By Eric Mersch
Are you just getting used to ChatGPT? Well, now there’s a new game in town, and it’s growing its 1.8M daily user base fast.
DeepSeek entered the AI landscape in the second half of 2023, but it disrupted global markets on January 20, 2025. This is when DeepSeek released its R1 Large Language Model (LLM), which achieves performance comparable to OpenAI-o1-1217 on reasoning tasks.
Nvidia, for example, lost about 17% (nearly $593B in market value, a record one-day loss for any publicly traded company), while tech stocks across the globe diminished.
What’s All the DeepSeek Hype About?
DeepSeek is a Chinese AI startup founded in May 2023 by entrepreneur Liang Wenfeng. He was already known in the AI community as co-founder of High-Flyer, a quantitative hedge fund that used innovative AI-driven trading strategies.
By mid-2022, High-Flyer acquired 10,000 Nvidia high-performance A100 graphics processor chips to power its AI systems before trade restrictions cut off supply. It has been suggested that Liang drew upon his AI experience and his AI infrastructure to lead the development of DeepSeek’s R1 model. Early reports also indicated that the DeepSeek model may have trained on OpenAI’s model. Regardless, DeepSeek did utilize innovative post-training methods to enhance the R1 model. AI operators should understand these steps.
Post-Training
DeepSeek achieves revolutionary accuracy in reasoning tasks without extensive data and computing resources with a focus on Post-Training.
Post-training refers to the stages of training that occur after the initial pre-training of a language model. Pre-training typically involves training a model on vast amounts of general-purpose data (like text from books, websites, and other sources) to build a foundational understanding of language.
OpenAI emerged as the pre-eminent Large Language Model (LLM) developer and was followed by like-minded individuals across the industry. Fueled by tremendous investment and trained on readily available online information, these LLM providers created AI agents with a level of inference that makes the machine interaction human-like to varying degrees.
Post-training, on the other hand, is where DeepSeek’s innovation truly shines. The Chinese company has employed several methodologies to achieve results equaling those of OpenAI and similar providers. While well-known within the GenAI community, these methods are applied with purpose and ingenuity by DeepSeek researchers to achieve better output. For example, as we will learn below, reinforcement learning was described in an academic paper on machine learning called Training Language Models to Self-Correct via Reinforcement Learning.
Process-Based Awards
Another concept, process-based reward models, was described in a machine learning paper called Let’s Verify Step by Step. However, while valuable in enhancing general reasoning performance, these methods did not produce results comparable to OpenAI’s first models.
Then, DeepSeek researchers went further to innovate on this output, seemingly leading to comparable results. The key to unlocking comparable performance is Reinforcement Learning, specifically Reinforcement Learning Without Supervised Fine-Tuning (SFT).
Researchers applied reinforcement learning (RL) to a base model ignoring supervised data, enabling reasoning capabilities to emerge naturally. The authors describe how they sought to explore the potential of LLMs to develop reasoning capabilities without any supervised data, focusing on their self-evolution through a pure Reinforcement Learning process and that the results were that after thousands of RL steps, DeepSeek-R1-Zero exhibits super performance on reasoning benchmarks. The important discovery was that the reasoning capabilities emerged naturally without training the model on massive data sets, which would have consumed tremendous computing infrastructure.
The Reinforcement Learning Process
RL is a series of processes applied sequentially to achieve the desired outcome. Here’s how it works:
DeepSeek researchers started with a base model, which they refer to as DeepSeek-V3-Base. They may have developed their base model using methods pioneered by OpenAI and others. Specifically, DeepSeek may have used prompts to interact with OpenAI’s GPT-4 or ChatGPT, using their responses to train its model. This effectively mimics OpenAI’s approach of leveraging existing models for improvement. But this is pure speculation. I have no way of validating this assertion, nor does anyone not associated with DeepSeek.
To improve the output from the base model, researchers used a method called Group Relative Policy Optimization, or GRPO, which enhances the algorithm by comparing multiple outputs generated for the same input prompt. GRPO creates groups of outputs by similarity and, using Advantage Estimation assigns a numerical score based on accuracy and relevance based on the mean (μ) and standard deviation (σ) of the group’s output. The algorithm is rewarded for generating groups with the highest Advantage Estimation. The application of rewards is referred to as Reward Signal Design. Repeated training with reward signal feedback allows the algorithm to perform better in subsequent queries. Results that perform at a level outside of a pre-determined distance from the mean are rejected.
The DeepSearch researchers reported that post-GRPO output suffered from poor readability and language mixing. Poor readability indicates that the process produced complex answers with non-standard vocabulary, run-on sentences, lengthy paragraphs, and ineffective formatting. Language mixing refers to DeepSeek’s tendency to switch between languages inadvertently and inconsistently.
To address these output deficiencies, researchers have focused on the initial stage of RL, known as the Cold Start phase, where analysis is conducted without prior supervised fine-tuning, typical in other stages of LLM development. Researchers have sought to replicate the LLM fine-tuning process by introducing a small dataset curated by human experts to produce high-quality data. This process, called Reinforcement Learning from Human Feedback, or RLHF, involves repeated human review and reward assignment used in a Multi-Stage Training Pipeline approach. This is why the DeepSeek paper incorporates cold-start data and a multi-stage training pipeline (page 1 and page 9).
Utilizing RLHF raises cost and scalability concerns because it requires significant and specialized human labor to provide high-quality feedback and train reward models. RLHF raises other problems, such as variance in assessing “quality” and bias.
The cost of using RLHF to create reasoning inference has yet to be determined, but it will likely burden the model’s future development. DeepSeek’s journey has not been without its challenges, and the team has had to navigate the potential trade-offs between compute costs and human labor costs. These questions aside, DeepSeek did achieve better results using RLHF, which is a testament to their perseverance and dedication.
The final step is Supervised Fine-Tuning or SFT. This is a training process in which the now pre-trained model undergoes further training through supervised learning on a specifically labeled dataset. Supervised learning does not require human input; the algorithm learns independently. However, human involvement is necessary to create the initial datasets if those datasets are not labeled. This means that data for use requires humans to generate input-output pairs, such as writing the output corresponding to the input, as determined by humans or an authoritative source.
SFT typically improves the algorithm’s performance in a particular task or domain by leveraging task-specific examples that explicitly define the input-output relationships.
What is the Potential Impact of DeepSeek?
In Summary, using a complex Reinforcement Learning process, DeepSeek researchers have developed an algorithm with reasoning capabilities that rival and potentially surpass current LLMs. Their RL process utilizes novel applications of existing methods, some tailored to create an efficiency model. This potential for advancement in AI is truly exciting.
They used Group Relative Policy Optimization, which mathematically rewarded good performance and demoted substandard output. Output scored at high sigmas from the mean were eliminated. Although this step involved human intervention, it led to a repeatable output with low compute resource requirements.
Introducing high-quality, human-curated Cold Start data before the start of the Reinforcement Learning phase allowed the algorithm to develop enhanced reasoning capabilities.
Sam Altman and the OpenAI team stood on the shoulders of giants to create their LLM. The DeepSeek researchers have given the community new shoulders for others to do the same.
About the Author
Eric Mersch has 25 years of finance experience in the technology industry, including CFO roles at public companies and numerous venture capital and private equity portfolio companies. He has worked with over 40 different SaaS companies and compiled his experience into his book, Hacking SaaS—An Insider’s Guide to Managing Software Business Success. His goal in writing the book is to educate SaaS professionals, thus shortening the apprenticeship of those new to SaaS.
His book, Hacking SaaS – An Insider’s Guide to Managing Software Success, is available on Amazon.