DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning
Published on: 06 October 2025
Shifting from Supervised Learning to Reinforcement Learning
graph TD
subgraph "DeepSeek's Approach (Pure RL)"
direction TB
DS_HP["Hard Reasoning Problems
(Math, Code)"] --> DS_RLLoop{"Reinforcement Learning
Loop"};
DS_RLLoop -- "Generates Reasoning & Answer" --> DS_Verifier{"Rule-Based Verifier"};
DS_Verifier -- "Receives Reward Signal
(based on final answer only)" --> DS_RLLoop;
DS_RLLoop --> DS_LLMR["LLM's Reasoning
(Self-Discovered Pathways)"];
subgraph Benefits
direction TB
B1["Surpasses human performance"]
B2["Autonomous self-improvement"]
end
DS_LLMR --> Benefits;
end
subgraph "Traditional Approach (SFT)"
direction TB
TA_HARS["Human-Annotated
Reasoning Steps"] --> TA_SFT{"Supervised
Fine-Tuning"};
TA_SFT --> TA_LLMR["LLM's Reasoning
(Mimics Human Thought)"];
subgraph Limitations
direction TB
L1["Capped by human
performance"]
L2["Introduces cognitive biases"]
end
TA_LLMR --> Limitations;
end
style L1 fill:#f8d7da,stroke:#721c24
style L2 fill:#f8d7da,stroke:#721c24
style B1 fill:#d4edda,stroke:#155724
style B2 fill:#d4edda,stroke:#155724
Emergence of Sophisticated Reasoning Behaviors
graph TD
A[Base LLM] --> B["Incentivized by RL on
Hard Problems with Simple Rewards"];
B --> C(Emergent Reasoning Engine);
subgraph " "
direction LR
C --> D["📈 Increased Thinking Time
(Generates Longer Chain-of-Thought)"];
C --> E["🕵️ Self-Verification
(Checks its own
calculations and logic)"];
C --> F["🤔 Self-Reflection
(Identifies mistakes and
re-evaluates)"];
C --> G["💡 'Aha Moment'
(Sudden strategy shifts,
e.g., using 'Wait, let's
reevaluate...')"];
end
style C fill:#cce5ff,stroke:#004085,stroke-width:2px
style D fill:#fff3cd,stroke:#856404
style E fill:#fff3cd,stroke:#856404
style F fill:#fff3cd,stroke:#856404
style G fill:#fff3cd,stroke:#856404
The Two-Model Development Pipeline
graph TD
A[DeepSeek-V3
Base Model] --> B{"Stage 1: Pure
Reinforcement Learning"};
B -- "on reasoning tasks" --> C[DeepSeek-R1-Zero];
C -- "Inherits Core Reasoning" --> D{"Stage 2: Multi-stage
Refinement & Alignment"};
D -- "Includes SFT, Rejection
Sampling & Secondary RL" --> E[DeepSeek-R1];
subgraph Model Properties
direction LR
P1["R1-Zero: ✅ Powerful
Reasoner | ❌ Poor
Readability"]
P2["R1: ✅ Powerful Reasoner
| ✅ Human-Aligned &
Readable"]
end
C -.-> P1;
E -.-> P2;
style A fill:#e0e0e0,stroke:#333
style C fill:#cce5ff,stroke:#004085
style E fill:#d4edda,stroke:#155724```
Efficient Training with Group Relative Policy Optimization (GRPO)
graph TD
subgraph "PPO (Traditional Actor-Critic)"
direction TB
PPO_A["Policy Model
(Actor)"] -- "Generates action" --> PPO_B{Environment};
PPO_B -- "Returns state, reward" --> PPO_C["Value Model
(Critic)"];
PPO_C -- "Computes Advantage" --> PPO_A;
PPO_D["Requires two complex networks
working in tandem"];
end
subgraph "GRPO (Simpler Approach used in the paper)"
direction TB
GRPO_A[Policy Model] -- "Generates a group of G responses" --> GRPO_B["{Response 1..G}"];
GRPO_B --> GRPO_C{"Reward Model"};
GRPO_C -- "Assigns reward to each response" --> GRPO_D["{Reward 1..G}"];
GRPO_D --> GRPO_E{"Group Computation
(Calculates relative advantage)"};
GRPO_E --> GRPO_A;
GRPO_F["More efficient: Eliminates
the need for a separate Value Model"];
end
style GRPO_F fill:#d4edda,stroke:#155724
style PPO_D fill:#f8d7da,stroke:#721c24
Sources: