DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning

Published on: 06 October 2025

Shifting from Supervised Learning to Reinforcement Learning

graph TD
    subgraph "DeepSeek's Approach (Pure RL)"
        direction TB
        DS_HP["Hard Reasoning Problems 
 (Math, Code)"] --> DS_RLLoop{"Reinforcement Learning 
 Loop"};
        DS_RLLoop -- "Generates Reasoning & Answer" --> DS_Verifier{"Rule-Based Verifier"};
        DS_Verifier -- "Receives Reward Signal 
 (based on final answer only)" --> DS_RLLoop;
        DS_RLLoop --> DS_LLMR["LLM's Reasoning 
 (Self-Discovered Pathways)"];

        subgraph Benefits
             direction TB
             B1["Surpasses human performance"]
             B2["Autonomous self-improvement"]
        end

        DS_LLMR --> Benefits;
    end

    subgraph "Traditional Approach (SFT)"
        direction TB
        TA_HARS["Human-Annotated 
 Reasoning Steps"] --> TA_SFT{"Supervised 
 Fine-Tuning"};
        TA_SFT --> TA_LLMR["LLM's Reasoning 
 (Mimics Human Thought)"];

        subgraph Limitations
            direction TB
            L1["Capped by human 
 performance"]
            L2["Introduces cognitive biases"]
        end

        TA_LLMR --> Limitations;
    end

    style L1 fill:#f8d7da,stroke:#721c24
    style L2 fill:#f8d7da,stroke:#721c24
    style B1 fill:#d4edda,stroke:#155724
    style B2 fill:#d4edda,stroke:#155724

Emergence of Sophisticated Reasoning Behaviors

graph TD
    A[Base LLM] --> B["Incentivized by RL on 
 Hard Problems with Simple Rewards"];
    B --> C(Emergent Reasoning Engine);

    subgraph " "
        direction LR
        C --> D["📈 Increased Thinking Time 
 (Generates Longer Chain-of-Thought)"];
        C --> E["🕵️ Self-Verification 
 (Checks its own 
 calculations and logic)"];
        C --> F["🤔 Self-Reflection 
 (Identifies mistakes and 
 re-evaluates)"];
        C --> G["💡 'Aha Moment' 
 (Sudden strategy shifts, 
 e.g., using 'Wait, let's 
 reevaluate...')"];
    end

    style C fill:#cce5ff,stroke:#004085,stroke-width:2px
    style D fill:#fff3cd,stroke:#856404
    style E fill:#fff3cd,stroke:#856404
    style F fill:#fff3cd,stroke:#856404
    style G fill:#fff3cd,stroke:#856404

The Two-Model Development Pipeline

graph TD
    A[DeepSeek-V3 
 Base Model] --> B{"Stage 1: Pure 
 Reinforcement Learning"};
    B -- "on reasoning tasks" --> C[DeepSeek-R1-Zero];
    C -- "Inherits Core Reasoning" --> D{"Stage 2: Multi-stage 
 Refinement & Alignment"};
    D -- "Includes SFT, Rejection 
 Sampling & Secondary RL" --> E[DeepSeek-R1];

    subgraph Model Properties
        direction LR
        P1["R1-Zero: ✅ Powerful 
 Reasoner | ❌ Poor 
 Readability"]
        P2["R1: ✅ Powerful Reasoner 
 | ✅ Human-Aligned & 
 Readable"]
    end

    C -.-> P1;
    E -.-> P2;

    style A fill:#e0e0e0,stroke:#333
    style C fill:#cce5ff,stroke:#004085
    style E fill:#d4edda,stroke:#155724```

Efficient Training with Group Relative Policy Optimization (GRPO)

graph TD
    subgraph "PPO (Traditional Actor-Critic)"
        direction TB
        PPO_A["Policy Model 
 (Actor)"] -- "Generates action" --> PPO_B{Environment};
        PPO_B -- "Returns state, reward" --> PPO_C["Value Model 
 (Critic)"];
        PPO_C -- "Computes Advantage" --> PPO_A;
        PPO_D["Requires two complex networks 
 working in tandem"];
    end
    subgraph "GRPO (Simpler Approach used in the paper)"
        direction TB
        GRPO_A[Policy Model] -- "Generates a group of G responses" --> GRPO_B["{Response 1..G}"];
        GRPO_B --> GRPO_C{"Reward Model"};
        GRPO_C -- "Assigns reward to each response" --> GRPO_D["{Reward 1..G}"];
        GRPO_D --> GRPO_E{"Group Computation 
 (Calculates relative advantage)"};
        GRPO_E --> GRPO_A;
        GRPO_F["More efficient: Eliminates 
 the need for a separate Value Model"];
    end
    style GRPO_F fill:#d4edda,stroke:#155724
    style PPO_D fill:#f8d7da,stroke:#721c24

Sources:

DeepSeek-R1 incentivizes reasoning in LLMs through reinforcement learning

Share this post

Share on X • Share on LinkedIn • Share via Email

DeepSeek-R1 Incentivizes Reasoning in LLMs through Reinforcement Learning

Shifting from Supervised Learning to Reinforcement Learning

Emergence of Sophisticated Reasoning Behaviors

The Two-Model Development Pipeline

Efficient Training with Group Relative Policy Optimization (GRPO)

Related Posts

Share this post