Large Language Model

Published on: 26 September 2025

High-Level Transformer Architecture

graph TD
    Input([Input Text]) --> PE1[Positional Encoding]
    PE1 --> Enc_MultiHead

    subgraph "Encoder Block (Repeated Nx)"
        Enc_MultiHead[Multi-Head Self-Attention] --> AddNorm1[Add & Norm]
        AddNorm1 --> Enc_FFN[Feed-Forward Network]
        Enc_FFN --> AddNorm2[Add & Norm]
    end

    PrevOutput([Previous Decoder Output]) --> PE2[Positional Encoding]
    PE2 --> Dec_MaskedMultiHead

    subgraph "Decoder Block (Repeated Nx)"
        Dec_MaskedMultiHead[Masked Multi-Head Self-Attention] --> AddNorm3[Add & Norm]
        AddNorm3 --> Dec_EncDecAtt[Encoder-Decoder Attention]
        Dec_EncDecAtt --> AddNorm4[Add & Norm]
        AddNorm4 --> Dec_FFN[Feed-Forward Network]
        Dec_FFN --> AddNorm5[Add & Norm]
    end

    AddNorm2 -- Encoder's Contextual Output --> Dec_EncDecAtt
    AddNorm5 --> FinalOutput(Linear Layer) --> Softmax(Softmax Layer) --> Output([Final Output Probabilities])

The Three-Stage LLM Training Process

graph TD;
    A[Massive Unlabeled Text Corpus] --> B(Phase 1: Self-Supervised Pre-training);
    B -- Learns grammar, facts, reasoning --> C{Base Model};

    D["High-Quality Labeled Dataset 
(Prompt-Response Pairs)"] --> E(Phase 2: Supervised Fine-Tuning);
    C -- Adapts to follow instructions --> E;
    E -- Creates a more helpful model --> F{Tuned Model};

    %% --- Start of Refinement ---
    I["Human Preference Data 
(Ranked Responses)"] --> G(Phase 3: Reinforcement Learning from Human Feedback);
    %% --- End of Refinement ---

    F --> G;
    G -- Aligns with human preferences --> H[Final Aligned LLM];

    %% --- Styling ---
    style A fill:#cde4ff
    style D fill:#cde4ff
    style I fill:#cde4ff
    style B fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px
    style G fill:#f9f,stroke:#333,stroke-width:2px
    style C fill:#b4f8c8,stroke:#333,stroke-width:2px
    style F fill:#b4f8c8,stroke:#333,stroke-width:2px
    style H fill:#a8e6cf,stroke:#333,stroke-width:4px

The RLHF (Reinforcement Learning from Human Feedback) Loop

graph TD;
    A[Start with a Prompt] --> B{Tuned LLM};
    B -- Generates --> C["Multiple Responses 
 (e.g., Response A, B, C)"];
    C --> D(Human Evaluator Ranks Responses);
    D -- "A > C > B" --> E[Ranked Preference Data];
    E --> F(Train a Reward Model);
    F -- Predicts which responses are 'good' --> G[Reward Model];
    G -- Provides reward signal --> H(Fine-tune LLM via Reinforcement Learning);
    H --> B;

    style B fill:#b4f8c8
    style G fill:#b4f8c8
    style D fill:#ffcc99
    style H fill:#f9f

Share this post

Share on X • Share on LinkedIn • Share via Email

Large Language Model

High-Level Transformer Architecture

The Three-Stage LLM Training Process

The RLHF (Reinforcement Learning from Human Feedback) Loop

Related Diagrams

Share this post