Quantization
Published on: 09 October 2025
Tags: #quantization #ai
1. The Core Quantization and Dequantization Process
graph TD
%% Define styles for input/output and processing nodes
classDef inputOutput fill:#cde4ff,stroke:#333,stroke-width:2px;
classDef process fill:#f9f9f9,stroke:#333,stroke-width:1px;
subgraph Quantization
A[FP32 Value] --> B[Divide by Scale];
B --> C[Add Zero-Point];
C --> D[Round to Nearest Integer];
D --> E[INT8/INT4 Value];
end
subgraph Dequantization
F[INT8/INT4 Value] --> G[Subtract Zero-Point];
G --> H[Multiply by Scale];
H --> I[Approximated FP32 Value];
end
%% Link the two processes
E --> F;
%% Apply styles
class A,I inputOutput;
class E,F inputOutput;
class B,C,D,G,H process;
linkStyle 4 stroke:#ff9999,stroke-width:2px,fill:none;
2. Symmetric vs. Asymmetric Quantization
graph TD
subgraph "Symmetric Quantization
(for Weights)"
direction TB
A[FP32 Range: -1.0 to 1.0] --> B["Zero-Point = 0"];
B --> C[INT8 Range: -127 to 127];
style A fill:#cde4ff
style C fill:#d5e8d4
end
subgraph "Asymmetric Quantization
(for Activations)"
direction TB
D[FP32 Range: 0.0 to 2.5] --> E["Zero-Point = 60 (Example)"];
E --> F[INT8 Range: 0 to 255];
style D fill:#cde4ff
style F fill:#d5e8d4
end
3. Comparison of Quantization Strategies: PTQ vs. QAT
graph TD
subgraph "Post-Training Quantization
(PTQ)"
direction TB
A(FP32 Pre-Trained Model) --> B[Calibration with Data];
B --> C[Calculate Quantization Params];
C --> D(Quantized
INT8/INT4 Model);
end
subgraph "Quantization-Aware Training
(QAT)"
direction TB
E(FP32 Pre-Trained Model) --> F{Fine-Tuning Loop};
F -- Forward Pass --> G(Simulate Quantization);
G -- Backward Pass --> H(Update FP32 Weights);
H --> F;
F -- End of Fine-Tuning --> I(Final Quantized
INT8/INT4 Model);
end
4. Overview of Advanced PTQ Techniques
mindmap
root((Advanced PTQ))
GPTQ
::icon(fa fa-compress)
Layer-by-layer quantization
Uses second-order info (Hessian)
Updates remaining weights
to compensate for
quantization error
AWQ
::icon(fa fa-star)
Activation-Aware Weight
Quantization
Identifies important
weights based on
activation magnitudes
Protects salient weights
with per-channel scaling
SmoothQuant
::icon(fa fa-sliders)
Addresses challenging
activation outliers
Shifts quantization
difficulty from activations
to weights
Enables accurate W8A8
quantization
5. Hardware Acceleration for Quantized Models
graph TD
subgraph Standard Inference
direction TB
A(FP32 Model) --> B(General-Purpose Cores /
CUDA Cores);
B --> C(Slower Inference);
end
subgraph Accelerated Inference
direction TB
D(Quantized INT8/INT4 Model) --> E(Specialized Hardware
e.g., NVIDIA Tensor Cores,
Intel AMX);
E --> F(Faster Inference);
end
style F fill:#d5e8d4,stroke:#333,stroke-width:2px