Tokenization
Published on: 26 September 2025
Tags: #tokenization #ai #llm
The Overall NLP Pipeline
graph TD
A[Raw Text: 'Tokenization is
crucial'] --> B{Tokenization};
B --> C["Tokens: ['Tokenization', 'is',
'crucial']"];
C --> D{Numericalization /
Vocabulary Mapping};
D --> E["Input IDs: [2345, 16, 6789]"];
E --> F{Embedding Lookup};
F --> G["Vector Representations:
[[0.1, ...],
[0.5, ...],
[0.9, ...]]"];
G --> H{LLM Processing};
Comparison of Tokenization Methods
graph TD
subgraph Input
A["Input Text: 'Learning
tokenization'"]
end
subgraph Tokenization Methods
B(Word-Level)
C(Character-Level)
D(Subword-Level)
end
subgraph Output Tokens
B_out(['Learning', 'tokenization'])
C_out(['L','e','a','r','n','i','n','g',
' ','t','o','k','e','n','i','z','a','t','i','o','n'])
D_out(['Learn', '##ing', 'token',
'##ization'])
end
A --> B --> B_out
A --> C --> C_out
A --> D --> D_out
Byte-Pair Encoding (BPE) Algorithm Flow
graph TD
A[Start with a corpus of text] --> B{Initialize vocabulary with
all individual characters};
B --> C{Loop until vocabulary size
is reached};
C -- Yes --> D{"Find the most frequent
adjacent pair of tokens
(eg 'e' + 's')"};
D --> E{"Merge this pair into a new,
single token ('es')"};
E --> F{Add the new token to
the vocabulary};
F --> G{Update the corpus by
replacing all instances
of the pair with
the new token};
G --> C;
C -- No --> H[End: Final Vocabulary
Generated];
Vocabulary Size Trade-offs
graph TD
subgraph Larger Vocabulary
direction LR
A_Pro1["+Shorter sequence lengths"]
A_Pro2["+Fewer 'unknown' tokens"]
A_Con1["-Larger model size
(embedding layer)"]
A_Con2["-Slower training"]
A_Con3["-May have undertrained
embeddings for rare tokens"]
end
subgraph Smaller Vocabulary
direction LR
B_Pro1["+Smaller model size"]
B_Pro2["+More computationally
efficient"]
B_Con1["-Longer sequence lengths"]
B_Con2["-May split words into less
meaningful pieces"]
B_Con3["-Can make it harder for the
model to learn long-range
dependencies"]
end