Tokenization plays a key role in large language models—it turns raw text into a format that the models can actually understand and work with.
When building RAG (Retrieval-Augmented Generation) systems or fine-tuning large language models, it is important to understand tokenization techniques. Input data must be tokenized before being fed into the model. Since tokenization can vary between models, it’s essential to use the same tokenization method that was used during the model’s original training.
In this tutorial, we'll go through the tokenization and its practical applications in LLM tasks. The tutorial will cover:
- Introduction to Tokenization
- Tokenization in LLMs
- Byte Pair Encoding (BPE)
- WordPiece
- Key Differences Between BPE and WordPiece
- Conclusion
Let's get started.
Tokenization is the process of breaking text into smaller units (tokens) for analysis or modeling. There are two main approaches:
1. Traditional Tokenization
- Splits text into words or sentences based on simple rules, such as whitespace and punctuation.
- Commonly used in classic NLP tasks like part-of-speech (POS) tagging and syntactic parsing.
- Example: "The cat sat on the mat." → ["The", "cat", "sat", "on", "the", "mat", "."]
- Limitation: Struggles with rare, compound, or unknown words (e.g., "ChatGPT" or "e-mail").
2. LLM Tokenization
- Splits text into subword units to better handle rare and unknown terms.
- Allows any word to be represented by combining known subwords (e.g., "ChatGPT" → ["Chat", "G", "PT"], depending on the tokenizer).
- Uses a fixed-size vocabulary (e.g., 50,000 tokens) for efficiency and consistency during training and inference.
- Example: "unhappiness" → ["un", "happi", "ness"] (using Byte-Pair Encoding or WordPiece).
Traditional tokenization preserves linguistic units, while LLM tokenization optimizes for machine learning.
Tokenization in LLMs
Tokenization is a key step in how large language models (LLMs) process text. It involves breaking down input text into smaller units called tokens — which are the building blocks the model understands. These tokens aren’t always whole words; depending on the tokenizer used, they might be subwords, characters, or even word pieces.
In other words, a single word could be split into multiple tokens, or several short words might be grouped into one token. Tokenization can be considered a translator between human language and the numerical format a model requires.
There are several tokenization methods, such as BPE, WordPiece, SentencePiece, and Byte-level BPE. In the section below, we’ll explore these methods with examples.
Byte Pair Encoding (BPE)
BPE starts with characters as the basic tokens and iteratively merges the most frequent pairs of characters or subwords into new tokens. This helps create a vocabulary of common subwords and efficiently handles rare or new words by breaking them into familiar subword pieces.
Example:
Input: "unhappiness"
Initial tokens: ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]
After merges: ["un", "happi", "ness"]
Using the tokenizers library from Hugging Face, we can implement BPE tokenization.
Result:
What happens if we include compound words like unhappiness and unkindness in the training corpus? In this case, the tokenizer learns them as whole tokens and adds them directly to its vocabulary.
As a result, the tokenizer will treat them as single tokens, as shown below:
Result:
WordPiece
Developed by Google (and used in BERT), WordPiece is similar to BPE but uses a probabilistic model to decide which subword merges maximize the likelihood of the training data. It tends to keep frequent words intact while breaking rare words into smaller parts.
One of its strengths is its effective handling of rare words by focusing on maximizing the likelihood of the observed data.
Example:
Input: "unhappiness"
Tokens: ["un", "##happiness"]
or ["un", "##happi", "##ness"]
( ##
indicates continuation from the previous token.)
By using the tokenizers library, we can implement WordPiece tokenization. The code below trains a WordPiece tokenizer on a small custom corpus and then uses it to tokenize the word "unhappiness".
The unk_token stands for "unknown token"—it defines the special token that the tokenizer uses when it encounters a word or subword not present in its vocabulary.
In such cases, that word (or part of it) is replaced with [UNK].
Result:
Key Differences Between BPE and WordPiece
Both BPE and WordPiece are subword tokenization methods, but they differ in how they decide which subword units to merge:
- BPE merges the most frequent pairs of characters or subwords in the training data. It’s purely frequency-based.
- WordPiece uses a probabilistic approach—it selects merges based on which combinations maximize the likelihood of the training data. This often results in more effective handling of rare or unseen words.
In practice, WordPiece tends to split rare words more conservatively than BPE, which helps language models generalize better.
No comments:
Post a Comment