DataTechNotes: Tokenization in LLMs – BPE and WordPiece (part-1)

Tokenization plays a key role in large language models—it turns raw text into a format that the models can actually understand and work with.

When building RAG (Retrieval-Augmented Generation) systems or fine-tuning large language models, it is important to understand tokenization techniques. Input data must be tokenized before being fed into the model. Since tokenization can vary between models, it’s essential to use the same tokenization method that was used during the model’s original training.

In this tutorial, we'll go through the tokenization and its practical applications in LLM tasks. The tutorial will cover:

Introduction to Tokenization
Tokenization in LLMs
Byte Pair Encoding (BPE)
WordPiece
Key Differences Between BPE and WordPiece
Conclusion

Let's get started.

Introduction to Tokenization

Tokenization is the process of breaking text into smaller units (tokens) for analysis or modeling. There are two main approaches:
1. Traditional Tokenization

Splits text into words or sentences based on simple rules, such as whitespace and punctuation.
Commonly used in classic NLP tasks like part-of-speech (POS) tagging and syntactic parsing.
Example: "The cat sat on the mat." → ["The", "cat", "sat", "on", "the", "mat", "."]
Limitation: Struggles with rare, compound, or unknown words (e.g., "ChatGPT" or "e-mail").

2. LLM Tokenization

Splits text into subword units to better handle rare and unknown terms.
Allows any word to be represented by combining known subwords (e.g., "ChatGPT" → ["Chat", "G", "PT"], depending on the tokenizer).
Uses a fixed-size vocabulary (e.g., 50,000 tokens) for efficiency and consistency during training and inference.
Example: "unhappiness" → ["un", "happi", "ness"] (using Byte-Pair Encoding or WordPiece).

Traditional tokenization preserves linguistic units, while LLM tokenization optimizes for machine learning.

Tokenization in LLMs

Tokenization is a key step in how large language models (LLMs) process text. It involves breaking down input text into smaller units called tokens — which are the building blocks the model understands. These tokens aren’t always whole words; depending on the tokenizer used, they might be subwords, characters, or even word pieces.

In other words, a single word could be split into multiple tokens, or several short words might be grouped into one token. Tokenization can be considered a translator between human language and the numerical format a model requires.

There are several tokenization methods, such as BPE, WordPiece, SentencePiece, and Byte-level BPE. In the section below, we’ll explore these methods with examples.

Byte Pair Encoding (BPE)

BPE starts with characters as the basic tokens and iteratively merges the most frequent pairs of characters or subwords into new tokens. This helps create a vocabulary of common subwords and efficiently handles rare or new words by breaking them into familiar subword pieces.

Example:

Input: "unhappiness"
Initial tokens: ["u", "n", "h", "a", "p", "p", "i", "n", "e", "s", "s"]
After merges: ["un", "happi", "ness"]

Using the tokenizers library from Hugging Face, we can implement BPE tokenization.

 
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Create a BPE (Byte-Pair Encoding) tokenizer model
tokenizer = Tokenizer(models.BPE())

# Use whitespace to split input into words before subword tokenization
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Define training corpus with base words 
corpus = ["happy", "unhappy", "kindness", "unkind"]

# Set up BPE trainer with a small vocab size and include all tokens with freq ≥ 1
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)

# Train the tokenizer on the corpus
tokenizer.train_from_iterator(corpus, trainer)

# Get the learned vocabulary after training
vocab = tokenizer.get_vocab()

# Sort the vocabulary by token ID for easier reading
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
print(f"Length of vocabulary: {len(sorted_vocab)}")

# Print the first 15 tokens from the sorted vocabulary
for token, idx in sorted_vocab[:15]:
    print(f"{idx}: {token}")

# Tokenize new compound words (not in training set)
encoding = tokenizer.encode("unhappiness", "unkindness")

# Show the subword tokens generated
print("Tokens:", encoding.tokens)
 

Result:

Length of vacabulary: 25
a
d
e
h
i
k
n
p
s
u
y
ap
hap
in
kin
Tokens: ['un', 'hap', 'p', 'in', 'es', 's', 'unkind', 'nes', 's'] 
  

What happens if we include compound words like unhappiness and unkindness in the training corpus? In this case, the tokenizer learns them as whole tokens and adds them directly to its vocabulary.
As a result, the tokenizer will treat them as single tokens, as shown below:

 
... 
# Define training corpus with base words
corpus = ["happy", "unhappy", "kindness", "unkind", "unhappiness", "unkindness"]
...

# Tokenize new compound words (not in training set)
encoding = tokenizer.encode("unhappiness", "unkindness")
... 

Result:

 
Tokens: ['unhappiness', 'unkindness']

WordPiece

Developed by Google (and used in BERT), WordPiece is similar to BPE but uses a probabilistic model to decide which subword merges maximize the likelihood of the training data. It tends to keep frequent words intact while breaking rare words into smaller parts.
One of its strengths is its effective handling of rare words by focusing on maximizing the likelihood of the observed data.

Example:

Input: "unhappiness"
Tokens: ["un", "##happiness"] or ["un", "##happi", "##ness"]
( ## indicates continuation from the previous token.)

By using the tokenizers library, we can implement WordPiece tokenization. The code below trains a WordPiece tokenizer on a small custom corpus and then uses it to tokenize the word "unhappiness".
The unk_token stands for "unknown token"—it defines the special token that the tokenizer uses when it encounters a word or subword not present in its vocabulary.
In such cases, that word (or part of it) is replaced with [UNK].

 
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

corpus = ["kindness", "happiness", "happy", "unhappy", "unkind"]

# WordPiece tokenizer
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.WordPieceTrainer(vocab_size=50, special_tokens=["[UNK]"])
tokenizer.train_from_iterator(corpus, trainer)

# Get the learned vocabulary after training
vocab = tokenizer.get_vocab()

# Sort the vocabulary by token ID for easier reading
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
print(f"Length of vocabulary: {len(sorted_vocab)}")

# Print the first 15 tokens from the sorted vocabulary
for token, idx in sorted_vocab[:15]:
    print(f"{idx}: {token}")

output = tokenizer.encode("unhappiness")
print(output.tokens) 
 

Result:

Length of vacabulary: 41
[UNK]
a
d
e
h
i
k
n
p
s
u
y
##a
##p
##i
['unhapp', '##iness'] 

Key Differences Between BPE and WordPiece

Both BPE and WordPiece are subword tokenization methods, but they differ in how they decide which subword units to merge:

BPE merges the most frequent pairs of characters or subwords in the training data. It’s purely frequency-based.
WordPiece uses a probabilistic approach—it selects merges based on which combinations maximize the likelihood of the training data. This often results in more effective handling of rare or unseen words.

In practice, WordPiece tends to split rare words more conservatively than BPE, which helps language models generalize better.

Conclusion

In this tutorial, we explored the concept of tokenization and understood the key differences between tokenization in traditional NLP and large language models (LLMs). Tokenization plays a crucial role in how LLMs process and understand text.

We also implemented two widely used tokenization methods—Byte Pair Encoding (BPE) and WordPiece—using the tokenizers library, providing practical insight into how subword units are learned and applied.

In the next section of the tutorial, we will cover additional tokenization techniques.

DataTechNotes

Pages

Tokenization in LLMs – BPE and WordPiece (part-1)

No comments:

Post a Comment