In the previous tutorial, we explored LLM tokenization and learned how to use BPE and WordPiece tokenization with the tokenizers library. In the second part of the tutorial, we will learn how to use SentencePiece and Byte-level BPE methods.
The tutorial will cover:
- Introduction to SentencePiece
- Implementing SentencePiece Tokenization
- Introduction to Byte-level BPE
- Implementing Byte-level BPE Tokenization
- Conclusion
Let's get started.
Introduction to SentencePiece
SentencePiece is a language-agnostic subword tokenization algorithm designed to work directly on raw, untokenized text. It treats the entire input as a character stream, learning subword units using algorithms like Byte Pair Encoding (BPE) or Unigram Language Modeling.
Unigram Language Model tokenization is a probabilistic method that segments text into subwords by selecting the most likely combination from a pre-learned vocabulary, offering more flexible and accurate tokenization than greedy methods like BPE.
A distinctive feature
of SentencePiece is its use of a special token (▁) to mark word
boundaries, enabling it to tokenize text without relying on whitespace.
This makes it particularly effective for languages like Japanese and for
multilingual models such as T5 and mT5. It supports flexible vocabulary
generation and allows probabilistic tokenization via the Unigram model,
making it suitable for both deterministic and sample-based decoding.
SentencePiece treats text as a raw stream of characters—including spaces—making it ideal for languages without explicit word boundaries. One of its key strengths is that it's language-agnostic and doesn't require the text to be pre-tokenized, which makes it highly versatile across different languages and datasets.
Example:
Input: "Hello world"
Tokens: ["▁Hello", "▁world"]
(▁
represents a space in the original text.)
To use SentencePiece tokenization, we’ll work with the sentencepiece library. You can install it using the following pip command:
Result:
Explanation of the output
SentencePiece splits English words into subwords, useful for handling rare or compound words. It preserves punctuation as individual tokens. For Japanese, it handles kanji and hiragana sequences as distinct tokens based on learned units. The ▁ symbol is a marker for where whitespace appeared before tokenization.
Introduction to Byte-level BPE
Byte-Level BPE,
used in models like GPT-2, GPT-3, and LLaMA, operates on the byte level
instead of the character level, which means it treats every possible
byte as a basic unit. This ensures that any input—regardless of
language, special characters, or formatting—can be tokenized and
perfectly reversed. It encodes spaces and other characters using
specific byte representations (e.g., Ġ for a space), preserving exact
formatting.
Byte-Level BPE avoids out-of-vocabulary issues by guaranteeing coverage for all inputs and compresses text efficiently by merging frequently occurring byte pairs. This method is ideal for handling rich and noisy datasets like web text or code, where exact preservation of characters is critical.
Byte-level BPE works at the byte level rather than the character level. This means it can handle any kind of input—emojis, accented characters, or special symbols—without needing extra normalization. It's robust to all Unicode characters and doesn’t make assumptions about character encoding, making it highly flexible for real-world text.
Example:
Input: "café"
Tokens: ["c", "a", "f", "Ã", "©"]
(before merges)
After BPE merges: ["café"]
(if it's in vocabulary)
Using Hugging Face's tokenizers library with ByteLevelBPETokenizer, we can implement byte-level BPE tokenization. We'll start by saving the training data to a plain text file, then train the tokenizer using that file and a specified vocabulary size. Finally, we'll encode new text and observe how it's broken down into subword or byte-level tokens.
Here is the source code:
Result:
Explanation of the output:
The tokenizer splits text into byte-level units, so each visible character may become multiple byte-based tokens, especially for non-ASCII characters. The special-looking symbols like 'Ġ', 'ä', 'ģ', etc., are how multibyte Unicode characters are encoded as byte sequences. This encoding is reversible — the model can still reconstruct the original sentence using these tokens.
No comments:
Post a Comment