Tokenization in LLMs – SentencePiece and Byte-level BPE (part-2)

     In the previous tutorial, we explored LLM tokenization and learned how to use BPE and WordPiece tokenization with the tokenizers library. In the second part of the tutorial, we will learn how to use SentencePiece and Byte-level BPE methods. 

    The tutorial will cover:

  1. Introduction to SentencePiece
  2. Implementing SentencePiece Tokenization
  3. Introduction to Byte-level BPE 
  4. Implementing Byte-level BPE Tokenization
  5. Conclusion

     Let's get started.

 

Introduction to SentencePiece 

    SentencePiece is a language-agnostic subword tokenization algorithm designed to work directly on raw, untokenized text. It treats the entire input as a character stream, learning subword units using algorithms like Byte Pair Encoding (BPE) or Unigram Language Modeling. 

    Unigram Language Model tokenization is a probabilistic method that segments text into subwords by selecting the most likely combination from a pre-learned vocabulary, offering more flexible and accurate tokenization than greedy methods like BPE.

    A distinctive feature of SentencePiece is its use of a special token (▁) to mark word boundaries, enabling it to tokenize text without relying on whitespace. This makes it particularly effective for languages like Japanese and for multilingual models such as T5 and mT5. It supports flexible vocabulary generation and allows probabilistic tokenization via the Unigram model, making it suitable for both deterministic and sample-based decoding.
    
 

Implementing SentencePiece Tokenization

    SentencePiece treats text as a raw stream of characters—including spaces—making it ideal for languages without explicit word boundaries. One of its key strengths is that it's language-agnostic and doesn't require the text to be pre-tokenized, which makes it highly versatile across different languages and datasets.

Example:

Input: "Hello world"
Tokens: ["▁Hello", "▁world"]
( represents a space in the original text.)

    To use SentencePiece tokenization, we’ll work with the sentencepiece library. You can install it using the following pip command:

 
% pip install sentencepiece

    First we prepare a training corpus by writing some sample text to a file. Then train a tokenizer using SentencePieceTrainer on that file. We load the trained model with SentencePieceProcessor and encode new sentences using the tokenizer. 
    Here is a complete code:
 
 
import sentencepiece as spm

# Write training data to file
with open("corpus.txt", "w") as f:
f.write("Hello world. This is a test. Unhappiness, 元気です。")

# Train a SentencePiece tokenizer
spm.SentencePieceTrainer.Train(
input='corpus.txt', model_prefix='spm_model', vocab_size=50, model_type='bpe'
)

# Load and use the tokenizer
sp = spm.SentencePieceProcessor()
sp.load("spm_model.model")

print(sp.encode("unhappiness, 寒い", out_type=str)) 
 

Result:

 
['▁', 'u', 'n', 'h', 'app', 'in', 'ess', ',', '▁', '今日', 'は', '寒い', 'です', '。']

Explanation of the output

    SentencePiece splits English words into subwords, useful for handling rare or compound words. It preserves punctuation as individual tokens. For Japanese, it handles kanji and hiragana sequences as distinct tokens based on learned units. The ▁ symbol is a marker for where whitespace appeared before tokenization.


Introduction to Byte-level BPE 

    Byte-Level BPE,  used in models like GPT-2, GPT-3, and LLaMA, operates on the byte level instead of the character level, which means it treats every possible byte as a basic unit. This ensures that any input—regardless of language, special characters, or formatting—can be tokenized and perfectly reversed. It encodes spaces and other characters using specific byte representations (e.g., Ġ for a space), preserving exact formatting. 

    Byte-Level BPE avoids out-of-vocabulary issues by guaranteeing coverage for all inputs and compresses text efficiently by merging frequently occurring byte pairs. This method is ideal for handling rich and noisy datasets like web text or code, where exact preservation of characters is critical.


Implementing Byte-level BPE Tokenization

    Byte-level BPE works at the byte level rather than the character level. This means it can handle any kind of input—emojis, accented characters, or special symbols—without needing extra normalization. It's robust to all Unicode characters and doesn’t make assumptions about character encoding, making it highly flexible for real-world text.

Example:

Input: "café"
Tokens: ["c", "a", "f", "Ã", "©"] (before merges)
After BPE merges: ["café"] (if it's in vocabulary) 

    Using Hugging Face's tokenizers library with ByteLevelBPETokenizer, we can implement byte-level BPE tokenization. We'll start by saving the training data to a plain text file, then train the tokenizer using that file and a specified vocabulary size. Finally, we'll encode new text and observe how it's broken down into subword or byte-level tokens.
    Here is the source code:

 
from tokenizers import ByteLevelBPETokenizer

# Save corpus to a file
with open("byte_corpus.txt", "w") as f:
f.write("Hello world. This is a test. Unhappiness, 私は元気です。")

# Train the tokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=["byte_corpus.txt"], vocab_size=50)

# Encode sample text
output = tokenizer.encode("hello, 今日は寒い")
print(output.tokens) 
 

Result:

 
['h', 'e', 'l', 'l', 'o', ',', 'Ġ', 'ä', '»', 'Ĭ', 'æ', 'Ĺ', '¥', 'ã', 'ģ', '¯', 'å', 
'¯', 'Ĵ', 'ã', 'ģ', 'Ħ']

 Explanation of the output:

    The tokenizer splits text into byte-level units, so each visible character may become multiple byte-based tokens, especially for non-ASCII characters. The special-looking symbols like 'Ġ', 'ä', 'ģ', etc., are how multibyte Unicode characters are encoded as byte sequences. This encoding is reversible — the model can still reconstruct the original sentence using these tokens.


Conclusion
 
    In this tutorial, we explored how Byte-level BPE and SentencePiece tokenization work and learned their main ideas, how they’re different, and where they’re useful in today’s language models. We also saw how each method handles text at the subword and byte level. 
    We also learned how to implement both methods in Python and got a clearer understanding of how they work in real-world scenarios.
    
 


No comments:

Post a Comment