Tokenization plays a key role in large language models—it turns raw text into a format that the models can actually understand and work with.
When building RAG (Retrieval-Augmented Generation) systems or fine-tuning large language models, it is important to understand tokenization techniques. Input data must be tokenized before being fed into the model. Since tokenization can vary between models, it’s essential to use the same tokenization method that was used during the model’s original training.
In this tutorial, we'll go through the tokenization and its practical applications in LLM tasks. The tutorial will cover:
- Introduction to Tokenization
- Tokenization in LLMs
- Byte Pair Encoding (BPE)
- WordPiece
- Key Differences Between BPE and WordPiece
- Conclusion
Let's get started.