DataTechNotes

Tokenization in LLMs – BPE and WordPiece (part-1)

Tokenization plays a key role in large language models—it turns raw text into a format that the models can actually understand and work with.

When building RAG (Retrieval-Augmented Generation) systems or fine-tuning large language models, it is important to understand tokenization techniques. Input data must be tokenized before being fed into the model. Since tokenization can vary between models, it’s essential to use the same tokenization method that was used during the model’s original training.

In this tutorial, we'll go through the tokenization and its practical applications in LLM tasks. The tutorial will cover:

Introduction to Tokenization
Tokenization in LLMs
Byte Pair Encoding (BPE)
WordPiece
Key Differences Between BPE and WordPiece
Conclusion

Let's get started.

Building RAG-Based QA System with LlamaIndex

In this tutorial, we will implement a RAG (Retrieval-Augmented Generation) chatbot using LlamaIndex, Hugging Face Transformer, and Flan-T4 model. We use a sample industrial equipment documentation as our knowledge base and allow an LLM (Flan-T5) to generate responses using retrieved external data. We also add relevance filtering for accuracy control. The tutorial covers:

Introduction to RAG
Why LlamaIndex?
Setup and custom data preparation
Creating a vector store index
Load a pre-trained LLM (Flan-T5)
Retrieval with relevance check
Enhanced QA method
Execution
Conclusion
Full code listing

Implementing Retrieval-Augmented Generation (RAG) for Custom Data Q&A

In this tutorial, we will implement a Retrieval-Augmented Generation (RAG) system in Python using LangChain, Hugging Face Transformers, and FAISS. We will use custom equipment specifications as our knowledge base and allow an LLM (Flan-T5) to generate responses using retrieved external data. The tutorial covers:

Introduction to RAG
Setup and custom data preparation
Creating a vector store (FAISS)
Load a pre-trained LLM (Flan-T5)
Building the RAG system
Execution
Conclusion
Full code listing

Fine-Tuning a Large Language Model (LLM) for Text Classification

In this tutorial, we will learn how to fine-tune a pre-trained large language model (LLM) for a text classification task using the Hugging Face transformers library. We will use the DistilBERT model, a smaller and faster version of BERT, and fine-tune it on the IMDb movie review dataset for sentiment analysis (positive or negative). The tutorial covers:

Introduction to fine-turing LLMs
Loading and preparing a dataset
Data tokenization
Fine-tuning the model
Prediction and model evaluation
Execution
Conclusion
Full code listing

PCA-Based Anomaly Detection in Python

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior. Principal Component Analysis (PCA) is a dimensionality reduction technique that can be used for anomaly detection by projecting data into a lower-dimensional space and identifying anomalies as points that deviate significantly from the projected data.

In this tutorial, we will learn how to perform PCA-based anomaly detection using Python. We will generate synthetic 3D data, apply PCA, and detect anomalies based on the reconstruction error. Finally, we will evaluate the performance using a confusion matrix and classification report and visualize the results in a 3D plot.

The tutorial covers:

Introduction to PCA and Anomaly detection
Generating test data
Applying PCA
Detecting anomalies
Conclusion
Source code listing

Graph-based Anomaly Detection Example

Graph-based anomaly detection identifies unusual data points by analyzing graph structures, where nodes represent data points and edges depict relationships, such as distances or similarities. In this tutorial, I explain how to detect anomalies using a graph-based method in Python. The tutorial covers:

Introduction to Graph method
Generating test data
Graph construction
Anomaly detection
Conclusion
Source code listing