LLML CS336 Lecture 1 Overview and Tokenization
CS336: Lecture 1 Overview and Tokenization
Course Overview
ALL FOCUSED ON EFFICIENCY
The main 5 components for this course:
More descriptions about the course overview can be found in Course Lecture Notes.
More than a course overview…
What I will learn from this course?
Full understanding of this technology is necessary and crucial for fundamental search, rather than just calling a prompt.
There are three types of knowledge:
Mechanics: how things work (what a Transformer is, how model parallelism leverages GPUs)
Mindset: squeezing the most out of the hardware, taking scale seriously (scaling laws)
Intuitions: which data and modeling decisions yield good accuracy
We can teach mechanics and mindset (these do transfer). We can only partially teach intuitions (do not necessarily transfer across scales).
About Scaling Law & The bitter lesson
Is the algorithm really making no sense? Of course not! We cannot afford to make it so wasteful!
- Wrong interpretation: scale is all that matters, algorithms don’t matter.
- Right interpretation: algorithms that scale is what matters.
In fact, efficiency is way more important at larger scale (can’t afford to be wasteful). In other words, maximize efficiency!
Tokenization
The tokenization includes the encoding and decoding process for the word list. We will focus on Byte-pair Encoding (BPE) for this section.
-
Encoding: convert string to numbers
-
Decoding: convert numbers back to strings
The vocabulary size is the number.
Several Notifications
-
All the blank space are tokens
-
spaCy internally uses hash values to represent strings, and these hashes map to integer IDs.
BPE: tiktoken
tiktoken
is a tokenization tool trained by OpenAI with Byte-Pair Encoding (BPE), it achieves great achievement in English encoding and decoding.
1 |
|
The encoding and decoding process is a reversible process, for example, the encoding_string
is a list of tokens which have been split, and generate a hash process to transform to a set of integer, which are called the encoded_string
. The reversible part lies in that you can easily use decode
method to generate the original encoding_string
.
titoken
module is specifically designed for ASCII text (English), thus its support for CHinese and other unicode string is poor.
1 |
|
Let’s analyse the result:
-
For ANSCI, one single character is 1 bytes. (
char
in C++) -
For Unicode, one single character is 3 bytes.
1 |
|
-
For the English word, the tokenization process is based on words, so
num_bytes/num_tokens
is mainly based on the average length of words. -
For the Chinese word, the tokenization process is based on single characters.
Principles
To make it more formalized: we want to find a hash function , given a list of unicode string: , it satisfied:
-
The set of the unicode string can be converted to , and . (The splitting part, and we hope the splitting is meaningful, the value space of Y is the power set of the set composed of Unicode characters.)
-
The hash function: $f(y) = z, y \in \mathbb{Y}, z \in \mathbb{Z}, \mathbb{Z} \subseteq \mathbb{N} $.
-
No hash collision is allowed here. $\forall z \in \mathbb{Z}, \text{there exists only one } y \in \mathbb{Y}, f(y) = z $. (Injection is required)
Character-based Tokenization
For is finite, there exists at least one hash function that satisfy all the requirements above. A Unicode string is a sequence of Unicode characters. Each character can be converted into a code point (integer) via ord
.
1 |
|
The problems:
-
The context meaning has been lost during the encoding process.
-
is quite large, we need to make it smaller for the training process.
1 |
|
1 |
|
Byte-based Tokenization
Unicode strings can be represented as a sequence of bytes, which can be represented by integers between 0 and 255.
1 |
|
Problem: quite long sequences.
Word-based Tokenization
Use regex to split strings into word and do word-based tokenizations.
Problems:
-
The number of words is huge (like for Unicode characters).
-
Many words are rare and the model won’t learn much about them.
-
This doesn’t obviously provide a fixed vocabulary size.
Byte-Pair Encoding
Wikipedia: Byte Pair Encoding
The original paper: http://www.pennelynn.com/Documents/CUJ/HTML/94HTML/19940045.HTM
It was adapted to NLP for neural machine translation.
Neural Machine Translation of Rare Words with Subword Units
Neural machine translation (NMT) models typically operate with a fixed vocabulary, but translation is an open-vocabulary problem. Previous work addresses the translation of out-of-vocabulary words by backing off to a dictionary. In this paper, we introduce a simpler and more effective approach, making the NMT model capable of open-vocabulary translation by encoding rare and unknown words as sequences of subword units. This is based on the intuition that various word classes are translatable via smaller units than words, for instance names (via character copying or transliteration), compounds (via compositional translation), and cognates and loanwords (via phonological and morphological transformations). We discuss the suitability of different word segmentation techniques, including simple character n-gram models and a segmentation based on the byte pair encoding compression algorithm.
- Basic idea: train the tokenizer on raw text to automatically determine the vocabulary.
- Intuition: common sequences of characters are represented by a single token, rare sequences are represented by many tokens.
The GPT-2 paper used word-based tokenization to break up the text into inital segments and run the original BPE algorithm on each segment.
Sketch: start with each byte as a token, and successively merge the most common pair of adjacent tokens.
The BPE (Byte Pair Encoding) training process is an iterative algorithm that builds a vocabulary and a set of merge rules by finding and combining the most frequent adjacent symbols (bytes or subwords) in a given text.
Initialization:
- Start with a vocabulary that contains all unique individual bytes (0-255) present in your training text.
- Represent your training text as a sequence of these individual byte IDs.
Iterative Merging (for
N
desired merges):- Count Pairs: In the current sequence of IDs (tokens), count the occurrences of all adjacent pairs of symbols.
- Find Most Frequent: Identify the pair of symbols that appears most frequently.
- Create New Symbol & Rule:
- Assign a new, unique ID to this most frequent pair (e.g., if IDs up to 255 are used for single bytes, start new IDs from 256).
- Add this new ID to your vocabulary, mapping it to the combined byte sequence of the two symbols it represents (e.g., if
b'p'
andb'a'
were merged, the new ID maps tob'pa'
). - Record this merge rule (the original pair and its new ID).
- Update Sequence: Replace all occurrences of the most frequent pair in your text sequence with the new symbol’s ID.
Repeat: Continue steps 2 until you’ve performed the desired number of merges (
N
) or no more pairs can be found.
In essence, BPE training works by:
- Starting small: Treating every byte as a distinct unit.
- Gradually expanding: Learning to combine frequently occurring sequences of bytes into new, larger “subwords.”
- Building a dictionary: Creating a mapping from these new subwords (and original bytes) to unique numerical IDs.
This process allows the tokenizer to represent common words and subword units efficiently, leading to shorter token sequences for given texts, which is crucial for the performance of large language models.
Somehow like the process of Huffman Encoding.
1 |
|
1 |
|
For the data string provided and the num_merges
, you can see the tokenize result.
1 |
|
1 |
|