Byte Pair Encoding
May 12, 2025
With the gaining popularity with LLMs, specifically transformers (GPT, BERT, etc.), it is fundamental that we understand how these complex models understand "words". Before language models can process text, they first need to break it down into manageable chunks. But splitting on characters loses meaning, and splitting on full words wastes space. Enter Byte Pair Encoding (BPE) — a clever algorithm that finds a middle ground, helping models like GPT and BERT "tokenize" text in an efficient and learnable way. We will discuss the algorithm below.
BPE Training Algorithm
Before the training algorithm starts, we need to initialize our vocabulary list from the corpus (large collection of words). For example, suppose our tiny corpus has the following words:
"low", "lowest", "newer", "wider", "widest"
We will now build our vocabulary with the symbols used to write these words. The base vocabulary would then be ['d', 'e', 'i', 'l', 'n', 'r', 's', 't', 'w']. However, in real-world use cases, the base vocabulary will contain the ASCII characters, as well as Unicode characters. In a given text example if you are tokenizing a character that is not found in the corpus, that character is assigned an unknown token.
GPT-2 and RoBERTa tokenizers have an intuitive way of dealing with this problem: start with bytes rather than characters. Every Unicode string is a byte sequence in UTF-8. There are only 256 distinct byte values, so your initial vocabulary is always size = 256 (+ the end-of-wordmarker).
To run BPE merges (which expect “characters”), GPT-2 maps each byte 0…255 to a printable surrogate Unicode code-point that never occurs in normal text. Because every possible byte already exists in the “base vocabulary,” no character can ever be out-of-vocabulary. Rare emoji are just rare byte sequences the model may never merge further, but they’re still perfectly legal tokens.
This is called byte level BPE.
After getting our base vocabulary, new tokens are added until the desired size is reached (50,000 for GPT-2) through merges . This refers to rules to merge two words in the vocabulary list into a new word. At the beginning, these merges will merge one-token at a time to create tokens of two characters, and as training progresses, longer words.
At each step during training, the BPE algorithm will search for the most frequent pair (two consecutive tokens in a word) of tokens. This is the pair that will be merged, before the process is repeated again for the next step. For clarity, I'll explain this more intuitively through the example using the words we defined previously. Lets assume the following word frequencies:
1("low", 15), ("lowest", 10), ("newer", 7), ("wider", 12), ("widest", 10)
meaning that the word "low"
occurred 15 times in the corpus, "lowest"
10 times, "newer"
7 times, "wider"
12 times, and "widest"
10 times.
To start, we split up each word into the characters that made up our initial vocabulary. This will display each word as a list of tokens:
1("l" "o" "w", 15), ("l" "o" "w" "e" "s" "t", 10), ("n" "e" "w" "e" "r", 7), ("w" "i" "d" "e" "r", 12), ("w" "i" "d" "e" "s" "t", 10)
Now, we examine each pair. For example, the pair ("o", "w")
is present in the words low
and lowest
, meaning its frequency is 25 in our corpus.
This is the most frequent pair, meaning that the first rule learned by the tokenizer is ("o", "w") -> "ow"
, after which "ow"
will be added
to our vocabulary. Thus, this pair will then be merged into all the words of the corpus. Following the completion of this state, our vocabulary
and corpus will look like the follwing:
1Vocabulary: ['d', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w', 'ow']
2Corpus: ("l" "ow", 15), ("l" "ow" "e" "s" "t", 10), ("n" "e" "w" "e" "r", 7), ("w" "i" "d" "e" "r", 12), ("w" "i" "d" "e" "s" "t", 10)
Now we have pairs in our corpus that are longer than two characters: ("l", "ow")
with a frequency of 25 in the corpus. Again, this is
the most frequent pair in the corpus, meaning that the second merge rule is ("l", "ow") -> "low"
, which is the first three-letter token. Adding to
our vocabulary and merging across all words results in:
1Vocabulary: ['d', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w', 'ow', 'low']
2Corpus: ("low", 15), ("low" "e" "s" "t", 10), ("n" "e" "w" "e" "r", 7), ("w" "i" "d" "e" "r", 12), ("w" "i" "d" "e" "s" "t", 10)
3
The most frequent pair is now ("s", "t")
, meaning that we can learn the next merge rule: ("s", "t") -> "st"
. After merging, we have:
1Vocabulary: ['d', 'e', 'i', 'l', 'n', 'o', 'r', 's', 't', 'w', 'ow', 'low', 'st']
2Corpus: ("low", 15), ("low" "e" "st", 10), ("n" "e" "w" "e" "r", 7), ("w" "i" "d" "e" "r", 12), ("w" "i" "d" "e" "st", 10)
3
We continue this process until we reached our defined size!
BPE Implementation
We will now examine the code implementation for BPE. Let it be noted that this is an overview of the algorithm that is intuitive, not a highly optimized version that can run on large corpuses.
First, we will start with our corpus defined below:
1corpus = [
2 "Byte Pair Encoding is a popular subword tokenization method.",
3 "It splits words into smaller pieces based on frequency.",
4 "This technique is used in many modern language models.",
5 "Understanding BPE helps us see how text is represented numerically.",
6]
We will now pre-tokenize our corpus into words using the gpt2
tokenizer:
1from transformers import AutoTokenizer
2
3tokenizer = AutoTokenizer.from_pretrained("gpt2")
We then compute the frequencies of each word in the corpus:
1from collections import defaultdict
2
3word_freqs = defaultdict(int)
4
5for text in corpus:
6 words_with_offsets = tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str(text)
7 new_words = [word for word, offset in words_with_offsets]
8 for word in new_words:
9 word_freqs[word] += 1
10
11print(word_freqs)
defaultdict(<class 'int'>, {'Byte': 1, 'ĠPair': 1, 'ĠEncoding': 1, 'Ġis': 3, 'Ġa': 1, 'Ġpopular': 1, 'Ġsubword': 1, 'Ġtokenization': 1, 'Ġmethod': 1, '.': 4, 'It': 1, 'Ġsplits': 1, 'Ġwords': 1, 'Ġinto': 1, 'Ġsmaller': 1, 'Ġpieces': 1, 'Ġbased': 1, 'Ġon': 1, 'Ġfrequency': 1, 'This': 1, 'Ġtechnique': 1, 'Ġused': 1, 'Ġin': 1, 'Ġmany': 1, 'Ġmodern': 1, 'Ġlanguage': 1, 'Ġmodels': 1, 'Understanding': 1, 'ĠBPE': 1, 'Ġhelps': 1, 'Ġus': 1, 'Ġsee': 1, 'Ġhow': 1, 'Ġtext': 1, 'Ġrepresented': 1, 'Ġnumerically': 1})
Next, we will compute the base vocabulary using the characters in the corpus:
1base_vocab = []
2
3for word in word_freqs:
4 for letter in word:
5 if letter not in base_vocab:
6 base_vocab.append(letter)
7
8base_vocab.sort()
9print(base_vocab)
['.', 'B', 'E', 'I', 'P', 'T', 'U', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'w', 'x', 'y', 'z', 'Ġ']
We will now add the <|endoftext|>
token to the beginning of our vocabulary list, this is used in GPT-2:
1vocab = ["<|endoftext|>"] + base_vocab.copy()
In order to start training, we will split each word into its individual characters:
1letters = {word: [letter for letter in word] for word in word_freqs.keys()}
Now that we can train, we will write a function that computes the frequency of each pair. This function will be called at each step of training:
1def compute_pair_freqs(splits):
2 pair_freqs = defaultdict(int)
3 for word, freq in word_freqs.items():
4 split = splits[word]
5 if len(split) == 1:
6 continue
7 for i in range(len(split) - 1):
8 pair = (split[i], split[i + 1])
9 pair_freqs[pair] += freq
10 return pair_freqs
Let's take a look at this dictionary:
1pair_freqs = compute_pair_freqs(letters)
2count = 0
3
4for i, pair in enumerate(pair_freqs):
5 print(f'{pair}: {pair_freqs[pair]}')
6 count += 1
7 if count > 5:
8 break
('B', 'y'): 1 ('y', 't'): 1 ('t', 'e'): 4 ('Ġ', 'P'): 1 ('P', 'a'): 1 ('a', 'i'): 1
We now will find the most frequent pair:
1freq_pair = ''
2max_freq = None
3
4for pair, freq in pair_freqs.items():
5 if (max_freq is None) or (freq > max_freq):
6 freq_pair = pair
7 max_freq = freq
8print(f'{freq_pair}, {max_freq}')
('Ġ', 'i'), 5
The first rule to merge will be ('Ġ', 'i') -> 'Ġi', where we will add 'Ġi' to our vocabulary:
1merges = {('Ġ', 'i'): 'Ġi'}
2vocab.append('Ġi')
Continuing, we will add this merge into our letters dictionary:
1def merge_pair(a, b, letters):
2 for word, split in letters.items():
3 if len(split) == 1:
4 continue
5
6 for i in range(len(split) - 1):
7 if (split[i] == a) and (split[i + 1] == b):
8 split = split[:i] + [a + b] + split[i + 2:]
9 letters[word] = split
10 return letters
1letters = merge_pair('Ġ', 'i', letters)
2print(letters['Ġinto'])
['Ġi', 'n', 't', 'o']
Now that we have every component, we will loop until our vocabulary reaches the desired size of 50:
1size = 50
2while len(vocab) < size:
3 pair_freqs = compute_pair_freqs(letters)
4 freq_pair = ''
5 max_freq = None
6
7 for pair, freq in pair_freqs.items():
8 if (max_freq is None) or (freq > max_freq):
9 freq_pair = pair
10 max_freq = freq
11 freq_letters = tuple([letter for letter in freq_pair])
12 letters = merge_pair(*freq_pair, letters)
13 merges[freq_pair] = freq_pair[0] + freq_pair[1]
14 vocab.append(merges[freq_pair])
As a result, we have learned 17 merge rules since our original vocabulary length was 33 including the special token:
1print(merges)
{('Ġ', 'i'): 'Ġi', ('t', 'e'): 'te', ('o', 'd'): 'od', ('Ġ', 's'): 'Ġs', ('Ġ', 'm'): 'Ġm', ('e', 'r'): 'er', ('n', 'g'): 'ng', ('Ġi', 's'): 'Ġis', ('e', 'n'): 'en', ('r', 'e'): 're', ('i', 'ng'): 'ing', ('Ġ', 'p'): 'Ġp', ('l', 'a'): 'la', ('w', 'o'): 'wo', ('wo', 'r'): 'wor', ('wor', 'd'): 'word', ('t', 'o'): 'to'}
The vocabulary will now be composed of the special token, base alphabet, and the results of the merges:
1print(vocab)
['<|endoftext|>', '.', 'B', 'E', 'I', 'P', 'T', 'U', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'w', 'x', 'y', 'z', 'Ġ', 'Ġi', 'te', 'od', 'Ġs', 'Ġm', 'er', 'ng', 'Ġis', 'en', 're', 'ing', 'Ġp', 'la', 'wo', 'wor', 'word', 'to']
Lastly, we can turn this into a function and run it on another piece of text to tokenize it.
1def tokenize(text):
2 pre_tokenize_result = tokenizer._tokenizer.pre_tokenizer.pre_tokenize_str(text)
3 pre_tokenized_text = [word for word, offset in pre_tokenize_result]
4 splits = [[l for l in word] for word in pre_tokenized_text]
5 for pair, merge in merges.items():
6 for idx, split in enumerate(splits):
7 i = 0
8 while i < len(split) - 1:
9 if split[i] == pair[0] and split[i + 1] == pair[1]:
10 split = split[:i] + [merge] + split[i + 2 :]
11 else:
12 i += 1
13 splits[idx] = split
14
15 return sum(splits, [])
1tokenize('Tokenize this piece of text.')
['T', 'o', 'k', 'en', 'i', 'z', 'e', 'Ġ', 't', 'h', 'i', 's', 'Ġp', 'i', 'e', 'c', 'e', 'Ġ', 'o', 'f', 'Ġ', 'te', 'x', 't', '.']
Note that this implementation will throw an error if it encounters a character unseen in the corpus since we did not include all possible bytes in the corpus. Large language models such as GPT-2 rely on byte-level BPE, meaning that there is never an unknown character. Byte-level BPE is beyond the scope of this article so its been left out!
Thus, this is the BPE algorithm!