Byte Pair Encoding
What does an LLM do?
Given a sequence of tokens, an LLM predicts the next token in the sequence. The LLM is trained on a large corpus of text data and learns the probability distribution of the next token given the previous tokens in the sequence. This probability distribution is used to generate text by sampling from it. The token with highest probability is chosen as the next token in the sequence.
Byte Pair Tokenization
History
The 2016 BPE paper by Sennrich et al. introduced the concept of Byte Pair Encoding (BPE) for tokenization. BPE is a data compression algorithm that replaces the most frequent pair of bytes in a data stream with a single byte. This process is repeated iteratively until a predefined number of iterations or until a predefined vocabulary size is reached. The resulting vocabulary consists of the most frequent tokens in the corpus, which can be used for tokenization.
Algorithm
The tokenization algorithms like BPE (Byte Pair Encoding) identify sub-words which are most prevalent in the corpus. This way most commonly available tokens (units of text) are identified that is extent in the corpus data with which the model is trained. And those are considered separate tokens. Tokens are merged and new bigger tokens are created accordingly based on what is most commonly available.
BPE is a method to build vocabulary by generated tokens from the corpus. This is used in GPT-2, GPT-4, Llama2, BERT, etc.
Finer points
- white space symbol is honored because “esteem” and “dearest” are different words. So, <est\w> is important.
- vocabulary has characters and sub-words, honoring the word boundaries. start with characters and keep augmenting it with subwords and words.
- what should be k - when to stop - how many iterations:
- number of iterations
- vocab size limit
- select our data set - more data set, better will be the tokenization.
- tokenization is suited for language. However, for mathematics such tokenization is not well suited.
- word based tokenization -
- OOV - Out of vocabulary words - unknown words. - managing that is a challenge
- boy, boys, etc will have different tokens.
- character based tokenization
- vocabulary size will be less. 256 characters in English language. Number of words is 170K. That is the advantage of having small vocabulary size.
- OOV is not there.
- The meaning of the words are completely lost.
- Tokenized sequence will be much longer than the initial raw text. One word is one token in word based. but here several tokens per word.
- sub-word based tokenization
- BPE is a sub-word based tokenization.
- It is a compromise between word based and character based tokenization.
- It captures the meaning of the words, while keeping the vocabulary size small.
- OOV is not there.
- The tokenized sequence will be shorter than character based, but longer than word based.
- captures the root words and the meaning of the words.
- Rule
- Do not split frequently used words into small sub-words.
- Split the rare words into smaller meaningful sub-words split above.
- sub-word based tokenization helps the model learn that different words with same root word as “token” like “tokens”, and “tokenization” are similar in meaning.
- helpful in syntatic situations, where suffixes like “ization” are considered as separate tokens, and its connection to another token can be learned.
- Most common pairs of consecutive bytes of data is replaced with a byte that does not occur in data.
- BPE ensures that most common words in the vocabulary are represented as a single token, while rare words are broken down into two or more subword tokens.
- BPE ensures new tokens are created and they become common root words. The
References
Jupyter Notebook with code
Byte Pair Encoding & Tiktokenizer library code
Reference
- Sebastian Raschka’s BPE code
- GP2 Merges
- GP2 Vocabulary
- BPE paper - A new algorithm of data compression optimization
Implementation of BPE
1. Corpus
text = """The Dark Knight Rises is a superhero movie released in 2012. It is the final part of Christopher Nolan Dark Knight trilogy, following Batman Begins and The Dark Knight. The film stars Christian Bale as Bruce Wayne Batman, who has been retired as Batman for eight years after the events of the previous movie.
"""
2. Character tokenization
Convert the entire text corpus into ids for each character.
ids = [ord(ch) for ch in text]
print(ids)
Output >>
[84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 82, 105, 115, 101, 115, 32, 105, 115, 32, 97, 32, 115, 117, 112, 101, 114, 104, 101, 114, 111, 32, 109, 111, 118, 105, 101, 32, 114, 101, 108, 101, 97, 115, 101, 100, 32, 105, 110, 32, 50, 48, 49, 50, 46, 32, 73, 116, 32, 105, 115, 32, 116, 104, 101, 32, 102, 105, 110, 97, 108, 32, 112, 97, 114, 116, 32, 111, 102, 32, 67, 104, 114, 105, 115, 116, 111, 112, 104, 101, 114, 32, 78, 111, 108, 97, 110, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 66, 97, 116, 109, 97, 110, 32, 66, 101, 103, 105, 110, 115, 32, 97, 110, 100, 32, 84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 46, 32, 84, 104, 101, 32, 102, 105, 108, 109, 32, 115, 116, 97, 114, 115, 32, 67, 104, 114, 105, 115, 116, 105, 97, 110, 32, 66, 97, 108, 101, 32, 97, 115, 32, 66, 114, 117, 99, 101, 32, 87, 97, 121, 110, 101, 32, 66, 97, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 115, 32, 98, 101, 101, 110, 32, 114, 101, 116, 105, 114, 101, 100, 32, 97, 115, 32, 66, 97, 116, 109, 97, 110, 32, 102, 111, 114, 32, 101, 105, 103, 104, 116, 32, 121, 101, 97, 114, 115, 32, 97, 102, 116, 101, 114, 32, 116, 104, 101, 32, 101, 118, 101, 110, 116, 115, 32, 111, 102, 32, 116, 104, 101, 32, 112, 114, 101, 118, 105, 111, 117, 115, 32, 109, 111, 118, 105, 101, 46, 10]
3. Count all adjacent pairs
def get_stats(ids):
counts = {}
for pair in zip(ids, ids[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
stats = get_stats(ids)
print(stats)
Output >>
{(84, 104): 3, (104, 101): 8, (101, 32): 10, (32, 68): 3, (68, 97): 3, (97, 114): 6, (114, 107): 3, (107, 32): 3, (32, 75): 3, (75, 110): 3, (110, 105): 3, (105, 103): 4, (103, 104): 4, (104, 116): 4, (116, 32): 5, (32, 82): 1, (82, 105): 1, (105, 115): 5, (115, 101): 2, (101, 115): 1, (115, 32): 11, (32, 105): 3, (32, 97): 5, (97, 32): 1, (32, 115): 2, (115, 117): 1, (117, 112): 1, (112, 101): 1, (101, 114): 4, (114, 104): 1, (114, 111): 1, (111, 32): 2, (32, 109): 2, (109, 111): 2, (111, 118): 2, (118, 105): 3, (105, 101): 2, (32, 114): 2, (114, 101): 4, (101, 108): 1, (108, 101): 2, (101, 97): 2, (97, 115): 4, (101, 100): 2, (100, 32): 3, (105, 110): 4, (110, 32): 6, (32, 50): 1, (50, 48): 1, (48, 49): 1, (49, 50): 1, (50, 46): 1, (46, 32): 2, (32, 73): 1, (73, 116): 1, (32, 116): 4, (116, 104): 3, (32, 102): 4, (102, 105): 2, (110, 97): 1, (97, 108): 2, (108, 32): 1, (32, 112): 2, (112, 97): 1, (114, 116): 1, (32, 111): 2, (111, 102): 2, (102, 32): 2, (32, 67): 2, (67, 104): 2, (104, 114): 2, (114, 105): 3, (115, 116): 3, (116, 111): 1, (111, 112): 1, (112, 104): 1, (114, 32): 3, (32, 78): 1, (78, 111): 1, (111, 108): 2, (108, 97): 1, (97, 110): 6, (116, 114): 1, (105, 108): 2, (108, 111): 2, (111, 103): 1, (103, 121): 1, (121, 44): 1, (44, 32): 2, (102, 111): 2, (108, 108): 1, (111, 119): 1, (119, 105): 1, (110, 103): 1, (103, 32): 1, (32, 66): 6, (66, 97): 4, (97, 116): 3, (116, 109): 3, (109, 97): 3, (66, 101): 1, (101, 103): 1, (103, 105): 1, (110, 115): 1, (110, 100): 1, (32, 84): 2, (116, 46): 1, (108, 109): 1, (109, 32): 1, (116, 97): 1, (114, 115): 2, (116, 105): 2, (105, 97): 1, (66, 114): 1, (114, 117): 1, (117, 99): 1, (99, 101): 1, (32, 87): 1, (87, 97): 1, (97, 121): 1, (121, 110): 1, (110, 101): 1, (110, 44): 1, (32, 119): 1, (119, 104): 1, (104, 111): 1, (32, 104): 1, (104, 97): 1, (32, 98): 1, (98, 101): 1, (101, 101): 1, (101, 110): 2, (101, 116): 1, (105, 114): 1, (111, 114): 1, (32, 101): 2, (101, 105): 1, (32, 121): 1, (121, 101): 1, (97, 102): 1, (102, 116): 1, (116, 101): 1, (101, 118): 2, (118, 101): 1, (110, 116): 1, (116, 115): 1, (112, 114): 1, (105, 111): 1, (111, 117): 1, (117, 115): 1, (101, 46): 1, (46, 10): 1}
4. Select the pair with highest frequency
pair = max(stats, key=stats.get)
print(pair)
# for readability define char_pair to print characters corresponding to the pair with highest frequency.
char_pair = (chr(pair[0]), chr(pair[1]))
Output >>
(115, 32)
char_pair = ('s', ' ')
5. Assign a new token id to the pair
# assuming only ascii characters are in the text, the max id through ord(ch) is 127. So, initializing the next token id as 128.
i=1
idx = 127 + i
6. Replace all the occurrence of the most frequent pair with the new token id
def merge(ids, pair, idx):
newids = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids
ids = merge(ids, pair, idx)
print(ids)
Output >>
# The 21st and 22nd ids = (115,32) are now replaced with 128. Followed by 105.
# Similarly all other occurrences of (115,32) are replaced with 128.
[84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 82, 105, 115, 101, 147, 105, 147, 97, 32, 115, 117, 112, 101, 114, 104, 101, 114, 111, 32, 109, 111, 118, 105, 101, 32, 114, 101, 108, 101, 97, 115, 101, 100, 32, 105, 110, 32, 50, 48, 49, 50, 46, 32, 73, 116, 32, 105, 147, 116, 104, 101, 32, 102, 105, 110, 97, 108, 32, 112, 97, 114, 116, 32, 111, 102, 32, 67, 104, 114, 105, 115, 116, 111, 112, 104, 101, 114, 32, 78, 111, 108, 97, 110, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 66, 97, 116, 109, 97, 110, 32, 66, 101, 103, 105, 110, 147, 97, 110, 100, 32, 84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 46, 32, 84, 104, 101, 32, 102, 105, 108, 109, 32, 115, 116, 97, 114, 147, 67, 104, 114, 105, 115, 116, 105, 97, 110, 32, 66, 97, 108, 101, 32, 97, 147, 66, 114, 117, 99, 101, 32, 87, 97, 121, 110, 101, 32, 66, 97, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 147, 98, 101, 101, 110, 32, 114, 101, 116, 105, 114, 101, 100, 32, 97, 147, 66, 97, 116, 109, 97, 110, 32, 102, 111, 114, 32, 101, 105, 103, 104, 116, 32, 121, 101, 97, 114, 147, 97, 102, 116, 101, 114, 32, 116, 104, 101, 32, 101, 118, 101, 110, 116, 147, 111, 102, 32, 116, 104, 101, 32, 112, 114, 101, 118, 105, 111, 117, 147, 109, 111, 118, 105, 101, 46, 10]
7. Repeat the process
def get_stats(ids):
counts = {}
for pair in zip(ids, ids[1:]):
counts[pair] = counts.get(pair, 0) + 1
return counts
def merge(ids, pair, idx):
newids = []
i = 0
while i < len(ids):
if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
newids.append(idx)
i += 2
else:
newids.append(ids[i])
i += 1
return newids
vocab_size = 148 # the desired final vocabulary size
num_merges = vocab_size - 128
ids = list(tokens) # copy so we don't destroy the original list
merges = {} # (int, int) -> int
for i in range(num_merges):
# 1) Count all adjacent pairs in our current sequence 'ids'.
stats = get_stats(ids)
pair = max(stats, key=stats.get)
idx = 128 + i
# Decode the characters of the pair for display
char_pair = (chr(pair[0]), chr(pair[1]))
print(f"merging {pair} ({char_pair[0]}{char_pair[1]}) into a new token {idx}")
ids = merge(ids, pair, idx)
merges[pair] = idx
Output >>
merging (115, 32) (s ) into a new token 128
merging (101, 32) (e ) into a new token 129
merging (104, 129) (h) into a new token 130
merging (97, 114) (ar) into a new token 131
merging (110, 32) (n ) into a new token 132
merging (116, 32) (t ) into a new token 133
merging (105, 103) (ig) into a new token 134
merging (134, 104) (h) into a new token 135
merging (101, 114) (er) into a new token 136
merging (114, 101) (re) into a new token 137
merging (97, 132) (a) into a new token 138
merging (66, 97) (Ba) into a new token 139
merging (84, 130) (T) into a new token 140
merging (68, 131) (D) into a new token 141
merging (141, 107) (k) into a new token 142
merging (142, 32) ( ) into a new token 143
merging (143, 75) (K) into a new token 144
merging (144, 110) (n) into a new token 145
merging (145, 135) () into a new token 146
merging (105, 115) (is) into a new token 147
8. Final vocabulary
print(tokens)
print(ids)
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")
Output >>
[84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 82, 105, 115, 101, 115, 32, 105, 115, 32, 97, 32, 115, 117, 112, 101, 114, 104, 101, 114, 111, 32, 109, 111, 118, 105, 101, 32, 114, 101, 108, 101, 97, 115, 101, 100, 32, 105, 110, 32, 50, 48, 49, 50, 46, 32, 73, 116, 32, 105, 115, 32, 116, 104, 101, 32, 102, 105, 110, 97, 108, 32, 112, 97, 114, 116, 32, 111, 102, 32, 67, 104, 114, 105, 115, 116, 111, 112, 104, 101, 114, 32, 78, 111, 108, 97, 110, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 66, 97, 116, 109, 97, 110, 32, 66, 101, 103, 105, 110, 115, 32, 97, 110, 100, 32, 84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 46, 32, 84, 104, 101, 32, 102, 105, 108, 109, 32, 115, 116, 97, 114, 115, 32, 67, 104, 114, 105, 115, 116, 105, 97, 110, 32, 66, 97, 108, 101, 32, 97, 115, 32, 66, 114, 117, 99, 101, 32, 87, 97, 121, 110, 101, 32, 66, 97, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 115, 32, 98, 101, 101, 110, 32, 114, 101, 116, 105, 114, 101, 100, 32, 97, 115, 32, 66, 97, 116, 109, 97, 110, 32, 102, 111, 114, 32, 101, 105, 103, 104, 116, 32, 121, 101, 97, 114, 115, 32, 97, 102, 116, 101, 114, 32, 116, 104, 101, 32, 101, 118, 101, 110, 116, 115, 32, 111, 102, 32, 116, 104, 101, 32, 112, 114, 101, 118, 105, 111, 117, 115, 32, 109, 111, 118, 105, 101, 46, 10]
[140, 146, 133, 82, 147, 101, 128, 105, 128, 97, 32, 115, 117, 112, 136, 104, 136, 111, 32, 109, 111, 118, 105, 129, 137, 108, 101, 97, 115, 101, 100, 32, 105, 132, 50, 48, 49, 50, 46, 32, 73, 133, 105, 128, 116, 130, 102, 105, 110, 97, 108, 32, 112, 131, 133, 111, 102, 32, 67, 104, 114, 147, 116, 111, 112, 104, 136, 32, 78, 111, 108, 138, 146, 133, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 139, 116, 109, 138, 66, 101, 103, 105, 110, 128, 97, 110, 100, 32, 140, 146, 116, 46, 32, 140, 102, 105, 108, 109, 32, 115, 116, 131, 128, 67, 104, 114, 147, 116, 105, 138, 139, 108, 129, 97, 128, 66, 114, 117, 99, 129, 87, 97, 121, 110, 129, 139, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 128, 98, 101, 101, 132, 137, 116, 105, 137, 100, 32, 97, 128, 139, 116, 109, 138, 102, 111, 114, 32, 101, 135, 133, 121, 101, 131, 128, 97, 102, 116, 136, 32, 116, 130, 101, 118, 101, 110, 116, 128, 111, 102, 32, 116, 130, 112, 137, 118, 105, 111, 117, 128, 109, 111, 118, 105, 101, 46, 10]
tokens length: 309
ids length: 217
compression ratio: 1.42X