What does an LLM do?

Given a sequence of tokens, an LLM predicts the next token in the sequence. The LLM is trained on a large corpus of text data and learns the probability distribution of the next token given the previous tokens in the sequence. This probability distribution is used to generate text by sampling from it. The token with highest probability is chosen as the next token in the sequence.

Byte Pair Tokenization

History

The 2016 BPE paper by Sennrich et al. introduced the concept of Byte Pair Encoding (BPE) for tokenization. BPE is a data compression algorithm that replaces the most frequent pair of bytes in a data stream with a single byte. This process is repeated iteratively until a predefined number of iterations or until a predefined vocabulary size is reached. The resulting vocabulary consists of the most frequent tokens in the corpus, which can be used for tokenization.

Algorithm

The tokenization algorithms like BPE (Byte Pair Encoding) identify sub-words which are most prevalent in the corpus. This way most commonly available tokens (units of text) are identified that is extent in the corpus data with which the model is trained. And those are considered separate tokens. Tokens are merged and new bigger tokens are created accordingly based on what is most commonly available.

BPE is a method to build vocabulary by generated tokens from the corpus. This is used in GPT-2, GPT-4, Llama2, BERT, etc.

graph TD; A(corpus) --> B(1.break into ***characters*** and build a frequency table); B --> C(2.Add ***end of word*** symbol); C --> D(3.Initialize ***vocabulary*** with all unique characters); D --> E(4.Choose two tokens that appear together most frequently, ***respecting word boundaries***); E --> F(5.***Merge*** the two tokens into a new token and add to the vocabulary); F --> G(6.***Add*** the new merged token into the vocabulary); G --k times--> E;

Finer points

white space symbol is honored because “esteem” and “dearest” are different words. So, <est\w> is important.
vocabulary has characters and sub-words, honoring the word boundaries. start with characters and keep augmenting it with subwords and words.
what should be k - when to stop - how many iterations:
1. number of iterations
2. vocab size limit
select our data set - more data set, better will be the tokenization.
tokenization is suited for language. However, for mathematics such tokenization is not well suited.
word based tokenization -
1. OOV - Out of vocabulary words - unknown words. - managing that is a challenge
2. boy, boys, etc will have different tokens.
character based tokenization
1. vocabulary size will be less. 256 characters in English language. Number of words is 170K. That is the advantage of having small vocabulary size.
2. OOV is not there.
3. The meaning of the words are completely lost.
4. Tokenized sequence will be much longer than the initial raw text. One word is one token in word based. but here several tokens per word.
sub-word based tokenization
1. BPE is a sub-word based tokenization.
2. It is a compromise between word based and character based tokenization.
3. It captures the meaning of the words, while keeping the vocabulary size small.
4. OOV is not there.
5. The tokenized sequence will be shorter than character based, but longer than word based.
6. captures the root words and the meaning of the words.
7. Rule
  1. Do not split frequently used words into small sub-words.
  2. Split the rare words into smaller meaningful sub-words split above.
8. sub-word based tokenization helps the model learn that different words with same root word as “token” like “tokens”, and “tokenization” are similar in meaning.
9. helpful in syntatic situations, where suffixes like “ization” are considered as separate tokens, and its connection to another token can be learned.
10. Most common pairs of consecutive bytes of data is replaced with a byte that does not occur in data.
11. BPE ensures that most common words in the vocabulary are represented as a single token, while rare words are broken down into two or more subword tokens.
12. BPE ensures new tokens are created and they become common root words. The

References

Jupyter Notebook with code

Byte Pair Encoding & Tiktokenizer library code

Reference

Implementation of BPE

1. Corpus

text = """The Dark Knight Rises is a superhero movie released in 2012. It is the final part of Christopher Nolan Dark Knight trilogy, following Batman Begins and The Dark Knight. The film stars Christian Bale as Bruce Wayne Batman, who has been retired as Batman for eight years after the events of the previous movie.
"""

2. Character tokenization

Convert the entire text corpus into ids for each character.

ids = [ord(ch) for ch in text]
print(ids)

Output >>
[84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 82, 105, 115, 101, 115, 32, 105, 115, 32, 97, 32, 115, 117, 112, 101, 114, 104, 101, 114, 111, 32, 109, 111, 118, 105, 101, 32, 114, 101, 108, 101, 97, 115, 101, 100, 32, 105, 110, 32, 50, 48, 49, 50, 46, 32, 73, 116, 32, 105, 115, 32, 116, 104, 101, 32, 102, 105, 110, 97, 108, 32, 112, 97, 114, 116, 32, 111, 102, 32, 67, 104, 114, 105, 115, 116, 111, 112, 104, 101, 114, 32, 78, 111, 108, 97, 110, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 66, 97, 116, 109, 97, 110, 32, 66, 101, 103, 105, 110, 115, 32, 97, 110, 100, 32, 84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 46, 32, 84, 104, 101, 32, 102, 105, 108, 109, 32, 115, 116, 97, 114, 115, 32, 67, 104, 114, 105, 115, 116, 105, 97, 110, 32, 66, 97, 108, 101, 32, 97, 115, 32, 66, 114, 117, 99, 101, 32, 87, 97, 121, 110, 101, 32, 66, 97, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 115, 32, 98, 101, 101, 110, 32, 114, 101, 116, 105, 114, 101, 100, 32, 97, 115, 32, 66, 97, 116, 109, 97, 110, 32, 102, 111, 114, 32, 101, 105, 103, 104, 116, 32, 121, 101, 97, 114, 115, 32, 97, 102, 116, 101, 114, 32, 116, 104, 101, 32, 101, 118, 101, 110, 116, 115, 32, 111, 102, 32, 116, 104, 101, 32, 112, 114, 101, 118, 105, 111, 117, 115, 32, 109, 111, 118, 105, 101, 46, 10] 

3. Count all adjacent pairs

def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

stats = get_stats(ids)
print(stats)

Output >>
{(84, 104): 3, (104, 101): 8, (101, 32): 10, (32, 68): 3, (68, 97): 3, (97, 114): 6, (114, 107): 3, (107, 32): 3, (32, 75): 3, (75, 110): 3, (110, 105): 3, (105, 103): 4, (103, 104): 4, (104, 116): 4, (116, 32): 5, (32, 82): 1, (82, 105): 1, (105, 115): 5, (115, 101): 2, (101, 115): 1, (115, 32): 11, (32, 105): 3, (32, 97): 5, (97, 32): 1, (32, 115): 2, (115, 117): 1, (117, 112): 1, (112, 101): 1, (101, 114): 4, (114, 104): 1, (114, 111): 1, (111, 32): 2, (32, 109): 2, (109, 111): 2, (111, 118): 2, (118, 105): 3, (105, 101): 2, (32, 114): 2, (114, 101): 4, (101, 108): 1, (108, 101): 2, (101, 97): 2, (97, 115): 4, (101, 100): 2, (100, 32): 3, (105, 110): 4, (110, 32): 6, (32, 50): 1, (50, 48): 1, (48, 49): 1, (49, 50): 1, (50, 46): 1, (46, 32): 2, (32, 73): 1, (73, 116): 1, (32, 116): 4, (116, 104): 3, (32, 102): 4, (102, 105): 2, (110, 97): 1, (97, 108): 2, (108, 32): 1, (32, 112): 2, (112, 97): 1, (114, 116): 1, (32, 111): 2, (111, 102): 2, (102, 32): 2, (32, 67): 2, (67, 104): 2, (104, 114): 2, (114, 105): 3, (115, 116): 3, (116, 111): 1, (111, 112): 1, (112, 104): 1, (114, 32): 3, (32, 78): 1, (78, 111): 1, (111, 108): 2, (108, 97): 1, (97, 110): 6, (116, 114): 1, (105, 108): 2, (108, 111): 2, (111, 103): 1, (103, 121): 1, (121, 44): 1, (44, 32): 2, (102, 111): 2, (108, 108): 1, (111, 119): 1, (119, 105): 1, (110, 103): 1, (103, 32): 1, (32, 66): 6, (66, 97): 4, (97, 116): 3, (116, 109): 3, (109, 97): 3, (66, 101): 1, (101, 103): 1, (103, 105): 1, (110, 115): 1, (110, 100): 1, (32, 84): 2, (116, 46): 1, (108, 109): 1, (109, 32): 1, (116, 97): 1, (114, 115): 2, (116, 105): 2, (105, 97): 1, (66, 114): 1, (114, 117): 1, (117, 99): 1, (99, 101): 1, (32, 87): 1, (87, 97): 1, (97, 121): 1, (121, 110): 1, (110, 101): 1, (110, 44): 1, (32, 119): 1, (119, 104): 1, (104, 111): 1, (32, 104): 1, (104, 97): 1, (32, 98): 1, (98, 101): 1, (101, 101): 1, (101, 110): 2, (101, 116): 1, (105, 114): 1, (111, 114): 1, (32, 101): 2, (101, 105): 1, (32, 121): 1, (121, 101): 1, (97, 102): 1, (102, 116): 1, (116, 101): 1, (101, 118): 2, (118, 101): 1, (110, 116): 1, (116, 115): 1, (112, 114): 1, (105, 111): 1, (111, 117): 1, (117, 115): 1, (101, 46): 1, (46, 10): 1}  

4. Select the pair with highest frequency

pair = max(stats, key=stats.get)
print(pair)
# for readability define char_pair to print characters corresponding to the pair with highest frequency.
char_pair = (chr(pair[0]), chr(pair[1]))

Output >>
(115, 32)
char_pair = ('s', ' ')

5. Assign a new token id to the pair

# assuming only ascii characters are in the text, the max id through ord(ch) is 127. So, initializing the next token id as 128.
i=1
idx = 127 + i

6. Replace all the occurrence of the most frequent pair with the new token id

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

ids = merge(ids, pair, idx)
print(ids)

Output >> 
# The 21st and 22nd ids = (115,32) are now replaced with 128. Followed by 105.
# Similarly all other occurrences of (115,32) are replaced with 128.
[84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 82, 105, 115, 101, 147, 105, 147, 97, 32, 115, 117, 112, 101, 114, 104, 101, 114, 111, 32, 109, 111, 118, 105, 101, 32, 114, 101, 108, 101, 97, 115, 101, 100, 32, 105, 110, 32, 50, 48, 49, 50, 46, 32, 73, 116, 32, 105, 147, 116, 104, 101, 32, 102, 105, 110, 97, 108, 32, 112, 97, 114, 116, 32, 111, 102, 32, 67, 104, 114, 105, 115, 116, 111, 112, 104, 101, 114, 32, 78, 111, 108, 97, 110, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 66, 97, 116, 109, 97, 110, 32, 66, 101, 103, 105, 110, 147, 97, 110, 100, 32, 84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 46, 32, 84, 104, 101, 32, 102, 105, 108, 109, 32, 115, 116, 97, 114, 147, 67, 104, 114, 105, 115, 116, 105, 97, 110, 32, 66, 97, 108, 101, 32, 97, 147, 66, 114, 117, 99, 101, 32, 87, 97, 121, 110, 101, 32, 66, 97, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 147, 98, 101, 101, 110, 32, 114, 101, 116, 105, 114, 101, 100, 32, 97, 147, 66, 97, 116, 109, 97, 110, 32, 102, 111, 114, 32, 101, 105, 103, 104, 116, 32, 121, 101, 97, 114, 147, 97, 102, 116, 101, 114, 32, 116, 104, 101, 32, 101, 118, 101, 110, 116, 147, 111, 102, 32, 116, 104, 101, 32, 112, 114, 101, 118, 105, 111, 117, 147, 109, 111, 118, 105, 101, 46, 10]

7. Repeat the process

def get_stats(ids):
    counts = {}
    for pair in zip(ids, ids[1:]):
        counts[pair] = counts.get(pair, 0) + 1
    return counts

def merge(ids, pair, idx):
    newids = []
    i = 0
    while i < len(ids):
        if i < len(ids) - 1 and ids[i] == pair[0] and ids[i + 1] == pair[1]:
            newids.append(idx)
            i += 2
        else:
            newids.append(ids[i])
            i += 1
    return newids

vocab_size = 148  # the desired final vocabulary size
num_merges = vocab_size - 128
ids = list(tokens)  # copy so we don't destroy the original list

merges = {}  # (int, int) -> int
for i in range(num_merges):
    # 1) Count all adjacent pairs in our current sequence 'ids'.
    stats = get_stats(ids)
    pair = max(stats, key=stats.get)
    idx = 128 + i
    # Decode the characters of the pair for display
    char_pair = (chr(pair[0]), chr(pair[1]))
    print(f"merging {pair} ({char_pair[0]}{char_pair[1]}) into a new token {idx}")
    ids = merge(ids, pair, idx)
    merges[pair] = idx

Output >> 
merging (115, 32) (s ) into a new token 128
merging (101, 32) (e ) into a new token 129
merging (104, 129) (h) into a new token 130
merging (97, 114) (ar) into a new token 131
merging (110, 32) (n ) into a new token 132
merging (116, 32) (t ) into a new token 133
merging (105, 103) (ig) into a new token 134
merging (134, 104) (h) into a new token 135
merging (101, 114) (er) into a new token 136
merging (114, 101) (re) into a new token 137
merging (97, 132) (a) into a new token 138
merging (66, 97) (Ba) into a new token 139
merging (84, 130) (T) into a new token 140
merging (68, 131) (D) into a new token 141
merging (141, 107) (k) into a new token 142
merging (142, 32) ( ) into a new token 143
merging (143, 75) (K) into a new token 144
merging (144, 110) (n) into a new token 145
merging (145, 135) () into a new token 146
merging (105, 115) (is) into a new token 147

8. Final vocabulary

print(tokens)
print(ids)
print("tokens length:", len(tokens))
print("ids length:", len(ids))
print(f"compression ratio: {len(tokens) / len(ids):.2f}X")

Output >>
[84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 82, 105, 115, 101, 115, 32, 105, 115, 32, 97, 32, 115, 117, 112, 101, 114, 104, 101, 114, 111, 32, 109, 111, 118, 105, 101, 32, 114, 101, 108, 101, 97, 115, 101, 100, 32, 105, 110, 32, 50, 48, 49, 50, 46, 32, 73, 116, 32, 105, 115, 32, 116, 104, 101, 32, 102, 105, 110, 97, 108, 32, 112, 97, 114, 116, 32, 111, 102, 32, 67, 104, 114, 105, 115, 116, 111, 112, 104, 101, 114, 32, 78, 111, 108, 97, 110, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 32, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 66, 97, 116, 109, 97, 110, 32, 66, 101, 103, 105, 110, 115, 32, 97, 110, 100, 32, 84, 104, 101, 32, 68, 97, 114, 107, 32, 75, 110, 105, 103, 104, 116, 46, 32, 84, 104, 101, 32, 102, 105, 108, 109, 32, 115, 116, 97, 114, 115, 32, 67, 104, 114, 105, 115, 116, 105, 97, 110, 32, 66, 97, 108, 101, 32, 97, 115, 32, 66, 114, 117, 99, 101, 32, 87, 97, 121, 110, 101, 32, 66, 97, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 115, 32, 98, 101, 101, 110, 32, 114, 101, 116, 105, 114, 101, 100, 32, 97, 115, 32, 66, 97, 116, 109, 97, 110, 32, 102, 111, 114, 32, 101, 105, 103, 104, 116, 32, 121, 101, 97, 114, 115, 32, 97, 102, 116, 101, 114, 32, 116, 104, 101, 32, 101, 118, 101, 110, 116, 115, 32, 111, 102, 32, 116, 104, 101, 32, 112, 114, 101, 118, 105, 111, 117, 115, 32, 109, 111, 118, 105, 101, 46, 10]
[140, 146, 133, 82, 147, 101, 128, 105, 128, 97, 32, 115, 117, 112, 136, 104, 136, 111, 32, 109, 111, 118, 105, 129, 137, 108, 101, 97, 115, 101, 100, 32, 105, 132, 50, 48, 49, 50, 46, 32, 73, 133, 105, 128, 116, 130, 102, 105, 110, 97, 108, 32, 112, 131, 133, 111, 102, 32, 67, 104, 114, 147, 116, 111, 112, 104, 136, 32, 78, 111, 108, 138, 146, 133, 116, 114, 105, 108, 111, 103, 121, 44, 32, 102, 111, 108, 108, 111, 119, 105, 110, 103, 32, 139, 116, 109, 138, 66, 101, 103, 105, 110, 128, 97, 110, 100, 32, 140, 146, 116, 46, 32, 140, 102, 105, 108, 109, 32, 115, 116, 131, 128, 67, 104, 114, 147, 116, 105, 138, 139, 108, 129, 97, 128, 66, 114, 117, 99, 129, 87, 97, 121, 110, 129, 139, 116, 109, 97, 110, 44, 32, 119, 104, 111, 32, 104, 97, 128, 98, 101, 101, 132, 137, 116, 105, 137, 100, 32, 97, 128, 139, 116, 109, 138, 102, 111, 114, 32, 101, 135, 133, 121, 101, 131, 128, 97, 102, 116, 136, 32, 116, 130, 101, 118, 101, 110, 116, 128, 111, 102, 32, 116, 130, 112, 137, 118, 105, 111, 117, 128, 109, 111, 118, 105, 101, 46, 10]
tokens length: 309
ids length: 217
compression ratio: 1.42X