CS336: Lecture 01 - Tokenizer

This unit was inspired by Andrej Karpathy’s video on tokenization. [video]

To get a feel for how tokenizers work, play with this interactive site.

Tokenization Methods

Character-based

Byte-based

Word-based

Byte Pair Encoding (BPE)

The BPE algorithm was introduced by Philip Gage in 1994 for data compression. [article]

It was adapted to NLP for neural machine translation. [Sennrich+ 2015]

Basic idea: train the tokenizer on raw text to automatically determine the vocabulary.

Intuition: common sequences of characters are represented by a single token, rare sequences are represented by many tokens.

1
def merge(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]:  # @inspect indices, @inspect pair, @inspect new_index
2
    """Return `indices`, but with all instances of `pair` replaced with `new_index`."""
3
    new_indices = []  # @inspect new_indices
4
    i = 0  # @inspect i
5
    while i < len(indices):
6
        if i + 1 < len(indices) and indices[i] == pair[0] and indices[i + 1] == pair[1]:
7
            new_indices.append(new_index)
8
            i += 2
9
        else:
10
            new_indices.append(indices[i])
11
            i += 1
12
    return new_indices
13

14
def train_bpe(string: str, num_merges: int) -> BPETokenizerParams:  # @inspect string, @inspect num_merges
15
    Start with the list of bytes of string.
16
    indices = list(map(int, string.encode("utf-8")))  # @inspect indices
17
    merges: dict[tuple[int, int], int] = {}  # index1, index2 => merged index
18
    vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)}  # index -> bytes
19
    for i in range(num_merges):
20
        Count the number of occurrences of each pair of tokens
21
        counts = defaultdict(int)
22
        for index1, index2 in zip(indices, indices[1:]):  # For each adjacent pair
23
            counts[(index1, index2)] += 1  # @inspect counts
24
        Find the most common pair.
25
        pair = max(counts, key=counts.get)  # @inspect pair
26
        index1, index2 = pair
27
        Merge that pair.
28
        new_index = 256 + i  # @inspect new_index
29
        merges[pair] = new_index  # @inspect merges
30
        vocab[new_index] = vocab[index1] + vocab[index2]  # @inspect vocab
31
        indices = merge(indices, pair, new_index)  # @inspect indices
32
    return BPETokenizerParams(vocab=vocab, merges=merges)

Bill