CS336: Lecture 01 - Tokenizer

CS336: Lecture 01 - Tokenizer

avatar Bill
2025-12-05 15 分钟
CS336: Language Models From Scratch #CS336 #LLM

This unit was inspired by Andrej Karpathy’s video on tokenization. [video]

To get a feel for how tokenizers work, play with this interactive site.

image-20251205233041637

Tokenization Methods

Character-based

Byte-based

Word-based

Byte Pair Encoding (BPE)

The BPE algorithm was introduced by Philip Gage in 1994 for data compression. [article]

It was adapted to NLP for neural machine translation. [Sennrich+ 2015]

Basic idea: train the tokenizer on raw text to automatically determine the vocabulary.

Intuition: common sequences of characters are represented by a single token, rare sequences are represented by many tokens.

def merge(indices: list[int], pair: tuple[int, int], new_index: int) -> list[int]: # @inspect indices, @inspect pair, @inspect new_index
"""Return `indices`, but with all instances of `pair` replaced with `new_index`."""
new_indices = [] # @inspect new_indices
i = 0 # @inspect i
while i < len(indices):
if i + 1 < len(indices) and indices[i] == pair[0] and indices[i + 1] == pair[1]:
new_indices.append(new_index)
i += 2
else:
new_indices.append(indices[i])
i += 1
return new_indices
def train_bpe(string: str, num_merges: int) -> BPETokenizerParams: # @inspect string, @inspect num_merges
Start with the list of bytes of string.
indices = list(map(int, string.encode("utf-8"))) # @inspect indices
merges: dict[tuple[int, int], int] = {} # index1, index2 => merged index
vocab: dict[int, bytes] = {x: bytes([x]) for x in range(256)} # index -> bytes
for i in range(num_merges):
Count the number of occurrences of each pair of tokens
counts = defaultdict(int)
for index1, index2 in zip(indices, indices[1:]): # For each adjacent pair
counts[(index1, index2)] += 1 # @inspect counts
Find the most common pair.
pair = max(counts, key=counts.get) # @inspect pair
index1, index2 = pair
Merge that pair.
new_index = 256 + i # @inspect new_index
merges[pair] = new_index # @inspect merges
vocab[new_index] = vocab[index1] + vocab[index2] # @inspect vocab
indices = merge(indices, pair, new_index) # @inspect indices
return BPETokenizerParams(vocab=vocab, merges=merges)

image-20251206001826125

CS336: Lecture 01 - Tokenizer

https://it-bill.github.io/blog/2025/cs336-lecture-01-tokenizer

作者 Bill
发布于 2025-12-05
更新于 2026-02-04
许可协议 CC BY 4.0
评论
B / U {}
0 条评论
Avatar

Bill

3
文章
1
分类
3
标签
接着奏乐接着舞
Music Cover
第一次爱的人
00:00 / 02:18
分类更多
CS336: Language Models From Scratch3
标签更多
CS336LLMPBE