Read

Watch

Reflect

No sidebar. No autoplay. No attention traps. Just learning.

2h 13m → 15 terms · 14 segments

Read ~14m

15 terms · 14 segments

Let's build the GPT Tokenizer

Andrej Karpathy AI-ML

Andrej Karpathy AI-ML|Published Feb 20, 2024Analyzed Jan 21, 2026

14chapters with key takeaways — read first, then watch

14chapters with key takeaways — read first, then watch

1

Tokenization: The Unsung Hero of LLMs

0:00-2:30•2m 30sIntro

2

Tokenization: Source of LLM Quirks

4:20-5:57•1m 37sConcept

3

GPT-2 Tokenizer in Action: Arbitrary Splits

5:57-9:30•3m 33sDemo

4

Inefficiencies in Non-English & Code Tokenization

9:30-12:29•2m 59sLimitation

5

GPT-4 Tokenizer: Denser & More Efficient

12:29-14:56•2m 27sArchitecture

6

Unicode, UTF-8, and the Need for BPE

14:56-23:49•8m 53sConcept

7

Implementing Byte Pair Encoding (BPE)

23:49-34:12•10m 23sArchitecture

8

Iterative BPE Training, Encoding & Decoding

34:12-57:21•23m 9sTraining

9

GPT-2's Regex: Enforcing Merge Rules

57:21-1:11:38•14m 17sArchitecture

10

Tiktoken, GPT-4 & Special Tokens

1:11:38-1:25:28•13m 50sArchitecture

11

SentencePiece: Llama & Mistral's Tokenizer

1:28:41-1:43:28•14m 47sArchitecture

12

Vocab Size, Model Extension & Multimodality

1:43:28-1:51:41•8m 13sArchitecture

13

Tokenization Explained: LLM Failures & Quirks

1:51:41-2:10:19•18m 38sUse Case

14

Tokenization: Conclusion & Future Outlook

2:10:19-2:13:35•3m 16sConclusion

Video Details & AI Summary

Published Feb 20, 2024

Analyzed Jan 21, 2026

AI Analysis Summary

This video provides a deep dive into tokenization, a fundamental yet complex aspect of large language models (LLMs). It explains how text is converted into numerical tokens, detailing the Byte Pair Encoding (BPE) algorithm from scratch and comparing the tokenization strategies of GPT-2, GPT-4, and Llama 2 (using SentencePiece). The lecture highlights how tokenization design choices are at the root of many common LLM quirks, such as poor performance in non-English languages, arithmetic errors, and unexpected model behaviors, offering practical insights and recommendations for effective tokenization in AI applications.

Title Accuracy Score

10/10Excellent

54.8s processing

Model:gemini-2.5-flash

Original Video

Watch on YouTube View channel on YouTube