No sidebar. No autoplay. No attention traps. Just learning.

Read ~14m
15 terms · 14 segments

Let's build the GPT Tokenizer

14chapters with key takeaways — read first, then watch
1

Tokenization: The Unsung Hero of LLMs

0:00-2:302m 30sIntro
2

Tokenization: Source of LLM Quirks

4:20-5:571m 37sConcept
3

GPT-2 Tokenizer in Action: Arbitrary Splits

5:57-9:303m 33sDemo
4

Inefficiencies in Non-English & Code Tokenization

9:30-12:292m 59sLimitation
5

GPT-4 Tokenizer: Denser & More Efficient

12:29-14:562m 27sArchitecture
6

Unicode, UTF-8, and the Need for BPE

14:56-23:498m 53sConcept
7

Implementing Byte Pair Encoding (BPE)

23:49-34:1210m 23sArchitecture
8

Iterative BPE Training, Encoding & Decoding

34:12-57:2123m 9sTraining
9

GPT-2's Regex: Enforcing Merge Rules

57:21-1:11:3814m 17sArchitecture
10

Tiktoken, GPT-4 & Special Tokens

1:11:38-1:25:2813m 50sArchitecture
11

SentencePiece: Llama & Mistral's Tokenizer

1:28:41-1:43:2814m 47sArchitecture
12

Vocab Size, Model Extension & Multimodality

1:43:28-1:51:418m 13sArchitecture
13

Tokenization Explained: LLM Failures & Quirks

1:51:41-2:10:1918m 38sUse Case
14

Tokenization: Conclusion & Future Outlook

2:10:19-2:13:353m 16sConclusion

Video Details & AI Summary

Published Feb 20, 2024
Analyzed Jan 21, 2026

AI Analysis Summary

This video provides a deep dive into tokenization, a fundamental yet complex aspect of large language models (LLMs). It explains how text is converted into numerical tokens, detailing the Byte Pair Encoding (BPE) algorithm from scratch and comparing the tokenization strategies of GPT-2, GPT-4, and Llama 2 (using SentencePiece). The lecture highlights how tokenization design choices are at the root of many common LLM quirks, such as poor performance in non-English languages, arithmetic errors, and unexpected model behaviors, offering practical insights and recommendations for effective tokenization in AI applications.

Title Accuracy Score
10/10Excellent
54.8s processing
Model:gemini-2.5-flash