Prime your brain first — retention follows

Read ~15m
15 terms · 15 segments

The spelled-out intro to language modeling: building makemore

15chapters with key takeaways — read first, then watch
1

Introduction to makemore and Language Modeling

0:00-3:023m 2sIntro
2

Loading and Analyzing the Names Dataset

3:03-5:442m 41sConcept
3

Building a Bi-gram Language Model by Counting

5:45-12:446m 59sConcept
4

PyTorch Tensor for Bi-gram Counts and Visualization

12:45-24:0911m 24sArchitecture
5

Generating Names by Sampling from Bi-gram Probabilities

24:10-36:1512m 5sDemo
6

Optimizing Probability Matrix with PyTorch Broadcasting

36:16-50:1513m 59sConcept
7

Measuring Model Quality: Negative Log Likelihood Loss

50:15-1:00:079m 52sConcept
8

Addressing Zero Probabilities with Model Smoothing

1:00:07-1:03:403m 33sConcept
9

Neural Network Approach for Bi-gram Language Model

1:03:40-1:10:016m 21sArchitecture
10

One-Hot Encoding and Single Linear Layer in PyTorch

1:10:01-1:18:468m 45sArchitecture
11

Transforming Neural Network Outputs to Probabilities (Softmax)

1:18:46-1:28:5810m 12sArchitecture
12

Calculating Negative Log Likelihood Loss for Neural Network

1:28:58-1:38:369m 38sArchitecture
13

Training the Neural Network with Gradient Descent

1:38:36-1:47:479m 11sTraining
14

Equivalence to Bi-gram Model and Regularization as Smoothing

1:47:47-1:54:296m 42sConcept
15

Sampling from Neural Net and Future Directions

1:54:29-1:57:453m 16sConclusion

Video Details & AI Summary

Published Sep 7, 2022
Analyzed Jan 21, 2026

AI Analysis Summary

This video provides a detailed, 'spelled-out' introduction to language modeling using the 'makemore' project. It begins by building a character-level bi-gram language model through explicit counting and normalization, demonstrating how to sample new words and evaluate model quality using negative log likelihood. The tutorial then transitions to implementing the same bi-gram model within a neural network framework using PyTorch, explaining concepts like one-hot encoding, logits, softmax, and gradient-based optimization, ultimately showing how both approaches yield identical results while highlighting the superior flexibility and scalability of neural networks for future, more complex models like transformers.

Title Accuracy Score
10/10Excellent
41.5s processing
Model:gemini-2.5-flash