Read

Watch

Reflect

Your brain learns faster when it knows what's coming

1h 55m → 15 terms · 15 segments

Read ~15m

15 terms · 15 segments

Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej Karpathy AI-ML

Andrej Karpathy AI-ML|Published Oct 4, 2022Analyzed Jan 21, 2026

15chapters with key takeaways — read first, then watch

15chapters with key takeaways — read first, then watch

1

Importance of Activations & Gradients

0:00-1:22•1m 22sIntro

2

Code Refactoring & Initial Model Performance

1:22-4:17•2m 55sConcept

3

High Initial Loss & Softmax Overconfidence

4:17-9:21•5m 4sConcept

4

Fixing Output Layer Initialization

9:21-12:59•3m 38sDemo

5

Hidden Layer Saturation & Vanishing Gradients

12:59-24:37•11m 38sConcept

6

Fixing Hidden Layer Initialization

24:37-27:54•3m 17sDemo

7

Principled Initialization (Kaiming Normal)

27:54-37:40•9m 46sConcept

8

Applying Kaiming Initialization

37:40-40:39•2m 59sDemo

9

Introduction to Batch Normalization

40:39-53:58•13m 19sConcept

10

BatchNorm Inference & Running Statistics

53:58-1:00:53•6m 55sConcept

11

BatchNorm Details & Summary

1:00:53-1:04:49•3m 56sConcept

12

Real-World BatchNorm & PyTorch API

1:04:49-1:14:50•10m 1sUse Case

13

Summary of Initialization & Normalization

1:14:50-1:18:33•3m 43sConclusion

14

PyTorch Refactoring & Diagnostic Tools (No BatchNorm)

1:18:33-1:36:15•17m 42sArchitecture

15

Parameter Update Ratio & BatchNorm Robustness

1:36:15-1:55:56•19m 41sConcept

Video Details & AI Summary

Published Oct 4, 2022

Analyzed Jan 21, 2026

AI Analysis Summary

This video, 'Building makemore Part 3,' delves into the critical importance of understanding activations and gradients for training deep neural networks. It demonstrates how proper weight initialization, both manual and principled (Kaiming), can significantly improve training stability and performance by preventing issues like softmax overconfidence and hidden layer saturation. The lecture then introduces Batch Normalization as a powerful modern innovation that robustly stabilizes activation distributions, making deep learning more reliable, and showcases diagnostic tools like activation/gradient histograms and update-to-data ratios to monitor network health during training.

Title Accuracy Score

10/10Excellent

1.2m processing

Model:gemini-2.5-flash

Original Video

Watch on YouTube View channel on YouTube