Feeda - OnScreen Live

Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej Karpathy

865K subscribers

384.0k views • 2 years ago

We dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, ...

I keep coming back to these videos again and again. Andrej i...

77 Comments

@EricRubio7 1 year ago

🐐 See Less

@adosar7261 1 year ago

I still can't understand why BatchNorm helps against vanishing/exploding gradients. Is there any ideas? See Less

@theusualcouple 1 year ago

Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏 See Less

@zlsj861 1 year ago

🎯Course outline for quick navigation:

[00:00-03:21]1. See More ng and refactoring neural networks for language modeling
-[00:00-00:30]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.
-[00:31-01:03]Understanding neural net activations and gradients in training is crucial for optimizing architectures.
-[02:06-02:46]Refactored code to optimize neural net with 11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.
-[03:03-03:28]Using torch.nograd decorator to prevent gradients computation.

[03:22-14:22]2. Efficiency of torch.no_grad and neural net initialization issues
-[03:22-04:00]Using torch's no_grad makes computation more efficient by eliminating gradient tracking.
-[04:22-04:50]Network initialization causes high loss of 27, rapidly decreases to 1 or 2.
-[05:00-05:32]At initialization, the model aims for a uniform distribution among 27 characters, with roughly 1/27 probability for each.
-[05:49-06:19]Neural net creates skewed probability distributions leading to high loss.
-[12:08-12:36]Loss at initialization as expected, improved to 2.12-2.16

[14:24-36:39]3. Neural network initialization
-[16:03-16:31]The chain rule with local gradient is affected when outputs of tanh are close to -1 or 1, leading to a halt in back propagation.
-[18:09-18:38]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.
-[26:03-26:31]Optimization led to improved validation loss from 2.17 to 2.10 by fixing softmax and 10-inch layer issues.
-[29:28-30:02]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.
-[30:17-30:47]Scaling down by 0.2 shrinks gaussian with standard deviation 0.6.
-[31:03-31:46]Initializing neural network weights for well-behaved activations, kaiming he et al.
-[36:24-36:55]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.

[36:39-51:52]4. Neural net initialization and batch normalization
-[36:39-37:05]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.
-[40:32-43:04]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.
-[40:51-41:13]Batch normalization from 2015 enabled reliable training of deep neural nets.
-[41:39-42:09]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.
-[43:20-43:50]Calculating standard deviation of activations, mean is average value of neuron's activation.
-[45:45-46:16]Back propagation guides distribution movement, adding scale and shift for final output

[51:52-01:01:35]5. Jittering and batch normalization in neural network training
-[52:10-52:37]Padding input examples adds entropy, augments data, and regularizes neural nets.
-[53:44-54:09]Batch normalization effectively controls activations and their distributions.
-[56:05-56:33]Batch normalization paper introduces running mean and standard deviation estimation during training.
-[01:00:46-01:01:10]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.

[01:01:36-01:09:21]6. Batch normalization and resnet in pytorch
-[01:02:00-01:02:30]Biases are subtracted out in batch normalization, reducing their impact to zero.
-[01:03:13-01:03:53]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters.
-[01:07:25-01:07:53]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code.

[01:09:21-01:23:37]7. Pytorch weight initialization and batch normalization
-[01:10:05-01:10:32]Pytorch initializes weights using 1/fan-in square root from a uniform distribution.
-[01:11:11-01:11:40]Scaling weights by 1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features.
-[01:14:02-01:14:35]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper.
-[01:16:00-01:16:30]Batch normalization centers data for gaussian activations in deep neural networks.
-[01:17:32-01:18:02]Batch normalization, influential in 2015, enabled reliable training of much deeper neural nets.

[01:23:39-01:55:56]8. Custom pytorch layer and network analysis
-[01:24:01-01:24:32]Updating buffers using exponential moving average with torch.nograd context manager.
-[01:25:47-01:27:11]The model has 46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations.
-[01:28:04-01:28:30]Saturation stabilizes at 20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3.
-[01:33:19-01:33:50]Setting gain correctly at 1 prevents shrinking and diffusion in batch normalization.
-[01:38:41-01:39:11]The last layer has gradients 100 times greater, causing faster training, but it self-corrects with longer training.
-[01:43:18-01:43:42]Monitoring update ratio for parameters to ensure efficient training, aiming for -3 on log plot.
-[01:51:36-01:52:04]Introduce batch normalization and pytorch modules for neural networks.
-[01:52:39-01:53:06]Introduction to diagnostic tools for neural network analysis.
-[01:54:45-01:55:50]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress

offered by Coursnap See Less

@adamskrodzki6152 1 year ago

Amazing, knowledge that is hell hard to find in other videos and also, you have AMAZING skill in clearly explaining complex stuff. See Less

@styssine 1 year ago

This is a great lecture, especially the second half building intuition about diagnostics. Amazing stuff. See Less

@lucianovidal8721 1 year ago

The amount of useful information in this video is impressive. Thanks for such good content. See Less

@sanjaybhatikar 1 year ago

I keep coming back to these videos again and again. Andrej is legend! See Less

@JuliusSmith 1 year ago

Thanks for the fantastic download! You have changed my learning_rate in this area from 0.1 to something >1! See Less

@pravingaikwad1337 1 year ago

what is the purpose of bnmean_running and bnstd_running? See Less

AI News AI News

28:44

OpenAI's New Device Will Change AI Forever (OpenAI's IO Device Revealed)

TheAIGRID

99.5k views • 1 day ago

12:37

Deepseek just BROKE the Entire AI Industry... (something is up)

Wes Roth

91.1k views • 1 day ago

01:58

New Deepseek model just BROKE the AI Industry... #ai #google #opensource

Wes Roth

11.6k views • 1 day ago

00:49

Will AI take your job?!

AI For Humans

2.1k views • 2 days ago