🐐
Andrej Karpathy
446K subscribersWe dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, ...
I keep coming back to these videos again and again. Andrej i...
41 Comments
Wes Roth
47.7k views • 4 days ago
TheAIGRID
22.3k views • 5 days ago
Wes Roth
61.1k views • 5 days ago
Wes Roth
26.7k views • 6 days ago
Wes Roth
54.9k views • 1 week ago
AI Jason
68.3k views • 1 week ago
TheAIGRID
36.6k views • 1 week ago
TheAIGRID
6.3k views • 1 week ago
TheAIGRID
19.6k views • 1 week ago
TheAIGRID
13.6k views • 1 week ago
Shelf will be hidden for 30 daysUndo
Wes Roth
51.6k views • 1 week ago
AI For Humans
598 views • 1 week ago
Wes Roth
60.7k views • 1 week ago
TheAIGRID
147.6k views • 1 week ago
TheAIGRID
42.4k views • 1 week ago
AI For Humans
3.0k views • 1 week ago
TheAIGRID
30.8k views • 2 weeks ago
Wes Roth
48.7k views • 2 weeks ago
Wes Roth
38.9k views • 2 weeks ago
TheAIGRID
32.0k views • 2 weeks ago
Wes Roth
59.0k views • 2 weeks ago
TheAIGRID
33.5k views • 2 weeks ago
TheAIGRID
85.7k views • 2 weeks ago
Wes Roth
91.6k views • 2 weeks ago
Morningside AI
8.0k views • 2 weeks ago
Wes Roth
138.8k views • 2 weeks ago
AI For Humans
763 views • 2 weeks ago
TheAIGRID
28.6k views • 2 weeks ago
Wes Roth
33.3k views • 2 weeks ago
AI Explained
123.1k views • 2 weeks ago
TheAIGRID
457.7k views • 2 weeks ago
AI For Humans
1.4k views • 2 weeks ago
TheAIGRID
21.7k views • 3 weeks ago
Wes Roth
28.4k views • 3 weeks ago
Wes Roth
48.5k views • 3 weeks ago
TheAIGRID
398.9k views • 3 weeks ago
Wes Roth
19.7k views • 3 weeks ago
TheAIGRID
31.3k views • 3 weeks ago
AI Jason
37.7k views • 3 weeks ago
Wes Roth
60.8k views • 3 weeks ago
Wes Roth
20.5k views • 3 weeks ago
TheAIGRID
18.1k views • 3 weeks ago
TheAIGRID
53.2k views • 3 weeks ago
Wes Roth
46.4k views • 3 weeks ago
Wes Roth
33.3k views • 3 weeks ago
Wes Roth
82.6k views • 3 weeks ago
Wes Roth
35.2k views • 3 weeks ago
TheAIGRID
40.1k views • 3 weeks ago
Wes Roth
37.3k views • 3 weeks ago
Wes Roth
16.2k views • 3 weeks ago
TheAIGRID
26.4k views • 3 weeks ago
AI Explained
115.9k views • 3 weeks ago
AI For Humans
2.9k views • 3 weeks ago
TheAIGRID
41.8k views • 3 weeks ago
Wes Roth
129.9k views • 4 weeks ago
TheAIGRID
40.3k views • 4 weeks ago
Wes Roth
39.1k views • 4 weeks ago
AI Jason
65.1k views • 4 weeks ago
TheAIGRID
40.6k views • 4 weeks ago
AI For Humans
373 views • 1 month ago
TheAIGRID
51.3k views • 1 month ago
TheAIGRID
19.7k views • 1 month ago
TheAIGRID
31.6k views • 1 month ago
TheAIGRID
103.6k views • 1 month ago
AI For Humans
3.5k views • 1 month ago
TheAIGRID
45.8k views • 1 month ago
TheAIGRID
56.0k views • 1 month ago
AI Explained
115.3k views • 1 month ago
AI For Humans
2.3k views • 1 month ago
AI For Humans
1.7k views • 1 month ago
AI Jason
27.2k views • 1 month ago
AI For Humans
334 views • 1 month ago
AI For Humans
2.6k views • 1 month ago
AI Explained
105.5k views • 1 month ago
AI Explained
130.2k views • 1 month ago
AI For Humans
1.5k views • 1 month ago
AI Jason
202.1k views • 1 month ago
AI For Humans
1.4k views • 1 month ago
AI Explained
177.3k views • 2 months ago
AI Jason
32.9k views • 2 months ago
AI Explained
150.6k views • 2 months ago
Andrej Karpathy
453.8k views • 2 months ago
AI Explained
240.7k views • 2 months ago
AI Jason
62.1k views • 2 months ago
AI Explained
187.2k views • 2 months ago
AI Explained
160.9k views • 3 months ago
AI Jason
87.5k views • 3 months ago
AI Explained
270.8k views • 3 months ago
AI Jason
60.6k views • 3 months ago
AI Explained
96.5k views • 3 months ago
AI Jason
7.1k views • 3 months ago
AI Explained
145.7k views • 3 months ago
AI Explained
133.0k views • 4 months ago
AI Explained
79.4k views • 4 months ago
AI Jason
16.1k views • 4 months ago
AI Explained
84.0k views • 4 months ago
AI Explained
74.5k views • 4 months ago
AI Explained
144.7k views • 5 months ago
AI Jason
68.5k views • 5 months ago
Morningside AI
4.1k views • 5 months ago
AI Explained
83.7k views • 5 months ago
AI Jason
134.2k views • 5 months ago
AI Jason
33.3k views • 5 months ago
Morningside AI
9.7k views • 5 months ago
AI Explained
229.0k views • 5 months ago
Andrej Karpathy
1.8M views • 5 months ago
AI Explained
112.8k views • 5 months ago
AI Explained
167.2k views • 5 months ago
AI Explained
156.8k views • 5 months ago
Morningside AI
26.0k views • 5 months ago
AI Jason
16.1k views • 6 months ago
AI Explained
96.4k views • 6 months ago
AI Jason
68.7k views • 6 months ago
AI Explained
120.6k views • 6 months ago
AI Jason
52.5k views • 6 months ago
AI Jason
19.1k views • 7 months ago
AI Jason
52.7k views • 7 months ago
AI Jason
28.6k views • 7 months ago
AI Jason
13.6k views • 7 months ago
AI Jason
183.4k views • 8 months ago
AI Jason
48.2k views • 8 months ago
AI Jason
26.5k views • 8 months ago
AI Jason
14.4k views • 8 months ago
Andrej Karpathy
4.2M views • 1 year ago
Andrej Karpathy
153.6k views • 1 year ago
Andrej Karpathy
167.4k views • 1 year ago
Andrej Karpathy
240.3k views • 1 year ago
Andrej Karpathy
269.9k views • 1 year ago
41 Comments
I still can't understand why BatchNorm helps against vanishing/exploding gradients. Is there any ideas?
Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏
Amazing, knowledge that is hell hard to find in other videos and also, you have AMAZING skill in clearly explaining complex stuff.
This is a great lecture, especially the second half building intuition about diagnostics. Amazing stuff.
The amount of useful information in this video is impressive. Thanks for such good content.
Thanks for the fantastic download! You have changed my learning_rate in this area from 0.1 to something >1!
🐐     See Less
I still can't understand why BatchNorm helps against vanishing/exploding gradients. Is there any ideas?     See Less
Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏     See Less
🎯Course outline for quick navigation:
[00:00-03:21]1.     See More ng and refactoring neural networks for language modeling
-[00:00-00:30]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.
-[00:31-01:03]Understanding neural net activations and gradients in training is crucial for optimizing architectures.
-[02:06-02:46]Refactored code to optimize neural net with 11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.
-[03:03-03:28]Using torch.nograd decorator to prevent gradients computation.
[03:22-14:22]2. Efficiency of torch.no_grad and neural net initialization issues
-[03:22-04:00]Using torch's no_grad makes computation more efficient by eliminating gradient tracking.
-[04:22-04:50]Network initialization causes high loss of 27, rapidly decreases to 1 or 2.
-[05:00-05:32]At initialization, the model aims for a uniform distribution among 27 characters, with roughly 1/27 probability for each.
-[05:49-06:19]Neural net creates skewed probability distributions leading to high loss.
-[12:08-12:36]Loss at initialization as expected, improved to 2.12-2.16
[14:24-36:39]3. Neural network initialization
-[16:03-16:31]The chain rule with local gradient is affected when outputs of tanh are close to -1 or 1, leading to a halt in back propagation.
-[18:09-18:38]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.
-[26:03-26:31]Optimization led to improved validation loss from 2.17 to 2.10 by fixing softmax and 10-inch layer issues.
-[29:28-30:02]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.
-[30:17-30:47]Scaling down by 0.2 shrinks gaussian with standard deviation 0.6.
-[31:03-31:46]Initializing neural network weights for well-behaved activations, kaiming he et al.
-[36:24-36:55]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.
[36:39-51:52]4. Neural net initialization and batch normalization
-[36:39-37:05]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.
-[40:32-43:04]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.
-[40:51-41:13]Batch normalization from 2015 enabled reliable training of deep neural nets.
-[41:39-42:09]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.
-[43:20-43:50]Calculating standard deviation of activations, mean is average value of neuron's activation.
-[45:45-46:16]Back propagation guides distribution movement, adding scale and shift for final output
[51:52-01:01:35]5. Jittering and batch normalization in neural network training
-[52:10-52:37]Padding input examples adds entropy, augments data, and regularizes neural nets.
-[53:44-54:09]Batch normalization effectively controls activations and their distributions.
-[56:05-56:33]Batch normalization paper introduces running mean and standard deviation estimation during training.
-[01:00:46-01:01:10]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.
[01:01:36-01:09:21]6. Batch normalization and resnet in pytorch
-[01:02:00-01:02:30]Biases are subtracted out in batch normalization, reducing their impact to zero.
-[01:03:13-01:03:53]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters.
-[01:07:25-01:07:53]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code.
[01:09:21-01:23:37]7. Pytorch weight initialization and batch normalization
-[01:10:05-01:10:32]Pytorch initializes weights using 1/fan-in square root from a uniform distribution.
-[01:11:11-01:11:40]Scaling weights by 1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features.
-[01:14:02-01:14:35]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper.
-[01:16:00-01:16:30]Batch normalization centers data for gaussian activations in deep neural networks.
-[01:17:32-01:18:02]Batch normalization, influential in 2015, enabled reliable training of much deeper neural nets.
[01:23:39-01:55:56]8. Custom pytorch layer and network analysis
-[01:24:01-01:24:32]Updating buffers using exponential moving average with torch.nograd context manager.
-[01:25:47-01:27:11]The model has 46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations.
-[01:28:04-01:28:30]Saturation stabilizes at 20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3.
-[01:33:19-01:33:50]Setting gain correctly at 1 prevents shrinking and diffusion in batch normalization.
-[01:38:41-01:39:11]The last layer has gradients 100 times greater, causing faster training, but it self-corrects with longer training.
-[01:43:18-01:43:42]Monitoring update ratio for parameters to ensure efficient training, aiming for -3 on log plot.
-[01:51:36-01:52:04]Introduce batch normalization and pytorch modules for neural networks.
-[01:52:39-01:53:06]Introduction to diagnostic tools for neural network analysis.
-[01:54:45-01:55:50]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress
offered by Coursnap    See Less
Amazing, knowledge that is hell hard to find in other videos and also, you have AMAZING skill in clearly explaining complex stuff.     See Less
This is a great lecture, especially the second half building intuition about diagnostics. Amazing stuff.     See Less
The amount of useful information in this video is impressive. Thanks for such good content.     See Less
I keep coming back to these videos again and again. Andrej is legend!     See Less
Thanks for the fantastic download! You have changed my learning_rate in this area from 0.1 to something >1!     See Less
what is the purpose of bnmean_running and bnstd_running?     See Less