🐐
Andrej Karpathy
458K subscribersWe dive into some of the internals of MLPs with multiple layers and scrutinize the statistics of the forward pass activations, ...
I keep coming back to these videos again and again. Andrej i...
50 Comments
TheAIGRID
22.1k views • 3 days ago
AI Explained
103.0k views • 1 month ago
AI Explained
117.6k views • 1 month ago
TheAIGRID
62.7k views • 1 month ago
Wes Roth
35.0k views • 1 month ago
AI For Humans
5.4k views • 2 months ago
Andrej Karpathy
34.7k views • 2 months ago
TheAIGRID
217 views • 3 months ago
TheAIGRID
48.5k views • 3 months ago
AI For Humans
5.1k views • 3 months ago
Shelf will be hidden for 30 daysUndo
Wes Roth
22.8k views • 3 months ago
AI For Humans
1.0k views • 3 months ago
Wes Roth
75.1k views • 3 months ago
Wes Roth
43.1k views • 3 months ago
Wes Roth
19.1k views • 3 months ago
Wes Roth
6.4k views • 3 months ago
Wes Roth
5.2k views • 3 months ago
Wes Roth
9.0k views • 4 months ago
Wes Roth
15.7k views • 4 months ago
Wes Roth
2.3k views • 4 months ago
Wes Roth
10.8k views • 5 months ago
AI Jason
56.9k views • 5 months ago
Wes Roth
82.8k views • 5 months ago
Wes Roth
17.6k views • 5 months ago
Wes Roth
81.6k views • 5 months ago
Wes Roth
70.8k views • 5 months ago
Wes Roth
183.9k views • 5 months ago
Wes Roth
35.0k views • 5 months ago
Wes Roth
53.5k views • 6 months ago
Wes Roth
56.7k views • 6 months ago
TheAIGRID
15.6k views • 6 months ago
Wes Roth
59.7k views • 6 months ago
AI For Humans
626 views • 6 months ago
Wes Roth
10.6k views • 6 months ago
Wes Roth
19.1k views • 7 months ago
Wes Roth
53.5k views • 7 months ago
Wes Roth
70.3k views • 7 months ago
Wes Roth
49.0k views • 7 months ago
Wes Roth
21.3k views • 7 months ago
Wes Roth
40.2k views • 7 months ago
Wes Roth
59.4k views • 7 months ago
Wes Roth
55.4k views • 7 months ago
AI Jason
56.1k views • 7 months ago
Wes Roth
25.4k views • 7 months ago
Wes Roth
40.3k views • 7 months ago
Wes Roth
100.8k views • 7 months ago
Wes Roth
90.4k views • 7 months ago
AI Jason
13.0k views • 7 months ago
Wes Roth
67.1k views • 7 months ago
Wes Roth
34.6k views • 7 months ago
Wes Roth
84.7k views • 7 months ago
Wes Roth
42.6k views • 8 months ago
Wes Roth
31.7k views • 8 months ago
Wes Roth
48.3k views • 8 months ago
Wes Roth
43.4k views • 8 months ago
Wes Roth
25.9k views • 8 months ago
Wes Roth
69.4k views • 8 months ago
Wes Roth
46.3k views • 8 months ago
Wes Roth
34.1k views • 8 months ago
Wes Roth
75.3k views • 8 months ago
AI Jason
15.0k views • 8 months ago
AI For Humans
1.8k views • 8 months ago
Andrej Karpathy
686.3k views • 9 months ago
AI Jason
17.6k views • 9 months ago
TheAIGRID
21.0k views • 9 months ago
TheAIGRID
29.2k views • 9 months ago
TheAIGRID
36.3k views • 9 months ago
TheAIGRID
10.5k views • 9 months ago
TheAIGRID
61.1k views • 9 months ago
AI For Humans
5.6k views • 9 months ago
TheAIGRID
14.1k views • 9 months ago
AI Explained
151.7k views • 9 months ago
TheAIGRID
4.9k views • 9 months ago
TheAIGRID
95.1k views • 9 months ago
TheAIGRID
16.8k views • 9 months ago
TheAIGRID
54.5k views • 9 months ago
TheAIGRID
43.5k views • 9 months ago
TheAIGRID
18.7k views • 9 months ago
TheAIGRID
30.1k views • 9 months ago
TheAIGRID
39.1k views • 9 months ago
AI Jason
75.1k views • 9 months ago
TheAIGRID
176.2k views • 9 months ago
TheAIGRID
37.7k views • 9 months ago
TheAIGRID
17.5k views • 9 months ago
TheAIGRID
35.1k views • 9 months ago
AI Explained
388.7k views • 9 months ago
TheAIGRID
71.3k views • 9 months ago
TheAIGRID
55.3k views • 9 months ago
TheAIGRID
6.2k views • 10 months ago
TheAIGRID
27.9k views • 10 months ago
TheAIGRID
14.6k views • 10 months ago
AI For Humans
948 views • 10 months ago
TheAIGRID
20.8k views • 10 months ago
TheAIGRID
25.3k views • 10 months ago
TheAIGRID
36.6k views • 10 months ago
AI Explained
129.2k views • 10 months ago
AI Explained
97.7k views • 10 months ago
AI For Humans
5.7k views • 10 months ago
AI Jason
354.8k views • 10 months ago
AI For Humans
667 views • 10 months ago
AI For Humans
3.5k views • 10 months ago
Morningside AI
13.5k views • 10 months ago
AI For Humans
781 views • 10 months ago
AI Explained
129.9k views • 10 months ago
AI For Humans
1.5k views • 10 months ago
AI Jason
49.4k views • 10 months ago
AI Explained
118.4k views • 11 months ago
AI For Humans
3.0k views • 11 months ago
AI Jason
113.7k views • 11 months ago
AI For Humans
387 views • 11 months ago
AI For Humans
3.6k views • 11 months ago
AI Explained
118.3k views • 11 months ago
AI For Humans
2.3k views • 11 months ago
AI For Humans
1.7k views • 11 months ago
AI Jason
30.7k views • 11 months ago
AI For Humans
339 views • 11 months ago
AI For Humans
2.6k views • 11 months ago
AI Explained
106.4k views • 11 months ago
AI Explained
131.0k views • 11 months ago
AI For Humans
1.5k views • 11 months ago
AI Jason
218.6k views • 1 year ago
AI For Humans
1.4k views • 1 year ago
AI For Humans
1.8k views • 1 year ago
AI Explained
181.1k views • 1 year ago
AI Jason
35.1k views • 1 year ago
AI Explained
151.1k views • 1 year ago
Andrej Karpathy
482.7k views • 1 year ago
AI Explained
241.8k views • 1 year ago
AI Jason
63.7k views • 1 year ago
AI Explained
187.7k views • 1 year ago
AI Explained
161.6k views • 1 year ago
AI Jason
91.0k views • 1 year ago
AI Explained
272.8k views • 1 year ago
AI Jason
61.4k views • 1 year ago
AI Explained
96.8k views • 1 year ago
AI Jason
7.2k views • 1 year ago
AI Explained
145.9k views • 1 year ago
AI Explained
133.4k views • 1 year ago
AI Explained
79.5k views • 1 year ago
AI Jason
16.8k views • 1 year ago
AI Explained
84.1k views • 1 year ago
AI Explained
74.6k views • 1 year ago
AI Explained
144.9k views • 1 year ago
AI Jason
75.2k views • 1 year ago
Morningside AI
4.1k views • 1 year ago
AI Explained
83.7k views • 1 year ago
AI Jason
140.3k views • 1 year ago
AI Jason
33.7k views • 1 year ago
Morningside AI
9.8k views • 1 year ago
AI Explained
229.6k views • 1 year ago
Andrej Karpathy
1.9M views • 1 year ago
AI Explained
112.8k views • 1 year ago
Morningside AI
26.1k views • 1 year ago
AI Jason
16.3k views • 1 year ago
AI Jason
71.9k views • 1 year ago
AI Jason
53.8k views • 1 year ago
AI Jason
20.4k views • 1 year ago
AI Jason
53.4k views • 1 year ago
AI Jason
28.9k views • 1 year ago
Andrej Karpathy
4.3M views • 2 years ago
Andrej Karpathy
157.3k views • 2 years ago
Andrej Karpathy
172.8k views • 2 years ago
Andrej Karpathy
247.8k views • 2 years ago
Andrej Karpathy
278.3k views • 2 years ago
50 Comments
I still can't understand why BatchNorm helps against vanishing/exploding gradients. Is there any ideas?
Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏
Amazing, knowledge that is hell hard to find in other videos and also, you have AMAZING skill in clearly explaining complex stuff.
This is a great lecture, especially the second half building intuition about diagnostics. Amazing stuff.
The amount of useful information in this video is impressive. Thanks for such good content.
Thanks for the fantastic download! You have changed my learning_rate in this area from 0.1 to something >1!
🐐     See Less
I still can't understand why BatchNorm helps against vanishing/exploding gradients. Is there any ideas?     See Less
Thank you @Andrej for bringing this series. You are a great teacher, the way you have simplified such seemingly complex topics is valuable to all the students like me. 🙏     See Less
🎯Course outline for quick navigation:
[00:00-03:21]1.     See More ng and refactoring neural networks for language modeling
-[00:00-00:30]Continuing makemore implementation with multilayer perceptron for character-level language modeling, planning to move to larger neural networks.
-[00:31-01:03]Understanding neural net activations and gradients in training is crucial for optimizing architectures.
-[02:06-02:46]Refactored code to optimize neural net with 11,000 parameters over 200,000 steps, achieving train and val loss of 2.16.
-[03:03-03:28]Using torch.nograd decorator to prevent gradients computation.
[03:22-14:22]2. Efficiency of torch.no_grad and neural net initialization issues
-[03:22-04:00]Using torch's no_grad makes computation more efficient by eliminating gradient tracking.
-[04:22-04:50]Network initialization causes high loss of 27, rapidly decreases to 1 or 2.
-[05:00-05:32]At initialization, the model aims for a uniform distribution among 27 characters, with roughly 1/27 probability for each.
-[05:49-06:19]Neural net creates skewed probability distributions leading to high loss.
-[12:08-12:36]Loss at initialization as expected, improved to 2.12-2.16
[14:24-36:39]3. Neural network initialization
-[16:03-16:31]The chain rule with local gradient is affected when outputs of tanh are close to -1 or 1, leading to a halt in back propagation.
-[18:09-18:38]Concern over destructive gradients in flat regions of h outputs, tackled by analyzing absolute values.
-[26:03-26:31]Optimization led to improved validation loss from 2.17 to 2.10 by fixing softmax and 10-inch layer issues.
-[29:28-30:02]Standard deviation expanded to three, aiming for unit gaussian distribution in neural nets.
-[30:17-30:47]Scaling down by 0.2 shrinks gaussian with standard deviation 0.6.
-[31:03-31:46]Initializing neural network weights for well-behaved activations, kaiming he et al.
-[36:24-36:55]Modern innovations have improved network stability and behavior, including residual connections, normalization layers, and better optimizers.
[36:39-51:52]4. Neural net initialization and batch normalization
-[36:39-37:05]Modern innovations like normalization layers and better optimizers reduce the need for precise neural net initialization.
-[40:32-43:04]Batch normalization enables reliable training of deep neural nets, ensuring roughly gaussian hidden states for improved performance.
-[40:51-41:13]Batch normalization from 2015 enabled reliable training of deep neural nets.
-[41:39-42:09]Standardizing hidden states to be unit gaussian is a perfectly differentiable operation, a key insight in the paper.
-[43:20-43:50]Calculating standard deviation of activations, mean is average value of neuron's activation.
-[45:45-46:16]Back propagation guides distribution movement, adding scale and shift for final output
[51:52-01:01:35]5. Jittering and batch normalization in neural network training
-[52:10-52:37]Padding input examples adds entropy, augments data, and regularizes neural nets.
-[53:44-54:09]Batch normalization effectively controls activations and their distributions.
-[56:05-56:33]Batch normalization paper introduces running mean and standard deviation estimation during training.
-[01:00:46-01:01:10]Eliminated explicit calibration stage, almost done with batch normalization, epsilon prevents division by zero.
[01:01:36-01:09:21]6. Batch normalization and resnet in pytorch
-[01:02:00-01:02:30]Biases are subtracted out in batch normalization, reducing their impact to zero.
-[01:03:13-01:03:53]Using batch normalization to control activations in neural net, with gain, bias, mean, and standard deviation parameters.
-[01:07:25-01:07:53]Creating deep neural networks with weight layers, normalization, and non-linearity, as exemplified in the provided code.
[01:09:21-01:23:37]7. Pytorch weight initialization and batch normalization
-[01:10:05-01:10:32]Pytorch initializes weights using 1/fan-in square root from a uniform distribution.
-[01:11:11-01:11:40]Scaling weights by 1 over sqrt of fan in, using batch normalization layer in pytorch with 200 features.
-[01:14:02-01:14:35]Importance of understanding activations and gradients in neural networks, especially as they get bigger and deeper.
-[01:16:00-01:16:30]Batch normalization centers data for gaussian activations in deep neural networks.
-[01:17:32-01:18:02]Batch normalization, influential in 2015, enabled reliable training of much deeper neural nets.
[01:23:39-01:55:56]8. Custom pytorch layer and network analysis
-[01:24:01-01:24:32]Updating buffers using exponential moving average with torch.nograd context manager.
-[01:25:47-01:27:11]The model has 46,000 parameters and uses pytorch for forward and backward passes, with visualizations of forward pass activations.
-[01:28:04-01:28:30]Saturation stabilizes at 20% initially, then stabilizes at 5% with a standard deviation of 0.65 due to gain set at 5 over 3.
-[01:33:19-01:33:50]Setting gain correctly at 1 prevents shrinking and diffusion in batch normalization.
-[01:38:41-01:39:11]The last layer has gradients 100 times greater, causing faster training, but it self-corrects with longer training.
-[01:43:18-01:43:42]Monitoring update ratio for parameters to ensure efficient training, aiming for -3 on log plot.
-[01:51:36-01:52:04]Introduce batch normalization and pytorch modules for neural networks.
-[01:52:39-01:53:06]Introduction to diagnostic tools for neural network analysis.
-[01:54:45-01:55:50]Introduction to diagnostic tools in neural networks, active research in initialization and backpropagation, ongoing progress
offered by Coursnap    See Less
Amazing, knowledge that is hell hard to find in other videos and also, you have AMAZING skill in clearly explaining complex stuff.     See Less
This is a great lecture, especially the second half building intuition about diagnostics. Amazing stuff.     See Less
The amount of useful information in this video is impressive. Thanks for such good content.     See Less
I keep coming back to these videos again and again. Andrej is legend!     See Less
Thanks for the fantastic download! You have changed my learning_rate in this area from 0.1 to something >1!     See Less
what is the purpose of bnmean_running and bnstd_running?     See Less