I agree. We are in the era of "let's optimize domain XY by training this expert in the MoE set." The Mixture of Experts approach allows this (and also leads to benchmaxxing).    See More
AI Explained
406K subscribersDo we have a new best AI model, or do we have the downfall of benchmarks in general, as a way of capturing machine ...
LLMS were promised to generalise. Turns out it's whack-a...
43 Comments
29:32
TheAIGRID
6.7k views • 17 hours ago
12:34
TheAIGRID
8.4k views • 1 day ago
18:31
Wes Roth
36.0k views • 1 day ago
13:00
TheAIGRID
7.2k views • 1 day ago
13:40
AI Explained
32.2k views • 1 day ago
16:26
TheAIGRID
10.2k views • 2 days ago
20:55
Wes Roth
75.6k views • 3 days ago
32:43
TheAIGRID
20.0k views • 3 days ago
04:53
AI Jason
17.0k views • 4 days ago
32:27
Wes Roth
54.4k views • 4 days ago
Shelf will be hidden for 30 daysUndo
07:00
TheAIGRID
3.0k views • 4 days ago
24:44
Wes Roth
75.0k views • 5 days ago
96:00
Wes Roth
39.8k views • 6 days ago
19:59
TheAIGRID
13.9k views • 6 days ago
14:10
AI Jason
162.9k views • 1 week ago
13:09
TheAIGRID
4.7k views • 1 week ago
21:55
TheAIGRID
15.3k views • 1 week ago
17:29
Wes Roth
16.3k views • 1 week ago
18:50
AI Explained
102.2k views • 1 week ago
52:21
AI For Humans
14.0k views • 1 week ago
12:43
TheAIGRID
30.9k views • 1 week ago
10:58
Wes Roth
48.5k views • 1 week ago
21:15
TheAIGRID
20.0k views • 1 week ago
15:28
AI Jason
34.2k views • 1 week ago
10:15
TheAIGRID
12.2k views • 1 week ago
18:06
TheAIGRID
19.7k views • 1 week ago
02:30
AI For Humans
5.0k views • 1 week ago
10:32
TheAIGRID
9.2k views • 1 week ago
11:24
AI Jason
48.5k views • 2 weeks ago
12:54
Wes Roth
33.0k views • 2 weeks ago
01:50
AI For Humans
4.5k views • 2 weeks ago
26:19
Wes Roth
101.8k views • 2 weeks ago
22:33
TheAIGRID
57.9k views • 2 weeks ago
62:43
AI For Humans
26.9k views • 2 weeks ago
23:07
TheAIGRID
12.5k views • 2 weeks ago
10:31
Wes Roth
38.9k views • 2 weeks ago
50:36
Wes Roth
71.6k views • 2 weeks ago
43:03
TheAIGRID
59.8k views • 2 weeks ago
08:06
TheAIGRID
43.1k views • 2 weeks ago
21:07
TheAIGRID
28.9k views • 2 weeks ago
22:15
TheAIGRID
53.8k views • 2 weeks ago
18:32
TheAIGRID
25.1k views • 2 weeks ago
15:29
Wes Roth
67.9k views • 2 weeks ago
12:35
TheAIGRID
14.8k views • 3 weeks ago
09:57
TheAIGRID
12.8k views • 3 weeks ago
09:39
AI Jason
23.2k views • 3 weeks ago
10:07
TheAIGRID
13.8k views • 3 weeks ago
19:50
AI Explained
78.7k views • 3 weeks ago
10:57
TheAIGRID
7.5k views • 3 weeks ago
57:15
AI For Humans
10.8k views • 3 weeks ago
10:30
Wes Roth
20.0k views • 3 weeks ago
15:44
Wes Roth
45.7k views • 3 weeks ago
05:25
TheAIGRID
623 views • 3 weeks ago
10:03
Wes Roth
57.4k views • 3 weeks ago
01:35
AI For Humans
2.4k views • 3 weeks ago
10:55
TheAIGRID
9.5k views • 3 weeks ago
12:49
TheAIGRID
11.1k views • 3 weeks ago
10:47
Wes Roth
92.2k views • 3 weeks ago
10:12
TheAIGRID
12.5k views • 3 weeks ago
01:47
AI For Humans
1.2k views • 3 weeks ago
05:19
TheAIGRID
4.0k views • 3 weeks ago
31:51
Wes Roth
80.1k views • 3 weeks ago
08:50
TheAIGRID
41.5k views • 3 weeks ago
08:05
TheAIGRID
3.4k views • 3 weeks ago
32:40
Wes Roth
78.2k views • 3 weeks ago
13:22
TheAIGRID
23.6k views • 3 weeks ago
13:22
TheAIGRID
2.3k views • 3 weeks ago
01:44
AI For Humans
1.2k views • 3 weeks ago
11:31
TheAIGRID
10.9k views • 3 weeks ago
10:00
TheAIGRID
44.1k views • 3 weeks ago
25:00
Wes Roth
121.6k views • 1 month ago
11:46
TheAIGRID
23.0k views • 1 month ago
01:22
AI For Humans
20.7k views • 4 weeks ago
11:25
TheAIGRID
7.2k views • 4 weeks ago
53:42
AI For Humans
11.8k views • 4 weeks ago
23:17
Wes Roth
34.9k views • 4 weeks ago
08:02
TheAIGRID
9.4k views • 4 weeks ago
17:30
Wes Roth
34.2k views • 1 month ago
22:13
AI Explained
68.9k views • 1 month ago
23:08
Wes Roth
36.1k views • 1 month ago
23:08
Wes Roth
41 views • 1 month ago
09:19
TheAIGRID
7.6k views • 1 month ago
11:31
TheAIGRID
13.4k views • 1 month ago
27:09
Wes Roth
77.6k views • 1 month ago
12:43
Wes Roth
28.4k views • 1 month ago
10:51
AI Jason
18.0k views • 1 month ago
09:46
TheAIGRID
21.1k views • 1 month ago
35:11
Wes Roth
36.6k views • 1 month ago
08:19
TheAIGRID
4.3k views • 1 month ago
59:16
AI For Humans
11.5k views • 1 month ago
09:27
TheAIGRID
19.4k views • 1 month ago
11:05
TheAIGRID
4.4k views • 1 month ago
09:31
TheAIGRID
13.7k views • 1 month ago
08:10
TheAIGRID
8.9k views • 1 month ago
01:20
AI For Humans
2.8k views • 1 month ago
12:56
TheAIGRID
6.8k views • 1 month ago
55:25
AI For Humans
12.2k views • 1 month ago
19:03
AI Explained
103.4k views • 1 month ago
18:16
TheAIGRID
19.3k views • 1 month ago
13:18
TheAIGRID
21.4k views • 1 month ago
01:56
AI For Humans
8.6k views • 1 month ago
41:56
AI For Humans
11.0k views • 1 month ago
02:31
AI For Humans
2.0k views • 1 month ago
15:52
TheAIGRID
49.7k views • 1 month ago
10:25
TheAIGRID
448.8k views • 2 months ago
18:54
AI Jason
73.0k views • 2 months ago
33:27
AI Explained
123.3k views • 2 months ago
09:34
TheAIGRID
18.8k views • 2 months ago
20:00
AI Explained
89.7k views • 2 months ago
45:20
AI For Humans
9.4k views • 2 months ago
25:37
TheAIGRID
42.4k views • 2 months ago
35:11
TheAIGRID
23.1k views • 2 months ago
10:39
TheAIGRID
56.1k views • 2 months ago
13:03
TheAIGRID
15.1k views • 2 months ago
17:42
AI Explained
89.4k views • 2 months ago
55:07
AI For Humans
11.4k views • 2 months ago
11:29
AI Jason
95.0k views • 2 months ago
41:56
TheAIGRID
26.6k views • 2 months ago
33:44
TheAIGRID
9.4k views • 2 months ago
10:12
TheAIGRID
20.4k views • 2 months ago
20:16
AI Explained
74.8k views • 2 months ago
49:17
AI For Humans
10.6k views • 2 months ago
11:29
TheAIGRID
6.3k views • 2 months ago
13:46
TheAIGRID
25.8k views • 3 months ago
10:43
TheAIGRID
7.1k views • 2 months ago
00:26
AI For Humans
3.6k views • 3 months ago
02:17
TheAIGRID
6.1k views • 3 months ago
12:33
AI Jason
31.2k views • 3 months ago
00:42
AI For Humans
5.4k views • 3 months ago
44:40
AI For Humans
15.8k views • 3 months ago
14:56
AI Explained
60.4k views • 3 months ago
01:41
AI For Humans
6.0k views • 3 months ago
21:43
AI Explained
118.5k views • 3 months ago
23:40
TheAIGRID
71.5k views • 3 months ago
13:33
TheAIGRID
8.8k views • 3 months ago
14:40
TheAIGRID
27.0k views • 3 months ago
15:37
TheAIGRID
113.8k views • 3 months ago
15:07
TheAIGRID
17.6k views • 3 months ago
18:27
AI Explained
61.9k views • 3 months ago
45:12
AI For Humans
12.1k views • 3 months ago
12:54
AI Explained
60.7k views • 3 months ago
55:23
AI For Humans
13.1k views • 3 months ago
08:33
AI Jason
41.2k views • 3 months ago
14:14
AI Explained
58.4k views • 4 months ago
05:14
AI Jason
33.5k views • 4 months ago
11:47
AI Jason
61.0k views • 4 months ago
15:44
AI Explained
58.8k views • 4 months ago
14:07
AI Explained
67.5k views • 5 months ago
11:32
AI Explained
20.2k views • 5 months ago
11:32
AI Explained
48.7k views • 5 months ago
06:41
AI Jason
17.9k views • 5 months ago
18:55
AI Explained
57.9k views • 6 months ago
16:02
AI Jason
115.1k views • 6 months ago
15:02
AI Explained
163.6k views • 6 months ago
11:55
AI Explained
199.3k views • 6 months ago
18:44
AI Jason
138.9k views • 7 months ago
17:20
AI Explained
84.7k views • 7 months ago
07:02
AI Jason
81.7k views • 7 months ago
11:44
AI Explained
179.1k views • 7 months ago
09:29
AI Jason
53.2k views • 7 months ago
16:39
AI Jason
188.0k views • 8 months ago
26:20
AI Explained
110.5k views • 8 months ago
05:56
AI Jason
23.0k views • 8 months ago
14:01
AI Explained
101.8k views • 8 months ago
22:02
AI Jason
18.2k views • 8 months ago
16:50
AI Explained
96.4k views • 8 months ago
03:35
AI Jason
52.6k views • 9 months ago
19:05
AI Explained
99.1k views • 9 months ago
04:25
AI Jason
24.5k views • 9 months ago
14:02
AI Jason
10.2k views • 9 months ago
11:44
AI Jason
35.5k views • 9 months ago
19:04
AI Jason
55.2k views • 10 months ago
131:12
Andrej Karpathy
2.3M views • 1 year ago
211:24
Andrej Karpathy
5.7M views • 1 year ago
81:55
Andrej Karpathy
34.7k views • 1 year ago
241:26
Andrej Karpathy
1.0M views • 1 year ago
30:38
Morningside AI
13.8k views • 1 year ago
133:35
Andrej Karpathy
1.0M views • 2 years ago
26:56
Morningside AI
4.2k views • 2 years ago
45:54
Morningside AI
10.1k views • 2 years ago
59:48
Andrej Karpathy
3.4M views • 2 years ago
39:00
Morningside AI
26.7k views • 2 years ago
116:20
Andrej Karpathy
6.9M views • 3 years ago
56:22
Andrej Karpathy
263.0k views • 3 years ago
115:24
Andrej Karpathy
326.7k views • 3 years ago
115:58
Andrej Karpathy
475.4k views • 3 years ago
75:40
Andrej Karpathy
510.0k views • 3 years ago
43 Comments
I agree. We are in the era of "let's optimize domain XY by training this expert in the MoE set." The Mixture of Experts approach allows this (and also leads to benchmaxxing).    See More
You can't solve hallucination in humans let alone machines. People and machines make mistakes.
i tried 3.1 pro. it infinite looped in the thinking phase and wasted countless tokens. then finally after like 5 mins the code it wrote had syntax errors
The TLDR from Google Gemini's summary
-
The video discusses the release of Gemini 3.1 Pro and the growing confusion surrounding AI model benchmarks (    See More
LLMS were promised to generalise. Turns out it's whack-a-mole after all. Feel the AGI.
Boring. The last ten videos just pick arguments with this and that. It’s all about nothing, really.
Buffett used to say the taste of cola doesn’t age — he keeps wanting more. Ot     See More
Yeah i stopped trusting benchmarks after the release of Gemini Pro 3.0. It killed it in benchmarks and gave me the impression it was the best model. I've been comparing it against ChatGP     See More
ive grown cold to ai
seems like incremental improvements won't get us to agi and labs do nothing interesting in that regard
I agree. We are in the era of "let's optimize domain XY by training this expert in the MoE set." The Mixture of Experts approach allows this (and also leads to benchmaxxing).    See More an, it is still vastly useful. Surely Claude is really advanced in optimizing code and code patterns, and thus performs really nicely. But I think we are heading in the direction of a "collection of narrow AIs that talk together," like the Geth in Mass Effect. The Geth were an AI composed of many programs. I still think that it can lead to very useful things, but maybe it is not the "world-class AGI" that we expect.
Because I think that in some domains, we are already at AGI level.    See Less
Glad to see the bat signal worked     See Less
You can't solve hallucination in humans let alone machines. People and machines make mistakes.     See Less
i tried 3.1 pro. it infinite looped in the thinking phase and wasted countless tokens. then finally after like 5 mins the code it wrote had syntax errors     See Less
The TLDR from Google Gemini's summary
-
The video discusses the release of Gemini 3.1 Pro and the growing confusion surrounding AI model benchmarks (    See More 2_DPnzoiHaY">0:00). The speaker explains that post-training (1:01) is now the dominant stage in LLM development, leading to models excelling in specific domains rather than universally (1:39).
Here's a breakdown of the key points:
Domain Specialization and Benchmarks (1:39): The video highlights that models optimized for specific domains may perform differently in other areas. For example, Claude Opus 4.6, despite being strong in coding, performed poorly in a chess puzzle benchmark (2:01). This shows that older paradigms, where strong performance in one area meant strong performance in all, no longer apply.
ARC-AGI 2 Caveat (3:42): Gemini 3.1 Pro shows impressive results on ARC-AGI 2, outperforming other models (3:42). However, this is tempered by the observation that models might use "unintended arithmetic patterns" from numerical encodings, leading to accidentally correct solutions (4:22).
Simple Bench Record (5:54): Gemini 3.1 Pro set a new record on the speaker's private "Simple Bench," a test of common sense reasoning (5:57). This performance brings it within the margin of error of human average baseline, marking a significant threshold in AI capabilities in text-based tests (6:06).
Hallucination Caveat (8:22): The video addresses the issue of hallucinations (factual inaccuracies) in models. While Gemini 3.1 Pro appears to have a lower percentage of incorrect answers being hallucinations compared to some other models, it's noted that hallucinations are "definitely not a solved problem" (9:36).
Model Card Insights (9:54): The speaker touches on the Gemini 3.1 model card, highlighting that features like "deep think mode" might not always enhance capability despite the inference costs (10:18). However, it also shows examples where 3.1 Pro significantly reduces runtime in machine learning R&D tasks (10:38).
Exponential Growth and Anthropic's Strategy (11:12): The video discusses the rapid revenue growth of AI companies like Anthropic and OpenAI (11:32). Dario Amodei, CEO of Anthropic, suggests that by specializing in enough "specialisms," models might generalize to all specialisms, potentially achieving super intelligence without extensive continual learning (12:55).
The Elusive "One True Benchmark" (15:10): The video questions the existence of a single benchmark for general intelligence, noting that labs are incentivized to create their own benchmarks, which can lead to bias (15:52). Forecasting the future is presented as a truly objective benchmark, with models approaching human forecaster levels (16:01).
Other Metrics and Realism (17:02): The video briefly touches upon other benchmarks like speed, showcasing a model that can generate full answers in milliseconds (17:09). Finally, it highlights the increasing realism in video generation with examples like "Seed Dance 2.0" (17:44).    See Less
LLMS were promised to generalise. Turns out it's whack-a-mole after all. Feel the AGI.     See Less
Boring. The last ten videos just pick arguments with this and that. It’s all about nothing, really.
Buffett used to say the taste of cola doesn’t age — he keeps wanting more. Ot     See More t have that quality; after a while, you can’t stand their taste.
Your narrative, videos feel the second type now 😟    See Less
Yeah i stopped trusting benchmarks after the release of Gemini Pro 3.0. It killed it in benchmarks and gave me the impression it was the best model. I've been comparing it against ChatGP     See More de subs extensively for the past 2 months using all 3 in tandem with the same prompts (trying to figure out which sub to choose), i found that Gemini 3 is by far the most "confidently wrong" most often. It never says "i don't know", it will answer incorrectly and worst of all it will do it confidently & persuasively. And when you try to correct it it doesn't get the message into its head, it'll either push back or dismiss it. For this reason i started to find Gemini hard to trust with certain things (like troubleshooting etc) because i could never know if it was giving me misinformation or not. There were times that i found that if i hadn't put the effort into verifying what it told me then i would've gone ahead believing what it said was true when it wasn't. You can't trust that.
The conclusion i make from that is that Google must be training it to be good at benchmarks rather than real-world usability.
So these days i default to Claude or ChatGPT (still trying to figure out which one i prefer) simply because if they don't know something they will say so, Claude probably moreso than ChatGPT in this regard, but neither of them are as bad as Gemini 3.
Don't get me wrong, Gemini 3 is better for certain things (e.g. multimodal stuff, and i prefer the way Gemini explains/words things), but until they fix the confident hallucination i can't trust it. At this point the only reason i'm still subscribed is because of NotebookLM    See Less
What about Grok 4.20? Is that one running among the smartest now?     See Less
ive grown cold to ai
seems like incremental improvements won't get us to agi and labs do nothing interesting in that regard     See Less