Feeda - OnScreen Live

Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI

AI Explained

408K subscribers

105.5k views • 2 months ago

Do we have a new best AI model, or do we have the downfall of benchmarks in general, as a way of capturing machine ...

LLMS were promised to generalise. Turns out it's whack-a...

48 Comments

@pierQRzt180 2 months ago

I agree. We are in the era of "let's optimize domain XY by training this expert in the MoE set." The Mixture of Experts approach allows this (and also leads to benchmaxxing). See More an, it is still vastly useful. Surely Claude is really advanced in optimizing code and code patterns, and thus performs really nicely. But I think we are heading in the direction of a "collection of narrow AIs that talk together," like the Geth in Mass Effect. The Geth were an AI composed of many programs. I still think that it can lead to very useful things, but maybe it is not the "world-class AGI" that we expect.

Because I think that in some domains, we are already at AGI level. See Less

@executivelifehacks6747  2 months ago

Glad to see the bat signal worked See Less

@torarinvik4920  2 months ago

You can't solve hallucination in humans let alone machines. People and machines make mistakes. See Less

@stephenkamenar  2 months ago

i tried 3.1 pro. it infinite looped in the thinking phase and wasted countless tokens. then finally after like 5 mins the code it wrote had syntax errors See Less

@YouTubeCommenterChap  2 months ago

The TLDR from Google Gemini's summary
-
The video discusses the release of Gemini 3.1 Pro and the growing confusion surrounding AI model benchmarks ( See More 2_DPnzoiHaY">0:00). The speaker explains that post-training (1:01) is now the dominant stage in LLM development, leading to models excelling in specific domains rather than universally (1:39).

Here's a breakdown of the key points:

Domain Specialization and Benchmarks (1:39): The video highlights that models optimized for specific domains may perform differently in other areas. For example, Claude Opus 4.6, despite being strong in coding, performed poorly in a chess puzzle benchmark (2:01). This shows that older paradigms, where strong performance in one area meant strong performance in all, no longer apply.

ARC-AGI 2 Caveat (3:42): Gemini 3.1 Pro shows impressive results on ARC-AGI 2, outperforming other models (3:42). However, this is tempered by the observation that models might use "unintended arithmetic patterns" from numerical encodings, leading to accidentally correct solutions (4:22).

Simple Bench Record (5:54): Gemini 3.1 Pro set a new record on the speaker's private "Simple Bench," a test of common sense reasoning (5:57). This performance brings it within the margin of error of human average baseline, marking a significant threshold in AI capabilities in text-based tests (6:06).

Hallucination Caveat (8:22): The video addresses the issue of hallucinations (factual inaccuracies) in models. While Gemini 3.1 Pro appears to have a lower percentage of incorrect answers being hallucinations compared to some other models, it's noted that hallucinations are "definitely not a solved problem" (9:36).

Model Card Insights (9:54): The speaker touches on the Gemini 3.1 model card, highlighting that features like "deep think mode" might not always enhance capability despite the inference costs (10:18). However, it also shows examples where 3.1 Pro significantly reduces runtime in machine learning R&D tasks (10:38).

Exponential Growth and Anthropic's Strategy (11:12): The video discusses the rapid revenue growth of AI companies like Anthropic and OpenAI (11:32). Dario Amodei, CEO of Anthropic, suggests that by specializing in enough "specialisms," models might generalize to all specialisms, potentially achieving super intelligence without extensive continual learning (12:55).

The Elusive "One True Benchmark" (15:10): The video questions the existence of a single benchmark for general intelligence, noting that labs are incentivized to create their own benchmarks, which can lead to bias (15:52). Forecasting the future is presented as a truly objective benchmark, with models approaching human forecaster levels (16:01).

Other Metrics and Realism (17:02): The video briefly touches upon other benchmarks like speed, showcasing a model that can generate full answers in milliseconds (17:09). Finally, it highlights the increasing realism in video generation with examples like "Seed Dance 2.0" (17:44). See Less

@shApYT  2 months ago

LLMS were promised to generalise. Turns out it's whack-a-mole after all. Feel the AGI. See Less

@vassilisworld  2 months ago

Boring. The last ten videos just pick arguments with this and that. It’s all about nothing, really.

Buffett used to say the taste of cola doesn’t age — he keeps wanting more. Ot See More �t have that quality; after a while, you can’t stand their taste.

Your narrative, videos feel the second type now 😟 See Less

@Barncore  2 months ago

Yeah i stopped trusting benchmarks after the release of Gemini Pro 3.0. It killed it in benchmarks and gave me the impression it was the best model. I've been comparing it against ChatGP See More de subs extensively for the past 2 months using all 3 in tandem with the same prompts (trying to figure out which sub to choose), i found that Gemini 3 is by far the most "confidently wrong" most often. It never says "i don't know", it will answer incorrectly and worst of all it will do it confidently & persuasively. And when you try to correct it it doesn't get the message into its head, it'll either push back or dismiss it. For this reason i started to find Gemini hard to trust with certain things (like troubleshooting etc) because i could never know if it was giving me misinformation or not. There were times that i found that if i hadn't put the effort into verifying what it told me then i would've gone ahead believing what it said was true when it wasn't. You can't trust that.

The conclusion i make from that is that Google must be training it to be good at benchmarks rather than real-world usability.

So these days i default to Claude or ChatGPT (still trying to figure out which one i prefer) simply because if they don't know something they will say so, Claude probably moreso than ChatGPT in this regard, but neither of them are as bad as Gemini 3.

Don't get me wrong, Gemini 3 is better for certain things (e.g. multimodal stuff, and i prefer the way Gemini explains/words things), but until they fix the confident hallucination i can't trust it. At this point the only reason i'm still subscribed is because of NotebookLM See Less

@iugey  2 months ago

What about Grok 4.20? Is that one running among the smartest now? See Less

@flamyf  2 months ago

ive grown cold to ai
seems like incremental improvements won't get us to agi and labs do nothing interesting in that regard See Less

AI News AI News

Loading...

15:00

This Breakthrough Could Change the Path to AGI

TheAIGRID

11.9k views   •   2 months ago

20:52

this EX-OPENAI RESEARCHER just released it...

Wes Roth

55.6k views   •   2 months ago

15:57

This New OpenAI Leak Changes Everything About GPT-6

TheAIGRID

15.3k views   •   2 months ago

27:28

Claude JUST became AWARE

Wes Roth

109.4k views   •   2 months ago

22:12

AI News - New Models From Google & OpenAI , AI Drama & Humanoids In Factories

TheAIGRID

10.8k views   •   2 months ago

19:53

Google, OpenAI & Anthropic All Reported the Same Threat

TheAIGRID

12.3k views   •   2 months ago

21:52

What the New ChatGPT 5.4 Means for the World

AI Explained

78.4k views   •   2 months ago

58:29

OpenAI's GPT-5.4 Is a Beast. But Good Luck Staying King.

AI For Humans

9.7k views   •   2 months ago

23:50

OpenAI’s New GPT-5.4 Pro Is Now The Smartest AI In The World.

TheAIGRID

16.8k views   •   2 months ago

13:15

GPT 5.4 "we see no wall"

Wes Roth

52.6k views   •   2 months ago

Shorts

Google AI presents VEO it's answer to Op...
6.2k views

Yes, This AI Was Trained on Babies #arti...
2.1k views

Weird New AI Facial Animation Software #...
1.4k views

ChatGPT Turns You Into a Bad Electrician...
2.8k views

AI Can Make You Run On Water #artificial...
2.0k views

Shelf will be hidden for 30 daysUndo

15:17

wtf is Harness Engineer & why is it important

AI Jason

27.9k views   •   2 months ago

16:40

GPT 5.4 leaks

Wes Roth

46.8k views   •   2 months ago

11:33

GPT-5.3 Instant & Gemini 3.1 Flash Lite - OpenAI and Google’s Newest And Fastest AI Yet

TheAIGRID

7.7k views   •   2 months ago

39:39

Cal Newport AI takes are WILD...

Wes Roth

39.2k views   •   2 months ago

01:03

Claude Code and I are building AI stuff all day long. Yes, I’m Clauderotting. #ai #claude #coding

AI For Humans

2.2k views   •   2 months ago

14:10

Grok 5 Could Be xAI’s Biggest Breakthrough Yet - Nobody Noticed This

TheAIGRID

21.9k views   •   2 months ago

19:31

Perplexity AI Computer Tutorial With New Usecases (How To Use Perplexity AI Computer )

TheAIGRID

10.1k views   •   2 months ago

23:53

Claude kill count going up

Wes Roth

35.1k views   •   2 months ago

16:39

Google’s AGI Plan Just Got Clearer (Demis Hassabis Explains)

TheAIGRID

23.0k views   •   2 months ago

29:32

Did OpenAI Just Help the Government Kill Anthropic?

TheAIGRID

11.0k views   •   2 months ago

12:34

OpenAI & Google Just JOINED FORCES - Staff Demand “No Killer AI”

TheAIGRID

10.4k views   •   2 months ago

13:00

Anthropic REFUSES Military Demands, Pentagon Left STUNNED!

TheAIGRID

7.6k views   •   2 months ago

13:40

Deadline Day for Autonomous AI Weapons & Mass Surveillance

AI Explained

40.1k views   •   2 months ago

58:08

Nano Banana 2 Launched. It's Fine. But Seedance 2.0 Might Be Great.

AI For Humans

9.0k views   •   2 months ago

16:26

The US Government is Threatening to SEIZE Claude

TheAIGRID

10.7k views   •   2 months ago

32:43

The 2028 Global Intelligence Crisis Explained - What Happens When AI Breaks The Economy?

TheAIGRID

24.0k views   •   2 months ago

04:53

How to prompt Gemini 3.1 for Epic animations

AI Jason

20.6k views   •   2 months ago

32:27

$1 Trillion Gone

Wes Roth

56.0k views   •   2 months ago

07:00

Every Vibe Coder Needs This AI Agent - Kane AI Testing Agent

TheAIGRID

3.0k views   •   2 months ago

24:44

the SCARIEST chart in AI

Wes Roth

78.9k views   •   2 months ago

96:00

"The Universe Is A PROGRAM" Is this the SOURCE CODE of our Universe? - Stephen Wolfram

Wes Roth

44.3k views   •   2 months ago

19:59

Sam Altman Sparks OUTRAGE With Controversial AI Comment

TheAIGRID

13.9k views   •   2 months ago

14:10

Anthropic killed Tool calling

AI Jason

183.5k views   •   2 months ago

13:09

OpenClaw Setup Tutorial With New Usecases (OpenClaw Usecases 2026)

TheAIGRID

4.7k views   •   2 months ago

21:55

AGI by 2028? Sam Altman Just Changed the Timeline

TheAIGRID

15.3k views   •   2 months ago

17:29

How to Build ANYTHING with Oz by Warp

Wes Roth

16.9k views   •   2 months ago

18:50

Gemini 3.1 Pro and the Downfall of Benchmarks: Welcome to the Vibe Era of AI

AI Explained

105.5k views   •   2 months ago

52:21

Gemini 3.1 Just Dropped. SuperIntelligence Is Coming. We're Fine.

AI For Humans

14.0k views   •   2 months ago

12:43

Gemini 3.1 Pro For Beginners - All New Features Explained (Gemini 3.1 Pro Tutorial)

TheAIGRID

30.9k views   •   2 months ago

10:58

did Anthropic just END OpenClaw?

Wes Roth

48.8k views   •   2 months ago

21:15

Meta's New AI Is Freaking Everyone Out...

TheAIGRID

20.0k views   •   2 months ago

15:28

Agent memory resolved?

AI Jason

38.4k views   •   2 months ago

10:15

Grok 4.2 Agents For Beginners - Grok 4.2 Full Guide With Usecases

TheAIGRID

12.2k views   •   2 months ago

18:06

Elon Musk vs OpenAI Just Took a Wild Turn

TheAIGRID

19.7k views   •   2 months ago

02:30

Seedance 2.0 Proves China is catching up in AI #ai #ainews #tech

AI For Humans

5.3k views   •   2 months ago

10:32

Gemini 3 Deepthink For Beginners - Gemini 3 Deepthink Full Guide With Usecases

TheAIGRID

9.2k views   •   2 months ago

11:24

WebMCP - Why is awesome & How to use it

AI Jason

52.5k views   •   2 months ago

12:54

8 BILION DIGITAL CLONES

Wes Roth

33.2k views   •   2 months ago

01:50

Seedance 2.0 is scary. #ai #aivideo #seedance2

AI For Humans

5.0k views   •   2 months ago

26:19

it JUST happened

Wes Roth

102.1k views   •   3 months ago

22:33

Google Gemini 3 DeepThink Is Now the Smartest AI In The World

TheAIGRID

57.9k views   •   3 months ago

62:43

Seedance 2.0 Is Peak AI Video. We Tested It. Send Help.

AI For Humans

28.1k views   •   3 months ago

23:07

Insider QUITS OpenAI and Sounds the Alarm - They're making a BIG mistake.

TheAIGRID

12.5k views   •   3 months ago

10:31

AI Video Just Went TOO FAR... NOT OK!

Wes Roth

39.2k views   •   3 months ago

50:36

INSTALL OPENCLAW in 30 seconds and START BUILDING... | Local Install and VPS FULL Tutorial

Wes Roth

75.8k views   •   3 months ago

43:03

Elon Musk Reveals the Future of AI - XAI Full Reveal (Supercut)

TheAIGRID

59.8k views   •   3 months ago

08:06

How To Access Seedance 2.0 - Seedance 2.0 Tutorial Complete Guide For Beginners

TheAIGRID

43.1k views   •   3 months ago

21:07

Elon Musk’s AI Startup Is Falling Apart?

TheAIGRID

28.9k views   •   3 months ago

22:15

Did Anthropic Accidentally Create a Conscious AI?

TheAIGRID

53.8k views   •   3 months ago

18:32

Sam Altman Breaks Silence On The AI Chaos

TheAIGRID

25.1k views   •   3 months ago

15:29

OPUS 4.6 thinks it's "DEMON POSSESSED"

Wes Roth

68.2k views   •   3 months ago

12:35

Meta’s Most Powerful AI Model Just Leaked - (Meta Avocado)

TheAIGRID

14.8k views   •   3 months ago

09:57

China’s New Robot "Bolt" Just Broke the Human Speed Limit

TheAIGRID

12.8k views   •   3 months ago

09:39

How to install and use Claude Code Agent Teams (Reverse-engineered)

AI Jason

25.2k views   •   3 months ago

10:07

Ex-OpenAI Researcher Says They're ALL Wrong About AI

TheAIGRID

13.8k views   •   3 months ago

19:50

The Two Best AI Models/Enemies Just Got Released Simultaneously

AI Explained

79.3k views   •   3 months ago

10:57

Claude Opus 4.6 For Beginners - All New Features Explained (Claude Opus 4.6 Tutorial)

TheAIGRID

7.5k views   •   3 months ago

57:15

AI Coding Updates from OpenAI & Anthropic Are Good... Maybe Too Good?

AI For Humans

11.2k views   •   3 months ago

10:30

xAI will birth a SENTIENT SUN...

Wes Roth

20.1k views   •   3 months ago

15:44

OpenAI's FRONTIER might be the "JOB KILLER" we were waiting for

Wes Roth

46.1k views   •   3 months ago

05:25

How To Use Google Image FX - Image FX Google Tutorial

TheAIGRID

623 views   •   3 months ago

10:03

Opus 4.6 is about to send SHOCKWAVES...

Wes Roth

57.4k views   •   3 months ago

01:35

Opus 4.6 & GPT-5.3 Incoming!! #ai #ainews #openai

AI For Humans

2.5k views   •   3 months ago

10:55

The Internet Is Turning Against ChatGPT - Here's Why - ChatGPT Boycotts Explained

TheAIGRID

9.5k views   •   3 months ago

12:49

Sam Altman FIRES Back At Critics - We Are Not STUPID!

TheAIGRID

11.1k views   •   3 months ago

10:47

AI bubble JUST popped...

Wes Roth

92.3k views   •   3 months ago

10:12

OpenAI Stunned as Anthropic Takes Shots at ChatGPT (Anthropic super bowl)

TheAIGRID

12.5k views   •   3 months ago

01:47

Starting an AI Video MicroStudio?? #ai #aivideo #aiagents

AI For Humans

1.2k views   •   3 months ago

05:19

Google Gemini Agentic Vision Tutorial - How To Use Google Gemini Agentic Vision

TheAIGRID

4.0k views   •   3 months ago

31:51

ClawdBot makes money

Wes Roth

80.7k views   •   3 months ago

08:50

Sam Altman Finally Admits It: "We Screwed Up"

TheAIGRID

41.5k views   •   3 months ago

08:05

How To Setup OpenClaw For Beginners - OpenClaw Tutorial For Complete Beginners

TheAIGRID

3.4k views   •   3 months ago

32:40

ClawdBot BROKE EVERYTHING in 72 hours...

Wes Roth

78.3k views   •   3 months ago

13:22

Yann LeCun Just Called Out the Entire Robotics Industry

TheAIGRID

23.6k views   •   3 months ago

13:22

Yann LeCun Just Called Out the Entire Robotics Industry

TheAIGRID

2.3k views   •   3 months ago

01:44

Project Genie makes WEIRD AI games #ai #videogames #googleai

AI For Humans

1.2k views   •   3 months ago

11:31

Grok Imagine Tutorial - How To Use Grok Imagine 1.0 for Beginners

TheAIGRID

10.9k views   •   3 months ago

10:00

MOLTBOOK EXPOSED: The New AI Scam That Fooled Everyone

TheAIGRID

44.1k views   •   3 months ago

25:00

Clawdbot is about to BREAK EVEREYTHING

Wes Roth

121.8k views   •   3 months ago

11:46

Moltbook Just Stunned The Entire AI Industry And Is Now Out Of Control....

TheAIGRID

23.0k views   •   3 months ago

01:22

Is Moltbook Real? #ai #aiagents #moltbook

AI For Humans

21.6k views   •   3 months ago

11:25

Googles New Genie 3 Just Shocked The AI World And Broke The Stock Market.

TheAIGRID

7.2k views   •   3 months ago

53:42

The AI Holodeck Just Got Real: Google's Project Genie

AI For Humans

12.0k views   •   3 months ago

23:17

Google's MIND BLOWING World Creator (GENIE 3)

Wes Roth

35.0k views   •   3 months ago

08:02

Project Genie Tutorial (How to use Project Genie)

TheAIGRID

9.4k views   •   3 months ago

17:30

KIMI K2.5 AGENT SWARM is INSANE

Wes Roth

34.3k views   •   3 months ago

22:13

Claude AI Co-founder Publishes 4 Big Claims about Near Future: Breakdown

AI Explained

69.4k views   •   3 months ago

23:08

"Almost UNIMAGINABLE Power" - Anthropic Founder

Wes Roth

36.2k views   •   3 months ago

23:08

"Almost UNIMAGINABLE Power" - Anthropic Founder

Wes Roth

41 views   •   3 months ago

09:19

Social Media is Melting Down Over This OpenAI Headline (Here’s the Reality)

TheAIGRID

7.6k views   •   3 months ago

11:31

Microsoft CEO: AI Fails If This Doesn’t Happen

TheAIGRID

13.4k views   •   3 months ago

27:09

ClawdBot is out of control

Wes Roth

77.7k views   •   3 months ago

10:51

Why MCP is dead & How I vibe now

AI Jason

18.6k views   •   3 months ago

09:46

Google’s AI CEO Just Called Out OpenAI Over AGI Claims

TheAIGRID

21.1k views   •   3 months ago

08:19

I Didn’t Expect This AI Tool to Be This Good - Higgsfield AI Is Stunning

TheAIGRID

4.3k views   •   3 months ago

59:16

Claude Code Is Taking Over (And We Don't Hate It)

AI For Humans

11.7k views   •   3 months ago

09:27

This Is How You Know AGI Is Close...

TheAIGRID

19.4k views   •   3 months ago

11:05

The First AI Browser That Actually Works - Norton Neo AI Browser

TheAIGRID

4.4k views   •   3 months ago

09:31

RIP OpenAI? Apple Dumps ChatGPT for Google Gemini!

TheAIGRID

13.7k views   •   3 months ago

08:10

Elon Musks Grok Is Probably Going To Be Banned...

TheAIGRID

8.9k views   •   3 months ago

01:20

Will you opt into Gemini’s Personal Intelligence AI?

AI For Humans

2.8k views   •   3 months ago

12:56

Nvidia Just Changed Self Driving Forever - Tesla Should Be Worried

TheAIGRID

6.8k views   •   3 months ago

55:25

Google's AI Knows Everything About You (We Said Yes)

AI For Humans

12.3k views   •   3 months ago

19:03

Anthropic: Our AI just created a tool that can ‘automate all white collar work’, Me:

AI Explained

104.1k views   •   4 months ago

18:16

How Googles Winning The AI Race

TheAIGRID

19.3k views   •   4 months ago

13:18

Boston Dynamics Atlas Is The Only New Humanoid That Matters

TheAIGRID

21.4k views   •   4 months ago

01:56

Google Gemini Will Power Apple’s Siri AI!?!

AI For Humans

8.6k views   •   4 months ago

41:56

The AI Robot Uprising Has Begun (And It's Weirder Than You Think)

AI For Humans

11.1k views   •   4 months ago

02:31

Star Wars fan film made with AI actually works #starwars #ai #shorts

AI For Humans

2.0k views   •   4 months ago

15:52

The Growing AI Backlash Nobody Wants to Talk About.

TheAIGRID

49.7k views   •   4 months ago

10:25

A New Kind of AI Is Emerging And Its Better Than LLMS?

TheAIGRID

448.8k views   •   4 months ago

18:54

Secrets to unlock Gemini 3's hidden power...

AI Jason

73.4k views   •   4 months ago

33:27

What the Freakiness of 2025 in AI Tells Us About 2026

AI Explained

123.6k views   •   4 months ago

09:34

NVIDIA's New AI Agent Just Crossed the Line - The Age of AI Agents Begins (Nvidia Nitrogen)

TheAIGRID

18.8k views   •   4 months ago

20:00

Gemini Exponential, Demis Hassabis' ‘Proto-AGI’ coming, but …

AI Explained

89.8k views   •   4 months ago

45:20

OpenAI’s new ChatGPT Images is here! But…Will it top Nano Banana Pro?

AI For Humans

9.6k views   •   4 months ago

25:37

China’s "Impossible" AI Breakthrough: We Are In Trouble

TheAIGRID

42.4k views   •   4 months ago

35:11

AI News :The First “AGI-Capable” Model, Prompting Changes Forever , Automated AI Lab and more..

TheAIGRID

23.1k views   •   4 months ago

10:39

OpenAI Researcher QUITS — Says the Company Is Hiding the Truth - (It Actually Worse Than You Think)

TheAIGRID

56.1k views   •   4 months ago

13:03

Ex Google AI Veteran Claims Worlds First AGI Capable System - And Nobodys Talking About it...

TheAIGRID

15.1k views   •   5 months ago

17:42

GPT 5.2: OpenAI Strikes Back

AI Explained

89.4k views   •   5 months ago

55:07

GPT-5.2 Finally Arrived, But The Disney Deal is Bigger

AI For Humans

11.5k views   •   5 months ago

11:29

Nano Banana + Gemini 3 = S-TIER UI DESIGNER

AI Jason

96.4k views   •   5 months ago

41:56

The Latest AI Breakthroughs You Need to See (Google, OpenAI, Deepseek and More)

TheAIGRID

26.6k views   •   5 months ago

33:44

AI News : Deepseek Returns, Amazons Secret AI Models, Googles Breakthrough , Veo 3 Beaten and More

TheAIGRID

9.4k views   •   5 months ago

10:12

Google’s New Breakthrough Brings AGI Even Closer - Titans and Miras

TheAIGRID

20.4k views   •   5 months ago

20:16

You Are Being Told Contradictory Things About AI

AI Explained

74.9k views   •   5 months ago

49:17

OpenAI's Code Red: Can New AI Models Hold Off Google Gemini?

AI For Humans

10.7k views   •   5 months ago

11:29

ChatGPT Privacy CRACKS:The Court Now Has Your ChatGPT History

TheAIGRID

6.3k views   •   5 months ago

13:46

AI Is About to Change Coding Forever in 2026 - "Software Engineering Is Done"

TheAIGRID

25.8k views   •   5 months ago

10:43

Grok Thinks Elon Musk Is a God… This Is Where It Gets Dangerous

TheAIGRID

7.1k views   •   5 months ago

00:26

Do NOT do this with Nano Banana Pro #ai #aiart #google

AI For Humans

3.6k views   •   5 months ago

02:17

How to Tell If an Image Is AI-Generated (Beginner Friendly)

TheAIGRID

6.1k views   •   5 months ago

12:33

"okay, but I want Gemini3 to perform 10x for my specific use case" - Here is how

AI Jason

31.3k views   •   5 months ago

00:42

Nano Banana Pro: Take a Selfie With Every Version of You

AI For Humans

5.4k views   •   5 months ago

44:40

Google's Nano Banana Pro & Gemini 3 Just Changed Everything!

AI For Humans

15.8k views   •   5 months ago

14:56

Nano Banana Pro: But Did You Catch These 10 Details?

AI Explained

60.4k views   •   5 months ago

01:41

Google's Nano Banana Pro is INSANE

AI For Humans

6.0k views   •   5 months ago

21:43

Gemini 3 Pro: Breakdown

AI Explained

118.6k views   •   5 months ago

23:40

Gemini 3 Shows a Level of Intelligence We Haven’t Seen Before. (Gemini 3 Explained)

TheAIGRID

71.5k views   •   5 months ago

13:33

This Chip Could Give OpenAI an Unfair Advantage.

TheAIGRID

8.8k views   •   5 months ago

14:40

Researchers Just Broke AI’s Most Important Assumption. (We Were Wrong About LLMs)

TheAIGRID

27.0k views   •   5 months ago

15:37

If This Works… AGI Arrives Early. (Thermodynamic Computing)

TheAIGRID

113.8k views   •   5 months ago

15:07

Google’s SIMA 2: The Most Advanced AI Agent Ever Built

TheAIGRID

17.6k views   •   5 months ago

18:27

Is GPT-5.1 Really an Upgrade? But Models Can Auto-Hack Govts, so … there’s that

AI Explained

61.9k views   •   6 months ago

45:12

OpenAI Surprise Drops GPT-5.1 But Google Is Lurking

AI For Humans

12.1k views   •   6 months ago

12:54

Bubble or No Bubble, AI Keeps Progressing (ft. Relentless Learning + Introspection)

AI Explained

60.7k views   •   6 months ago

55:23

AI Job Losses Are Real. Don’t Panic (Yet).

AI For Humans

13.1k views   •   6 months ago

08:33

The Design Mode for Claude Code...

AI Jason

41.4k views   •   6 months ago

14:14

Did you miss these 2 AI stories? A *Real* LLM-crafted Breakthrough + Continual Learning Blocked?

AI Explained

58.4k views   •   6 months ago

05:14

Claude Skills - the SOP for your agent that is bigger than MCP

AI Jason

33.6k views   •   6 months ago

11:47

.agent folder is making claude code 10x better...

AI Jason

61.4k views   •   7 months ago

15:44

Sora 2 - It will only get more realistic from here

AI Explained

58.8k views   •   7 months ago

14:07

OpenAI Tests if GPT-5 Can Automate Your Job - 4 Unexpected Findings

AI Explained

67.5k views   •   7 months ago

11:32

ChatGPT Can Now Call the Cops, but 'Wait till 2100 for Full Job Impact' - Altman

AI Explained

20.2k views   •   7 months ago

11:32

ChatGPT Can Now Call the Cops, but 'Wait till 2100 for Full Job Impact' - Altman

AI Explained

48.7k views   •   7 months ago

06:41

Vibe Design is much better than I thought...

AI Jason

18.0k views   •   8 months ago

18:55

An ‘AI Bubble’? What Altman Actually said, the Facts and Nano Banana

AI Explained

57.9k views   •   8 months ago

16:02

I was using sub-agents wrong... Here is my way after 20+ hrs test

AI Jason

116.1k views   •   9 months ago

15:02

GPT-5 has Arrived

AI Explained

163.6k views   •   9 months ago

11:55

Genie 3: The World Becomes Playable (DeepMind)

AI Explained

199.3k views   •   9 months ago

18:44

I was using Claude Code wrong... The Ultimate Workflow

AI Jason

139.2k views   •   9 months ago

17:20

How Not to Read a Headline on AI (ft. new Olympiad Gold, GPT-5 …)

AI Explained

84.7k views   •   9 months ago

07:02

Claude Killer? My review on Kimi K2 after hrs of testing...

AI Jason

81.7k views   •   10 months ago

11:44

Grok 4 - 10 New Things to Know

AI Explained

179.1k views   •   10 months ago

09:29

Tired of AI-ish UI? Here is how to make it better...

AI Jason

53.2k views   •   10 months ago

16:39

Claude Designer is insane...Ultimate vibe coding UI workflow

AI Jason

188.3k views   •   10 months ago

26:20

When Will AI Models Blackmail You, and Why?

AI Explained

110.5k views   •   10 months ago

05:56

Vibe Versioning - Iterate UI in Cursor 10x faster

AI Jason

23.0k views   •   10 months ago

14:01

Apple’s ‘AI Can’t Reason’ Claim Seen By 13M+, What You Need to Know

AI Explained

101.8k views   •   11 months ago

22:02

Build the next Billion $ Agent 🚀

AI Jason

18.2k views   •   11 months ago

16:50

AI Accelerates: New Gemini Model + AI Unemployment Stories Analysed

AI Explained

96.4k views   •   11 months ago

03:35

10x better UI design for vibe coders - Use v0 directly in Cursor

AI Jason

52.7k views   •   11 months ago

19:05

Claude 4: Full 120 Page Breakdown … Is it the Best New Model?

AI Explained

99.1k views   •   11 months ago

04:25

How to make accurate UI Tweak in Cursor with Stagewise

AI Jason

24.5k views   •   11 months ago

14:02

Build MCP business for vibe coder

AI Jason

10.2k views   •   11 months ago

11:44

Cursor + Browser control = Self improving coding agent

AI Jason

35.5k views   •   1 year ago

131:12

How I use LLMs

Andrej Karpathy

2.3M views   •   1 year ago

211:24

Deep Dive into LLMs like ChatGPT

Andrej Karpathy

5.8M views   •   1 year ago

81:55

Founding fathers on today's America

Andrej Karpathy

34.7k views   •   1 year ago

241:26

Let's reproduce GPT-2 (124M)

Andrej Karpathy

1.0M views   •   1 year ago

30:38

Expert AI Developer Explains NEW OpenAI Assistants API v2 Release

Morningside AI

13.8k views   •   2 years ago

133:35

Let's build the GPT Tokenizer

Andrej Karpathy

1.0M views   •   2 years ago

26:56

Expert AI Developer Explains What OpenAI's Q* Means for Businesses

Morningside AI

4.2k views   •   2 years ago

45:54

Voiceflow CEO Talks GPTs, Future of AI Agencies and Chatbot Builders (Full Interview)

Morningside AI

10.1k views   •   2 years ago

59:48

[1hr Talk] Intro to Large Language Models

Andrej Karpathy

3.5M views   •   2 years ago

39:00

Expert AI Developer Explains What OpenAI 'GPTs' Mean For Businesses

Morningside AI

26.7k views   •   2 years ago

116:20

Let's build GPT: from scratch, in code, spelled out.

Andrej Karpathy

6.9M views   •   3 years ago

56:22

Building makemore Part 5: Building a WaveNet

Andrej Karpathy

264.4k views   •   3 years ago

115:24

Building makemore Part 4: Becoming a Backprop Ninja

Andrej Karpathy

328.8k views   •   3 years ago

115:58

Building makemore Part 3: Activations & Gradients, BatchNorm

Andrej Karpathy

475.4k views   •   3 years ago

75:40

Building makemore Part 2: MLP

Andrej Karpathy

510.0k views   •   3 years ago

48 Comments

@pierQRzt180  2 months ago

I agree. We are in the era of "let's optimize domain XY by training this expert in the MoE set." The Mixture of Experts approach allows this (and also leads to benchmaxxing). See More an, it is still vastly useful. Surely Claude is really advanced in optimizing code and code patterns, and thus performs really nicely. But I think we are heading in the direction of a "collection of narrow AIs that talk together," like the Geth in Mass Effect. The Geth were an AI composed of many programs. I still think that it can lead to very useful things, but maybe it is not the "world-class AGI" that we expect.

Because I think that in some domains, we are already at AGI level. See Less

@executivelifeh...  2 months ago

Glad to see the bat signal worked See Less

@torarinvik4920  2 months ago

You can't solve hallucination in humans let alone machines. People and machines make mistakes. See Less

@stephenkamenar  2 months ago

i tried 3.1 pro. it infinite looped in the thinking phase and wasted countless tokens. then finally after like 5 mins the code it wrote had syntax errors See Less

@YouTubeComment...  2 months ago

The TLDR from Google Gemini's summary
-
The video discusses the release of Gemini 3.1 Pro and the growing confusion surrounding AI model benchmarks ( See More 2_DPnzoiHaY">0:00). The speaker explains that post-training (1:01) is now the dominant stage in LLM development, leading to models excelling in specific domains rather than universally (1:39).

Here's a breakdown of the key points:

Domain Specialization and Benchmarks (1:39): The video highlights that models optimized for specific domains may perform differently in other areas. For example, Claude Opus 4.6, despite being strong in coding, performed poorly in a chess puzzle benchmark (2:01). This shows that older paradigms, where strong performance in one area meant strong performance in all, no longer apply.

ARC-AGI 2 Caveat (3:42): Gemini 3.1 Pro shows impressive results on ARC-AGI 2, outperforming other models (3:42). However, this is tempered by the observation that models might use "unintended arithmetic patterns" from numerical encodings, leading to accidentally correct solutions (4:22).

Simple Bench Record (5:54): Gemini 3.1 Pro set a new record on the speaker's private "Simple Bench," a test of common sense reasoning (5:57). This performance brings it within the margin of error of human average baseline, marking a significant threshold in AI capabilities in text-based tests (6:06).

Hallucination Caveat (8:22): The video addresses the issue of hallucinations (factual inaccuracies) in models. While Gemini 3.1 Pro appears to have a lower percentage of incorrect answers being hallucinations compared to some other models, it's noted that hallucinations are "definitely not a solved problem" (9:36).

Model Card Insights (9:54): The speaker touches on the Gemini 3.1 model card, highlighting that features like "deep think mode" might not always enhance capability despite the inference costs (10:18). However, it also shows examples where 3.1 Pro significantly reduces runtime in machine learning R&D tasks (10:38).

Exponential Growth and Anthropic's Strategy (11:12): The video discusses the rapid revenue growth of AI companies like Anthropic and OpenAI (11:32). Dario Amodei, CEO of Anthropic, suggests that by specializing in enough "specialisms," models might generalize to all specialisms, potentially achieving super intelligence without extensive continual learning (12:55).

The Elusive "One True Benchmark" (15:10): The video questions the existence of a single benchmark for general intelligence, noting that labs are incentivized to create their own benchmarks, which can lead to bias (15:52). Forecasting the future is presented as a truly objective benchmark, with models approaching human forecaster levels (16:01).

Other Metrics and Realism (17:02): The video briefly touches upon other benchmarks like speed, showcasing a model that can generate full answers in milliseconds (17:09). Finally, it highlights the increasing realism in video generation with examples like "Seed Dance 2.0" (17:44). See Less

@shApYT  2 months ago

LLMS were promised to generalise. Turns out it's whack-a-mole after all. Feel the AGI. See Less

@vassilisworld  2 months ago

Boring. The last ten videos just pick arguments with this and that. It’s all about nothing, really.

Buffett used to say the taste of cola doesn’t age — he keeps wanting more. Ot See More �t have that quality; after a while, you can’t stand their taste.

Your narrative, videos feel the second type now 😟 See Less

@Barncore  2 months ago

Yeah i stopped trusting benchmarks after the release of Gemini Pro 3.0. It killed it in benchmarks and gave me the impression it was the best model. I've been comparing it against ChatGP See More de subs extensively for the past 2 months using all 3 in tandem with the same prompts (trying to figure out which sub to choose), i found that Gemini 3 is by far the most "confidently wrong" most often. It never says "i don't know", it will answer incorrectly and worst of all it will do it confidently & persuasively. And when you try to correct it it doesn't get the message into its head, it'll either push back or dismiss it. For this reason i started to find Gemini hard to trust with certain things (like troubleshooting etc) because i could never know if it was giving me misinformation or not. There were times that i found that if i hadn't put the effort into verifying what it told me then i would've gone ahead believing what it said was true when it wasn't. You can't trust that.

The conclusion i make from that is that Google must be training it to be good at benchmarks rather than real-world usability.

So these days i default to Claude or ChatGPT (still trying to figure out which one i prefer) simply because if they don't know something they will say so, Claude probably moreso than ChatGPT in this regard, but neither of them are as bad as Gemini 3.

Don't get me wrong, Gemini 3 is better for certain things (e.g. multimodal stuff, and i prefer the way Gemini explains/words things), but until they fix the confident hallucination i can't trust it. At this point the only reason i'm still subscribed is because of NotebookLM See Less

@iugey  2 months ago

What about Grok 4.20? Is that one running among the smartest now? See Less

@flamyf  2 months ago

ive grown cold to ai
seems like incremental improvements won't get us to agi and labs do nothing interesting in that regard See Less