Episode 88

Llama 4 Caught Cheating Benchmarks? Meta Under Fire!

OPTIMIZE YOUR LIFE AND SUBSCRIBE — NO BENCHMARK CHEATING REQUIRED Is Meta’s brand‑new Llama 4 only “state‑of‑the‑art” because it *trained on the test*? 🤔 In this episode of They Might Be Self‑Aware, Hunter Powers and Daniel Bishop dig into the evidence that Llama 4 was benchmark‑tuned, why top Meta engineers are distancing themselves from the release, and what it means for the future of AI evaluation. We also unpack OpenAI’s whirlwind month—GPT‑4.1, the death of GPT‑4.5 (the model that *beat the Turing Test*), the rumored $3 billion Windsurf buyout, and Sam Altman’s dream of the “10× developer.” 🔔 Subscribe for two no‑fluff AI & tech breakdowns every week: https://www.youtube.com/@tmbsa --- KEY TAKEAWAYS * Meta’s Llama 4 likely over‑fit to eval suites—benchmark scores ≠ real‑world quality. * Massive resignations around release hint at internal disputes on ethics & transparency. * AI benchmarks need a revamp; otherwise, every lab will “teach to the test.” * OpenAI’s consolidation strategy (Windsurf, o‑series) mirrors Salesforce/Microsoft Office. * GPT‑4.5’s sudden shutdown sparks debate: are “too‑human” models being shelved? * Expect 10× productivity tools, not mass layoffs—history shows workload expands. --- LISTEN ON THE GO • Apple Podcasts: https://podcasts.apple.com/us/podcast/they-might-be-self-aware/id1730993297 • Spotify: https://open.spotify.com/show/3EcvzkWDRFwnmIXoh7S4Mb • Full transcript & links: https://www.tmbsa.tech/episodes/llama-4-caught-cheating-benchmarks-meta-under-fire For more info, visit our website at https://www.tmbsa.tech/ #AI #Llama4 #OpenAI #GPT4 #BenchmarkCheating #TuringTest #Meta #TechPodcast #MachineLearning #Productivity #10xDeveloper

They Might Be Self-Aware · Daniel Bishop, Hunter Powers

April 21, 202535m 58s

Audio is streamed directly from the publisher (cdn.simplecast.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

⏱️ CHAPTERS
00:00:00 – Metaverse banter
00:01:28 - Meta drops Llama 4: size, MoE architecture & first‑day hype
00:03:03 - “Cheating the test?” How Llama 4 climbed then fell on leaderboards
00:07:15 - Broken benchmarks, GPU tricks & lessons from 2000‑era graphics cards
00:11:16 - Should we trust today’s AI leaderboards? Transparency + corporate ties
00:16:15 - AB testing 101 and why secret “mystery models” exist
00:18:13 - Model chaos at OpenAI: GPT‑4.1, o‑series, mini models & naming mess
00:24:28 - OpenAI = Salesforce of AI? Windsurf acquisition & product sprawl
00:26:33 - Sam Altman’s “10× productivity” promise—what it really means
00:27:15 - Will coders vanish or just do more? History of tech‑driven expectations
00:30:55 - Conspiracy corner: GPT‑4.5 passed the Turing Test… then got axed
00:34:45 - Wrap Up

Topics

llama 4 vs gpthunter powersai productivitymeta resignationgpt‑4.5openaimixture of expertsturing testbenchmark cheatingai podcastwindsurf acquisition10x developerbenchmark gamingthey might be self awareai evaluationai leaderboardai model comparisonai benchmarksmeta aimachine learning newsopen‑weight modelsalesforce of aitech podcastgpt‑4.1llama 4artificial intelligence debatedaniel bishopai controversyai ethics

← All episodes of They Might Be Self-Aware