PLAY PODCASTS
Benchmarking AI Models

Benchmarking AI Models

How do you know if a new AI model is actually bet…

Linear Digressions

March 30, 202629m 55s

Audio is streamed directly from the publisher (feeds.soundcloud.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

How do you know if a new AI model is actually better than the last one? It turns out answering that question is a lot messier than it sounds. This week we dig into the world of LLM benchmarks — the standardized tests used to compare models — exploring two canonical examples: MMLU, a 14,000-question multiple choice gauntlet spanning medicine, law, and philosophy, and SWE-bench, which throws real GitHub bugs at models to see if they can fix them. Along the way: Goodhart's Law, data contamination, canary strings, and why acing a test isn't always the same as being smart.

Topics

datasciencemachinelearninglineardigressions