Season 2 · Episode 1479

The Speed of Thought: Inside the New Era of Inference

The war for model size is over. Explore the engineering breakthroughs making massive AI models faster than human thought.

My Weird Prompts · Daniel Rosehill

March 23, 202620m 55s

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page View transcript

Show Notes

For years, the AI industry was obsessed with parameter counts, but as of 2026, the battlefield has shifted entirely to the Deployment Era. It is no longer about who has the most parameters in a server room; it is about who can serve the most intelligent tokens at a speed that feels like human thought. This episode dives deep into how massive three-trillion-parameter models like Grok-3 and Grok-4 are achieving real-time streaming speeds that were once thought impossible. We explore the radical efficiency of Mixture of Experts (MoE) architectures, the precision of Latent Routing, and the memory-saving magic of hierarchical quantization. From Multi-Token Prediction to the "draft and verify" system of speculative decoding, we break down the engineering feats allowing these digital giants to punch way above their weight class. Discover why inference now accounts for two-thirds of all AI compute spend and how the industry is moving from building the brain to effectively using it.

← All episodes of My Weird Prompts