Season 2 · Episode 1556

Faster Than Thought: The Engineering Behind Real-Time AI

From KV cache monsters to sub-100ms response times, explore the hardware and software innovations making real-time AI a reality.

My Weird Prompts · Daniel Rosehill

March 26, 202623m 47s

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page View transcript

Show Notes

The dream of seamless, real-time interaction with AI is finally within reach, but the path there is paved with immense engineering challenges. This episode dives deep into the "war against latency," exploring how the industry is moving away from clunky, "bolted-on" multimodal models toward unified engines that perceive the world as a single stream of data. We break down the technical breakthroughs—from NVIDIA’s Rubin architecture and Groq’s high-speed LPUs to memory-saving tricks like Grouped-Query Attention and PagedAttention. Learn how frameworks like Google’s TurboQuant and the Saguaro algorithm are shrinking the massive "KV cache monster" to achieve sub-100-millisecond response times. Whether it’s autonomous systems making split-second decisions or digital assistants that never miss a beat, the era of "the speed of thought" is here. Join us as we unpack the hardware-software synergy defining the next generation of artificial intelligence.

← All episodes of My Weird Prompts