Season 2 · Episode 1544

Why It Costs More to Run AI Than to Build It

Discover why the AI runtime is the unsung hero of the tech stack, determining whether your AI feels like a snappy conversation or a slow crawl.

My Weird Prompts · Daniel Rosehill

March 25, 202622m 3s

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page View transcript

Show Notes

As of March 2026, the industry has officially crossed a threshold where more than half of all AI infrastructure spending is dedicated to keeping the lights on through inference rather than training. This shift has placed the AI runtime—the critical software layer between hardware and model weights—at the center of the performance battle. This episode explores the architectural differences between local engines like Ollama and production-grade powerhouses like vLLM, explaining how innovations like PagedAttention and kernel fusion are driving a sixteen-fold increase in throughput. We also dive into the trade-offs between hardware-specific optimization and the portability of standards like ONNX, and what the new Kubernetes AI Requirements (KAIR) mean for the future of agentic deployment.

← All episodes of My Weird Prompts