
Season 2 · Episode 267
Decoding the Transformer: From Attention to Inference
Herman and Corn dive into the mechanics of transformer inference, exploring how models turn massive matrices into meaningful conversation.
My Weird Prompts · Daniel Rosehill
January 21, 202619m 37s
Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
In this episode, Herman and Corn break down the "black box" of the transformer architecture, moving beyond the 2017 "Attention Is All You Need" paper to explore how modern LLMs actually process data during inference. They discuss the critical shift from encoder-decoder models to decoder-only giants, the memory-saving brilliance of KV caching, and the hardware-aware speed of FlashAttention-3. From speculative decoding to Rotary Positional Embeddings, learn how these technical plumbing upgrades have transformed simple translation tools into sophisticated world models capable of reasoning. This deep dive covers the journey of a token from a numerical vector to a human-readable response, revealing the complex engineering that powers today's most advanced AI systems.