Season 2 · Episode 1564

Why AI is Trading Transcripts for Raw Audio

Forget basic transcription. Explore how native omni-modal models are capturing the "soul" of speech with near-instant latency.

My Weird Prompts · Daniel Rosehill

March 26, 202624m 52s

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page View transcript

Show Notes

The era of the "cascaded pipeline"—where speech is converted to text before being processed—is officially coming to an end. In this episode, we dive into the cutting-edge landscape of audio AI as of March 2026, comparing the raw power of local models like Whisper-large-v3-turbo and Moonshine against the massive scale of SaaS giants like OpenAI and Cohere. We explore the technical breakthroughs in Conformer architectures and the "omni tax" that comes with native multimodality. Why are developers choosing between specialized ASR for accuracy and omni-modal systems for emotional intelligence? From the 160ms latency of Kyutai’s Moshi to the recent audio regressions in Gemini, we break down the decision matrix for building the next generation of voice-first applications. Whether you're a developer seeking data sovereignty or a power user looking for the fastest response times, this deep dive covers the tools, the trade-offs, and the future of human-machine interaction.

← All episodes of My Weird Prompts