Season 1 · Episode 54

Tokenizing Everything: How Omnimodal AI Handles Any Input

Omnimodal AI: How do models process images, audio, video, and text all at once? Discover the engineering behind AI that accepts anything.

My Weird Prompts · Daniel Rosehill

December 11, 202532m 58s

Audio is streamed directly from the publisher (dts.podtrac.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page View transcript

Show Notes

How do AI models process images, audio, video, and text all at once? Herman and Corn dive deep into the technical complexity of multimodal tokenization, exploring how modern omnimodal models compress vastly different data types into a unified format that a single neural network can understand. From vision encoders to spectrograms to temporal compression, discover the engineering behind the AI systems that can accept anything and output anything.

← All episodes of My Weird Prompts