
š ThursdAI - Dec 4, 2025 - DeepSeek V3.2 Goes Gold Medal, Mistral Returns to Apache 2.0, OpenAI Hits Code Red, and US-Trained MOEs Are Back!
ThursdAI - The top AI news from the past week Ā· Alex Volkov
Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
Hey yall, Alex here š«”
Welcome to the first ThursdAI of December! Snow is falling in Colorado, and AI releases are falling even harder. This week was genuinely one of those ādrink from the firehoseā weeks where every time I refreshed my timeline, another massive release had dropped.
We kicked off the show asking our co-hosts for their top AI pick of the week, and the answers were all over the map: Wolfram was excited about Mistralās return to Apache 2.0, Yam couldnāt stop talking about Claude Opus 4.5 after a full week of using it, and Nisten came out of left field with an AWQ quantization of Prime Intellectās model that apparently runs incredibly fast on a single GPU. As for me? Iām torn between Opus 4.5 (which literally fixed bugs that Gemini 3 created in my code) and DeepSeekās gold-medal winning reasoning model.
Speaking of which, letās dive into what happened this week, starting with the open source stuff thatās been absolutely cooking.
Open Source LLMs
DeepSeek V3.2: The Whale Returns with Gold Medals
The whale is back, folks! DeepSeek released two major updates this week: V3.2 and V3.2-Speciale. And these arenāt incremental improvementsāweāre talking about an open reasoning-first model thatās rivaling GPT-5 and Gemini 3 Pro with actual gold medal Olympiad wins.
Hereās what makes this release absolutely wild: DeepSeek V3.2-Speciale is achieving 96% on AIME versus 94% for GPT-5 High. Itās getting gold medals on IMO (35/42), CMO, ICPC (10/12), and IOI (492/600). This is a 685 billion parameter MOE model with MIT license, and it literally broke the benchmark graph on HMMT 2025āthe score was so high it went outside the chart boundaries. Thatās how you DeepSeek, basically.
But itās not just about reasoning. The regular V3.2 (not Speciale) is absolutely crushing it on agentic benchmarks: 73.1% on SWE-Bench Verified, first open model over 35% on Tool Decathlon, and 80.3% on ϲ-bench. Itās now the second most intelligent open weights model and ranks ahead of Grok 4 and Claude Sonnet 4.5 on Artificial Analysis.
The price is what really makes this insane: 28 cents per million tokens on OpenRouter. Thatās absolutely ridiculous for this level of performance. Theyāve also introduced DeepSeek Sparse Attention (DSA) which gives you 2-3x cheaper 128K inference without performance loss. LDJ pointed out on the show that he appreciates how transparent theyāre being about not quite matching Gemini 3ās efficiency on reasoning tokens, but itās open source and incredibly cheap.
One thing to note: V3.2-Speciale doesnāt support tool calling. As Wolfram pointed out from the model card, itās ādesigned exclusively for deep reasoning tasks.ā So if you need agentic capabilities, stick with the regular V3.2.
Check out the full release on Hugging Face or read the announcement.
Mistral 3: Europeās Favorite AI Lab Returns to Apache 2.0
Mistral is back, and theyāre back with fully open Apache 2.0 licenses across the board! This is huge news for the open source community. They released two major things this week: Mistral Large 3 and the Ministral 3 family of small models.
Mistral Large 3 is a 675 billion parameter MOE with 41 billion active parameters and a quarter million (256K) context window, trained on 3,000 H200 GPUs. Thereās been some debate about this modelās performance, and I want to address the elephant in the room: some folks saw a screenshot showing Mistral Large 3 very far down on Artificial Analysis and started dunking on it. But hereās the key context that Merve from Hugging Face pointed outāthis is the only non-reasoning model on that chart besides GPT 5.1. When you compare it to other instruction-tuned (non-reasoning) models, itās actually performing quite well, sitting at #6 among open models on LMSys Arena.
Nisten checked LM Arena and confirmed that on coding specifically, Mistral Large 3 is scoring as one of the best open source coding models available. Yam made an important point that we should compare Mistral to other open source players like Qwen and DeepSeek rather than to closed modelsāand in that context, this is a solid release.
But the real stars of this release are the Ministral 3 small models: 3B, 8B, and 14B, all with vision capabilities. These are edge-optimized, multimodal, and the 3B actually runs completely in the browser with WebGPU using transformers.js. The 14B reasoning variant achieves 85% on AIME 2025, which is state-of-the-art for its size class. Wolfram confirmed that the multilingual performance is excellent, particularly for German.
Thereās been some discussion about whether Mistral Large 3 is a DeepSeek finetune given the architectural similarities, but Mistral claims these are fully trained models. As Nisten noted, even if they used similar architecture (which is Apache 2.0 licensed), thereās nothing wrong with thatāitās an excellent architecture that works. Lucas Atkins later confirmed on the show that āMistral Large looks fantastic... it is DeepSeek through and through architecture wise. But Kimi also does thatāDeepSeek is the GOAT. Training MOEs is not as easy as just import deepseak and train.ā
Check out Mistral Large 3 and Ministral 3 on Hugging Face.
Arcee Trinity: US-Trained MOEs Are Back
We had Lucas Atkins, CTO of Arcee AI, join us on the show to talk about their new Trinity family of models, and this conversation was packed with insights about what it takes to train MOEs from scratch in the US.
Trinity is a family of open-weight MOEs fully trained end-to-end on American infrastructure with 10 trillion curated tokens from Datology.ai. They released Trinity-Mini (26B total, 3B active) and Trinity-Nano-Preview (6B total, 1B active), with Trinity-Large (420B parameters, 13B active) coming in mid-January 2026.
The benchmarks are impressive: Trinity-Mini hits 84.95% on MMLU (0-shot), 92.1% on Math-500, and 65% on GPQA Diamond. But what really caught my attention was the inference speedāNano generates at 143 tokens per second on llama.cpp, and Mini hits 157 t/s on consumer GPUs. Theyāve even demonstrated it running on an iPhone via MLX Swift.
I asked Lucas why it matters where models come from, and his answer was nuanced: for individual developers, it doesnāt really matterāuse the best model for your task. But for Fortune 500 companies, compliance and legal teams are getting increasingly particular about where models were trained and hosted. This is slowing down enterprise AI adoption, and Trinity aims to solve that.
Lucas shared a fascinating insight about why they decided to do full pretraining instead of just post-training on other peopleās checkpoints: āWe at Arcee were relying on other companies releasing capable open weight models... I didnāt like the idea of the foundation of our business being reliant on another company releasing models.ā He also dropped some alpha about Trinity-Large: theyāre going with 13B active parameters instead of 32B because going sparser actually gave them much faster throughput on Blackwell GPUs.
The conversation about MOEs being cheaper for RL was particularly interesting. Lucas explained that because MOEs are so inference-efficient, you can do way more rollouts during reinforcement learning, which means more RL benefit per compute dollar. This is likely why weāre seeing labs like MiniMax go from their original 456B/45B-active model to a leaner 220B/10B-active modelāthey can get more gains in post-training by being able to do more steps.
Check out Trinity-Mini and Trinity-Nano-Preview on Hugging Face, or read The Trinity Manifesto.
OpenAI Code Red: Panic at the Disco (and Garlic?)
It was ChatGPTās 3rd birthday this week (Nov 30th), but the party vibes seem⦠stressful. Reports came out that Sam Altman has declared a āCode Redā at OpenAI.
Why? Gemini 3.The user numbers donāt lie. ChatGPT apparently saw a 6% drop in daily active users following the Gemini 3 launch. Googleās integration is just too good, and their free tier is compelling.
In response, OpenAI has supposedly paused āside projectsā (ads, shopping bots) to focus purely on model intelligence and speed. Rumors point to a secret model codenamed āGarlicāāa leaner, more efficient model that beats Gemini 3 and Claude Opus 4.5 on coding reasoning, targeting a release in early 2026 (or maybe sooner if they want to save Christmas).
Wolfram and Yam nailed the sentiment here: Integration wins. Wolframās family uses Gemini because itās right there on the Pixel, controlling the lights and calendar. OpenAI needs to catch up not just on IQ, but on being helpful in the moment.
Post the live show, OpenAI also finally added GPT 5.1 Codex Max we covered 2 weeks ago to their API and itās now available in Cursor, for free, until Dec 11!
Amazon Nova 2: Enterprise Push with Serious Agentic Chops
Amazon came back swinging with Nova 2, and the jump on Artificial Analysis is genuinely impressiveāfrom around 30% to 61% on their index. Thatās a massive improvement.
The family includes Nova 2 Lite (7x cheaper, 5x faster than Nova Premier), Nova 2 Pro (93% on ϲ-Bench Telecom, 70% on SWE-Bench Verified), Nova 2 Sonic (speech-to-speech with 1.39s time-to-first-audio), and Nova 2 Omni (unified text/image/video/speech with 1M token context windowāyou can upload 90 minutes of video!).
Gemini 3 Deep Think Mode
Google launched Gemini 3 Deep Think mode exclusively for AI Ultra subscribers, and itās hitting some wild benchmarks: 45.1% on ARC-AGI-2 (a 2x SOTA leap using code execution), 41% on Humanityās Last Exam, and 93.8% on GPQA Diamond. This builds on their Gemini 2.5 variants that earned gold medals at IMO and ICPC World Finals. The parallel reasoning approach explores multiple hypotheses simultaneously, but itās compute-heavyālimited to 10 prompts per day at $77 per ARC-AGI-2 task.
This Weekās Buzz: Mid-Training Evals are Here!
A huge update from us at Weights & Biases this week: We launched LLM Evaluation Jobs. (Docs)
If you are training models or finetuning, you usually wait until the end to run your expensive benchmarks. Now, directly inside W&B, you can trigger evaluations on mid-training checkpoints.
It integrates with Inspect Evals (over 100+ public benchmarks). You just point it to your checkpoint or an API endpoint (even OpenRouter!), select the evals (MMLU-Pro, GPQA, etc.), and we spin up the managed GPUs to run it. You get a real-time leaderboard of your runs vs. the field.
Also, a shoutout to users of Neptune.aiācongrats on the acquisition by OpenAI, but since the service is shutting down, we have built a migration script to help you move your history over to W&B seamlessly. We arenāt going anywhere!
Video & Vision: Physics, Audio, and Speed
The multimodal space was absolutely crowded this week.
Runway Gen 4.5 (āWhisper Thunderā)
Runway revealed that the mysterious āWhisper Thunderā model topping the leaderboards is actually Gen 4.5. The key differentiator? Physics and Multi-step adherence. It doesnāt have that ādiffusion wobbleā anymore. We watched a promo video where the shot changes every 3-4 seconds, and while itās beautiful, it shows we still havenāt cracked super long consistent takes yet. But for 8-second clips? Itās apparently the new SOTA.
Kling 2.6: Do you hear that?
Kling hit back with Video 2.6, and the killer feature is Native Audio. I generated a clip of two people arguing, and the lip sync was perfect. Not ādubbed overā perfect, but actively generated with the video. It handles multi-character dialogue, singing, and SFX. Itās huge for creators.
Kling was on a roll this week, releasing not one, but two Video Models (O1 Video is an omni modal one that takes Text, Images and Audio as inputs) and O1 Image and Kling Avatar 2.0 are also great updates! (Find all their releases on X)
P-Image: Sub-Second Generation at Half a Cent
Last week we talked about ByteDanceās Z-Image, which was super cool and super cheap. Well, this week Pruna AI came out with P-Image, which is even faster and cheaper: image generation under one second for $0.005, and editing under one second for $0.01.
I built a Chrome extension this week (completely rewritten by Opus 4.5, by the wayāmore on that in a second) that lets me play with these new image models inside the Infinite Craft game. When I tested P-Image Turbo against Z-Image, I was genuinely impressed by the quality at that speed. If you want quick iterations before moving to something like Nano Banana Pro for final 4K output, these sub-second models are perfect.
The extension is available on GitHub if you want to try itāyou just need to add your Replicate or Fal API keys.
SeeDream 4.5: ByteDance Levels Up
ByteDance also launched SeeDream 4.5 in open beta, with major improvements in detail fidelity, spatial reasoning, and multi-image reference fusion (up to 10 inputs for consistent storyboards). The text rendering is much sharper, and it supports multilingual typography including Japanese. Early tests show it competing well with Nano Banana Pro in prompt adherence and logic.
š¤ Voice & Audio
Microsoft VibeVoice-Realtime-0.5B
In a surprise drop, Microsoft open-sourced VibeVoice-Realtime-0.5B, a compact TTS model optimized for real-time applications. It delivers initial audible output in just 300 milliseconds while generating up to 10 minutes of speech. The community immediately started creating mirrors because, well, Microsoft has a history of releasing things on Hugging Face and then having legal pull them down. Get it while itās hot!
Use Cases: Code, Cursors, and āAntigravityā
We wrapped up with some killer practical tips:
* Opus 4.5 is a beast: As I mentioned, using Opus inside Cursorās āAskā mode is currently the supreme coding experience. It debugs logic flaws that Gemini misses completely. I also used Opus as a prompt engineer for my infographics, and it absolutely demolished GPT at creating the specific layouts I needed
* Googleās Secret: Nisten dropped a bomb at the end of the showāOpus 4.5 is available for free inside Googleās Antigravity (and Colab)! If you want to try the model thatās beating GPT-5 without paying, go check Antigravity now before they patch it or run out of compute.
* Microsoft VibeVoice: A surprise drop of a 0.5B speech model on HuggingFace that does real-time TTS (300ms latency). It was briefly questionable if it would stay up, but mirrors are already everywhere.
Thatās a wrap for this week, folks. Next week is probably going to be our final episode of the year, so weāll be doing recaps and looking at our predictions from last year. Should be fun to see how wrong we were about everything!
Thank you for tuning in. If you missed the live stream, subscribe to our Substack, YouTube, and wherever you get your podcasts. See you next Thursday!
TL;DR and Show Notes
Hosts and Guests
* Alex Volkov - AI Evangelist & Weights & Biases (@altryne)
* Co Hosts - @WolframRvnwlf, @yampeleg, @nisten, @ldjconfirmed
* Guest - Lucas Atkins (@latkins) - CTO Arcee AI
Open Source LLMs
* DeepSeek V3.2 and V3.2-Speciale - Gold medal olympiad wins, MIT license (X, HF V3.2, HF Speciale, Announcement)
* Mistral 3 family - Large 3 and Ministral 3, Apache 2.0 (X, Blog, HF Large, HF Ministral)
* Arcee Trinity - US-trained MOE family (X, HF Mini, HF Nano, Blog)
* Hermes 4.3 - Decentralized training, SOTA RefusalBench (X, HF)
Big CO LLMs + APIs
* OpenAI Code Red - ChatGPT 3rd birthday, Garlic model in development (The Information)
* Amazon Nova 2 - Lite, Pro, Sonic, and Omni models (X, Blog)
* Gemini 3 Deep Think - 45.1% ARC-AGI-2 (X, Blog)
* Cursor + GPT-5.1-Codex-Max - Free until Dec 11 (X, Blog)
This Weekās Buzz
* WandB LLM Evaluation Jobs - Evaluate any OpenAI-compatible API (X, Announcement)
Vision & Video
* Runway Gen-4.5 - #1 on text-to-video leaderboard, 1,247 Elo (X)
* Kling VIDEO 2.6 - First native audio generation (X)
* Kling O1 Image - Image generation (X)
Voice & Audio
* Microsoft VibeVoice-Realtime-0.5B - 300ms latency TTS (X, HF)
AI Art & Diffusion
* Pruna P-Image - Sub-second generation at $0.005 (X, Blog, Demo)
* SeeDream 4.5 - Multi-reference fusion, text rendering (X)
This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit sub.thursdai.news/subscribe