PLAY PODCASTS
πŸ¦ƒ ThursdAI - Thanksgiving special 24' - Qwen Open Sources Reasoning, BlueSky hates AI, H controls the web & more AI news

πŸ¦ƒ ThursdAI - Thanksgiving special 24' - Qwen Open Sources Reasoning, BlueSky hates AI, H controls the web & more AI news

ThursdAI - The top AI news from the past week Β· Alex Volkov

November 28, 20241h 46m

Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Hey ya'll, Happy Thanskgiving to everyone who celebrates and thank you for being a subscriber, I truly appreciate each and every one of you!

We had a blast on today's celebratory stream, especially given that today's "main course" was the amazing open sourcing of a reasoning model from Qwen, and we had Junyang Lin with us again to talk about it! First open source reasoning model that you can run on your machine, that beats a 405B model, comes close to o1 on some metrics 🀯

We also chatted about a new hybrid approach from Nvidia called Hymba 1.5B (Paper, HF) that beats Qwen 1.5B with 6-12x less training, and Allen AI releasing Olmo 2, which became the best fully open source LLM πŸ‘ (Blog, HF, Demo), though they didn't release WandB logs this time, they did release data!

I encourage you to watch todays show (or listen to the show, I don't judge), there's not going to be a long writeup like I usually do, as I want to go and enjoy the holiday too, but of course, the TL;DR and show notes are right here so you won't miss a beat if you want to use the break to explore and play around with a few things!

ThursdAI - Recaps of the most high signal AI weekly spaces is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

TL;DR and show notes

* Qwen QwQ 32B preview - the first open weights reasoning model (X, Blog, HF, Try it)

* Allen AI - Olmo 2 the best fully open language model (Blog, HF, Demo)

* NVIDIA Hymba 1.5B - Hybrid smol model beating Qwen, SmolLM w/ 6-12x less training (X, Paper, HF)

* Big CO LLMs + APIs

* Anthropic MCP - model context protocol (X,Blog, Spec, Explainer)

* Cursor, Jetbrains now integrate with ChatGPT MacOS app (X)

* Xai is going to be a Gaming company?! (X)

* H company shows Runner H - WebVoyager Agent (X, Waitlist)

* This weeks Buzz

* Interview w/ Thomas Cepelle about Weave scorers and guardrails (Guide)

* Vision & Video

* OpenAI SORA API was "leaked" on HuggingFace (here)

* Runway launches video Expand feature (X)

* Rhymes Allegro-TI2V - updated image to video model (HF)

* Voice & Audio

* OuteTTS v0.2 - 500M smol TTS with voice cloning (Blog, HF)

* AI Art & Diffusion & 3D

* Runway launches an image model called Frames (X, Blog)

* ComfyUI Desktop app was released πŸŽ‰

* Chat

* 24 hours of AI hate on πŸ¦‹ (thread)

* Tools

* Cursor agent (X thread)

* Google Generative Chess toy (Link)

See you next week and happy Thanks Giving πŸ¦ƒ

Thanks for reading ThursdAI - Recaps of the most high signal AI weekly spaces! This post is public so feel free to share it.

Full Subtitles for convenience

[00:00:00] Alex Volkov: let's get it going.

[00:00:10] Alex Volkov: Welcome, welcome everyone to ThursdAI November 28th Thanksgiving special. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. You're on ThursdAI. We are live [00:00:30] on ThursdAI. Everywhere pretty much.

[00:00:32] Alex Volkov:

[00:00:32] Hosts and Guests Introduction

[00:00:32] Alex Volkov: I'm joined here with two of my co hosts.

[00:00:35] Alex Volkov: Wolfram, welcome.

[00:00:36] Wolfram Ravenwolf: Hello everyone! Happy Thanksgiving!

[00:00:38] Alex Volkov: Happy Thanksgiving, man.

[00:00:39] Alex Volkov: And we have Junyang here. Junyang, welcome, man.

[00:00:42] Junyang Lin: Yeah, hi everyone. Happy Thanksgiving. Great to be here.

[00:00:46] Alex Volkov: You had a busy week. We're going to chat about what you had. I see Nisten joining us as well at some point.

[00:00:51] Alex Volkov: Yam pe joining us as well. Hey, how, Hey Yam. Welcome. Welcome, as well. Happy Thanksgiving. It looks like we're assembled folks. We're across streams, across [00:01:00] countries, but we are.

[00:01:01] Overview of Topics for the Episode

[00:01:01] Alex Volkov: For November 28th, we have a bunch of stuff to talk about. Like really a big list of stuff to talk about. So why don't we just we'll just dive in. We'll just dive in. So obviously I think the best and the most important.

[00:01:13] DeepSeek and Qwen Open Source AI News

[00:01:13] Alex Volkov: Open source kind of AI news to talk about this week is going to be, and I think I remember last week, Junyang, I asked you about this and you were like, you couldn't say anything, but I asked because last week, folks, if you remember, we talked about R1 from DeepSeek, a reasoning model from [00:01:30] DeepSeek, which really said, Oh, maybe it comes as a, as open source and maybe it doesn't.

[00:01:33] Alex Volkov: And I hinted about, and I asked, Junyang, what about some reasoning from you guys? And you couldn't say anything. so this week. I'm going to do a TLDR. So we're going to actually talk about the stuff that, you know, in depth a little bit later, but this week, obviously one of the biggest kind of open source or sorry, open weights, and news is coming from our friends at Qwen as well, as we always celebrate.

[00:01:56] Alex Volkov: So one of the biggest things that we get as. [00:02:00] is, Qwen releases, I will actually have you tell me what's the pronunciation here, Junaid, what is, I say Q W Q or maybe quick, what is the pronunciation of this?

[00:02:12] Junyang Lin: I mentioned it in the blog, it is just like the word quill. Yeah. yeah, because for the qw you can like work and for the q and you just like the U, so I just combine it together and create a new pronunciation called Quill.

[00:02:28] Junyang Lin: Yeah.

[00:02:28] Alex Volkov: So we're saying it's Quin [00:02:30] Quill 32 B. Is that the right pronunciation to say this?

[00:02:33] Junyang Lin: Yeah, it's okay. I would just call it qui quill. It is, some something funny because,the ca the characters look very funny. Oh, we have a subculture,for these things. Yeah. Just to express some, yeah.

[00:02:46] Junyang Lin: our. feelings.

[00:02:49] Alex Volkov: Amazing. Qwen, Quill, 32B, and it's typed,the name is typed QWQ, 32Breview. This is the first OpenWeights reasoning model. This [00:03:00] model is not only predicting tokens, it's actually doing reasoning behind this. What this means is we're going to tell you what this means after we get to this.

[00:03:07] Alex Volkov: So we're still in the, we're still in the TLDR area. We also had. Another drop from Alien Institute for AI, if you guys remember last week we chatted with Nathan, our dear friend Nathan, from Alien Institute about 2. 0. 3, about their efforts for post training, and he gave us all the details about post training, so they released 2.

[00:03:28] Alex Volkov: 0. 3, this week they released Olmo 2. [00:03:30] 0. We also talked about Olmo with the friends from Alien Institute a couple of months ago, and now they released Olmo 2. 0. Which they claim is the best fully open sourced, fully open sourced language models, from Allen Institute for AI.and, so we're going to chat about, Olmo a little bit as well.

[00:03:46] Alex Volkov: And last minute addition we have is NVIDIA Haimba, which is a hybrid small model from NVIDIA, very tiny one, 1. 5 billion parameters. small model building Qwen and building small LLM as well. this is in the area [00:04:00] of open source. I

[00:04:01] Alex Volkov: Okay, in the big companies, LLMs and APIs, I want to run through a few things.

[00:04:06] Anthropic's MCP and ChatGPT macOS Integrations

[00:04:06] Alex Volkov: So first of all, Anthropic really something called MCP. It's a, something they called Model Context Protocol. We're going to briefly run through this. It's a, it's a kind of a release from them that's aimed for developers is a protocol that enables secure connections between a host application, like a cloud desktop, for example,

[00:04:24] Alex Volkov: there's also a bunch of new integrations for the ChatGPT macOS app. If you guys remember a couple of [00:04:30] weeks ago, We actually caught this live.

[00:04:31] Alex Volkov: I refreshed my MacOS app and there's ta da, there's a new thing. And we discovered this live. It was very fun. The MacOS app for ChatGPT integrates with VS Code, et cetera. and so we tried to run this with Cursor. It didn't work. So now it works with Cursor,

[00:04:43] Wolfram Ravenwolf:

[00:04:43] Alex Volkov: So the next thing we're going to look at, I don't know if it's worth mentioning, but you guys know the XAI, the company that Elon Musk is raising another 6 billion for that tries to compete with OpenAI

[00:04:54] Alex Volkov: Do you guys hear that it's going to be a gaming company as well? I don't know if it's worth talking about, but we'll at least [00:05:00] mention this. And the one thing that I wanted to chat about is H, the French company, H that showed a runner that looks. Three times as fast and as good as the Claude computer use runner, and we're definitely going to show examples of this, video live because that looks just incredible.

[00:05:18] Alex Volkov: this out of nowhere company, the biggest fundraise or the biggest seed round that Europe has ever seen, at least French has ever seen, just show they, An agent that controls your [00:05:30] computer that's tiny, ridiculously tiny, I think it's like the three billion parameter, two billion parameter or something.

[00:05:36] Alex Volkov: And it runs way better than computer, cloud computer use. Something definitely worth talking about. after with, after which in this week's Bars, we're going to talk with Thomas Capelli, from, from my team at Weights Biases. about LLM guardrails, that's gonna be fun. and in vision video category, we're gonna cover that OpenAI Sora quote unquote leaked, this week.

[00:05:56] Alex Volkov: And this leak wasn't really a leak, but, definitely [00:06:00] we saw some stuff. and then there's also a new expand feature that we saw in, Runway. And we saw another video model from, Rhymes called Allegro TIV2. which is pretty cool in voice and audio. If we get there in voice and audio, we saw out TTS vision 0.

[00:06:19] Alex Volkov: 2, which is a new TTS, a 500 million parameter, small TTS you can run in your browser and sounds pretty dope.art in the fusion, super quick runway launches an image [00:06:30] model. Yep, Runway, the guys who do video, they launched an image model that looks pretty sick, and we're definitely going to look at some examples of this, and Confi UI Desktop, for those of you who are celebrating something like this, Confi UI now is runnable with desktop, and there's a bunch of tool stuff, but honestly, I can talk about two things.

[00:06:47] Alex Volkov: Tools and there's a cool thing with Google generative chess toy. I can show you this so you can show your folks in Thanksgiving and going to impress them with a generative chess toy. But honestly, instead of this, I would love to chat about the thing that [00:07:00] some of us saw on the other side of the social media networks.

[00:07:04] Alex Volkov: And definitely we'll chat about this, for the past 24 hours. So chat, for the past. 24 hours, on BlueSky, we saw a little bit of a mob going against the Hug Face folks and then, other friends of ours on,from the AI community and the anti AI mob on BlueSky. So we're going to chat about that.

[00:07:26] Alex Volkov: And hopefully give you our feelings about what's going on, about this [00:07:30] world. And this is a pro AI show. And when we see injustice happens against ai, we have to speak out about against this. And I think that this is mostly what we're gonna cover this show, unless this is.

[00:07:42] Wolfram Ravenwolf: Where I could insert the two things I have.

[00:07:44] Wolfram Ravenwolf: One is a tool, which is the AI video composer, which, allows you to talk to, ff mpac, which is a complicated comment line tool, but very powerful. And so you have a UI where you just use natural language to control the tool. So that is one tool. Maybe we get to [00:08:00] it, if not just Google or ask for Plexity or anything.

[00:08:03] Alex Volkov: No, we'll drop it in. Yeah, we'll drop it in show notes, absolutely.

[00:08:04] Wolfram Ravenwolf: Yeah, that's the best part. Okay. And echo mimic. Version 2 is also an HN Synthesia alternative for local use, which is also, yeah, a great open source local runnable tool.

[00:08:17] Alex Volkov: What do we call this? EcoMimic?

[00:08:19] Wolfram Ravenwolf: EcoMimic. EcoMimic

[00:08:21] Alex Volkov: v2.

[00:08:21] Wolfram Ravenwolf: EcoMimic

[00:08:23] Alex Volkov: 2.

[00:08:24] Alex Volkov: Alright, we have a special guest here that we're gonna add Alpin. Hey Alpen, [00:08:30] welcome, feel free to stay anonymous and don't jump, we're gonna start with open source AI and then we're gonna chat with you briefly about the experience you had.

[00:08:38] Alpin Dale: hello everyone.

[00:08:39] Alex Volkov: Hey man. Yeah, you've been on the show before, right Alton? You've been on the show.

[00:08:43] Alpin Dale: a few times, yeah. it's nice to be back here again.

[00:08:46] Alex Volkov: Yeah. Alton, we're gonna get, we're gonna chat with you soon, right? We're gonna start with open source. We need to go to Junyang and talk about reasoning models.

[00:08:52] Alex Volkov: so feel free to stay with us. And then I definitely want to hear about some of the stuff we're going to cover after open source. We're going to cover the [00:09:00] anti AI mob over there.

[00:09:05] Alex Volkov: Alrighty folks, it's time to start with the,with the corner we love the most, yeah? let's dive into this. Let's dive in straight to Open Source AI.

[00:09:29] Alex Volkov: Open Source AI, [00:09:30] let's get it started. Let's start it.

[00:09:35] Alex Volkov: Okay, folks, so open source this week, we're going to get, let me cover the other two things super quick before we dive in.

[00:09:43] NVIDIA Haimba Hybrid Model Discussion

[00:09:43] Alex Volkov: Alright, so I want to like briefly cover the Haimba paper super quick, because we're going to get the least interesting stuff out of the way so we can focus on the main topic. Course, NVIDIA released Heimbar 1. 5 parameters. Heimbar is a hybrid small model, from NVIDIA. We talked about hybrid models [00:10:00] multiple times before.

[00:10:00] Alex Volkov: we have our friend of the pod, LDJ here. He loves talking about hybrid models. He actually brought this to our attention in the, in, in the group chat. We talked about, you guys know the Transformer, we love talking about the Transformer. Haimba specifically is a hybrid model between Transformer and I think they're using a hybrid attention with Mamba layers in parallel.

[00:10:22] Alex Volkov: they claim they're beating Lama and Qwen and SmallLM with 6 to 12 times less training as well. Let's look [00:10:30] at the, let's look at their, let's look at their X.so this is what they're, this is what they're showing, this is the table they're showing some impressive numbers, the interesting thing is, this is a table of comparison that they're showing, and in this table of comparison, the comparison is not only Evaluations.

[00:10:47] Alex Volkov: The comparison they're showing is also cache size and throughput, which I like. it's do you guys know what this reminds me of? This reminds me of when you have a electric vehicle [00:11:00] and you have a gas based vehicle or standard combustion engine vehicle, and then they compare the electric vehicle and acceleration.

[00:11:07] Alex Volkov: It's Oh, our car is faster. But you get this by default, you get the acceleration by default with all the electric vehicles. This is how the model works. This is how those model works. So for me, when you compare like hybrid models, or, non transformer based models, a Mamba based models, the throughput speed up is generally faster because of it.

[00:11:29] Alex Volkov: [00:11:30] But definitely the throughput is significantly higher. Tokens per second. is significantly higher. So for comparison for folks who are listening to us, just so you, you'll hear the comparison, the throughput for this 1. 5 billion model is 664 tokens per second versus a small LM 238 tokens per second, or something like Qwen 1.

[00:11:54] Alex Volkov: 5 at 400. So 600 versus 400. the training cost in [00:12:00] tokens, they say this was, 1. 5 trillion tokens versus Qwen at 18. I don't know if Junyang you want to confirm or deny the 18 mentioned here that they added. Sometimes they, they say different things, but yeah, definitely the highlight of this Heimwehr thing.

[00:12:14] Alex Volkov: And this is from NVIDIA, by the way, I think it's very worth like shouting out that this specific thing comes from this model comes from NVIDIA. Um,they specifically mentioned that the cost, And outperformance of this model comes at 6 to 12 times less [00:12:30] training, which is very impressive.

[00:12:31] Alex Volkov: what else about this model? Performance wise, MMLU at 52, which is lower than Qwen at 59, at, at 1. 5 billion parameters. GSM 8K, we know the GSM 8K is not that interesting anymore, I think, at this point. We're not like over, we're not over, we're not looking at this like too much. What else should we say about this model?

[00:12:52] Alex Volkov: GPK is pretty interesting at 31. GPK is usually knowledge versus something. [00:13:00] Anything else to say about this model? Yeah, you have anything to say Nisten? Anything to say about the small models? About the hybrid model specifically? I know that like our friend LDJ said that like this seems like the first actual model that competes apples to apples.

[00:13:13] Alex Volkov: Because usually when we compare Hybrid models specifically, those usually people say that those are not like necessarily one to one comparisons between hybrid models and just formal models.

[00:13:24] Nisten Tahiraj: I was just going to say that fromfrom NVIDIA, we've heard these [00:13:30] claims before and they didn't quite turn out that way, so I'm going to start off a little bit more skeptical on that end. also from, from the Mistral Mamba, Mambastral, that one was not very performant.

[00:13:44] Nisten Tahiraj: it seemed like it was going to be good for long context stuff. The runtime wasn't that good as well. yeah, I'm going to give this one a test because. Again, the promise of, of like hybrid, SSM models is that it can do better [00:14:00] in longer contexts and it can run faster. So it is worth testing given what, what they're claiming.

[00:14:06] Nisten Tahiraj: But, again, on MMLU, it didn't do that well, but, yeah, overall the numbers do look great actually for what it is, but I think we do need to do further testing on this, whether it is practically. That's good. Because I'm not sure how well it's going to hold up after you just throw like 32k of context of it.

[00:14:25] Nisten Tahiraj: I guess it's going to remember all that, but, yeah, this on paper, this does [00:14:30] look like it's one of the first ones that is Applesauce.

[00:14:33] Alex Volkov: Yeah. All right. anything else to say here? Yeah, the architecture. Jan, go ahead.

[00:14:39] Yam Peleg: Yeah, about the architecture. I tweeted about it.It is, I think it has extreme potential and, it might, I just by looking at the attention maps, from the paper, like just a glimpse is enough for you to see that.

[00:14:55] Yam Peleg: They really do solve something really profound [00:15:00] with many of the models that we have today. basically, I'm really simplifying here, but basically, when you look at the Attention versus Mamba, they act very differently in terms of how they process the tokens, sliding window ones, you could say.

[00:15:20] Yam Peleg: And of course self attention is like global, to everything, but Mamba is not exactly global, it's sequential, and sliding window is also not exactly [00:15:30] global, but it's not the same sequential, it's like everything to everything, but with a window. So what they did is combine the two, and you can really see the difference in attention map of the trained model.

[00:15:44] Yam Peleg: it's not exactly the same as just, hybrid Mamba attention models that we all saw before.there is a lot to this model and I really want to see one of those. I just [00:16:00] trained for like at scale, like a large one on, on, on a huge data set, because I think it might be an improvement to either,just by looking at the way the model learned, but you cannot know until you actually try.

[00:16:15] Yam Peleg: I tweeted about it just like briefly. So if you want to go and look at, I'm just, I'm just pointing out that go and check the paper out because the architecture is unique. There is, there is a reason the model is, for its size, very performant. [00:16:30]

[00:16:30] Alex Volkov: Yeah, I'm gonna add your tweet.

[00:16:31] Alex Volkov: All right, folks, time for us to move to the second thing.

[00:16:36] Allen Institute's Olmo 2.0 Release

[00:16:36] Alex Volkov: The folks at Allen AI, surprises with another release this week, and they have, as always they do, they say, hey folks, we divide the categories of open source to not open source at all, then somewhat open weights maybe, and then fully open source, the folks who release the checkpoints, the data, the, the training code.

[00:16:57] Alex Volkov: I will say this, they used to release Weights [00:17:00] Biases logs as well, and they stopped. So if somebody listens to the show from LMAI, as I know they do, folks, what's up with the Weights Biases logs? We know, and we love them, so please release the Weights Biases logs again. but, they released Olmo 2.

[00:17:14] Alex Volkov: Congrats, folks, for releasing Olmo 2. Let me actually do the clap as well. Yay!Olmo 2 is, they claim, is, they claim,the best open, fully open language model to date, and they show this nice graph as well, where, they released two models, Olmo [00:17:30] 2. 7b and Olmo 2. 13b, and they cite multiple things, to, to attribute for the best performance here.

[00:17:37] Alex Volkov: specifically the training stability, they ran this for a significant longer before. they cite some of the recipes of. What we talked about last week from TULU3 methodology, the kind of the state of the art post training methodology from TULU3 that we've talked with Nathan last week, specifically the verifiable framework, thing that we've talked about, multiple other technical things like rate [00:18:00] annealing and the data curriculum.

[00:18:01] Alex Volkov: And obviously they're focusing on their data. they have their, Ohm's selection of tasks on which they compared these models and,the breakdown that I told you about that they do is the open weights models, partially open models, and then fully open models. So this is the breakdown that they have in the area of open weights models.

[00:18:18] Alex Volkov: They have Lama 2. 13b and Mistral 7b, for example, they put Qwen in there as well. So Qwen 2. 57 and 14. And the partially open models, they put Zamba and Stable [00:18:30] LLM. And the fully open models, they put themselves and Olmo and, Ember7B and Olmo2 beats all of that category with some nice, average of stats.

[00:18:40] Alex Volkov: they talk about pre training and a bunch of other stuff. and the instruct category specifically with the Tulu kind of,recipes. What else can we say about Olmo? That's very interesting for folks before we jump into Qwen. What else can we say about Olmo? The, oh, the fact that the thing about the fully open source, we always mention this, is the data set.

[00:18:59] Alex Volkov: We [00:19:00] always talk about the data, they release all of the data sets, so Olmo mix was released, Dolmino mix was released, the SFT training data, post training data set was released as well. yeah, folks, comments. You can also try this model at playground. lnai. org. I've tried it. It's interesting. it's not look, uh,the best about this is the best among open source.

[00:19:21] Alex Volkov: Obviously it's not the best at, generally with closed source data, you can get more significantly better than this. But comments from folks about OMO? [00:19:30]

[00:19:30] Wolfram Ravenwolf: Yeah, it's not multilingual, they said that there is only English, but they are working on putting that in, I think, in another version, but, yeah, it's a truly open source model, not just OpenWeights, so a big applause for them, releasing everything, that is a big thing and I always appreciate it.

[00:19:46] Wolfram Ravenwolf: Thank you.

[00:19:48] Alex Volkov: A hundred percent. All right, folks, it looks like we got Eugene back. Eugene, talk to us about Heimbar.

[00:19:54] Eugen Cheugh: Yeah, no, sorry, I was just saying that as someone who works on transformer [00:20:00] alternative,it's actually really awesome to get the data point because we all haven't decided what's the best arrangement, what's the percentage of transformer versus non transformer?

[00:20:08] Eugen Cheugh: Is the non transformer layers in the front or the back? It's like you say, the car and the car scenario, it's like electric car, do we even know if we want the electric engine in front or the back? and these are data points that we love to test to just, find out more and it's. And I appreciate what NVIDIA is doing as well and looking forward to more research in this space.

[00:20:26] Alex Volkov: Awesome. thanks for joining us and feel free to stay. The more the merrier. This is like a [00:20:30] Thanksgiving kind of pre party for all of us. The more the merrier, folks. If you're listening to this only and you're not like on the live stream, I encourage you to go and check us out because like we're also like showing stuff.

[00:20:40] Alex Volkov: We're like showing the papers. We're like, we're waving. We're like showing Turkey, whatever. we're having fun. all right, folks. I think it's time to talk about the main course. We just ate the mashed potatoes. Let's eat the turkey for open source.

[00:20:53] Qwen Quill 32B Reasoning Model

[00:20:53] Alex Volkov: In this week's Open Source Turkey dinner, the Reasoning Model, the first ever Reasoning Open [00:21:00] Source, we got Qwen Quill, Qwen Quill?

[00:21:04] Alex Volkov: Yes, Qwen Quill 32 bit preview, the first open source. Let's go! Let's go! The first open source Reasoning Model from our friends at Qwen. We have Jun Yang here, Jun Yang and Justin Lin, to talk to us about this release. Folks at OpenAI released this, they worked for, the rest of about O1, we released a couple of months ago.

[00:21:25] Alex Volkov: Then the folks at DeepSeek released R1, that they just released it, they [00:21:30] promised to give us, maybe at some point. The folks at O1 did not release the reasoning. So, what you see in O1 is the reasoning being obfuscated from us, so we can't actually see how the model reasons. R1 gave us the reasoning itself.

[00:21:44] Alex Volkov: But didn't release the model. And so now we have a reasoning model that you can actually download and use. And unlike reflection, this model actually does the thing that it promises to do. Junyang, how did you do it? What did you do? Please give us all the details as much as possible. Please do the announcement yourself.

[00:21:58] Alex Volkov: Thank you for joining us. [00:22:00] Junyang from Qwen.

[00:22:00] Junyang Lin: Yeah, thanks everyone for the attention and for the appreciation, and I'm Junyang from the Qwen team, and we just released the new model for reasoning, but we just added a tag that it is a preview. Yeah, it is something very experimental, but we would really like to receive some feedback to see how people use it and to see what people think.

[00:22:24] Junyang Lin: The internal problems,they really are. Yeah, it is called QUIL. it is [00:22:30] something, very interesting naming,because we like to see that, we first called it like Q1,things like that, but we think it's something too normal and we'd like to see there was something connected with IQ, EQ, then we call it QQ, and then we found out, QWEN with a W there.

[00:22:47] Junyang Lin: And we found a very interesting expression because it looks really cute. There is a subculture in China with the text expression to express the feelings. So it is something very interesting. So we [00:23:00] just decided to use the name and for. For the pronunciation, it's just like the word Q, because I combined QW, the pronunciation of QW, with U together, and it's still just cute.

[00:23:13] Junyang Lin: Yeah, there's something beside the model, and it is actually a model, which can, And this is the reason before it reaches the final response. If you just try with our demo and you will find that it just keeps talking to itself. And it's something really [00:23:30] surprising for us. If it asks you a question, it just keeps talking to itself to discover more possibilities as possible.

[00:23:42] Junyang Lin: And sometimes will lead to some new things. Endless generation. So we have some limitations there. So we mentioned the limitations in the almost the second paragraph, which includes endless generation. But it is very interesting. I [00:24:00] don't say it is a really strong model, something like competitive to O1 or outcompeting R1.

[00:24:06] Junyang Lin: It is not Simply like that, we show the benchmark scores, but it is something for your reference to see that, maybe it is at this level, and then if you really check the model performance, when it processes like mathematics and coding problems, it really thinks step by step, and it really discovers more possibilities.[00:24:30]

[00:24:30] Junyang Lin: Maybe it is a bit like brute forcing, just like discovering all possibilities. If there are 1 plus 2 is equal to 1, and it discovers a lot of possibilities, but it sometimes finishes,can finish some very difficult tasks. I think, you guys can wait for our more official release, maybe one month or two months later.

[00:24:53] Junyang Lin: We'll make sure, And the next one will be much better than this preview one, but you can play with it. It is something really interesting, [00:25:00] very different from the previous models.

[00:25:02] Alex Volkov: So first of all, a huge congrats on releasing something that, everybody, it looks like it piqued interest for, tons of folks, absolutely.

[00:25:09] Alex Volkov: Second of all, it definitely thinks, it looks like it's,Actually, this seems like this. you can see the thinking, like we're actually showing this right now for folks who are just listening and I'll just read you the actual kind of ice cube question that we have that,somebody places four ice cubes and then at the start of the first minute, and then five ice cubes at the start of the second minute, how many ice cubes there are at the [00:25:30] start of the third minute,we should probably have prepared like a turkey based question,for this one, but basically the answer is zero.

[00:25:36] Alex Volkov: Oh, the ice cubes melt within a minute, and the answer is zero, and people know the answer is zero because, ice cubes melt faster than a minute. But, the,LLM starts going into math and s**t, and, just to be clear, O1 answers this question, it understands the answer is zero. Quill does not.

[00:25:53] Alex Volkov: But the reasoning process is still pretty cool and compared to like other models like you see you can see it thinking It's let me set up an equation. Oh, [00:26:00] actually, it's not correct Ah, now the equation asking for this and this and this and it goes like This is confusing Let me read the problem again.

[00:26:06] Alex Volkov: And so it tries to read the problem again. This feels Not like just spitting tokens. So Junyang, what, could you tell us like what's the difference between this and training at a regular Qwen 2. 5? So as far as I saw, this is based on Qwen 5, correct?

[00:26:27] Junyang Lin: Yeah, it is based on Qwen 2. 5 [00:26:30] 32 billion de instruct Model. Yeah, we have tried a lot of options, maybe we will release more technical details later, but I can tell you something that, we mostly simply do some, do some work on the, post training data. Because it is actually based on our previous model, so we did not change the pre training, because we are actually very confident in our pre training, because we have trained it with [00:27:00] a lot of tokens, so there should be some knowledge about reasoning there, and in Qwen 2.

[00:27:05] Junyang Lin: 5, we also have some text reasoning, relative data, in the pre training process, so we just try to see that if we can align with the behavior of such, reasoning. So we have some very simple,superfines, fine tuning, and we find that while it can generate things like that, we have done a bit like RL stuff, and we also have done something like, RFT, Rejection, [00:27:30] Finetuning, so we can add more data from it.

[00:27:33] Junyang Lin: And there are a lot of techniques, just like self aligned. We use the base language model to use in context learning to build samples for us, to just We've built something like that make the model that can reason and we found that it's really surprising. We did not do very complex stuff, but we find that it has this behavior, but we still find that there is still much room in the reinforcement learning [00:28:00] from human feedback because we found that if you add some RL, you can improve the performance very significantly, so we have some belief that Maybe we, if we have done some more in a process where we're modeling LLM critiques and also things like building more nuanced data for the multi step reasoning, the model will be much better.

[00:28:26] Junyang Lin: Yeah. But this one is interesting. You can keep [00:28:30] talking to it. It keeps talking to itself, just talking about some strange thinking and sometimes maybe I'm wrong. I will check the question again and maybe I'm wrong again and then do it again and again. And sometimes it's generally too long because we have some limitations in long text generation.

[00:28:49] Junyang Lin: I think All models have this problem, so when it reaches maybe some bound and it will turn into some crazy behaviors, it just never [00:29:00] stops generating. We just mentioned this limitation. Just

[00:29:05] Alex Volkov: to make sure folks understand, this is a preview, this is not like an official release. You guys are like, hey, this is a preview, this is a test of you guys.

[00:29:12] Alex Volkov: You guys are like trying this out, like folks should give feedback, folks should try it out. Maybe Finetune also on top of it. Yeah. There's definitely we're trying this out. This is

[00:29:21] Yam Peleg: it's like chatGPT is a research preview. It's not exactly a preview. It beats the benchmarks on so many problems.

[00:29:29] Yam Peleg: We would

[00:29:29] Junyang Lin: like [00:29:30] to make it a fun, funny stuff to make people happy. It's now Thanksgiving and people are always expecting models from us. And they're just talking that all out. where's our reasoning model or things like that. Yeah. so we showed this one to you. And.

[00:29:48] Alex Volkov: Yeah, Jan Wolfram, folks, comments about the reasoning model from Qwen.

[00:29:53] Yam Peleg: Oh, I have a lot of comments. That's a lot. I don't know if you can hear me. Yeah, Jan, [00:30:00] go ahead.

[00:30:00] Alex Volkov: There's just a delay, but we're good.

[00:30:02] Yam Peleg: Yeah, I just want to say, it's like, uh, CGPT is, uh, is a research preview. It's it's a really good thing.

[00:30:10] Yam Peleg: It's a really good model. Seriously. So, I mean, it can be a preview, but it's extremely powerful. How did you guys train this? I mean, what, what, what's the data? How did you generate it? Can you Can I just create data that looks like O1 and Finetune and it's going to work? or, like, give us some details.

[00:30:28] Yam Peleg: it's a really hard thing to [00:30:30] do. it's really, really, really successful. Sohow did you make it?

[00:30:35] Alex Volkov: Give us some details if you can, I'm saying. if you can. Don't let Yam, don't let Yam go into give some details that you cannot give details. but hey, it looks like we may have lost Junyang for a bit with some connection issues, but while he reconnects, we got Maybe he can't, maybe he can't hear details, so

[00:30:52] Wolfram Ravenwolf: They put the plug.

[00:30:53] Alex Volkov: and Wolfram, what's your, I saw your take. Let's, meanwhile, let's take a look. You did some testing for this model as well, right?

[00:30:59] Wolfram Ravenwolf: [00:31:00] Yeah. And I just ran the, the IceCube prompt and on my run, it got the zero correct.

[00:31:04] Wolfram Ravenwolf: So that is a bit of a red flag. Oh, you

[00:31:06] Alex Volkov: did get it correct.

[00:31:07] Wolfram Ravenwolf: Yeah. it was fun because it wrote, Over 10, 000 characters, but in the end it said, okay, so confusing, they all melted zero. So that worked. But of course you have to run benchmarks multiple times. I did run the MMLU Pro computer science benchmark twice.

[00:31:23] Wolfram Ravenwolf: And what is very interesting is, Also here, it generated much more tokens than any other model. The second, highest [00:31:30] number of tokens was GPT 40, the latest one, which was 160, 000 tokens for the whole benchmark. And here we have over 200, 000, 232, 000 tokens it generated. So it took me two and a half hours to run it.

[00:31:45] Wolfram Ravenwolf: And, yeah, it's an 8B model, no, a 32B model at 8 bit in my system where I was running it, because I have 48GB VRAM, so you can run it locally and look at it, it's, it's placed above the 405B [00:32:00] Lama 3. 1, it's above the big Mistral, it's above the GBT, JGBT latest, and the GBT 4. 0 from, yeah, the most recent one.

[00:32:08] Wolfram Ravenwolf: So just to recap

[00:32:09] Alex Volkov: what you're saying. On the MMLU Pro Benchmark, this is a model that you run on your Mac, or whatever PC, and it beats Llama 3. 5, 4 or 5 billion parameter on this benchmark, because it's reasoning and it's smart, it runs for longer, and it uses those test time compute, inference time [00:32:30] compute, Compute, Scaling, Loss that we talked about multiple times.

[00:32:33] Alex Volkov: It runs for longer and achieves a better score. This is like the excitement. This is the stuff. so Junyang, now that you're back with us, could you answer, or at least some of Yam's question, if you couldn't hear this before, I will repeat this for you. How? What does the data look like? can you just come up with some O1 stuff?

[00:32:51] Alex Volkov: By the way, welcome, welcome Nisten.

[00:32:53] Nisten Tahiraj: But I tried it.

[00:32:54] Introduction to the New Google Model

[00:32:54] Nisten Tahiraj: It got the Martian.Rail Train Launcher, it got it perfectly [00:33:00] on first try, and I saw that it did take it three tries, so I use this as a standard question on most models, is if you're going to launch a train from the highest mountain in the solar system, which is on Mars, and you want to accelerate it at two G's, so Still comfortable.

[00:33:21] Nisten Tahiraj: how long would that track need to be in order for you to get to orbital velocity and in order for you to get to, to leave [00:33:30] Mars gravity well? And it's a very good question because there's so many steps to solve it and you can just change it to, you can say 2. 5G and that completely changes the order of the steps for, that the model has to solve.

[00:33:42] Alex Volkov: So it's unlikely to be in the training data and it got it perfectly. It's again, it's this one, it's the new Google preview, even Sonnet takes two tries, two or three tries often to get the right answer. So,yeah, the model worked, and I had the same thing as [00:34:00] Wolfram, he did put out a lot of tokens, but again, it's pretty fast to run locally, Folks, it's a good model. It's, it, for a test preview, for something that was released, as a first, open weights reasoning model, we are very impressed.

[00:34:14] Model Performance and Availability

[00:34:14] Alex Volkov: we're gonna give Junaid, one more, one more attempt here, Junaid, I see you on the spaces. and you're as a speaker, maybe you can unmute there and speak to us through the spaces,while we try this out, I will just tell to folks that like you are, you can download this model.

[00:34:27] Alex Volkov: It's already on, OLAMA. [00:34:30] You can just like OLAMA install Quill or QWQ.it's already on OpenRouter as well. You can get it on OpenRouter. So you can like replace. you can replace whatever you use, like OpenAI, you can replace and put this model in there. it's, you can try it out in Hug Face, this is where we tried it just now.

[00:34:47] Alex Volkov: And, It's awesome. It's awesome to have this. I'm pretty sure that many people are already like trying different variations and different like fine tunes of this model. And it just like going up from here, like to get a open [00:35:00] model, 32 billion parameters, that gets, what is the score? let me take a look.

[00:35:04] Alex Volkov: The score is, I think it gets, 50 on AIME. It's ridiculous. Anybody try this on ARK Challenge, by the way? Do you guys see in your like, like tweets or whatever, the ARK Challenge? Anybody try to run this model on that and try? I would be very interested because that's that's a big prize. It's a very big prize.

[00:35:22] Alex Volkov: I'm pretty sure

[00:35:22] Eugen Cheugh: someone's trying right now. You shall think that out.

[00:35:26] Alex Volkov: I'm pretty sure somebody's trying right now. They could use a

[00:35:29] Wolfram Ravenwolf: 72B [00:35:30] version of it and maybe that gets even better. Probably does.

[00:35:35] Alex Volkov: Yeah. They're probably training a bigger model than this right now. all right folks. So with this, I think that, we've covered pretty much everything that we wanted to cover with Quill.

[00:35:46] Scaling and Model Efficiency

[00:35:46] Alex Volkov: and I think, yeah, the one thing that I wanted to show, let me just show this super quick before we move on to the next topic that we have is this, scaling kind of thing. We saw pretty much the same thing. From, from [00:36:00] DeepSeq. And then we saw pretty much the same thing also from OpenAI. The kind of the scaling confirmation, the scaling log confirmation, the next scaling log confirmation, test time compute or inference time compute works.

[00:36:11] Alex Volkov: Which basically means that the more thinking, the more tokens, the more time you give these models, the better. to think, the better their answer is. We're getting more and more confirmation for this kind of Noah Brown, I don't know, thesis, that these models actually perform [00:36:30] significantly better when you give them more tokens to think.

[00:36:32] Alex Volkov: this is incredible to me. This is like incredible because not only will we have better models with more scale, but Even though some people claim a wall has been hit, no wall has been hit. but also we now have these models that can answer better with more tokens. and this is like another, another confirmation from this.

[00:36:51] Alex Volkov: Qwen, Quail32B is now here. You can, you can now run. a, a 4 0 5 B level models, at least on [00:37:00] MMLU Pro,like wolf from here said on your computers. And shout out to our friends from, Alibaba Quinn for releasing these awesome models for us as a Thanksgiving,present.

[00:37:10] Alex Volkov: Jang, you're back with us. Let's see. maybe you're back.

[00:37:14] Junyang Lin: I don't know if you can hear me. Yes,