PLAY PODCASTS
๐Ÿ“… ThursdAI - Mar 28 - 3 new MoEs (XXL, Medium and Small), Opus is ๐Ÿ‘‘ of the arena, Hume is sounding emotional + How Tanishq and Paul turn brainwaves into SDXL images ๐Ÿง ๐Ÿ‘๏ธ

๐Ÿ“… ThursdAI - Mar 28 - 3 new MoEs (XXL, Medium and Small), Opus is ๐Ÿ‘‘ of the arena, Hume is sounding emotional + How Tanishq and Paul turn brainwaves into SDXL images ๐Ÿง ๐Ÿ‘๏ธ

ThursdAI - The top AI news from the past week ยท Alex Volkov and Paul Scotti

March 28, 20241h 35m

Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Hey everyone, this is Alex and can you believe that we're almost done with Q1 2024? March 2024 was kind of crazy of course, so I'm of course excited to see what April brings (besides Weights & Biases conference in SF called Fully Connected, which I encourage you to attend and say Hi to me and the team!)

This week we have tons of exciting stuff on the leaderboards, say hello to the new best AI in the world Opus (+ some other surprises), in the open source we had new MoEs (one from Mosaic/Databricks folks, which tops the open source game, one from AI21 called Jamba that shows that a transformers alternative/hybrid can actually scale) and tiny MoE from Alibaba, as well as an incredible Emotion TTS from Hume.

I also had the pleasure to finally sit down with friend of the pod Tanishq Abraham and Paul Scotti from MedArc and chatted about MindEye 2, how they teach AI to read minds using diffusion models ๐Ÿคฏ๐Ÿง ๐Ÿ‘๏ธ

Thank you for reading ThursdAI - Recaps of the most high signal AI weekly spaces. This post is public so feel free to share it.

TL;DR of all topics covered:

* AI Leaderboard updates

* Claude Opus is number 1 LLM on arena (and in the world)

* Claude Haiku passes GPT4-0613

* ๐Ÿ”ฅ Starling 7B beta is the best Apache 2 model on LMsys, passing GPT3.5

* Open Source LLMs

* Databricks/Mosaic DBRX - a new top Open Access model (X, HF)

* ๐Ÿ”ฅ AI21 - Jamba 52B - Joint Attention Mamba MoE (Blog, HuggingFace)

* Alibaba - Qwen1.5-MoE-A2.7B (Announcement, HF)

* Starling - 7B that beats GPT3.5 on lmsys (HF)

* LISA beats LORA as the frontrunner PeFT (X, Paper)

* Mistral 0.2 Base released (Announcement)

* Big CO LLMs + APIs

* Emad leaves stability ๐Ÿฅบ

* Apple rumors - Baidu, Gemini, Anthropic, who else? (X)

* This weeks buzz

* WandB Workshop in SF confirmed April 17 - LLM evaluations (sign up here)

* Vision & Video

* Sora showed some demos by actual artists, Air Head was great (Video)

* Tencent Aniportait - generate Photorealistic Animated avatars (X)

* MedArc - MindEye 2 - fMRI signals to diffusion models (X)

* Voice & Audio

* Hume demos EVI - empathic voice analysis & generation (X, demo)

* AI Art & Diffusion & 3D

* Adobe firefly adds structure reference and style transfer - (X, Demo)

* Discussion

* Deep dive into MindEye 2 with Tanishq & Paul from MedArc

* Is narrow finetuning done-for with larger context + cheaper prices - debate

๐Ÿฅ‡๐Ÿฅˆ๐Ÿฅ‰Leaderboards updates from LMSys (Arena)

This weeks updates to the LMsys arena are significant. (Reminder in LMsys they use a mix of MT-Bench, LLM as an evaluation and user ELO scores where users play with these models and choose which answer they prefer)

For the first time since the Lmsys arena launched, the top model is NOT GPT-4 based. It's now Claude's Opus, but that's not surprising if you used the model, what IS surprising is that Haiku, it's tiniest, fastest brother is now well positioned at number 6, beating a GPT4 version from the summer, Mistral Large and other models while being dirt cheap.

We also have an incredible show from the only Apache 2.0 licensed model in the top 15, Starling LM 7B beta, which is now 13th on the chart, with incredible finetune of a finetune (OpenChat) or Mistral 7B. ๐Ÿ‘

Yes, you can now run a GPT3.5 beating model, on your mac, fully offline ๐Ÿ‘ Incredible.

Open Source LLMs (Welcome to MoE's)

Mosaic/Databricks gave us DBRX 132B MoE - trained on 12T tokens (X, Blog, HF)

Absolutely crushing the previous records, Mosaic has released the top open access model (one you can download and run and finetune) in a while, beating LLama 70B, Grok-1 (314B) and pretty much every other non closed source model in the world not only on metrics and evals, but also on inference speed

It uses a Mixture of Experts (MoE) architecture with 16 experts that each activate for different tokens. this allows it to have 36 billion actively parameters compared to 13 billion for Mixtral. DBRX has strong capabilities in math, code, and natural language understanding.

The real kicker is the size, It was pre-trained on 12 trillion tokens of text and code with a maximum context length of 32,000 tokens, which is just incredible, considering that LLama 2 was just 2T tokens. And the funny thing is, they call this DBRX-medium ๐Ÿ‘€ Wonder what large is all about.

Graph credit Awni Hannun from MLX (Source)

You can play with the DBRX here and you'll see that it is SUPER fast, not sure what Databricks magic they did there, or how much money they spent (ballpark of ~$10M) but it's truly an awesome model to see in the open access! ๐Ÿ‘

AI21 releases JAMBA - a hybrid Transformer + Mamba 58B MoE (Blog, HF)

Oh don't I love #BreakingNews on the show! Just a few moments before ThursdAI, AI21 dropped this bombshell of a model, which is not quite the best around (see above) but has a few very interesting things going for it.

First, it's a hybrid architecture model, capturing the best of Transformers and Mamba architectures, and achieving incredible performance on the larger context window size (Transformers hardware requirements scale quadratically with attention/context window)

AI21 are the first to show (and take the bet) that hybrid architecture models actually scale well, and are performant (this model comes close to Mixtral MoE on many benchmarks) while also being significantly cost advantageous and faster on inference on longer context window. In fact they claim that Jamba is the only model in its size class that fits up to 140K context on a single GPU! ย 

This is a massive effort and a very well received one, not only because this model is Apache 2.0 license (thank you AI21 ๐Ÿ‘) but also because this is now the longest context window model in the open weights (up to 256K) and we've yet to see the incredible amount of finetuning/optimizations that the open source community can do once they set their mind to it! (see Wing from Axolotl, add support for finetuning Jamba the same day it released)

Can't wait to see the benchmarks for this model once it's properly instruction fine-tuned.

Small MoE from Alibaba - Qwen 1.5 - MoE - A2.7B (Blog, HF)

What a week for Mixture of Experts models, we got an additional MoE from the awesome Qwen team, where they show that training a A2.7B (the full model is actually 14B but only 2.7B are activated at the same time) is cheaper, 75% reduction in training costs and 174% improvement in inference speed!

Also in open source:

Lisa beats LORA for the best parameter efficient training

๐Ÿ“ฐ LISA is a new method for memory-efficient large language model fine-tuning presented in a Hugging Face paper๐Ÿ’ช LISA achieves better performance than LoRA with less time on models up to 70B parameters๐Ÿง  Deep networks are better suited to LISA, providing more memory savings than shallow networks๐Ÿ’พ Gradient checkpointing greatly benefits LISA by only storing gradients for unfrozen layers๐Ÿ“ˆ LISA can fine-tune models with up to 7B parameters on a single 24GB GPU๐Ÿš€ Code implementation in LMFlow is very simple, only requiring 2 lines of code๐Ÿค” LISA outperforms full parameter training in instruction following tasks

Big CO LLMs + APIs

Emad departs from Stability AI.

In a very surprising (perhaps unsurprising to some) move, Emad Mostaque, founder and ex-CEO of stability announces his departure, and focus on decentralized AI

For me personally (and I know countless others) we all started our love for Open Source AI with Stable Diffusion 1.4, downloading the weights, understanding that we can create AI on our machines, playing around with this. It wasn't easy, stability was sued to oblivion, I think LAION is still down from a lawsuit but we got tons of incredible Open Source from Stability, and tons of incredible people who work/worked there.

Big shoutout to Emad and very excited to see what he does next

Throwback to NEURIPS where Emad borrowed my GPU Poor hat and wore it ironically ๐Ÿ˜‚ Promised me a stability hat but... I won't hold it against it him ๐Ÿ™‚

This weeks Buzz (What I learned with WandB this week)

I'm so stoked about the workshop we're running before the annual Fully Connected conference in SF! Come hear about evaluations, better prompting with Claude, and tons of insights that we have to share in our workshop, and of course, join the main event on April 18 with the whole Weights & Biases crew!

Vision

Sora was given to artists, they created ... art

Here's a short by a company called ShyKids who got access to SORA alongside other artists, it's so incredibly human, and I love the way they used storytelling to overcome technological issues like lack of consistency between shots. Watch it and enjoy imagining a world where you could create something like this without living your living room.

This also shows that human creativity and art is still deep in the middle of all these creations, even with tools like SORA

MindEye 2.0 - faster fMRI-to-image

We had the awesome pleasure to have Tanishq Abraham and Paul Scotti, who recently released a significantly bette version of fMRI to Image model called MindEye 2.0, shortening the time it takes from 40 hours of data to just 1 hour of fMRI data. This is quite remarkable and I would encourage you to listen to the full interview that's coming out this Sunday on ThursdAI.

Voice

Hume announces EVI - their Empathic text to speech mode (Announcement, Demo)

This one is big folks, really was blown away (see my blind reaction below), Hume announced EVI, a text to speech generator that can reply with emotions! It's really something, and it has be seen to experience. This is in addition to Hume already having an understanding of emotions via voice/imagery, and the whole end to end conversation with an LLM that understands what I feel is quite novel and exciting!

The Fine-Tuning Disillusionment on X

Quite a few folks noticed a sort of disillusionment from finetuning coming from some prominent pro open source, pro fine-tuning accounts leading me to post this:

And we of course had to have a conversation about it, as well as Hamel Husain wrote this response blog called "Is Finetuning still valuable"

I'll let you listen to the conversation, but I will say, like with RAG, finetuning is a broad term that doesn't apply evenly across the whole field. For some narrow use-cases, it may simply be better/cheaper/faster to deliver value to users with using smaller cheaper but longer context models and just provide all the information/instructions to the model in the context window.

From the other side, we had data privacy concerns, RAG over a finetune model can absolutely be better than just a simple RAG, and just a LOT more considerations before we make this call that fine-tuning is not "valuable" for specific/narrow use-cases.

This is it for this week folks, another incredible week in AI, full of new models, exciting developments and deep conversations! See you next week ๐Ÿ‘

Transcript Below:

[00:00:00] Alex Volkov: Hey, this is ThursdAI, I'm Alex Volkov, and just a little bit of housekeeping before the show. And what a great show we had today. This week started off slow with some, some news, but then quickly, quickly, many open source and open weights releases from Mosaic and from AI21 and from Alibaba. We're starting to pile on and at the end we had too many things to talk about as always.

[00:00:36] Alex Volkov: , I want to thank my co hosts Nisten Tahirai, LDJ, Jan Peleg, And today we also had Robert Scoble with a surprise appearance and helped me through the beginning. We also had Justin, Junyang, Lin from Alibaba and talk about the stuff that they released from Quen. And after the updates part, we also had two deeper conversations at the second part of this show.

[00:01:07] Alex Volkov: The first one was with Danish Matthew Abraham. and Paul Gotti from MedArc about their recent paper and work on MindEye2, which translates fMRI images using diffusion models into images. So fMRI signals into images, which is mind reading, basically, which is incredible. So a great conversation, and it's always fun to have Tanish on the pod.

[00:01:37] Alex Volkov: And the second conversation stemmed from a recent change in the narrative or a sentiment change in our respective feeds about fine tuning in the era of long context, very cheap models like Claude. And that conversation is also very interesting to listen to. One thing to highlight is this week we also saw the first time GPT 4 was toppled down from the Arena, and we now have the, a change in regime of the best AI possible, uh, which is quite, quite stark as a change, and a bunch of other very exciting and interesting things in the pod today.

[00:02:21] Alex Volkov: So, as a brief reminder, if you want to support the pod, the best way to do this is to share it with your friends and join our live recordings every ThursdAI on X. But if you can't sharing it with a friend, sharing a subscription from Substack, or subscribing, uh, to a pod platform of your choice is a great way to support this pod.

[00:02:48] Alex Volkov: With that, I give you March 28th, ThursdAI.

[00:02:52] Alex Volkov: Hello hello everyone, for the second time? we're trying this again, This is ThursdayAI, now you March 28th. My name is Alex Volkov. I'm an AI evangelist with Weights Biases. And for those of you who are live with us in the audience who heard this for the first time, apologies, we just had some technical issues and hopefully they're sorted now.

[00:03:21] Alex Volkov: And in order to make sure that they're sorted, I want to see that I can hear. Hey Robert Scoble joining us. And I usually join their spaces, but Robert is here every week as well. How are you, Robert? Robert.

[00:03:35] Robert Scoble: great. A lot of news flowing through the system. New

[00:03:39] Alex Volkov: we have, a lot of updates to do.

[00:03:43] Robert Scoble: photo editing techniques. I mean, the AI world is just hot and

[00:03:48] Robert Scoble: going.

[00:03:49] Alex Volkov: A week to week, we feel the excited Acceleration and I also want to say hi to Justin Justin is the core maintainer of the Qwen team. Qwen, we've talked about, and we're going to talk about today, because you guys have some breaking news. But also, you recently started a new thing called OpenDevon. I don't know if we have tons of updates there, but definitely folks who saw Devon, which we reported on, what a few weeks ago, I think? Time moves really fast in this AI world. I think, Justin, you posted something on X, and then it started the whole thing. So you want to give , two sentences about OpenDevon.

[00:04:21] Justin Lin: Yeah, sure. I launched the Open Devon project around two weeks ago because we just saw Devon. It is very popular. It is very impressive. And we just think that Whether we can build something with the open source community, work together, build an agent style, or do some research in this. So we have the project, and then a lot of people are coming in, including researchers and practitioners in the industry.

[00:04:46] Justin Lin: So we have a lot of people here. Now we are working generally good. Yeah You can see that we have a front end and back end and a basic agent system. So we are not far from an MVP So stay tuned

[00:05:01] Alex Volkov: Amazing. so definitely Justin when there's updates to update, you know where to come on Thursday. I, and but also you have like specific when updates that we're going to get to in the open source open source area So folks I'm going to run through everything that we have to cover and hopefully we'll get to everything.

[00:05:18] Alex Volkov: ,

[00:05:18] TL;DR - March 28th

[00:05:18] Alex Volkov: here's the TLDR or everything that's important in the world of AI that we're going to talk about for the next two hours, starting now. right So we have a leaderboard update, and I thought this is gonna be cool to just have a leaderboard update section because when big things are happening, on the leaderboards, and specifically I'm talking here about The lmsys Arena leaderboard the one that also does EmptyBench, which is, LLM, Judges, LLMs, but also multiple humans interact with these models and in two windows and then they calculate ELO scores, which correlates the best of the vibes evaluations that We all know and love and folks, Claude Opus is the number one LLM on Arena right now. Claude Appus, as the one that we've been talking about, I think, since week to week to week to week is

[00:06:05] Alex Volkov: now the number one LLM in the world and it's quite impressive, and honestly, in this instance, the arena was like, lagging behind all our vibes We talked about this already, we felt it on AXE and on LokonLama and all other places. so I think it's a big deal it's a big deal because for the first time since, I think forever it's clear to everyone that GPT4 was actually beat now not only that, Sonnet, which is their smaller version, also beats some GPT 4's version. and Haiku, their tiniest, super cheap version, 25 cents per million tokens. you literally can use Haiku the whole day, and at the end of the month, you get I don't know, 5 bucks. Haiku also passes one of the versions of GPT 4 for some of the vibes and Haiku is the distilled Opus version, so that kind of makes sense.

[00:06:53] Alex Volkov: But it's quite incredible that we had this upheaval and this change in leadership in the LMS arena, and I thought it's worth mentioning here before. So let's in the open source LLM stuff, we have a bunch of updates here. I think the hugest one yesterday, Databricks took over all of our feeds the Databricks bought this company called Mosaic, and we've talked about Mosaic multiple times before and now they're combined forces and for the past.

[00:07:17] Alex Volkov: year they've been working on something called DBRX, and now it's we got, in addition to the big company models that's taken over, so cloud Opus took over GPT 4, We now have a new open access model that takes over as the main lead. and they call this DPRX medium, which is funny. It's 132 billion parameter language model. and it's a mixture of experts with I think, 16 experts, and it's huge, and it beats Lama270b, it beats Mixtral, it beats Grock on at least MMLUE and human Evil scores and so it's really impressive to see, and we're gonna, we're gonna chat about DPRx as well and there's a bunch of stuff to cover there as well and Justin, I think you had a thread that we're gonna go through, and you had a great reaction.

[00:08:02] Alex Volkov: summary, so we're gonna cover that just today, what 30 minutes before this happened we have breaking news. I'm actually using breaking news here in the TLDR section because

[00:08:11] Alex Volkov: why [00:08:20] not?

[00:08:22] Alex Volkov: So AI21, a company from Israel releases something incredible. It's called Jamba. It's 52 billion parameters. but the kicker is it's not a just a Transformer It's a joint architecture from joint attention and Mamba. And we've talked about Mamba and we've talked about Hyena. Those are like state space models that they're trying to do a Competition to Transformers architecture with significantly better context understanding. and Jamba 52 looks quite incredible. It's also a mixture of experts. as you notice, we have a bunch of mixture of experts here. and It's it's 16 experts with two active generation It supports up to 256K context length and quite incredible. So we're going to talk about Jamba.

[00:09:03] Alex Volkov: We also have some breaking news So in the topic of breaking news Junyang, you guys also released something. you want to do the announcement yourself? It would be actually pretty cool.

[00:09:13] Justin Lin: Yeah, sure. Yeah just now we released a small MOE model which is called QWEN 1. 5 MOE with A2. 7B, which means we activate, uh, 2. 7 billion parameters. Its total parameter is, uh, 14 billion, but it actually activates around, uh, 2. 7 billion parameters

[00:09:33] Alex Volkov: thanks Justin for breaking this down a little bit. We're going to talk more about this in the open source as we get to this section I also want to mention that, in the news about the Databricks, the DBRX model, something else got lost and was released actually on Thursday last week.

[00:09:49] Alex Volkov: We also didn't cover this. Starling is now a 7 billion parameter model that beats GPT 3.5 on LMsys as Well so Starling is super cool and we're going to add a link to this and talk about Starling as Well Stability gave us A new stable code instruct and Stability has other news as well that we're going to cover and it's pretty cool.

[00:10:07] Alex Volkov: It's like a very small code instruct model that beats the Starchat, like I think 15b as well. So we got a few open source models. We also got a New method to Finetune LLMs, it's called Lisa if you guys know what LORA is, there's a paper called Lisa, a new method for memory efficient large language model Fine tuning.

[00:10:25] Alex Volkov: And I think this is it. Oh no, there's one tiny news in the open source as well mistral finally gave us Mistral 0. 2 base in a hackathon that they participated in with a bunch of folks. on the weekend, and there was a little bit of a confusion about this because we already had Mistral 0.

[00:10:43] Alex Volkov: 2 instruct model, and now they released this base model that many finetuners want the base model. so just worth an update there. In the big companies LLMs and APIs, I don't think We have tons of stuff besides, Cloud opus as we said, is the number one LLM in the world. The little bit of news there is that Emmad Mostak leaves stability AI and that's like worthwhile Mentioning because definitely Imad had a big effect, on my career because I started my whole thing with stable Diffusion 1. 4 release. and we also have some Apple rumors where as you guys remember, we've talked about Apple potentially having their own model generator, they have a bunch of Open source that they're working on, they have the MLX platform, we're seeing all these signs. and then, this week we had rumors that Apple is going to go. with Gemini, or sorry, last week, we had rumors that Apple is going to go with Gemini, this week, we had rumors that Apple is going to sign with Entropic, and then now Baidu, And also this affected the bunch of stuff. so it's unclear, but worth maybe mentioning the Apple rumors as well in this week's buzz, the corner where I talk about weights and biases, I already mentioned, But maybe I'll go a little bit in depth that we're in San Francisco on April 17th and 18th, and the workshop is getting filled up, and it's super cool to see, and I actually worked on the stuff that I'm going to show, and it's super exciting, and it covers pretty much a lot of the techniques, that we cover here on ThursdAI as well.

[00:12:05] Alex Volkov: In the vision and video category, This was a cool category as well, because Sora for the first time, the folks at Sora they gave Sora to artists and they released like a bunch of actual visual demos that look mind blowing. Specifically Airhead, i think was mind blowing. We're gonna cover this a little bit.

[00:12:21] Alex Volkov: If you guys remember Emo, the paper that wasn't released on any code that took One picture and made it sing and made it an animated character. Tencent released something close to that's called AnimPortrait. but Any portrait doesn't look as good as emo, But actually the weights are there.

[00:12:36] Alex Volkov: So you can now take one image and turn it into a talking avatar and the weights are actually open and you can use it and it's pretty cool. and in the vision and video, I put this vision on video as well, but MedArk released MindEye 2, and we actually Have a chat closer to the second hour with with yeah, with Tanishq and Paul from AdArc about MindEye 2, which is reading fMRI signals and turning them into images of what you saw, which is crazy. And I Think the big update from yesterday as Well from voice and audio category is that Hume, a company called Hume, demos something called EVI which is their empathetic voice analysis and generation model, which is crazy I posted a video about this yesterday on my feed. you talk to this model, it understands Your emotions. Apparently this is part of what Hume has on the platform. you can actually use this right now but now they already, they showed a 11 labs competitor, a text to speech model that actually can generate voice in multiple emotions. and it's pretty like stark to talk to it. and it answers sentence by sentence and it changes its emotion sentence from by sentence. and hopefully I'm going to get access to API very soon and play around with this. really worth talking about. Empathetic or empathic AIs in the world of like agentry and everybody talks about the, the

[00:13:53] Alex Volkov: AI therapist.

[00:13:54] Alex Volkov: So we're going to cover Hume as well. I think a very brief coverage in the AI art and diffusion Adobe Firefly had their like annual conference Firefly is a one year old and they added some stuff like structure reference and style transfer and one discussion at the end of the show IS narrow fine tuning done for for large, with larger contexts and cheaper prices for Haiku. we had the sentiment on our timelines, and I maybe participated in this a little bit, and so we had the sentiment and , I would love a discussion about Finetuning, because I do see quite A few prominent folks like moving away from this concept of Finetuning for specific knowledge stuff.

[00:14:32] Alex Volkov: Tasks, still yes but for knowledge, it looks like context windows the way they're evolving. They're going to move towards, potentially folks will just do RAG. So we're going to have a discussion about fine tuning for specific tasks, for narrow knowledge at the end there. and I think this is everything that We are going to talk about here. That's a lot. So hopefully we'll get to a bunch of it.

[00:14:51] Open Source -

[00:14:51] Alex Volkov: and I think we're going to start with our favorite, which is open source

[00:15:12] Alex Volkov: And while I was giving the TLDR a friend of the pod and frequent co host Yam Pelleg joined us. Yam, how are you?

[00:15:18] Yam Peleg: Hey, how are you doing?

[00:15:19] Alex Volkov: Good! I saw something that you were on your way to to visit. our friends at AI21. Is that still the

[00:15:24] Alex Volkov: awesome, awesome.

[00:15:25] Yam Peleg: 10 I'll be there in 10, 20 minutes.

[00:15:27] Alex Volkov: Oh, wow Okay. so we have 10, 20 minutes. and if you guys are there and you want to like hop on, you're also welcome so actually while you're here, I would love to hear from you we, We have two things to discuss. They're major in the open source and like a bunch of other stuff to cover I think the major like the thing that took over all our timelines is that Mosaic is back and Databricks, the huge company that does like a bunch of stuff. They noticed that Mosaic is doing very incredible things. and around, I don't know, six months ago, maybe almost a year ago, they Databricks acquired Mosaic. and Mosaic has been quiet since Then just a refresher for folks who haven't followed us for for longest time Mosaic released a model that was for I don't know, like three months, two months was like the best 7 billion parameter model called mpt and

[00:16:10] DBRX MoE 132B from Mosaic

[00:16:10] Alex Volkov: Mosaic almost a year ago, I think in May also broke the barrier of what we can consider a large context window so they announced a model with 64 or 72k context window and they were the first before cloud, before anybody else. and since then they've been quiet. and they have an inference platform, they have a training platform, they have a bunch of stuff that Databricks acquired. and yesterday they came out with a bang. and this bang is, they now released the top open access model, the BITS LLAMA, The BITS Mixtral, the BITS Grok1, The BITS all these things [00:16:40] And it's huge. It's a 132 billion parameter MOE that they've trained on I don't know why Seven

[00:16:49] Alex Volkov: 12,

[00:16:49] Yam Peleg: 12,

[00:16:50] Alex Volkov: jesus Christ, 12 trillion parameters.

[00:16:53] Alex Volkov: This is like a huge I don't think we've seen anything come close to this amount of training, Right

[00:16:59] Yam Peleg: Oh yeah, it's insane. I mean, the next one is six of Gemma, the next one we know. We don't know about Mistral, but the next one we know is six trillion of Gemma, and it's already nuts. So, but Yeah. It's a much larger model. I think the interesting thing to say is that it's the age of MOE now everyone is really seeing a mixture of experts and the important thing to, to pay attention to is that they are not entirely the same.

[00:17:27] Yam Peleg: So there is still exploration in terms of the architecture or of small tweaks to the MOE, how to do them, how to actually implement them better, what works better, what is more efficient and so on and so forth. That we just heard about Qwen MOE, which is also a little bit different than the others.

[00:17:44] Yam Peleg: So there is still exploration going on and just looking at what is coming out and everything turns out to be at the ballpark of Mistral and Mixtral just makes me more curious. Like, how did they do this? How everything is just on, on the same ballpark as them? How did they manage to train such powerful models?

[00:18:04] Yam Peleg: Both of them. And Yeah.

[00:18:06] Yam Peleg: I just want to say that because it's amazing to see.

[00:18:10] Alex Volkov: So, so just to highlight, and I think we've been highlighting this When Grok was released, we've been highlighting and now we're highlighting This as well. A significantly smaller model from Mixtral is still up there. It's still given the good fight, even though these models like twice and maybe three times as large sometimes and have been trained. So we don't know how much Mixtral was trained on right but Mixtral is still doing The good fight still after all this time which is quite incredible. and we keep mentioning this when Grok was released, we mentioned this. And now when this was released, we mentioned this as well.

[00:18:38] Alex Volkov: It's. What else should we talk about in DBRX? Because I think that obviously Databricks want to show off the platform. Nisten, go ahead. Welcome, by the way. You want to give us a comment about DBRX as well? Feel free.

[00:18:51] Nisten Tahiraj: Hey guys, sorry I'm late. I was stuck debugging C and it finally worked. I just lost a good time. I used DBRX yesterday. I was comparing it I used it in the LMTS arena. And then I opened the Twitter space and told people to use it. And now it just hit rate limits so you can't use it anymore. Yeah.

[00:19:11] Nisten Tahiraj: It was pretty good. I very briefly did some coding example. It felt better than than Code Llama to me. It wasn't as good as Cloud Opus stuff, but it did give me working gave me working bash scripts. So, yeah, in the very brief, short amount of time I use it, it seemed pretty good, so,

[00:19:31] Alex Volkov: Yep.

[00:19:32] Nisten Tahiraj: that's about it.

[00:19:33] Nisten Tahiraj: As for the Mistral and Mixtral question, so, I use Mistral large a lot, I use I use medium a lot, And the 70s, and the Frankensteins of the 70s, and they all start to feel the same, or incremental over each other. It's just the data. It's just the way they feed it. They feed this thing, and the way they raise it, I think it's it's all they're all raised the same way in the same data.

[00:20:03] Nisten Tahiraj: Yeah, the architecture makes some difference, but the one thing that you notice is that it doesn't get that much better with the much larger models. So it's just the data.

[00:20:20] Justin Lin: That's what I think it is.

[00:20:21] Alex Volkov: I want to ask Justin to also comment on this, because Justin, you had a thread that

[00:20:24] Alex Volkov: had a great coverage as well. What's your impressions from DBRX and kind of the size and the performance per size as well?

[00:20:32] Justin Lin: Yeah, the site is pretty large and it activates a lot of parameters. I remember it's 36 billion and the model architecture is generally fine. Actually, I talked to them a few times. around three months ago, last December introduced Quent2Dem and I accidentally saw it yesterday there are some common senses.

[00:20:57] Justin Lin: I think it is really good. They use TIC token tokenizer with the GPT2BP tokenizer. Recently I have been working with LLAMA tokenizer and the sentence piece tokenizer, well, makes me feel sick. Yeah. It's complicated. Yeah, but the GPT BPE tokenizer, because I have been working with BPE tokenizer years ago, so everything works great.

[00:21:22] Justin Lin: And we were just, for Qwen 1. 5, we just changed it from the implementation of TIP token to the GPT 2 BPE tokenizer by Hugging Face. It is simple to use. I think it's good to change the tokenizer. And it's also good to have the native chat ML format so that I think in the future people are going to use this chat ML format because the traditional chat formats like human assistant, there are a lot of risks in it.

[00:21:53] Justin Lin: So chat ML format is generally good. I think they have done a lot of great choices, but I'm not that, Impressed by their performance in the benchmark results, although benchmarks are not that important, but it's a good indicator. For example, when you look at its MMLU performance, I expect it to be, well, if you have trained it really good.

[00:22:19] Justin Lin: I haven't trained a 100 billion MOE model, but I expect it to be near 80. It is just 73 with 12 trillion tokens. I don't know if they repeat the training epics or they have diverse 12 trillion tokens. They didn't share the details, but I think it could be even better. I am relatively impressed by their coding performance, just as Nisten said.

[00:22:47] Justin Lin: The coding capability looks pretty well, but then I found that well?

[00:22:53] Justin Lin: DBRX Instruct because you can improve and instruct model to a really high level at human eval, but, it's hard for you to improve it for the base model. I'm not pretty sure maybe I need to try more, but it's generally a very good model.

[00:23:10] Alex Volkov: Yeah, absolutely. We got the new contender for the Open weights, open source. So the LLAMA folks are probably like thinking about, the release date it's very interesting what LLAMA will come out with. Notable that this is only an LLM. There's nothing like, there's no multimodality here. and the rumors are the LLAMA will hopefully will be multimodal. so whatever comparison folks do and something like like GPT 4, it's also notable that this is not multi modal yet, this is just text. One thing I will say is that they call this DBRX Medium which hints at potentially having a DBLX, DBRX Large or something, and also something that was hidden and they didn't give it, yet they retrained MPT.

[00:23:48] Alex Volkov: Yam, I think you commented on this and actually Matei Zaharia, the chief scientist there commented on, on, on your thread. They retrained the MPT7B, which was like for a while, the best 7 billion parameter model almost a year ago. and they said that it cost them like twice less to train the same model, something like this, which I thought it was notable as well.

[00:24:07] Alex Volkov: I don't know, Yam, if you want to, if you want to chime in on The yeah.

[00:24:10] Yam Peleg: The interesting thing here is that I mean, it's obvious to anyone in the field that you can, making the model much, much, much better if you get better data. So, what they basically say, what they basically show with actions is that if you have, you can even make the model even twice as better or twice as cheaper to train depending on how you look at it, just by making the data better.

[00:24:35] Yam Peleg: And my own comment on this is that at the moment, to the best of my knowledge Better, better data is something that is not quite defined. I mean, there is a lot of there is a lot of intuition, there are, I think big things when you look at broken data, it's broken. But it's really hard to define what exactly is better data apart [00:25:00] from a deduplication and all of the obvious.

[00:25:03] Yam Peleg: It's very hard to define what exactly is the influence of specific data on performance down the line. So, so it's really interesting to hear from people that have done this and made a model twice as better. What exactly did they do? I mean, because they probably are onto something quite big to get to these results.

[00:25:27] Yam Peleg: Again, it's amazing to see. I mean, it's just a year, maybe even less than a year of progress. I think MPT is from May. If I remember, so it's not even a year of progress and we already have like twice as better models and things are progressing

[00:25:42] Alex Volkov: Worth mentioning also that Databricks not only bought Mosaic, they bought like a bunch of startups, lilac, the friends from Lilac the, we had the folks from Lilac, IL and Daniel here on the pod. And we talked about how important data their data tools specifically is. and they've been a big thing in open source.

[00:25:58] Alex Volkov: All these folks from Databricks, they also highlight like how much Li help them understand their data. very much. so I'm really hoping that they're going to keep Lilac around and free to use as well one last thing that I want to say, it's also breaking news, happened two hours ago, the author of Megablocks, The training library from MOEs, Trevor gale I think he's in DeepMind, he has now given Databricks the mega Blocks library.

[00:26:23] Alex Volkov: So Databricks is also taking over and supporting the mega blocks training library for Moes. that is they say out for firms the next best library for Moes as well and there was a little bit of a chat where Arthur Mech from Mistro said, Hey, welcome to the party. And then somebody replied and said, you are welcome and then they showed the kind of the core contributors to the mega blocks library. And a lot of them are, folks from Databricks. and so now they've taken over this library.

[00:26:50] AI21 - JAMBA - hybrid Transformer/Mamba Architecture 52B MoE

[00:26:50] Alex Volkov: So yes MOE seems to be a big thing and now let's talk about the next hot MOE AI 21. The folks that I think the biggest like lab for AI in Israel, they released something called Jamba, which is a 52 billion parameter, MOE. and the interesting thing about Jamba is not that it's an MOE is that it's a Mamba and joint attention. so it's like a, it's a mamba transformer. Is that what it is? It's a combined architecture. We've talked about state space models a little bit here, and we actually talked with the author Eugene from RWKV, and we've mentioned Hyena from Together AI, and we mentioned Mamba before and all I remember that we mentioned is that those models, the Mamba models, still don't get the same kind of performance and now we're getting this like 52 billion parameter mixture of excerpt model that does. Quite impressive on some numbers and comes close to LLAMA70B even, which is quite Impressive MMLU is almost 70, 67%. I don't see a human eval score. I don't think they added this. But they Have quite impressive numbers across the board for something that's like a New architecture.

[00:27:52] Alex Volkov: 50 billion parameters with 12 active and what else is interesting here? The New architecture is very interesting. it supports up to 256. thousand context length, which is incredible. Like this Open model now Beats cloud 2 in just the context length, which is also incredible. Just to remind you Databricks, even though they released like a long context model before Databricks DBRX is 32, 32, 000.

[00:28:15] Alex Volkov: This is 256. And not only does it support 256 because of its Unique architecture They can fit up to 140k contexts on a single A180 GB GPU. I know I'm saying a lot of numbers. Very fast, But if you guys remember, for those of you who frequent the pod, we've talked with folks from , the yarn scaling method. and the problem with the context window in Transformers is that the more context you have the more resources it basically takes in a very basic thing. And so the SSM models and the Mamba architecture, they specifically focus on lowering the requirements for long context. and this model gets three times as throughput on long context compared to Mistral.

[00:28:57] Alex Volkov: 8 times 7b, compared to Mixtral, basically. so very exciting, yeah, you wanna comment on this I know you're like almost there, meeting with the guys but Please give us the comments,

[00:29:07] Yam Peleg: I'm there. I'm there in five minutes, so I can maybe if time works towards favour, maybe I can even get you the people on the pod

[00:29:14] Alex Volkov: That'd be incredible.

[00:29:15] Yam Peleg: I'm just, yeah, what what is important here, in my opinion, is that first, I mean, absolutely amazing to see the results.

[00:29:23] Yam Peleg: But what was not known to this point is whether or not those types of models scale. to these sizes. We had smaller Mambas and they were, they looked really promising, but we were at the point where, okay, it looks promising. It looks like it could be at the same ballpark of transformers, but to test this out, someone need to just invest a lot of money into the compute and just see what the results they get.

[00:29:53] Yam Peleg: And it's a risk. You don't know what you're going to get if you're going to do it. And it turns out that you get a really good model at the same ballpark. Maybe slightly less performant as a transformer, but it is expectable. The thing the thing worth mentioning here is that Mamba the Mamba architecture is way more efficient in terms of context size.

[00:30:15] Yam Peleg: As you just said, transformers are quadratic in terms of complexity. When you increase the context. So you have if you have two tokens, you need you need four times that you can say the memory. And if you have four tokens, you need 16 and it just goes on and on and it just explodes, which is why context length is such a problem but Mamba scales much more friendly, memory friendly, you can say.

[00:30:39] Yam Peleg: So, but the thing is that you do pay with the performance of the model. So. What you, what people do is a hybrid between the two, so you can find some sweet spot where you don't just use so much memory and yet you don't have the performance degrade that bad. And I mean, yeah, it's a risk. At the end of the day, you need to train, training such a large model is a lot of money, is a lot of money in terms of compute.

[00:31:06] Yam Peleg: And they did it, released it in Apache 2, which is amazing for everyone to use. And proving for, to everyone that, all right, if you follow this recipe, you get this result. Now people can build on top of that and can train maybe even larger model or maybe even, maybe just use this model. I'm, I didn't try it yet, but I think it's an incredible thing to try because it's it's not the same as Mixtral.

[00:31:33] Yam Peleg: Mixtral is a little bit better, but it's at the same ballpark as Mixtral, but you get way more context there. At your home on a small GPU for cheap. It's amazing.

[00:31:41] Alex Volkov: and Mixtral specifically,

[00:31:43] Yam Peleg: potential.

[00:31:45] Alex Volkov: thanks Yamin, I just want to highlight that Mixtral Is this like amazing model that WE compare models like three times the size to it, and they barely beat Mixtral. We talked about this when Grok 1 was released, we now talked about this when DBRX was released with like

[00:31:57] Alex Volkov: 12 trillion parameters in data.

[00:32:00] Alex Volkov: Mixtral is this basically like the