PLAY PODCASTS
Interconnects

Interconnects

152 episodes — Page 4 of 4

Where 2024’s “open GPT4” can’t match OpenAI’s

And why the comparisons don't really matter. Repeated patterns in the race for reproducing ChatGPT, another year of evaluation crises, and people who will take awesome news too far.This is AI generated audio with Python and 11LabsSource code: https://github.com/natolambert/interconnects-toolsOriginal post: https://www.interconnects.ai/p/open-gpt4-limitations00:00 Where 2024's "open GPT4" can't match OpenAI's03:19 Models vs. products04:51 RLHF progress: Revisiting Llama 2's release and potential in 202408:30 Smaller scale open RLHF10:33 Opportunities12:24 Commentary This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit www.interconnects.ai/subscribe

Jan 5, 202413 min

Interviewing Tri Dao and Michael Poli of Together AI on the future of LLM architectures

This interview is on YouTube and podcast players.Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies. I’m excited to get to talk with Michael Poli (Stanford PhD student + research at Together AI) and Tri Dao (incoming professor at Princeton + Chief Scientist at Together AI). This builds on the mega-post from yesterday on the same topics, though the interview is obviously less math heavy:Interconnects is a reader-supported publication. Consider becoming a subscriber.Topics: Introductions | Why Attention works and may not scale | Quadratic scaling in attention | What is Striped Hyena | What is Mamba | Mamba hardware optimization | Predictions for 2024 architectures | More predictions for AIIntroductions[00:00:00] Nathan Lambert: Okay. Hey, everyone. Welcome to the first interview that we're going to post on interconnects. I'm really trying to bring more scientific voices into the AI discourse as media is covering a lot these days. I'm happy to be here with Michael Poli and Tri Dao, experts in some of these non attention architectures that have been really blowing up in the last few weeks of December.So, Michael, do you want to introduce yourself first?[00:00:25] Michael Poli: Sure. Thank you. Thank you, Nathan. For inviting me, I, do research at Together AI. And I was also a PhD student at Stanford, working with Stefano Ermon and, and, Chris Re, that's, that's how I met Tri as well. if I had to choose maybe, I moved to a few different areas of research.if I had to choose one, I like to, do research at the intersection of signal processing, dynamical systems, and deep learning, and most recently, luckily, there's been more interest in, in kind of efficient architectures that use some of these principles. to improve scaling, along different axes and to, to get sort of new, new trade offs at inference time.[00:01:13] Nathan Lambert: Great. And Tri?[00:01:16] Tri Dao: Everyone, thanks Nathan for, for, hosting us. really excited to be here. I'm Tri. I, just finished my PhD at Stanford. and I'm being assistant professor at Princeton, and right now I'm chief scientist at Together AI. it's, it's a startup working on AI infrastructure. And, yeah, I've been working at the intersection of machine learning and systems, so designing algorithms that take advantage of the hardware that, that they run on.I'm interested in, long range, dependencies, how to encode that into a model, designing architectures that can, can, learn long range dependencies. yeah, really excited to be here.Why Attention works and may not scale[00:02:01] Nathan Lambert: Okay. I think I'm going to, I have some questions dive right into this. I think you two will kind of both answer them or someone can answer longer, whatever you want.I think to start with, we should talk about maybe why attention works and what the limitations of attention are. I think. Almost every person in tech broadly now knows that a transformer is a model built with attention and chat GPT does that but like, why is this so good, even like how much of a transformer is built with attention are there other things going on, and what might be challenges there.[00:02:35] Tri Dao: Right. so, transformer which is this. Contexture that powers most of the exciting applications that we're seeing nowadays, as you mentioned, and so on. attention is kind of the core layer there, and attention actually became, earlier, around 2014, 2015, and then transformer came out, incorporating that, focusing a lot on, on, attention, with these, MLPs, interleaving, MLP and, and attention.And I think the success largely has been, They are, they seem to be able to scale really well so that you can scale up the models, with more, more parameters, with more data. And that has been the recipe for, for success. It sounds obvious now, but I think, five years ago that wasn't, that wasn't clear.so it seems like, you know, Transformer Architecture is, is a hugely successful one. and, you know, a couple of reasons why it's successful. I think it's like General enough that it's able to learn a lot from data. And two is, is pretty friendly to hardware. You can, there's no kind of sequential dependency like previous RNNs.so as a result, you can run it well on GPUs, TPUs. you can scale it up. It's very hardware efficient. I've personally have worked on making it more hardware efficient as well. So it's just kind of the recipe for, for success, which is, general architecture that scales well. if you're an NLP person, maybe you, you, you said, you know, incorporate some kind of inductive bias for, for, to protect, personally, I think it's a very general architecture that, that scales well and it's hardware friendly.[00:04:16] Nathan Lambert: Yeah. Yeah. It's, it's remarkable how it seems so obvious now and it's like. I think one of the things that studying this work is the context length becomes a really interesting access to stud

Dec 21, 202335 min