The 10,000x Yolo Researcher Metagame — with Yi Tay of Reka

July 5, 20241h 44m

Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

Livestreams for the AI Engineer World’s Fair (Multimodality ft. the new GPT-4o demo, GPUs and Inference (ft. Cognition/Devin), CodeGen, Open Models tracks) are now live! Subscribe to @aidotEngineer to get notifications of the other workshops and tracks!

It’s easy to get de-sensitized to new models topping leaderboards every other week — however, the top of the LMsys leaderboard has typically been the exclusive domain of very large, very very well funded model labs like OpenAI, Anthropic, Google, and Meta. OpenAI had about 600 people at the time of GPT-4, and Google Gemini had 950 co-authors. This is why Reka Core made waves in May - not only debuting at #7 on the leaderboard, but doing so with all-new GPU infrastructure and 20 employees with <5 people on pre-training and a relatively puny $60m in funding.

Shortly after the release of GPT3, Sam Altman speculated on the qualities of “10,000x researchers”:

* “They spend a lot of time reflecting on some version of the Hamming question—"what are the most important problems in your field, and why aren’t you working on them?” In general, no one reflects on this question enough, but the best people do it the most, and have the best ‘problem taste’, which is some combination of learning to think independently, reason about the future, and identify attack vectors.” — sama

* Taste is something both John Schulman and Yi Tay emphasize greatly

* “They have a laser focus on the next step in front of them combined with long-term vision.” — sama

* “They are extremely persistent and willing to work hard… They have a bias towards action and trying things, and they’re clear-eyed and honest about what is working and what isn’t” — sama

“There's a certain level of sacrifice to be an AI researcher, especially if you're training at LLMs, because you cannot really be detached… your jobs could die on a Saturday at 4am, and there are people who will just leave it dead until Monday morning, or there will be people who will crawl out of bed at 4am to restart the job, or check the TensorBoard” – Yi Tay (at 28 mins)

“I think the productivity hack that I have is, I didn't have a boundary between my life and my work for a long time. So I think I just cared a lot about working most of the time. Actually, during my PhD, Google and everything [else], I'll be just working all the time. It's not like the most healthy thing, like ever, but I think that that was actually like one of the biggest, like, productivity, like and I spent, like, I like to spend a lot of time, like, writing code and I just enjoy running experiments, writing code” — Yi Tay (at 90 mins)

* See @YiTayML example for honest alpha on what is/is not working

and so on.

More recently, Yi’s frequent co-author, Jason Wei, wrote about the existence of Yolo researchers he witnessed at OpenAI:

Given the very aggressive timeline — Yi left Google in April 2023, was GPU constrained until December 2023, and then Reka Flash (21B) was released in Feb 2024, and Reka Core (??B) was released in April 2024 — Reka’s 3-5 person pretraining team had no other choice but to do Yolo runs. Per Yi:

“Scaling models systematically generally requires one to go from small to large in a principled way, i.e., run experiments in multiple phrases (1B->8B->64B->300B etc) and pick the winners and continuously scale them up. In a startup, we had way less compute to perform these massive sweeps to check hparams. In the end, we had to work with many Yolo runs (that fortunately turned out well).

In the end it took us only a very small number of smaller scale & shorter ablation runs to get to the strong 21B Reka Flash and 7B edge model (and also our upcoming largest core model). Finding a solid recipe with a very limited number of runs is challenging and requires changing many variables at once given the ridiculously enormous search space. In order to do this, one has to abandon the systematicity of Bigtech and rely a lot on “Yolo”, gut feeling and instinct.”

We were excited to be the first podcast to interview Yi, and recommend reading our extensive show notes to follow the same papers we reference throughout the conversation.

Special thanks to Terence Lee of TechInAsia for the final interview clip, who are launching their own AI newsletter called The Prompt!

Full Video Podcast

Show Notes

* Yi on LinkedIn, Twitter, Personal

* Full prep doc

* Reka funding/valuation

* Building frontier AI teams as GPU Poors

* Yi’s Research

* 2020

* Efficient Transformers: A Survey went viral!

* Long Range Arena: A Benchmark for Efficient Transformers in 2020

* 2021: Generative Models are Unsupervised Predictors of Page Quality: A Colossal-Scale Study

* 2022:

* UL2: Unifying Language Learning Paradigms

* PaLM -> PaLM-2

* Emergent Abilities of Large Language Models vs the Mirage paper

* Recitation Augmented generation

* DSI++: Updating Transformer Memory with New Documents

* The Efficiency Misnomer: “a model with low FLOPs may not actually be fast, given that FLOPs does not take into account information such as degree of parallelism (e.g., depth, recurrence) or hardware-related details like the cost of a memory access”

* 2023: Flan-{PaLM/UL2/T5}

1.8k tasks for instruction tuning

* Encoder-decoder vs Decoder only

* Latent Space Discord discussion on enc-dec vs dec-only

* Related convo with Yi Tay vs Yann LeCun

* @teortaxes:

* If 2024 papers are to be trusted: You don't need (most) attention you don't need (most) kv cache You don't need (most) FFN layers You don't need a reward model You don't need… all the stuff that still makes frontier models work, ironically

* “there have been no real advance since 2019's T5 models”

* The future of Open source models - relevant to a16z vs Founders Fund debate. Open source cannot compete!

Timestamps

* [00:00:00] Intro

* [00:01:57] Yi Tay Intro

* [00:03:02] Path into LLMs

* [00:09:41] Google Brain: PaLM, UL2, DSI, Emergent Abilities

* [00:11:54] PaLM 2

* [00:15:27] Emergent Abilities

* [00:18:26] Quoc Le

* [00:24:16] Marketing Research: How to Start from Zero with No Reach

* [00:27:34] What's needed to be a successful AI Researcher?

* [00:30:31] Reka Origin

* [00:33:24] Starting Reka Infra

* [00:35:04] Why not to use TPUs outside Google

* [00:36:29] Chaotic vs Stable Infra

* [00:38:04] Risk Sharing of Bad Nodes

* [00:41:05] Checkpointing and Orchestration

* [00:43:39] Reka Flash/Core/Edge

* [00:46:59] Recruiting the team

* [00:47:22] Noam Architecture - Swiglu, GQA, RMSnorm, ROPE

* [00:52:26] Encoder-decoder vs Decoder-only

* [00:55:52] LLM Trends - Llama 3 and Phi 3 Glowup

* [00:57:46] LLM Trends - Benchmarks and Evals

* [01:03:25] LLM Trends - Early vs Late Fusion Multimodality

* [01:07:22] LLM Trends - Scaling Laws

* [01:09:41] LLM Trends - Long Context vs RAG

* [01:12:31] Long Context vs Finetuning

* [01:14:14] If emergence is real, when does Efficiency work?

* [01:17:41] MoEs and Upcycling

* [01:20:47] The Efficiency Misnomer - Efficiency != Speed

* [01:25:05] Open Source vs Closed Models

* [01:28:08] Personal Productivity

* [01:33:19] Singapore vs US Academic Scene

* [01:37:42] Building Silicon Valley outside Silicon Valley

* [01:40:29] TechInAsia Meetup

Transcript

[00:00:00] swyx: Thanks for watching. Bye bye.

[00:00:05] AI Charlie: Welcome back, friends. It's only been a week since the World's Fair, and it was incredible gathering the community to see the latest and greatest in AI engineering. You can catch up now on the four live stream track days on the AI Engineer YouTube, and our team is busy editing the remaining workshops and five other tracks, including the surprisingly popular AI Leadership track.

[00:00:28] Thank you all for your support, and stay tuned for news about the next event. The 2025 AI Engineer Summit. Last week, we did a very special deep dive with Josh and John of InView and Databricks Mosaic on training LLMs and setting up massive GPU clusters. And today, we're pleased to follow that up with a very special conversation with Yi Tai, formerly tech lead of Palm 2 at Google Brain, and now chief scientist of Reka.

[00:00:56] ai. Raker's largest model, Raker Core, was at launch. The fifth best model in the world. And the only GPT 4 class model not trained by a big lab like OpenAI, Google, Anthropic or Meta. In fact, while Google Gemini has 950 co authors, Raker only has 20 employees. With up to five people actually working on pre training.

[00:01:21] Swyx was excited to return to Singapore to delve into Yi Reka and building a new AI model lab outside of Silicon Valley. Stay tuned to the very end for a special bonus clip from Yi's recent appearance at the Tekinesia meetup for his spiciest take on why senior management is overrated and why this is the time to build up senior 10, 000x individual contributors like himself.

[00:01:46] Watch out and take care.

[00:01:48] swyx: Welcome to lay space. This is a long time coming, but I'm so excited to have you here.

[00:01:52] Yi Tay: Yeah, thanks for, thanks for inviting and excited to be here. chat about a lot of stuff.

[00:01:57] Yi Tay Intro

[00:01:57] swyx: Yeah. So you are interesting to research and introduce. You are now chief scientist of Rega, which is a super interesting model lab, but before that you were at Google Brain, you were architecture co-lead on POM two, you were inventor of UL two.

[00:02:10] You're a core contributor on Flan, you're a member of the Bard core team, and you also did some work on generative retrieval. That's a very, very illustrious three year career at Google Brain.

[00:02:19] Yi Tay: Yeah, thanks, thanks, thanks, yeah.

[00:02:20] swyx: And then since then, Reka, you joined in March 2023, announced a 58 million series in June 2023.

[00:02:26] I don't know if you know, the post money valuation, or the pre money valuation is public. So it's, crunch basis is, is, Oh, okay, okay. I

[00:02:33] Yi Tay: did not know that yet. 50

[00:02:34] swyx: something million. So you don't even have to leak it. It's on the internet. Okay. Rekha's stated goals were to work on universal intelligence, including general purpose multimodal and multilingual agents, self improving AI, and model efficiency.

[00:02:45] In February You released Rekha Flash. In April, you released Rekha Core and Edge. And then, most recently, you released VibeEval. Is that a good summary of the last six years? No, it's not. Four years? Four years, yeah. Oh my god. We're talking about AI I was wondering, since when did I,

[00:03:00] Yi Tay: like, step into a time machine or something?

[00:03:02] Path into LLMs

[00:03:02] swyx: Yeah, okay, so can we just talk about your transition into, you know, you did your PhD, and we can talk about your PhD, transition into brain and research and all that. You know, I saw you do some work on recommender systems, I saw you do some work on quaternions. What the f**k was that?

[00:03:17] Yi Tay: Okay, let, let, let's, let's forget about

[00:03:18] swyx: that.

[00:03:18] Just describe your path into modern L lms, right? Because you were, you were, you didn't start there.

[00:03:24] Yi Tay: Yeah. Okay. Sure. Sure. I, I, I think the world also didn't start start there, right? I mean, I think in so I joined Google in 2019, end of 2019. And the world looked like really different at the time, right?

[00:03:34] I think that was around the time the first GBT was released by. GPT 1 or something was released by OpenAI. So, research, like ML research and NLP research looked very different at that time. So I was mostly, I identified as like a language researcher. I don't like to use the word NLP, Jason will kill me if I use the word NLP.

[00:03:51] But like, I was like, okay, a language researcher. I, , but I was more like an architecture kind of researcher. And when I joined Google, I was also I continued on this as a model architecture researcher. I worked a lot on efficient transformers. That was your first viral paper. Yeah, yeah, and like, you know, I worked in the long range arena.

[00:04:09] I spent quite a lot of time looking at what we could do without attention. Like, there was a synthesizer paper back in 2020. I think that was like my early days in Google. There wasn't like a At that point of time transformer research was mainly like WMT, like machine translation and like perplexity and stuff like that.

[00:04:25] It's not really about You know, there wasn't like, I think it was in field short, field short learning and field short in context learning came only about like, you know, when GPT 3 came out and beyond, right? And then, so I think that at that time, the meta, I would say, the meta looked very different. And at that time, a lot of the work was focused on Like fine tuning things like T5 or BERT or something like that, right?

[00:04:45] So I think a lot of the research, not only myself, but like around me or like even the broader community were working on those kind of things. And so I think that was, which I feel that in hindsight today is actually pretty useful to like kind of think about because a lot of people came into like, AI into right after ChatGPT came out, right, so they saw AI as kind of, I think there's a lot of benefits of you know, understanding how, you know, transformers and like, I've broken this thing apart so many times, trying to, it's like, these things actually, you know, help to improve intuition and it's not totally disconnect I think a lot of things are still relevant today and, and it's just the scale has gotten much larger and also the paradigms shift a little bit from Single task, fine tuning to like generally do everything kind of universal foundation models.

[00:05:29] Foundation models, right. I think it's just a slight change in paradigm, but fundamentally, I don't think like the underlying principles of research hasn't really changed that much except for like compute. Yeah. So basically algorithms

[00:05:42] swyx: stay put and then compute and data scale.

[00:05:45] Yi Tay: So I have some thoughts about this.

[00:05:47] So I think back then a lot of the academic research, I think people have talked about this, like Sasha Rush has talked about this, or other people have talked about this, it's like, the conferences were always organized by like, Applications, right? They were always organized by like, Oh, like question answering.

[00:06:02] It was always by this, right? I think there was, there's like a bit of a transpose going on. Things become universal and then becoming like, okay, there's a data work stream, there's a model architecture work stream, and then people work on improving like a universal model and general purpose algorithms to improve this model rather than finding domain specific tricks.

[00:06:20] I think for, even in 2019, I think I've already been Like focusing on works that are like you know, you could improve on general architecture at that time. It was like, like maybe LSTMs in 2017 or something, and then you try on like 10 different tasks and the kind of thing, right? But like a lot of the research community have been focused more on like, how do I get that extra 2 percent on question answering or like, and then sentiment analysis.

[00:06:44] I think. There was this phrase of like, in 2017, 2018, where this style of work was still very fashionable in academia and conferences, right? And then, I think the big thing about the chat GPT moment of like, 2022, the thing that changed drastically is like, it completely like, it was like this sharp, Make all this work like kind of like

[00:07:02] swyx: obsolete.

[00:07:03] So November 2022, you're saying? Exactly, Charged GPT launch? Because I feel like if you're in the research community, this was coming.

[00:07:08] Yi Tay: Yeah, yeah. That's what I'm saying. I'm saying that like the big labs and stuff, like people have already been moving towards general, like even T5 was already like general purpose.

[00:07:15] Yeah. And that's the thing, right? But like, there was, it's like, there's a bit of a time okay, like places like Google and Meta, OpenAI, we will be working on things like three years ahead of everybody else. And academia will be like, Still working on like this path specific things, Got it, got it. And then like, I think the faulty function was the, the ChatGPT moment actually really like, It was coming, it was coming, it was just like the final, the last straw, and then it's finally like, Yeah,

[00:07:39] swyx: now it's serious.

[00:07:40] Yi Tay: Yeah, now it's really, the thing completely changed. I don't know how it turned from my, from my background to like, talking about the meta.

[00:07:47] swyx: I think that you navigate the meta very well, and part of my goal here is to also isolate how you think about the meta for other people to reflect on, because I think obviously you do it very well.

[00:07:57] Oh, thanks. I'm looking at your papers published somewhere around 2021. You had a hard cut to 22 Y two and Palm, and you did Y two Palm Emerge Abilities, DSI, REIT recitation, augmented Generation, all in the same year-ish. Mm-Hmm. So like there was, did you change teams? Did you, did you like have a research focus?

[00:08:17] Like when did you become,

[00:08:19] Yi Tay: oh, you're still saying that like language research became the

[00:08:21] swyx: model guy.

[00:08:21] Yi Tay: My research became emergent. It was like, it's very obvious. No, I don't think I'm like a person that like, I'm not like super, super great at like forcing a trend, like two years ahead. And then especially, especially like, Plan for that, right?

[00:08:34] Yeah. I think I smoothly and as like, kind of like as like I, I never actually really thought about this, this way. I just did like at every step, I just like optimized for like what I found to be most impactful and most promising. And then that gradually, and also it is, it is also a lot of influence by talking to people.

[00:08:52] Right? And then at the time I started working more with. I had some close collaborations with Jason and other people. I mean, Google is, you can work with anybody you want, basically. So you're kind of like, also like, partly it's like the environment shift. And I think the environment shifts like very quickly, but like, I was always like pulling for like the environment.

[00:09:10] I was not like, I think it's always good to like have an open mind and move along with the field rather than, okay, this is my research area. I'm going to get stuck in it two years. I think I just move along to find like things that interest me. And naturally I think like that turned out to be like, The things that were most impactful at that time.

[00:09:27] In retrospect, I kind of did well, but like, I never actually really saw it as intentional. Sure. I didn't do anything really intentional, except that's doing what I find interesting, actually.

[00:09:37] swyx: Cool. Well, we'll just talk about the main work at Google Brain, and then we'll move to Rekha.

[00:09:41] Google Brain: PaLM, UL2, DSI, Emergent Abilities

[00:09:41] swyx: So out of UL2, Palm, Emergent Abilities, which of these came first?

[00:09:46] Yi Tay: Yeah, I, wait, I did, I need, I can't really actually re remember. Okay. What will make you talk about year two then? Year two and DSI, the differential search index? I I was working on it like the December of 2021. So like at Google they were like projects that are like big efforts that are like a researcher will be like part of the effort and then this will be kind of top downish to some extent.

[00:10:04] Right. And then they are, they were also like. Bottom up research that one could do I can't speak for the Google now for sure, but like, at least at that time, right? So UL2 and DSI, Differentiable Search Index, were like, works that I kind of tinkered with in the December break where nobody was around.

[00:10:19] Palm also has this kind of differentiation because there's Palm 1 and there's Palm 2. Right. So Palm 2, I was actually like the co lead of one of the work streams, but like Palm 1, I was more of a contributor and Palm 2, I was like, so, so they were like, now I have to think back of like, okay, what's the timeline, which came first, right?

[00:10:35] In general, they were like three categories of works. One is like broader efforts that are efforts. And then there are some that like UL2 and DSI were like my own projects, like projects I use to compute that. That I had, and then I just played with it. You accidentally left the auto run in for a month.

[00:10:50] Yeah, yeah, yeah, that was in the paper. It was fun, I think. It was really fun. And then, there was also like a third category where those were like, the efforts that my good friends were driving and I contributed. So Flan was just one of them. I know like, maybe on, I would like to just maybe say this publicly, like a lot of people like, I talk a lot about Flan.

[00:11:08] You're Flan's show number one. But like, yeah, but like, the first author is actually Hsiung Wan, who is great, and then like, another guy, Le, I was her cook. core contributor but I mean just because I'm a little more visible so I kind of Accidentally took a little bit more credit for that, but I was a co contributor, but I was not like The lead authors are obvious.

[00:11:25] Yeah, so the third category was projects that my friends Emergence was also like Emergence Abilities No, actually, that paper was supposed to be only me and Jason on the paper. And I actually became friends with Jason From the paper and then that led to like this streak of like, I dunno, 10 papers or something together with Jason and now we are like super good friends, the Ultimate Romance.

[00:11:44] But that was like the immersion paper. But I, emergent paper was also like a belonged to be like a, a bottom up kind of like a thing. And fun times. Yeah, it was fun. ,

[00:11:54] PaLM 2

[00:11:54] swyx: maybe I'll pick on Palm two. Because I feel like, I'll pick on Palm 2 and Emergence, because I really want to make sure I tell those stories.

[00:12:01] Those are important stories. Palm 2, I think it's a career story. It effectively became a co lead on the second version of a very high profile, company wide effort. How did that happen? I think people would like to know, to know how to, you know, what's like the sort of career strategy there.

[00:12:16] Yi Tay: To be clear I was one of the co leads, but there were a lot of co leads, so, so I don't want to take too much credit for that.

[00:12:21] But my involvement with Palm 2 came from the after UL2 was working well, and then it was getting some visibility within Google. Was UL2 the largest model that Google had released

[00:12:32] swyx: at the time? Yeah, I think so. That was the largest. And you just, it was a personal project? It was a personal project.

[00:12:37] Yeah. Yeah. Isn't that unusual? How can it be like one person's decision to like suddenly release something that, you know, effectively changed the trajectory of, I think how, how, how, how people brain, how,

[00:12:47] Yi Tay: how we work was that, I mean, 20 B is not that much larger, but from 11 B to 11 B T five, actually at that time there was starting BT five, right?

[00:12:55] So I think UL two is code decoder 20 B model. I think when we got it approved, it was like. It was released as like, kind of like, like the big brother of T5, you know? Kind of like, okay, we updated T5 with like a new objective and train this new model into DBM we want to, and it uses the same pre training data set and everything, right?

[00:13:13] So like from PRC4. Yeah, from, yeah, that was the easiest because there was precedence, right? It was like, okay.

[00:13:18] swyx: But yeah, there was some architecture, like the mixture of denoisers. Yeah,

[00:13:21] Yi Tay: yeah, yeah. So, so back to Palm two, I think my involvement with Palm Two came from the work to, to, to, to add UL two, to Palm two.

[00:13:28] And then I, I, I mean, it was from the top down point of view. I, I mean, the leads were decided in a top down manner. It's not like, like there was not much like fighting or like, or any major things, right? It was like. It was a mixture of bottom up, top down ish, half half situation, and then from the top it was like, Okay, these are the people who are the most visible in contributing to this workstream, and then, okay, how about E and this other guy will be in charge of this modeling workstream, and something like that, right?

[00:13:58] So I think it just happened that way organically, and yeah, I think that was how I kind of was co leading the modeling workstream.

[00:14:07] swyx: I think in retrospect, you understand now that this is a very valuable experience. And I think now, today, it will be much more competitive to get the job that you got, whereas you didn't, you know, two years ago, you didn't have to try that hard to get it.

[00:14:20] Or like, you kind of lucked into it with you all too, and then like, it just compounded from that initial good decision.

[00:14:25] Yi Tay: I think it's very hard to counterfactually analyze these type of things. I think it's definitely true that there are more people working on generative AI now, and, you know, if you are in a big company, it's way harder to navigate.

[00:14:35] Like these type of things, right? I wouldn't say that there were like nobody or so wanting to work on this at that time. In fact, there were actually But you were the obvious choice. There were less people. There were definitely less people, but I think I would say that maybe it's slightly harder now, but like, it's also not like it was easy at the time.

[00:14:50] Yeah.

[00:14:51] swyx: Yeah. I imagine it's sensitive. But also in my mind this is now the most And this is the most valuable on the job training in the world. And so people want to know how to get it. This is what I'm trying to figure out.

[00:15:03] Yi Tay: Like, actually, individually we also cannot pick somebody else's experience and then try to replicate it on, because everybody's circumstances, their initialization point, their, That thing is kind of also like in different This is not only true for LLMs in general, right?

[00:15:16] Because a lot of times like, oh, okay, you did this in this position, and then because of this It's very hard to trace all this down to to find the causal path for this thing. So I think everything in life, there's some luck involved, I guess. Yeah,

[00:15:26] swyx: there is.

[00:15:27] Emergent Abilities

[00:15:27] swyx: Emergent Abilities. Very influential paper.

[00:15:30] Subsequently contested by the Mirage paper. Oh, yeah, yeah. So before we get to the Mirage, was there a story behind Emergent Abilities? You know, I'm sure it's Jason's Thesis or like what? Just tell, just tell more about like the behind the scenes, like was was, was there a discussion that led to

[00:15:43] Yi Tay: it that, you know, this one was like this, the idea, the inception of it was like mostly Jason.

[00:15:49] Okay. Right. I think I, I helped out to like. You know, shape up a little bit of the paper get some stakeholders involved and stuff. I was discussing quite a bit with Jason, but this, the idea itself was Jason itself. So, actually, when the Mirage thing and everything came out I didn't okay, I was just hot takes for the sake of hot takes.

[00:16:06] I didn't feel, I believe in emergence I have to just go on the record and just say I mean, I believe in emergence. And then I was not feeling very strongly because I think that, I can't speak for Jason, but I would just imagine that he would be Maybe personally offended because because I know Jason is a person that takes a lot of like feedback like very well He's a very like he's not offended by harsh feedback and he rebuts well like online as well, right?

[00:16:29] But like I would imagine he will be the one that is the most like affected by Criticisms of emergence. I was believing in it, but I have to say that the paper, I mean, that's why he's the first author and I'm second, but that was mostly Jason's thesis, and I have to really say that Jason has really good ideas, and I was more of a support role for that paper, yeah.

[00:16:49] swyx: Sure, yeah, you know, lots more to discuss there, but you believe in emergence, that's enough for me to work with.

[00:16:55] Yi Tay: I also think that the, the, the Mirage paper is mostly like I don't know who, actually I don't even remember who wrote it. Ryland

[00:17:01] swyx: Schaefer. Yeah, I, I covered him on, on my NeurIPS podcast.

[00:17:03] Yi Tay: Okay, okay.

[00:17:04] swyx: He's a very good speaker, and the paper was well done. It's just that people drew the wrong conclusions from the paper. Because they had a very good title. Do you believe in emergence?

[00:17:12] Yi Tay: Of course. Okay, high five.

[00:17:14] swyx: I mean, how can you read any paper, read any, the progress of LLMs and not believe in emergence?

[00:17:20] It's so stupid. Like, just because you re parameterize some benchmarks in evals and, you know, make it linear, doesn't mean emergence is completely gone. And even in the Mirage paper, they acknowledged that there were some metrics that were true, genuine emergence, according to them. I think it was something like 25 ish percent in the ballpark.

[00:17:38] That's not the exact number, but it's in the ballpark. So I was like, okay, fine, like some benchmarks you disagree with, but on the whole, there is emergence, it's just, now we're just talking about the magnitude.

[00:17:47] Yi Tay: Yeah, yeah, yeah, for sure, for sure. I don't think the authors of the paper had really very Like they, they didn't, I mean, nobody, we, we should just assume people don't have bad intentions.

[00:17:55] Right. But like, no, they, they definitely were just doing this. But like the, the, I think the Popul media, I was more like annoyed by the nearest best people. I mean, okay. Best people was, let's take the thing, take of a grain of salt, right? Yes. But like, there were people come to me like, oh, you should care about this because it's the nearest best disprove because it's the nearest best paper.

[00:18:11] I'm like, paper awards like mean anything. Actually, it doesn't mean anything. Right? Like. I think that was more of my where my angst was coming from. I don't, I don't think I really had, I don't even remember who were the authors of that paper, right?

[00:18:23] swyx: I'm sure they're doing well for themselves, and we don't have to dwell too much on that.

[00:18:26] Quoc Le as manager

[00:18:26] swyx: Okay, so a couple more things from Google, and then we can go to Rekha. Kwok Le was your manager.

[00:18:30] Yi Tay: Yeah, yeah. What is I had another manager called Don. Like, I had two managers during my time at Google.

[00:18:34] swyx: So I'm just basically going to ask for quick hits from what did you learn from Kwok? What did you learn from Jason?

[00:18:38] What did you learn from, you know, Juan? Who they are, who they represent to you,

[00:18:42] Yi Tay: like, how they advise you and all that. So Kwok as a manager, he was more like a friend, and we would like talk a lot about, I think Kwok is a very researchy person, he has a lot of like, he's more of like an intuition, I learned a lot about like from him about like, there was no like concrete, like it was more like over time, and it was very implicit, soft kind of feeling, but I think like a lot of research science, we were like brainstorming a lot about like, , I quite like that, you know, when we were But there was this U palm paper that, that didn't like get much, like as much attention that I feel it deserves, but like, I think that was one of the works that I, I kind of like discussed with court quite a bit and like and that time you're releasing the fund two stuff and everything.

[00:19:16] And then like, I think court has a lot of good sense about like what makes a work a good hit and like you know, publicly a good hit. And like a lot of research sense about like what what makes like. like research cool, you know, so I think he has good like intuition as a researcher and I learned quite a little bit about and I also I was going to say that I think Jason also probably learned like quite a bit from Quark and and this also influences like more of like it was not only just like me getting influenced but that there was like Jason getting influenced and then Jason influenced me so I think overall what I learned from Quark's like intuition, research taste, people like chat about AGI sometimes, singularity and stuff like this it was Like, he's nice to talk to as a friend, manager, kind of, he's like kind of a friend figure to me.

[00:20:01] He's very much a researcher more than like a corporate, you know, manager kind of thing.

[00:20:06] swyx: I totally expect that. It

[00:20:07] Yi Tay: was fun, it was fun.

[00:20:08] Jason Wei

[00:20:08] swyx: Jason Wei, what did you learn from him? What is your distillation?

[00:20:11] Yi Tay: Okay, Jason is very interesting. So, I learned in my career, I learned two or three things, major things from Jason, right?

[00:20:18] So, I think the first thing I learned from him is that so Jason was actually, okay, I'm going to talk about the more casual, more fun stuff first. Jason was the most spicy on Twitter first before me. There was an era where I was a goody two shoes, I only had my main account, I only tweet my only tweets Newspaper alert, you know?

[00:20:34] Right. And then Jason was starting to post like, like hot takes, right? Yeah. And I just thought to myself, oh damn. Like, you know, and there were, there were types that I was like, Jason, you should not post this. You're gonna get cancer. Right. And he, he, he was fine. He, he always break through the storm and everything until I, I looked at him and I'm like, maybe it's not that bad after all.

[00:20:50] Just be, be, I love it. Right. So there was like kind of like, which is very interesting 'cause Jason is much younger than me and I. And the other thing also, our accounts, right, we created them around the same time, right? And the interesting story behind it was that Jason's account and my account has our own, our original it was not like an anime character that nobody know who is it.

[00:21:09] We have our identity. It's pseudonymous. It's pseudonymous, right? And then I asked Jason why do you want to have a So like, why don't you just make like, and he told me this thing which was quite true was that like, Okay, you can post a thing that is spicy and it's hot, but if you cannot stand by the opinion, then you should not have the opinion in the first place, right?

[00:21:25] Wow. Right, so there was something that, oh, okay, I thought that was profound because so far this, I mean, there are times where, okay, I post something and it's spicy and then, okay, it gets a little bit bad, and then I, okay, I kind of agree that, okay, this is bad, then I will retract it. But if I could stand by the opinion, then I would just stand by it because that's the point of making it It should be said.

[00:21:42] Right, it should be said because. I can put my name behind it, right? So that was This is part of the first bucket about how it kind of influenced my online persona like, a little bit, and then, I mean, and then it turns out that now AGI Hippo is so much more spicy than the cola The cola is just hibernating somewhere, it's not even around, right?

[00:22:00] So, I mean, Jason is also more constrained because he works for he has Like an actual employer, right? And he has to be a little bit

[00:22:08] swyx: more The worst thing about Twitter is that, you know, anytime anyone from OpenAI tweets anything, they're like Did you see this researcher from OpenAI said something?

[00:22:15] And they read tea leaves that are not there, and it makes you very cautious to tweet anything. And so it kills the golden goose, is what I say.

[00:22:22] Yi Tay: There was one tweet, I mean, at the time when somebody was, people were speculating the GPT 2 chatbots, right? And then Jason just posted something on his main account something excited about new experiments being run just a random just, and then people screenshot that, and post like, Yeah, I hate that.

[00:22:35] So I think, I, now I think for, All the count is mostly like personal, like personal stuff , very personal I think he would stay away from non work things non work things.

[00:22:44] swyx: The golden goose has been killed because people on Twitter cannot control themselves from drawing random conclusions from, you know, all these hints and all that Yeah, yeah, yeah, yeah,

[00:22:52] Yi Tay: but Going to like the actual, this is like filler, filler, this is not canon, it's filler.

[00:22:57] I think the second thing I learned from Jason is more about like, from my you know, kind of like, from my own career, it's like, the importance of like marketing and PR. So Jason is actually like, I mean, I was just like, he was actually like really, you know, the emergence, like how many blog posts you wrote about the emergent abilities and how many talks he's given about, about emergent, like a lot, probably like the other day I was just at this webcom keynote and he was giving a keynote again about emergent abilities and it's been two years, right?

[00:23:25] So I, I think one big success of him is that like he, he does the work. Okay. Thanks a lot about like marketing the work itself. I did not like in my early parts of my career, early parts in Google, right? I was putting out a lot of work, but I didn't put in a lot of like effort in like thinking about the, like how the work is going to be received.

[00:23:42] I'll just be like, here's a paper, here's a paper, here's a paper, right? But Jason will be like, I'm going to write this paper and I'm going to like market the s**t out of it. So I, I, I think I learned a lot about like, so every single first author paper that Jason writes in the last, he has like 1000 citations in one year.

[00:23:56] Oh my god. Like no, I mean, not every, but like most of it that he leads. So his hit rate is very high. His hit rate, like impact density, like it's very high, right? So, It's pretty interesting but I kind of see him as like a peer and I learn a lot from his basically some, some people are just like talented in, in different ways.

[00:24:11] And, and I think that like, I, I looked at how he markets his own work and markets himself actually, right?

[00:24:16] Marketing Research: How to Start from Zero with No Reach

[00:24:16] Yi Tay: If someone is starting from zero,

[00:24:17] swyx: like no Twitter presence, what is the second best thing to do? You mean as a researcher? For marketing, yeah.

[00:24:23] Yi Tay: I, I think you would like the, the, the most obvious. If you're like a re like say hypothetically, you're like a researcher in like a place without visibility or without an end.

[00:24:32] You have no personal visibility. The first goal is always to try to find a mentor or coworker that is like within this circle, and then you start from there, right. Because, and then you get, you get, you know. people from like, who has a visibility and following to retweet. So you will like, work with them the big goal is not about like, I learned this, I mean, this is like, probably a career mistake in my early days, was that you know, instead of like, focusing on like, so called people okay, if you do good work, it's more of like, okay, how am I going to I see this visible researcher from DeepMind, right, or how can I collaborate with this person, and then kind of do something that feels cool, and like, I can win their respect, and that they would like.

[00:25:09] You know, they will be willing to co author for me because the exercise itself was so about how to, you're not trying to please reviewers or anything, you're just, if you can find one semi visible, you don't even have to be like a famous person, that's like a semi few thousands of followers, has a good reputation of research, and then you collaborate with this person, and then when you post the work, you are co authored with this person, and then you get the person to vouch for you, or just, over time, this would It could be from internships, it could be from, you know, just DMs.

[00:25:38] I think, you know, people are nicer than some people, they seem scary, but if you DM them, they're actually willing to collaborate, actually.

[00:25:44] swyx: I was scared of you, actually. And when I DMed you, you turned out a lot nicer than I feared. So thank you for being nice. That's really great advice for people.

[00:25:55] I just want to leave that out there for people. For others who follow, you know the work that the career advice that I give, the title topic of this is pick up what others put down and specifically pick up what your mentors put down. Like, mentors always have more work to do than they have personally time for.

[00:26:09] The high visibility mentors, and if you can show that you're a good collaborator with them, they will lift you up. Accordingly, that's a pretty good formula for career growth. Should I ask about Hyungwon? Or I don't know how close you are. Oh, we're

[00:26:21] Yi Tay: still good friends. Hyungwon is a great engineer and he's very systematic in the way he thinks.

[00:26:26] I think Hyungwon is without going into detail, I still spend a lot of time talking to Hyungwon, even like in the, even after we both are different places, about like very interesting algorithm, arithmetic ways to think about life. Very interesting like, perspectives on life rather than research.

[00:26:43] But Hyungwon is a great engineer. And the one thing that scares me about Hyungwon is he doesn't have multiple monitors. He just codes with one small screen. And he does everything with very hyper optimized. And then I

[00:26:54] swyx: want those U curve where one screen, one screen, and then many screens.

[00:26:57] Yeah, yeah, yeah.

[00:26:58] Yi Tay: So I think Hyungwon scares me because it's like, I think that was at NeurIPS 2022. Like, we were doing some work at the New Orleans. And then He'll be coding perfectly fine with this 13 inch MacBook with one terminal, and then he'll be like, he keeps telling us okay, it's more optimal to using keyboard is more optimal than moving your head, because if you can switch your screen fast enough, it's faster than your head, like moving to different screens and stuff.

[00:27:24] I did not actually distill that, because it's too painful to do that, but it's very interesting in a way that I'm he belongs to one of those hardcore people with one monitor and

[00:27:34] What's needed to be a successful AI Researcher?

[00:27:34] swyx: Maybe this is a relevant question to just close out the Google site. What do you think is

← All episodes of Latent Space: The AI Engineer Podcast