
Interviewing Ross Taylor on the state of AI: Chinese open models, scaling reasoning, useful tools, and what comes next
Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
I’m excited to welcome Ross Taylor back on the podcast (and sorry for the lack of episodes in general – I have a lot going on!). The first time Ross came on we focused on reasoning – before inference-time scaling and that sort of RL was popular, agents, Galactica, and more from his Llama days. Since then, and especially after DeepSeek R1, Ross and I have talked asynchronously about the happenings of AI, so it’s exciting to do it face to face.
In this episode we cover some of everything:
* Recent AI news (Chinese models and OpenAI’s coming releases)
* “Do and don’t” of LLM training organizations
* Reasoning research and academic blind spots
* Research people aren’t paying enough attention to
* Non language modeling news & other topics
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show outline as a mix of questions and edited assertions that Ross sent me as potential topics.
00:00 Recent AI news
Related reading is on Kimi’s K2 model, thoughts on OpenAI’s forthcoming open release.
* What did you think of Z.ai’s GLM 4.5 model (including MIT licensed base model) with very strong scores? And Kimi?
* What will OpenAI’s open model actually be?
* What do you make of the state of the ecosystem?
12:10 “Do and don’t” of LLM training organizations
Related reading is on managing training organizations or the Llama 4 release.
This is one of my favorite topics – I think a lot of great stuff will be written on it in the future. For now, Ross asserts…
* Most major LLM efforts are not talent-bound, but politics-bound. Recent failures like Llama 4 are org failures not talent failures.
* Most labs are chaotic, changing direction every week. Very different picture from the narrative presented online.
* Most labs represent investment banks or accountancy firms in that they hire smart young people as “soldiers” and deliberately burn them out with extremely long hours.
36:40 Reasoning research and academic blind spots
Related reading is two papers point questions at the Qwen base models for RL (or a summary blog post I wrote).
I start with: What do you think of o3, and search as something to train with RL?
And Ross asserts…
* Most open reasoning research since R1 has been unhelpful - because not enough compute to see what matters (underlying model and iterations).
* Best stuff has been simple tweaks to GRPO like overlong filtering and removing KL divergence.
* Far too much focus on MATH and code - AIME has tens of samples too so is very noisy.
* People are generally building the wrong kind of environments - like puzzles, games etc - instead of thinking about what kind of new capabilities they’d like to incentivise emerging.
50:20 Research people aren’t paying enough attention to
The research area I hear the most about right now is “rubrics” – a per-prompt specialized LLM-as-a-judge to replace reward models. SemiAnalysis reported OpenAI scaling this approach and lots of great research is coming out around it.
I start with: What do you think of the state of RL scaling and generalization? What of models losing
Ross asserts…
* Rubrics are underhyped on social media - they were driving force behind projects like DeepResearch - and GenRMs are interesting but perhaps slightly overhyped.
* There is an evals crisis - there are not enough high quality evals, particularly for frontier tasks like automating research and real life work. Impediment to anyone building agents or ASI.
01:02:46 Extra stuff!
I ask Ross: What AI are you using today? Why?
To conclude, Ross wanted to discuss how AlphaEvolve has been underhyped on social media, and means the future isn’t just RL. Shows there are other effective ways to use inference compute.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Transcript
Created with AI, pardon the minor typos, not quite enough time this week but I’m hiring someone to help with this soon!Nathan Lambert: Hey, Ross. How's it going? Welcome back to Interconnects. I took a many month break off podcasting. I've been too busy to do all this stuff myself.
Ross Taylor: Yeah, I was trying to think of all the things that happened since the last time we did a podcast a year ago. In AI time, that's like two hundred years.
Nathan Lambert: Yeah, so I was looking at it. We talked about reasoning and o1 hadn’t happened yet.
For a brief intro, Ross was a co-founder of Papers with Code, and that brought him to Meta. And then at Meta, he was a lead on Galactica, which was a kind of language model ahead of its time relative to ChatGPT. So if people don't know about Galactica, there's a great paper worth reading. And then he was doing a bunch of stuff on reasoning with Llama related to a lot of the techniques that we'll talk about in this episode.
And now he's doing a startup. I don't know if he wants to talk about this, but generally, we talk a lot about various things. This got started through o1 and trying to figure out scaling RL. We started talking a lot but then we also just resonate on a lot of topics on training language models and other fun stuff - and also trying to be one of the few people not in these big labs that tries to talk about this and think about what the heck's going on. So we're gonna kind of roll through a long list of a lot of things that Ross sent me that he wanted to talk about, but this will be a compilation of the things that we've talked about and fleshing them out outside of the Signal chat.
So, Ross, if you want to introduce yourself more, you can, or we'll just start talking about news because I think a lot of people already know you.
Ross Taylor: Yeah, let's get into the news. There’s lots of fun things to talk about.
Nathan Lambert: So, the last two weeks of Chinese models. I think we had Z.ai's GLM 4.5 today. Kimi-K2 last week. I think Qwen is on a roll. I thought summer was supposed to be chill but this is crazy.
I haven't even used all of these. The pace is just incredible. And all the open models have actually good licenses now. But is this going to hurt anyone in the US? Where do you see this going in six months?
Ross Taylor: Yeah, so yesterday was the one day I actually tried to turn off Twitter. And so when you told me in the morning about the new GLM model, I had to read up on that. So that shows if you take your eye off Twitter for one second, then you’re behind on open source...
But yes, I think the general theme is that it’s been absolutely relentless. So thinking about the last time I spoke to you on the podcast a year ago, Llama 3 was a fairly established standard.
There were still things happening in the background, if you paid attention to things, but now it's absolutely relentless. In the case of China, I think their business culture is that - as soon as they find something is successful - they’re very good at concentrating resources and going after it. So it’s created a very competitive space.
I think the context is very interesting in several different dimensions. There's the geopolitical dimension, which you've hinted at in some of your blogs. For example, what does it mean if the open source standard is Chinese? What does that mean if we think about these models not just as things which power products, but as (critical) infrastructure? Then it seems like China has a great advantage if they want to be the standard for the whole Global South.
Nathan Lambert: Yeah. There are a few things that we're going to come back to in this conversation that are so interesting. We're gonna roll into what it takes to train these models. And we're going to talk about how crazy, political and hard it is in the US. But we have all these orgs popping up in China - so is this partially just a US problem?
But then we also have OpenAI that's supposedly going to release a model. There are multiple things. But my question is: why is China doing so well? Are they well suited to training these language models?
Ross Taylor: I’ll caveat what I’m about to say by saying that I want to be careful about making generalisations. Because, for example, we’ve seen some of these new Chinese organisations be good at innovation - for example, this week we had GSPO which was nice. But for Chinese orgs, my general sense is that, once something has already been validated, the specification for what to build has been set, and the task can be reduced to an engineering problem, then Chinese culture is very well set up to succeed in those situations.
The other dimension which has become relevant - especially after DeepSeek - is that the Chinese Government has traditionally been very good at recognising what’s successful, pouring resources in, and facilitating public-private collaborations. I think that surprises people still in the West. For example, people are surprised that a group can come out of Tsinghua can and fairly quickly have their own state-of-the-art LLM. Why isn’t there a similar story for groups coming out of MIT?
Nathan Lambert: I’m not sure about this.
Ross Taylor: I think the US will eventually wake up to this, but…
Nathan Lambert: My understanding is that Z.ai is a startup that spun out of Tsinghua, so I don’t know if it’s the best comparison. Also Alibaba is the clear winner here because they have Qwen, but they’ve also invested in Moonshot, which is Kimi, and then I think also Z.ai.
So I’m more interested in the question as to why they are all open. That seems more important relative to talent because there are lots of universities that might have model orgs spinning out of them - even in the US - and it’s not solely a Chinese thing.
I think it could happen with a group out of MIT. That being said, I agree that the US should have more compute deployed for academics and a lot of universities are just spinning them up now. It just takes a long time.
So I think there’s a lot of things that Twitter is mixing up here. There's a good tweet in it, but I don't think it'll be 100% true, which makes for a very viral tweet when it feels true.
Ross Taylor: Yeah, I think there is definitely naivety about how things are actually working (in China). And there’s asymmetric information, in that you don’t truly know what’s going on in the inside of these organisations.
The other thing worth mentioning - which is maybe a separate topic - is that there’s a tendency to see open models as a homogenous category. But there are very different use cases. So if I want to do a new reasoning paper, I’m going to use a Qwen model. But then if I’m doing distillation, I’m going to use DeepSeek or Kimi.
This discussion also relates to OpenAI’s rumored open model: because in my mind I still don’t quite see how it will fit into the ecosystem. Because is it going to be something that people build research on? If it’s a post-trained model, then probably not, right?
Nathan Lambert: Yeah. But their tweet was about safety, so I doubt it is a base model if they’re delaying it for safety. I do think they actually delayed it for this reason. It’s very much in OpenAI’s culture. But I don’t think it’s going to change the ecosystem. It will be an interesting one off.
I also don't expect them to release a model that's based on their GPT architecture. My bet is they take an off-the-shelf architecture like Qwen or Llama. A lot of the recent OLMo models are very Qwen-y. And they will also be deciding sizes based on what fits on what cluster - e.g. Qwen is very deep rather than wide, and OLMo 2 is very similar to that. So I think the OpenAI model is going to fit that mold.
Ross Taylor: I think so. I guess one way to think about it is they're just trying to “distill” their RL infrastructure into weight space, right? As opposed to publicising their (internal) architectural choices.
But back to the discussion, and maybe this is a question for you Nathan, but do you think their model is going to be more comparable in use case to a Kimi or DeepSeek? Or is it more similar to Qwen? Or is it actually something completely different, like an on-device model? A smaller model?
Nathan Lambert: I expect it to be smaller. They joked about on-device, which I don't know is the right framing.
Ross Taylor: Yeah.
Nathan Lambert: I'm also just now realizing how - if RL is their great strength - then part of the challenge of shipping an RL model in open source is that you need your training infrastructure to match the inference infrastructure. So unless they train this on an exact VLM that people have access to - and some open source environments - then they can’t just dump the model and expect people to be able to do search and code execution in the open model stack.
I don't know exactly how Qwen and DeepSeek have gone about this. My impression is that they're actually not as useful in terms of tool use because it's so hard. I think that tool use is naturally a closed model reinforcing thing because it benefits to have these tools match up.
Ross Taylor: So the Qwen models are pretty good at things like function calling. Kimi - at least in the benchmarks - was also pretty good at agentic tool use benchmarks. And then - this is a separate discussion - but they had this nice training innovation where they use lots of MCP servers in a synthetic data strategy. But then again, you’re mostly seeing indications of capability in headline evals, which you shouldn’t really trust anyway.
Nathan Lambert: I think of Claude 4 as the release that ended eval chasing. On paper the release was so lame, but it delivered for everybody - which is very bold because there is a lot of money on the line. They are constantly fundraising and if one fundraiser gets spooked because the release numbers are bad, then that’s a lot of CEO calls that they have got to make.
Ross Taylor: On evals, I was thinking about this a few months ago. It might have changed now given the pace of AI development, but I was thinking about how you might split up the impact timeline for a release.
So day one is headline benchmark numbers - which are mostly b******t. Like I’ve got this amount for my model on MMLU Pro. But then the next tier of impact is the day after the release where people have all these weird bespoke evals on Twitter.
Nathan Lambert: The pelicans and the rotating hexagons and balls…
Ross Taylor: Yes, and by this stage you’re getting more confident. Because unless the model developers are very smart (which some of them are), then they probably haven’t optimised for day two benchmarks. So at that stage you’re beginning to believe that the model actually generalises beyond the headline numbers.
And then finally you have a week or two weeks after the release where you can say that you’ve tried the model quite a lot now, and you then have real confidence that the model is good.
Nathan Lambert: Yeah. Refute my claim: Chinese providers are still optimizing for benchmarks more than OpenAI, Google, and
Ross Taylor: Yep, I mean it’s probably true.
Nathan Lambert: It feels so obvious to me. I think that China has closed the gap to a remarkable degree, but I don't think they've caught up fully. I think that's hard. It’s very hard to get all the data and pipelines in place. A lot of it is actually user data, knowing your user, and hill climbing that. So for example, all these APIs not working is a huge issue for them.
Ross Taylor: Yeah. I think (Chinese models) have also been helped by the fact that a lot of the academic work that builds on them has been doing reasoning work in publicly available data domains like math and code.
The models have been heavily optimised for these domains anyway, so the model developers are not quite as exposed - since people aren’t really testing the true generalisation capabilities of the model. We already know that the Qwen models are heavily mid-trained on math and code, so they will hold up performance-wise there.
Nathan Lambert: Yeah. Okay, this is a good preview for the episode. I think that the main things are going to be how to build good organisations, and then academic reasoning research and how to bridge the gap. I think we can talk starting about org charts.
So how do you make a good org? Or maybe there are two things. One: how do you make a good org chart for training language models? And two, how do you make an effective culture?
I think this is quickly becoming one of my favorite little niche interests because there's just so much intrigue in it. There's just so much money on the line to break everything. So you sent me some hot takes if you want to read them, but the floor is yours for what doesn't work.
Ross Taylor: Sure. So if anyone’s been on social media recently, the general trend nowadays is to check your phone and see these NFL draft style tweets about researchers moving between orgs.
First of all, researchers have always moved between orgs. This is not a new thing. And a lot of the org moves that were talked about - at least outside of Meta - were just regular moves.
But I think the bigger mistake on Twitter is just the tendency to see the bottleneck in LLM projects as skill issues. And at least from my n=1 experience, that has never been the main bottleneck for success.
There are a number of ways to make this case, but I think I'd start by saying that machine learning is a heavily empirical science. So what does genius mean in that context? What does talent actually mean?
There are certainly some skills which are useful - like how do you form the right minimal viable experiment? And how do you iterate fast to explore a research direction where you’re going to hit a lot of dead ends. But a lot of it comes down hard work, good infrastructure, and ultimately resources.
So in that context, most of these orgs - even before public failings - had very good people. And I don’t think the difference in talent between orgs is that large. Smart people will eventually figure things out. So therefore, more often than not, the difference between a good versus a bad model is reflecting an inefficiency in the ability to channel resources to your talent. And that is the fundamental point.
Now you could say, on the flip side, okay, Ross, well, if that's true, why is Zuck paying people these massive amounts of money? And I think that's a separate question. But yeah, more often…
Nathan Lambert: Well what do you think?
Ross Taylor: I am torn on this because, on the one hand, I think the new group will probably make very good models. They’re very smart people. And I think having a new org as well is the right way to do it.
I think in leadership's mind, it's a case of “Look, we tried this multiple times, we’re very serious about this, we have resources, so let’s do the maximum conviction play”. And I think that's broadly what you should do because it’s a big expense, but it’s not massive, massive spend (for these large companies).
But on the other hand, I feel sorry for - this isn’t a Meta point by the way, but a general point - but I feel it’s a shame these organisations don’t have good mechanisms to identify the talent they already have in their orgs and have to recruit externally.
The talent that has already done the hard work, that is. It’s a shame they have to hire externally and start afresh. That’s the tragedy.
So that’s the conflict in my mind. I think they’ll make great models. I think it’s the right approach to do things afresh. But at the same time, it’s a shame that all the people that came before them, and made the previous generation of models, are treated like an asset. In the sense that you’ve used these people - grinded them really hard - and now you’ve moved on to a new group of people.
Nathan Lambert: You put this in your provocations. You said LLM labs are like investment banks where people are slotted in to burn out and burn through. I know that a lot of the work that needs to be done is somewhat mundane data work and it can be parallelised - e.g. if your users are asking this type of question, let’s create new prompts and manage human works and create synthetic data pipelines. And that works a lot of the time.
But then, I remember the Dwarkesh podcast with Sholto and Trenton - and it’s the one where they’ve both moved jobs (which reinforces your point), but they were saying you just need to convince someone at a frontier lab that a particular problem is important. I.e. people talk about things, but they just have to do it.
So is it the case that people are just dispatched to solve specific problems, or do individuals have free rein, and it’s fun on the ground because you choose the things you want to add to your beautiful final model?
So you can present a positive and a negative. It might vary across labs, but I guess your provocation is that there's a bunch of places where it is a meat grinder and you just put people in and chew through them.
Ross Taylor: I think so. Unfortunately the model for a lot of successful tech companies is to get very young, motivated, people - with a base level of intelligence - and make them work very long hours on a project with a big mission. This was the classic Elon way to run a company.
But this is also the model for a lot of frontier labs. You have your soldiers who - on the surface - look similar to quants at hedge funds from like 10 years ago in terms of their working hours. And in the culture too, you have friendly competition between people who all want to be the best.
Nathan Lambert: I will say: I know a bunch of people at OpenAI, and they do work crazy hours. I also work a lot, but I do a lot of things that aren't grinding data to go into the model.
Ross Taylor: Yeah, so on the question of decision-making, I think major decisions are generally made by people who are a little more experienced and already have some successes to their name. But you do need to have soldiers in this kind of environment. The space is just highly competitive (and requires people to work long hours).
And I think that's a shame. Even for myself right now, where I’m trying to build a startup, I’m thinking that - yes, we all need to work hard - but is there an alternative model where you invest in your employees instead of using them? - i.e. burning them out and then moving on to a new group. That’s what I’m trying to work out for my new company.
Nathan Lambert: I feel like a lot of people are just more cynical now in tech, myself included. I got a great cold e-mail from someone fresh out of undergrad, and I was pretty sure in two to three years this person would be legit. And I was talking to a coworker on how we could potentially capture this and invest in them. And they were just saying we might get them, but then they’d just go to OpenAI in 2 years. So we don’t get any of the upside.
I think some of that is just cynicism. Investing in people is still the right thing to do because you’ll end up keeping the ones that are a bit more grounded even if it is really hard. For example, I've lost people that are extremely talented that I wouldn't want to keep. So I don't know how to balance that cynicism versus reality of building teams in the long term.
I guess smaller teams might be a bit easier to maintain, whereas if you’re at a tech company, the churn is hard to avoid because there’s so many levels in moving up.
I think some of the rumors around Meta and Llama 4 - at least from the Dylan Patel SemiAnalysis article - were about them doing these cowboy crazy model training runs, including changing pre-training mixes half way through, and that maybe points to dynamics with middle management wanting their data to be used so they can get promotions. But most labs I don't think are doing that type of s**t for their leading models. And I don't think Meta is normally doing that. I think that was a pressure cooker side effect.
Ross Taylor: I would push back on that a bit by stating that all of these labs are deeply chaotic places (not just particular orgs). They change direction every week, right? That’s just the nature of the field we’re in.
But then, it is definitely true that certain labs are good at projecting, at least externally, that they have their s**t together. They have AGI internally, all this kind of b******t.
The truth is that it is a shitshow everywhere. It's just that if you're going to be a s**t show, you at least want to be a functional s**t show, and you want to make good models. Right?
As I mentioned before, I think there are new plays to be made around taking the view that you want to invest in your talent as opposed to just grinding them out. But I would also say that, in lab culture, people tend to overvalue raw talent again - especially in empirical science. If you take the view that an empirical science is mostly about experimental velocity, then you don’t just value infrastructure in that world, but you also want to hire folks who are very collaborative and who want to help each other.
It sounds like a b******t point in a field that lionises individual intelligence, but I just feel that if you're making a marginal hiring choice, then you have to think about how someone adds to an existing group? So I think there are new plays to be made on talent.
But there is nuance. Because there are certainly people who are especially productive. I’ve seen that in person. So it’s not like everyone is equal - that is definitely not the case - but I just feel that individual talent is overemphasised when problems in these orgs are mostly structural.
Nathan Lambert: The differentiation right now is people who are willing to put more highly focused hours turning the crank. Every organisation has the baseline time costs of needing to do meetings, commute time to work, commitments etc. But in terms of AI, where people are doing more and more, this really favors young people who don’t have a lot of responsibilities.
Ross Taylor: This is maybe a transition onto another topic, but I’d make a more controversial point which is that - even the things in ML which seem more like novel research are more the result of persistence rather than inspiration.
For example, this time last year we were both speculating about what o1/Strawberry was. And speculation makes you think it was some amazing new thing. But actually it was basically what we were both doing two years ago right? Essentially RL from verifiable reward, but with very good base models, because they were in a good position to exploit that, and then enough ablations to find a recipe that worked.
So this is oversimplifying things a little, but we should take the view that they just had to do the work to make the recipe good. And that comes down to experimental velocity, and also having the right infrastructure and a good enough base model. So in that world, what is talent?
Is talent the person who says “we should make the models think more”, or is talent the person who is actually on the ground doing the ablations to find out which recipe works? Right? Because I can also make models think more by best-of-N, but, then there may be better ways to do it?
Nathan Lambert: I mean, I think I analogize a lot myself with my athletics career - like rowing in college. I think so much of it is the same. I wasn't the most gifted athlete, but if you put in the hours and you understand where you're spending your effort, it works out for people.
The question I wanted to ask you on this topic is, given that that these orgs are so chaotic, then what does this mean for the ceiling on progress? One of the most coveted questions is about the trend line. There are obviously going to be new paradigms - inference time scaling was an obvious one if you thought from first principles about what compute and intelligence is - but even if we don’t have a new paradigm, then what is the ceiling?
Ross Taylor: I would say that, even in climates where most organisations are chaotic, you’re still going to have macro factors that lift all boats. So a good example recently was these gold medal results on IMO. Three or so different labs all had different approaches and all found they crossed the threshold for a gold medal.
If you were to zoom out - and one way to do this is to imagine you're looking twenty years into the future back at this time - then would you look at the individual methods that researchers used, or would you just say compute reached a critical threshold where things began to work?
So compute is the big exponential that's underlying all of this. And then if you zoom into a shorter time horizon, then you're seeing more of the local challenges, like what’s the particular bottleneck at a point in time? So maybe the bottleneck to agentic models is scaling RL environments. Or maybe the bottleneck to better reasoning is longer context windows.
But look: fundamentally as long as compute keeps coming online, I think the trends look good and all of the organisational chaos is short-term noise. It slows down progress a bit but is not meaningful in the long-term. But, unfortunately, it's still meaningful for people in their careers because one to two years of organizational chaos could matter personally. But on longer timelines, it doesn't really matter.
Nathan Lambert: Yeah, I agree. It seems like the question is what happens when the fundraising starts to slow down. We're on a trend line of compute rollout. But if Sam Altman can't raise again, that is a very big sign. That's like the end of the “bubble”. OpenAI is not going to go away because of that, but if they can’t get the next cluster… then that would be a bad sign.
Ross Taylor: I'm quite optimistic because I think you only have a bust if AI ceases to be increasingly useful or doesn't live up to certain promises. But even if there's no algorithmic progress, I still think AI will continue to continue to be increasingly useful. I don't think there are fundamental barriers. It's just a question of how quickly you get things right.
I think the argument would have been slightly different two years ago. If the reasoning paradigm didn't come through, then I think it would have been trickier to justify the expense because then you'd be looking at reasoning benchmarks and thinking: s**t, to push this forward I need this amount of data annotation or need to generate this amount of data.
Nathan Lambert: You look at GPT 4.5 as the example.
Ross Taylor: Yeah, exactly. That's a really good example. So you can treat that model like a counterfactual universe where reasoning didn't happen. There we would all be looking at the model thinking “it's good at creative writing, but maybe not so good at some more things we really care about (like reasoning)”.
By the way, I'm sure it’s a really good model. I didn’t play with it enough to form a good judgement.
Nathan Lambert: I've been using it a lot. I used it for a long time - especially until Claude 4 - as it’s just nicer, especially when GPT 4.1 was so sycophantic. But GPT 4.5 was nice.
Ross Taylor: So I'm gonna flip things around and ask you a question Nathan. Let's say we are here in a year's time. What does the key benchmark look like for LLMs that everyone is focused on?
Nathan Lambert: Oh, it's fully gonna be some agentic thing. I don't know if it'll be as stupid as making money on the stock market… I wrote a post on what I thought was coming next. One of the most poignant things I was looking at is the fact that scaling models is no longer the direction anymore. All the marketing is shifting to agents. And I think some of that is because it's not easy to scale parameters anymore.
Every RL curve is this log plot, and it becomes hard. But agents are already beginning to work well. For example, this year Claude Code showed up. There's gonna be versions of that in all sorts of domains and more people working to evaluate them. That will create an interesting marketing problem where labs need to figure out how to communicate that their model is good.
But the future looks like it’s all on the agentic side, and will lead to a big shift in what the language modelling companies need to think about. The prioritization of the company is also different, whereas modelling was always central before. I’m still modelling-pilled and think that is the central thing for the company…
But it’s true that now that teams building products are going to hold more weight than they used to. And there will be interesting changes in how these companies manage this transition, and how communications change.
So, I think Claude Code is great. But I think that it's hard to integrate in some things. For example, how do I get that running on my cluster at AI2 where we have all of our data and models, launch evals from our file system on the GPU machines. I don’t think that quite works yet, but maybe I’m doing something wrong.
Ross Taylor: Yeah, I agree with your answer. So I spent several years working on Papers with Code, where we were trying to focus heavily on evals before they were a big thing - trying to index all these various leaderboards. And I think now is an interesting situation because I feel like if you make good evals now, you possibly have more leverage than you've ever had in the field of ML..
This is a weird thing because traditionally evals were quite an unsexy thing to do. It was a thing that researchers didn't want to do because they'd rather be training models. But now the ability to define a metric for a capability that you'd like to see - e.g. trading stocks, or doing scientific research - is just incredible leverage that you can wield. A small group of people in places like universities can say “this is the new north star that we should achieve for agents” and shape how AI progress evolves.
Nathan Lambert: It can happen. We recently released IFBench, a benchmark for following instructions which is just more constraints and a different prompt sourcing. And I was saying to folks that we need to have the goal of making at least two frontier labs adopt it. And I messaged various people, including someone at OpenAI, and they said they already integrated it last week.
So yes, someone doing research (on evals) has a shot at getting into the OpenAI internal evaluation platform.
Ross Taylor: Exactly, so it's incredible leverage. And then the other interesting thing is that the friction for making and using good evals is going to increase quite a lot.
For example, in some of the recent benchmarks, you need the RL agent to have access to a GPU and then you need to spin up lots of these servers to do rollouts. This is expensive. Long gone are the old days where you had two CSVs with a train and a test split.
And then on the eval creator side, there’s a big difference between good and bad evals as models become more capable.
A bad eval just means that you're going to get incredibly egregious reward hacking, and you're not going to learn anything useful, whereas a good eval is a pathway towards a brand new capability.
Nathan Lambert: I have a related question on this. So I see three eras in evals based on what people are doing with models.
For pre-training, the best evals are testing knowledge and these very broad things and are hard to game. It's just kind of like FLOPs.
At post training, a lot of evals are formatting and extraction. I think formatting became even clearer to people when these RL environments became the hot new thing. And I actually think that post training might be like the ugly duckling in the middle, where then if you go into agents, all the agentic tasks are gonna be evals of actually doing things and you can't like format-lie your way through that. So it might be that post training evals are the hardest one to get right.
Ross Taylor: Yeah, and I think you're going to see more cases of people claiming good results, but when you look beneath the surface, you’ll see insane reward hacking. So the meme right is KernelBench evals. Have you seen these?
Nathan Lambert: Oh.
Ross Taylor: You see all these amazing speed ups which aren’t even possible based on the hardware. And this is not a problem with KernelBench, I would say it’s more a problem with people publishing papers for agentic evals and not looking at their results carefully.
So this shows that to get an eval in the right place takes a lot of work. And even with progress in models, I don’t think you’re going to be able to fully automate the construction of a good eval in the next year at least. I might be wrong. Models will certainly help us in creating evals. So I think that, for now, it’s a place where a researcher can have a lot of leverage.
I think if you were to ask what is the central eval is right now, it'd probably be something like SWE-Bench (verified). But even that is now quite saturated. So there's a big blue sky now where someone can define what the next big task is for ML. And you don’t need a big cluster in order to be the one who defines it; so I think that’s quite exciting.
Nathan Lambert: Yeah. And when you think about the amount of money that'll be steered by these things, it's so crazy to have the uncertainty there and like who will come up with that as well. I think that it's part of what makes it fun, I think.
We should talk about reasoning things.
Ross Taylor: Reasoning. Yeah.
Nathan Lambert: Where do we start? I don't think I've ever done that much of a rant about the academic community chasing these things. I understand why academics are claiming to do new algorithms that get remarkable scores, but a lot of these papers are just extracting things that are hard to document from a model or something else or formatting
I was on one of these papers, which was hilarious. We figured out that if you train Qwen on random rewards, the evaluation scores go up. And we had to go through the logic on why this can happen.
Because if there's no reward, the advantage is zero and the gradients are all literally zero. And then it turns out that the algorithm manipulates the most common sequences. It's actually something that if you read a lot of the reasoning literature, people talk about how we want to make sure our algorithm doesn't squash uncommon sequences. And then the real hammer is that, if you do random rewards, then you see that the model has modal collapse onto the things that it was trained on. And that can make scores go up.
So if you have a model that two thirds of the time has a certain behavior in its reasoning and that behavior is good on the benchmark, then just by fiddling the weights a bit then it does that behaviour more. This points to a structural failure.
I would also say it is a good example for why people should be using truly open models for research purposes and why they're so good for innovation. For example, if we knew what goes in Qwen data and if someone just filtered it and it was like, oh, look, I found the found the GPQA prompts in it…then we know data contamination has happened.
The Qwen case is borderline - I don't know how exactly to characterize it because the Qwen models are fantastic - but there's so much research that is showing that they are very likely to be doing some dubious things in terms of benchmarks. It's hard for people that aren't super in the weeds to hold both of these possibilities in their brains.
So I don't know. What do you think of the last six months? Have we actually made any progress? Has the academic community made any progress?
Ross Taylor: I think there's been little progress. I mean that in the literal sense: there’s been some progress, but it has been little. I think I can answer this question in several ways.
So after DeepSeek-R1 came out, there were two approaches in open source more generally, which was either you go down the distillation route or the RL route to make interesting small models.
The initial thing that was undervalued - at least from an engineering perspective - was that for smaller model sizes, it is far more efficient to do distillation than RL.
Nathan Lambert: And not just in compute but also in performance? It's hard to do RL on the small models.
Ross Taylor: I think this point has been made twice now. So there was the original DeepSeek-R1 paper, and then more recently, there was a new Qwen paper as well. The Qwen paper showed that RL needed 17x times more compute than distillation.
So one way to think about this is that RL is a brute force lever to do data generation. But assuming that RL is still good, and you still want to do research on it in academia, then you run into a classic problem. And that problem is: if you don’t have enough compute, then you don't know if the structure you are imposing is gonna generalize (to high compute settings).
And my worry is that a lot of the results are on relatively low compute budgets, both in terms of the underlying base model, which determines how well the RL approach learns, but also the total number of RL steps. So it's just quite hard to see - unless there’s a massive gain - what’s truly important.
So the most useful things are - in my opinion - quite boring things. Like, there was the DAPO paper which showed that you should have filtering for overly long sequences, and you shouldn’t overly penalise them if your context window gets cutoff.
There has also been interesting work showing that even simpler approaches (than GRPO) might work, where you remove clipping. So Reka was doing lots of good work using REINFORCE leave-one-out (RLOO). But even there, it’s difficult because you don’t know if simpler algorithms are going to work with long agentic traces.
So it’s not clear. I think the recent work this week was actually quite good. The GSPO work was good, and if you saw their graphs…
Nathan Lambert: Explain it to people. I think a lot of people have heard of the other ones by now. But GSPO is group sequence policy optimization with Qwen Coder. Why are you positive about it relative to the other ideas? I think GSPO is well motivated but why is it getting hyped more?
Ross Taylor: So I hope I don't botch this because it's the morning. But, essentially, with GRPO, you assign a reward to the whole sequence (via the advantage). But you also have an importance weight, which is your policy likelihood relative to your old one. Because when you do RL, you typically sample lots of rollouts but do several mini batches for your gradient update. So that means you go a little bit off policy.
So to fix that you have an importance weight term. But in GRPO, while the advantage is uniform across all tokens, the importance weight is particular for each individual token. And the importance weight is calculated for a single sequence. So one way of looking at this is that, if you had more sequences to calculate the importance weight, it would be a lot less variance - but by calculating it on a single sequence, you introduce a lot of variance through that term.
So the short answer of what GSPO does is that, instead of looking at a token likelihood, they look at the likelihood of the whole sequence. So now the clipping is not on an individual token basis, but, it looks at one of the sequences in your group and says okay, this one is less likely, so we’ll clip out that sequence. And the TLDR is, at least from the results they show, it seems to be a lot more sample efficient.
I mean, it's not just 0.5 percentage points or something like that. But I think the reason I trust it more is that it’s very simple. And it’s quite directionally well motivated from just a basic understanding of importance sampling. If it were more complex, I'd be a lot more skeptical, but it's fairly simple and it seems to work well.
Nathan Lambert: Yeah, I'm still fairly skeptical.
I think academic research is relatively wide in what people are trying out but labs are relatively narrow. And once you’re further along in your modelling journey, you’re dealing with different parts of state space and then all these algorithmic tweaks just like help your model on whatever blocker it was or your implementation.
I thought for GSPO the sequence thing was funny because when you read the GRPO paper, you were like oh, the reward is just per sequence. But all the tokens in the sequence get the same loss function. But the standard implementation is to break it down per token. And then GSPO is essentially to take that standard implementation and you change the weight on every token back to this. And I was doubting whether this was really going to be a major thing.
I think for junior researchers, one of the good things about this era is that you can really learn the math by studying all these algorithms and thinking about how they are implemented. I hadn’t done that for a few years until writing this RLHF book on policy gradients and I was getting into the weeds like per-token loss, length bias for GRPO, and so on. For students to be able to do this in their brain, it is really good for thinking about the interface between algorithms and systems.
Ross Taylor: It’s interesting, because as AI became more hyped after ChatGPT, you have more people reading papers. This is a great thing, but also you have lots of new people reading papers in the wrong way.
For me the basic logic (for reading papers) is as follows: what’s the reported gain of the paper and how much complexity does it introduce?
So if you get a gain but the paper introduces shitloads of complexity, it's probably not going to stand the test of time. Whereas if it's something relatively simple, but it seems to get a good gain, then that’s the thing that is going to last.
Nathan Lambert: The o1 lesson. The simple thing. In RL research, I've heard it described as: if you see something that only beats the baseline by a few percent, it's not gonna work. But if it’s 2x then that’s a real innovation, because whether they finetune their baselines or not, they’re still going to be crushing it.
Ross Taylor: Exactly.
Nathan Lambert: So I think that's a good heuristic for people right now.
Ross Taylor: And I think researchers are their worst enemy because they want to see their own methods work. But the weird thing in ML is that neural networks “want to learn”. So if you push something enough, it will work. It's just a question of whether that is a good use of your time?
So the question is: what's the right thing to scale and push on? So that’s why - when you read papers - at least what I say to young researchers is that you should always judge how much complexity the paper introduces, and whether you trust the gain.
And then based on those three factors, you can judge whether it’s worth caring about the paper. But I can see why - if you’re new to reading papers - why you might be attracted to complicated, new techniques in papers that seem methodologically interesting.
Nathan Lambert: And researchers often manipulate the results of their peer methods in the way to tell a convincing story. And I think these algorithms are a perfect example of trying to tell a story.
Ross Taylor: Yeah.
Nathan Lambert: So when you think of cognitive behavior of paper authors, you have to take that into account too.
Ross Taylor: The other point I’d make is that - in the reasoning trace - I understand that everyone has to focus on math and code, because that’s where t