PLAY PODCASTS
Interviewing Eugene Vinitsky on self-play for self-driving and what else people do with RL

Interviewing Eugene Vinitsky on self-play for self-driving and what else people do with RL

Interconnects

March 12, 20251h 9m

Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Show Notes

Eugene Vinitsky is a professor a New York University department of Civil and Urban Engineering. He’s one of my original reinforcement learning friends from when we were both doing our Ph.D.’s in RL at UC Berkeley circa 2020. Eugene has extensive experience in self-driving, open endedness, multi-agent reinforcement learning, and self-play with RL. In this conversation we focus on a few key topics:

* His latest results on self-play for self-driving and what they say about the future of RL,

* Why self-play is confusing and how it relates to the recent takeoff of RL for language models, and

* The future of RL in LMs and elsewhere.

This is a conversation where we take the time to distill very cutting edge research directions down into the core essences. I felt like we were learning in real time what recent developments mean for RL, how RL has different scaling laws for deep learning, and what is truly salient about self-play.

The main breakthrough we discuss is scaling up self-play techniques for large-scale, simulated reinforcement learning. Previously, scaling RL in simulation has become economical in single-agent domains. Now, the door is open to complex, multi-agent scenarios where more diversity is needed to find solutions (in this case, that’s what self play does).

Eugene’s Google Scholar | Research Lab | Linkedin | Twitter | BlueSky | Blog (with some great career advice).

Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.

Show outline & links

We cover many papers in this podcast. Also, as an experiment, here’s a Deep Research report on “all the papers that appeared in this podcast transcript.”

In this episode, we cover:

* Self-play for self-driving, mostly around the recent paper Robust Autonomy Emerges from Self-Play (Cusumano-Towner et al. 2025). The simulator they built powering this is Gigaflow. More discussion on HackerNews.(Here’s another self-play for self-driving paper and another from Eugene from earlier this year).A few highlights:

All simulated agents use the same neural net with the same weights, albeit with randomized rewards and conditioning vector to allow them to behave as different types of vehicles with different types of aggressiveness. This is like driving in a world where everyone is different copies of you, but some of your copies are in rush while others are patient. This allows backprop to optimize for a sort of global utility across the entire population.”

“The resulting policy simulates agents that are human-like, even though the system has never seen humans drive.”

* Large Language Models are In-context Preference Learners — how language models can come up with reward functions that will be applied to RL training directly. Related work from Stanford.

* Related literature from Interconnects! The first includes literature we mention on the learning locomotion for quadrupeds with deep RL (special shoutout as usual to Marco Hutter’s group).

* Recent and relevant papers Value-based RL Scales Predictably, Magnetic control of tokamak plasmas through deep reinforcement learning.

* Other things we mention:

* Cruise, Tesla, and Waymo’s autonomy stacks (speculation) and how the self-driving industry has changed since we were / were considering working in it.

* Evo 2 foundation model for biology.

* Eugene is working with a new startup on some LLM and RL stuff. If you’re interested in this episode, ping [email protected]. Not a paid promotion.

Chapters

* 00:00:00 Introduction & RL Fundamentals

* 00:11:27 Self‑Play for Self‑Driving Cars

* 00:31:57 RL Scaling in Robotics and Other Domains

* 00:44:23 Language Models and In-Context Preference Learning

* 00:55:31 Future of RL and Grad School Advice

Transcript

I attempted to generate with ElevenLab’s new Scribe tool, but found the formatting annoying and reverted back to Alessio’s smol-podcaster. If you’re interested in working part-time as an editorial aide to Interconnects, please get in touch.

Nathan Lambert [00:01:27]: Hey, Eugene. Welcome to the show.

Eugene Vinitsky [00:01:29]: Hey, Nathan. Thanks for having me. Excited to be here.

Nathan Lambert [00:01:32]: Yeah, so I'll have said this in the intro as well, but we definitely go well back in all the way to Berkeley days and RL days, I think.

I will embarrass you a little bit now on the live read, which is like, you were one of the people when I was switching into RL, and they're like, oh, it seems like you only figured out how to get into AI from a potentially different background, and that's what I was trying to do in 2017 and 2018.

So that was kind of fun, and now we're just friends, which is good.

Eugene Vinitsky [00:02:01]: Yeah, we were both figuring out. If I had any lead over you there, I was also frantically trying to figure it out, because I was coming from a weird background.

Nathan Lambert [00:02:11]: There are definitely a lot of people that do that now and over-attribute small time deltas to big strategic plans, which was probably what it was.

And we're just going to do some of our normal conversations on RL and self-play.

I think the backstory of this is you told me that your recent paper from some of your time at Apple, I think I don't want to time for it too specifically, was something, paraphrasing, like the most exciting RL thing you've ever had a part of.

And major RL projects are not that frequent.

I think if you segment out all the language model excitement in the past 10 years, there's really a few major milestones, and it's good to kind of talk about them.

So we can kind of start, I think, basic things, like how do you define reinforcement learning, and it will kind of build up to this self-driving project.

Eugene Vinitsky [00:03:05]: Yeah, so I think RL is kind of a big thing, but at a really basic level, you have this process of taking actions in the world.

You're seeing the state of the world.

If you're taking actions in the world, you sometimes receive a reward that tells you the value of that action, and you're trying to kind of optimize your cumulative behavior over time.

So that, you know, over long trajectories, you're optimizing those costs.

That's both, you know, the hard thing and the exciting thing is that if you do it well, you can really optimize really long horizon behaviors.

Nathan Lambert [00:03:41]: Yeah, I agree.

And it's funny because now it's finally, the language models are finally doing this long chain of thought, and I don't really think that's the same.

I think the interactive notion will come up a lot here where these long context behaviors are many, many actions interacting with the world relative to one really, really long action, which is kind of odd.

Eugene Vinitsky [00:04:04]: Yeah, I guess, yeah, it mixes things, right?

Because it has very long state, right?

It's got very long contexts, and it's generating its own context.

But in the end, there's really one action at the end that, like, kind of determines how everything went, you know?

Nathan Lambert [00:04:23]: Yeah, yeah, yeah, we'll get into this.

And then the next thing that we kind of need to set up is what do you define self-play as?

I think this word has been particularly broken in recent times with language models, and I'm hoping we can get a fairly specific criteria for what is self-play and what are related topics.

Eugene Vinitsky [00:04:42]: Yeah, I think even within the field, there's quite a bit of debate as to what constitutes self-play.

So talking to, you know, experts, people will disagree about what methods are and are in self-play.

But what I will say is I generally define self-play as anything where an agent plays a copy of itself.

So up to a bunch of different agents interacting with each other, as long as they're mostly, in some ways, copies of each other, we're doing self-play.

Nathan Lambert [00:05:12]: Yeah, and then do you think anything, I mean, your background's in multi-agent as well.

Do you think there is something fundamental to kind of a game that has a really specific hill to climb where it's kind of this competitive nature versus something like language?

Eugene Vinitsky [00:05:29]: Yeah, this is kind of the dream of, I think, some multi-agent researchers is this type of like ratchet effect where you have a bunch of agents interacting with each other and kind of increasing complexity on the part of any agent generates increasing, like creates new challenges that need to be solved and then force you to learn new skills.

And then you kind of get this endless, endless ratchet.

Maybe that's what you meant.

I may have misinterpreted.

Nathan Lambert [00:05:55]: We're going to revisit it.

I think also it's like, how does the multi-agent nature of a lot of these things change what people think about with RL?

This is kind of the last building block before we go into the self-driving stuff.

Eugene Vinitsky [00:06:07]: Yeah, yeah, yeah.

So the way that the multi-agent thing changes things is it makes everything much harder and more interesting.

So you go away from this world where you have like a clear score function, right?

So you have some reward for first in single agent setting, you have some reward.

If that reward is high, you're doing well, right?

And when you move into the multi-agent setting, it becomes reward with respect to whom, right?

It all of a sudden matters whom I'm playing, right?

So if we go to a game of like, like one setting is like two players, zero sum games, right?

So a game of two player poker, I give you, I train a poker bot, right?

How do I know it's any good?

I have to play another poker bot to decide that it's any good, right?

And so all of a sudden, this challenge of like, what is a good policy becomes very fundamental.

And you kind of lose even a notion of there being like one clear good policy.

And like the whole, a lot of, a lot of the field of multi-agents is coming up with different definitions of what would cost you goodness.

Nathan Lambert [00:07:06]: Um, so, and then back to the self-play thing with that, like, is all of the self-play that we discussed, like if you were playing yourself, does the same consideration apply?

Like, is that, is self-play necessarily a multi-agent framing?

Eugene Vinitsky [00:07:19]: Um, I think it, I think it is because oftentimes what we're trying to do with self-play is like to converge to some notion of policy goodness.

And self-play is just a mechanism that gets us to some definition of, of high quality policies.

Um, and, and, and what turns out to be the case is there, there are actually many like non-self-play type methods for doing this.

Self-play just turns out to be an effective way to accomplish constructing effective policies.

Nathan Lambert [00:07:45]: Yeah, I, I, there's many, I'll, I'll link later a lot of these papers on self-play for preference learning and look into them a bit more.

Eugene Vinitsky [00:07:56]: Yeah.

Nathan Lambert [00:07:57]: Essentially that's been the lens.

There's two lenses by which this has come back and both of them, I don't think fit into, I, I think this multi-agent lens of self-play is much richer and I don't think any of them have fulfilled this.

I think there's useful methods for preference tuning.

I think that's like maybe spin it's like self-play something preference learning is one.

And there's papers related to this where they're probably looking at the probability of the own model in generating a response or something like looking at the internals of the model.

And it's not really set up in this game nature of some sort.

And then also with Q stars, when the self-play stuff came back where I really think I've, I've talked to some people that did original reporting on this and it was that the model looked like it was talking to itself.

And I think that very understandably for less, a little bit less technical audiences that haven't engaged with self-play, that coverage of talking to itself got transformed into a self-play commentary and hype cycle, which took people down the wrong path for like an entire year, which is so brutal, but also very understandable and funny.

Eugene Vinitsky [00:09:11]: Yeah, I think there's something interesting and different happening in these like multi-agent like LLM self-play setups.

I'm not super familiar, but I think what's happening is something quite different than what we mean in other multi-agent settings when we're talking about self-play.

Like I feel like it's, it's more about like refining like the distribution of actions that it takes in some, some kind of odd way.

Nathan Lambert [00:09:39]: I think this sounds ridiculous at first pass, but it's almost that the language models are simulating a softer version of self-play within themselves to kind of check their own work and engage in their own discourse, which the level of intelligence they have is not going to like unlock the true like incremental progress that we think of with self-play.

Which probably, I think for context of things for self-play, just to put them on the record of this are, have been very impactful or things like AlphaGo and New Zero.

I think that's, those are the prime examples of generating some superhuman policy in a closer way.

I think it's, it's important to kind of gate the conversation on like, these are the aspirational goals, um, in terms of outcomes and then figuring out how to apply them to new domains and new tools is kind of unknown.

Eugene Vinitsky [00:10:31]: So, so maybe I should have said this earlier, but like self-play is the thing that gives a, is like maybe the one way that we know to build superhuman agents right now.

So, right.

So, um, superhuman go, um, human level Dota, human level, uh, Starcraft.

Um, technically poker is in a, in a slightly weirder, um, weirder space where I don't, I don't exactly know that I would call the method on that underlie that self-play.

Um, sorry.

Um, and, uh, but yeah, it's the one way we really know how to build superhuman agents.

Nathan Lambert [00:11:06]: And I think this is a kind of a natural transition because the, to make people excited in the work that you did, it seems like you've discovered superhuman driving through self-play without inductive biases.

And I'm like, um, how do you view the potential impact of this?

And then we can kind of go into the method.

Eugene Vinitsky [00:11:27]: Right.

So the, the challenge with self-play is, and this requires a bit of technical detail to get there, but you know, in, in like two players, here are some games, games where you and an adversary are playing with each other and somebody wins and somebody loses, there's a very well defined notion of what being good is.

Um, you know, that they're, they're well, you know, their criteria that we would like our policies to converge to.

And, and the challenge has always been about moving beyond that to a domain where it's much harder to define what, what doing well means, right?

There isn't like an abstract notion of what good driving is there out in the world where I could just write down the reward function and simulate it and optimize with respect to that.

And all of a sudden I'd have a good driving policy.

So the, the gap has always been between these methods that work really, really well in, in well-defined games like, like Starcraft or go, uh, and chess, um, and settings where it's much harder to define that.

And so we haven't been able to, to move to self-play in settings where, for example, humans might be in the loop, right.

And, and particularly driving is an instance of that somewhere where at the end, we're going to take our policy and it's going to drive with humans and we have no way to simulate humans and play against them.

Um, and so figuring out how to close that gap has been kind of an open, open challenge.

And I think maybe this is the first instance of, uh, finding a way to do that.

Nathan Lambert [00:12:51]: Okay.

So that's a much better motivation than I gave.

And I understand the excitement now, because if this works in one domain, um, and you'll tell us about how grand of an effort it actually was.

I know big tech companies can put a lot of force and long-term investment behind things to get them off the ground.

Then a lot of the other things that people are saying about language models or other complicated domains are at least there's an existence proof of something similar happening.

So why don't you just continue to explain, uh, this problem set up of learning driving without having a human teacher.

It will probably take detours to analogize different self-driving stacks just because we know about them and it's good to compare.

Eugene Vinitsky [00:13:36]: So one way of framing this is, and I'm going to put cautions in the end, I'm going to give you the, the, the extreme version of it.

And I'm going to walk it back a little bit is like human level driving without any human level data.

And the caution needs to be that this is in simulation and our ability to measure human level driving in simulation is limited in a lot of ways.

So I can tell you about the ways that we measured it and then I'll, I'll have to tell you what the limitations of those things are.

Um, so this was a large scale effort, um, uh, and Lovlin Colton's team and at Apple, um, it was about like eight researchers, research engineers working together for about a year and a half, uh, build, building the stack out.

Um, it was, I think a lot of us came at it from different places.

I know some folks were very inspired by this idea of like alpha star for driving, you know, building a diverse, rich world and then driving it in a way that such you would, you would transfer to policies that you hadn't seen before.

So like human actors.

Um, so, um, yeah, the, the, if, if, if it's helpful that the idea here is that, or the goal here was to build a human level simulated driver.

Um, and here, what that means in our case is not a fully end-to-end method, right?

So we're not simulating perception.

So driving stacks consist of like generally perception, prediction, planning controls.

So you have a perception stack that, you know, takes your LIDAR, your camera, your radar, and converts it into, you know, where are the cars, where are the road is, what's impassable.

Um, and then a prediction stack will take the like positions of all the cars, the cyclists, pedestrians, and it'll predict, predict where they're going to go next.

And then a planning stack will say, okay, given those predictions, you know, what's a good trajectory for me to take.

And then the control stack will say how to actually follow that trajectory safely and robust.

Right.

And we're talking about subsuming the prediction, planning, control portion of the stack, not the perception part of the stack.

Nathan Lambert [00:15:28]: Okay.

So I was, I was thinking that you might not even do control.

I was thinking you might just say, uh, control is a softer album and not do that too.

Eugene Vinitsky [00:15:35]: So in the same way, we're kind of, we're only kind of doing control.

Uh, we're, we're, we're doing this for, I think Waymo uses the

Nathan Lambert [00:15:42]: the term behavior for this.

I think it's been their behavior team for a while.

Is that right?

Eugene Vinitsky [00:15:46]: Okay.

Nathan Lambert [00:15:47]: Uh, you know, I very, it's hard to know where the abstraction ends, but they definitely have a behavior team that's done a lot of things through the years.

Well, he's not the job apps that I've been applying to an interview or have interviewed for in the past.

Yeah, me too.

Eugene Vinitsky [00:16:01]: Um, I think we do know how to control cars.

We know how to make cars follow a pre-specified trajectory, right?

This is, this is somewhat of an easier problem than like humanoid robotics or something.

You know, big thing got wheels.

We know how to make it turn.

Nathan Lambert [00:16:14]: Um, so how do we get these things from, I mean, they start as like, it doesn't start at just all the simulated cars crashing all the time.

What is the start here?

Eugene Vinitsky [00:16:24]: I'll send you the video once it's out, but like, you know, the, the first 10 hours of simulation is just like cars scattered all across the road, smashing into each other, driving off the road, that type of thing.

It's actually interestingly useful because what we do is when two cars crash, we have them come to an immediate stop.

And this actually creates a lot of blockades in the road.

So at some point during the training, the cars start to learn to drive around stopped cars, even though those cars are stopped because they've crashed, um, as well as to drive around like obstacles and things like that.

Um, so that, yeah, that's what it looks like.

Um, yeah.

Nathan Lambert [00:16:58]: Um, as well as the reward function for these.

So you have a bunch of cars that can see their peers and there's some reward function I'm guessing.

Eugene Vinitsky [00:17:06]: So the, the major component of the reward function is getting to your goal without colliding.

So we, we have these maps that we've taken from the cartless simulator.

They're fairly large maps.

Some of them are like multiple kilometers in spatial extent.

We have eight of them and we place goals randomly over the map.

Um, and you get a sequence of goals.

So, you know, that like, okay, I want to get to this point.

And then after that, I'm going to want to get to this next point.

After that, you're going to get a big reward for getting to that goal.

You're going to get some amount of penalty for colliding.

And then there's also an implicit penalty because if you collide, you're not ever going to get to your goal.

And then there, there is some amount of hand design here in that there are small rewards for like staying in your lane and being aligned with your lane and like, you know, not driving in the opposite direction in the wrong lane.

Nathan Lambert [00:17:51]: This was one of the questions is if you had to do this sort of thing.

Eugene Vinitsky [00:17:54]: You have to do that.

But one interesting thing, and maybe we could talk about that at some point is we randomize the weights of those rewards.

So there are agents that like really want to drive in the lane going in the right direction.

And there are agents that don't care about that at all.

And they will take the wrong lane on the highway, uh, you know, going at full speed in the opposite direction.

And that's kind of useful because you're ready for that scenario.

You've seen that scenario in the world when you're driving around.

Right.

Um, but yeah, we have to, we have to do some of that stuff because at some point there are laws and you can't avoid encoding the laws into your system.

You know, stop signs are a human concept.

Um, they're, they're not, you know, it's not going to emerge that you see a red thing and you're like, oh yeah, that means I should stop.

And then I should like give the right of way rules to the other cars.

Um, but all of our rewards are kind of soft in the sense, like, you know, if you're at a stop sign and folks have been preventing you from going for a very long period of time, right.

You're going to start to nudge in and like break the rules about right away.

Nathan Lambert [00:18:55]: One of my questions for later on this is like, do you think our vehicles and driving dynamics and infrastructure kind of constrain the way of driving?

Like we've co-designed human driving in our infrastructure so that human driving is actually no longer that special because of the track is so long, so defined.

Eugene Vinitsky [00:19:11]: I think this is, this is part of why this is all going to work or like why it works is because like human, human driving is, and human behavior in many domains is like fairly constrained by the institutions and the laws and the norms that we design.

Uh, it's not super free from, uh, so like driving amongst humans is much more of a constrained problem than you would, than you would, you would think it's also unconstrained in some interesting ways, but, but it's, it's quite unconstrained, quite constrained.

Nathan Lambert [00:19:42]: And how hard to act was this to actually learn?

So how sensitive of a process is it now?

I think in the paper, you're talking about gigaflow, which is like a high speed

simulation engine.

So obviously, you know, on data, the final paper says that it learns in 1.6 billion kilometers of driving.

I was wondering if you had an intuition for that.

So like how many miles are driven by all the cars in San Francisco and day or something like this?

Eugene Vinitsky [00:20:10]: That's a, that's a great question.

Nathan Lambert [00:20:12]: Um, it could be a good chat GPT query, to be honest.

Eugene Vinitsky [00:20:16]: This might be a chat GPT question.

Um, let me, let me give some, some numbers that I do know.

Uh, and this is kind of maybe helpful.

So I think cars crash every 20,000 to a hundred thousand miles and a fatal collision happens every a hundred million miles, something like that.

Um, but how many miles are driven in a day in a city?

I'm not sure.

1.6 billion kilometers, the distance between here and Saturn.

Um, it sounds like kind of far when you put it that way, but there are a lot of cars.

Yeah, there are a lot of cars, right?

There are a lot of drivers.

Um, there are surprisingly few trips in a city, fewer than you would expect, but, um, I'm struggling to put a number on it.

Nathan Lambert [00:21:01]: Um, I'll tell you what chat GPT gets when it's done.

I was thinking it's Oh three mini high.

This is not a reliable number.

Take this time.

So your intuition that it's lower goes a lot.

I mean, you've thought about a lot of these car systems for a very long time and I will link to some of your other work on this.

So you definitely have better intuitions than I would.

Eugene Vinitsky [00:21:20]: Well, the intuition comes with the fact that like a lane of the highway can take 2000 vehicles per hour, which is like just not that many vehicles.

Um, and you know, most, most of, most of traffic is between like, you know, 8am and or like 7am and like 10am and then on the way back home.

And so, you know, you can like kind of estimate based on how many lanes there are on the main highway, how many trips there are.

Nathan Lambert [00:21:43]: So San Francisco, the chat Oh three mini high estimated four to 5 million miles in a day in San Francisco.

It's a bully.

It's a plausible number, but it's well below what you are doing.

Like this is, I think maybe globally this billion kilometers could be hit.

So this is okay.

Eugene Vinitsky [00:22:03]: Here's one way to think of it.

We simulate 10,000 years of human drive.

Nathan Lambert [00:22:08]: Okay.

So yeah, 10,000 per one.

I guess it depends on how many cars you have in parallel.

Eugene Vinitsky [00:22:14]: Per one training run one trip to get the policy that we get.

We simulate about 10,000 years of human drive.

Nathan Lambert [00:22:20]: Yeah.

Eugene Vinitsky [00:22:21]: Yeah.

Nathan Lambert [00:22:22]: So to have 10,000 cars, it's all of them driving for a year.

Eugene Vinitsky [00:22:26]: Yeah, exactly.

And we have about like a million cars driving at any given time in the simulator.

Nathan Lambert [00:22:34]: Do you think that substantially changes the learning dynamics?

Like are they all, how many cars are any of them interacting with at any one time?

Eugene Vinitsky [00:22:40]: Yeah.

Any given simulator in any given world.

So this is this like kind of like Isaac's gym style vectorized simulator.

So it all runs in the GPU.

So it's a bunch of worlds happening in parallel, but any given world, there are about 150 cars in it, which means that sometimes you're driving in sparse traffic and sometimes you're going to interact with like 10 or 20 cars at any given time.

Um, and I, I think one thing is that one, one cool thing is that at that scale, I think RL becomes very, very stable.

Um, like for us, like every training run succeeds, the reward curves go straight up.

You know, there's no like, um, what are you scaling?

Nathan Lambert [00:23:19]: Are you just like scaling batch size effectively?

Uh, what is, yeah.

What is the actual thing you're, they're scaling?

Eugene Vinitsky [00:23:26]: We're scaling the amount of experience generated.

So it's like a trillion samples of, of total experience, um, that, that the agents train on.

Um, and then, yeah, we use gigantic batch sizes, like, you know, um, but like, what is the thing

Nathan Lambert [00:23:43]: that you need to dial up in order to make learning actually happen?

Eugene Vinitsky [00:23:47]: Uh, total amount of experience generated, right?

So you need to be generating, you know, million samples per second to train on type of thing.

Nathan Lambert [00:23:57]: Okay.

And then what is the actual, I guess I don't know a ton about multi-gen RL, but what is the actual RL like algorithm and is it a giant replay buffer that is just building and building and building?

Eugene Vinitsky [00:24:08]: It's PPO.

Uh, you know, one thing we've been seeing throughout our work pretty continually is that for, for both theoretical and empirical reasons, PPO is actually a really good multi-agent RL algorithm.

Nathan Lambert [00:24:20]: You had the paper, are you, you are on the paper years ago.

That's like on the something, something PPO multi-agent simple.

Eugene Vinitsky [00:24:29]: So we know that PPO works empirically pretty well.

That's basically the title of the paper.

That's a PPO simple, good multi-agent cooperative.

Good.

Uh, it's good in cooperative problems.

It's, it turns out to be pretty good in two players, zero, some games.

And, and here in, um, this driving thing, it's what's called the general sum game.

And, and there, you know, it seems to work in the setting too.

So evidence is accumulating.

Nathan Lambert [00:24:51]: Something that people probably don't know about multi-agent RL and maybe I don't know either, but in this paper, all of the cars were using the same actual weights of the model.

Is that standard in multi-agent RL or is it kind of a variable?

Eugene Vinitsky [00:25:04]: So I'll add one little, uh, subtlety here.

So yes, we're using every policy is the copy of the same agent, right?

They're all looking at their local observations.

So it's decentralized, but it's all one copy, but every agent gets its own like conditioning vector.

That's like, what are my like reward weights?

How big of a, you know, what's my width and my length?

Am I a cyclist?

Am I a pedestrian?

Am I a driver?

And they flexibly adjust their behavior based on that condition.

Nathan Lambert [00:25:29]: Do you think that's actually like, if you were to squint at the system, is that actually changing the policy or is it changing the environment in kind of an indirect way?

Eugene Vinitsky [00:25:38]: It's, it's changing the policy.

Like you'll see that like a car is like, oh, I'm a, I'm a, like a pedestrian.

I'm a, I'm a big truck.

I'm going to do like a K point turn to turn around.

Uh, I'm a pedestrian.

I'm, you know, going to like smoothly wiggle through these small boxes of areas that I couldn't get through.

Otherwise it, it, it really, uh, appreciably changes the policy, which is cool because it's this like tiny 3 million parameter neural network or like 6 million parameter.

Um, and, and so like, there are all these like little sub policies inside of it that you can activate by, by conditioning.

Nathan Lambert [00:26:11]: Can you do it, um, post hoc to change the behavior in an interpretable way?

Eugene Vinitsky [00:26:16]: Um, I don't know about interpretable.

I guess it, it sometimes depends what we mean when we say interpretable, but yeah.

So you can be like, look, okay, you, you, you don't care about staying in your lane and you'll see it start going into the other lane and driving.

You know, you change the size of the policy or like the, the car and it will change the trajectories that it takes in response.

Um, it's, it's very responsive to this condition.

Um, we have some cool graphs in the paper pointing, pointing out all the different things you can make it do by changing these, these values.

Nathan Lambert [00:26:46]: Um, I'm trying to think of how this reflects on the nature of driving and what the downstream use of this tool is.

So you showed that this is doable and what does this, like, what does this mean for self-driving specifically?

Like, what would you do if you had the same big team and you know that this exists and you're interested in self-driving as a field?

I mean, there are obviously a lot of people that a lot of companies that have big teams and lots of money to try to think about self-driving.

Eugene Vinitsky [00:27:14]: So as I said earlier, like there's this like, um, perception, prediction, planning, control stack.

And I think this is a really is providing a lot of evidence that you could maybe subsume the prediction and the planning stack, um, and, and put it into this type of like end-to-end policy that you could then like train in sim and then maybe not zero shot deploy onto the roadway.

Just like take a straight from sim, put it onto the roadway though.

I think like maybe possible, uh, but like really give you this like base policy that you could then start to put on the roadway and start to build this flywheel, um, that you can then use to collect, you know, more and more experience, validate the safety.

You know, like if you're, you know, if you're a, um, uh, automotive manufacturer that doesn't have like a full spun up self-driving team, but you have a pretty good perception stack, like this is something that you can use to just like get something out in the world pretty fast.

Cause like three, I think like two, two, three days of training later, you have something that I think, and we'd like to start testing it, uh, can be like straight up put onto the roadway with humans driven around and things will be like pretty okay.

Um, you know, don't take the safety driver out, but like, yeah, and you have some cred

Nathan Lambert [00:28:24]: saying this given that you've done RL experiments with real cars, this is not something that's, um, ripping off the bandaid for the first time.

You've done different types of self-driving experiments with RL policies in the real world.

I don't, it might not be at the same level of the stack, but I can add links to that.

Eugene Vinitsky [00:28:42]: That was a lot more constrained, right?

We were putting these cars on the highway to like smooth traffic.

So they would drive in a way such that like stop and go waves and traffic would like get smoothed out and disappear.

Um, but there it was just like, you know, stay in your lane, follow the car behind you here.

We're talking about like, you know, complicated interactions at intersections and that type of thing.

So a lot, a lot more like safe, everything there is safety critical, but like significantly less constrained than anything we've done in the past.

Nathan Lambert [00:29:08]: And then to kind of keep leading this on, uh, I will say a bunch of things because you're more of an industry insider.

So it makes it less revealing if I say things, cause I don't really know anything.

Um, back when I was interviewing for a job and around 2021, at least a lot of RL people were interviewing with self-driving companies who were doing extensive research in RL for different parts of this behavior stack.

Um, even at that time, four years ago, prediction seemed largely or like sensing and prediction was perception was largely solved.

At least CV stacks are really mature and figuring out the actual driving component and decision making was really hard.

There was, I mean, I did a Tesla self home self like take home and for their self-driving team and they were hiring other RL people that take home was ridiculous.

Eugene Vinitsky [00:29:54]: I was like, yeah, I remember that.

Nathan Lambert [00:29:56]: Freaking intersection of polygons.

It's four years ago.

They've got to be using a different question, but it was so hard.

Um, I did end up solving the test cases.

Um, it was, I solved the test cases.

God, that was rough.

But essentially the rumor was they're doing something like mu zero for self-driving and or a mix of imitation learning, which is there's a duality of learning a world model from real data relative to building a simulator.

But the motivation of the work is very similar, which is in mu zero, you want to unroll trajectories and be able to learn from that and distill an RL policy versus if you have a big simulator, you then can learn everything from scratch and figure out how to transfer that to real.

And I think there's different assumptions on what would work.

And the history of RL, it is now that the simulator to real is generally a more promising path.

If you can build the right simulator then and going from real to enhancing real with, with RL alone, um, cruise was building a research team.

And one of the best engineers I talked to was trying to build a world model or like a simulator and do this like alpha go for self-driving.

I think that was a phrase from the interviews four years ago.

So a lot of this, and Waymo is now obviously winning.

I think Waymo, I don't know exactly what they're doing.

I think their stack is actually probably the most complicated, um, where they probably were looking at behavior, like all sorts of RL inspired things for very specific parts of the stack to, to improve behavior.

But it's funny that looking back four years ago, this was something that the spectrum of ideas that industry was looking at was actually very related to this.

And in the same time, the self-driving industry has changed a lot.

Uh, so what do you think of this whole industry of self-driving relative to, you have a lot of experience here.

I mean, I'm, I'm a big Waymo fan now, but there's just like, it's so funny how these things evolve.

And I think after this, later on, we'll talk about the, like, this is the RL specific trajectory with simulation, simulated results and stuff too.

Eugene Vinitsky [00:31:57]: I mean, we were interviewing at the same time.

So I was also interviewing with all of these self-driving companies when you were, uh, and, and it, it did seem like it was the place that was the most friendly to doing RL type research at the time.

Um, I think now almost everyone has gone all in on this like imitation learning type approach, um, that are like, this is a huge fraction of what people are doing.

I think a lot of the RL teams have been spun down, uh, which I think is unfortunate a little bit because I think what this work shows is that, uh, it may be wrong to do so that there is a lot of, a lot of value still in RL for this last piece of, of the, of the puzzle.

Um, you know, um, you know, one thing we have here is, uh, an insanely robust policy, right?

So like just an end to end neural network in SIM, it crashes once in a million miles,

Nathan Lambert [00:32:46]: um, crashes at all.

Eugene Vinitsky [00:32:49]: Yeah.

Nathan Lambert [00:32:50]: And you, but what was the number you said before for miles per crash?

Eugene Vinitsky [00:32:53]: Uh, humans are between 20 and a hundred K, um, somewhere, somewhere like that.

It's a little hard to get estimates because it varies from place to place a lot.

So, I mean, a lot of industries are pretty excited about this, like alpha zero for self driving type thing.

And the question, you know, becomes, as you said, like, what is the simulator that we do this in?

And so one perspective that's very prominent is like, let's collect a lot of data.

Let's sell the world model and then let's unroll in that simulator.

And then the challenge becomes like, who do you unroll in that simulator?

Now your world model has a build into itself, a model of the other agents, right?

If you kind of take the single agent perspective, I'm going to unroll world model.

I'm going to place a car inside of it.

And that's the car I'm going to train with RL.

And now what happens.

Nathan Lambert [00:33:40]: This was a big problem for self-driving because you have like a dynamic number of, um, objects in the scene that you're supposed to reason about with your world model.

How does the policy that you train handle this kind of agents coming in and out?

Now, is it all just that you have some, like, are you identifying entities as nearby as other cars are nearby or is there some abstraction or is that the perception stack handles that?

Eugene Vinitsky [00:34:04]: Yeah, exactly.

We roughly simulate a sensor in the sense that you only see cars in some radius of yourself.

Um, but, but we don't, we don't, yeah.

I mean, all the cars are there persistently in the simulator driving around and we, we answered this riddle of like, what should the other cars do by like their self-play, right?

They're a copy of your policy.

They're driving around.

Um, whereas I don't know what happens in the world model, right?

Like kind of in this like world model approach, you're limited by how capable the world model is at simulating the behavior of other actors.

And if your world model has actually learned a robust model of human driving for all the other agents in the simulator, then like, you don't even need, you don't really need, you need to do RL because like the world model already has a model of how humans should behave in a simulator at human level, but they don't.

Um, so yeah.

Nathan Lambert [00:34:53]: And it's just like, it's just, it's, it's so funny that it just feels like they haven't.

And the only way that Waymo et cetera has gotten it, it seems like Waymo has adapted a autonomous stack with like some human inspiration to make the driver more smooth is what it seems like when you're in it, which is like extremely, really strong perception and world understanding with some really clever policy that is tuned to feel human, but probably not human or RL at the bottom of the day.

Eugene Vinitsky [00:35:27]: I wonder, I don't know what Waymo's planning stack actually looks like in the end, right?

Like Waymo's pretty secretive and, uh, I've never worked there.

Um, and if I had worked there, I wouldn't be able to say.

Um, but you know, I think, I think, you know, if I had to make a bet, it's some, some kind of like hand designed cost, um, like mixing a bunch of terms together about like what a good trajectory looks like, maybe mixing with a little bit of human data to like, to make that trajectory feel like a little smooth in human life.

Nathan Lambert [00:35:59]: And yeah, to prompt you, um, what does your, yeah, I agree with this.

What does your history of being a nerd on urban plan and planning make you think of what is coming for self-driving cars?

Eugene Vinitsky [00:36:12]: So, so I guess the thing to mention is I'm a professor of transportation engineering, uh, among other things.

So I have, I have, um, required to have some thoughts on this.

Um, I think that, you know, self-driving cars are, are coming.

Um, I don't know if they're, they're coming a year from now to who knows when the cost curve gets driven down.

Nathan Lambert [00:36:32]: Where we live, they're more likely to come sooner given tech hubs and, um, people are willing to pay very high premiums.

Eugene Vinitsky [00:36:39]: That's true.

So like, like a lot of goods, they may come for, for wealthy folks first.

And then that allows the cost scaling to come down over time.

Um, and it really is a magical experience to take away Mo, right?

Like I remember the first day I saw like the cars driving around and nobody in it.

And I actually just started chasing one of the cars cause I was so like, it was such a magical moment.

I needed to, I needed to experience it for as long as possible.

Nathan Lambert [