
Audio is streamed directly from the publisher (api.substack.com) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.
Show Notes
This is the first of a handful of interviews I’m doing with teams building the best open language models of the world. In 2025, the open model ecosystem has changed incredibly. It’s more populated, far more dominated by Chinese companies, and growing. DeepSeek R1 shocked the world and now there are a handful of teams in China training exceptional models. The Ling models, from InclusionAI — Ant Group’s leading AI lab — have been one of the Chinese labs from the second half of the year that are releasing fantastic models at a rapid clip.
This interview is primarily with Richard Bian, who’s official title is Product & Growth Lead, Ant Ling & InclusionAI (on LinkedIn, X), previously leading AntOSS (Ant Group’s open source software division). Richard spent a substantial portion of his career working in the United States, with time at Square, Microsoft, and an MBA from Berkeley Haas, before returning to China and work at Ant.
Also joining are two leads of the Ant Ling technical team, Chen Liang (Algorithm Engineer), and Ziqi Liu (Research Lead).
This interview focuses on many topics of the open language models, such as:
* Why is the Ant Group — known for the popular fintech app AliPay — investing so much in catching up to the frontier of AI?
* What does it take to rapidly gain the ability to train excellent models?
* What decisions does one make when deciding a modeling strategy? Text-only or multimodal? What size of models?…
* How does the Chinese AI ecosystem prioritize different directions than the West?
And many more topics. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Some more references & links:
* InclusionAI’s homepage, highlighting their mission.
* AntLingAGI on X (models, research, etc.), InclusionAI on X (overall initiative), InclusionAI GitHub, or their Discord community.
* Ling 1T was highlighted in “Our Picks” for our last open model roundup in October.
* Another interview with Richard at State of Open Conference 2025.
* Over the last few months, our coverage of the Chinese ecosystem has taken off, such as our initial ranking of 19 open Chinese AI labs (before a lot of the models we discuss below), model roundups, and tracking the trajectory of China’s ecosystem.
An overview of Ant Ling & Inclusion AI
As important context for the interview, we wanted to present an overview of InclusionAI, Ant’s models, and other efforts that emerged onto the scene just in the last 6-9 months. To start — branding.
Here’s a few screenshots of InclusionAI’s new website. It starts with fairly standard “open-source AI lab messaging.”
Then I was struct by the very distinct messaging which is surprisingly rare in the intense geopolitical era of AI — saying AI is shared for humanity.
I expect a lot of very useful and practical messaging from Chinese open-source labs. They realize that Western companies likely won’t pay for their services, so having open models is their only open door to meaningful adoption and influence.
Main models (Ling, Ring, & Ming)
The main model series is the Ling series, their reasoning models are called Ring, and their Multimodal versions are called Ming. The first public release was Ling Plus, 293B sparse MoE in April. They released the paper for their reasoning model in June and have continued to build on their MoE-first approach.
Since then, the pace has picked up significantly. Ling 1.5 came in July.
Ling (and Ring) 2.0 came in September of this year, with a 16B total, 2B active mini model, an 100B total, 6B active flash model, and a big 1T total parameter 50B active primary model. This 1T model was accompanied by a substantial tech report on the challenges of scaling RL to frontier scale models. The rapid pace that Chinese companies have built this knowledge (and shared it clearly) is impressive and worth considering what it means for the future.
Eval scores obviously aren’t everything, but they’re the first step to building meaningful adoption. Otherwise, you can also check out their linear attention model (paper, similar to Qwen-Next), some intermediate training checkpoints, or multimodal models.
Experiments, software, & other
InclusionAI has a lot of projects going in the open source space. Here are some more highlights:
* Language diffusion models: MoEs, sizes similar to Ling 2.0 mini and flash (so they likely used those as base). Previous versions exist.
* Agent-based models/fine-tunes, Deep Research models, computer-use agentic models.
* GroveMoE, MoE arch experiments.
* RL infra demonstrations (Interestingly, those are dense models)
* AWorld: Training + general framework for agents (RL version, paper)
* AReal: RL training suite
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* 00:00:00 A frontier lab contender in 8 months
* 00:07:51 Defining AGI with metaphor
* 00:20:16 How the lab was born
* 00:23:30 Pre-training paradigms
* 00:40:25 Post training at Inclusion
* 00:48:15 The Chinese model landscape
* 00:53:59 Gaps in the open source ecosystem today
* 00:59:47 Why China is winning the open race
* 01:11:12 A metaphor for our moment in LLMs
Transcript
A frontier lab contender in 8 months
Nathan Lambert (00:05)
Hey everybody. I’m excited to start a bit of a new series when I’m talking to a lot more people who are building open models. Historically, I’ve obviously talked to people I work with, but there’s a lot of news that has happened in 2025 and I’m excited to be with one of the teams, a mix of product, which is Richard Bian and some technical members from the Ant Ling team as well, which is Chen Liang and Ziqi Liu. But really this is going to be a podcast where we talk about how you’re all building models, why you do this. It’ll talk about different perspectives between US, China and a lot of us going towards a similar goal. I was connected first with Richard, who’s also talked to other people that helped with Interconnects. So we can start there and go through and just kind of talk about what you do. And we’ll roll through the story of building models and why we do this.
Richard Bian (01:07)
Hi. Again, thanks so much, Nathan. Thanks so much for having us. My name is Richard Bian. I’m currently leading the product and growth team of Ant Ling, which is part of the Inclusion AI lab of Ant Group. So Ant Group is the parent company of Alipay, which might be a product which many, many more people know about. But the group has been there for quite some time. It used to be a part of Alibaba, but now it’s a separate company since 2020. I actually have a pretty mixed background. Before I joined the Ling team, I’ve been doing Ant open source for four years. In fact, I built Ant open source from a technical strategy, which is basically a one-liner from our current CTO all the way into a full-fledged multifunctional team of eight people in four years. So it has been a pretty rewarding journey. And before that, my last life, I’ve been spending 11 years in the States working as a software engineer with Microsoft and with Square. Again, it was a pretty rewarding past. I returned back to China during COVID to be close with my family. It was a conscious decision. So far so good. It has been a pretty rewarding journey. And I really love how Nathan you name your column as Interconnects and you actually echoed when you just began the conversation just now. I found that to be a very noble initiative. So very honored to be here.
Nathan Lambert (02:48)
Hopefully first of many, but I think you all have been doing very interesting stuff in the last few weeks, or last few months, so it’s very warranted. And do you two want to introduce yourselves as well?
Chen Liang (02:58)
Me first. My name is Chen Liang and I’m the algorithm engineer of Ling Team, and I’m mainly responsible for the floating point 8 training during the pre-training. Thank you.
Ziqi Liu (03:16)
My name is Ziqi Liu and I graduated, a PhD from Jiao Tong University in China. And I’ve been working at Ant Group for about eight years. And currently I’m working on the Ling language model. That’s it.
Nathan Lambert (03:45)
Nice. I think the way this will flow is I’m going to probably transition. It’ll start more with Richard’s direction. Then as we go, it’ll get more technical. And please jump in. I think that we don’t want to segment this. I mean, the border between product growth, technical modeling, whatever, that’s why AI is fun is because it’s small. But I would like to know how Inclusion AI started and all these initiatives. I don’t know if there’s a link to Ant OSS. I found that in prep and I thought that was pretty interesting and just kind of like, how does the birth of a new language modeling lab go from idea to releasing one trillion parameter models? So like, what does that feel like on the ground?
Richard Bian (04:18)
There’s actually one additional suffix for that in eight months’ time. In fact, we kind of began all of this initiative in February this year. So just to begin with for the audience who probably didn’t know much about Inclusion AI, Inclusion AI basically envisions AGI as a humanity’s shared milestone, not a privileged asset. So we started this initiative back in the February of 2025, inspired by the DeepSeek Research Lab. So the DeepSeek Research Lab and their publication, in fact, motivated a lot of people. I believe not only in China, but globally. Taking one step more closer to the AGI initiative by showing it’s probably not an exclusive game for only the richest people who can afford the best hardware and the best talent. So the way we’re kind of looking at it is like why we named that Inclusion is because we actually have that gene with the company. So the decision was actually made, of course, the decision was made beyond my pay grade, but it was actually very well informed internally for the mission and vision that we want to be more like DeepSeek, which is a research lab with a dedicated effort of pursuing AGI. In fact, I mean, if you kind of think about Ant Group with our business model, like we’re a Fintech company, to some extent, very similar to a combination of Square, Stripe, and many other companies in the States, we have a very broad range of businesses which focus not only on the financial vertical, but on medical insurances and the technical services as well. So a lot of those businesses. In order for us to actually be able to support those businesses, I would say long-term success in the next five to 10 years is going to be critically important for us to be able to really focus on the fundamentals of AI. And we feel that the language model is a key to that door. We cannot give up on that initiative.
Nathan Lambert (06:52)
There’s a lot here and I agree with this. And I think that it’s like, the Ant Group is a big large tech company. And I think large tech companies being able to train AI as like most of the audience here is going to be like, yes, they definitely should be doing this. It’s a transformative technology. I think the two things to double click on are, we’re going to have to define like what you think of as AGI and why you’re pursuing this. Because it has to go deeper than like a term that we are doing. I know like DeepSeek is very ideological in their pursuit of intelligence. So I think it’s good to do that. And then I will also double click on the question of like, why open models and like, because DeepSeek is doing like open and as strong as they can, they’re text only. We’ll talk about this later. But it’s like, let’s do each of these individually to kind of ground the motivation.
Defining AGI with metaphor
Richard Bian (07:51)
Sure. I guess, I mean, for AGI, the way we are looking at it is like, I don’t think there’s a definitive answer to that. I mean, if we kind of search Google or any other search engines, it will give you a line, which means something. But it doesn’t mean anything, honestly, to me personally, just by looking at the definition. I would probably use a metaphor. People are probably very familiar with the navigation era. It’s a glorious navigation era back in the 1400s. Now, I think it feels more like all the ships are just leaving Lisbon last year, or maybe like two years ago.
Nathan Lambert (08:18)
I like it. I agree with this more than most of the definitions, because a lot of the definitions are grounded in like work or something.
Richard Bian (08:26)
The one I’m kind of looking at is like, all the ships are leaving Lisbon. Some of them are heading west, knowing for a fact that, hey, India is over there. But now we all know the truth that India is on the east side. But it doesn’t matter. It’s the whole American continent. So the way I’m kind of looking into the definition of AGI right now is like, I personally have a very firm belief that human intelligence and machine intelligence, to some extent, have their similarities. Humans are trying to, to some extent, explore the limit of human intelligence with the help from the machines. So when everything was beginning, we were kind of using all of this as a co-pilot mode. But moving forward, there are all of these theories indicating that there might be an intrinsic point that the machine intelligence, it goes all the way back from the tooling time. They believe that machine intelligence might, at one point, exceed human intelligence. So I guess we’re looking to that pivoting point. Before we reach there, honestly, I don’t know where we’re going and how long we can go towards that particular direction. But clearly, there are some common consensus right now, including maybe MoE (Mixture of Experts) as architecture, including the pre-training, even to some extent, we’re seeing a diminishing return. But pre-training is still pretty important. And reinforcement learning, to some extent, is probably another general agreement that this might not be wrong. We don’t know if this is right, but it might not be wrong. So there are all of these exploratory directions that we believe in. So we’re just kind of sailing there and see how that goes.
Nathan Lambert (10:20)
I love this. And I think the crucial question is for Chen or Ziqi is like, the team like, how do you build team alignment around this? Is this something that you feel like you walk into the office or get on a call and everybody’s in agreement? Or is this like a vision that you’re still building or trying to sell? Like, to what extent you could say, because I think there’s a big difference between like, I buy the vision for Inclusion AI, but it’s like, how real is this when you’re across the org?
Richard Bian (10:49)
I can maybe share my feeling and Ziqi and Chen can chime in. Of course, at the very beginning, there’s skepticism. It’s by human nature, right? So the way we’re looking at it is like, I think DeepSeek gives a very clear indication that this might be working. There has been this hazy, chaotic era of 2024, which nobody has the tools to navigate. So people are very cautious about sailing. You see ships going out and came back crippled, and you begin to worry about what’s going on there.
Nathan Lambert (11:34)
I think there’s a big difference between the US because I think in the US everybody was bought in. And I’ve talked to a few more labs in China and it’s like there’s so much emotional energy on the DeepSeek moment in China that I think in the US people forget about it where it’s like, I could see this in the sequence of releases as well because it’s like everybody had a few months after DeepSeek like all these labs in China have started releasing models and I just think that it’s good to have you say this, is a shared sense of people so people can internalize like how much has been mobilized. And that’s kind of a culturally salient point.
Richard Bian (12:04)
It’s motivating. To some extent, there was this very famous navigator called Zheng He back in the Ming dynasty. So I think basically when Zheng He was able to pretty much pull through the trip all the way to India from China, people began realizing that, hey, not only the Portuguese can do this kind of long journey sailing, the Chinese can do that too. And we’re exploring different parts of the map. You know, toward the end of the day, nobody knows the whole picture. So the way I’m kind of looking at it is like, first, I’m very bought into the mission to some extent that it kind of feels like, you know, even though we begin sailing late, but we do have our own kind of taste to this game. So we will be able to contribute. And you did ask about the question, you know, like why we chose to be open, right? To some extent, I cannot really believe that open is a choice, just like how the leaders in this game are not the most open player in the game, right? But if you’re kind of thinking about playing poker, the trick leader has their own strategy, which is all understandable. For us, because we’re joining the game at this stage, I guess the best strategy would kind of feel like, A, really trying to follow suit to the right direction to minimize the mistakes we’re making at this moment because we’re so late. Second, stay open and stay polished. So keep a very open mind about what’s going on in the surroundings. And that’s probably the best we can do. That’s my two cents.
Nathan Lambert (13:51)
To provide some color and I’ll have a whole note in the page that I release with this for people listening. The first Ling model, which is like their text only model, very, you could see iterations from DeepSeek and the architecture was in April and then a big updated Ling 1.5 in July. And then in September or recently was Ling 2.0, which also came with a multimodal Ming and a reasoning Ring model. And I think like by this September release is when like me and a couple of people that work at Interconnects were like, Holy crap, like this is a, this is like very much a real deal model. And to kind of ramp in that period of time is not easy. Like there’s a lot of companies in the US that are trying to do this right now. A few companies in China have shown that they can do this. And it’s like, I guess if you want to explain this kind of Ling, Ring, Ming series of models and like if this is a clear strategy behind this or if this is what works like, how did you evolve through the first models through the summer to today to kind of get to this point?
Richard Bian (14:56)
Sure. So I mean, first and foremost, I think the foundation model is really important. To some extent, I’ve been working with many people on the system side, because Ant Group has a very solid cloud-native infrastructure team. So the team has been, when we talk about this, we’re kind of beginning using the metaphor. The model is really like an operating system. It’s not like the operating system itself, but it’s more like the kernel. Right, so only a few people can actually write kernel code, even nowadays. Just like how there’s the most talented people who can actually work on the model team right now. We feel that it’s not only a key leading to the technical future, but it’s also a key leading to the user experience future. Because we do see the, I personally believe in the trend of technology brings in new interactions which will lead to new product, which will lead to new business models, which will lead to potentially new organization structure, rinse and repeat. So we kind of like really choose to do the fundamental model of the Ling series because of that. And the Ring series is an obvious next, given the relationship between V3 and R1. It definitely indicates about how we can potentially take a very polished, well, actually, a very intelligent individual, unpolished, and put some reinforcement learning on it to make it a much better individual in one clear vertical direction. We’re going to be touching on some of those kind of technical aspects in our conversation next. But that has been a very clear direction.
Nathan Lambert (16:48)
Do you see this evolving with kind of feedback from within Ant Group, which is like, you’ve also released this diffusion language model. A diffusion language model is very interesting. I’m going to just go out on a little bit of a side rant because I’ve heard, I was talking to people about these and it’s like very hit or miss with me, whether or not I think they’re going to be big. Because we see that tool use and reasoning is a big thing. So the whole idea of a diffusion language model is you generate a very long sequence at once and that could save on costs because you don’t have this kind of quadratic memory increase and you do very long sequences. So I saw that I was optimistic. And then you see the idea of tool use, which is like, you have to be able to chop up the reasoning. And I was like, I’m really bearish on diffusion models for language again, because you have to be able to search and execute code. But then I was hearing that in like user facing products, like code diffs, where if you’re generating a website and you did take a prompt and go to a huge diff on a code base really fast, then language diffusion is actually really nice. And the motivation of the question is like, do you have this feedback loop in your modeling where Ant Group is trying to use these things for products and might like have a bit of a feedback of like this latency isn’t fast enough or like this area you need to move it to, or is this kind of like a separate play of just build the best models you can and figure it out later?
Richard Bian (18:12)
That’s a very perfect question. We use this metaphor that we’re probably also doing this reinforcement learning in real life by trial and error. Almost kind of feels like, so I think Nathan, you nailed a very good question. And there are some very clear consensus about coding agents, tool use and people kind of going down a path and pursuing their own business models and begin making revenues. So that’s one type of usage patterns for language models. We do that and we see some very clear, I would say feedback loops in that direction. So that’s one pillar. And the second pillar is about the not so clear aspect. By saying the not so clear aspect, it’s like, I believe everyone in the Silicon Valley and in Seattle is still scratching their heads trying to understand about, hey, when can I break even with all this investment? Are we really generating enough user values kind of back to, I’m a product person. So all of those kinds of words keep coming back into my head. And, you know, at this moment, consciously speaking, it’s very hard to come to the conclusion that, you know, all of this is valuable enough for the end user. But, you know, we’re trying to explore the directions for that. I would say a lot of the, you know, generating the whole website, you know, what Labo did, it’s an interesting form of product. But at this moment, we don’t know if it’s A, sustainable as a business model, B, if this is the best type of product we can offer to the user. So all of those are iterative. Within company, we do have some of those explorative products that use our models, not only the Ring model, but Ming as well, like the multimodal. And you mentioned about the, so that’s the second pillar. And the latter is more like the last pillar, because Ant Group does have a research institution called Ant Research. So the model is a joint collaboration between the research and the Ling Team.
How the lab was born
Nathan Lambert (20:16)
I guess there’s another like org chart question, which is like, where in the structure of the big tech company that is Ant did this Inclusion AI slash Ling and all of this grow? Like, is this within cloud that there’s a new modeling or research org or is it kind of separate? Like, do you feel like this is a part of the bigger company or are you kind of insulated from this?
Richard Bian (20:42)
You can actually search on Google and find information about Ant Research which is a joint research lab focusing more on a lot of these frontier technologies like graph, deep learning, reinforcement learning, before all of this. So that’s the background of Ant Research. And second, when we begin forming the AGI initiative of Inclusion AI, we begin getting very serious. So we begin putting all of these resources together to some extent physically, but more from the organizational ways of saying that all of these teams of financial models and research lab institution and the user experience expert focusing on exploratively looking into the next big application that people will actually use. So all of this, we kind of began forming this internal, I wouldn’t call that organization, but more like this internal initiative directly driven by our CTO. So it’s very serious effort. It’s very serious to the extent that, you know, it feels more like when the team actually formed the original DeepSeek initiative. So all of these people, you do nothing else but only focusing on this and this is the only important thing for this.
Nathan Lambert (22:01)
It’s like so much of this is that the mystique I feel like is that in the West, we don’t get what would normally be gossip of what is happening in the Chinese tech ecosystem, which I don’t think this is hard to see if you have friends that work at Ant Group, because it’s probably you’re moving hundreds of people’s jobs around and people talk. Whereas like in my circles, it’s like, Meta is doing another reorg. And then you hear about it in the news a few days later. So it’s just like, I don’t know. That’s my reflection hearing all of this. And I’m mostly learning that all of these orgs end up similar in size. And then you have to prioritize resources per researcher and all of these normal things. I’m going to start transitioning into this section we had prepped on actual modeling things, which is mostly on pre-training, which is fun. I think that state of affairs on my pre-training knowledge from AI2 is that we’ve scaled, done plenty of dense models and some architecture things from up to like 32B, some experiments at 70B that one didn’t work out. MoE is work in progress. So I’m personally very interested in architectural decisions that enable MoEs and long context.
Pre-training paradigms
I think the kind of basic thing is just like, if you’re pre-training, I mean, this is for Ziqi is like, what does your, how do you feel like your trajectory is as a researcher as you’re going through these months? This could be just like, what does your work feel like when you’re trying to boot up like a DeepSeek style, very ambitious lab building new infrastructure and getting models off the ground. And then we’ll kind of go into some more specific discussions around like Ling 1T later and stuff like this. But it’s like, how is building this?
Ziqi Liu (23:45)
Our architecture indeed refers to OpenAI’s scaling law or DeepSeek’s scaling law. They really do a good job. In our Ling scaling law, the non-embedding training FLOPs play the central role of our scaling law. So we set up our own framework that provides foundation for a standardized experimental pipeline. So there are many questions when we start conducting scaling law under the MoE architecture. So the first question is, can we find simple rules for finding optimal hyperparameters with respect to training FLOPs, which are not sensitive to the structure of MoE. Similar to DeepSeek, we first discovered the optimal critical hyperparameters with respect to training FLOPs and the MoE architecture. We find those optimal hyperparameters are not that sensitive to the structure of MoE, like the activation ratio and something others in a mild condition, but more related to the training FLOPs. So this is our first finding. And then we found activation ratio is critical and can consistently improve if we reduce activation ratio.
Nathan Lambert (25:14)
Can you say more about this? I mean, most of pre-training is a lot of different things, which you’re accumulating FLOP efficiency while getting model performance. And then it’s like Chen, you also were saying you focused on FP8 stability, FP8 and training stability in general. So I’m kind of curious of like any major, like, what is your biggest impressions of focusing on kind of this narrow thing in pre-training, which is getting more memory by using lower precision while maintaining stability. So if you have any like high level takes on pre-training stability at that precision, then I’ll zoom into more specific questions on scaling up from there.
Chen Liang (26:00)
At first we heard about the floating point 8 from DeepSeek. They used floating point 8 training through the training of DeepSeek. And we also tried the recipe of them, the block-wise INT8 in the Megatron. And we find that actually the MFU (Model FLOPs Utilization) is not very high. And sometimes it’s even slower than the BF16 (bfloat16) training. And we find that the main costs are the quantization and dequantization. So actually, the floating point 8 is not as fast as they claimed, actually. And we profile the whole training data and try to minimize the quantization and dequantization process.
Nathan Lambert (26:50)
What is getting quantized and dequantized?
Chen Liang (26:53)
If you want to try the floating point 8 training, it’s actually due to GEMM (General Matrix Multiply) in the linear layers. And you want to quantize the weights and the inputs to FP8 (E4M3) type. But the other structure, they compute in the BF16, BFloat16 type. So when you get into the linear layer, you need to quantize it to the floating point 8, and then do the GEMM. And the GEMM output is the BFloat16. So this is the way you need to quantize and dequantize to adapt the other structure.
Nathan Lambert (27:43)
And then what does your work actually look like in getting this? So you find it to be not as fast. Like, what do you actually do to change this?
Chen Liang (27:50)
In the MoE layer, it’s got the FC1 (Fully Connected 1) and FC2 (Fully Connected 2), right? And in the middle of them, they’ve got the switch gated function. So FC1, switch gated function and FC2. And the output of FC1 is the BFloat16. And we fuse the operation of the switch gated function and the quantization function. So we fuse them, the two operations, into one. And so it saves some time. And the MoE layer is a batched operation. So you need to actually do the activation function on all the experts. So that’s a lot of time.
Nathan Lambert (28:52)
For people listening, FC is fully connected, which is just the standard neural network layer. So I might be being silly, but generally the idea with MoEs is that you have the feed forward layers, take up the most parameters and you get more efficient by adding MoEs. And within the MoE, kind of gated to each expert, is it actually standard that it’s like fully connected, MoE gate, fully connected? And it’s kind of alternating because I know this normally like attention block, MoE block is like the higher level of abstraction. And it’s this fully connected, MoE gating and then fully connected, is that actually industry standard? And I just had like a lapse in my brain.
Chen Liang (29:37)
This structure is conventional actually. Some experiments have explained that the switch gated can make your gradient stable during training. So it’s actually a standard architecture.
Nathan Lambert (29:51)
When you’re actually experimenting on this, is this the sort of thing that when you’re doing it at your like first models were about 300B total and you had smaller models? Like, is this a sort of thing done where you get this performance at every scale? Or do you have to revisit this when you’re doing something like Ling 1T, which is this latest model with way more parameters? Because I think the root of my question is like, are the numerical problems you get from scaling like whack-a-mole, where it’s like an old problem that you fixed becomes a problem again? Or is it an entirely new type of thing that comes up when you’re going to big models?
Chen Liang (30:26)
We do the experiment on the size of 100 billion parameters first. Also the situation can be, we can learn from the situation. That size, not just the 1T.
Nathan Lambert (30:43)
And I remember reading, I saw that you guys did QK norm for this as well. Is this just like, you also found this to be standard and work for you because we’ve had some issues with long context and doing QK norm kind of hurting performance there. We still have some ablations to track down.
Chen Liang (30:47)
We actually do the experiment of the QK norm on BFloat16 and the result comes out. The loss is better than if you didn’t apply the QK norm. And actually the one big thing is that when you do the floating point 8 training, if you do not apply QK norm before the rotary embedding, the gradient of the linear QKV may be underflow. Most of the time, it’s underflow because without the QK norm. So if we want to apply the floating point 8 training, you need to add the QK norm to avoid the quantization error. Since the quantization error is propagated from the last layer to the first, and if the last layer got more quantization error, until the first layer it’s amplified error.
Nathan Lambert (32:07)
Let me try to talk through this because I’m mostly working post-training and I’ve heard all these terms and I want to make sure that we’re presenting a fairly clear picture to people. So in attention, you have queries, keys, and values. And these are big matrices that store many different things. And like generally with pre-training, the magnitude of the variables matters a lot because what you’re saying about like gradient flow. And if you have variables that are like too small, you might have no signal and too big or one thing. And what we’re saying is that, God, I guess what’s the order between, when you have, I guess there’s complicated things, which is like where the rotary embeddings are applied relative to the attention computation. And what we’re saying is that you have to put QK norm ahead of the rotary embeddings in this attention module, because then otherwise your gradients are too small when you’re scaling this or with FP8.
Chen Liang (32:53)
During the forward process, you got the QK norm and the rotary embedding, and then you go forward. But during the backward, but if you do not apply QK norm, the Q times K matrix may have large values. And during the backward, the large value may bring a large gradient. And when you do the quantization, actually divide the data by the max of the per channel, the max of the column. So some small values will be divided nearly to the zero. So when you do the dequantize, it cannot find the real value before the quantization.
Nathan Lambert (33:52)
That makes sense. I see. Like, what are you actually looking at to figure this out? Are you looking at like intermediate activation values when you’re scaling? Because I like training loss will only show you so much, or are you like seeing that the training loss is better or worse and then going to investigate this later?
Chen Liang (34:08)
The first is the loss is not right compared to the BFloat16. And we print the quantization error during the intermediate layers and find that without QK norm in the linear QKV, the gradient is too large.
Nathan Lambert (34:34)
I think that this is very good. It gives people a sense for like what the different things moving around when you’re looking at kind of pre-training research is. And then the other side of things, if you make a change and then you have a loss spike, you’re like, okay, then you have like a numerical stability issue. I guess like a loss spike that you can’t skip. So I’m guessing you have things where if you have a loss spike, you can skip some of them. But there’s some numerical stability you can’t get around. This is fun. I’m going to kind of keep rolling through this. I think that you’re also talking about how you have like different pipeline for training your MoE, which you described as like a heterogeneous fine-grained pipeline. I think that this is like, I would read this as matching your training architecture to your compute architecture in order to get a speed up. Because I think with MoEs and the communication bottleneck. So I think that it’s like, if you want to talk about the parallelism strategies you did to get pre-training to be efficient. I think it was also really interesting because it covers multiple layers of the stack and how you design models.
Chen Liang (35:39)
It’s actually a common way, not just for our model. So actually the modern parallelism is just data parallel, tensor parallel, pipeline parallel, and context parallel. And our optimization is only focused on the pipeline parallel. As you can see from the paper, we do not use TP during our pre-training. So the common way to do the pre-training is they name it one forward and one backward type. Let’s see. We just focused on one machine with eight cards. And every card, actually, we name it as a stage. So we got stage 0 to stage 7. And every stage does the forward and the backward after it does the forward and sends the forward data to the next stage and they get the backward data from the next stage, right?
Nathan Lambert (36:49)
So that’s like an eight step pipeline. That’s like a pipeline parallel that you’re describing.
Chen Liang (36:53)
And every stage, they do communication from the prior stage and do the communication with the next stage. And the 1F1B got a problem that the stage 0 and stage 7 always got the most computation load because stage 0, you have an embedding layer. And it’s an index select operation. So it’s close. And stage 7, you got the LM head layer and the loss function. And you also got a large GEMM. So you need to times the hidden states to transfer the hidden states to the vocab size. And the vocab size is always large.
Nathan Lambert (37:45)
How much fine-grained work are you doing to change which part of the model is on each stage? Because that seems like what it would be then. You either have to change the model or you have to change how you split up the model. It’s like your two options.
Chen Liang (37:58)
The common way is just you split the LM head layer and embedding layer and just divide it by the GPU number. So it’s natural that the stage 0 and the stage 7 got much more computation load, since you just ignored the balance of the system when you split the layers. So it’s the common one. So our optimization’s main concern is just to alleviate the computation load of the stage 0 and stage 7.
Nathan Lambert (38:25)
I see. I guess I don’t fully follow like what has happened. I’m trying to be like very clear of whether or not I understand it. Because I think that’s like in a dense model, I think pipeline parallel really makes sense, but you have like a smaller model. And then as you’re getting bigger, it’s like much less of a model. I don’t know what it means to necessarily like de-load the specifically the embeddings or the loss function and how much of a change you can make. But I think that might be like a me limitation. It might be hard to get to, but you can, I’m curious if you want to try.
Chen Liang (39:14)
Actually, it’s quite the same as the dense model. The only difference is that per GPU, you can imagine that during the pre-training, if we got the 32 experts and we use like four machines to gather the expert data, it’s just you can view this four machine as one machine. So in this view, it’s the same like the dense model. So just imagine the dense model. You split the layers according to your GPU cards. And let’s assume that every machine got two layers of the dense model.
Nathan Lambert (40:11)
So I get that. And then it’s like, it’s just like, then you have to shift things around to make it so the loss is less of a bottleneck in the last layer or the final part of this pipeline parallel being the bottleneck is kind of potentially fundamental.
Chen Liang (40:24)
Yeah.
Post training at Inclusion
Nathan Lambert (40:25)
I see. I mean, the next question that I wanted to ask is going to be very related to this, which is like, what are your, how do you scale this to make RL work at the same scale? So the different problems that you have for doing pre-training versus RL with a large scale model. I don’t have the title of the paper, but you’re like in this Ling 1T paper, there’s a ton of RL details. And it’s like, is this kind of just like the next sequential problem that you got to? And then there’s just a lot of, not necessarily similar solutions, but like you’re doing your problem solving in the same way to make RL work rather than pre-training in terms of throughput.
Chen Liang (41:03)
It’s actually got some common tricks like we mentioned in the paper that the VPP (virtual pipeline parallelism). It actually means that the machine, you got double layers than the original one, than the original 1F1B, same things. But the difference is, let us assume that the stage 0 machine got four layers. But actually, during the time, two layers are doing computing and two layers are doing communication. So that’s what they call VPP.
Nathan Lambert (41:47)
What does two layers computing and communicating mean?
Chen Liang (41:50)
In other words, some layers are doing computing and some layers just prepare the data. They get the data.
Nathan Lambert (42:00)
I see, so it’s like some machines.
Chen Liang (42:03)
So when you train, during the computing, communication bandwidth is idle, right? So they utilize this to just like the exploration is the exploration. And our optimization is just to split the pipeline more precisely.
Nathan Lambert (42:31)
So I think I’m seeing that. So it’s within a node. You have very fast communication between eight GPUs. And then in pre-training, you’re kind of doing all sequentially, but in RL, you need to kind of sync this. You need to communicate more between your like generate, you have to move your weights to be able to generate when you’re doing RL. There’s like this sync step. And then I’m thinking what you’re saying is like, you have this chunk on eight GPUs and then you can split this. So half of them are doing compute and half are doing communication at the same time. So it kind of alleviates the bottlenecks. I see. For context and how like there’s a lot of different ways of doing RL infrastructure, it’s just the abstractions that like what we’re doing is much easier where we’re looking at approaches where we have GPUs that are set for generation and training, and that we are primarily looking at ways to make those both faster and then be able to throw the like training GPUs, we sync the weights to the generators and the generators just keep going where this is like it’s much more deeply embedded in the architecture where you have like one cluster where you’re kind of splitting the GPUs and what work is happening across each of the across like the per node basis when you’re doing this RL training. And I’m going to go look at this in more detail.
Chen Liang (43:48)
Yeah.
Richard Bian (43:56)
Just to add a little bit more flavors to this, the reason why we kind of didn’t really cover a lot of post-training details in this interview is because we have some additional technical papers or technical reports we’re writing at this moment about the system.
Nathan Lambert (44:14)
That makes sense.
Richard Bian (44:15)
So it was to some extent intentionally vague, Nathan. But I mean, first thing first, the current paper of Ling 1T and Ring 1T does have the fundamental intro for our system. It’s called a system. I believe the article has been published on ant-ling.medium.com/ on the medium technical paper as well as on Ling Team. So the paper is also available in English on Ling Team as we publish all the details. So specifically, there are several things which we did for the RL aspect. One is about the system itself. You can imagine that we do have an optimized internal hybrid engine which does all the things you described. And the second part is we’re exploring the reward model system. So this reward model system essent