[Linkpost] “Inference Scaling and the Log-x Chart” by Toby_Ord

EA Forum Podcast (Curated & popular) · EA Forum Team

February 2, 202616m 32s

Audio is streamed directly from the publisher (dl.type3.audio) as published in their RSS feed. Play Podcasts does not host this file. Rights-holders can request removal through the copyright & takedown page.

Original episode page

Show Notes

This is a link post. Improving model performance by scaling up inference compute is the next big thing in frontier AI. But the charts being used to trumpet this new paradigm can be misleading. While they initially appear to show steady scaling and impressive performance for models like o1 and o3, they really show poor scaling (characteristic of brute force) and little evidence of improvement between o1 and o3. I explore how to interpret these new charts and what evidence for strong scaling and progress would look like. From scaling training to scaling inference The dominant trend in frontier AI over the last few years has been the rapid scale-up of training — using more and more compute to produce smarter and smarter models. Since GPT-4, this kind of scaling has run into challenges, so we haven’t yet seen models much larger than GPT-4. But we have seen a recent shift towards scaling up the compute used during deployment (aka 'test-time compute’ or ‘inference compute’), with more inference compute producing smarter models. You could think of this as a change in strategy from improving the quality of your employees’ work via giving them more years of training in which acquire [...] --- First published: February 2nd, 2026 Source: <a href="https://forum.effectivealtruism.org/posts/zNymXezwySidkeRun/inference-scaling-and-the-log-x-chart?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Source+URL+in+episode+description&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">https://forum.effectivealtruism.org/posts/zNymXezwySidkeRun/inference-scaling-and-the-log-x-chart</a> Linkpost URL: <a href="https://forum.effectivealtruism.org/out?url=https%3A%2F%2Fwww.tobyord.com%2Fwriting%2Finference-scaling-and-the-log-x-chart" rel="noopener noreferrer" target="_blank">https://www.tobyord.com/writing/inference-scaling-and-the-log-x-chart</a> --- Narrated by <a href="https://type3.audio/?utm_source=TYPE_III_AUDIO&utm_medium=Podcast&utm_content=Narrated+by+TYPE+III+AUDIO&utm_term=ea_forum&utm_campaign=ai_narration" rel="noopener noreferrer" target="_blank">TYPE III AUDIO</a>. ---<div style="max-width: 100%";>Images from the article:<a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/ck7hpe2850zbtkdvmtg1" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/ck7hpe2850zbtkdvmtg1" alt="Two scatter plots showing "o1 AIME accuracy during training" and "o1 AIME accuracy at test time" versus compute on log scale." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/fkjohjdto9del1k0kp4t" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/fkjohjdto9del1k0kp4t" alt="A logarithmic graph showing "Moore's Law: The number of transistors on microchips doubles every two years."" style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/zv3gchd5zuy4wldglsx0" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/zv3gchd5zuy4wldglsx0" alt="Performance comparison graphs showing model coverage across SWE-bench Lite and other coding benchmarks with varying sample sizes." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/o6qevbqi6vmmi5rh0auf" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/o6qevbqi6vmmi5rh0auf" alt="Two line graphs comparing model performance on "MATH (Oracle Verifier)" and "CodeContests" benchmarks across sample sizes." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/vahlaqdhv03wmyspc548" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/vahlaqdhv03wmyspc548" alt="Graph showing O Series Performance with ARC-AGI Semi-Private Eval scores versus cost per task." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/untjb7pcqnbau5n9yybo" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/untjb7pcqnbau5n9yybo" alt="Graph showing O Series Performance over cost per task, with models from O1-MINI to O3 HIGH." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/sruf5zyeaxeqrqvv6rvd" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/sruf5zyeaxeqrqvv6rvd" alt="Graph showing "Best Observed Score@k by Time Budget (95% CI)" comparing AI model and human performance over time." style="max-width: 100%;" /></a><hr style="margin-top: 24px; margin-bottom: 24px;" /><a href="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/hfdivn2yqmd5qnitio9s" target="_blank"><img src="https://res.cloudinary.com/cea/image/upload/f_auto,q_auto/v1/mirroredImages/zNymXezwySidkeRun/hfdivn2yqmd5qnitio9s" alt="Graph titled "ARC-AGI LEADERBOARD" showing model performance scores versus cost per task." style="max-width: 100%;" /></a>Apple Podcasts and Spotify do not show images in the episode description. Try <a href="https://pocketcasts.com/" target="_blank" rel="noreferrer">Pocket Casts</a>, or another podcast app.</div>

← All episodes of EA Forum Podcast (Curated & popular)