16 Comments
User's avatar
Adam's avatar

> The METR graph cannot be saved

I think this is not a helpful framing at all! Instead the question should be: how should and shouldn't I update my views based on the graph. Clearly it contains a bunch of signal and so the correct amount to update is not zero, and also clearly you need to think about how that signal generalises to other claims you might want to think about.

I think arguments like these about how well it generalises are most useful if they write down concrete predictions about exactly to what extent it will and won't generalise. We're working against the clock of rapid AI progress here, we can't pre-empt everything on the first shot, and so it's probably most valuable to do a good first version and then work to resolve the most important and decision-relevant differences in people's predictions based on the results of the first version (and the state of play more generally).

On the release of this paper some commentators (e.g. Gary Marcus, IIRC) said they expected that the exponential time horizon trend wouldn't generalise to other kinds of tasks outside of software engineering. I think (at least on some operationalisations) that claim was probably wrong, based on METR's follow-up investigating that question https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains

In this post, I'm glad to see you predict that the baseliners on their initial suite took longer than a different sample would if incentivised differently. It might be helpful to write up a few of these predictions, and take them to METR (or your favourite non-METR research group that wants to do a replication) and see if we can get evidence on those too.

Nathan Witkin's avatar

This is fair, but to be very clear I am in fact claiming that the proper update is zero. It's not that there is no signal per se, but that there is far too much uncertainty / noise to tease out its details such that updating in any way would be distinct from a coin-flip. I suppose if you doubted AI was improving at all at software engineering, you could probably safely update to "it is," but I don't think anyone still doubts that.

I wish I had had space to address METR’s follow up comparing trendlines across domains. This is from another thread where I respond to the same criticism (OP also linked to the cross-domain time horizons):

With respect to the idea that “lots of different things measured lots of different ways show a similar pattern,” I disagree that the other benchmarks represented in the linked time horizon graph are “measured in lots of different ways.” On the contrary, they have many of the exact same issues as METR’s, and in some cases are worse. For instance: many of their time horizons are guesstimated by task annotators, rather than measured from human baselines. They also suffer from very similar task realism issues, and maybe most importantly the frontier labs are aggressively fine-tuning for all of them. So no, there are too many correlated sources of error here for these other benchmarks to speak to the robustness of METR’s.

As for writing up more concrete predictions, I will consider doing this, though I am busy so no promises.

Thanks for the read and thoughtful comment.

Victualis's avatar

Your critique is reasonable. I agree with much of it. However, the interesting part of the graph really isn't the y-axis, which you point out has values pulled out of a sack marked "METR buddies doing random tasks, using weird metrics and incentives". The graph is getting attention because it shows a nice linear pattern on a graph with a linear horizontal axis and a logarithmic vertical axis, and maybe an even steeper more recent trend. This supports claims that LLM progress is continuing and that capabilities are advancing something like exponentially, and maybe that we recently see an even higher rate of improvement. The exact parameters of the curve are close to irrelevant if this really is exponential growth. Peter Thiel could hypothetically order 1000 Palantir coders to independently implement some difficult task related to their codebase, and use those timings as a more rigorous baseline, but the belief is that LLMs would show the same pattern on such a dataset.

I think there is a strong argument to be made that the fixed 50% threshold is a weakness. It's also a problem that the benchmark assumes that if a system has mastered at 50% some set of tasks, then it has also mastered all shorter duration tasks at at least 50% (which might not hold, as RL to prefer quick solutions to longer tasks might hurt correctness on shorter tasks). I can also accept that it's possible for the initial duration labels to be adversarially selected to facilitate apparent exponential progress (a nontrivial combinatorial problem but maybe worth investigating). I fully agree with you that trying to say anything about the absolute capability levels of LLMs based on the METR 50% graph is silly. But I'm not convinced your specific complaints undermine the main way that the graph has been interpreted as a claim about relative improvement over time.

Nathan Witkin's avatar

Thanks for the read and thoughtful comment. Your worry about the assumption that success at long tasks implies the same or a higher rate of success at shorter ones is a great point I hadn't considered.

I agree that the METR graph is enough to show that LLMs are improving relatively quickly at software engineering, although I'm not sure we can say that that improvement is exponential based on this benchmark alone.

I don't think, though, that the initial duration labels would have to be adversarially selected per se to give the appearance of exponential growth where there wasn't any. My concern is that time-horizons for the longer ~60% of tasks might be inflated by quite a bit, in which case correcting for that would make the curve much shallower. I think no matter what you'd still have exponential growth at the lower end of the x-axis, but I suspect that with a much larger, randomized sample that allowed for better pairing between tasks and domain-specific experts, you could see a pretty substantial tapering of the improvement rate somewhere along the x-axis.

Also, an important detail that wasn't public until the paper's lead author, Thomas Kwa, posted this note (https://metr.org/notes/2026-01-22-time-horizon-limitations/) (originally a LessWrong comment) is that "most of the longer tasks" don't even have baselines; their time-horizons are just being guesstimated. That of course throws a huge wrench in things as well.

Alex Willen's avatar

This is great - thanks for putting it together. I had always thought that while the absolute numbers were dubious and somewhat misrepresented (so often this graph is cited with language that elides the fact that it's talking about a 50% success rate), it was still useful in terms of relative improvement. Didn't realize there were such significant issues with data contamination, though.

Do you think there are any useful conclusions to be drawn from the graph, or is it just one of those things that looks really meaningful while revealing nothing of substance?

Nathan Witkin's avatar

I think it's decent evidence that AI is in fact improving at software engineering, but we already knew that. As for how much or how fast, I have a hard time seeing any way to root through all of the noise to come to a meaningful conclusion on those fronts. There are just too many problems with the study, and it's also very hard to understand their precise scope / interaction with the level of information we have about METR's data.

Herbie Bradley's avatar

The criticisms on biased selection of SWEs seem vastly overstated. You don't justify why this selection of SWEs should be expected to produce significantly different results than say, randomly selected employed SWEs in the US with more than 3 years of experience (reasonable criteria). More speculatively (but you are also speculating), I would expect that SWEs in the US are probably very insensitive to the financial reward given the size so it seems unlikely to me that that resulted in inflation of time estimates. Given the quality of the METR professional network the numbers are probably higher than the random selection as well.

I do think overall METR should make a more representative version with better longer tasks that can give more general SWE information (rather than being focused on ML R&D at long time horizons), but the rest of the methodological issues seem minor: I still think the overall shape of the curve is correct (and that is the main takeaway people have anyway). It's also notable that I could probably write a piece of this length aggregating reasons why METR numbers are lower than the "true" time horizon, for example that they are very likely not using the best scaffolding—e.g., Claude Code regularly achieves much longer time horizons for Anthropic models and METR are not using it since they go via the API.

Nathan Witkin's avatar

Thanks for the read and thoughtful criticisms.

I would be very careful about playing the “I’m sure the tiny, biased sample is actually representative” game. It is very hard to anticipate from first principles how different distinct samples might be. Some examples of differences that could be material here:

1. METR’s professional network likely overrepresents expertise in some areas, and underrepresents them in others (e.g. I’d imagine expertise in ML is overrepresented).

2. METR’s professional network may overrepresent part or full-time SWEs relative to a nationally representative sample.

3. METR’s professional network may be younger on average than SWEs nationwide.

4. METR’s professional network may over or underrepresent some social groups (by gender, race, nationality, citizenship status, etc.).

5. METR’s professional network may have been educated at different universities. It’s important to realize it’s non-obvious how this would affect baselines. It could mean they get faster because baseliners from top schools are better coders, but it could mean they get slower because top schools train coders in some systematically different way. Or maybe coders from top schools are more likely to view baselining as a chore or favor or waste of time relative to their other work. That could in turn make them speed up, but it could also slow them down if it means they make more mistakes they then have to go back and correct.

6. A larger sample would allow METR to better pair tasks with domain experts (something they tried to do with their sample, but often couldn’t because of how small it was). If baselines for long tasks were inflated because of a lack of pertinent expertise, this would shorten them, possibly by a lot.

7. Engineers who do not personally know METR employees may approach baselining differently. For example, maybe they respond more to the perverse incentive to work slower. Or maybe they respond less (some people are more willing to waste a friend’s money than a stranger’s).

And I’m sure there are plenty of others I’m missing. That these issues are so hard to anticipate is why you get such strict guidelines surrounding proper sampling in the social sciences.

Ditto for your doubting that baseliners would be sensitive to monetary rewards. First off, if they weren’t, then why did METR feel the need to offer them in the first place? But second and more importantly, this is another situation where you just want to ‘follow the rules’ (of good methodology). Rich people do in fact respond to small monetary incentives. There are tons of reasons for this, such as thinking that, in principle it is wrong or suboptimal to waste or lose money, or just because money is psychologically and symbolically powerful, such that people will make less-than-rational decisions in order to make even a little bit of it. As with not generalizing from a tiny, biased sample, there are good reasons why social scientists would warn you against paying people in proportion to [x] when you are trying to measure [x] (at least if they are not normally paid according to this same pattern).

All this in mind, I would push back on the claim that any of the issues here are minor. Sampling issues (not to mention uneven data contamination, which you don’t raise) could warp the shape of the doubling curve in all sorts of ways (that is, relative to a much larger, random sample). These are not issues you just shrug off.

As for time horizons being understated, I don’t buy that insufficient scaffolding would make a huge difference, esp. given what we already know about how slow baseliners were relative to METR’s repository maintainers. Would of course be interested to see strong evidence to the contrary (do we have any systematic comparisons of one-shot success rates for different tasks across scaffolds?).

Herbie Bradley's avatar

This seems fair, but certainly I'‘m very sceptical that scaffolding makes little difference. I’ve worked extensively with different scaffolding of LLMs over the past few years, and my sense is it makes a huge difference, especially in coding. Most of Claude Code’s product development is essentially an effort to build out more and more helpful scaffolding (e.g, Skills) on top of the base model. A concrete recent example of a pure-scaffolding performance boost is the Poetiq scaffold on the ARC-AGI 2 benchmark: https://poetiq.ai/posts/arcagi_verified/

Nathan Witkin's avatar

Thanks for sharing this. Fully admit to not being super knowledgable about scaffolding, but my point was more that, even with it, it'd be hard to fully make up for the pretty substantial amount of baseline slowdown we know about. Also, increasing all models' capabilities by the same amount wouldn't address the main issues I see as affecting the slope of the curve.

On a broader note, I wonder whether scaffoldmaxxing dilutes the model capability signal in favor of the 'scaffold efficacy' signal. Say you figured out which scaffolds are best for each model, and then re-ran the benchmark with them. I think it'd be hard to interpret what the results meant for overall model capability growth (versus our just getting better at scaffolding). Still I take the point that you would get more impressive results, and that in a lot of real-world cases, engineers are figuring out scaffolding best practices, so benchmarking different model-scaffold pairs is informative too.

Herbie Bradley's avatar

Better models also respond better to scaffolding though, because they are smarter! Concretely this is driven by the models having higher success rates at things like tool use and having been post-trained out of making obvious stupid mistakes. I agree it would be hard to interpret because you're no longer controlling for the scaffolding, but if what we care about is how good models can be at some domain then I don't think there's a sensible way to avoid it since you have to do some amount of scaffolding building (METR have their own, it's just probably not optimal at all).

Nathan Witkin's avatar

Ah okay got it. Would be cool to see some careful benchmarking of model-scaffold pairs on realistic tasks.

Phil Bell's avatar

Interesting - what do you think we should be using instead to understand improving AI capability?

Anatol Wegner, PhD's avatar

For anyone interested I did a critical analysis of the METR paper a while ago from a statistical perspective: https://aichats.substack.com/p/are-ai-time-horizon-doubling-every?r=4tn68o

Gabe The Ape's avatar

Props to Nathan for writing this, and props to Transformer for publishing this even though it also somewhat critiques of their use of it.

User's avatar
Comment removed
Jan 21
Comment removed
Nathan Witkin's avatar

Thank you! It's been wild to see the number of folks in AI research willing to simply brush these issues off. Maybe this is a matter of culture clash between industry and academia, but as someone whose background is more the latter, these issues -- esp. taken all together -- seem pretty disqualifying.