I think this is not a helpful framing at all! Instead the question should be: how should and shouldn't I update my views based on the graph. Clearly it contains a bunch of signal and so the correct amount to update is not zero, and also clearly you need to think about how that signal generalises to other claims you might want to think about.
I think arguments like these about how well it generalises are most useful if they write down concrete predictions about exactly to what extent it will and won't generalise. We're working against the clock of rapid AI progress here, we can't pre-empt everything on the first shot, and so it's probably most valuable to do a good first version and then work to resolve the most important and decision-relevant differences in people's predictions based on the results of the first version (and the state of play more generally).
On the release of this paper some commentators (e.g. Gary Marcus, IIRC) said they expected that the exponential time horizon trend wouldn't generalise to other kinds of tasks outside of software engineering. I think (at least on some operationalisations) that claim was probably wrong, based on METR's follow-up investigating that question https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains
In this post, I'm glad to see you predict that the baseliners on their initial suite took longer than a different sample would if incentivised differently. It might be helpful to write up a few of these predictions, and take them to METR (or your favourite non-METR research group that wants to do a replication) and see if we can get evidence on those too.
This is fair, but to be very clear I am in fact claiming that the proper update is zero. It's not that there is no signal per se, but that there is far too much uncertainty / noise to tease out its details such that updating in any way would be distinct from a coin-flip. I suppose if you doubted AI was improving at all at software engineering, you could probably safely update to "it is," but I don't think anyone still doubts that.
I wish I had had space to address METR’s follow up comparing trendlines across domains. This is from another thread where I respond to the same criticism (OP also linked to the cross-domain time horizons):
With respect to the idea that “lots of different things measured lots of different ways show a similar pattern,” I disagree that the other benchmarks represented in the linked time horizon graph are “measured in lots of different ways.” On the contrary, they have many of the exact same issues as METR’s, and in some cases are worse. For instance: many of their time horizons are guesstimated by task annotators, rather than measured from human baselines. They also suffer from very similar task realism issues, and maybe most importantly the frontier labs are aggressively fine-tuning for all of them. So no, there are too many correlated sources of error here for these other benchmarks to speak to the robustness of METR’s.
As for writing up more concrete predictions, I will consider doing this, though I am busy so no promises.
This is great - thanks for putting it together. I had always thought that while the absolute numbers were dubious and somewhat misrepresented (so often this graph is cited with language that elides the fact that it's talking about a 50% success rate), it was still useful in terms of relative improvement. Didn't realize there were such significant issues with data contamination, though.
Do you think there are any useful conclusions to be drawn from the graph, or is it just one of those things that looks really meaningful while revealing nothing of substance?
I think it's decent evidence that AI is in fact improving at software engineering, but we already knew that. As for how much or how fast, I have a hard time seeing any way to root through all of the noise to come to a meaningful conclusion on those fronts. There are just too many problems with the study, and it's also very hard to understand their precise scope / interaction with the level of information we have about METR's data.
> The METR graph cannot be saved
I think this is not a helpful framing at all! Instead the question should be: how should and shouldn't I update my views based on the graph. Clearly it contains a bunch of signal and so the correct amount to update is not zero, and also clearly you need to think about how that signal generalises to other claims you might want to think about.
I think arguments like these about how well it generalises are most useful if they write down concrete predictions about exactly to what extent it will and won't generalise. We're working against the clock of rapid AI progress here, we can't pre-empt everything on the first shot, and so it's probably most valuable to do a good first version and then work to resolve the most important and decision-relevant differences in people's predictions based on the results of the first version (and the state of play more generally).
On the release of this paper some commentators (e.g. Gary Marcus, IIRC) said they expected that the exponential time horizon trend wouldn't generalise to other kinds of tasks outside of software engineering. I think (at least on some operationalisations) that claim was probably wrong, based on METR's follow-up investigating that question https://metr.org/blog/2025-07-14-how-does-time-horizon-vary-across-domains
In this post, I'm glad to see you predict that the baseliners on their initial suite took longer than a different sample would if incentivised differently. It might be helpful to write up a few of these predictions, and take them to METR (or your favourite non-METR research group that wants to do a replication) and see if we can get evidence on those too.
This is fair, but to be very clear I am in fact claiming that the proper update is zero. It's not that there is no signal per se, but that there is far too much uncertainty / noise to tease out its details such that updating in any way would be distinct from a coin-flip. I suppose if you doubted AI was improving at all at software engineering, you could probably safely update to "it is," but I don't think anyone still doubts that.
I wish I had had space to address METR’s follow up comparing trendlines across domains. This is from another thread where I respond to the same criticism (OP also linked to the cross-domain time horizons):
With respect to the idea that “lots of different things measured lots of different ways show a similar pattern,” I disagree that the other benchmarks represented in the linked time horizon graph are “measured in lots of different ways.” On the contrary, they have many of the exact same issues as METR’s, and in some cases are worse. For instance: many of their time horizons are guesstimated by task annotators, rather than measured from human baselines. They also suffer from very similar task realism issues, and maybe most importantly the frontier labs are aggressively fine-tuning for all of them. So no, there are too many correlated sources of error here for these other benchmarks to speak to the robustness of METR’s.
As for writing up more concrete predictions, I will consider doing this, though I am busy so no promises.
Thanks for the read and thoughtful comment.
This is great - thanks for putting it together. I had always thought that while the absolute numbers were dubious and somewhat misrepresented (so often this graph is cited with language that elides the fact that it's talking about a 50% success rate), it was still useful in terms of relative improvement. Didn't realize there were such significant issues with data contamination, though.
Do you think there are any useful conclusions to be drawn from the graph, or is it just one of those things that looks really meaningful while revealing nothing of substance?
I think it's decent evidence that AI is in fact improving at software engineering, but we already knew that. As for how much or how fast, I have a hard time seeing any way to root through all of the noise to come to a meaningful conclusion on those fronts. There are just too many problems with the study, and it's also very hard to understand their precise scope / interaction with the level of information we have about METR's data.