Discussion about this post

User's avatar
Victualis's avatar

Your critique is reasonable. I agree with much of it. However, the interesting part of the graph really isn't the y-axis, which you point out has values pulled out of a sack marked "METR buddies doing random tasks, using weird metrics and incentives". The graph is getting attention because it shows a nice linear pattern on a graph with a linear horizontal axis and a logarithmic vertical axis, and maybe an even steeper more recent trend. This supports claims that LLM progress is continuing and that capabilities are advancing something like exponentially, and maybe that we recently see an even higher rate of improvement. The exact parameters of the curve are close to irrelevant if this really is exponential growth. Peter Thiel could hypothetically order 1000 Palantir coders to independently implement some difficult task related to their codebase, and use those timings as a more rigorous baseline, but the belief is that LLMs would show the same pattern on such a dataset.

I think there is a strong argument to be made that the fixed 50% threshold is a weakness. It's also a problem that the benchmark assumes that if a system has mastered at 50% some set of tasks, then it has also mastered all shorter duration tasks at at least 50% (which might not hold, as RL to prefer quick solutions to longer tasks might hurt correctness on shorter tasks). I can also accept that it's possible for the initial duration labels to be adversarially selected to facilitate apparent exponential progress (a nontrivial combinatorial problem but maybe worth investigating). I fully agree with you that trying to say anything about the absolute capability levels of LLMs based on the METR 50% graph is silly. But I'm not convinced your specific complaints undermine the main way that the graph has been interpreted as a claim about relative improvement over time.

Alex Willen's avatar

This is great - thanks for putting it together. I had always thought that while the absolute numbers were dubious and somewhat misrepresented (so often this graph is cited with language that elides the fact that it's talking about a 50% success rate), it was still useful in terms of relative improvement. Didn't realize there were such significant issues with data contamination, though.

Do you think there are any useful conclusions to be drawn from the graph, or is it just one of those things that looks really meaningful while revealing nothing of substance?

17 more comments...

No posts

Ready for more?