Consider, say the authors of a new pre-print research paper on AI benchmarking, a kitchen environment. Humans build a model about tools and appliances and cooking times that will translate to a self-catering holiday rental. They will be able to make new recipes in new settings, and make predictions, both near and far, about the consequences of their actions.
Model-learning agents should likewise “support many different tasks unknown ahead of time”.
They do not. Or at least, experiments show that Claude 4 Sonnet, Gemini 2.5 Pro, and o3 do not, and suggest that LLMs will be fundamentally unable to do so until they get “metacognitive capabilities: strategic experimental design, better uncertainty quantification, and flexible belief updating during exploration and task execution.”
In a limited subset of cases, throwing more compute at the problem helps, but not enough to suggest that current model architecture could achieve decent performance with enough juice behind it. The fact is, say the paper authors, “reasoning models cannot determine when or how to revise what they have learned”, and that is crippling.

Dining and Cooking