why-benchmarks

If you spent some time studying the rapid progress of LLMs, you will likely have encountered benchmark scores coming in the names of MMLU

Any new model release seems invariably followed by a string of numbers used to compare the performance of the model to others, on tasks based on various types of knowledge or skills, and the internet seems perpetually abuzz of the evolution of such metrics

Benchmarks Everywhere!

In many ways, it seems that benchmarks are the quantitative measure of the progress in the field, and hence define the progress itself

In spite (or because) of that, numerous complaints have emerged that these metrics are not doing a very good job at measuring 'true, real-world' capabilities for various models

Sometimes these claims are also going alongside with rumored trust problems associated with the benchmarking process (data contamination, unreproducible process, cheating, ...)

This is particularly the case for benchmarks that consist of a private dataset of questions upon which the models are evaluated (and that should of course not be known in advance, for this would ruin the purpose of the exam)

In several ways, the problems with the measuring of models mirror challenges that can be found in human education: relevance to real-world skills, fairnesss, trust, ...

These problems are in fact not only superficial and any attempt at addressing them puts one in front the almost philosophical question of what we are (implicitly or explicitly) expecting a benchmark to tell us

What to expect from a benchmark, in general?

Beyond the LLM world, quantitative metrics have become quite ubiquitous as a means to a wide diversity of 'goods', ranging from digital cameras, GPUs, and locks to influencers, entertainers, writers, or researchers

What creates the appeal of benchmarks is that they produce one or several scores (i.e. real-valued quantities) telling us what is best in a certain category (via e.g. a ranking)

Such metrics may then be used to make informed decisions about a particular choice one is facing (e.g. a purchase decision), though the fascination that humans have for metrics is so deep that they may not serve obvious purposes (most people who read about acceleration stats for luxury cars that people cannot afford one)

What should we expect from an LLM benchmark?

LLMs are obviously very useful for some tasks, and they can display levels of skill for those

A user may hence want to find a benchmark relevant to their use-case (e.g. if they look for a model that can help fix bugs in code, they may wan to use SWEBench) and then pick the model that is the best suited for it, hoping that benchmarks reflect the expected skills

There raises a number of nontrivial issues:

•

Is the task well aligned with the benchmark?

•

Is the benchmark trustable? Did anyone cheat?

•

Do the metrics give us an a priori idea on how successful a model will be at any task?

Each of these questions raises challenges of their own, and it seems a priori necessary to treat them separately

However, going back to the question of what we expect from an LLM benchmark, we find ourselves with the problem comes with the fact that the key strength of LLMs is the fact that they are fairly general in their abilities: for instance, it is entirely possible that we want to use the same model to debug code, to invent cooking recipes, to help fill documents, and to design a new LLM benchmark at the same time

What would a good benchmark then tell us? It would be a fair measure of general capabilities i.e. the model should maybe not be specialized for anything in particular, but should be able to use available and resources to solve a particular problem (to the extent that this is possible at all; no one expects a model to be able e.g. to break strong cryptography)

Informally, a good benchmark would perhaps present a model to a very wide variety of tasks, and assess its performance on that; naively, if we had a way to sample a thousand different 'random' tasks, and saw a model give good performance on 997 of them, this would give us a good degree of confidence about it being able to get a good performance at a new 'random' (feasible) task

In subsequent posts, we will see how trying to fulfill this naive goal has shaped the approach behind our paper; and how this also leads to completely open benchmarks, where no cheating is possible

June 30th, 2025