If you spent some time studying the rapid progress of LLMs, you will likely have encountered benchmark scores coming in the names of MMLU
If you spent some time studying the rapid progress of LLMs, you will likely have encountered benchmark scores coming in the names of MMLU
Any new model release seems invariably followed by a string of numbers used to compare the performance of the model to others, on tasks based on various types of knowledge or skills, and the internet seems perpetually abuzz of the evolution of such metrics
Any new model release seems invariably followed by a string of numbers used to compare the performance of the model to others, on tasks based on various types of knowledge or skills, and the internet seems perpetually abuzz of the evolution of such metrics
Benchmarks Everywhere!
Benchmarks Everywhere!
In many ways, it seems that benchmarks are the quantitative measure of the progress in the field, and hence _define_ the progress itself
In many ways, it seems that benchmarks are the quantitative measure of the progress in the field, and hence define the progress itself
In spite (or because) of that, numerous complaints have emerged that these metrics are not doing a very good job at measuring 'true, real-world' capabilities for various models
In spite (or because) of that, numerous complaints have emerged that these metrics are not doing a very good job at measuring 'true, real-world' capabilities for various models
Sometimes these claims are also going alongside with rumored _trust_ problems associated with the benchmarking process (data contamination, unreproducible process, cheating, ...)
Sometimes these claims are also going alongside with rumored trust problems associated with the benchmarking process (data contamination, unreproducible process, cheating, ...)
This is particularly the case for benchmarks that consist of a private dataset of questions upon which the models are evaluated (and that should of course not be known in advance, for this would ruin the purpose of the exam)
This is particularly the case for benchmarks that consist of a private dataset of questions upon which the models are evaluated (and that should of course not be known in advance, for this would ruin the purpose of the exam)
In several ways, the problems with the measuring of models mirror challenges that can be found in human education: relevance to real-world skills, fairnesss, trust, ...
In several ways, the problems with the measuring of models mirror challenges that can be found in human education: relevance to real-world skills, fairnesss, trust, ...
These problems are in fact not only superficial and any attempt at addressing them puts one in front the almost philosophical question of what we are (implicitly or explicitly) expecting a benchmark to tell us
These problems are in fact not only superficial and any attempt at addressing them puts one in front the almost philosophical question of what we are (implicitly or explicitly) expecting a benchmark to tell us
What to expect from a benchmark, in general?
What to expect from a benchmark, in general?
Beyond the LLM world, quantitative metrics have become quite ubiquitous as a means to a wide diversity of 'goods', ranging from digital cameras, GPUs, and locks to influencers, entertainers, writers, or researchers
Beyond the LLM world, quantitative metrics have become quite ubiquitous as a means to a wide diversity of 'goods', ranging from digital cameras, GPUs, and locks to influencers, entertainers, writers, or researchers
What creates the appeal of benchmarks is that they produce one or several _scores_ (i.e. real-valued quantities) telling us _what is best_ in a certain category (via e.g. a ranking)
What creates the appeal of benchmarks is that they produce one or several scores (i.e. real-valued quantities) telling us what is best in a certain category (via e.g. a ranking)
Such metrics may then be used to make informed decisions about a particular choice one is facing (e.g. a purchase decision), though the fascination that humans have for metrics is so deep that they may not serve obvious purposes (most people who read about acceleration stats for luxury cars that people cannot afford one)
Such metrics may then be used to make informed decisions about a particular choice one is facing (e.g. a purchase decision), though the fascination that humans have for metrics is so deep that they may not serve obvious purposes (most people who read about acceleration stats for luxury cars that people cannot afford one)
What should we expect from an LLM benchmark?
What should we expect from an LLM benchmark?
LLMs are obviously very useful for some tasks, and they can display levels of skill for those
LLMs are obviously very useful for some tasks, and they can display levels of skill for those
A user may hence want to find a benchmark relevant to their use-case (e.g. if they look for a model that can help fix bugs in code, they may wan to use SWEBench) and then pick the model that is the best suited for it, hoping that benchmarks reflect the expected skills
A user may hence want to find a benchmark relevant to their use-case (e.g. if they look for a model that can help fix bugs in code, they may wan to use SWEBench) and then pick the model that is the best suited for it, hoping that benchmarks reflect the expected skills
There raises a number of nontrivial issues:
There raises a number of nontrivial issues:
Is the task well aligned with the benchmark?
Is the task well aligned with the benchmark?
Is the benchmark trustable? Did anyone cheat?
Is the benchmark trustable? Did anyone cheat?
Do the metrics give us an a priori idea on how successful a model will be at any task?
Do the metrics give us an a priori idea on how successful a model will be at any task?
Each of these questions raises challenges of their own, and it seems a priori necessary to treat them separately
Each of these questions raises challenges of their own, and it seems a priori necessary to treat them separately
However, going back to the question of _what we expect_ from an LLM benchmark, we find ourselves with the problem comes with the fact that the key strength of LLMs is the fact that they are fairly general in their abilities: for instance, it is entirely possible that we want to use the same model to debug code, to invent cooking recipes, to help fill documents, and to design a new LLM benchmark at the same time
However, going back to the question of what we expect from an LLM benchmark, we find ourselves with the problem comes with the fact that the key strength of LLMs is the fact that they are fairly general in their abilities: for instance, it is entirely possible that we want to use the same model to debug code, to invent cooking recipes, to help fill documents, and to design a new LLM benchmark at the same time
What would a good benchmark then tell us? It would be a fair measure of _general capabilities_ i.e. the model should maybe not be specialized for anything in particular, but should be _able_ to use available and resources to solve a particular problem _(to the extent that this is possible at all; no one expects a model to be able e.g. to break strong cryptography)_
What would a good benchmark then tell us? It would be a fair measure of general capabilities i.e. the model should maybe not be specialized for anything in particular, but should be able to use available and resources to solve a particular problem (to the extent that this is possible at all; no one expects a model to be able e.g. to break strong cryptography)
Informally, a good benchmark would perhaps present a model to a very wide variety of tasks, and assess its performance on that; naively, if we had a way to sample a thousand different 'random' tasks, and saw a model give good performance on 997 of them, this would give us a good degree of confidence about it being able to get a good performance at a new 'random' (feasible) task
Informally, a good benchmark would perhaps present a model to a very wide variety of tasks, and assess its performance on that; naively, if we had a way to sample a thousand different 'random' tasks, and saw a model give good performance on 997 of them, this would give us a good degree of confidence about it being able to get a good performance at a new 'random' (feasible) task
In subsequent posts, we will see how trying to fulfill this naive goal has shaped the approach behind our paper; and how this also leads to completely open benchmarks, where no cheating is possible
In subsequent posts, we will see how trying to fulfill this naive goal has shaped the approach behind our paper; and how this also leads to completely open benchmarks, where no cheating is possible
June 30th, 2025
June 30th, 2025
.
xent-notes
about
xent-beyond-generation
why-benchmarks
the-right-games
about-transfer
about-the-octopus
game-interpolation
the-value-of-data
game-spaces
measuring-measures
not-that-verifiable
post-generative-ai
truly-superhuman
safety