November 12th, 2025
November 12th, 2025
How to Benchmark Benchmarks
How to Benchmark Benchmarks
Over the past decades, benchmarks have the driving force of modern AI: while the true progress of the models is seen by their use (and our satisfaction with the results we get), the way we get there is by using hundreds of benchmarks as proxys for the "true" measure of intelligence (whatever this means)
Over the past decades, benchmarks have the driving force of modern AI: while the true progress of the models is seen by their use (and our satisfaction with the results we get), the way we get there is by using hundreds of benchmarks as proxys for the "true" measure of intelligence (whatever this means)
Model-Benchmark Duality
Model-Benchmark Duality
The progress in models has been so fast that benchmarks are now often see as being saturated, as losing their informative power in terms of where the models should go
The progress in models has been so fast that benchmarks are now often see as being saturated, as losing their informative power in terms of where the models should go
Models and benchmarks are naturally paired to each other, in a mathematical duality sense: by pairing the two, we get a score
Models and benchmarks are naturally paired to each other, in a mathematical duality sense: by pairing the two, we get a score
Now a benchmark is used to say 'how good a model is' (universally, or along some specific dimension), but obviously this is only true if the benchmark is good
Now a benchmark is used to say 'how good a model is' (universally, or along some specific dimension), but obviously this is only true if the benchmark is good
And now a benchmark is only good if it gives good scores to good models... so this feels a bit cyclic/recursive/munchausian/ouroboroian
And now a benchmark is only good if it gives good scores to good models... so this feels a bit cyclic/recursive/munchausian/ouroboroian
Cyclic Definitions: Escaping the Ouroboros
Cyclic Definitions: Escaping the Ouroboros
Cyclic definitions are not always empty!
Cyclic definitions are not always empty!
So... great! We have a cyclic definitions: good models are good a good benchmarks, which are good because they identify good models!
So... great! We have a cyclic definitions: good models are good a good benchmarks, which are good because they identify good models!
These are just a variant of the $x=x$ (or $0=0$ for minimalists), and hence carry no information... for instance, if you define philosophy as what philosophers do and philosophers as those who practice philosophy, you are not defining either!
These are just a variant of the
x=x (or
0=0 for minimalists), and hence carry no information... for instance, if you define philosophy as what philosophers do and philosophers as those who practice philosophy, you are not defining either!
For instance if you define the Fibonacci sequence recursively $F_{n+1}=F_n+F_{n-1}$, and specify the values $F_0=F_1=1$, you have completely defined the whole sequence
For instance if you define the Fibonacci sequence recursively
Fn+1=Fn+Fn−1, and specify the values
F0=F1=1, you have completely defined the whole sequence
Obviously, the goal of such definitions (beyond laziness or unability to formulate explicit ones) is often to emphasize the subjectivity of the object being defined
Obviously, the goal of such definitions (beyond laziness or unability to formulate explicit ones) is often to emphasize the subjectivity of the object being defined
1. Purely Cyclic Definitions
1. Purely Cyclic Definitions
2. Cyclic Definitions with Boundary Conditions
2. Cyclic Definitions with Boundary Conditions
Such definitions can be very useful, not only conceptually, but also as means to produce some concrete outputs: the definition can be propagated from the 'boundary' (whatever external input is given) towards the 'bulk' (far from the external input)
Such definitions can be very useful, not only conceptually, but also as means to produce some concrete outputs: the definition can be propagated from the 'boundary' (whatever external input is given) towards the 'bulk' (far from the external input)
If you define interesting websites as websites that are linked to interesting websites, and you specify externally boundary conditions (e.g. you say that Wikipedia is good)
If you define interesting websites as websites that are linked to interesting websites, and you specify externally boundary conditions (e.g. you say that Wikipedia is good)
That is the PageRank example... in the 1990s, the canonical example was 'whitehouse.gov: good' and 'whitehouse.com: bad'
That is the PageRank example... in the 1990s, the canonical example was 'whitehouse.gov: good' and 'whitehouse.com: bad'
There are (at least) three types of cyclic definitions that I have seen:
There are (at least) three types of cyclic definitions that I have seen:
Not to say that philosophy (or any other hard-to-define discipline) is ill-defined... there is just a tendency of certain fields to define themselves as being what their recognized experts define it to be
Not to say that philosophy (or any other hard-to-define discipline) is ill-defined... there is just a tendency of certain fields to define themselves as being what their recognized experts define it to be
3. 'Spiral' Definitions (with initial data)
3. 'Spiral' Definitions (with initial data)
A third type of definition are the quasi-cyclic definitions associated with problems where much of what defines an object is self-referential, but not the entirety of it; there is a small fraction of the definition that hinges on something external
A third type of definition are the quasi-cyclic definitions associated with problems where much of what defines an object is self-referential, but not the entirety of it; there is a small fraction of the definition that hinges on something external
This can be viewed as fairly similar in spirit to the second definition (cyclic, with boundary conditions), but it is really intermediate between the first and the second: oftencase, the objects satisfying a 'spiral' definition may not be unique (or uniquely defined by their spiralic definition), but still _more unique_ than if they were purely cyclic with no boundary conditions
This can be viewed as fairly similar in spirit to the second definition (cyclic, with boundary conditions), but it is really intermediate between the first and the second: oftencase, the objects satisfying a 'spiral' definition may not be unique (or uniquely defined by their spiralic definition), but still more unique than if they were purely cyclic with no boundary conditions
For instance, if one takes an English language dictionary, it is typically just self-referential: all the words are defined the other words... there is however some 'deep substance' in that complex self-reference, to the point that anything that can be constructed to be defined using an English dictionary is probably (at least) as interesting as English itself
For instance, if one takes an English language dictionary, it is typically just self-referential: all the words are defined the other words... there is however some 'deep substance' in that complex self-reference, to the point that anything that can be constructed to be defined using an English dictionary is probably (at least) as interesting as English itself
A standard indication of a dictionary's poor quality is an abundance of length-2 cycles in the definitions (of the type "philosophy is what philosophers do"; "a philosopher is someone who practices philosophy"), from which we just learn that a philosopher is probably a person
A standard indication of a dictionary's poor quality is an abundance of length-2 cycles in the definitions (of the type "philosophy is what philosophers do"; "a philosopher is someone who practices philosophy"), from which we just learn that a philosopher is probably a person
Intelligence as a Spiral-Type Object
Intelligence as a Spiral-Type Object
Intelligence is one of these things that seems hard to detect if oneself is not (a little bit) intelligent... but conceivably, one can recognize (a priori
Intelligence is one of these things that seems hard to detect if oneself is not (a little bit) intelligent... but conceivably, one can recognize (a priori
.