October 24, 2025
October 24, 2025
Benchmarking Beyond Model Evaluation
Benchmarking Beyond Model Evaluation
The xent game idea covers two intertwined aspects of the LLM universe:
The xent game idea covers two intertwined aspects of the LLM universe:
Games as a means to evaluate model capabilities
Games as a means to evaluate model capabilities
Games as ways to develop new model capabilities
Games as ways to develop new model capabilities
Over the course of the recent months we have shifted towards the latter, as it is a distinctive edge of our approach compared to other evaluation mechanisms (besides other advantages such as uncheatability, reproducibility, etc.)
Over the course of the recent months we have shifted towards the latter, as it is a distinctive edge of our approach compared to other evaluation mechanisms (besides other advantages such as uncheatability, reproducibility, etc.)
At the same time, there are (at least) two distinctive features of benchmarking with our games that open new alleys:
At the same time, there are (at least) two distinctive features of benchmarking with our games that open new alleys:
Here are two:
Here are two:
We can evaluate new games (including non-xent games) based on a fixed collection of games: how much are we getting better at the fixed collection by playing to the new games?
We can evaluate new games (including non-xent games) based on a fixed collection of games: how much are we getting better at the fixed collection by playing to the new games?
It is very cheap and fast to run a benchmark based on a select family of games, and hence to measure progress continuously (impossible to do with votes)
It is very cheap and fast to run a benchmark based on a select family of games, and hence to measure progress continuously (impossible to do with votes)
The evaluation games can be customized to target specific tasks/classes of task: we have much more granularity in what we are measuring
The evaluation games can be customized to target specific tasks/classes of task: we have much more granularity in what we are measuring
What possibilities does this open?
What possibilities does this open?
We can evaluate the quality of data based on the same idea: if by learning some data we are better off playing some games, then that data is probably quite valuable; more generally this applies to data-creating environments as well
We can evaluate the quality of data based on the same idea: if by learning some data we are better off playing some games, then that data is probably quite valuable; more generally this applies to data-creating environments as well
How to think about the value of data?
How to think about the value of data?
Obviously, determining the value of a game or of data can be extremely interesting... particularly if we want to think of a (somehow) more decentralized future where various agents holding useful pieces of data could be incentivized to share it, allowing it to realize their potential
Obviously, determining the value of a game or of data can be extremely interesting... particularly if we want to think of a (somehow) more decentralized future where various agents holding useful pieces of data could be incentivized to share it, allowing it to realize their potential
I have long had a vision of data marketplaces where data would in some sense find its way to fit the needs of various agents producing the data
I have long had a vision of data marketplaces where data would in some sense find its way to fit the needs of various agents producing the data
But for a long time, I saw a hard problem: it wasn't clear how to gauge the value of data / of knowledge
But for a long time, I saw a hard problem: it wasn't clear how to gauge the value of data / of knowledge
Where this idea comes from
Where this idea comes from
I thought about my own learning experience: there were things that made more sense than others, that connected to earlier pieces of knowledge, and of course there were 'aha' moments in my life when I would learn something and that would change my understanding of the world, that would make me feel able to do things that were not possible before, etc...
I thought about my own learning experience: there were things that made more sense than others, that connected to earlier pieces of knowledge, and of course there were 'aha' moments in my life when I would learn something and that would change my understanding of the world, that would make me feel able to do things that were not possible before, etc...
But I was not exactly sure how to formulate this abstractly before thinking about xent games
But I was not exactly sure how to formulate this abstractly before thinking about xent games
(To be sure: this is a nice idea, but much needs to be specified, in particular in regards to which games we care about)
(To be sure: this is a nice idea, but much needs to be specified, in particular in regards to which games we care about)
If you care about a concrete game, like chess, and if I show you a chess position that to you feels like a winning one, and I can show you that this is a losing one, then of course you will find this interesting
If you care about a concrete game, like chess, and if I show you a chess position that to you feels like a winning one, and I can show you that this is a losing one, then of course you will find this interesting
Let's start with a simple example
Let's start with a simple example
If this position happens to be a very niche example, you may find it less interesting than if this corresponds to something that you would be likely to encounter in a real game of chess
If this position happens to be a very niche example, you may find it less interesting than if this corresponds to something that you would be likely to encounter in a real game of chess
If you could convince yourself that you are better at chess after seeing that example, than you would probably be excited by this!
If you could convince yourself that you are better at chess after seeing that example, than you would probably be excited by this!
Now, I would like to say that great data should 'feel' exciting in a similar way
Now, I would like to say that great data should 'feel' exciting in a similar way
Taking examples from attending a lecture or class: my enjoyment of my time in the room largely (and quite obviously) depends on whether I feel I am getting something out of it
Taking examples from attending a lecture or class: my enjoyment of my time in the room largely (and quite obviously) depends on whether I feel I am getting something out of it
Example of what 'getting something' means
Example of what 'getting something' means
A simple base scenario is that I already care about a question, and I just learn a convincing answer to that question
A simple base scenario is that I already care about a question, and I just learn a convincing answer to that question
Let's say that I am trying to prove something (or that I tried in the past, and didn't succeed), then if someone comes with the solution (and I understand it), I will of course be excited
Let's say that I am trying to prove something (or that I tried in the past, and didn't succeed), then if someone comes with the solution (and I understand it), I will of course be excited
This base scenario is of course an easy one
This base scenario is of course an easy one
In terms of games, we can say that there are specific games in our head that we care about (e.g. proving a specific theorem or fixing a problem in a computer simulation) and we get data that we find convincingly gives us a path towards winning that game
In terms of games, we can say that there are specific games in our head that we care about (e.g. proving a specific theorem or fixing a problem in a computer simulation) and we get data that we find convincingly gives us a path towards winning that game
So, it is about a benchmarked capability differential: we see how much the version of ourselves exposed to the data is better at a specific trusted benchmark (and the benchmark can be to solve a problem, or more simply to find something that can evaluated as a plausible path towards solving the problem -- for people that are not too delusional, say)
So, it is about a benchmarked capability differential: we see how much the version of ourselves exposed to the data is better at a specific trusted benchmark (and the benchmark can be to solve a problem, or more simply to find something that can evaluated as a plausible path towards solving the problem -- for people that are not too delusional, say)
How can we extend the above scenario?
How can we extend the above scenario?
The above scenario is obviously a dream one...
The above scenario is obviously a dream one...
To some extent, earlier in my studies, I was 'living the dream', in that I had a number of questions I had gotten to care about and some of these were indeed covered in our courses, and that was amazing
To some extent, earlier in my studies, I was 'living the dream', in that I had a number of questions I had gotten to care about and some of these were indeed covered in our courses, and that was amazing
But most of the content I find nowadays does not directly fit that bill: maybe I don't care about enough questions a priori, or people are not able to provide enough answers to questions I care about (probably a combination of both)
But most of the content I find nowadays does not directly fit that bill: maybe I don't care about enough questions a priori, or people are not able to provide enough answers to questions I care about (probably a combination of both)
So, in many ways, one should settle for less...
So, in many ways, one should settle for less...
For instance, let's say we fix a broad (but well-defined) area of interest, like statistical field theory
For instance, let's say we fix a broad (but well-defined) area of interest, like statistical field theory
Then a talk can be interesting if we can imagine (a posteriori) a natural game that we get better at by learning something
Then a talk can be interesting if we can imagine (a posteriori) a natural game that we get better at by learning something
For instance, we learn (or discover ourselves) a simple, natural question that is a priori mysterious, and we learn the answer
For instance, we learn (or discover ourselves) a simple, natural question that is a priori mysterious, and we learn the answer
So, in that case, one could say that the a priori game is one is trying to find questions that are nontrivial in statistical field theory (this is a bit the game that a researcher is playing), and we are a bit better now (we learn of a winning move that we didn't know before)
So, in that case, one could say that the a priori game is one is trying to find questions that are nontrivial in statistical field theory (this is a bit the game that a researcher is playing), and we are a bit better now (we learn of a winning move that we didn't know before)
Even More Generally
Even More Generally
Assuming that we (hopefully) care about certain things, the natural trait that will make us care about data is some kind of general curiosity (that 'good data' will fulfill), a good sense of measure of what games we may want to be good at; and that can come from experience, where we discovered that playing certain games made us better at other games we already cared about
Assuming that we (hopefully) care about certain things, the natural trait that will make us care about data is some kind of general curiosity (that 'good data' will fulfill), a good sense of measure of what games we may want to be good at; and that can come from experience, where we discovered that playing certain games made us better at other games we already cared about
Schmidhuber has definitely studied things like artificial curiosity, though in slightly different terms
Schmidhuber has definitely studied things like artificial curiosity, though in slightly different terms
Ultimately, any point of data is an answer to a question, so we shouldn't care about all questions... because we are bounded energy beings, we should be in fact quite selective about what we accept as 'great data'
Ultimately, any point of data is an answer to a question, so we shouldn't care about all questions... because we are bounded energy beings, we should be in fact quite selective about what we accept as 'great data'
So, we learned to ascribe (via a learned estimation of the transfer value) value to certain games based on their description, and hence we learned to appreciate things that could be appearing in different contexts
So, we learned to ascribe (via a learned estimation of the transfer value) value to certain games based on their description, and hence we learned to appreciate things that could be appearing in different contexts
Somehow, I think one of my strengths as a researcher is being good at playing this game consciously and lucidly (this is also appreciated by some students)
Somehow, I think one of my strengths as a researcher is being good at playing this game consciously and lucidly (this is also appreciated by some students)
And being curious (in a good sense, i.e. being able to distinguish signal from randomness, to find the true magic in things) means to have developed a good proxy sense of which games we get better at, and why these games are useful, by relating them (think of a web of games) to some 'fundamental basis'
And being curious (in a good sense, i.e. being able to distinguish signal from randomness, to find the true magic in things) means to have developed a good proxy sense of which games we get better at, and why these games are useful, by relating them (think of a web of games) to some 'fundamental basis'
Perspectives
Perspectives
These (abstract, but 'concretely refineable') considerations are what I think should be the basis of measuring the quality of some raw data or some data-generating environment (e.g. a new game)... for humans, it is very tiring to consume data only to find there is nothing to be done from it, but artificial agents don't suffer from this as much (still having in mind that energy is not infinite)
These (abstract, but 'concretely refineable') considerations are what I think should be the basis of measuring the quality of some raw data or some data-generating environment (e.g. a new game)... for humans, it is very tiring to consume data only to find there is nothing to be done from it, but artificial agents don't suffer from this as much (still having in mind that energy is not infinite)
If we implement them properly, these ideas can help us to set up infrastructure to seek and exchange data/knowledge in a self-organized way, with various actors centering their interest on certain classes of games and interacting fruitfully to acquire new meaningful capabilities... but this is a topic for a future post!
If we implement them properly, these ideas can help us to set up infrastructure to seek and exchange data/knowledge in a self-organized way, with various actors centering their interest on certain classes of games and interacting fruitfully to acquire new meaningful capabilities... but this is a topic for a future post!
On a meta-level: a reason why I am excited about xent is that it has made me feel better at answering questions such as the value of data (or about the octopus, see 'About the Octopus')!
On a meta-level: a reason why I am excited about xent is that it has made me feel better at answering questions such as the value of data (or about the octopus, see 'About the Octopus')!
More General Proxies
More General Proxies
... and Finding Intrinsic Value in Data!
... and Finding Intrinsic Value in Data!
.
xent-notes
about
xent-beyond-generation
why-benchmarks
the-right-games
about-transfer
about-the-octopus
game-interpolation
the-value-of-data
game-spaces
measuring-measures
not-that-verifiable
post-generative-ai
truly-superhuman
safety