the-value-of-data

October 24, 2025

Benchmarking Beyond Model Evaluation

The xent game idea covers two intertwined aspects of the LLM universe:

•

Games as a means to evaluate model capabilities

•

Games as ways to develop new model capabilities

Over the course of the recent months we have shifted towards the latter, as it is a distinctive edge of our approach compared to other evaluation mechanisms (besides other advantages such as uncheatability, reproducibility, etc.)

At the same time, there are (at least) two distinctive features of benchmarking with our games that open new alleys:

Here are two:

•

We can evaluate new games (including non-xent games) based on a fixed collection of games: how much are we getting better at the fixed collection by playing to the new games?

•

It is very cheap and fast to run a benchmark based on a select family of games, and hence to measure progress continuously (impossible to do with votes)

•

The evaluation games can be customized to target specific tasks/classes of task: we have much more granularity in what we are measuring

What possibilities does this open?

•

We can evaluate the quality of data based on the same idea: if by learning some data we are better off playing some games, then that data is probably quite valuable; more generally this applies to data-creating environments as well

How to think about the value of data?

Obviously, determining the value of a game or of data can be extremely interesting... particularly if we want to think of a (somehow) more decentralized future where various agents holding useful pieces of data could be incentivized to share it, allowing it to realize their potential

I have long had a vision of data marketplaces where data would in some sense find its way to fit the needs of various agents producing the data

But for a long time, I saw a hard problem: it wasn't clear how to gauge the value of data / of knowledge

Where this idea comes from

I thought about my own learning experience: there were things that made more sense than others, that connected to earlier pieces of knowledge, and of course there were 'aha' moments in my life when I would learn something and that would change my understanding of the world, that would make me feel able to do things that were not possible before, etc...

But I was not exactly sure how to formulate this abstractly before thinking about xent games

(To be sure: this is a nice idea, but much needs to be specified, in particular in regards to which games we care about)

If you care about a concrete game, like chess, and if I show you a chess position that to you feels like a winning one, and I can show you that this is a losing one, then of course you will find this interesting

Let's start with a simple example

If this position happens to be a very niche example, you may find it less interesting than if this corresponds to something that you would be likely to encounter in a real game of chess

If you could convince yourself that you are better at chess after seeing that example, than you would probably be excited by this!

Now, I would like to say that great data should 'feel' exciting in a similar way

Taking examples from attending a lecture or class: my enjoyment of my time in the room largely (and quite obviously) depends on whether I feel I am getting something out of it

Example of what 'getting something' means

A simple base scenario is that I already care about a question, and I just learn a convincing answer to that question

Let's say that I am trying to prove something (or that I tried in the past, and didn't succeed), then if someone comes with the solution (and I understand it), I will of course be excited

This base scenario is of course an easy one

In terms of games, we can say that there are specific games in our head that we care about (e.g. proving a specific theorem or fixing a problem in a computer simulation) and we get data that we find convincingly gives us a path towards winning that game

So, it is about a benchmarked capability differential: we see how much the version of ourselves exposed to the data is better at a specific trusted benchmark (and the benchmark can be to solve a problem, or more simply to find something that can evaluated as a plausible path towards solving the problem -- for people that are not too delusional, say)

How can we extend the above scenario?

The above scenario is obviously a dream one...

To some extent, earlier in my studies, I was 'living the dream', in that I had a number of questions I had gotten to care about and some of these were indeed covered in our courses, and that was amazing

But most of the content I find nowadays does not directly fit that bill: maybe I don't care about enough questions a priori, or people are not able to provide enough answers to questions I care about (probably a combination of both)

So, in many ways, one should settle for less...

For instance, let's say we fix a broad (but well-defined) area of interest, like statistical field theory

Then a talk can be interesting if we can imagine (a posteriori) a natural game that we get better at by learning something

For instance, we learn (or discover ourselves) a simple, natural question that is a priori mysterious, and we learn the answer

So, in that case, one could say that the a priori game is one is trying to find questions that are nontrivial in statistical field theory (this is a bit the game that a researcher is playing), and we are a bit better now (we learn of a winning move that we didn't know before)

Even More Generally

Assuming that we (hopefully) care about certain things, the natural trait that will make us care about data is some kind of general curiosity (that 'good data' will fulfill), a good sense of measure of what games we may want to be good at; and that can come from experience, where we discovered that playing certain games made us better at other games we already cared about

Schmidhuber has definitely studied things like artificial curiosity, though in slightly different terms

Ultimately, any point of data is an answer to a question, so we shouldn't care about all questions... because we are bounded energy beings, we should be in fact quite selective about what we accept as 'great data'

So, we learned to ascribe (via a learned estimation of the transfer value) value to certain games based on their description, and hence we learned to appreciate things that could be appearing in different contexts

Somehow, I think one of my strengths as a researcher is being good at playing this game consciously and lucidly (this is also appreciated by some students)

And being curious (in a good sense, i.e. being able to distinguish signal from randomness, to find the true magic in things) means to have developed a good proxy sense of which games we get better at, and why these games are useful, by relating them (think of a web of games) to some 'fundamental basis'

Perspectives

These (abstract, but 'concretely refineable') considerations are what I think should be the basis of measuring the quality of some raw data or some data-generating environment (e.g. a new game)... for humans, it is very tiring to consume data only to find there is nothing to be done from it, but artificial agents don't suffer from this as much (still having in mind that energy is not infinite)

If we implement them properly, these ideas can help us to set up infrastructure to seek and exchange data/knowledge in a self-organized way, with various actors centering their interest on certain classes of games and interacting fruitfully to acquire new meaningful capabilities... but this is a topic for a future post!

On a meta-level: a reason why I am excited about xent is that it has made me feel better at answering questions such as the value of data (or about the octopus, see 'About the Octopus')!

More General Proxies

... and Finding Intrinsic Value in Data!