July 10th, 2025
July 10th, 2025
Why games and which games?
Why games and which games?
The idea to use games as benchmark of AIs is almost as old as the field of itself
•
The idea to use games as benchmark of AIs is almost as old as the field of itself
In the modern era of LLMs, games seem to only have played an understated role in evaluating the abilities of models, with the use of games only rising in prevalence for benchmarks recently (with Minecraft, the Pokemon game, the Vending Machine Bench, ...)
•
In the modern era of LLMs, games seem to only have played an understated role in evaluating the abilities of models, with the use of games only rising in prevalence for benchmarks recently (with Minecraft, the Pokemon game, the Vending Machine Bench, ...)
The Turing test was itself famously framed as an imitation game, and many of the early successes of the field (DeepBlue, AlphaGo, AlphaStar, OpenAI Five, DeepStack, ...) were about games
•
The Turing test was itself famously framed as an imitation game, and many of the early successes of the field (DeepBlue, AlphaGo, AlphaStar, OpenAI Five, DeepStack, ...) were about games
In addition to being a much more fun way to evaluate models, games bring a number of advantages
In addition to being a much more fun way to evaluate models, games bring a number of advantages
They are conceivably closer to "real-world scenarii" than "college exam questions"
•
They are conceivably closer to "real-world scenarii" than "college exam questions"
Why use games to evaluate LLMs?
Why use games to evaluate LLMs?
They can bring interesting solutions, giving a more qualitative insight into the strenghts and weaknesses of models
•
They can bring interesting solutions, giving a more qualitative insight into the strenghts and weaknesses of models
They can give us a feeling about "how the model reacts", "what is its style of play"
•
They can give us a feeling about "how the model reacts", "what is its style of play"
Which games to use to evaluate LLMs?
Which games to use to evaluate LLMs?
I have always found it fascinating how much more energy can be poured into some specific games than into finding _good_ games
I have always found it fascinating how much more energy can be poured into some specific games than into finding good games
I remember once asking Gary Kasparov at a conference whether he ever wished the rules of chess were a bit different... interestingly, he said he never thought about it
I remember once asking Gary Kasparov at a conference whether he ever wished the rules of chess were a bit different... interestingly, he said he never thought about it
As a mathematician, I have also seen many colleagues happy to work with a theory, once the axioms would be laid down, without really asking themselves about the merit of the axioms
As a mathematician, I have also seen many colleagues happy to work with a theory, once the axioms would be laid down, without really asking themselves about the merit of the axioms
I also remember seeing a talk of Demis Hassabis saying that machines can play games like go or chess, but that they can't _invent_ games, unlike humans (note that most humans can't invent games played throughout centuries either)
I also remember seeing a talk of Demis Hassabis saying that machines can play games like go or chess, but that they can't invent games, unlike humans (note that most humans can't invent games played throughout centuries either)
The game design challenge is very different from the game solving challenge, in that it is a priori much more of an art than a science
The game design challenge is very different from the game solving challenge, in that it is a priori much more of an art than a science
As a result of a lack of clarity on this question, the classical criterion (be it for games like go or chess, or a field of mathematics) is merely historical or sociological (how long has the game been play by how many people?)
As a result of a lack of clarity on this question, the classical criterion (be it for games like go or chess, or a field of mathematics) is merely historical or sociological (how long has the game been play by how many people?)
A nice feature of such a criterion is that it is of course quite simple and that it is hard to 'game' it.
A nice feature of such a criterion is that it is of course quite simple and that it is hard to 'game' it.
But is historical relevance the best game selection criterion in the age of AI?
But is historical relevance the best game selection criterion in the age of AI?
It is hard to be insensitive to the historical dimension of the question, but it is equally hard to think that this is the _best we can do_ if we are looking for AI, AGI, or ASI?
It is hard to be insensitive to the historical dimension of the question, but it is equally hard to think that this is the best we can do if we are looking for AI, AGI, or ASI?
If we look at the world around, it seems that everyone has a different idea of what this could mean (or no idea at all)... progress is very fast on all sorts of benchmarks
If we look at the world around, it seems that everyone has a different idea of what this could mean (or no idea at all)... progress is very fast on all sorts of benchmarks
I think that echoing the ideas of Turing, such notions _are best probed (if not defined) by measuring performance at games_
I think that echoing the ideas of Turing, such notions are best probed (if not defined) by measuring performance at games
This then raises the question: how should we pick games if we are after measuring general capabilities of LLMs? Is that even well-defined?
This then raises the question: how should we pick games if we are after measuring general capabilities of LLMs? Is that even well-defined?
What is a game?
What is a game?
For the sake of this discussion, we will just mean some protocol of interaction with fixed rules, that awards scores to the players, with the goal of each player being to maximize their score; games can be perfect or imperfect information, single-player of multi-player, etc.
For the sake of this discussion, we will just mean some protocol of interaction with fixed rules, that awards scores to the players, with the goal of each player being to maximize their score; games can be perfect or imperfect information, single-player of multi-player, etc.
For the sake of concreteness, if we work with LLMs, we should think that games involve a textual interface, allowing models to play text strings based on text strings they receive
For the sake of concreteness, if we work with LLMs, we should think that games involve a textual interface, allowing models to play text strings based on text strings they receive
In a future post, I will explain how we arrived to a reasonably satisfactory solution to this problem in our paper on Cross-Entropy Games
In a future post, I will explain how we arrived to a reasonably satisfactory solution to this problem in our paper on Cross-Entropy Games
What makes games truly special?
What makes games truly special?
In some sense, having a reproducible, long-term credible benchmark is all we need, because once we know what we want, we can work our way there
In some sense, having a reproducible, long-term credible benchmark is all we need, because once we know what we want, we can work our way there
At the same time, having concrete games to play is great, because we have a clear way to reach higher capabilities, by reinforcement learning, for instance
At the same time, having concrete games to play is great, because we have a clear way to reach higher capabilities, by reinforcement learning, for instance
.