the-right-games

July 10th, 2025

Why games and which games?

•

The idea to use games as benchmark of AIs is almost as old as the field of itself

•

In the modern era of LLMs, games seem to only have played an understated role in evaluating the abilities of models, with the use of games only rising in prevalence for benchmarks recently (with Minecraft, the Pokemon game, the Vending Machine Bench, ...)

•

The Turing test was itself famously framed as an imitation game, and many of the early successes of the field (DeepBlue, AlphaGo, AlphaStar, OpenAI Five, DeepStack, ...) were about games

In addition to being a much more fun way to evaluate models, games bring a number of advantages

•

They are conceivably closer to "real-world scenarii" than "college exam questions"

Why use games to evaluate LLMs?

•

They can bring interesting solutions, giving a more qualitative insight into the strenghts and weaknesses of models

•

They can give us a feeling about "how the model reacts", "what is its style of play"

Which games to use to evaluate LLMs?

I have always found it fascinating how much more energy can be poured into some specific games than into finding good games

I remember once asking Gary Kasparov at a conference whether he ever wished the rules of chess were a bit different... interestingly, he said he never thought about it

As a mathematician, I have also seen many colleagues happy to work with a theory, once the axioms would be laid down, without really asking themselves about the merit of the axioms

I also remember seeing a talk of Demis Hassabis saying that machines can play games like go or chess, but that they can't invent games, unlike humans (note that most humans can't invent games played throughout centuries either)

The game design challenge is very different from the game solving challenge, in that it is a priori much more of an art than a science

As a result of a lack of clarity on this question, the classical criterion (be it for games like go or chess, or a field of mathematics) is merely historical or sociological (how long has the game been play by how many people?)

A nice feature of such a criterion is that it is of course quite simple and that it is hard to 'game' it.

But is historical relevance the best game selection criterion in the age of AI?

It is hard to be insensitive to the historical dimension of the question, but it is equally hard to think that this is the best we can do if we are looking for AI, AGI, or ASI?

If we look at the world around, it seems that everyone has a different idea of what this could mean (or no idea at all)... progress is very fast on all sorts of benchmarks

I think that echoing the ideas of Turing, such notions are best probed (if not defined) by measuring performance at games

This then raises the question: how should we pick games if we are after measuring general capabilities of LLMs? Is that even well-defined?

What is a game?

For the sake of this discussion, we will just mean some protocol of interaction with fixed rules, that awards scores to the players, with the goal of each player being to maximize their score; games can be perfect or imperfect information, single-player of multi-player, etc.

For the sake of concreteness, if we work with LLMs, we should think that games involve a textual interface, allowing models to play text strings based on text strings they receive

In a future post, I will explain how we arrived to a reasonably satisfactory solution to this problem in our paper on Cross-Entropy Games

What makes games truly special?

In some sense, having a reproducible, long-term credible benchmark is all we need, because once we know what we want, we can work our way there

At the same time, having concrete games to play is great, because we have a clear way to reach higher capabilities, by reinforcement learning, for instance