safety

Some Thoughts on Safety and Meta-Sampling

February 18th, 2026

The question of the safety of AI system is good to ask in the context of the xent meta-sampling approach to AGI, where a curriculum of games is dynamically sampled to build model capabilities

We call this process cognitive training (thanks Mihir for the suggestion!): a meta-objective mu is used to measure the usefulness of a possible new game towards teaching a current model to be more capable than previous version of itself on the set of previously used games, while further expanding the skillset towards new ones that we have not developed during the curriculum yet...

Xent Meta-Sampling

The idea is that once the meta-sampler has been through enough data, it will discover new games that really bring new games that will mostly be unknown to humans as of now, this being what we mean by AGI, in fact: a model's ability to grow somehow spontaneously, while generating its own challenges and tackling them, thereby unlocking a never-ending process, an open-ended stream of capabilities; this is a picture that is in fact quite exciting, I believe, while there could be safety concerns about them; let me discuss some

The Need for some Open-Endedness

The whole point of AGI is that we are (somehow) lazy and incompetent: we don't have enough time (or nerve) to specify all the tasks that models should learn, and in some sense we also are not absolutely sure of what are the tasks they will have to deal with in the future... so, in order to be competitive, we will need to delegate the construction of the curriculum to an AGI or so!

This question of being competitive is important in the context of safety: it is in fact trivial to design a system that is safe: if a system is weak enough, it is probably quite safe, in fact

The Risk of Open-Endedness

We are not there yet, but one of the ideas that comes to mind when trying to make very powerful models is to have the models write other models

It is not hard to see the risk in such systems: these could go off-rails in various directions, and knowing what could happens (given that they become more and more capable) is conceivably an impossible task past a certain capability level

Two Key Features of the Meta-Sampling Process

Within the framework of xent games, there are I think two strong key features that can be noted

The first one is about misalignment: if a model thrives to be more general rather than more and more specialized, this prevents "paperclipping"

Said slightly differently: a generalistic model will take more time to grow general skills than hacking rewards, and thereby will inject energy into understanding the broad context supporting its goals (within which they lie), limiting the risks of mis-specification/mis-alignment: limit the risks of "stupid" errors by being balanced!

The second one is about a static meta-objective being used: while enabling a curriculum that is automatically designed allows a model to remain competitive in terms of performance, fixing the meta-objective in time makes it easier to study the possible outcomes: abrupt phase transitions in abilities become less likely, and it remains clear "what the model wants": to learn more and more skills, as measured in light of past games

Failure modes (if they exist) can be studied in light of the meta-objective and (hopefully) can be addressed by adjusting it; it is in fact not even unreasonable to think that a model trained via such a process would be able to help in the analysis of the dynamics of the meta-objective!