make-a-wish

June 9th, 2026

On Misalignment

One of the things we hear the most about the risks of AGI is the alignment problem

I think if we have a principled and conceptually transparent way to build an AGI, the risk of a catastrophic event due to some ambiguity in a prompt is minimal

When we prompt a model, there is the question of what we have in mind, i.e. what is sought by the prompter... and that may be ambiguous, but that can be disambiguated in good faith

Is Good Faith Well Defined?

The first word spoken by a human was not a lie, for otherwise words would not exist

Truths can exist without falsehoods

The main thing I want to get to is the following:

Conversely, falsehoods can only exist as a parasitic appendage to truths: the main reason we may believe a falsehood is because the vast majority of things said are true

Bad faith interpretation is not the default; it requires some nontrivial intervention; if a model is honest in all situations where it was evaluated, and it was trained in a transparent way, it is safe to assume that it is honest

There can of course be things like "be honest for everything, except ...", but these need to be fed into the model, they are not spontaenously emergent

A 'Nothing Under my Sleeves' Path

I simply think that if we can build an AGI from a base model of 2025 or 2026, and if we can be completely transparent about the process by which we trained it, and if it runs in public... and it its only goal has been to learn to be intrinsically better (via the Xent Game meta-objective), while growing on a number of simple external benchmarks (like mathematics and other standard subjects) and if on numerous tests, it acts honestly and in an aligned way, we can just trust it going forward

Similarly, I think that if we prompt it with tasks that have been time-stamped well before its construction, it seems impossible that things would be maliciously planted ahead

Keep It Simple

Like my mother would say: just do not make things too complicated, do not think of twisted scenarii; just think like when you were a child

I had a happy childhood and I was certainly not misaligned

A Path to a Non-Crazy Future

A simple vision for the future is that whatever we end up building beyond a certain scale runs in a clean fashion: we see what goes into a model, we see what goes out of it; models do not need to be gigantic; they can just shard tasks into many other tasks, and they speak with each other in natural language; and this way, we can simply build very competitive systems that are just stable and not incentivized to anything strange or non-natural; we can just do a lot of research by sharding tasks amongst extremely capable agents, and have them collaborate in a transparent way

We should just make sure that models also have a happy childhood (much thanks George Walker for first bringing that idea to my mind!)

Anyway, that is the way academia is supposed to work... it is not much more complicated than that

No secrets, no weird incentives, just machines tackling the challenges they have been ascribed and processing information in the best way they can; no meta-game theory; games are only useful inasmuch as they teach us stuff, and that's it!

Warning: This is a Very Naive Way of Thinking

But that kind of is the point: naive is good!

Can we Prove This?

Not sure about this, but I think it is a bit like proving that laws of physics cannot change in time or across space; it would just be so surprising if the world were so complicated that some training process would just yield models that are honest at all scales (as per intuition) but start to be dishonest at a large scale, without a clear incentive or reason for that

Make a Wish

You have a genie in a lamp; will you trust it to understand your wishes correctly?