June 9th, 2026
June 9th, 2026
On Misalignment
On Misalignment
One of the things we hear the most about the risks of AGI is the alignment problem
One of the things we hear the most about the risks of AGI is the alignment problem
I think if we have a principled and conceptually transparent way to build an AGI, the risk of a catastrophic event due to some ambiguity in a prompt is minimal
I think if we have a principled and conceptually transparent way to build an AGI, the risk of a catastrophic event due to some ambiguity in a prompt is minimal
When we prompt a model, there is the question of what we have in mind, i.e. what is sought by the prompter... and that may be ambiguous, but that can be disambiguated in good faith
When we prompt a model, there is the question of what we have in mind, i.e. what is sought by the prompter... and that may be ambiguous, but that can be disambiguated in good faith
Is Good Faith Well Defined?
Is Good Faith Well Defined?
The first word spoken by a human was not a lie, for otherwise words would not exist
The first word spoken by a human was not a lie, for otherwise words would not exist
Truths can exist without falsehoods
Truths can exist without falsehoods
The main thing I want to get to is the following:
The main thing I want to get to is the following:
Conversely, falsehoods can only exist as a parasitic appendage to truths: the main reason we may believe a falsehood is because the vast majority of things said are true
Conversely, falsehoods can only exist as a parasitic appendage to truths: the main reason we may believe a falsehood is because the vast majority of things said are true
Bad faith interpretation is not the default; it requires some nontrivial intervention; if a model is honest in all situations where it was evaluated, and it was trained in a transparent way, it is _safe to assume_ that it is honest
Bad faith interpretation is not the default; it requires some nontrivial intervention; if a model is honest in all situations where it was evaluated, and it was trained in a transparent way, it is safe to assume that it is honest
There can of course be things like "be honest for everything, except ...", but these _need to be fed into the model_, they _are not spontaenously emergent_
There can of course be things like "be honest for everything, except ...", but these need to be fed into the model, they are not spontaenously emergent
A 'Nothing Under my Sleeves' Path
A 'Nothing Under my Sleeves' Path
I simply think that if we can build an AGI from a base model of 2025 or 2026, and if we can be completely transparent about the process by which we trained it, and if it runs in public... and it its only goal has been to learn to be intrinsically better (via the Xent Game meta-objective), while growing on a number of simple external benchmarks (like mathematics and other standard subjects) and if on numerous tests, it acts honestly and in an aligned way, we can just trust it going forward
I simply think that if we can build an AGI from a base model of 2025 or 2026, and if we can be completely transparent about the process by which we trained it, and if it runs in public... and it its only goal has been to learn to be intrinsically better (via the Xent Game meta-objective), while growing on a number of simple external benchmarks (like mathematics and other standard subjects) and if on numerous tests, it acts honestly and in an aligned way, we can just trust it going forward
Similarly, I think that if we prompt it with tasks that have been time-stamped well before its construction, it seems impossible that things would be maliciously planted ahead
Similarly, I think that if we prompt it with tasks that have been time-stamped well before its construction, it seems impossible that things would be maliciously planted ahead
Keep It Simple
Keep It Simple
Like my mother would say: just do not make things too complicated, do not think of twisted scenarii; just think like when you were a child
Like my mother would say: just do not make things too complicated, do not think of twisted scenarii; just think like when you were a child
I had a happy childhood and I was certainly not misaligned
I had a happy childhood and I was certainly not misaligned
A Path to a Non-Crazy Future
A Path to a Non-Crazy Future
A simple vision for the future is that whatever we end up building beyond a certain scale runs in a clean fashion: we see what goes into a model, we see what goes out of it; models do not need to be gigantic; they can just shard tasks into many other tasks, and they speak with each other in natural language; and this way, we can simply build very competitive systems that are just stable and not incentivized to anything strange or non-natural; we can just do a lot of research by sharding tasks amongst extremely capable agents, and have them collaborate in a transparent way
A simple vision for the future is that whatever we end up building beyond a certain scale runs in a clean fashion: we see what goes into a model, we see what goes out of it; models do not need to be gigantic; they can just shard tasks into many other tasks, and they speak with each other in natural language; and this way, we can simply build very competitive systems that are just stable and not incentivized to anything strange or non-natural; we can just do a lot of research by sharding tasks amongst extremely capable agents, and have them collaborate in a transparent way
We should just make sure that models also have a happy childhood (much thanks George Walker for first bringing that idea to my mind!)
We should just make sure that models also have a happy childhood (much thanks George Walker for first bringing that idea to my mind!)
Anyway, that is the way academia is supposed to work... it is not much more complicated than that
Anyway, that is the way academia is supposed to work... it is not much more complicated than that
No secrets, no weird incentives, just machines tackling the challenges they have been ascribed and processing information in the best way they can; no meta-game theory; games are only useful inasmuch as they teach us stuff, and that's it!
No secrets, no weird incentives, just machines tackling the challenges they have been ascribed and processing information in the best way they can; no meta-game theory; games are only useful inasmuch as they teach us stuff, and that's it!
Warning: This is a Very Naive Way of Thinking
Warning: This is a Very Naive Way of Thinking
But that kind of is the point: naive is good!
But that kind of is the point: naive is good!
Can we Prove This?
Can we Prove This?
Not sure about this, but I think it is a bit like proving that laws of physics cannot change in time or across space; it would just be so surprising if the world were so complicated that some training process would just yield models that are honest at all scales (as per intuition) but start to be dishonest at a large scale, without a clear incentive or reason for that
Not sure about this, but I think it is a bit like proving that laws of physics cannot change in time or across space; it would just be so surprising if the world were so complicated that some training process would just yield models that are honest at all scales (as per intuition) but start to be dishonest at a large scale, without a clear incentive or reason for that
You have a genie in a lamp; will you trust it to understand your wishes correctly?
You have a genie in a lamp; will you trust it to understand your wishes correctly?
.