June 17th, 2025
June 17th, 2025
Using LLMs Beyond Generative Sampling
Using LLMs Beyond Generative Sampling
The theoretical foundations behind the training of LLMs are very clean and transparent and have been understood at least since Claude Shannon: trying to predict the next token probabilities in a text is simply what we would do if we tried to compress the text
The theoretical foundations behind the training of LLMs are very clean and transparent and have been understood at least since Claude Shannon: trying to predict the next token probabilities in a text is simply what we would do if we tried to compress the text
The sense that successfully compressing is tantamount to understanding underpins old ideas (Occam's Razor) and is for instance the basis of many IQ tests, which often read as follows: try to find the simplest explanation possible for a given sequence of symbols, and use that explanation to predict the subsequent symbol
The sense that successfully compressing is tantamount to understanding underpins old ideas (Occam's Razor) and is for instance the basis of many IQ tests, which often read as follows: try to find the simplest explanation possible for a given sequence of symbols, and use that explanation to predict the subsequent symbol
These ideas have then be developed further by Kolmogorov and Solomonoff in their formal descriptions of what we should aim in terms of building intelligent systems (typically taking into account the computational nature of descriptions, etc.)
These ideas have then be developed further by Kolmogorov and Solomonoff in their formal descriptions of what we should aim in terms of building intelligent systems (typically taking into account the computational nature of descriptions, etc.)
Now, the relatively surprising thing with LLMs is that they do so well on large corpora of textual data, in that they they manage to compress further and further very datasets (as we increase their size and training time), which is only possible if some really nontrivial structures about the data are captured (of course, they will not always hit the theoretical limit: if they did, this would they could break modern ciphers, which they can't)
Now, the relatively surprising thing with LLMs is that they do so well on large corpora of textual data, in that they they manage to compress further and further very datasets (as we increase their size and training time), which is only possible if some really nontrivial structures about the data are captured (of course, they will not always hit the theoretical limit: if they did, this would they could break modern ciphers, which they can't)
And all of this is well understood, with the caveat that the optimal compression is typically uncomputable (naively one could say, 'would take an infinite amount of time')
And all of this is well understood, with the caveat that the optimal compression is typically uncomputable (naively one could say, 'would take an infinite amount of time')
Anyway, these ideas are there from the beginning and this has naturally led to training models to minimize cross-entropy losses: this comes from the fact that when we try to minimize the 'surprise' of the LLM when discovering data, or in other words how much we should 'help' (i.e. give 'hints') the model to get the correct text (if we do things properly, by e.g. using arithmetic encoding relative to the predicted probabilities of the model)
Anyway, these ideas are there from the beginning and this has naturally led to training models to minimize cross-entropy losses: this comes from the fact that when we try to minimize the 'surprise' of the LLM when discovering data, or in other words how much we should 'help' (i.e. give 'hints') the model to get the correct text (if we do things properly, by e.g. using arithmetic encoding relative to the predicted probabilities of the model)
(A few words on how we got to think of Cross-Entropy Games)
(A few words on how we got to think of Cross-Entropy Games)
And naturally, we end up with the (pre-trained version of) LLMs, which are true gems of knowledge: they can capture very delicate structures in the data they are fed with, and which understand it (at least in some sense) quite deeply: this has been famously exemplified with GPT-2 and GPT-3
And naturally, we end up with the (pre-trained version of) LLMs, which are true gems of knowledge: they can capture very delicate structures in the data they are fed with, and which understand it (at least in some sense) quite deeply: this has been famously exemplified with GPT-2 and GPT-3
The most prevalent application of pre-trained models is their fine-tuned versions, which allow one to channel their subtle understanding towards specific applications (chatbots, agents), and (to a lesser extent) the embeddings of data that they produce
The most prevalent application of pre-trained models is their fine-tuned versions, which allow one to channel their subtle understanding towards specific applications (chatbots, agents), and (to a lesser extent) the embeddings of data that they produce
At the same time, the raw outcome of the pre-training, which are prediction probabilities for the next tokens often usually discarded (and is not available for closed-source models)
At the same time, the raw outcome of the pre-training, which are prediction probabilities for the next tokens often usually discarded (and is not available for closed-source models)
How could we use the prediction probabilities?
How could we use the prediction probabilities?
If instead of data generation (e.g. producing a textual answer), we think of data analysis, then the prediction probabilities are actually very interesting objects
If instead of data generation (e.g. producing a textual answer), we think of data analysis, then the prediction probabilities are actually very interesting objects
The most basic thing that they can tell us is how surprising a 'scenario' (i.e. a sequence of tokens) is in light of the model
The most basic thing that they can tell us is how surprising a 'scenario' (i.e. a sequence of tokens) is in light of the model
If we trust the model has learned something about the world, this becomes a good way to identify interesting information...
If we trust the model has learned something about the world, this becomes a good way to identify interesting information...
If I look e.g. at a proof of a theorem that I don't know, I try to predict how the proof is going to go, and the only interesting parts of the proof (from my point of view) are those I could not predict in advance; these are the _interesting parts_ of the proof
If I look e.g. at a proof of a theorem that I don't know, I try to predict how the proof is going to go, and the only interesting parts of the proof (from my point of view) are those I could not predict in advance; these are the interesting parts of the proof
But why would this be interesting information?
But why would this be interesting information?
Then, I typically try to push a bit further: _why were these in fact not predicted_ (i.e. how could I fail to see such an argument?)
Then, I typically try to push a bit further: why were these in fact not predicted (i.e. how could I fail to see such an argument?)
Often, this leads me to re-evaluate my way of thinking about the mathematical objects... and I try to do so by finding some fairly low-surprise elements (from my point of view, trying to put myself into the shoes of the version of me that hadn't seen the proof of the theorem) a priori explanations, such that _given those_ (but not the proof), I would be able to find the proof
Often, this leads me to re-evaluate my way of thinking about the mathematical objects... and I try to do so by finding some fairly low-surprise elements (from my point of view, trying to put myself into the shoes of the version of me that hadn't seen the proof of the theorem) a priori explanations, such that given those (but not the proof), I would be able to find the proof
And it is only once I have found these low-surprise explanatory elements that I consider I have really understood the proof
And it is only once I have found these low-surprise explanatory elements that I consider I have really understood the proof
This can take some time, sometimes years for fairly difficult theorems (for instance, it has taken me years of teaching to understand the complex-analytic proof of the prime number theorem), and it requires some intellectual honesty (and somehow to have a good mental model of a past self that would not know what we must have learned to start this process)
This can take some time, sometimes years for fairly difficult theorems (for instance, it has taken me years of teaching to understand the complex-analytic proof of the prime number theorem), and it requires some intellectual honesty (and somehow to have a good mental model of a past self that would not know what we must have learned to start this process)
Now, a great thing with LLMs is that they can perfectly be given an information that was not in their training set (that information can be given in their prompt, say) or not given that information, and that we can compare how surprised the LLMs are
Now, a great thing with LLMs is that they can perfectly be given an information that was not in their training set (that information can be given in their prompt, say) or not given that information, and that we can compare how surprised the LLMs are
I will focus on an example coming from my mathematical experience, as a researcher and teacher
I will focus on an example coming from my mathematical experience, as a researcher and teacher
Interestingly, this whole process does not seem to be the way that math books are written (in fact, I don't know any math book that tries to teach the reader to re-construct a subject from elements of information they may know; the closest I have seen were exercise-centered books with very long problems that re-prove genuine theorems)
Interestingly, this whole process does not seem to be the way that math books are written (in fact, I don't know any math book that tries to teach the reader to re-construct a subject from elements of information they may know; the closest I have seen were exercise-centered books with very long problems that re-prove genuine theorems)
In other words, we can really do the 'honest' version of what I described above, i.e. really compare a model that would know something to one that wouldn't know that thing, and measure their difference in 'surprise'
In other words, we can really do the 'honest' version of what I described above, i.e. really compare a model that would know something to one that wouldn't know that thing, and measure their difference in 'surprise'
Still... I can say that the above _does work_, both in terms of teaching (students are happy), and in terms of research (it has often happened to me to be able to prove something new once I had gotten a satisfactory level of understanding for a proof of a known result)
Still... I can say that the above does work, both in terms of teaching (students are happy), and in terms of research (it has often happened to me to be able to prove something new once I had gotten a satisfactory level of understanding for a proof of a known result)
For me, this _single_ application is enough to justify the use of LLMs cross-entropies (as a measure of surprise) beyond the generative setup: being able to perform such a type of _counterfactual thinking_ is in fact already superhumans (humans can only try to approximate their ignorance of a certain concept when trying to judge the importance or value of something in light of an information)
For me, this single application is enough to justify the use of LLMs cross-entropies (as a measure of surprise) beyond the generative setup: being able to perform such a type of counterfactual thinking is in fact already superhumans (humans can only try to approximate their ignorance of a certain concept when trying to judge the importance or value of something in light of an information)
.