Comment by nightski
6 days ago
I feel like you are ignoring or dismissing the word "interpolating", although a better word would likely be generalization. I'd make the claim that it's very hard to generalize without some form of world model. It's clear to me that transformers do have some form of world model, although not the same as what is being presented in V-JEPA.
One other nitpick is that you confine to "historical data", although other classes of data are trained on such as simulated and generative.
I didn't say generalisation, because there isnt any. Inductive learning does not generalise, it interpolates -- if the region of your future prediction (here, prompt competition) lies on or close to the interpolated region, then the system is useful.
Generalisation is the opposite process, hypothecating a universal and finding counter-examples to constrain the universal generalisaton. Eg., "all fire burns" is hypotheticated by a competent animal upon encountering fire once.
Inductive "learners" take the opposite approach: fire burns in "all these cases", and if you have a case similar to those, then fire will burn you.
They can look the same within the region of interpolation, but look very different when you leave it: all of these systems fall over quickly when more than a handful of semantic constraints are imposed. This number is a measure of the distance from the interpolated boundary (e.g., consider this interpretation of apple's latest paper on reasoning in LLMs: the "environment complexity" is nothing other than a measure of interpolation-dissimilarity).
Early modern philosophers of science were very confused by this, but it's in Aristotle plain-as-day, and it's also extremely well establish since the 80s as the development of formal computational stats necessitated making this clear: interpolation is not generalisation. The former does not get you robustness to irrelevant permuation (ie., generalisation); it does not permit considering counterfactual scenarios (ie., generalisation); it does not give you a semantics/theory of the data generating process (ie., generalisation, ie. a world model).
Interpolation is a model of the data. Generalisation requires a model of the data generating process, the former does not give you the latter, though it can appear to under strong experimental assumptions of known causal models.
Here LLMs model the structure of language-as-symbolic-ordering, that structure "in the interpolated region" expresses reasoning, but it isnt a model of reasoning. It's a model of reasoning as captured in historical cases of it.
Aren’t there papers showing that there is some kind of world model emerging? Like representations of an Othello board that we would recognize were found and manipulated successfully in a small model.
There are two follow up papers showing the representations are "entangled", a euphemism for statistical garbage, but I can't be bothered at the moment to find them.
However the whole issue of othello is a nonsequiteur which indicates that people involved here don't really seem to understand the issue, or what a world model is.
A "world model" is a model of a data generating process which isn't reducible-to or constituted by its measures. Ie., we are concerned for the case where there's a measurement space (eg., that of the height of mercury in a thermometer) and a target property space (eg., that of the temperature of the coffee). So that there is gap between the data-as-measure and its causes. In language this gap is massive: the cause of my saying, "I'm hungry" may have nothing to do with my hunger, even if it often does. For "scientific measuring devices", these are constructed to minimize this gap as much as possible.
In any case, with board games and other mathematical objects, there is no gap. The data is the game. The "board state" is an abstract object constituted by all possible board states. The game "is made out of" its realisations.
However the world isnt made out of language, nor coffee made out of thermometers. So a model of the data isnt a mdoel of its generating process.
So whether an interpolation of board states "fully characterises", someway, an abstract mathematical object "the game" is so irrelevant to the question it betrays a fundamental lack of understanding of even what's at issue.
No one is arguing that a structured interpolative model (ie., one given an inductive bias by an NN architecture) doesn't express properties of the underlying domain in its structure. The question is what happens to this model of the data when you have the same data generating process, but you arent in the interpolated region.
This problem is, in the limit of large data, impossible for abstract games by their nature, eg., a model classifying the input X into legal/illegal board states is the game.
Another way of phrasing this is that in ML/AI textbooks often begin by assuming there's a function you're approximating. But in the vast majority of cases where NNs are used, there is no such function -- there is no function tokens -> meanings (eg., "i am hungry" is ambigious).
But in the abstract math case there is a function, {boards} -> Legal|Illegal is a function, there are no ambiguous boards
So: of the infinite number of f* approximations to f_game, any is valid in the limit len(X) -> inf. Of the infinite number f*_lang to f_language, all are invalid (each in their own way).
4 replies →
Could you give more details about what precisely you mean by interpolation and generalization? The commonplace use of “generalization” in the machine learning textbooks I’ve been studying is model performance (whatever metric is deemed relevant) on new data from the training distribution. In particular, it’s meaningful when you’re modeling p(y|x) and not the generative distribution p(x,y).
It's important to be aware that ML textbooks are conditionalising every term on ML being the domain of study, and along with all computer science, extremely unconcerned with words they borrow retaining their meaning.
Generalisation in the popular sense (science, stats, philosophy of science, popsci) is about reliability and validity, def. validity = does a model track the target properties of a system we expect; reliability = does it continue to do so in environments in which those features are present, but irrelevant permutations are made.
Interpolation is "curve fitting", which is almost all of ML/AI. The goal of curve fitting is to replace a general model with a summary of the measurement data. This is useful when you have no way of obtaining a model of the data generating process.
What people in ML assume is that there is some true distribution of measurements, and "generalisation" means interpolating the data so that you capture the measurement distribution.
I think it's highly likely there's a profound conceptual mistake in assuming measurements themsleves have a true distribution, so even the sense of generalisation to mean "have we interpolated correctly" is, in most cases, meaningless.
Part of the problem is that ML textbooks frame all ML problems with the same set of assumptions (eg., that there exists an f: X->Y, that X has a "true distribution" Dx, so that finding f* implies learning Dx). For many datasets, these assumptions are false. Compare running a linear regression on photos of the sky, through stars to get star signs, vs. running it on V=IR electric circuit data to get `R`
In the former cases, there is no f_star_sign to find; there is no "true distribution" of star sign measurements; etc. So any model of star signs cannot be a model even of measurements of star signs. ML textbooks do not treat "data" as having these kinds of constraints, or relationships to reality, which breads pseudoscientific and credulous misunderstandings of issues (such as, indeed, the othello paper).