Comment by Intrinisical-AI
1 day ago
That's fascinating! — but I don't fully agree with the framing.
Using G(t) in the context of embeddings seems problematic, specially given the probabilitistic nature.
Example - Take a sentence with a typo but semantically clear and correct (let's suppose): "The justice sistem is corrupt."
G(t) = 0, right? But semantically, it's close to G(t) → 1.
Instead of focusing on exact validity —which for me seems too rigid for something as ambiguous and context-dependent as language— what if we focused on _approximate semantic trajectories_?
You wrote:
> "If you have two points in the embedding space which represent well-formed sequences and draw a line that interpolates between them you'd think that there would have to be points in between that correspond to ill-formed sequences."
In my view, it's actually the opposite:
> If the embedding model captures meaningful structure, and you account for geometric properties like curvature, local density, and geodesics — then the path between those two points should ideally trace semantically valid (even "optimal" - if that exists) reasoning.
The problem isn't that interpolation fails — it's that we're interpolating linearly in a space that likely isn't flat!
Thanks for your comment. Lmk what you think :)
You're right that "systems competent in language" (either humans or LLMs) are able to accept and understand slightly wrong sequences but generate correct sequences almost all of the time. (Who hasn't made a typo when talking to a chatbot and had the chatbot ignore the type and respond correctly?)
Treating G(t) as a binary function works for linguists who need a paradigm to do "normal science" but Chomsky's theory has not been so useful for building linguistically competent machines so there have to be serious things wrong with that theory.
Still, the vast majority of sequences t are gibberish that is nowhere near being valid. If those gibberish sequences are representable in the embedding space and took up a volume anywhere near the numeric prevalance they have I can only imagine that in a (say) N=3000 embedding space there is something like a manifold that is N=2999 or N=2998 or N=1500 or something inside the flat embedding space -- that structure would be the non-flat embedding you're looking for or be an approximation to it.
It might be that it is not really a manifold or has different dimensionalities in different places or even fractional dimensionalites. For instance you'd hope that would geometrically represent semantics of various sorts as suggested by the graphs here
https://nlp.stanford.edu/projects/glove/ [1]
So I've thought a lot about sympletic spaces in higher dimensions where area has to be conserved over various transformations (the propagator) and maybe this has led me to think about it the totally wrong way -- maybe the flat embedding space doesn't devote a large volume to gibberish because it was never trained to model gibberish strings, which has to have interesting implications if that is true.
Something else I think of is John Wheeler's idea of superspace in quantum gravity where, even though space-time looks like a smooth manifold to us, the correct representation in the quantum theory might be discrete: maybe for points a, b there are the possibilities that: (1) a,b are the same point, (2) a is the future or b, (3) b is the future of a, or (4) a and b are not causally connected. So you have this thing which exists on one level as something basically symbolic but looks like a manifold if you live in it and you're much bigger than the Planck length.
[1] ... which I don't really believe, of course you can project out 20 points out of a 50 dimensional embedding into an N=2 space and have the points land wherever you want!
Wow! you brought up several deep ideas that deserve unpacking step by step (as if we were LLMs):
- On the manifold being “high-dimensional” (e.g., 2999): I got your intuition; the set of valid linguistic sequences is tiny relative to the space of all possible strings, but still enormously rich and varied. So the valid set doesn’t fill the whole space, but it also can’t live in a low-dimensional manifold like 20D. But I'm also not so sure about that: how many ways you have to give an accurate response? Hard to argue than many more than one. Hard to argue even one of them it's completly correct. _There must be some sort of "clustering"_.
- On domain-specific manifolds and semantic transitions: 100% agree with your idea that different domains induce distinct geometric structures in embedding space, and even that the idea of a "simple manifold", seems to optimistic. But what about "regions" with common (geometric / topological) properties? Eg: Physics should? form a dense structured region, and I guess there common patterns between the implicit structure of it's subspace, and the maths' one for example. The semantic trajectories inside each domain will follow specific rules, but patterns must exists, and also should be transitional zones or “bridges” between them. I relate the emergen abilities into LLMs with this (what are LLMs but transformers of vectorial representations, taken by "views / parts / projections" (e.g: multi-attention heads).
What if we hypothetize about chart atlases; multiple local coordinate systems with smooth transition maps? Maybe a patchwork of overlapping manifolds, each shaped by domain-specific usage, linked by pathways of analogy or shared vocabulary.. Even if this is the case (we only guessing), the problem it's that the computational costs, and interpretations are neither trivial.
- On GloVe and the projection fallacy: I take your point, you can always "cherry-pick" the most good loonking examples to tell your story haha
- On symplectic spaces: I don’t know enough about symplectic geometry :( Only think: you got me thinking about hyperbolic spaces where volume grows exponentially; anti-intuitive from an euclidean point of view
- “maybe the flat embedding space doesn’t devote volume to gibberish because it was never trained to model gibberish.”
I initially thought of this as a kind of "contraction", but that term might be misleading - but thinking about it I prefer the idea of density redistribution. Like a fluid adapting to an invisible container --> Maybe it's like a manifold emergence through optimization pressure indirectly sculpted by the model’s training dynamics.
- Wheeler Superspace: Again, I cannot follow you :( I guess you're pointing that the semantical relationships could be formulated as discrete.. BUT, as a non-physicist, I honestly can’t tell the (any?) difference between being modeled as discrete vs being discrete. (xD)
Thank for the deep response, Paul! Its a pleasure having this conversation with you.
You're right that "systems competent in language" (either humans or LLMs) are able to accept and understand slightly wrong sequences but generate correct sequences almost all of the time. (Who hasn't made a typo when talking to a chatbot and had the chatbot ignore the type and respond correctly?)
Treating G(t) as a binary function works for linguists who need a paradigm to do "normal science" but Chomsky's theory has not been so useful for building linguistically competent machines so there have to be serious things wrong with that theory.
Still, the vast majority of sequences t are gibberish that is nowhere near being valid. If those gibberish sequences are representable in the embedding space and took up a volume anywhere near the numeric prevalance they have I can only imagine that in a (say) N=3000 embedding space there is something like a manifold that is N=2999 or N=2998 or N=1500 or something inside the flat embedding space -- that structure would be the non-flat embedding you're looking for or be an approximation to it.
It might be that it is not really a manifold or has different dimensionalities in different places or even fractional dimensionalites. For instance you'd hope that would geometrically represent semantics of various sorts as suggested by the graphs here
https://nlp.stanford.edu/projects/glove/ [1]
So I've thought a lot about sympletic spaces in higher dimensions where area has to be conserved over various transformations (the propagator) and maybe this has led me to think about it the totally wrong way -- maybe the flat embedding space doesn't devote a large volume to gibberish because it was never trained to model gibberish strings, which has to have interesting implications if that is true.
Something else I think of is John Wheeler's idea of superspace in quantum gravity where, even though space-time looks like a smooth manifold to us, the correct representation in the quantum theory might be discrete: maybe for points a, b there are the possibilities that: (1) a,b are the same point, (2) a is the future or b, (3) b is the future of a, or (4) a and b are not causally connected. So you have this thing which exists on one level as something basically symbolic but looks like a manifold if you live in it and you're much bigger than the Planck length.
But to get to that answer of "why do we flatten it?", we're not flattening it deliberately, the "flattening" is done by the neural network and we don't know another way to do it.
[1] ... which I don't really believe, of course you can project out 20 points out of a 50 dimensional embedding into an N=2 space and have the points land wherever you want!