Comment by Xmd5a
6 days ago
Most LM work implicitly uses surprisal = -log p(w | prefix) as the processing cost. But psycholinguistics keeps finding cases (garden-path sentences, etc.) where human difficulty is less about the next word being unlikely and more about how much of the current parse / interpretation has to be torn down and rebuilt. That’s essentially what Wang et al. formalize with their Incompatibility Fraction: they combine an LLM’s lexical predictions with a dependency parser, build a sheaf-style structure over prefixes, and measure how inconsistent the local parse distributions are with any single global structure. That incompatibility correlates with human reading times and distinguishes easy vs hard garden paths better than surprisal alone.
If you take that seriously, you end up with a different "surprise" objective: not just "this token was unlikely", but "this token forced a big update of my latent structure". In information-theoretic terms, the distortion term in a Rate–Distortion / Information Bottleneck objective stops being pure log-loss and starts to look like a backtracking cost on your semantic/structural state.
Now look at Shani et al.’s From Tokens to Thoughts paper: they compare LLM embeddings to classic human typicality/membership data (Rosch, Hampton, etc.) using RDT/IB, and show that LLMs sit in a regime of aggressive compression: broad categories line up with humans, but fine-grained typicality and "weird" members get squashed. Humans, by contrast, keep higher-entropy, messier categories – they "waste bits" to preserve contextual nuance and prototype structure.
Quantum cognition folks like Aerts have been arguing for years that this messiness is not a bug: phenomena like the Guppy effect (where "guppy" is a so-so Pet and a so-so Fish but a very typical Pet-Fish) are better modelled as interference in a Hilbert space, i.e. as emergent concepts rather than classical intersections. Lo et al. then show that large LMs (BERT) already exhibit quantum-like contextuality in their probability distributions: thousands of sheaf-contextual and tens of millions of CbD-contextual instances, with the degree of contextuality tightly related to embedding distances between competing words.
Put those together and you get an interesting picture:
Current LMs do live in a contextual / interference-ish regime at the probabilistic level, but their embedding spaces are still optimized for pointwise predictive compression, not for minimizing re-interpretation cost over time.
If you instead trained them under a "surprise = prediction error + structural backtracking cost" objective (something like log-loss + sheaf incompatibility over parses/meanings), the optimal representations wouldn’t be maximally compressed clusters. They’d be the ones that make structural updates cheap: more typed, factorized, role-sensitive latent spaces where meaning is explicitly organized for recomposition rather than for squeezing out every last bit of predictive efficiency.
That’s exactly the intuition behind DisCoCat / categorical compositional distributional semantics: you force grammar and semantics to share a compact closed category, treat sentence meaning as a tensor contraction over typed word vectors, and design the embedding spaces so that composition is a simple linear map. You’re trading off fine-grained, context-specific "this token in this situation" information for a geometry that makes it cheap to build and rebuild structured meanings.
Wang et al.’s Incompatibility Fraction is basically a first step toward such an objective, Shani et al. quantify how far LMs are from the "human" point on the compression–meaning trade-off, Aerts/Lo show that both humans and LMs already live in a quantum/contextual regime, and DisCoCat gives a concrete target for what "structured, recomposable embeddings" could look like. If we ever switch from optimizing pure cross-entropy to "how painful is it to revise my world-model when this token arrives?", I’d expect the learned representations to move away from super-compact clusters and towards something much closer to those typed, compositional spaces.
No comments yet
Contribute on Hacker News ↗