Comment by Xmd5a
6 days ago
Related:
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning – https://arxiv.org/pdf/2505.17117 (Lecun/Jurafsky)
> Large Language Models (LLMs) demonstrate striking linguistic capabilities that suggest semantic understanding (Singh et al., 2024; Li et al., 2024). Yet, a critical question remains unanswered: Do 1arXiv:2505.17117v5 [cs.CL] 25 Sep 2025LLMs navigate the compression-meaning trade-off similarly to humans, or do they employ fundamentally different representational strategies? This question matters because true understanding, which goes beyond surface-level mimicry, requires representations that balance statistical efficiency with semantic richness (Tversky, 1977; Rosch, 1973b).
> To address this question, we apply Rate-Distortion Theory (Shannon, 1948) and Information Bottleneck principles (Tishby et al., 2000) to systematically compare LLM and human conceptual structures. We digitize and release seminal cognitive psychology datasets (Rosch, 1973b; 1975; McCloskey & Glucksberg, 1978), which are foundational studies that shaped our understanding of human categorization but were previously unavailable in a machine-readable form. These benchmarks, comprising 1,049 items across 34 categories with both membership and typicality ratings, offer unprecedented empirical grounding for evaluating whether LLMs truly understand concepts as humans do. It also offers much better quality data than the current crowdsourcing paradigm.
From typicality tests in the paper above, we can jump to:
The Guppy Effect as Interference – https://arxiv.org/abs/1208.2362
> One can refer to the situation wherein people estimate the typicality of an exemplar of the concept combination as more extreme than it is for one of the constituent concepts in a conjunctive combination as overextension. One can refer to the situation wherein people estimate the typicality of the exemplar for the concept conjunction as higher than that of both constituent concepts as double overextension. We posit that overextension is not a violation of the classical logic of conjunction, but that it signals the emergence of a whole new concept. The aim of this paper is to model the Guppy Effect as an interference effect using a mathematical representation in a complex Hilbert space and the formalism of quantum theory to represent states and calculate probabilities. This builds on previous work that shows that Bell Inequalities are violated by concepts [7, 8] and in particular by concept combinations that exhibit the Guppy Effect [1, 2, 3, 9, 10], and add to the investigation of other approaches using interference effects in cognition [11, 12, 13].
And from quantum interferences
Quantum-like contextuality in large language models – https://royalsocietypublishing.org/doi/epdf/10.1098/rspa.202...
> This paper provides the first large-scale experimental evidence for contextuality in the large language model BERT. We constructed a linguistic schema modelled over a contextual quantum scenario, instantiated it in the Simple English Wikipedia, and extracted probability distributions for the instances. This led to the discovery of sheaf-contextual and CbD contextual instances. We prove that these contextual instances arise from semantically similar words by deriving an equation that relates degrees of contextuality to the Euclidean distance of BERT’s embedding vectors.
How can large language models become more human – https://discovery.ucl.ac.uk/id/eprint/10196296/1/2024.cmcl-1...
> Psycholinguistic experiments reveal that efficiency of human language use is founded on predictions at both syntactic and lexical levels. Previous models of human prediction exploiting LLMs have used an information theoretic measure called surprisal, with success on naturalistic text in a wide variety of languages, but under-performance on challenging text such as garden path sentences. This paper introduces a novel framework that combines the lexical predictions of an LLM with the syntactic structures provided by a dependency parser. The framework gives rise to an Incompatibility Fraction. When tested on two garden path datasets, it correlated well with human reading times, distinguished between easy and hard garden path, and outperformed surprisal.
Most LM work implicitly uses surprisal = -log p(w | prefix) as the processing cost. But psycholinguistics keeps finding cases (garden-path sentences, etc.) where human difficulty is less about the next word being unlikely and more about how much of the current parse / interpretation has to be torn down and rebuilt. That’s essentially what Wang et al. formalize with their Incompatibility Fraction: they combine an LLM’s lexical predictions with a dependency parser, build a sheaf-style structure over prefixes, and measure how inconsistent the local parse distributions are with any single global structure. That incompatibility correlates with human reading times and distinguishes easy vs hard garden paths better than surprisal alone.
If you take that seriously, you end up with a different "surprise" objective: not just "this token was unlikely", but "this token forced a big update of my latent structure". In information-theoretic terms, the distortion term in a Rate–Distortion / Information Bottleneck objective stops being pure log-loss and starts to look like a backtracking cost on your semantic/structural state.
Now look at Shani et al.’s From Tokens to Thoughts paper: they compare LLM embeddings to classic human typicality/membership data (Rosch, Hampton, etc.) using RDT/IB, and show that LLMs sit in a regime of aggressive compression: broad categories line up with humans, but fine-grained typicality and "weird" members get squashed. Humans, by contrast, keep higher-entropy, messier categories – they "waste bits" to preserve contextual nuance and prototype structure.
Quantum cognition folks like Aerts have been arguing for years that this messiness is not a bug: phenomena like the Guppy effect (where "guppy" is a so-so Pet and a so-so Fish but a very typical Pet-Fish) are better modelled as interference in a Hilbert space, i.e. as emergent concepts rather than classical intersections. Lo et al. then show that large LMs (BERT) already exhibit quantum-like contextuality in their probability distributions: thousands of sheaf-contextual and tens of millions of CbD-contextual instances, with the degree of contextuality tightly related to embedding distances between competing words.
Put those together and you get an interesting picture:
Current LMs do live in a contextual / interference-ish regime at the probabilistic level, but their embedding spaces are still optimized for pointwise predictive compression, not for minimizing re-interpretation cost over time.
If you instead trained them under a "surprise = prediction error + structural backtracking cost" objective (something like log-loss + sheaf incompatibility over parses/meanings), the optimal representations wouldn’t be maximally compressed clusters. They’d be the ones that make structural updates cheap: more typed, factorized, role-sensitive latent spaces where meaning is explicitly organized for recomposition rather than for squeezing out every last bit of predictive efficiency.
That’s exactly the intuition behind DisCoCat / categorical compositional distributional semantics: you force grammar and semantics to share a compact closed category, treat sentence meaning as a tensor contraction over typed word vectors, and design the embedding spaces so that composition is a simple linear map. You’re trading off fine-grained, context-specific "this token in this situation" information for a geometry that makes it cheap to build and rebuild structured meanings.
Wang et al.’s Incompatibility Fraction is basically a first step toward such an objective, Shani et al. quantify how far LMs are from the "human" point on the compression–meaning trade-off, Aerts/Lo show that both humans and LMs already live in a quantum/contextual regime, and DisCoCat gives a concrete target for what "structured, recomposable embeddings" could look like. If we ever switch from optimizing pure cross-entropy to "how painful is it to revise my world-model when this token arrives?", I’d expect the learned representations to move away from super-compact clusters and towards something much closer to those typed, compositional spaces.