Comment by red75prime

4 hours ago

The latest models are mostly LMMs (large multimodal models). If a model builds an internal representation that integrates all the modalities we are dealing with (robotics even provides tactile inputs), it becomes harder and harder to imagine why those representations should be qualitatively different.

It can't, simply because the textual description of a concept is different from the concept itself.

  • Obviously, a concept (which is an abstraction in more ways than one) is different from a textual representation. But LLMs don't operate on the textual description of a concept when they are doing their thing. A textual description (which is associated with other modalities in the training data) serves as an input format. LLMs perform non-linear transformations of points in their latent space. These transformations and representations are useful not only for generating text but also for controlling robots, for example (see VLAs in robotics).

    • > don't operate on the textual description of a concept when they are doing their thing.

      It could be mapping the text to some other internal representation with connections to mappings from some other text/tokens. But it does not stop text from being the ground truth. It has nothing else going on!

      The "hallucination" behavior alone should be enough to reject any claims that these are at least minimally similar to animal intelligence.

      1 reply →