Comment by aaroninsf

6 days ago

The goal is a a Large Phenomenological Model.

A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.

Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.

LLM of course trade in language tokens.

We can extend their behavior with front ends which convert other media types into such tokens.

But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.

With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.

But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)

The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.

Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....

When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...

...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.

0 comments

aaroninsf

No comments yet

Contribute on Hacker News ↗