Like others have said, its an interesting avenue for AGI. The joint embeddings would be closer to thinking than the current LLM token work. LLMs look like they have a lot limitations for AGI (although who knows if we have another crazy scale up? but that extra scale is looking difficult right now).
A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.
Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.
LLM of course trade in language tokens.
We can extend their behavior with front ends which convert other media types into such tokens.
But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.
With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.
But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)
The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.
Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....
When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...
...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.
possibly with their investment into AR/VR and gaming they may see a pathway to creating 'physical intelligence' and tap into a much bigger untapped market. I mean isn't Robotaxi the main carrot Musk's been holding in front of tesla investors for decade or so. physical robots may provide a more 'incremental fault tolerant' path to application of AI.
Like others have said, its an interesting avenue for AGI. The joint embeddings would be closer to thinking than the current LLM token work. LLMs look like they have a lot limitations for AGI (although who knows if we have another crazy scale up? but that extra scale is looking difficult right now).
There is a world of money in AGI, and they have the resources, and notably the data, to achieve it.
The goal is a a Large Phenomenological Model.
A good definition of "real AGI" might be, a multimodal model which understands time-based media, space, and object behavior, and hence true agency.
Phenomenology is the philosophy of "things as they seem," not "knowledge (words) about things." Seem to our senses, not understood through language.
LLM of course trade in language tokens.
We can extend their behavior with front ends which convert other media types into such tokens.
But we can do better with multimodal models which are trained directly on other inputs. E.g. integrating image classifiers with language models architecturally.
With those one can sort of understand time-based media, by sampling a stream and getting e.g. transcripts.
But again, it's even better to build a time-base multimodal models, which directly ingests time-based media rather than sampling. (Other architectures than transformers are going to be required to do this well IMO...)
The bootstrapping continues. This work is about training models to understand world and object properties by introducing agency.
Significant footnote: implicitly models trained to interact with the world necessarily have a "self model" which interacts with the "world model." Presumably they are trained to preserve their expensive "self." Hmmmmm....
When we have a model that knows about things not just as nodes in a language graph but also how such things look, and sound, and moves, and "feel" (how much mass do they have, how do they move, etc.)...
...well, that is approaching indistinguishable from one of us, at least wrt embodiment and agency.
possibly with their investment into AR/VR and gaming they may see a pathway to creating 'physical intelligence' and tap into a much bigger untapped market. I mean isn't Robotaxi the main carrot Musk's been holding in front of tesla investors for decade or so. physical robots may provide a more 'incremental fault tolerant' path to application of AI.
Physical robots as impressive as LLMs?
Robots that can do anything.
physical robots arguing endlessly with physical people