← Back to context

Comment by atleastoptimal

6 months ago

All those things could be done by humanoid robots. AI models aren’t limited to words, as we’ve seen with video models. Gpt 4o, which has been out for over a year, is natively multimodal. Robotics companies are training robots to take in all the data they have avaliable, video, audio, and interpret them all together in context. There is the core substrate of tokens, yes, but largely it is just a standard “bit” level of information for AI brains, not some essential limiter that will keep AI from understanding all the soft, abstract stuff that humans can. If you look at o3 now, just feeding it images, it clearly now can reason in a way closer to humans than a calculator is to it.