Comment by ricardobeat

6 days ago

It’s important to keep some perspective: there are zero robots in the wild, at the moment, that use a world model to work on tasks they weren’t specifically trained on. This is cutting edge research and an 80% success rate is astonishing!

80% success rate is also potentially commercially viable if the task is currently being done by a human.

Work that was once done by 10 humans can now be done by 10 robots + 2 humans for the 20% failure cases, at a lower total cost.

  • This really depends on the failure modes. In general, humans fail in predictable, and mostly safe, ways. AIs fail in highly unpredictable and potentially very dangerous ways. (A human might accidentally drop a knife, an AI might accidentally stab you with it.)

    • Or, if controlling a robot arm, it would stab itself through the conveyer belt at full torque.

  • It might still be a little slow (I'm not sure if the 16 seconds to compute an action is fast enough for commercial use cases), but this is definitely exciting and seems like a great step forward.

I'm surprised that's not how it's already done. I'd figure some of the inner layers in LLMs were already "world models" and that it's the outer layers that differentiated models between text vs. images/robotics/other modes...

  • That's what the propaganda says, but when we keep explaining it isn't true, and army arrives to repeat adcopy from their favourite tech guru.

    All statistical models of the kind in use are interpolations through historical data -- there's no magic. So when you interpolate through historical texts, your model is of historical text.

    Text is not a measure of the world, to say, "the sky is blue" is not even reliably associated with the blueness of the sky, let alone that the sky isnt blue (there is no sky, and the atmosphere isn't blue).

    These models appear "capture more" only because when you interpret the text you attribute meaning/understanding to it as the cause of its generation -- but that wasnt the cause, this is necessarily an illusion. There is no model of the world in a model of historical text -- there is a model of the world in your head which you associate with text, and that association is exploited when you use LLMs to do more than mere syntax transformation.

    LLMs excel most at "fuzzy retrieval" and things like coding -- the latter is principally a matter of syntax, and the former of recollection. As soon as you require the prompt-completion to maintain "semantic integrity" with non-syntactical/retrivable constraints, it falls apart.

    • I feel like you are ignoring or dismissing the word "interpolating", although a better word would likely be generalization. I'd make the claim that it's very hard to generalize without some form of world model. It's clear to me that transformers do have some form of world model, although not the same as what is being presented in V-JEPA.

      One other nitpick is that you confine to "historical data", although other classes of data are trained on such as simulated and generative.

      9 replies →

    • > army arrives to repeat adcopy from their favourite tech guru

      This is painfully accurate.

      The conversations go like this:

      Me: “guys, I know what I’m talking about, I wrote my first neural network 30 years ago in middle school, this tech is cool but it isn’t magic and it isn’t good enough to do the thing you want without getting us sued or worse.”

      Them: “Bro, I read a tweet that we are on the other side of the singularity. We have six months to make money before everything blows up.”

I can buy this, given a very wide meaning of "specifically trained on" and handwaving a bit about "as far as I know*", but then I read the actual wording of "new objects in new and unseen environments.", and remember these were floating around Mountain View doing tasks involving in new objects in novel environments years ago. Then I kinda gotta give up and admit to myself I'm distorting the conversation by emphasizing positivity over ground truth.

They don’t use it because it’s unsafe and potentially life threatening lol

  • Plenty of things are unsafe and potentially life threatening, including machines with pre-programmed routines that we use today. We already have robots with limited intelligence interacting safely with humans in workplaces.

    This learning technology didn't exist until this moment in time. That probably has more to do with why no one is using it in the wild.

    • Yes, you can just add other reliable safety meassures. Meaning if a human comes too close, the robot stops.

      Or the robot is supervised all the time.

      Or just operates in an area without humans.

      But so far this is research, not market ready.