Comment by noduerme
2 months ago
No. Not at all like that. I said:
>> nor spatial artifacts
I meant visual patterns, too. You're thinking about what I said on too granular a level. JEPA is visual, based ultimately on pixels. The tokens may be digested from pixels until they're as large as whole recognizable objects, but the tokens are not whole mental models themselves.
Here's an example of humans evaluating competing mental models as tokens: You see a car, it's white, it's got some blood stains on the door, and it's traveling towards a red light at 90 miles an hour in a 30 mph residential zone, while you're about to make a left turn. A human foot is dangling from the trunk.
You refer to several mental models you have about high speed chases, drug cartels in the area, murders, etc. You compare these models to determine the next action the car might take.
What were the tokens in this scenario? The color of the car, the pixels of blood, the speed, the traffic pattern? Or whole models of understanding behavior where you had to choose between a normal driver's behavior and that of someone with a dead body fleeing a crime scene?
No comments yet
Contribute on Hacker News ↗