Comment by thomastjeffery
2 days ago
There is no such thing as "thing" here.
These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).
The only dimension accuracy can apply to is desirability.
You don't think there's any segmentation going on?
Implicitly, maybe. Does that matter if you don't know where?