Comment by traverseda

2 days ago

"The first time you've seen these objects" is a weird thing to say. One presumes that this is already in their training set, and that these models aren't storing a huge amount of data in their context, so what does that even mean?

It probably gives them confidence that they can accurately see a thing even though they don't know what that thing is.

I could also imagine a lot of safety around leaving things outside of the current task alone so you might have to bend over backwards to get new objects worked on.

  • There is no such thing as "thing" here.

    These models are trained such that the given conditions (the visual input and the text prompt) will be continued with a desirable continuation (motor function over time).

    The only dimension accuracy can apply to is desirability.

So from what I understand it actually means that they were for example never trained on a video of an apple. Maybe only on a video of bread, pineapple, chocolate.

However, as it was trained using generic text data similarly to a normal LLM, it knows how an apple is supposed to look like.

Similar than a kid that never saw a banana, but his parent described it to him.

It's normal to have a training set and a validation set and I interpreted that to mean that these items weren't in the training set.