← Back to context

Comment by liuliu

1 day ago

It looks pretty obvious (I think):

1. S2 is a 7B VLM, it is responsible for taken in camera streams (from however many of them), run through prompt guided text generation, and before the lm_head (or a few layers leading to it), directly take the latent encoding;

2. S1 is where they collected a few hundreds hours of teleoperating data, retrospectively come up with prompt for 1, then train from the scratch;

Whether S2 finetuned with S1 or not is an open question, at least there is a MLP adapter that is finetuned, but could be the whole 7B VLM is finetuned too.

It looks plausible, but I am still skeptical about the generalization claim given it is all fine-tuned with household tasks. But nowadays, it is really difficult to understand how these models generalize.