Comment by MichaelRazum
3 days ago
Technical question: Can someone explain how the vision backbone can be replaced after training? I think this is what they mentioned in the video. Just wondering how it would work, since I would suspect that the visual embedings would be highly affected.
PS: Is the approach something like LORA or a complete retrain on the visual part?
When I've had Grok evaluate images and dug into how it perceives them, it seemed to just have an image labeling model slapped onto the text input layer. I'm not sure it can really see anything at all, like "vision" models can.
It was giving coordinate bounding boxes and likelihood matches to generic classifications for each:
…
Don't know how Grok is setup, but in earlier models the vision backbone was effectively a separate model that was trained to convert vision inputs into a tokenized output, where the tokenized outputs would be in the form of "soft tokens" that the main model would treat as input and attend to just like it would for text token inputs. Because they're two separate things, you can modify each somewhat independently. Not sure how things are currently setup tho.