Comment by MichaelRazum

3 days ago

Technical question: Can someone explain how the vision backbone can be replaced after training? I think this is what they mentioned in the video. Just wondering how it would work, since I would suspect that the visual embedings would be highly affected.

PS: Is the approach something like LORA or a complete retrain on the visual part?

2 comments

MichaelRazum

fdsjgfklsfd 3 days ago

When I've had Grok evaluate images and dug into how it perceives them, it seemed to just have an image labeling model slapped onto the text input layer. I'm not sure it can really see anything at all, like "vision" models can.

It was giving coordinate bounding boxes and likelihood matches to generic classifications for each:

    - *Positions*:
      - Central cluster: At least five bugs, spread across the center of the image (e.g., x:200-400, y:150-300).
      - Additional bugs: Scattered around the edges, particularly near the top center (x:300-400, y:50-100) and bottom right (x:400-500, y:300-400).
    - *Labels and Confidence*:
      - Classified as "armored bug" or "enemy creature" with ~80% confidence, based on their insect-like shape, spikes, and clustering behavior typical of game enemies.
      - The striped pattern and size distinguish them from other entities, though my training data might not have an exact match for this specific creature design.

…

    - *Positions*:
      - One near the top center (x:350-400, y:50-100), near a bug.
      - Another in the bottom right (x:400-450, y:350-400), near another bug.
    - *Labels and Confidence*:
      - Classified as "spider" or "enemy minion" with ~75% confidence, due to their leg structure and body shape.

DeveloperErrata 3 days ago

Don't know how Grok is setup, but in earlier models the vision backbone was effectively a separate model that was trained to convert vision inputs into a tokenized output, where the tokenized outputs would be in the form of "soft tokens" that the main model would treat as input and attend to just like it would for text token inputs. Because they're two separate things, you can modify each somewhat independently. Not sure how things are currently setup tho.