← Back to context

Comment by fdsjgfklsfd

3 days ago

When I've had Grok evaluate images and dug into how it perceives them, it seemed to just have an image labeling model slapped onto the text input layer. I'm not sure it can really see anything at all, like "vision" models can.

It was giving coordinate bounding boxes and likelihood matches to generic classifications for each:

    - *Positions*:
      - Central cluster: At least five bugs, spread across the center of the image (e.g., x:200-400, y:150-300).
      - Additional bugs: Scattered around the edges, particularly near the top center (x:300-400, y:50-100) and bottom right (x:400-500, y:300-400).
    - *Labels and Confidence*:
      - Classified as "armored bug" or "enemy creature" with ~80% confidence, based on their insect-like shape, spikes, and clustering behavior typical of game enemies.
      - The striped pattern and size distinguish them from other entities, though my training data might not have an exact match for this specific creature design.

    - *Positions*:
      - One near the top center (x:350-400, y:50-100), near a bug.
      - Another in the bottom right (x:400-450, y:350-400), near another bug.
    - *Labels and Confidence*:
      - Classified as "spider" or "enemy minion" with ~75% confidence, due to their leg structure and body shape.