← Back to context

Comment by maknee

3 months ago

interesting results. why does reload/cross-tile have worse results? would be nice to see some examples of failed results (how close did it to solving?)

We have an example of a failed cross-tile result in the article - the models seem like they're much better at detecting whether something is in an image vs. identifying the boundaries of those items. This probably has to do with how they're trained - if you train on descriptions/image pairs, I'm not sure how well that does at learning boundaries.

Reload are challenging because of how the agent-action loop works. But the models were pretty good at identifying when a tile contained an item.

I'm also curious what the success rates are for humans. Personally I find those two the most bothersome as well. Cross-tile because it's not always clear which parts of the object count and reload because it's so damn slow.