Mostly because OpenAI's vision offerings aren't particularly compelling:
- 4o can't really do localization, and ime is worse than Gemini 2.0 and Qwen2.5 at document tasks
- 4o mini isn't cheaper than 4o for images because it uses a lot of tokens per image compared to 4o (~5600/tile vs 170/tile, where each tile is 512x512)
- o1 has support for vision but is wildly expensive and slow
- o3-mini doesn't yet have support for vision, and o1-mini never did
Mostly because OpenAI's vision offerings aren't particularly compelling:
- 4o can't really do localization, and ime is worse than Gemini 2.0 and Qwen2.5 at document tasks
- 4o mini isn't cheaper than 4o for images because it uses a lot of tokens per image compared to 4o (~5600/tile vs 170/tile, where each tile is 512x512)
- o1 has support for vision but is wildly expensive and slow
- o3-mini doesn't yet have support for vision, and o1-mini never did