Comment by Mizza

3 days ago

Have any multimodal models been reasoning-trained yet?

https://platform.openai.com/docs/models/#o1

> The latest o1 model supports both text and image inputs

  • But not multimodal reasoning, the intermediate and output tokens are text only, at least in the released version, they probably have actual multimodal reasoning that's not been shown yet, as they already showed gpt-4o can output image tokens,but that's not been released yet either.

    • That wasn’t the question… they asked if any multimodal models had been reasoning trained. o1 fits that criteria precisely, and it can reason about the image input.

      They didn’t ask about a model that can create images while thinking. That’s an entirely unrelated topic.