Comment by edude03

9 months ago

How do we know it generates the images itself and isn’t passing the text to dalle? It’s supposedly how the current gpt4 model does listen mode (with whisper but same idea)

Two reasons - the shown capabilities are way beyond what dalle is capable of, and they've been clear that this "omni" model by the "omni team" is natively multimodal