Comment by empath-nirvana

2 years ago

The reason it can't do that is that, for example, "twenty" and "20" are nearly identical in the vector embedding space and it can't really distinguish them that well in most contexts. That's true for generally any task that relies on sort of "how the words look" vs "what the words mean". Any kind of meta request is going to be very difficult for an LLM, but a multi-modal GPT model should be able to handle it.

2 comments

empath-nirvana

Xenoamorphous 2 years ago

Thanks, I’ll try the multimodal one.

Xenoamorphous 2 years ago

Tried it, did not perform better than the non-multimodal one.