← Back to context

Comment by Jeff_Brown

2 years ago

There seems to be a small error in the reported results: In most rows the model that did better is highlighted, but in the row reporting results for the FLEURS test, it is the losing model (Gemini, which scored 7.6% while GPT4-v scored 17.6%) that is highlighted.

That row says lower is better. For "word error rate", lower is definitely better.

But they also used Large-v3, which I have not ever seen outperform Large-v2 in even a single case. I have no idea why OpenAI even released Large-v3.

The text beside it says "Automatic speech recognition (based on word error rate, lower is better)"