Comment by Jeff_Brown
2 years ago
There seems to be a small error in the reported results: In most rows the model that did better is highlighted, but in the row reporting results for the FLEURS test, it is the losing model (Gemini, which scored 7.6% while GPT4-v scored 17.6%) that is highlighted.
That row says lower is better. For "word error rate", lower is definitely better.
But they also used Large-v3, which I have not ever seen outperform Large-v2 in even a single case. I have no idea why OpenAI even released Large-v3.
The text beside it says "Automatic speech recognition (based on word error rate, lower is better)"