Comment by wongarsu
3 hours ago
A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt
3 hours ago
A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt
No comments yet
Contribute on Hacker News ↗