Comment by wongarsu

3 hours ago

A major limitation is that they only test GPT 4o. Previous research like [1] investigating the same question has shown significant differences between models, and even depending on the language of your prompt