← Back to context

Comment by jononor

1 month ago

Being _perceived_ as having the best LLM/chatbot is a billion dollar game now. And it is an ongoing race, at breakneck speeds. These companies are likely gaming the metrics in any and all ways that they can. Of course there are probably many working on genuine improvements also. And at the frontier it can be very difficult to separate "hack" from "better generalized performance". But that is much harder, so might be the minority in terms of practical impact already.

It is a big problem for researchers at least that we/they do know what is in the training data and how that process works. Figuring out if there are (for example) data leaks or overeager preference tuning, that caused performance to get better for a given task is extremely difficult with these giganormous black boxes.

You have potentially billions of dollars to gain, no way to be found out… it’s a good idea to initially assume there’s cheating and work back from there.

  • It’s not quite as bad as “no way to be found out”. There are evals that suss out contamination/training on the test set. Science means using every available means to disprove, though. Incredible claims etc