Comment by nwienert

4 days ago

I’ve seen some of the problems before, like https://o3-failed-arc-agi.vercel.app/

This is not hard to build datasets that have these types of problems in them, and I would expect LLMs to generalize this well. I don’t see how this is any different really than any other type of problem LLMs are good at given they have the dataset to study.

I get they keep the test updated with secret problems, but I don’t see how companies can’t game this just by investing in building their own datasets, even if it means paying teams of smart people to generate them.

The other question is if enough examples of this type of task are helpful and generalizable in some way. If so, why wouldn't you integrate that dataset into your training pipeline of an LLM.