Comment by echelon

6 months ago

This has me curious about ARC-AGI.

Would it have been possible for OpenAI to have gamed ARC-AGI by seeing the first few examples and then quickly mechanical turking a training set, fine tuning their model, then proceeding with the rest of the evaluation?

Are there other tricks they could have pulled?

It feels like unless a model is being deployed to an impartial evaluator's completely air gapped machine, there's a ton of room for shenanigans, dishonesty, and outright cheating.

7 comments

echelon

trott 6 months ago

> This has me curious about ARC-AGI

In the o3 announcement video, the president of ARC Prize said they'd be partnering with OpenAI to develop the next benchmark.

> mechanical turking a training set, fine tuning their model

You don't need mechanical turking here. You can use an LLM to generate a lot more data that's similar to the official training data, and then you can train on that. It sounds like "pulling yourself up by your bootstraps", but isn't. An approach to do this has been published, and it seems to be scaling very well with the amount of such generated training data (They won the 1st paper award)

pastage 6 months ago
I know nothing about LLM training, but do you mean there is a solution to the issue of LLMs gaslighting each other? Sure this is a proven way of getting training data, but you can not get theorems and axioms right by generating different versions of them.
- trott 6 months ago
  
  This is the paper: https://arxiv.org/abs/2411.02272
  They won the 1st paper award: https://arcprize.org/2024-results
  In their approach, the LLM generates inputs (images to be transformed) and solutions (Python programs that do the image transformations). The output images are created by applying the programs to the inputs.
  So there's a constraint on the synthetic data here that keeps it honest -- the Python interpreter.
- abrichr 6 months ago
  
  I believe the paper being referenced is “Scaling Data-Constrained Language Models” (https://arxiv.org/abs/2305.16264).
  For correctness, you can use a solver to verify generated data.

riku_iki 6 months ago

> OpenAI to have gamed ARC-AGI by seeing the first few examples

not just few examples. o3 was evaluated on "semi-private" test, which was previously already used for evaluating OAI models, so OAI had access to it already for a long time.

WiSaGaN 6 months ago

In their benchmark, they have a tag "tuned" attached to their o3 result. I guess we need they to inform us of the exact meaning of it to gauge.