Comment by disgruntledphd2

20 days ago

> This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

It's still really, really impressive though.

Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.

Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.

5 comments

disgruntledphd2

thesz 20 days ago

  > It's still really, really impressive though.

Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

Much like quoting Quake code almost verbatim not so long ago.

disgruntledphd2 20 days ago
> Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?
No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation). That being said, they provide all the code et al for people to review.
I do agree that an out of distribution test would be super helpful, but given that it will almost certainly fail (given what we know about LLMs) I'm not too pushed about that given that it will definitely fail.
Look, I'm pretty sceptical about AI boosting, but this is a much better attempt than the windsurf browser thing from a few months back and it's interesting to know that one can get this work.
I do note that the article doesn't talk much about all the harnesses needed to make this work, which assuming that this approach is plausible, is the kind of thing that will be needed to make demonstrations like this more useful.
- thesz 20 days ago
  
  > No we don't and yeah we would expect them to only report positive results (this is both marketing and investigation).
  This is matter of methodology. If they train models on that task or somewhat score/select models on their progress on that task, then we have test set leakage [1].
  [1] https://en.wikipedia.org/wiki/Leakage_(machine_learning)
  This question is extremely important because test set leakage leads to impressively looking results that do not generalize to anything at all.
  
  2 replies →