← Back to context

Comment by qarl

20 days ago

> Still a really cool project!

Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.

The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.

It's amazing that it "works", but viability is another issue.

It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.

Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.

On top of that, Anthropic is losing money on it.

All of those things combined, viability remains a serious question.

  • > You won't know until you've finished spending the money whether it will fail or not.

    How do you conclude that? You start off with a bunch of tests and build these things incrementally, why would you spend 20k before realizing there’s a problem?

    • Because literally no real-world non-research project starts with "we have an extremely comprehensive test suite and specification complete down to the most finite detail" and then searches for a way to turn it into code.

      5 replies →

  • > It cost $20,000

    I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

    You should look it up. :)

    • > > It cost $20,000

      > I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???

      I'll bite - I can write you an unoptimised C compiler that emits assembly for $20k, and it won't be 100k lines of code (maybe 15k, the last time I did this?).

      It won't take me a week, though.

      I think this project is a good frame of reference and matches my experience - vibing with AI is sometimes more expensive than doing it myself, and always results in much more code than necessary.

      22 replies →

    • That's irrelevant in this context, because it's not "get the humans to make a working product OR get the AI to make a working product"

      The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.

      Coincidentally yes, I am aware, my last contract was building out a SCADA module the AI failed to develop at the company that contracted me.

      I'm using that money to finance a new software company, and so far, AI hasn't been much help getting us off the ground.

      Edit: oh yeah, and on top of paying Claude to fuck it up, you still have to also pay the salary of the guy arguing with Claude.

      1 reply →

    • You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.

      If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.

      16 replies →

    • If my devs are writing that much code they're doing something wrong. Lines of code is an anti metric. That used to be commonly accepted knowledge.

    • It really depends on the human and the code it outputs.

      I can get my 2y old child to output 100k LoC, but it won't be very good.

      2 replies →

    • Well, if these humans can cheat by taking whatever needed degree of liberty in copycat attitude to fit in the budget, I guess that a simple `git clone https://gcc.gnu.org/git/gcc.git SomeLocalDir` is as close to $0 as one can hope to either reach. And it would end up being far more functional and reliable. But I get that big-corp overlords and their wanna-match-KPI minions will prefer an "clean-roomed" code base.

  • > On top of that, Anthropic is losing money on it.

    It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e

    • no, and that is widely known. the actual problem is that the margins are not sufficient at that scale to make up for the gargantuan training costs to train their SOTA model.

      3 replies →

    • That's for the API right? The subs are still a loss. I don't know which one of the two is larger.

  • That's a good point! Here claude opus wrote a C compiler. Outrageously cool.

    Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.

    To be fair, it was a pretty horrible mess of useEffects. But just another data point.

    Also I was hoping opus would finally be able to handle complex typescript generics, but alas...

  • it's 20,000 in 2026, with the price of tokens halving every year (at a given perf level), this will be around 1,000 dollars in 2030

  • Also, heaven knows if the result in maintainable or easy to change.

  • > On top of that, Anthropic is losing money on it

    This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not

  > This test sorta definitely proves that AI is legit.

This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.

  • > This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

    It's still really, really impressive though.

    Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.

    Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.

    •   > It's still really, really impressive though.
      

      Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?

      Much like quoting Quake code almost verbatim not so long ago.

      4 replies →

  • There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.

    • There is tinycc, that makes it three compilers.

      There is a C compiler implemented in Rust from scratch: https://github.com/PhilippRados/wrecc/commits/master/?after=... (the very beginning of commit history)

      There are several C compilers written in Rust from scratch of comparable quality.

      We do not know whether Anthropic has a closed source C compiler written in Rust in their training data. We also do not know whether Anthropic validated their models on their ability to implement C compiler from scratch before releasing this experiment.

      That language J I proposed does not have any C compiler implemented in it at all. Idiomatic J expertise is scarce and expensive so that it would be a significant expense for Anthropic to have C compiler in J for their training data. Being Turing-complete, J can express all typical compiler tips and tricks from compiler books, albeit in an unusual way.

      1 reply →

How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.

Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.

If will write you an C compiler by hand for 19k and it will be better than what Claude made.

Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.

> optimizations aren't as good as the 40 year gcc project

with all optimizations disabled:

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

  • That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.

    • Not only is it new. There has been 0 performance optimization done. Well none prompted for at least. Once you give the agents a profiler and start a loop focusing on performance you'll see it start improving it.

      3 replies →

It is legit - with some pretty severe caveats. I am pressed to come up with an example that has more formal specification, published source implementations, and public unit test coverage than a C compiler.

It is not feasible that someone will use AI to tackle genuinely new software and provide a tenth of the level of guide-rails Anthropic had for this project. They were able to keep the million monkeys on their million typewriters on an extremely short leash, and able to have it do the vast majority of iteration without human intervention.

it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure

  • well, if in this period it is a matter of cost, tomorrow won't be anymore. 4GB of RAM in the 80s would have cost tens of millions of dollars, now even your car runs 4 gb memory only for the infotainment systems, and runs dozens GBs of RAM for the most complex assistants. So i would see this achievement more as a warning, the final result is not what's concerning, it is the premonition behind it

The full source of several compilers being in its training set is somewhat helpful though. It’s not exactly a novel problem and those optimizations and edge cases which it seemingly is struggling are the overwhelming majority of the work anyway.

Do we know it just didn’t shuffle gcc’s source code around a bit?