Comment by qarl

20 days ago

> Still a really cool project!

Yeah. This test sorta definitely proves that AI is legit. Despite the millions of people still insisting it's a hoax.

The fact that the optimizations aren't as good as the 40 year gcc project? Eh - I think people who focus on that are probably still in some serious denial.

153 comments

qarl

PostOnce 20 days ago

It's amazing that it "works", but viability is another issue.

It cost $20,000 and it worked, but it's also totally possible to spend $20,000 and have Claude shit out a pile of nonsense. You won't know until you've finished spending the money whether it will fail or not. Anthropic doesn't sell a contract that says "We'll only bill you if it works" like you can get from a bunch of humans.

Do catastrophic bugs exist in that code? Who knows, it's 100,000 lines, it'll take a while to review.

On top of that, Anthropic is losing money on it.

All of those things combined, viability remains a serious question.

ryanjshaw 20 days ago
> You won't know until you've finished spending the money whether it will fail or not.
How do you conclude that? You start off with a bunch of tests and build these things incrementally, why would you spend 20k before realizing there’s a problem?
- friendzis 20 days ago
  
  Because literally no real-world non-research project starts with "we have an extremely comprehensive test suite and specification complete down to the most finite detail" and then searches for a way to turn it into code.
  
  5 replies →
qarl 20 days ago
> It cost $20,000
I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???
You should look it up. :)
- lelanthran 20 days ago
  
  > > It cost $20,000
  > I'm curious - do you have ANY idea what it costs to have humans write 100,000 lines of code???
  I'll bite - I can write you an unoptimised C compiler that emits assembly for $20k, and it won't be 100k lines of code (maybe 15k, the last time I did this?).
  It won't take me a week, though.
  I think this project is a good frame of reference and matches my experience - vibing with AI is sometimes more expensive than doing it myself, and always results in much more code than necessary.
  
  22 replies →
- PostOnce 20 days ago
  
  That's irrelevant in this context, because it's not "get the humans to make a working product OR get the AI to make a working product"
  The problem is you may pay $20K for gibberish, then try a second time, fail again, and then hire humans.
  Coincidentally yes, I am aware, my last contract was building out a SCADA module the AI failed to develop at the company that contracted me.
  I'm using that money to finance a new software company, and so far, AI hasn't been much help getting us off the ground.
  Edit: oh yeah, and on top of paying Claude to fuck it up, you still have to also pay the salary of the guy arguing with Claude.
  
  1 reply →
- sarchertech 20 days ago
  
  You wouldn’t pay a human to write 100k LOC. Or at least you shouldn’t. You’d pay a human to write a working useful compiler that isn’t riddled with copyright issues.
  If you didn’t care about copying code, usefulness, or correctness you could probably get a human to whip you up a C compiler for a lot less than $20k.
  
  16 replies →
- etler 20 days ago
  
  If my devs are writing that much code they're doing something wrong. Lines of code is an anti metric. That used to be commonly accepted knowledge.
- m00x 20 days ago
  
  It really depends on the human and the code it outputs.
  I can get my 2y old child to output 100k LoC, but it won't be very good.
  
  2 replies →
- psychoslave 20 days ago
  
  Well, if these humans can cheat by taking whatever needed degree of liberty in copycat attitude to fit in the budget, I guess that a simple `git clone https://gcc.gnu.org/git/gcc.git SomeLocalDir` is as close to $0 as one can hope to either reach. And it would end up being far more functional and reliable. But I get that big-corp overlords and their wanna-match-KPI minions will prefer an "clean-roomed" code base.
- bopbopbop7 20 days ago
  
  100k lines of clean, bug free, optimized, and vulnerability free code or 100k lines of outsourced slop? Two very different price points.
  
  33 replies →
tumdum_ 20 days ago
> On top of that, Anthropic is losing money on it.
It seems they are *not* losing money on inference: https://bsky.app/profile/steveklabnik.com/post/3mdirf7tj5s2e
- byzantinegene 20 days ago
  
  no, and that is widely known. the actual problem is that the margins are not sufficient at that scale to make up for the gargantuan training costs to train their SOTA model.
  
  3 replies →
- quikoa 20 days ago
  
  That's for the API right? The subs are still a loss. I don't know which one of the two is larger.
chamomeal 20 days ago

That's a good point! Here claude opus wrote a C compiler. Outrageously cool.
Earlier today, I couldn't get opus to replace useEffect-triggered-redux-dispatch nonsense with react-query calls. I already had a very nice react-query wrapper with tons of examples. But it just couldn't make sense of the useEffect rube goldberg machine.
To be fair, it was a pretty horrible mess of useEffects. But just another data point.
Also I was hoping opus would finally be able to handle complex typescript generics, but alas...
georgeven 20 days ago

it's 20,000 in 2026, with the price of tokens halving every year (at a given perf level), this will be around 1,000 dollars in 2030
RA_Fisher 20 days ago

Progress can be reviewed over time, and I'd think that'd take a lot of the risk out.
nly 17 days ago

Also, heaven knows if the result in maintainable or easy to change.
bdangubic 20 days ago
> On top of that, Anthropic is losing money on it
This has got to be my favorite one of them all that keeps coming up in too many comments… You know who also was losing money in the beginning?! every successful company that ever existed! some like Uber were losing billions for a decade. and when was the last time you rode in a taxi? (I still do, my kid never will). not sure how old you are and if you remember “facebook will never be able to monetize on mobile…” - they all lose money, until they do not
- ThrowawayR2 20 days ago
  
  Anyone remember the dotcom bust?
  
  7 replies →
- deaux 20 days ago
  
  Completely detached from reality, brainwashed SV VC's who have made dumping the norm in their bubble.
  I can guarantee you that 90% of successful businesses in the world made a profit their first year.
  
  4 replies →
- PostOnce 20 days ago
  
  Are we forgetting that sometimes, they just go bankrupt?
  
  5 replies →
- qarl 20 days ago
  
  I love how your comment is getting downvoted.
  Like it's a surprise that startups burn through money. I get the feeling that people really have no idea what they're talking about in here anymore.
  It's a shame.
  
  5 replies →

thesz 20 days ago

  > This test sorta definitely proves that AI is legit.

This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.

The "out of distribution" test would be like "implement (self-bootstrapping, Linux kernel compatible) C compiler in J." J is different enough from C and I know of no such compiler.

disgruntledphd2 20 days ago
> This is an "in distribution" test. There are a lot of C compilers out there, including ones with git history, implemented from scratch. "In distribution" tests do not test generalization.
It's still really, really impressive though.
Like, economics aside this is amazing progress. I remember GPT3 not being able to hold context for more than a paragraph, we've come a long way since then.
Hell, I remember bag of words being state of the art when I started my career. We have come a really, really, really long way since then.
- thesz 20 days ago
  
  > It's still really, really impressive though.
  Do we know how many attempts were done to create such compiler before during previous tests? Would Anthropic report on the failed attempt? Can this "really, really impressive" thing be a result of a luck?
  Much like quoting Quake code almost verbatim not so long ago.
  
  4 replies →
Rudybega 20 days ago
There are two compilers that can handle the Linux kernel. GCC and LLVM. Both are written in C, not Rust. It's "in distribution" only if you really stretch the meaning of the term. A generic C compiler isn't going to be anywhere near the level of rigour of this one.
- thesz 20 days ago
  
  There is tinycc, that makes it three compilers.
  There is a C compiler implemented in Rust from scratch: https://github.com/PhilippRados/wrecc/commits/master/?after=... (the very beginning of commit history)
  There are several C compilers written in Rust from scratch of comparable quality.
  We do not know whether Anthropic has a closed source C compiler written in Rust in their training data. We also do not know whether Anthropic validated their models on their ability to implement C compiler from scratch before releasing this experiment.
  That language J I proposed does not have any C compiler implemented in it at all. Idiomatic J expertise is scarce and expensive so that it would be a significant expense for Anthropic to have C compiler in J for their training data. Being Turing-complete, J can express all typical compiler tips and tricks from compiler books, albeit in an unusual way.
  
  1 reply →

LinXitoW 20 days ago

How does 20K to replicate code available in the thousands online (toy C compilers) prove anything? It requires a bunch of caveats about things that don't work, it requires a bunch of other tools to do stuff, and an experienced developer had to guide it pretty heavily to even get that lackluster result.

soperj 20 days ago

Only if we take them at their word. I remember thinking things were in a completely different state when Amazon had their shop and go stores, but then finding out it was 1000s of people in Pakistan just watching you via camera.

cardanome 20 days ago

If will write you an C compiler by hand for 19k and it will be better than what Claude made.

Writing a toy C compiler isn't that hard. Any decent programmer can write one in a few weeks or months. The optimizations are the actually interesting part and Claude fails hard at that.

kvemkon 20 days ago

> optimizations aren't as good as the 40 year gcc project

with all optimizations disabled:

> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.

qarl 20 days ago
That distinction doesn't change my point. I am not surprised that a 40 year old project generates better code than this brand new one.
- charcircuit 20 days ago
  
  Not only is it new. There has been 0 performance optimization done. Well none prompted for at least. Once you give the agents a profiler and start a loop focusing on performance you'll see it start improving it.
  
  3 replies →

dwaite 18 days ago

It is legit - with some pretty severe caveats. I am pressed to come up with an example that has more formal specification, published source implementations, and public unit test coverage than a C compiler.

It is not feasible that someone will use AI to tackle genuinely new software and provide a tenth of the level of guide-rails Anthropic had for this project. They were able to keep the million monkeys on their million typewriters on an extremely short leash, and able to have it do the vast majority of iteration without human intervention.

byzantinegene 20 days ago

it costs $20,000 to reinvent the wheel, that it probably trained on. If that's your definition of legit, sure

organicUser 20 days ago

well, if in this period it is a matter of cost, tomorrow won't be anymore. 4GB of RAM in the 80s would have cost tens of millions of dollars, now even your car runs 4 gb memory only for the infotainment systems, and runs dozens GBs of RAM for the most complex assistants. So i would see this achievement more as a warning, the final result is not what's concerning, it is the premonition behind it

wqaatwt 18 days ago

The full source of several compilers being in its training set is somewhat helpful though. It’s not exactly a novel problem and those optimizations and edge cases which it seemingly is struggling are the overwhelming majority of the work anyway.

Do we know it just didn’t shuffle gcc’s source code around a bit?

miohtama 20 days ago

GCC had 40 years headstart

qarl 20 days ago

[flagged]