Comment by NitpickLawyer
4 hours ago
This is a much more reasonable take than the cursor-browser thing. A few things that make it pretty impressive:
> This was a clean-room implementation (Claude did not have internet access at any point during its development); it depends only on the Rust standard library. The 100,000-line compiler can build Linux 6.9 on x86, ARM, and RISC-V. It can also compile QEMU, FFmpeg, SQlite, postgres, redis
> I started by drafting what I wanted: a from-scratch optimizing compiler with no dependencies, GCC-compatible, able to compile the Linux kernel, and designed to support multiple backends. While I specified some aspects of the design (e.g., that it should have an SSA IR to enable multiple optimization passes) I did not go into any detail on how to do so.
> Previous Opus 4 models were barely capable of producing a functional compiler. Opus 4.5 was the first to cross a threshold that allowed it to produce a functional compiler which could pass large test suites, but it was still incapable of compiling any real large projects.
And the very open points about limitations (and hacks, as cc loves hacks):
> It lacks the 16-bit x86 compiler that is necessary to boot [...] Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase
> It does not have its own assembler and linker;
> Even with all optimizations enabled, it outputs less efficient code than GCC with all optimizations disabled.
Ending with a very down to earth take:
> The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
All in all, I'd say it's a cool little experiment, impressive even with the limitations, and a good test-case as the author says "The resulting compiler has nearly reached the limits of Opus’s abilities". Yeah, that's fair, but still highly imrpessive IMO.
> This was a clean-room implementation
This is really pushing it, considering it’s trained on… internet, with all available c compilers. The work is already impressive enough, no need for such misleading statements.
It's not a clean-room implementation, but not because it's trained on the internet.
It's not a clean-room implementation because of this:
> The fix was to use GCC as an online known-good compiler oracle to compare against
The classical definition of a clean room implementation is something that's made by looking at the output of a prior implementation but not at the source.
I agree that having a reference compiler available is a huge caveat though. Even if we completely put training data leakage aside, they're developing against a programmatic checker for a spec that's already had millions of man hours put into it. This is an optimal scenario for agentic coding, but the vast majority of problems that people will want to tackle with agentic coding are not going to look like that.
If you read the entire GCC source code and then create a compatible compiler, it's not clean room. Which Opus basically did since, I'm assuming, its training set contained the entire source of GCC. So even if they were actively referencing GCC I think that counts.
I'm using AI to help me code and I love Anthropic but I chocked when I read that in TFA too.
It's all but a clean-room design. A clean-room design is a very well defined term: "Clean-room design (also known as the Chinese wall technique) is the method of copying a design by reverse engineering and then recreating it without infringing any of the copyrights associated with the original design."
https://en.wikipedia.org/wiki/Clean-room_design
The "without infringing any of the copyrights" contains "any".
We know for a fact that models are extremely good at storing information with the highest compression rate ever achieved. It's not because it's typically decompressing that information in a lossy way that it didn't use that information in the first place.
Note that I'm not saying all AIs do is simply compress/decompress information. I'm saying that, as commenters noted in this thread, when a model was caught spotting out Harry Potter verbatim, there is information being stored.
It's not a clean-room design, plain and simple.
The LLM does not contain a verbatim copy of whatever it saw during the pre-training stage, it may remember certain over-represented parts, otherwise it has a knowledge about a lot of things but such knowledge, while about a huge amount of topics, is similar to the way you could remember things you know very well. And, indeed, if you give it access to internet or the source code of GCC and other compilers, it will implement such a project N times faster.
We all saw verbatim copies in the early LLMs. They "fixed" it by implementing filters that trigger rewrites on blatant copyright infringement.
It is a research topic for heaven's sake:
https://arxiv.org/abs/2504.16046
17 replies →
So it will copy most code with adding subtle bugs
[flagged]
With just a few thousand dollars of API credits you too can inefficiently download a lossy copy of a C compiler!
There seem to still be a lot of people who look at results like this and evaluate them purely based on the current state. I don't know how you can look at this and not realize that it represents a huge improvement over just a few months ago, there have been continuous improvements for many years now, and there is no reason to believe progress is stopping here. If you project out just one year, even assuming progress stops after that, the implications are staggering.
The improvements in tool use and agentic loops have been fast and furious lately, delivering great results. The model growth itself is feeling more "slow and linear" lately, but what you can do with models as part of an overall system has been increasing in growth rate and that has been delivering a lot of value. It matters less if the model natively can keep infinite context or figure things out on its own in one shot so long as it can orchestrate external tools to achieve that over time.
Every S-curve looks like an exponential until you hit the bend.
We've been hearing this for 3 years now. And especially 25 was full of "they've hit a wall, no more data, running out of data, plateau this, saturated that". And yet, here we are. Models keep on getting better, at more broad tasks, and more useful by the month.
8 replies →
This quote would be more impactful if people haven't been repeating it since gpt-4 time.
2 replies →
i have to admit, even if model and tooling progress stopped dead today the world of software development has forever changed and will never go back.
The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.
Prove this statement wrong.
Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."
Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.
"clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.
> Prove this statement wrong.
If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.
Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.
But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.
This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.
Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?
1 reply →
Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.
Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.
This sounds very wrong to me.
Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.
I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.
2 replies →
A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.
1 reply →
This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.
I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.
If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.
Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.
The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.
If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?
No one needs to prove you wrong. That’s just personal insecurity trying to justify ones own worth.
[flagged]
> clean-room implementation
Except its trained on all source out there, so I assume on GCC and clang. I wonder how similar the code is to either.
Honestly I don't find it that impressive. I mean, it's objectively impressive that it can be done at all, but it's not impressive from the standpoint of doing stuff that nearly all real-world users will want it to do.
The C specification and Linux kernel source code are undoubtedly in its training data, as are texts about compilers from a theoretical/educational perspective.
Meanwhile, I'm certain most people will never need it to perform this task. I would be more interested in seeing if it could add support for a new instruction set to LLVM, for example. Or perhaps write a complier for a new language that someone just invented, after writing a first draft of a spec for it.
Are you a frequent user of coding agents?
I ask because, as someone who uses these things every day, the idea that this kind of thing only works because of similar projects in the training data doesn't fit my mental model of how they work at all.
I'm wondering if the "it's in the training data" theorists are coding agent practitioners, or if they're mainly people who don't use the tools.
> Claude did not have internet access at any point during its development
Why is this even desirable? I want my LLM to take into account everything there is out there and give me the best possible output.
It's desirable if you're trying to build a C compiler as a demo of coding agent capabilities without all of the Hacker News commenters saying "yeah but it could just copy implementation details from the internet".