← Back to context

Comment by gmueckl

6 hours ago

The result is hardly a clean room implementation. It was rather a brute force attempt to decompress fuzzily stored knowledge contained within the network and it required close steering (using a big suite of tests) to get a reasonable approximation to the desired output. The compression and storage happened during the LLM training.

Prove this statement wrong.

Nobody disputes that the LLM was drawing on knowledge in its training data. Obviously it was! But you'll need to be a bit more specific with your critique, because there is a whole spectrum of interpretations, from "it just decompressed fuzzily-stored code verbatim from the internet" (obviously wrong, since the Rust-based C compiler it wrote doesn't exist on the internet) all the way to "it used general knowledge from its training about compiler architecture and x86 and the C language."

Your post is phrased like it's a two sentence slam-dunk refutation of Anthropic's claims. I don't think it is, and I'm not even clear on what you're claiming precisely except that LLMs use knowledge acquired during training, which we all agree on here.

  • "clean room" usually means "without looking at the source code" of other similar projects. But presumably the AIs training data would have included GCC, Clang, and probably a dozen other C compilers.

    • Suppose you the human are working on a clean room implementation of C compiler, how do you go about doing it? Will you need to know about: a) the C language, and b) the inner working of a compiler? How did you acquire that knowledge?

  • The result is a fuzzy reproduction of the training input, specifically of the compilers contained within. The reproduction in a different, yet still similar enough programming language does not refute that. The implementation was strongly guided by a compiler and a suite of tests as an explicit filter on those outputs and limiting the acceptable solution space, which excluded unwanted interpolations of the training set that also result from the lossy input compression.

    The fact that the implementation language for the compiler is rust doesn't factor into this. ML based natural language translation has proven that model training produces an abstract space of concepts internally that maps from and to different languages on the input and output side. All this points to is that there are different implicitly formed decoders for the same compressed data embedded in the LLM and the keyword rust in the input activates one specific to that programming language.

    • Thanks for elaborating. So what is the empirically-testable assertion behind this… that an LLM cannot create a (sufficiently complex) system without examples of the source code of similar systems in its training set? That seems empirically testable, although not for compilers without training a whole new model that excludes compiler source code from training. But what other kind of system would count for you?

> Prove this statement wrong.

If all it takes is "trained on the Internet" and "decompress stored knowledge", then surely gpt3, 3.5, 4, 4.1, 4o, o1, o3, o4, 5, 5.1, 5.x should have been able to do it, right? Claude 2, 3, 4, 4.1, 4.5? Surely.

  • Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do. Because most are trained, on that quite small product - its 20kb.

    But reimplementing that isn't impressive, because its not a clean room implementation if you trained on that data, to make the model that regurgitates the effort.

    • > Well, "Reimplement the c4 compiler - C in four functions" is absolutely something older models can do.

      Are you sure about that? Do you have some examples? The older Claude models can’t do it according to TFA.

  • This comparison is only meaningful with comparable numbers of parameters and context window tokens. And then it would mainly test the efficiency and accuracy of the information encoding. I would argue that this is the main improvement over all model generations.

  • Are you really asking for "all the previous versions were implemented so poorly they couldn't even do this simple, basic LLM task"?

    • Please look at the source code and tell me how this is a "simple, basic LLM task".

  • Perhaps 4.5 could also do it? We don’t know really until we try. I don’t trust the marketing material as much. The fact that the previous version (smaller versions) couldn’t or could do it does not really disprove that claim.

Even with 1 TB of weights (probable size of the largest state of the art models), the network is far too small to contain any significant part of the internet as compressed data, unless you really stretch the definition of data compression.

  • This sounds very wrong to me.

    Take the C4 training dataset for example. The uncompressed, uncleaned, size of the dataset is ~6TB, and contains an exhaustive English language scrape of the public internet from 2019. The cleaned (still uncompressed) dataset is significantly less than 1TB.

    I could go on, but, I think it's already pretty obvious that 1TB is more than enough storage to represent a significant portion of the internet.

  • A lot of the internet is duplicate data, low quality content, SEO spam etc. I wouldn't be surprised if 1 TB is a significant portion of the high-quality, information-dense part of the internet.

  • This is obviously wrong. There is a bunch of knowledge embedded in those weights, and some of it can be recalled verbatim. So, by virtue of this recall alone, training is a form of lossy data compression.

I challenge anyone to try building a C compiler without a big suite of tests. Zig is the most recent attempt and they had an extensive test suite. I don't see how that is disqualifying.

If you're testing a model I think it's reasonable that "clean room" have an exception for the model itself. They kept it offline and gave it a sandbox to avoid letting it find the answers for itself.

Yes the compression and storage happened during the training. Before it still didn't work; now it does much better.

  • The point is - for a NEW project, no one has an extensive test suite. And if an extensive test suite exists, it's probably because the product that uses it also exists, already.

    If it could translate the C++ standard INTO an extensive test suite that actually captures most corner cases, and doesn't generate false positives - again, without internet access and without using gcc as an oracle, etc?