← Back to context

Comment by isodev

17 days ago

> 1:1 reproducibility is much easier in LLMs than in software building pipelines

What’s a ‘software building pipeline’ in your view here? I can’t think of parts of the usual SDLC that are less reproducible than LLMs, could you elaborate?

Reproducibility across all existing build systems took a decade of work involving everything from compilers to sandboxing, and a hard reproducibility guarantee in completely arbitrary cases is either impossible or needs deterministic emulators which are terribly slow. (e.g. builds that depend on hardware probing or a simulation result)

Input-to-output reproducibility in LLMs (assuming the same model snapshot) is a matter of optimizing the inference for it and fixing the seed, which is vastly simpler. Google for example serves their models in an "almost" reproducible way, with the difference between runs most likely attributed to batching.

  • It’s not just about non-determinism, but about how chaotic LLMs are. A one word difference in a spec can and frequently does produce unrecognizably different output.

    If you are using an LLM as a high level language, that means that every time you make a slight change to anything and “recompile” all of the thousands upon thousands of unspecified implementation details are free to change.

    You could try to ameliorate this by training LLMs to favor making fewer changes, but that would likely end up encoding every bad architecture decisions made along the way and essentially forcing a convergence on bad design.

    Fixing this I think requires judgment on a level far beyond what LLMs have currently demonstrated.

    • >It’s not just about non-determinism

      I'm very specifically addressing prompt reproducibility mentioned above, because it's a notorious red herring in these discussions. What you want is correctness, not determinism/reproducibility which is relatively trivial. (although thinking of it more, maybe not that trivial... if you want usable repro in the long run, you'll have to store the model snapshot, the inference code, and make it deterministic too)

      >A one word difference in a spec can and frequently does produce unrecognizably different output.

      This is well out of scope for the reproducibility and doesn't affect it in the slightest. And for practical software development this is also a red herring, the real issue is correctness and spec gaming. As long as the output is correct and doesn't circumvent the intention of the spec, prompt instability is unimportant, it's just the ambiguous nature of the domain LLMs and humans operate in.

      5 replies →

  • You’re mixing the terms I think.

    Reproducible builds is not the same as reproducible outcomes.

    Rebuilding your project with a different version of a dependency is not the same as suddenly a program accepts txt files as attachments instead of pdfs because the model was borrowing a txt example from its training data and got sidetracked.