Comment by atherton94027

12 hours ago

The problem is that many of these clean room reimplementations require contributors to not have seen any of the proprietary source. You can't guarantee that with ai because who knows which training data was used

> You can't guarantee that with ai because who knows which training data was used

There are no guarantees in life, but with macOS you can know it is rather unlikely any AI was trained on (recent) Apple proprietary source code – because very little of it has been leaked to the general public – and if it hasn't leaked to the general public, the odds are low any mainstream AI would have been trained on it. Now, significant portions of macOS have been open-sourced – but presumably it is okay for you to use that under its open source license – and if not, you can just compare the AI-generated code to that open source code to evaluate similarity.

It is different for Windows, because there have been numerous public leaks of Windows source code, splattered all over GitHub and other places, and so odds are high a mainstream AI has ingested that code during training (even if only by accident).

But, even for Windows – there are tools you can use to compare two code bases for evidence of copying – so you can compare the AI-generated reimplementation of Windows to the leaked Windows source code, and reject it if it looks too similar. (Is it legal to use the leaked Windows source code in that way? Ask a lawyer–is someone violating your copyright if they use your code to do due diligence to ensure they're not violating your copyright? Could be "fair use" in jurisdictions which have such a concept–although again, ask a lawyer to be sure. And see A.V. ex rel. Vanderhye v. iParadigms, L.L.C., 562 F.3d 630 (4th Cir. 2009))

In fact, I'm pretty sure there are SaaS services you can subscribe to which will do this sort of thing for you, and hence they can run the legal risk of actually possessing leaked code for comparison purposes rather than you having to do it directly. But this is another expense which an open source project might not be able to sustain.

Even for Windows – the vast majority of the leaked Windows code is >20 years old now – so if you are implementing some brand new API, odds of accidentally reusing leaked Windows code is significantly reduced.

Other options: decompile the binary, and compare the decompiled source to the AI-generated source. Or compile the AI-generated source and compare it to the Windows binary (this works best if you can use the exact same compiler, version and options as Microsoft did, or as close to the same as is manageable.)

Are those OSes actually that strict about contributors? That’s got to be impossible to verify and I’ve only seen clean room stuff when a competitor is straight up copying another competitor and doesn’t want to get sued

  • ReactOS froze development to audit their code.[1] Circumstantial evidence was enough to call code not clean. WINE are strict as well. It is impossible to verify beyond all doubt of course.

    [1] https://reactos.org/wiki/Audit

    • We should add that the Windows source leaked, thus ReactOS had to be extra careful regarding contributions.

I’ve been thinking a long time about using AI to do binary decompilation for this exact purpose. Needless to say we’re short of a fundamental leap forward from doing that

I had never thought of this until now. Is the clean-room approach officially done with? I guess we have to wait for a case to be ruled on.