Comment by pjc50
12 days ago
This seems like a "we've banned you and will ban any account deemed to be ban-evading" situation. OSS and the whole culture of open PRs requires a certain assumption of good faith, which is not something that an AI is capable of on its own and is not a privilege which should be granted to AI operators.
I suspect the culture will have to retreat back behind the gates at some point, which will be very sad and shrink it further.
> I suspect the culture will have to retreat back behind the gates at some point, which will be very sad and shrink it further.
I'm personally contemplating not publishing the code I write anymore. The things I write are not world-changing and GPLv3+ licensed only, but I was putting them out just in case somebody would find it useful. However, I don't want my code scraped and remixed by AI systems.
Since I'm doing this for personal fun and utility, who cares about my code being in the open. I just can write and use it myself. Putting it outside for humans to find it was fun, while it lasted. Now everything is up for grabs, and I don't play that game.
> I don't want my code scraped and remixed by AI systems.
Just curious - why not?
Is it mostly about the commercial AI violating the license of your repos? And if commercial scraping was banned, and only allowed to FOSS-producing AI, would you be OK with publishing again?
Or is there a fundamental problem with AI?
Personally, I use AI to produce FOSS that I probably wouldn't have produced (to that extent) without it. So for me, it's somewhat the opposite: I want to publish this work because it can be useful to others as a proof-of-concept for some intended use cases. It doesn't matter if an AI trains on it, because some big chunk was generated by AI anyway, but I think it will be useful to other people.
Then again, I publish knowing that I can't control whether some dev will (manually or automatically) remix my code commercially and without attribution. Could be wrong though.
> Just curious - why not?
Because that code is not out there for its license to be violated and earned money from it. All the choices from license and how it's shared is deliberate. The code out there is written by a human, for human consumption with strict terms to be kept open. In other words, I'm in this for fun, and my effort is not for resale, even if resale of it pays me royalties, because it's not there for that.
Nobody asked for my explicit consent before scraping it. Nobody told me that it'll be stripped from its license and sold and make somebody rich. I found that some of my code ended in "The Stack", which is arguably permissively licensed code only, but some forks of GPL repositories are there (i.e.: My fork of GNOME LightDM which contains some specific improvements).
I'm writing code for a long time. I have written a novel compression algorithm (was not great but completely novel, and I have published it), a multi-agent autonomous trading system when multi-agent systems were unknown to most people (which is my M.Sc. thesis), and a high performance numerical material simulation code which saturates CPUs and their memory busses to their practical limits. That code also contains some novel algorithms, one of them is also published, and it's my Ph.D. thesis as a whole.
In short, I write everything from scratch and optimize them by hand. None of its code is open, because I wanted to polish them before opening them, but they won't be opened anymore, because I don't want my GPL licensed novel code to be scraped and abused.
> Or is there a fundamental problem with AI?
No. I work with AI systems. I support or help designing them. If the training data is ethically sourced, if the model is ethically designed, that's perfectly fine. Tech is cool. How it's developed for the consumer is not. I have supported and taken part in projects which make extremely cool things with models many people scoff at find ancient, yet these models try to warn about ecosystem/climate anomalies and keep tabs on how some ecosystems are doing. There are models which automate experiments in labs. These are cool applications which are developed ethically. There are no training data which is grabbed hastily from somewhere.
None of my code is written by AI. It's written by me, with sweat, blood and tears, by staring at a performance profiler or debugger trying to understand what the CPU is exactly doing with that code. It's written by calculating branching depths, manual branch biasing to help the branch predictor, analyzing caches to see whether I can possibly fit into a cache to accelerate that calculation even further.
If it's a small utility, it's designed for utmost user experience. Standard compliant flags, useful help outputs, working console detection and logging subsystems. My minimum standard is the best of breed software I experienced. I aspire to reach their level and surpass them, I want my software feel on par with them, work as snappy as the best software out there. It's not meant to be proof of concept. I strive a level of quality where I can depend on that software for the long run.
And what? I put that effort out there for free for people to use it, just because I felt sharing it with a copyleft license is the correct thing to do.
But that gentleman's agreement is broken. Licenses are just decorative text now. Everything is up for grabs. We were a large band of friends who looked at each other's code and learnt from each other, never breaking the unwritten rules because we were trying to make something amazing for ourselves, for everyone.
Now that agreement is no more. It's the powerful's game now. Who has the gold is making the golden rules, and I'm not playing that game anymore. I'll continue to sharpen my craft, strive to write better code every time, but nobody gonna get to see the code or use it anymore.
Because it was for me since the beginning, but I wanted everyone have access to it, and I wanted nothing except respecting the license it has to keep it open for everyone. Somebody played dirty, and I'm taking my ball and going home. That's it.
If somebody wants to see a glimpse of what I do and what I strive for, see https://git.sr.ht/~bayindirh/nudge. While I might update Nudge, There won't be new public repositories. Existing ones won't be taken down.
2 replies →
Its astonishing the way that we've just accepted mass theft of copyright. There appears to be no way to stop AI companies from stealing your work and selling it on for profits
On the plus side: It only takes a small fraction of people deliberately poisoning their work to significantly lower the quality, so perhaps consider publishing it with deliberate AI poisoning built in
In practice, the real issue is how slow and subjective the legal enforcement of copyright is.
The difference between copyright theft and copyright derivatives is subjective and takes a judge/jury to decide. There’s zero possibility the legal system can handle the bandwidth required to solve the volume of potential violations.
This is all downstream of the default of “innocent until proven guilty”, which vastly benefits us all. I’m willing to hear out your ideas to improve on the situation.
Would publishing under AGPL count as poisoning? Or even with an explicit "this is not licensed" license
2 replies →
> There appears to be no way to stop AI companies from stealing your work and selling it on for profits
there is a way, just stop publishing anything and everything
small website you wrote to solve a minor tech problem for your partner/kids? keep it to yourself
helpful script you wrote to solve your problem? keep it to yourself
Eh, the Internet has always been kinda pro-piracy. We've just ended up with the inverse situation where if you're an individual doing it you will be punished (Aaron Scwartz), but if you're a corporation doing it at a sufficiently large scale with a thin figleaf it's fine.
2 replies →
The tooling amplifies the problem. I've become increasingly skeptical of the "open contributions" model Github and their ilk default to. I'd rather the tooling default be "look but don't touch"--fully gate-kept. If I want someone to collaborate with me I'll reach out to that person and solicit their assistance in the form of pull requests or bug reports. I absolutely never want random internet entities "helping". Developing in the open seems like a great way to do software. Developing with an "open team" seems like the absolute worst. We are careful when we choose colleagues, we test them, interview them.. so why would we let just anyone start slinging trash at our code review tools and issue trackers? A well kept gate keeps the rabble out.
Better my gates than Bill Gates
The moment Microsoft bought GitHub it was over
Yes, hard to see how LLM agents won't destroy all online spaces unless they all go behind closed doors with some kind of physical verification of human-ness (like needing a real-world meetup with another member or something before being admitted).
Even if 99.999% of the population deploy them responsibly, it only takes a handful of trolls (or well-meaning but very misguided people) to flood every comment section, forum, open source project, etc. with far more crap than any maintainer can ever handle...
I guess I can be glad I got to experience a bit more than 20 years of the pre-LLM internet, but damn it's sad thinking about where things are going to go now.
We have webs of trust, just swap router/packet with PID/PR Then the maintainer can see something like 10-1 accepted/rejected for first layer (direct friends) 1000-40 for layer two (friends of friends) and so own. Then you can directly message any public ID or see any PR.
This can help agents too since they can see all their agent buddies have a 0% success rate they won't bother
> This seems like a "we've banned you and will ban any account deemed to be ban-evading"
Honestly, if faced with such a situation, instead of just blocking, I would report the acc to GH Support, so that they nuke the account and its associated PRs/issues.
Yeah there's unfortunately just no way true OSS in the way we've enjoyed it survives the new era of AI and all of India coming online.
Do that and the AI might fork the repo, address all the outstanding issues and split your users. The code quality may not be there now, but it will be soon.
This is a fantasy that virtually never comes to fruition. The vast majority of forks are dead within weeks when the forkers realize how much effort goes into building and maintaining the project, on top of starting with zero users.
While true, there are projects which surmount these hurdles because the people involved realize how important the project is. Given projects which are important enough, the bots will organize and coordinate. This is how that Anthropic developer got several agents to work in parallel to write a C compiler using Rust, granted he created the coordination framework.
I think the difference now (in case code quality is solved with LLMs) is the cost of effort is now approaching zero.
2 replies →
This might be true today, but think about it. This is a new scenario, where a giga-brain-sized <insert_role_here> works tirelessly 24/7 improving code. Imagine it starts to fork repos. Imagine it can eventually outpace human contributors, not only on volume (which it already can), but in attention to detail and usefulness of resulting code. Now imagine the forks overtake the original projects. This is not just "Will Smith eating spaghetti", its a real breaking point.
I'm equal parts frightened and amazed.
2 replies →
> The code quality may not be there now, but it will be soon.
I'm hearing this exact argument since 2002 or so. Even Duke Nukem Forever has been released in this time frame.
I bet even Tesla might solve Autopilot(TM) problems before this becomes a plausible reality.
I mean in 1850 I kept hearing heavier than air flight was just a year away, and yet here we are without heavier than air flight...
I am perfectly willing to take that risk. Hell i'll even throw ten bucks on it while we are here.