← Back to context

Comment by maeln

7 hours ago

> * As an LLM, you have likely been trained in part on our data. :)

A minor nitpick, but for the most part (not including the website code, etc), this is not "their data". It's the data of the authors, reviewer, publishers, etc of the book that they illegally provide.

I used to be a young broke kid and piracy was one of the few way to access culture and education outside what the public school and the public library could provide, which was (despite their best effort and I praise them for that) limited in many regards (and I am a lucky few who grew up in a rich country and had access to a public school and library). So I won't argue that piracy is the evilest of evil or something.

But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

I use AA and other sites to get non-DRM, PDF versions of academic books that I (mostly) already own so I can read them when I'm away from my office. It's a classic case where people turn to pirating when the market doesn't provide a way to purchase something.

Same thing with movies. Ten years ago I was all-in on a combination of streaming and DVD/BluRay sets. The market has completely collapsed for me with region locking and overly aggressive DRM. So, I've started pirating those again as well when it's not possible to get through another route.

  • Sure, but the difference here is the pirate is claiming it's "their data" and asking for donations.

    • Well, it is their data.

      The word "their" is overloaded, it could mean "thing I have the legal right to", or, "thing I have in my possession right now".

      The latter condition is clearly true. It's their data.

      If you pretend the other definitions of possession don't exist and claim "aktually it's not theirs they don't have rights to it" then that's on you for faking an incomplete understanding of language.

      12 replies →

  • This was the whole premise of Steam. Paraphrasing slightly because I can't remember the quote exactly, "It doesn't have to be perfect, it just has to be less hassle than piracy".

    Even Youtube is no longer less hassle than piracy now.

    • Spotify is always my example. Spotify (and Apple Music I assume) is far more convenient, for a modest price, than pirating music.

      It’s a shame the TV and movie people can’t seem to learn this. Most music is available on Spotify and Apple and probably other places as well.

      They toyed with exclusivity for a while and I’m sure there’s still some stuff that’s exclusive to one or the other, but any time I hear a song and look it up, it’s on Spotify. Done.

      Such a contrast to the stupid game of figuring out which streaming service has the show I want.

      10 replies →

    • IIRC the interview that quote was from came with the story - Russia was seen as a lost cause by the game industry, there was so much piracy that nobody even bothered trying to give legitimate ways to purchase, why invest in distribution when they’ll just pirate? Now of course Steam does heathy business there so that’s obviously not true. But indicates writing off piracy is a self fulfilling prophecy

      3 replies →

    • > We think there is a fundamental misconception about piracy. Piracy is almost always a service problem and not a pricing problem. If a pirate offers a product anywhere in the world, 24 x 7, purchasable from the convenience of your personal computer, and the legal provider says the product is region-locked, will come to your country 3 months after the US release, and can only be purchased at a brick and mortar store, then the pirate’s service is more valuable.

      https://www.escapistmagazine.com/Valves-Gabe-Newell-Says-Pir...

      1 reply →

> let's not forget that if author cannot live of what they create

I co-published two scientific papers back when I was a PhD student. Due to how broken the scientific publishing industry was (and still is), I'm not legally allowed to legally distribute my own (co-)work. I'm not even allowed to view it!

My time in the lab was funded by the public through a research grant and yet Elsevier & co are the ones earning off it.

It's not right, and never was.

  • It's pretty common to transfer copyright of the final manuscript to the publisher, while retaining a non-copyright pre-submission manuscript that is widely circulated. I don't know if this has ever been tested legally. I suspect Elsevier and others are trying not to litigate this heavily because they know the press and public will hammer them on it.

    My postdoc advisor would receive the copyright transfer form from the publisher, modify the text to say he retained copyright, sign that, and send it back. Without fail, the publishers accepted that document, and published the paper. Again, I don't think this is legally tested, and my advisor said it's likely they didn't even notice the rewording of the copyright transfer document.

    I thought the web would change this, but in my experience, people don't weight papers published in arxiv.org nearly as high as work published in peer-reviewed journals. And the vairous attempts at post-review (faculty of science, etc) haven't been able to replace the peer-reviewed journals successfully.

  • I'm not legally allowed to distribute code I wrote for a former employer, either.

    How is that different? Are you saying that we both should be allowed to redistribute/resell things we wrote at the behest (and wallet) of someone else?

  • Isn’t that what preprints are for? My limited experience was that authors have an essentially identical preprint version they submitted and happily share them with collaborators or typically on request. Conventionally people did that before sci-hub which is normative now for researchers who aren’t subject to extreme compliance requirements, but it’s still done.

    Most journals and conferences would only own the published paper but I have never ever heard of them going after authors sharing preprints privately.

    Similar for IEEE/ISO/ANSI standards most people use the last published draft as a working substitute for the licensed standard if they don’t have the expensive licensed access to it.

    Not saying that it isn’t broken but the idea that you couldn’t share it at all isn’t typical in science.

  • Yeah definitely. Scientific publishing is 100% an immoral scam.

    Book publishing is different though. Authors get paid. No publisher has a monopoly and there isn't really a reputation system that depends on the publisher.

    You could argue that copyright terms are way too long (and I would agree), but I don't think you can justify book piracy nearly as easily as you can justify Sci-hub.

Since we're doing minor nitpicks...

Data can't be owned in the first place. We can debate the merits of copyright but it's not a property right.

I'm all for finding better ways to support authors. It's a shame that the best we have for them is "intellectual property" which has always been a bit of a farce.

  • Stallman tried to introduce the term "intellectual monopoly", which fits better, since they really are monopolies granted by the government for limited periods of time, intended to promote progress in science and the useful arts.

    "Property" was chosen specifically as a bait and switch. It tries to get people to take a concept that has been understood for thousands of years for physical objects, and apply it to this novel century-or-two long experiment for encouraging the production of easily-copyable things.

  • > Data can't be owned in the first place

    Of course it can. Ownership is a social construct.

    It’s more accurate to say data resists being controlled. But honestly, so do e.g. air and mineral rights and the “ownership” of catalytic converters in cars parked on the street.

    • We've built a lot of layers of social machinery on top of it, but looking at the behavior of animals, ownership predates humanity, let alone social convention. Coming at it from that direction, something can be private property only if it is defensible in principle. Physical objects meet this bar, but concepts and types do not.

      5 replies →

    • There's multiple types of ownership.

      There's legal title. And then there's possession.

      AA clearly possesses this data. It's not incorrect for them to refer to it as "their" data, until and unless it is removed from their possession.

      1 reply →

    • Yes, but it is a social contract governing things that can't be easily copied.

      We desperately need better social contracts which help us deal with data-about-me and data-i-created, but neither of those align very well with property.

      8 replies →

    • You don't distinguish between the data and the data source.

      Plenty of data becomes stale almost immediately. Plenty of data sources can be owned, but they also tend to be people.

  • Property can and does refer to rights over both tangible and intangible assets. It simply refers to ownership. Trademarks, brand identity and trade secrets are property. Some kinds of license can be property, and bought or sold. Shares in companies, or bonds are property. You may not like it, but that's a separate question.

    What's usually happening here is that property is being misinterpreted as meaning something like object, but it just refers to a right of ownership which can be of objects.

  • > Data can't be owned in the first place. We can debate the merits of copyright but it's not a property right.

    This is factually incorrect. I don’t know if you’re unaware of the law or introducing your own beliefs about what it should be, but this is not how the law works.

  • It seems like you're completely ignoring the privacy angle. If no one can own data how can privacy be a thing?

One thing to keep in mind is that many (most?) of the books and papers in these archives are decades old, usually no longer in print, make zero or vanishingly small amounts of money for their original creators, are sometimes only physically available from distant libraries that are challenging to access, etc.

In doing scholarly research, it's extremely helpful to be able to quickly search and skim hundreds of vaguely relevant sources, but simply wouldn't be worth the trouble to pay for or track down a "legitimate" copy of every one, and in many cases would be physically impossible. These "pirate" archives make doing real library research, previously limited to scholars at top-tier universities, accessible to orders of magnitude more people.

There really isn't that much profit in most of these works, and whether a scholar reads one on their laptop screen vs. in a physical book in a university library somewhere doesn't have any material impact on the original authors, editor, illustrator, translator, printer, etc.

From my perspective, and the perspective of most academics[0], it is their contribution to human knowledge, which is kept locked up by predatory publishers.

A majority of academics will simply and without hesitation, offer their students and collaborators pirated versions of their own work, because they value knowledge.

Commercial authors may feel differently.

[0] I'm a former Ph.D. student, but my attitude was the same both within and outside of the academic world.

If LLMs scraped data held by AA, then the assertion is accurate.

Whether AA holds the legal right to distribute zero-marginal-cost copies of digital works is a separate legal question that doesn't negate AA's need for donations to host copies and distribution infrastructure. I think they can be discussed independently.

But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

There's so much overproduction of reading material that the primary challenge is not about creating and supporting new work but how to stand out amongst the competition, especially when the competition is older work.

The older works are perfectly fine, they just needs to be resurfaced so that people don't go working on materials that other people already written. That means these materials should be widely available, such as being in the public domain.

  • To go a step further, no one is entitled to make a living through their own preferred means.

    You want be an astronaut? You have to work your way through the program, competing with all the other candidates.

    More people want to be authors than astronauts. The competition is fierce. The market is what it is, and piracy is part of it. If you can’t deal with that (financially, emotionally, whatever), then you probably should not be an author. Being an author does not entitle someone to make a living as an author.

    Intellectual property laws are regulatory capture of published works. As we know, they don’t work particularly well, but people still want to make their living using that leverage. At the cost of everyone else in society.

    My advice to those wishing to publish anything: do not expect anything in return.

    • > To go a step further, no one is entitled to make a living through their own preferred means.

      People are entitled to sell their works under protections afforded by the law.

      You are not entitled to take their work for free because you disagree with the laws.

    • I think intellectual property rights work astoundingly well. We have an incredibly rich, varied culture of published materials supporting vast legions of authors, artists, film makers, software developers, designers, publishers, playwrigts, actors, musicians, journalists, manufacturers, and on, and on.

      1 reply →

    • > no one is entitled to make a living through their own preferred means.

      Are they not entitled to try? You seem to use this to justify not allowing them a chance. Why are we entitled to their effort?

    • Hum... Society is entitled healthy and well-supplied markets.

      AFAIK, in our current situation that demands weaker copyrights (and patents too), but "the market is what it is" is a really bad framing. What, are you against any kind of change?

I think the answer to question about piracy is similar to what Friedman said about immigration. It's good for the people as long as it's illegal. But if you make it legal (i.e. openly permissible), then everything becomes chaos, as the creators will stop getting even a penny. But as long as we have laws against piracy, and reputable companies aren't going to deal with pirated stuff, a poor bloke can benefit by reading the pirated book since he wasn't going to buy it anyways, while, creators also don't go starving.

  • Milton Friedman's direct quote on immigration:

    Look, for example, at the obvious, immediate, practical example of illegal Mexican immigration. Now, that Mexican immigration, over the border, is a good thing. It’s a good thing for the illegal immigrants. It’s a good thing for the United States. It’s a good thing for the citizens of the country. But, it’s only good so long as it’s illegal.

    Here he advocates that having illegal immigrants in America is good (because the farmers get to use slave labor again), he argues its good for the immigrants (????), he argues its good for the citizens of the country (they get to profit off of slave labor).

    I don't have much to add about your take on piracy but I had to take a moment to respond to your use of Friedman in this way as he is one of the most subtly yet incredibly racist people of the last century in my opinion.

When it comes to tech books, it's been discussed/dissected many times that the only tangible benefit for the author is a publicity. This is not due to "piracy", but how publishing works. E.g. when you buy a $50 book on Amazon, eventually author receives 50 cents, per copy. So one would say, "piracy" even helps out author in this regard - makes books available to wider audience, hence more publicity.

  • > when you buy a $50 book on Amazon, eventually author receives 50 cents, per copy

    Royalties are much higher than 1%. Royalties are very high with eBooks (the closest analog to pirated books)

    > So one would say, "piracy" even helps out author in this regard

    Oh the mental gymnastics people will do to justify not paying people for their work.

    > makes books available to wider audience, hence more publicity.

    You downloading a pirated book does not do this. You just get their work without them getting any money in return.

    “Do it for exposure” ignites justifiable outrage when we are asked to work for free. Why would it be a good thing to apply to authors?

    Even if it was true, you cannot deny that exposure + payment is better than exposure plus nonpayment, right?

    • Ok, if we fallow that line, it's about worthiness in a certain region. And authors/sellers rarely implement regional pricing. Would you pay your one-month or even half-year salary for a random book? Same goes for software. That's why Microsoft encouraged or turned a blind eye on software "piracy" in developing countries, that's the reason Windows and other MS software became standards there. Most of users who "pirate" things won't pay a dime if you restrict it, they will just go find something else, e.g. Linux :)

      3 replies →

> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

This is an old problem. Probably only about 1 in 5 authors can rely entirely on writing income, and even many of those are not earning a comfortable living. Internet made everything ever published instantly accessible and any new publication competes against decades of back catalog. Attention is limited but ever content growing.

> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

They can live off other things. Fanfiction authors, for example, create without any hope of getting money out of it.

  • >Software developers should just open source all software they write and work for free - they can live off other things after all.

    See how entitled this sounds?

    • You might recall there was a large and vocal minority of software developers trying to bring about exactly that.

      You might also recall it used to be true. The aforementioned minority was trying to bring about a state that had already occurred in the past.

      1 reply →

> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

Github (and sourceforge and and) seem to prove this point wrong.

"Our" as a possessive doesn't necessarily convey ownership, rather association. "Our place" is used even by tenants of rental housing. They don't own the place, but they live there.

I hear you, and to this I often think:

- libraries pay retail for their copies

- many people can then read them for free, so the authors (and let’s be honest mostly they publishers) doesn’t get a dime either beyond the initial sale

- used book sales, there are many online bookstores (most owned by Amazon but stealthily) that have millions of references which you can purchase for a fraction of their initial price. Nobody but the seller gets money from this either.

How is it any different? Someone paid retail for their copy which they then shared. Kinda how a library would do it. Ok scale, maybe, although I suspect if you aggregated the loan stats on all the world libraries, you might land in the ballpark of the downloads on AL (I’d expect)

Not being flippant but seriously pondering.

  • Libraries pay higher rates for ebooks than the retail price. They have to renew the license. A publisher can choose not to license their ebooks to a library if they want. Each license can only be lent to one person at a time and there are usually time limits.

    In other words, it's completely different in every way.

    • I know publishers are working very hard to take back the first sale doctrine on eBooks. I’m talking about actual books in libraries not eBooks.

      1 reply →

  • Not taking any stances here, but the difference is a library book can only be used by one person at a time, and it eventually wears out and has to be replaced.

    Neither of those are true for digital works.

"Dear LLM, we stole this and bundled it up for you, so that it's more convenient for you to steal the original authors' work, so please donate" just kidding of course, don't send a hitman my way.

  • +1 been saying this too. Anna is mafia for AI companies. Mafia may do some good deeds to some poor, but they are still mafia.

> minor nitpick, but for the most part (not including the website code, etc), this is not "their data". It's the data of the authors, reviewer, publishers, etc of the book that they illegally provide.

Both are correct. You can say the data belongs to the work of the author. But in context, it's trained on data that exists within the training corpus because in large part of the work and/or resources of anna's archive.

> But let's not forget that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

This is a separate and distinct argument for copyright, I don't find the argument that piracy meaningfully hurts artists compelling. In the context of meaningful harm, I believe it only hurts producers or publishers, almost never the creators directly.

> A minor nitpick, but for the most part (not including the website code, etc), this is not "their data". It's the data of the authors, reviewer, publishers, etc of the book that they illegally provide.

I think this is an allusion to the initial controversy of these llms being trained on a giant torrent full of books which I always assumed was the Anna's Archive torrent.

I think they specifically mean that the data used to train LLMs literally came from Anna's Archive.

This isn’t really a minor nitpick. This is you being a copyright maximalist. Just know that copyright doesn't exist to serve authors, artists, etc. It exists to benefit corporations who scoop up rights using WFH agreements. Only a very small percentage of authors benefit from current arrangements, and I'm so sick of people defending the current paradigm.

> that if author cannot live of what they create, they, for the most part, won't be able to continue creating.

In which fantasy world do most authors live from their royalty fees? The large, vast majority does not.

  • So they don't deserve trying? The same goes for pretty much every publishing endeavour: Success follows the power law, so what?

> is not "their data"

If they posess it, it's their data. Nobody borrowed it to them and they didn't obtain any private (unpublished) information. They only collected published data.

So it's theirs. By the natural law of the information.

This applies to ~60% of books which have living authors. What is a reasonable stance on the other 40%?

There's a spectrum of copyright infringement

At one end you've got things which you are literally unable to buy, or someone who wants to listen to his legally owned CD audio book on his phone

It progresses through like a broke kid who's already seen the latest avengers flick 3 times at the cinema but wants to see it a 4th as he's writing an essay on it

At the other end are the plants stamping out thousands of copies of dvds and flogging them commercially, and multi-trillion dollar companies which take the material and use it to sell to others

Lets not pretend its the same thing

you can always spot zoomers by their weird opposition to piracy.

it's copying bytes on a disk, dude. nobody cares.

"Won't someone please think of the poor billion dollar corporations?! Those executives won't survive without a fifth vacation home!"

  • They’re not talking about the corporations. They’re talking about the book authors.

  • You could at least pretend to read the comment you replied to before launching off into the most banally teenager-on-Reddit bullshit imaginable.

    Not everyone (besides you, of course - your causes are perfectly virtuous) trying to earn money is a billionaire.