Comment by nonethewiser

18 hours ago

[flagged]

15 comments

nonethewiser

The websites, music, movies, books, photos, art that they stole didn't appear out of thin air. The amount of time and effort people have collectively poured into creating these works throughout history far, far surpasses Anthropic's own effort of converting them into model weights.

bloppe 18 hours ago

The equivocation is crawling website <-> crawling LLM responses.

Both Anthropic and Alibaba are trying to build bleeding edge LLMs. That part is the same. The way they source their data is slightly different, but they would both argue it constitutes fair use under Copyright law.

walrus01 18 hours ago

"Your extremely efficient multi petabyte internet content suction machine is ripping off my extremely efficient multi petabyte internet content suction machine"

Sucking down petabytes of peoples' copyrighted content that they never granted a specific license to you to use seems to be an unavoidable and default part of the process of building any huge LLM.

nonethewiser 18 hours ago
So why was there crawling in 1998 but no LLMs?
- hasteg 17 hours ago
  
  Because the transformer, which all of these models are foundationally built off of and didn't invent themselves (bar google) wasn't invented? The amount of effort it took humanity to generate all the data that was required for the models to get to the point they're at now is absolutely not even comparable to how much effort it took to build the model code. Yeah, it's complicated, but if they didn't rip off all of humanities combined output it wouldn't even matter if the transformer got invented.
  
  1 reply →
- 12_throw_away 13 hours ago
  
  I am unable to comprehend the state of mind that would lead one to ask this question.
- vitally3643 16 hours ago
  
  We didn't have GPUs with hundreds of gigabytes of VRAM and tensor processing cores.
  
  1 reply →
- jbxntuehineoh 16 hours ago
  
  [flagged]

epsteingpt 18 hours ago

It's not really equivocation in this instance. This feels like a 'bad faith' comment. We can do better.

LLM's literally wouldn't work without the sum total of knowledge (in the forms of books and other copyrighted content) being used as 'training data' for these LLMs.

The 'bleeding edge' LLMs required many things, but: 1 Tech innovation ('attention') 2 Lots of compute 3 Data 4 Pre + post training

#4 doesn't happen without #3.

It's pretty obvious at this point that the major providers have stolen vast amounts of #3 - they have paid nearly 0 of the creators.

We can argue about the impact (I'd lean net good) vs. the cost. But arguing there isn't a cost is a bit silly.

nonethewiser 18 hours ago
All of this supports the fact that models arent essentially just web crawling
- margalabargala 18 hours ago
  
  Sure, but alibaba is still building an LLM. The scraping of responses and the scraping of websites occupy the same location in the stack of each. It's very comparable.
bel8 16 hours ago

The tech is Google's invention, popularized by OpenAI, so Anthropic should still stfu in that case.