Comment by walrus01

18 hours ago

"Your extremely efficient multi petabyte internet content suction machine is ripping off my extremely efficient multi petabyte internet content suction machine"

Sucking down petabytes of peoples' copyrighted content that they never granted a specific license to you to use seems to be an unavoidable and default part of the process of building any huge LLM.

8 comments

walrus01

nonethewiser 18 hours ago

So why was there crawling in 1998 but no LLMs?

hasteg 17 hours ago
Because the transformer, which all of these models are foundationally built off of and didn't invent themselves (bar google) wasn't invented? The amount of effort it took humanity to generate all the data that was required for the models to get to the point they're at now is absolutely not even comparable to how much effort it took to build the model code. Yeah, it's complicated, but if they didn't rip off all of humanities combined output it wouldn't even matter if the transformer got invented.
- Chu4eeno 16 hours ago
  
  Google didn't really invent much, they just had access to an insane amount of data and compute to try to train a model with just the attention mechanism, but ripping out (most of) the rest, from an earlier paper on machine translation from some poor academics, and it turned out to work very well (though insanely training data and compute intensive).
12_throw_away 13 hours ago

I am unable to comprehend the state of mind that would lead one to ask this question.
vitally3643 16 hours ago
We didn't have GPUs with hundreds of gigabytes of VRAM and tensor processing cores.
- walrus01 16 hours ago
  
  Or a feasible/economical way to attempt to store the sum total of human written output, multi-petabytes of data (outside of the resources of the NSA, maybe), when a server with 6 x 36GB 10K RPM SCSI HDD in RAID-5 was high end, and its network uplink would be at most two ports of 1 gigabit ethernet.
jbxntuehineoh 16 hours ago

[flagged]