Comment by xipho

2 years ago

> and Google has untouchable resources such as all the books they've scanned (and already won court cases about)

https://www.hathitrust.org/ has that corpus, and its evolution, and you can propose to get access to it via collaborating supercomputer access. It grows very rapidly. InternetArchive would also like to chat I expect. I've also asked, and prompt manipulated chatGPT to estimate the total books it is trained with, it's a tiny fraction of the corpus, I wonder if it's the same with Google?

17 comments

xipho

notpachet 2 years ago

> I've also asked, and prompt manipulated chatGPT to estimate the total books it is trained with

Whatever answer it gave you is not reliable.

zlg_codes 2 years ago
How does this not extend to ALL output from an LLM? If it can't understand its own runtime environment, it's not qualified to answer my questions.
- amputect 2 years ago
  
  That's correct. LLMs are plausible sentence generators, they don't "understand"* their runtime environment (or any of their other input) and they're not qualified to answer your questions. The companies providing these LLMs to users will typically provide a qualification along these lines, because LLMs tend to make up ("hallucinate", in the industry vernacular) outputs that are plausibly similar to the input text, even if they are wildly and obviously wrong and complete nonsense to boot.
  Obviously, people find some value in some output of some LLMs. I've enjoyed the coding autocomplete stuff we have at work, it's helpful and fun. But "it's not qualified to answer my questions" is still true, even if it occasionally does something interesting or useful anyway.
  *- this is a complicated term with a lot of baggage, but fortunately for the length of this comment, I don't think that any sense of it applies here. An LLM doesn't understand its training set any more than the mnemonic "ETA ONIS"** understands the English language.
  **- a vaguely name-shaped presentation of the most common letters in the English language, in descending order. Useful if you need to remember those for some reason like guessing a substitution cypher.
  
  10 replies →
- orf 2 years ago
  
  How many books has your brain been trained with? Can you answer accurately?
  
  3 replies →