Comment by drob518
8 hours ago
What large caches of undigitized content exists? Surely, not everything has been digitized, but I can’t think it’s much in percentage terms.
8 hours ago
What large caches of undigitized content exists? Surely, not everything has been digitized, but I can’t think it’s much in percentage terms.
The amount of private data that is locked up inside private internal databases is huge. This is especially true of regulated industries. There is a wealth of data - financial data showing how to budget for things, pricing data on various products that are B2B, standard operating procedures at mature companies that have gone through various revisions, designs for manufacturing plants so people don't keep reinventing and making the same mistakes again, and on and on.
The Vatican Library contains roughly 1.1 million printed books and around 75,000 codices, only a small percentage of which have been digitised.
Reddit alone contains about the same quantity of text (~10 billion posts * 10 words per post, vs 1 million books * 100k words per book). Messaging and document platforms (google docs, slack, discord, telegram, etc.) probably each have 1-3 orders of magnitude more than reddit. To your/GP's point though, those private platforms probably haven't been slurped up by LLMs yet.
Which is what percent of the world’s content? 0.000000001% or something similar. It’s nothing in the scheme of things. To put it another way, if we were to digitize that continent and train on it, our AIs would not get noticeably better in any way. It doesn’t move the needle.
1.1 million being 0.000000001% implies a total count of 1e17 books in the world - the real number is closer to 1e8.
1 reply →