← Back to context

Comment by cgh

8 hours ago

The Vatican Library contains roughly 1.1 million printed books and around 75,000 codices, only a small percentage of which have been digitised.

Reddit alone contains about the same quantity of text (~10 billion posts * 10 words per post, vs 1 million books * 100k words per book). Messaging and document platforms (google docs, slack, discord, telegram, etc.) probably each have 1-3 orders of magnitude more than reddit. To your/GP's point though, those private platforms probably haven't been slurped up by LLMs yet.

Which is what percent of the world’s content? 0.000000001% or something similar. It’s nothing in the scheme of things. To put it another way, if we were to digitize that continent and train on it, our AIs would not get noticeably better in any way. It doesn’t move the needle.

  • 1.1 million being 0.000000001% implies a total count of 1e17 books in the world - the real number is closer to 1e8.

    • You’re missing the point. And we’re not just talking about books, whatever that might mean. We’re talking about all documents ever made. Every magazine article, every blog and web page, every Word doc, etc. I’m pretty sure that whatever is in the Vatican archives is tiny by comparison. Given the age of the Vatican archives, I can also guarantee that many of those “books” are nothing more than page fragments. Very few will be full codices or long scrolls. Many will date before the printing press when document production was slow and laborious.