← Back to context

Comment by philipkglass

5 months ago

In the case of Wikipedia dumps there is an easy work-around. The XML dump starts with a small header and "<siteinfo>" section. Then it's just millions of "page" documents for the Wiki pages.

You can read the document as a streaming text source and split it into chunks based on matching pairs of "<page>" and "</page>" with a simple state machine. Then you can stream those single-page documents to an XML parser without worrying about document size. This of course doesn't apply in the general case where you are processing arbitrary huge XML documents.

I have processed Wikipedia many times with less than 8 GB of RAM.