Comment by VorpalWay
10 hours ago
> A fairer comparison would be to stream the file in C++ as well and maintain internal state for the count.
Wouldn't memory mapping the data in Python be the more fair comparison? If the language doesn't support that, then this seems to absolutely be a fair comparison.
> For most people that would be the first/naive approach as well when they programmed something like this I think.
I disagree, my mind immediately goes to mmap when I have to deal with a single file that I have to read in it's entirety. I think the non-obvious solution here is rather io-uring (which I would expect to be faster if dealing with lots of small files, as you can load them async concurrently from the file system).
I'd make the bet that "most people" (who can program) would not think of mmap, but either about streaming or would even just load the whole thing into memory.
Ask a bunch of coding agents and they will give you these two versions, which means it's likely that the LLMs have seen these way more often than the mmap version. Both Opus and GPT even pushed back when I asked for mmap, both said it would "add complexity".
It does add complexity, and the optimal solution is probably not to use it. Consider what happens if a 4kB page has only a single unique word in it—you’d still need to load it to memory to read the string, it just isn’t accounted against your process (maybe).
I would have expected something like this:
- Scan the file serially.
- For each word, find and increment a hash table entry.
- Sort and print.
In theory, technically, this does require slightly more memory—but it’s a tiny amount more; just a copy of each unique word, and if this is natural language then there aren’t very many. Meanwhile, OOP’s approach massively pressures the page cache once you get to the “print” step, which is going to be the bulk of the runtime.
It’s not even a full copy of each unique word, actually, because you’re trading it off against the size of the string pointers. That’s… sixteen bytes minimum. A lot of words are smaller than that.
That is a valid solution, but what IO block size should you use for the best performance? What if you end up reading half a word at the end of a chunk?
Handling that is in my opinion way more complex than letting the kernel figure it out via mmap. The kernel knows way more than you about the underlying block devices, and you can use madvise with MADV_SEQUENTIAL to indicate that you will read the whole file sequentially. (That might free pages prematurely if you keep references into the data rather than copy the first occurance of each word though, so perhaps not ideal in this scenario.)