← Back to context

Comment by tzs

2 days ago

I'm looking at the leaderboard and it raises some interesting questions. Currently the fastest are ~3.4 seconds.

Yesterday the README said that benchmarks were run on a "Premium Intel Digital Ocean Droplet with 2vCPUs and 1.5GB of available memory".

Today it says they are run on a "Mac Mini M1 with 12GB of RAM of available memory", which if the net is to be believed is quite a bit faster than the DO Droplet they said they had been using. I'm going to assume those 3.4 seconds results on the leaderboard were benchmarked on the Mac.

I've got an M2 Max Mac Studio which should be faster than the Mac Mini.

A program to do this challenge must read the entire input file, and it is going to have to at least some computation for every character in the file while parsing.

So I thought to try to get an idea of what an upper limit might be for how fast this could be done. One idea for that was this:

  $ time WC_ALL=C wc -l data.csv

The idea is wc should be written in C or C++, and counting lines just requires checking each character to see if it is newline so it is pretty minimal computation. WC_ALL=C should keep any Unicode stuff from happening which might slow it down.

This is taking 7.1 seconds. (Same without WC_ALL=C BTW).

OK, that was unexpected. I then wrote a line counter in C. Allocate a buffer of size N, loop doing (read N bytes from stdin into buffer, scan those bytes counting '\n's) until no more input. With a 1 MiB buffer it took 1 second. With a 1024 byte buffer it took 4.3 seconds. With a 512 byte buffer it took 7.1 seconds.

So...maybe wc just has a small buffer?

Then I decided to try "wc -c". That's 0.008 seconds. That's faster than "cat > /dev/null" (0.6) seconds, suggesting the "wc -c" is not reading the file. Someone probably decided to special case requests for just the number of characters and just use stat/fstat to get the file size or seeks to the end and gets the offset or something like that.

I then looked at the source for wc [1]. It does indeed special case things like -c. It also special cases -l, because lines, unlike words, can be counted without having to deal with locale stuff.

But my guess it is using a small buffer is wrong. Buffer size is 1 MiB same as mine. So why is my line counter 1 seconds and "wc -l" is 7.1 seconds?

Looking at it I see that wc is also finding the longest line, even if you have only asked for the number of lines. When I add finding the longest line to mine it then takes 5.1 seconds.

There is also more error handling in wc. Mine just loops as long as read() > 0 and then prints the stats and exits, where as wc loops as long as read() != 0, and then in the loop does an "if (len < 0)" to see if there was an error.

There is also a check in the loop in wc to see if a flag that gets set on SIGINFO is set. If it is then wc prints the current stats.

Still, on the 7 GB data.csv file, with a 1 MiB read buffer, the read loop should run under 7000 times so that "if (len < 0)" and "if (siginfo)" are only going to happen under 7000 times, and their enclosed code is only going to run if there is a read error for the first and every time I hit CTRL-T for second. In my tests that's 0 times for both of those.

That's not nearly enough to explain why it is 2.1 seconds slower than my line counter which now has the same buffer size, finds the longest line too, and aside from those two under 7000 times not taken if statements is essentially the same loop.

Maybe latter I'll see what it takes to build wc locally and try to find where the time is going.

[1] https://github.com/apple-oss-distributions/text_cmds/blob/te...