Comment by tyingq

5 years ago

"They’re parsing JSON. A whopping 10 megabytes worth of JSON with some 63k item entries."

Ahh. Modern software rocks.

11 comments

tyingq

Parsing 63k items in a 10 MB json string is pretty much a breeze on any modern system, including raspberry pi. I wouldn't even consider json as an anti-pattern with storing that much data if it's going over the wire (compressed with gzip).

Down a little in the article and you'll see one of the real issues:

> But before it’s stored? It checks the entire array, one by one, comparing the hash of the item to see if it’s in the list or not. With ~63k entries that’s (n^2+n)/2 = (63000^2+63000)/2 = 1984531500 checks if my math is right. Most of them useless.

Slikey 5 years ago

Check out https://github.com/simdjson/simdjson
More than 3 GB/s are possible. Like you said 10 MB of JSON is a breeze.
tyingq 5 years ago
The JSON patch took out more of the elapsed time. Granted, it was a terrible parser. But I still think JSON is a poor choice here. 63k x X checks for colons, balanced quotes/braces and so on just isn't needed.
Time with only duplication check patch: 4m 30s Time with only JSON parser patch: 2m 50s
- masklinn 5 years ago
  
  > But I still think JSON is a poor choice here.
  It’s an irrelevant one. The json parser from the python stdlib parses a 10Mb json patterned after the sample in a few dozen ms. And it’s hardly a fast parser.

bombcar 5 years ago

At least parse it into SQLite. Once.

brianberns 5 years ago
They probably add more entries over time (and maybe update/delete old ones), so you’d have to be careful about keeping the local DB in sync.
- bombcar 5 years ago
  
  So just have the client download the entire DB each time. Can’t be that many megabytes.
  
  1 reply →
tyingq 5 years ago
I think just using a length encoded serialization format would have made this work reasonably fast.
- hobofan 5 years ago
  
  Or just any properly implemented JSON parser. That's a laughable small amount of JSON, which should easily be parsed in milliseconds.

LukvonStrom 5 years ago

why not embed node.js to do this efficiently :D