Comment by ben_w
14 hours ago
This is probably also true for 98% of startups.
I think most people don't realise that "10 million" records is small, for a computer.
(That said, I have had to deal with code that included an O(n^2) de-duplication where the test data had n ~= 20,000, causing app startup to take 20 minutes; the other developer insisted there was no possible way to speed this up, later that day I found the problem, asked the CTO if there was a business reason for that de-duplication, removed the de-duplication, and the following morning's stand-up was "you know that 20 minute startup you said couldn't possibly be sped up? Yeah, well, I sped it up and now it takes 200ms")
I thought you were going to say to reduced O(n^2) to O(n*log(n)), but you just deleted the operation. Normally I'd say that's great, but just how much duplicate data is being left around now? Is that OK?
Each element was about, oh I can't remember exactly, perhaps 50 bytes? It wasn't a constant value, there could in theory be a string in there, but those needed to be added manually and when you have 20,000 of them, nobody would.
Also, it was overwhelmingly likely that none of the elements were duplicates in the first place, and the few exceptions were probably exactly one duplicate.
I'm kind of surprised no one just searched for "deduplication algorithm". If it was absolutely necessary to get this 1MB dataset to be smaller (when was this? Did it need to fit in L2 on a pentium 3 something?), then it could probably have been deduped + loaded in 300-400ms.
Most engineers that I've worked with that die on a premature optimization molehill like you describe also make that molehill as complicated as possible. Replacing the inside of the nested loop with a hashtable probe certainly fits the stereotype.
1 reply →