Comment by hedora

1 day ago

I'm kind of surprised no one just searched for "deduplication algorithm". If it was absolutely necessary to get this 1MB dataset to be smaller (when was this? Did it need to fit in L2 on a pentium 3 something?), then it could probably have been deduped + loaded in 300-400ms.

Most engineers that I've worked with that die on a premature optimization molehill like you describe also make that molehill as complicated as possible. Replacing the inside of the nested loop with a hashtable probe certainly fits the stereotype.

1 comment

hedora

ben_w 1 day ago

> I'm kind of surprised no one just searched for "deduplication algorithm".

Fair.

To set the scene a bit: the other developer at this point was arrogant, not at all up to date with even the developments of his preferred language, did not listen to or take advice from anyone.

I think a full quarter of my time there was just fire-fighting yet another weird thing he'd done.

> If it was absolutely necessary to get this 1MB dataset to be smaller

It was not, which is why my conversation with the CTO to check on if it was still needed was approximately one or two sentences from each of us. It's possible this might have been important on a previous pivot of the thing, at least one platform shift before I got there, but not when I got to it.