← Back to context

Comment by TeeMassive

4 years ago

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.

That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.

I guess more like 30 times slower. Consider mergesort and quicksort, which both (aim to) halve the search space with every iteration. There is a log2 amount of steps. 2 is our base. And yes, base 2 and base 10 logarithms will always be within a constant factor of one another, but seeing as we _are_ talking about constant factors here to begin with...

  • If we are talking about billions of entries, even of just 32bit integers, we are talking about 4GB of data, your sort is probably going to be IO dominated, not CPU (comparison) dominated. In which case you'd likely use a sorted data structure with a much higher branching factor than 2 (like a B-tree or LSM tree).

    Take the B-tree example, with 4KB pages, you can fit 1000 integers in there for a branching factor of 500, and a performance multiplier of just over 3.

> it will be 9 times slower

Closer to 30 times slower, as the logarithm in sorting complexity is a binary rather than decimal one.