Comment by kortex

4 years ago

There are so many super cool aspects to this!

- it's always fun when a new equation or formula is discovered, doubly so when it is very practical

- actually really easy to wrap your head around it

- seems very computationally efficient. Basically boils down to sorts and distance

- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.

- looks useful in both LTI and non stationary signal analysis

- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.

40 comments

kortex

thesz 4 years ago

You can update Pearson correlation online, you cannot update online OP correlation coefficient.

Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

Thus, it is not computationally efficient.

To break ties randomly one has to have good random number generator, which is a problem in itself.

Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).

But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.

For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].

[1] https://aclanthology.org/P02-1023.pdf

kortex 4 years ago
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
> Thus, it is not computationally efficient.
That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.
Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
I didn't know about sorting networks, that is fascinating!
- eru 4 years ago
  
  > First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind.
  What is efficient and what ain't depends totally on context.
  For example, for some domains even proving that a problem is in NP (instead of something worse), means that they consider the problem tractable, because we have good solvers for many NP hard problems in practice.
  For other domains, you want O(1) or O(log n) or O(log* n) etc.
  In general, P vs NP is such a big deal, because polynomial runtime is generally seen as tractable, and anything worse as intractable.
  (Btw, O(n log n) _is_ a polynomial runtime, and so is O(N) or even O(1). Though I can see that as a shorthand of notation you are excluding them.)
  
  7 replies →
- ummonk 4 years ago
  
  Indeed, in practice the logN factor can only get so big, even if N is taken to be the number of particles in the observable universe.
  
  1 reply →
- HelloNurse 4 years ago
  
  > Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
  By the same reasoning, economy would be a solved problem: you just need to have money.
  Computing correlation by sorting a large amount of samples instead of reading them once is a cost difference that can make your data processing practically infeasible despite the efficiency of sorting algorithms.
- anovikov 4 years ago
  
  I can only second this. Few algorithms are as efficient as NlogN.
  
  1 reply →
jazzkingrt 4 years ago
Can you explain in more detail why this can't be updated online?
We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:
https://en.wikipedia.org/wiki/Order_statistic_tree
But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.
- thesz 4 years ago
  
  Your point is very valid one. I was not able to find out how to keep ranks updated efficiently.
- phkahler 4 years ago
  
  But now we're moving away from being simple to compute.
hansvm 4 years ago

Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:
(1) Rank coefficients can absolutely be updated online.
(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.
(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.
omnicognate 4 years ago
> Thus, it is not computationally efficient.
Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.
- thesz 4 years ago
  
  I should point that Pearson's correlation coefficient can be updated on-line, the storage complexity is O(1). The storage complexity for sorting is O(N).
  Thus, this particular algorithm will not fit into, say, traffic analysis framework.
  
  3 replies →
TeeMassive 4 years ago
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.
That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.
- estomagordo 4 years ago
  
  I guess more like 30 times slower. Consider mergesort and quicksort, which both (aim to) halve the search space with every iteration. There is a log2 amount of steps. 2 is our base. And yes, base 2 and base 10 logarithms will always be within a constant factor of one another, but seeing as we _are_ talking about constant factors here to begin with...
  
  1 reply →
- tromp 4 years ago
  
  > it will be 9 times slower
  Closer to 30 times slower, as the logarithm in sorting complexity is a binary rather than decimal one.
asxd 4 years ago
Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.
- thesz 4 years ago
  
  Yes, exactly. There are so many to choose from. ;)
  There are several discussions here in HN about PRNGs of different kinds and people still want to invent new ones. Why? Because old ones do not satisfy them in full.
  Returning to the problem at hand. How many consecutive ties do we expect? This would helps us to define how many bits should we extract from state - for N expected consecutive ties we should extract 2log_2(N) bits or more (birthday paradox). For 64K ties we need 32 bits, which is fair amount of state, 1/8 of typical PRNG discussed here.
  Most PRNGs are splittable and this also brings the question "how does splitting affect bits generated?" Will these bits be correlated between splitted generators?
  I definitely do not want correlation coefficient computation that produces close results for provably different sets due to correlation between tie-breaked sections filled with random numbers.
  
  1 reply →
- medo-bear 4 years ago
  
  how do you sample from an arbitrary probability distribution? it is not a trivial problem at all
  
  6 replies →
eutectic 4 years ago

You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.
Also, I would say that random number generation is a well-solved problem.
petters 4 years ago

> Finally, if you want to have differentiable version of this correlation coefficient
Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.