Comment by kortex
4 years ago
There are so many super cool aspects to this!
- it's always fun when a new equation or formula is discovered, doubly so when it is very practical
- actually really easy to wrap your head around it
- seems very computationally efficient. Basically boils down to sorts and distance
- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.
- looks useful in both LTI and non stationary signal analysis
- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.
You can update Pearson correlation online, you cannot update online OP correlation coefficient.
Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
Thus, it is not computationally efficient.
To break ties randomly one has to have good random number generator, which is a problem in itself.
Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).
But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.
For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].
[1] https://aclanthology.org/P02-1023.pdf
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
> Thus, it is not computationally efficient.
That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.
Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
I didn't know about sorting networks, that is fascinating!
> First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind.
What is efficient and what ain't depends totally on context.
For example, for some domains even proving that a problem is in NP (instead of something worse), means that they consider the problem tractable, because we have good solvers for many NP hard problems in practice.
For other domains, you want O(1) or O(log n) or O(log* n) etc.
In general, P vs NP is such a big deal, because polynomial runtime is generally seen as tractable, and anything worse as intractable.
(Btw, O(n log n) _is_ a polynomial runtime, and so is O(N) or even O(1). Though I can see that as a shorthand of notation you are excluding them.)
7 replies →
Indeed, in practice the logN factor can only get so big, even if N is taken to be the number of particles in the observable universe.
1 reply →
> Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
By the same reasoning, economy would be a solved problem: you just need to have money.
Computing correlation by sorting a large amount of samples instead of reading them once is a cost difference that can make your data processing practically infeasible despite the efficiency of sorting algorithms.
I can only second this. Few algorithms are as efficient as NlogN.
1 reply →
Can you explain in more detail why this can't be updated online?
We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:
https://en.wikipedia.org/wiki/Order_statistic_tree
But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.
Your point is very valid one. I was not able to find out how to keep ranks updated efficiently.
But now we're moving away from being simple to compute.
Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:
(1) Rank coefficients can absolutely be updated online.
(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.
(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.
> Thus, it is not computationally efficient.
Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.
I should point that Pearson's correlation coefficient can be updated on-line, the storage complexity is O(1). The storage complexity for sorting is O(N).
Thus, this particular algorithm will not fit into, say, traffic analysis framework.
3 replies →
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.
That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.
I guess more like 30 times slower. Consider mergesort and quicksort, which both (aim to) halve the search space with every iteration. There is a log2 amount of steps. 2 is our base. And yes, base 2 and base 10 logarithms will always be within a constant factor of one another, but seeing as we _are_ talking about constant factors here to begin with...
1 reply →
> it will be 9 times slower
Closer to 30 times slower, as the logarithm in sorting complexity is a binary rather than decimal one.
Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.
Yes, exactly. There are so many to choose from. ;)
There are several discussions here in HN about PRNGs of different kinds and people still want to invent new ones. Why? Because old ones do not satisfy them in full.
Returning to the problem at hand. How many consecutive ties do we expect? This would helps us to define how many bits should we extract from state - for N expected consecutive ties we should extract 2log_2(N) bits or more (birthday paradox). For 64K ties we need 32 bits, which is fair amount of state, 1/8 of typical PRNG discussed here.
Most PRNGs are splittable and this also brings the question "how does splitting affect bits generated?" Will these bits be correlated between splitted generators?
I definitely do not want correlation coefficient computation that produces close results for provably different sets due to correlation between tie-breaked sections filled with random numbers.
1 reply →
how do you sample from an arbitrary probability distribution? it is not a trivial problem at all
6 replies →
You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.
Also, I would say that random number generation is a well-solved problem.
> Finally, if you want to have differentiable version of this correlation coefficient
Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.