Comment by thesz

4 years ago

You can update Pearson correlation online, you cannot update online OP correlation coefficient.

Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

Thus, it is not computationally efficient.

To break ties randomly one has to have good random number generator, which is a problem in itself.

Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).

But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.

For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].

[1] https://aclanthology.org/P02-1023.pdf

39 comments

thesz

kortex 4 years ago

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

> Thus, it is not computationally efficient.

That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.

Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.

I didn't know about sorting networks, that is fascinating!

eru 4 years ago
> First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind.
What is efficient and what ain't depends totally on context.
For example, for some domains even proving that a problem is in NP (instead of something worse), means that they consider the problem tractable, because we have good solvers for many NP hard problems in practice.
For other domains, you want O(1) or O(log n) or O(log* n) etc.
In general, P vs NP is such a big deal, because polynomial runtime is generally seen as tractable, and anything worse as intractable.
(Btw, O(n log n) _is_ a polynomial runtime, and so is O(N) or even O(1). Though I can see that as a shorthand of notation you are excluding them.)
- kortex 4 years ago
  
  > (Btw, O(n log n) _is_ a polynomial runtime, and so is O(N) or even O(1). Though I can see that as a shorthand of notation you are excluding them.)
  I mean yes, technically n^1 is a polynomial, but normally this is just called linear. You hit the nail on the head with the last sentence. Here I'm using polynomial to basically mean "n^k, k>1". From a NP vs P point of view, P is considered tractable. But from a "actually writing libraries" point of view, I think software developers consider O(n^2) "kinda bleh" if it can be at all avoided. Also developers tend to give much more weight to the actual coefficients / parallelism potential / practical runtime. Neural networks have superlinear bounds everywhere, but GPU goes brrr.
  Brass tacks, sorting some tuples and diffing the rank then reducing strikes me as likely to be:
  - pretty fast even in the naive implementation
  - sorting is a well trodden field
  - easy to write things like shaders for
  - amenable to speculative execution
  - amenable to divide and conquer
  So that is what I mean by efficient.
- frutiger 4 years ago
  
  The “P” in NP means a solution to a decision problem is verifiable in polynomial time.
  The “N” in NP means the solution can be produced by a non-deterministic Turing machine (i.e. a computer that can search the problem space in many dimensions in each step).
  NP does not mean “non-polynomial” time to produce a solution, otherwise P != NP by definition. Instead P is a subset of NP.
  The open question is if all decision problems with solutions verifiable in polynomial time can also be produced in polynomial time.
  
  5 replies →
ummonk 4 years ago
Indeed, in practice the logN factor can only get so big, even if N is taken to be the number of particles in the observable universe.
HelloNurse 4 years ago

> Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
By the same reasoning, economy would be a solved problem: you just need to have money.
Computing correlation by sorting a large amount of samples instead of reading them once is a cost difference that can make your data processing practically infeasible despite the efficiency of sorting algorithms.
anovikov 4 years ago
I can only second this. Few algorithms are as efficient as NlogN.
- thesz 4 years ago
  
  Compare Fourier (O(NlogN)) and wavelet (O(N)) transforms. Both can be used for similarity (correlation) search through projections.

jazzkingrt 4 years ago

Can you explain in more detail why this can't be updated online?

We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:

https://en.wikipedia.org/wiki/Order_statistic_tree

But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.

thesz 4 years ago

Your point is very valid one. I was not able to find out how to keep ranks updated efficiently.
phkahler 4 years ago

But now we're moving away from being simple to compute.

hansvm 4 years ago

Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:

(1) Rank coefficients can absolutely be updated online.

(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.

(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.

omnicognate 4 years ago

> Thus, it is not computationally efficient.

Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.

thesz 4 years ago
I should point that Pearson's correlation coefficient can be updated on-line, the storage complexity is O(1). The storage complexity for sorting is O(N).
Thus, this particular algorithm will not fit into, say, traffic analysis framework.
- omnicognate 4 years ago
  
  Yeah, this has worse complexity than Pearson. However, it's not competing on complexity.
  The point of it is to detect general dependencies, where Pearson only detects linear ones (cases where Y is completely determined by X but their Pearson correlation is 0 are trivial to construct). The "linear" in linear dependency isn't the same as the "linear" in linear complexity here, but it doesn't seem surprising for an algorithm that can detect more general dependencies to have worse than linear performance (and nlogn is pretty much the next best).
  In the absence of a known algorithm with the same properties and better complexity, it doesn't seem unreasonable to call it "computationally efficient" when it comes with a complexity sufficient for most uses, rather than say quadratic or exponential.
  
  1 reply →
- eru 4 years ago
  
  The runtime and storage requirements you cite only apply when you need exact answers.
  If you want to compute a correlation coefficient, you are probably happy to get eg 3 significant digits of precision. Thus you can use approximation algorithms and sampling.

TeeMassive 4 years ago

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.

That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.

estomagordo 4 years ago
I guess more like 30 times slower. Consider mergesort and quicksort, which both (aim to) halve the search space with every iteration. There is a log2 amount of steps. 2 is our base. And yes, base 2 and base 10 logarithms will always be within a constant factor of one another, but seeing as we _are_ talking about constant factors here to begin with...
- asQuirreL 4 years ago
  
  If we are talking about billions of entries, even of just 32bit integers, we are talking about 4GB of data, your sort is probably going to be IO dominated, not CPU (comparison) dominated. In which case you'd likely use a sorted data structure with a much higher branching factor than 2 (like a B-tree or LSM tree).
  Take the B-tree example, with 4KB pages, you can fit 1000 integers in there for a branching factor of 500, and a performance multiplier of just over 3.
tromp 4 years ago

> it will be 9 times slower
Closer to 30 times slower, as the logarithm in sorting complexity is a binary rather than decimal one.

asxd 4 years ago

Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.

thesz 4 years ago
Yes, exactly. There are so many to choose from. ;)
There are several discussions here in HN about PRNGs of different kinds and people still want to invent new ones. Why? Because old ones do not satisfy them in full.
Returning to the problem at hand. How many consecutive ties do we expect? This would helps us to define how many bits should we extract from state - for N expected consecutive ties we should extract 2log_2(N) bits or more (birthday paradox). For 64K ties we need 32 bits, which is fair amount of state, 1/8 of typical PRNG discussed here.
Most PRNGs are splittable and this also brings the question "how does splitting affect bits generated?" Will these bits be correlated between splitted generators?
I definitely do not want correlation coefficient computation that produces close results for provably different sets due to correlation between tie-breaked sections filled with random numbers.
- ykonstant 4 years ago
  
  > Yes, exactly. There are so many to choose from. ;)
  Just use a good PRNG to pick one of them :D
medo-bear 4 years ago
how do you sample from an arbitrary probability distribution? it is not a trivial problem at all
- kortex 4 years ago
  
  Why do you need an arbitrary distribution? We are talking about breaking ties, just use uniform. You can probably just use RDRAND.
  
  3 replies →
- asxd 4 years ago
  
  I guess I assumed the operative word was "good". The term "random number generator" almost always refers to a generator that intends to produce a uniform distribution.
- Hendrikto 4 years ago
  
  If you can sample from a uniform distribution, it is trivial to turn that into any arbitrary distribution.

eutectic 4 years ago

You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.

Also, I would say that random number generation is a well-solved problem.

petters 4 years ago

> Finally, if you want to have differentiable version of this correlation coefficient

Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.