← Back to context

Comment by thesz

4 years ago

You can update Pearson correlation online, you cannot update online OP correlation coefficient.

Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

Thus, it is not computationally efficient.

To break ties randomly one has to have good random number generator, which is a problem in itself.

Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).

But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.

For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].

[1] https://aclanthology.org/P02-1023.pdf

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).

> Thus, it is not computationally efficient.

That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.

Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.

I didn't know about sorting networks, that is fascinating!

  • > First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind.

    What is efficient and what ain't depends totally on context.

    For example, for some domains even proving that a problem is in NP (instead of something worse), means that they consider the problem tractable, because we have good solvers for many NP hard problems in practice.

    For other domains, you want O(1) or O(log n) or O(log* n) etc.

    In general, P vs NP is such a big deal, because polynomial runtime is generally seen as tractable, and anything worse as intractable.

    (Btw, O(n log n) _is_ a polynomial runtime, and so is O(N) or even O(1). Though I can see that as a shorthand of notation you are excluding them.)

    • > (Btw, O(n log n) _is_ a polynomial runtime, and so is O(N) or even O(1). Though I can see that as a shorthand of notation you are excluding them.)

      I mean yes, technically n^1 is a polynomial, but normally this is just called linear. You hit the nail on the head with the last sentence. Here I'm using polynomial to basically mean "n^k, k>1". From a NP vs P point of view, P is considered tractable. But from a "actually writing libraries" point of view, I think software developers consider O(n^2) "kinda bleh" if it can be at all avoided. Also developers tend to give much more weight to the actual coefficients / parallelism potential / practical runtime. Neural networks have superlinear bounds everywhere, but GPU goes brrr.

      Brass tacks, sorting some tuples and diffing the rank then reducing strikes me as likely to be:

      - pretty fast even in the naive implementation

      - sorting is a well trodden field

      - easy to write things like shaders for

      - amenable to speculative execution

      - amenable to divide and conquer

      So that is what I mean by efficient.

    • The “P” in NP means a solution to a decision problem is verifiable in polynomial time.

      The “N” in NP means the solution can be produced by a non-deterministic Turing machine (i.e. a computer that can search the problem space in many dimensions in each step).

      NP does not mean “non-polynomial” time to produce a solution, otherwise P != NP by definition. Instead P is a subset of NP.

      The open question is if all decision problems with solutions verifiable in polynomial time can also be produced in polynomial time.

      5 replies →

  • Indeed, in practice the logN factor can only get so big, even if N is taken to be the number of particles in the observable universe.

  • > Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.

    By the same reasoning, economy would be a solved problem: you just need to have money.

    Computing correlation by sorting a large amount of samples instead of reading them once is a cost difference that can make your data processing practically infeasible despite the efficiency of sorting algorithms.

  • I can only second this. Few algorithms are as efficient as NlogN.

    • Compare Fourier (O(NlogN)) and wavelet (O(N)) transforms. Both can be used for similarity (correlation) search through projections.

Can you explain in more detail why this can't be updated online?

We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:

https://en.wikipedia.org/wiki/Order_statistic_tree

But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.

Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:

(1) Rank coefficients can absolutely be updated online.

(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.

(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.

> Thus, it is not computationally efficient.

Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.

  • I should point that Pearson's correlation coefficient can be updated on-line, the storage complexity is O(1). The storage complexity for sorting is O(N).

    Thus, this particular algorithm will not fit into, say, traffic analysis framework.

    • Yeah, this has worse complexity than Pearson. However, it's not competing on complexity.

      The point of it is to detect general dependencies, where Pearson only detects linear ones (cases where Y is completely determined by X but their Pearson correlation is 0 are trivial to construct). The "linear" in linear dependency isn't the same as the "linear" in linear complexity here, but it doesn't seem surprising for an algorithm that can detect more general dependencies to have worse than linear performance (and nlogn is pretty much the next best).

      In the absence of a known algorithm with the same properties and better complexity, it doesn't seem unreasonable to call it "computationally efficient" when it comes with a complexity sufficient for most uses, rather than say quadratic or exponential.

      1 reply →

    • The runtime and storage requirements you cite only apply when you need exact answers.

      If you want to compute a correlation coefficient, you are probably happy to get eg 3 significant digits of precision. Thus you can use approximation algorithms and sampling.

> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.

That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.

  • I guess more like 30 times slower. Consider mergesort and quicksort, which both (aim to) halve the search space with every iteration. There is a log2 amount of steps. 2 is our base. And yes, base 2 and base 10 logarithms will always be within a constant factor of one another, but seeing as we _are_ talking about constant factors here to begin with...

    • If we are talking about billions of entries, even of just 32bit integers, we are talking about 4GB of data, your sort is probably going to be IO dominated, not CPU (comparison) dominated. In which case you'd likely use a sorted data structure with a much higher branching factor than 2 (like a B-tree or LSM tree).

      Take the B-tree example, with 4KB pages, you can fit 1000 integers in there for a branching factor of 500, and a performance multiplier of just over 3.

  • > it will be 9 times slower

    Closer to 30 times slower, as the logarithm in sorting complexity is a binary rather than decimal one.

Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.

  • Yes, exactly. There are so many to choose from. ;)

    There are several discussions here in HN about PRNGs of different kinds and people still want to invent new ones. Why? Because old ones do not satisfy them in full.

    Returning to the problem at hand. How many consecutive ties do we expect? This would helps us to define how many bits should we extract from state - for N expected consecutive ties we should extract 2log_2(N) bits or more (birthday paradox). For 64K ties we need 32 bits, which is fair amount of state, 1/8 of typical PRNG discussed here.

    Most PRNGs are splittable and this also brings the question "how does splitting affect bits generated?" Will these bits be correlated between splitted generators?

    I definitely do not want correlation coefficient computation that produces close results for provably different sets due to correlation between tie-breaked sections filled with random numbers.

    • > Yes, exactly. There are so many to choose from. ;)

      Just use a good PRNG to pick one of them :D

  • how do you sample from an arbitrary probability distribution? it is not a trivial problem at all

    • I guess I assumed the operative word was "good". The term "random number generator" almost always refers to a generator that intends to produce a uniform distribution.

    • If you can sample from a uniform distribution, it is trivial to turn that into any arbitrary distribution.

You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.

Also, I would say that random number generation is a well-solved problem.

> Finally, if you want to have differentiable version of this correlation coefficient

Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.