← Back to context

Comment by fieldcny

4 years ago

Dumb question, What is the significance of this?

From a signal processing perspective: being able to recognise signals in the presence of interference, noise and distortion.

For example, you might have a radio signal (such as WiFi) that you want to receive. First step is that you have to pick that signal out of whatever signal comes out of your radio receiver: which will be the WiFi signal along with all sorts of noise and interference from other users. Typically the search will be done with the mentioned "Pearson's Correlation", using it to compare the received signal with an expected template: a value of 1.0 meaning the received signal is a perfect match with the template, a value of 0.0 meaning no match at all. If the wanted signal is present, interference, noise and distortion will reduce the result of the correlation to less than 1.0, meaning you might miss the WiFi signal, even though it is present.

This article is about coming up with a measure that gives a more robust result in the face of noise, interference and distortion. It's fundamental stuff, in that it has quite general application.

  • (Yay signal processing!)

    Skimming it now, this looks wild. Using the variance of the rank of the dataset (for a given point, how many are less than that point) seems... weird, and throwing out some information. The author seems legit tho, so I can't wait to try drop-in implementing this in a few things.

    • Rank-transforms are pretty common: they show up in a lot of non-parametric hypothesis tests, for example.

      The neat thing about ranks is that, in aggregate, they're very robust. You can make an estimate of the mean arbitrarily bad by tweaking a single data point: just send it towards +/- infinity and the mean will follow. The median, on the other hand, is barely affected by that sort of shenanigans.

Correlation typically means y is a linear function of x, but people usually interpet it (incorrectly) as: knowing x tells you something about y. If y = x^2, then y is determined completely by x, but since it's nonlinear the correlation may actually be zero depending on the distribution of x. This paper proposes a statistic that will indicate if y is related to any function of x, linear or nonlinear.

  • This is... quite wrong? The dictionary says:

      1. a mutual relationship or connection between two or more things
      2. [Statistics] interdependence of variable quantities.
      3. [Statistics] a quantity measuring the extent of the interdependence of variable quantities.
    

    The most sympathetic to your definition is Wikipedia:

        In statistics, correlation or dependence is any statistical
        relationship, whether causal or not, between two random variables or bivariate
        data. In the broadest sense correlation is any statistical association, though 
        it actually refers to the degree to which a pair of variables are linearly 
        related.
    

    And that's the mathematical formulation. Correlation also has a meaning in everyday speech, and mathematics doesn't have the authority to just adopt terms and then claim people are wrong after they've changed the meaning.

    Also correlation very definitely means that knowing <x> tells you something about <y>. And vice versa. Like, for example: its value. Or at least a better idea of it than pure guessing without correlation.

  • I don't think that there's a standard enough mathematical definition of correlation to say that. Perhaps the word has been coopted but the paper linked suggests that the cooprion isn't accepted.

Well, in the abstract it says: “[a coefficient] which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other”, the former property which is not true of general random variables (but is true of Gaussians, which is one part of the reason they are used everywhere). I’m not sure about the latter property, actually, but I also doubt it’s true.

Worth noting the author is a highly regarded professor at Stanford.

It's fast to calculate, simple to understand, and doesn't make assumptions about the underlying distributions. This makes it a more effective generic tool for practitioners. Perhaps useful in the way the Pearson correlation is useful.

I'd like to learn more about the small sample properties. Proofs of asymptotics are necessary but less interesting. But the author's examples on example data sets look like it makes sense.