- it's always fun when a new equation or formula is discovered, doubly so when it is very practical
- actually really easy to wrap your head around it
- seems very computationally efficient. Basically boils down to sorts and distance
- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.
- looks useful in both LTI and non stationary signal analysis
- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
> Thus, it is not computationally efficient.
That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.
Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
I didn't know about sorting networks, that is fascinating!
Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:
(1) Rank coefficients can absolutely be updated online.
(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.
(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.
Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
Thus, it is not computationally efficient.
That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.
Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.
You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.
Also, I would say that random number generation is a well-solved problem.
Since it's rank-based, it would be nice to see a comparison to Spearman's correlation instead. It's very easy to find failure cases for Pearson, so the visualization is next to meaningless, IMO.
This new coefficient of correlation is really really awesome, and this visualization shows its value in such a beautifully simple presentation.
It would be great if someone who has Wikipedia edit privileges, can edit the Wikipedia article at [1] to describe/link how the Chatarjee's correlation coefficient solves many of the known limitation of Pearson's correlation coefficient.
;)
Easy there. New correlation coefficients get proposed all the time (eg. the introduction of the linked paper lists ~10-20 alone!). It's not a good idea to add every newly proposed coefficient to established wiki pages, just because they trend on social media. Yes, the paper looks nice, but if you read any new paper proposing a new measure, they all do! They're meant to be written that way. Let the community decide and test and discuss, and if in 10 years this new coefficient is well accepted and has proven itself, we can think about your proposed edit. Doing it before is putting the cart before the horse, and is a recipe for astroturfing.
But the part that one would add would not necessarily be the definition of the coefficient ξn, but rather the interesting discussion at the beginning about what makes for a good correlation coefficient.
I once attended a summer school in Saint-Flour, France, where Sourav Chatterjee gave a masterful exposition of results on large deviations for random graphs. All chalk talk, clear presentation. My impression is he's a deep thinker. On that basis alone I give such an article much more weight than the plethora of articles that pass before the eyes; reading the masters has always been a good rule of thumb.
This follow-up paper presents a related measure of conditional dependence that has a "natural interpretation as a nonlinear generalization of the familiar partial R2 statistic for measuring conditional dependence by regression."
The follow-up paper also provides some additional interpretive insights, I think.
My intuitive impression is that both of these are important developments.
I also have a very vague suspicion, based on the form of the function, that the correlation measure has some interpretation in terms of mutual information involving rank transformations of random variables.
Thanks for finding this article. I agree, these are important developments in particular because so many econometric models are now using machine learning without any distributional assumptions. Using correlation coefficients based on linearity is grossly insufficient.
I didn't get the idea of ranking. But it's simple:
"In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2."
Also learned that a Spearman coeff is just the Pearson coefficient taken on the rank of the data, instead of on the raw data.
But whereas Pearson/Spearman takes the sum of product of data/mean differences (Σ(x-x̄)(y-ȳ)/σxσy) where x̄ is mean and σ=std. dev., Chatterjee takes sum of rank differences (3Σ(rᵢ₊₁-rᵢ)/n²-1), concerning just the ranks of the Y data after the X,Y pairs have been sorted by X.
But still missing the intuition for why the sum of difference of ranks is so useful or where the magic numbers come from.
jmount has an explanation elsewhere in this thread linking to https://win-vector.com/2021/12/26/how-to-read-sourav-chatter... which does a great job of explaining the intuition, but in a nutshell the normalization factor of 3 comes from the fact that if you select two random points between 0 and 1, the mean distance between will be 1/3 (which is pretty easy to write down and solve, boils down the the fact that a pyramid of height 1 that's 1x1 at the base has volume = 1/3).
I have a very naive question: What are the downsides of estimating mutual information instead?
I have a math (but not statistics) background, and am sometimes bewildered by the many correlation coefficients that float around when MI describes pretty exactly what one wants ("how much does knowledge of one variable tell you about the other variable"?).
Different coefficients help you look at different kinds of relationships. For example, Pearson's R tells you about linear relationships between variables -- it's closely tied to the covariance: "how useful is it to draw a line through these data points, and how accurate is interpolating likely to be?".
Spearman's correlation helps you understand monotonic/rank-order relationships between variables: "Is there a trend where increasing X tends to also increase Y?" (This way we can be just as good at measuring the existence of linear, logarithmic, or exponential relationships, although we can't tell them apart.)
Mutual information helps you understand how similar two collections of data are, in the sort of unstructured way that's useful in building decision trees. You could have high mutual information without any sort of linear or monotonic relationship at all. Thus it's more general while at the same time not telling you anything that would be helpful in building, for instance, a predictive multivariate linear model.
TLDR; More specific coefficients leverage assumptions about the structure of the data (eg linearity), which can help you construct optimal versions of models under those assumptions. Mutual information doesn't make any assumptions about the structure of the data so it won't feed into such a model, but it still has lots of applications!
MI is quite useful and widely used. It typically requires binning data though when distributions are unknown / empirically estimated. This approach is a rank-based score, more similar to Spearman correlation than Pearson. This allows for nonlinear relationships between the two variables.
A slightly critical review on the work van be seen here: https://academic.oup.com/biomet/advance-article/doi/10.1093/.... They argue that the older forms of rank correlation, namely D, R, and tau*, are superior. Nonetheless, it seems like a nice contribution to the stats literature, although I doubt the widespread use of correlation is going anywhere.
It is designed for time series, but can be adopted to a more general case. The advantage over MI is that with MI, you could only see that A and B are related, while TE can show that A dives B, but not the reverse.
In the article it is explained that the purpose of this coefficient is to estimate how much X is a function of Y [1](or how noisy this association is); in particular this coefficient is 1 iff X is a function of Y.
With MI (the article claims that) you can have a coefficient of 1 without X being a function of Y.
[1] this means that this coefficient is intentionally not symmetric.
I'm also interested in this, having tried and semi-successfully used mutual information for finding associations between multinomial variables. As an even more naive question, I find the actual selection of estimators bewildering. How do I know which estimator to use for mutual information? How do I know if my chosen estimator has converged or is doing a bad job on my data? Bringing it back to the topic at hand, does the estimator presented in the paper provide good estimates for a wider variety of cases than the mutual information plug-in estimator? If so I can see it might be nice for simplicity reasons alone. Can we have different estimators for this new correlation coefficient? Any ideas what that would look like?
Mutual information is not trivial or even possible to estimate in many practical situations as far as i know. Example applications in robotics or computer vision, where mutual information would be useful are segmentation and denoising of unordered 3d point data, for example.
Yes, as someone mentioned above, the problem is getting the underlying distribution of the data, so you can measure SUM p_i log(p_i); this usually involves some binning, which can be tricky (and yes, I know the formula I gave is entropy not MI)
I try to remind myself that "it is just a model", as a corollary to "all models are wrong, some are useful." You are never dealing with the real world. And you are usually trying to estimate some future as of yet unobserved signal based on existing data. In other words, if your bins are reasonable and reasonably usefully accurate, you can build a working if not perfect system.
Don't try to optimize testing error performance to a value lower than the irreducible error in the system.
How is it possible to make a general coefficient of correlation that works for any non-linear relationship? Say if y=sha256(x), doesn't that mean y is a predictable function of x, but its statistically impossible to tell from looking at inputs/outputs alone?
Two things:
1. The result is asymptotical, i.e. holds as number of samples approach infinity.
2. The result is an "almost surely" result, i.e. in the collection of all possible infinite samples, the set of samples for which it fails has 0 measure. In non technical terms this means that it works for typical random samples and may not work for handpicked counterexamples.
In our particular case let f=Sha256. Then X must be discrete, i.e. a natural number. Now the particulars depend on the distribution on X, but the general idea is that since we have discrete values, the probability that we get an infinite sample where the values tend to infinity is 0. So we get that in a typical sample theres going to be an infinitude of x ties and furthermore most x values arent too large (in a way you can make precise), so the tie factors l_i dominate since there just arent that many distinct values encountered total. And so we get that the coefficient tends to 1.
No. If you have a good hash function, that means it's computationally infeasible to determine anything about x based only on y. It's not statistically impossible at all; "statistically" doesn't concern itself with computational difficulties.
This is similar to how, e.g., we generally assume that AES is unbreakable from a computational point of view, but if you want a statistically unbreakable cipher, your only (IINM) option is a one-time pad.
The summary says that it works if it is a measurable function [0], which is structure preserving. So sha256 would scramble the input too much for it to be detected here.
I've not yet read the article, just the abstract. But the abstract is pretty precise, and being "measurable" is a very weak constraint. For (computationally) complex functions like a hash, my first guess is the "escape clause" is in the number of samples needed for the statistic to converge to its expected value.
Simple: if in the whole sample, the same x always comes with the same y, then y is a function of x.
Example:
x1 = 1, y1 = 2
x2 = 1.1, y2 = 2
Here, y is a function of x, because if we know that x = 1, the we also know that y = 2. However, x is not a function of y, as we don't know what value x has given that y = 2.
I hope that made it more clear. Here, "function" simply means that every time we started with the same x, we also got the same y.
Seems that by this logic if you don’t have any repeats in your x values then you are bound to conclude that any set of y values is a function of the x.
In Theorem 1.1, f is a function of random variables, which might be where you're confused.
> doesn't that mean y is a predictable function of x
Sort of: as function of real numbers, sha256 is just some deterministic function.
But point is its output "looks like" a uniform random variable for any reasonable input distribution i.e. as a function of random variables the input and output variables should have 0 correlation
That's right, the paper has practical estimates of convergence only for continuous functions (kind of). For hash functions this coefficient is useless.
It is interesting to note that the Chatterjee paper makes a point of mentioning Pearson, Spearman, Kendall's tau, whereas the ones focused on in this paper all appear as citations but aren't explicitly discussed.
To be fair, Pearson, Spearman and Kendall's tau are the coeffisicent people use in practice. Had I been in the author's position I would have done the same: cite all the interesting developements but compare with what people actually use.
Comparing with something people barely know should be nice but people have a limited attention span so I would push to that anexes at best and focus on the more important parts.
"The formula is so simple that it is likely that there are many such coefficients, some of them possibly having better properties than the one presented below."
suggesting that the author is seeking to highlight a basic principle, not tune for a specific application.
Seems easy enough to play around with, and need not be strictly numbers either, as long as rank is defined on your fields (they are sortable...total ordering? I'm rusty on my terminology). Basically sort X, take the variance of the rank of Y, Z, etc. Much in the same way you would compute multi variable correlation.
From a signal processing perspective: being able to recognise signals in the presence of interference, noise and distortion.
For example, you might have a radio signal (such as WiFi) that you want to receive. First step is that you have to pick that signal out of whatever signal comes out of your radio receiver: which will be the WiFi signal along with all sorts of noise and interference from other users. Typically the search will be done with the mentioned "Pearson's Correlation", using it to compare the received signal with an expected template: a value of 1.0 meaning the received signal is a perfect match with the template, a value of 0.0 meaning no match at all. If the wanted signal is present, interference, noise and distortion will reduce the result of the correlation to less than 1.0, meaning you might miss the WiFi signal, even though it is present.
This article is about coming up with a measure that gives a more robust result in the face of noise, interference and distortion. It's fundamental stuff, in that it has quite general application.
Skimming it now, this looks wild. Using the variance of the rank of the dataset (for a given point, how many are less than that point) seems... weird, and throwing out some information. The author seems legit tho, so I can't wait to try drop-in implementing this in a few things.
Correlation typically means y is a linear function of x, but people usually interpet it (incorrectly) as: knowing x tells you something about y. If y = x^2, then y is determined completely by x, but since it's nonlinear the correlation may actually be zero depending on the distribution of x. This paper proposes a statistic that will indicate if y is related to any function of x, linear or nonlinear.
1. a mutual relationship or connection between two or more things
2. [Statistics] interdependence of variable quantities.
3. [Statistics] a quantity measuring the extent of the interdependence of variable quantities.
The most sympathetic to your definition is Wikipedia:
In statistics, correlation or dependence is any statistical
relationship, whether causal or not, between two random variables or bivariate
data. In the broadest sense correlation is any statistical association, though
it actually refers to the degree to which a pair of variables are linearly
related.
And that's the mathematical formulation. Correlation also has a meaning in everyday speech, and mathematics doesn't have the authority to just adopt terms and then claim people are wrong after they've changed the meaning.
Also correlation very definitely means that knowing <x> tells you something about <y>. And vice versa. Like, for example: its value. Or at least a better idea of it than pure guessing without correlation.
I don't think that there's a standard enough mathematical definition of correlation to say that. Perhaps the word has been coopted but the paper linked suggests that the cooprion isn't accepted.
Well, in the abstract it says: “[a coefficient] which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other”, the former property which is not true of general random variables (but is true of Gaussians, which is one part of the reason they are used everywhere). I’m not sure about the latter property, actually, but I also doubt it’s true.
Worth noting the author is a highly regarded professor at Stanford.
It's fast to calculate, simple to understand, and doesn't make assumptions about the underlying distributions. This makes it a more effective generic tool for practitioners. Perhaps useful in the way the Pearson correlation is useful.
I'd like to learn more about the small sample properties. Proofs of asymptotics are necessary but less interesting. But the author's examples on example data sets look like it makes sense.
This is a bit of non-sense. If you have n pairs (x_i,y_i) with all x_i's and y_i's different, then you can always say that y is a function of x: you simply take a very wiggly function that passes through all the pairs. So this new "coefficient of correlation" should always return 100% in such cases. Of course, such a coefficient would be useless, and fortunately, this one doesn't do that. The point is however that in order for such a coefficient to not be trivially 100%, you need extra specifications, more precisely, you need to specify some type of desired smoothness (or tension) in the class of functions you are looking for.
Let's make this more precise, and use R-squared (which this coefficient appears to be in some shape or form, since it is between 0 and 1, not -1 and 1). You look for relationships of the type y_i = f(x_i) + epsilon_i, where f is in some class of functions. Then this coefficient would be the ratio of the variance of f(x_i) and the variance of y_i. If you allow the class of functions to be arbitrary, there's no residual epsilon. So you need to specify somehow that class. In the case of the Pearson correlation, that class is the class of linear functions. If you want to generalize the Pearson correlation, you need to think of more general ways to specify the class of regression functions.
Seems likely. The presented coefficient only looks at the ordering of the X-values, and how it relates to the ordering of the Y-values. All other information is thrown away. That's how it can be so general, but it should come at the expense of power.
Granger causality is about linear inter-temporal dependence. Chatterjee correlation is about non-linear contemporaneous (or time-independent) dependence. They're tools for quite different applications.
Looks like you could do the opposite: use this measure as a proxy for correlation between input data X and Y, and seeing if increased correlation between X and Y is superior to just prediction of Y without X. Maybe model the correlation value going over a threshold as a sequence of Bernoulli trials
The paper discusses MIC in particular at least. They suggest that MIC sometimes overstates the strength of a noisy relationship, giving the example of a bivariate normal mixture (Figure 9). In that example MIC is 1, but the new measure is 0.48 or something, which seems more reasonable.
My understanding is that one of the major critiques of statistics, especially its use in psychology, has been the use of models which are derived from the mean.
There are inherent flaws/assumptions to this approach which Peter Molenaar has done extensive work to critique (See Todd Rose's book on the subject). For anyone who understands the technique presented in this paper, does it also depend on the mean as a model like when calculating Pearson's r?
Isn't Molenaar looking at networks of symptoms over time? Yes, in any multi-variate time series, when one is searching for relationships between the variables (and allowing that you may have multiple sets of time series which may have some type of grouping, i.e. observations from a set of people with one diagnosis vs observations from a set of people with a contrary or with no diagnosis), then yes, any attempt to find co-relations in the multivariate signal need to account for the underlying statistical distribution of the signal components. The normal distribution isn't a bad first a priori approximation, but you really need to check.
side note- it also isn't clear that you can group by diagnosis, see, for example, https://pubmed.ncbi.nlm.nih.gov/29154565/, which shows that even within diagnostic groups there is substantial individual variation.
This is an order-based algorithm, so it is more related to the median than the mean.
Another very useful consequence of being order-based, is that this new coefficient is much more robust to noise/outliers than the canonical correlation coefficient.
I think there are far bigger problems with the lack of theoretical foundations and abuse of p-values rather than with ergodicity or whatever is the pet peeve of Peter Molenaar.
Correlation never means causation, but it typically implies it, and often is directly entangled. Like the lead in car pumps being tied to city crime, the correlation likely was directly related to causation. That's not true of all correlation, but typically cause-effect has a correlation to something, even if there's only one and many other correlations that appear meaningful are meaningless.
You're using the formula for the case where there are no ties, but the vertical line has all X values tied and the horizontal line has all Y values tied. Also, the case where Y is a constant is explicitly excluded from consideration in the paper.
If anyone is interested, I've also published a Go implementation [1] of the code for float64 slices.
Results seem to exactly match the R and Python implementation, so there will be a second pass focusing on performance, stability and support for categorical variables.
The current version of the python lib seems to be extremely badly written code. Or is the algo so bad ? Takes something like 21s to compute the correlation for just 10k samples.
The equation is on the second page, and if you know enough to know what correlation is, you know enough to implement from the equation given. Takes N*Log(N) to run though, if implemented naively. (because you have to sort your data)
One of the aspects of math papers that I dislike is how unapproachable they are if you’re unfamiliar with some of the terminology and conventions. The esoteric symbols don’t make it any easier to Google their definitions either.
For instance, what does this mean?
> μ is the law of Y
μ is usually the mean or average. Is “law” something else?
Yeah, working code snippets would be great. That would provide an unambiguous implementation that someone could use to dig into the underlying functions used and learn the basic concepts that would be tedious for the author to go through.
In terms of the notation, it seemed like the author actually tried to keep his paper accessible, so my complaint isn’t with the author. My gripe is more with math notation in general.
In my opinion, unless you’ve read the appropriate textbooks or taken the right classes, math notation is hard to learn. The symbols are hard to Google for. Integral symbols, R for real numbers, sigma, delta, the round E that stands for IN are not found on a standard keyboard so it’s challenging for a layman to Google and learn that notation on their own. Math evolved over millennia and the notation wasn’t constructed with SEO in mind, so I understand why things are the way they are, but it’s a stumbling block for the uninitiated trying to learn more advanced math. Maybe there are resources like math.stackexchange out there that I’m unaware of that would help make learning notation more approachable.
There are so many super cool aspects to this!
- it's always fun when a new equation or formula is discovered, doubly so when it is very practical
- actually really easy to wrap your head around it
- seems very computationally efficient. Basically boils down to sorts and distance
- not limited to strict numeric fields ( integers and reals). Anything with an ordering defined can act as your Xs/Ys: characters, words, complex/vectors (under magnitude). I think you could even apply it recursively to divide and conquer high dimensional datasets.
- looks useful in both LTI and non stationary signal analysis
- possible use cases: exoplanet search, cosmology in general, ecology, climate, cryptanalysis, medicine, neuroscience, and of course, someone will find a way to win (more likely lose) money on stonks.
You can update Pearson correlation online, you cannot update online OP correlation coefficient.
Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
Thus, it is not computationally efficient.
To break ties randomly one has to have good random number generator, which is a problem in itself.
Finally, if you want to have differentiable version of this correlation coefficient, you will need to use sorting networks which are O(Nlog^2(N)).
But, it is a cool idea nevertheless, it brings up use of ranking in statistics and there are ties to other areas of the statistical science.
For example, it appears that you can more efficiently prune language models if you use rank metric than probability [1].
[1] https://aclanthology.org/P02-1023.pdf
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN).
> Thus, it is not computationally efficient.
That is not really what I meant. First, I personally consider NlogN the bar for "efficient" as far as algorithms go. Anything polynomial is inefficient, and anything better than NlogN is like, "super efficient" in my mind. Maybe that is a weird opinion, should I update my mental model? My reasoning is a lot of well known algorithms operate in better than polynomial but worse than linear time.
Second, sorting is a solved problem. Regardless of theoretical efficiency, we have so many ways to sort things, there is almost always really fast in practice ways of sorting data.
I didn't know about sorting networks, that is fascinating!
13 replies →
Can you explain in more detail why this can't be updated online?
We can track the rank of newly sampled pairs in LogN using something like an order statistic tree:
https://en.wikipedia.org/wiki/Order_statistic_tree
But I guess the problem is that with each new pair, O(n) pairs could change their contribution to the correlation coefficient.
2 replies →
Pearson still has a lot of advantages on modern hardware and in practice should be a ton faster, but:
(1) Rank coefficients can absolutely be updated online.
(2) There aren't many applications where an extra polylogarithmic factor is the difference between a good-enough algorithm and something too slow.
(3) The RNG is an implementation detail I'd probably omit. It's asymptotically equivalent to the deterministic operation of rescaling the sum of all pairwise absolute differences of Y values for each run of identical X values. If in your initial sort of the X values you instead lexicographically sort by X then Y then you can get away with a linear algorithm for computing those pairwise absolute differences.
> Thus, it is not computationally efficient.
Is there some standard sense in which only (sub)linear algorithms are considered "computationally efficient"? O(nlogn) is fine for a very wide variety of practical uses.
4 replies →
> Pearson and other correlation coefficients are linear, O(n), sorting would incur logiarithmic multiplier O(NlogN). Thus, it is not computationally efficient.
That doesn't seem a deal breaker to me. Sure if you're dealing with billions of data entries it will be 9 times slower, but throwing 9 times more computational power to solve a problem is far from something unheard of nowadays.
3 replies →
Out of curiosity, why do you think having a good random number generator is problematic? It seems like it's easy enough to access one in most situations.
9 replies →
You don't have to use sorting networks; to differentiate (or get a subgradient, anyway) a sorted list you just pass the derivative back to the original index, so you can use any sorting algorithm as long as you keep track of the permutation. You could even use radix sort.
Also, I would say that random number generation is a well-solved problem.
> Finally, if you want to have differentiable version of this correlation coefficient
Could you expand a little bit more on how to differentiate this coefficient? Since it is only based on ranks, it feels very non-differentiable.
This tweet contains a visualization comparison between linear correlation and “Chatarjee correlation:”
https://twitter.com/adad8m/status/1474754752193830912?s=21
Since it's rank-based, it would be nice to see a comparison to Spearman's correlation instead. It's very easy to find failure cases for Pearson, so the visualization is next to meaningless, IMO.
This new coefficient of correlation is really really awesome, and this visualization shows its value in such a beautifully simple presentation.
It would be great if someone who has Wikipedia edit privileges, can edit the Wikipedia article at [1] to describe/link how the Chatarjee's correlation coefficient solves many of the known limitation of Pearson's correlation coefficient. ;)
[1] https://en.wikipedia.org/wiki/Pearson_correlation_coefficien...
(especially second top-left diagram)
Easy there. New correlation coefficients get proposed all the time (eg. the introduction of the linked paper lists ~10-20 alone!). It's not a good idea to add every newly proposed coefficient to established wiki pages, just because they trend on social media. Yes, the paper looks nice, but if you read any new paper proposing a new measure, they all do! They're meant to be written that way. Let the community decide and test and discuss, and if in 10 years this new coefficient is well accepted and has proven itself, we can think about your proposed edit. Doing it before is putting the cart before the horse, and is a recipe for astroturfing.
The right place to add it would be here AFAICT: https://en.wikipedia.org/wiki/Correlation#Other_measures_of_.... The article on Pearson's shouldn't get sidetracked by discussing other types of variables. (or rather, it already has too many)
But the part that one would add would not necessarily be the definition of the coefficient ξn, but rather the interesting discussion at the beginning about what makes for a good correlation coefficient.
1 reply →
I once attended a summer school in Saint-Flour, France, where Sourav Chatterjee gave a masterful exposition of results on large deviations for random graphs. All chalk talk, clear presentation. My impression is he's a deep thinker. On that basis alone I give such an article much more weight than the plethora of articles that pass before the eyes; reading the masters has always been a good rule of thumb.
what kind of summer school was that?
This one: https://recherche.math.univ-bpclermont.fr/stflour/
This article is now published in the Journal of the American Statistical Association [Volume 116, 2021 - Issue 536]
https://doi.org/10.1080/01621459.2020.1758115
That costs $47 (US). The archiv article is free.
I think pointing out that it is published on a known journal is to display the fact that the paper is now peer-reviewed
7 replies →
Also worth looking at the cited and related paper with the same author:
https://arxiv.org/abs/1910.12327
This follow-up paper presents a related measure of conditional dependence that has a "natural interpretation as a nonlinear generalization of the familiar partial R2 statistic for measuring conditional dependence by regression."
The follow-up paper also provides some additional interpretive insights, I think.
My intuitive impression is that both of these are important developments.
I also have a very vague suspicion, based on the form of the function, that the correlation measure has some interpretation in terms of mutual information involving rank transformations of random variables.
Thanks for finding this article. I agree, these are important developments in particular because so many econometric models are now using machine learning without any distributional assumptions. Using correlation coefficients based on linearity is grossly insufficient.
I didn't get the idea of ranking. But it's simple:
"In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2."
https://en.m.wikipedia.org/wiki/Ranking
Also learned that a Spearman coeff is just the Pearson coefficient taken on the rank of the data, instead of on the raw data.
But whereas Pearson/Spearman takes the sum of product of data/mean differences (Σ(x-x̄)(y-ȳ)/σxσy) where x̄ is mean and σ=std. dev., Chatterjee takes sum of rank differences (3Σ(rᵢ₊₁-rᵢ)/n²-1), concerning just the ranks of the Y data after the X,Y pairs have been sorted by X.
But still missing the intuition for why the sum of difference of ranks is so useful or where the magic numbers come from.
jmount has an explanation elsewhere in this thread linking to https://win-vector.com/2021/12/26/how-to-read-sourav-chatter... which does a great job of explaining the intuition, but in a nutshell the normalization factor of 3 comes from the fact that if you select two random points between 0 and 1, the mean distance between will be 1/3 (which is pretty easy to write down and solve, boils down the the fact that a pyramid of height 1 that's 1x1 at the base has volume = 1/3).
Thanks! That did it
I have a very naive question: What are the downsides of estimating mutual information instead?
I have a math (but not statistics) background, and am sometimes bewildered by the many correlation coefficients that float around when MI describes pretty exactly what one wants ("how much does knowledge of one variable tell you about the other variable"?).
So ... what am I not understanding?
Different coefficients help you look at different kinds of relationships. For example, Pearson's R tells you about linear relationships between variables -- it's closely tied to the covariance: "how useful is it to draw a line through these data points, and how accurate is interpolating likely to be?".
Spearman's correlation helps you understand monotonic/rank-order relationships between variables: "Is there a trend where increasing X tends to also increase Y?" (This way we can be just as good at measuring the existence of linear, logarithmic, or exponential relationships, although we can't tell them apart.)
Mutual information helps you understand how similar two collections of data are, in the sort of unstructured way that's useful in building decision trees. You could have high mutual information without any sort of linear or monotonic relationship at all. Thus it's more general while at the same time not telling you anything that would be helpful in building, for instance, a predictive multivariate linear model.
TLDR; More specific coefficients leverage assumptions about the structure of the data (eg linearity), which can help you construct optimal versions of models under those assumptions. Mutual information doesn't make any assumptions about the structure of the data so it won't feed into such a model, but it still has lots of applications!
MI is quite useful and widely used. It typically requires binning data though when distributions are unknown / empirically estimated. This approach is a rank-based score, more similar to Spearman correlation than Pearson. This allows for nonlinear relationships between the two variables.
A slightly critical review on the work van be seen here: https://academic.oup.com/biomet/advance-article/doi/10.1093/.... They argue that the older forms of rank correlation, namely D, R, and tau*, are superior. Nonetheless, it seems like a nice contribution to the stats literature, although I doubt the widespread use of correlation is going anywhere.
Freely available version of the review paper, for those who want it: https://arxiv.org/abs/2008.11619
Instead of mutual information, can I recommend the Transfer Entropy?
https://journals.aps.org/prl/abstract/10.1103/PhysRevLett.85...
https://en.wikipedia.org/wiki/Transfer_entropy
It is designed for time series, but can be adopted to a more general case. The advantage over MI is that with MI, you could only see that A and B are related, while TE can show that A dives B, but not the reverse.
arxiv link: https://arxiv.org/abs/nlin/0001042
In the article it is explained that the purpose of this coefficient is to estimate how much X is a function of Y [1](or how noisy this association is); in particular this coefficient is 1 iff X is a function of Y.
With MI (the article claims that) you can have a coefficient of 1 without X being a function of Y.
[1] this means that this coefficient is intentionally not symmetric.
I'm also interested in this, having tried and semi-successfully used mutual information for finding associations between multinomial variables. As an even more naive question, I find the actual selection of estimators bewildering. How do I know which estimator to use for mutual information? How do I know if my chosen estimator has converged or is doing a bad job on my data? Bringing it back to the topic at hand, does the estimator presented in the paper provide good estimates for a wider variety of cases than the mutual information plug-in estimator? If so I can see it might be nice for simplicity reasons alone. Can we have different estimators for this new correlation coefficient? Any ideas what that would look like?
Mutual information is not trivial or even possible to estimate in many practical situations as far as i know. Example applications in robotics or computer vision, where mutual information would be useful are segmentation and denoising of unordered 3d point data, for example.
Yes, as someone mentioned above, the problem is getting the underlying distribution of the data, so you can measure SUM p_i log(p_i); this usually involves some binning, which can be tricky (and yes, I know the formula I gave is entropy not MI)
I try to remind myself that "it is just a model", as a corollary to "all models are wrong, some are useful." You are never dealing with the real world. And you are usually trying to estimate some future as of yet unobserved signal based on existing data. In other words, if your bins are reasonable and reasonably usefully accurate, you can build a working if not perfect system.
Don't try to optimize testing error performance to a value lower than the irreducible error in the system.
2 replies →
How is it possible to make a general coefficient of correlation that works for any non-linear relationship? Say if y=sha256(x), doesn't that mean y is a predictable function of x, but its statistically impossible to tell from looking at inputs/outputs alone?
Two things: 1. The result is asymptotical, i.e. holds as number of samples approach infinity.
2. The result is an "almost surely" result, i.e. in the collection of all possible infinite samples, the set of samples for which it fails has 0 measure. In non technical terms this means that it works for typical random samples and may not work for handpicked counterexamples.
In our particular case let f=Sha256. Then X must be discrete, i.e. a natural number. Now the particulars depend on the distribution on X, but the general idea is that since we have discrete values, the probability that we get an infinite sample where the values tend to infinity is 0. So we get that in a typical sample theres going to be an infinitude of x ties and furthermore most x values arent too large (in a way you can make precise), so the tie factors l_i dominate since there just arent that many distinct values encountered total. And so we get that the coefficient tends to 1.
No. If you have a good hash function, that means it's computationally infeasible to determine anything about x based only on y. It's not statistically impossible at all; "statistically" doesn't concern itself with computational difficulties.
This is similar to how, e.g., we generally assume that AES is unbreakable from a computational point of view, but if you want a statistically unbreakable cipher, your only (IINM) option is a one-time pad.
The summary says that it works if it is a measurable function [0], which is structure preserving. So sha256 would scramble the input too much for it to be detected here.
[0] https://en.m.wikipedia.org/wiki/Measurable_function
I've not yet read the article, just the abstract. But the abstract is pretty precise, and being "measurable" is a very weak constraint. For (computationally) complex functions like a hash, my first guess is the "escape clause" is in the number of samples needed for the statistic to converge to its expected value.
Any function that can be implemented in a computer is a Lebesgue measurable function.
2 replies →
Simple: if in the whole sample, the same x always comes with the same y, then y is a function of x.
Example:
x1 = 1, y1 = 2
x2 = 1.1, y2 = 2
Here, y is a function of x, because if we know that x = 1, the we also know that y = 2. However, x is not a function of y, as we don't know what value x has given that y = 2.
I hope that made it more clear. Here, "function" simply means that every time we started with the same x, we also got the same y.
Either Chatterjee has made a mistake, that is wrong or that definition has some extremely precise meanings because:
& I assume you could get data like that in the wild sampling a sin wave as a timeseries.
I think the abstract is a bit strong, it probably means "converges to" for a large repeating sample.
2 replies →
Seems that by this logic if you don’t have any repeats in your x values then you are bound to conclude that any set of y values is a function of the x.
In Theorem 1.1, f is a function of random variables, which might be where you're confused.
> doesn't that mean y is a predictable function of x
Sort of: as function of real numbers, sha256 is just some deterministic function. But point is its output "looks like" a uniform random variable for any reasonable input distribution i.e. as a function of random variables the input and output variables should have 0 correlation
This works for hash function as hash function does not introduce uncertainty, i.e. Y = f(X), not Y = f(x) + random_noise. I just tested it:
https://ibb.co/nCXVSqB
Perhaps they mean continuous or differentiable functions (evident by the desire to put an order on the elements)
edit: was wrong about sha256
sha256() doesn't involve state or decision making.
That's right, the paper has practical estimates of convergence only for continuous functions (kind of). For hash functions this coefficient is useless.
There is also another paper comparing the power of Chatterjee coefficient with other traditional ones.
https://arxiv.org/abs/2008.11619
It is interesting to note that the Chatterjee paper makes a point of mentioning Pearson, Spearman, Kendall's tau, whereas the ones focused on in this paper all appear as citations but aren't explicitly discussed.
To be fair, Pearson, Spearman and Kendall's tau are the coeffisicent people use in practice. Had I been in the author's position I would have done the same: cite all the interesting developements but compare with what people actually use.
Comparing with something people barely know should be nice but people have a limited attention span so I would push to that anexes at best and focus on the more important parts.
1 reply →
Best line of the paper:
"The formula is so simple that it is likely that there are many such coefficients, some of them possibly having better properties than the one presented below."
suggesting that the author is seeking to highlight a basic principle, not tune for a specific application.
I share some notes on reading/explaining the basic definition here: https://win-vector.com/2021/12/26/how-to-read-sourav-chatter...
Is there a way to extend this to measure a multivariate relationship?
For example: Cor(X, Y & Z)
I know you could run them pairwise but it’s possible Cor(X, Y) and Cor(X, Z) are close to zero but Cor(X, Y & Z) is close to 1.
Seems easy enough to play around with, and need not be strictly numbers either, as long as rank is defined on your fields (they are sortable...total ordering? I'm rusty on my terminology). Basically sort X, take the variance of the rank of Y, Z, etc. Much in the same way you would compute multi variable correlation.
This is a good question. When I read the article I'll keep it in mind!
Dumb question, What is the significance of this?
From a signal processing perspective: being able to recognise signals in the presence of interference, noise and distortion.
For example, you might have a radio signal (such as WiFi) that you want to receive. First step is that you have to pick that signal out of whatever signal comes out of your radio receiver: which will be the WiFi signal along with all sorts of noise and interference from other users. Typically the search will be done with the mentioned "Pearson's Correlation", using it to compare the received signal with an expected template: a value of 1.0 meaning the received signal is a perfect match with the template, a value of 0.0 meaning no match at all. If the wanted signal is present, interference, noise and distortion will reduce the result of the correlation to less than 1.0, meaning you might miss the WiFi signal, even though it is present.
This article is about coming up with a measure that gives a more robust result in the face of noise, interference and distortion. It's fundamental stuff, in that it has quite general application.
(Yay signal processing!)
Skimming it now, this looks wild. Using the variance of the rank of the dataset (for a given point, how many are less than that point) seems... weird, and throwing out some information. The author seems legit tho, so I can't wait to try drop-in implementing this in a few things.
1 reply →
Correlation typically means y is a linear function of x, but people usually interpet it (incorrectly) as: knowing x tells you something about y. If y = x^2, then y is determined completely by x, but since it's nonlinear the correlation may actually be zero depending on the distribution of x. This paper proposes a statistic that will indicate if y is related to any function of x, linear or nonlinear.
This is... quite wrong? The dictionary says:
The most sympathetic to your definition is Wikipedia:
And that's the mathematical formulation. Correlation also has a meaning in everyday speech, and mathematics doesn't have the authority to just adopt terms and then claim people are wrong after they've changed the meaning.
Also correlation very definitely means that knowing <x> tells you something about <y>. And vice versa. Like, for example: its value. Or at least a better idea of it than pure guessing without correlation.
6 replies →
I don't think that there's a standard enough mathematical definition of correlation to say that. Perhaps the word has been coopted but the paper linked suggests that the cooprion isn't accepted.
Well, in the abstract it says: “[a coefficient] which is 0 if and only if the variables are independent and 1 if and only if one is a measurable function of the other”, the former property which is not true of general random variables (but is true of Gaussians, which is one part of the reason they are used everywhere). I’m not sure about the latter property, actually, but I also doubt it’s true.
Worth noting the author is a highly regarded professor at Stanford.
It's fast to calculate, simple to understand, and doesn't make assumptions about the underlying distributions. This makes it a more effective generic tool for practitioners. Perhaps useful in the way the Pearson correlation is useful.
I'd like to learn more about the small sample properties. Proofs of asymptotics are necessary but less interesting. But the author's examples on example data sets look like it makes sense.
This is a bit of non-sense. If you have n pairs (x_i,y_i) with all x_i's and y_i's different, then you can always say that y is a function of x: you simply take a very wiggly function that passes through all the pairs. So this new "coefficient of correlation" should always return 100% in such cases. Of course, such a coefficient would be useless, and fortunately, this one doesn't do that. The point is however that in order for such a coefficient to not be trivially 100%, you need extra specifications, more precisely, you need to specify some type of desired smoothness (or tension) in the class of functions you are looking for.
Let's make this more precise, and use R-squared (which this coefficient appears to be in some shape or form, since it is between 0 and 1, not -1 and 1). You look for relationships of the type y_i = f(x_i) + epsilon_i, where f is in some class of functions. Then this coefficient would be the ratio of the variance of f(x_i) and the variance of y_i. If you allow the class of functions to be arbitrary, there's no residual epsilon. So you need to specify somehow that class. In the case of the Pearson correlation, that class is the class of linear functions. If you want to generalize the Pearson correlation, you need to think of more general ways to specify the class of regression functions.
This is incredibly cool. Thanks HN for bringing things like this to the front page, which I would have never otherwise heard about.
Why woud someone use pearson or smearson coefficients vs this new one?
If the relationship is linear, I'm guessing a test based on Pearson has greater power.
Seems likely. The presented coefficient only looks at the ordering of the X-values, and how it relates to the ordering of the Y-values. All other information is thrown away. That's how it can be so general, but it should come at the expense of power.
1 reply →
power and interpretability.
Assuming a linear relationship, if you know the correlation coefficient, you can predict unobserved values of y based on a known x with good accuracy.
y = ax + b + error
where strong correlation means error is small.
Pearson is for affine/linear+intercept relationships. Spearman is for monotonic relationships.
could this better represent semantic textual similarity? https://paperswithcode.com/sota/semantic-textual-similarity-...
Can information theoretic measures like Granger causality[0] be used for this purpose?
[0] https://en.wikipedia.org/wiki/Granger_causality
Granger causality is about linear inter-temporal dependence. Chatterjee correlation is about non-linear contemporaneous (or time-independent) dependence. They're tools for quite different applications.
Looks like you could do the opposite: use this measure as a proxy for correlation between input data X and Y, and seeing if increased correlation between X and Y is superior to just prediction of Y without X. Maybe model the correlation value going over a threshold as a sequence of Bernoulli trials
How does this relate to Maximal Information Coefficient (MIC), MGC, PPS and the like?
The paper discusses MIC in particular at least. They suggest that MIC sometimes overstates the strength of a noisy relationship, giving the example of a bivariate normal mixture (Figure 9). In that example MIC is 1, but the new measure is 0.48 or something, which seems more reasonable.
My understanding is that one of the major critiques of statistics, especially its use in psychology, has been the use of models which are derived from the mean.
There are inherent flaws/assumptions to this approach which Peter Molenaar has done extensive work to critique (See Todd Rose's book on the subject). For anyone who understands the technique presented in this paper, does it also depend on the mean as a model like when calculating Pearson's r?
Isn't Molenaar looking at networks of symptoms over time? Yes, in any multi-variate time series, when one is searching for relationships between the variables (and allowing that you may have multiple sets of time series which may have some type of grouping, i.e. observations from a set of people with one diagnosis vs observations from a set of people with a contrary or with no diagnosis), then yes, any attempt to find co-relations in the multivariate signal need to account for the underlying statistical distribution of the signal components. The normal distribution isn't a bad first a priori approximation, but you really need to check.
side note- it also isn't clear that you can group by diagnosis, see, for example, https://pubmed.ncbi.nlm.nih.gov/29154565/, which shows that even within diagnostic groups there is substantial individual variation.
This is an order-based algorithm, so it is more related to the median than the mean.
Another very useful consequence of being order-based, is that this new coefficient is much more robust to noise/outliers than the canonical correlation coefficient.
I think there are far bigger problems with the lack of theoretical foundations and abuse of p-values rather than with ergodicity or whatever is the pet peeve of Peter Molenaar.
Correlation never means causation, but it typically implies it, and often is directly entangled. Like the lead in car pumps being tied to city crime, the correlation likely was directly related to causation. That's not true of all correlation, but typically cause-effect has a correlation to something, even if there's only one and many other correlations that appear meaningful are meaningless.
Am I reading the paper definition correctly? If yes, why does the vertical line and horizontal line in 2D have so different correlations?
You're using the formula for the case where there are no ties, but the vertical line has all X values tied and the horizontal line has all Y values tied. Also, the case where Y is a constant is explicitly excluded from consideration in the paper.
I think there could be a lot of use of distance based correlations https://towardsdatascience.com/introducing-distance-correlat...
Is there code?
Yes, the author has shared the link to R package here:
https://cran.r-project.org/web/packages/XICOR/index.html
Edit: R code from Dr. Chatterjee's Stanford page is here - https://souravchatterjee.su.domains//xi.R
If you have never worked with R, the code seems clunky so I suggest checking out Python implementation on Github here:
https://github.com/czbiohub/xicor
The Python library is not from the original author though. But it's easy to read the code and it works with pandas as well.
If anyone is interested, I've also published a Go implementation [1] of the code for float64 slices.
Results seem to exactly match the R and Python implementation, so there will be a second pass focusing on performance, stability and support for categorical variables.
[1] https://github.com/tpaschalis/xicor-go
The current version of the python lib seems to be extremely badly written code. Or is the algo so bad ? Takes something like 21s to compute the correlation for just 10k samples.
2 replies →
Thanks, the Python code is very clear and simple and makes it super easy to understand the idea without having to digest the paper.
The equation is on the second page, and if you know enough to know what correlation is, you know enough to implement from the equation given. Takes N*Log(N) to run though, if implemented naively. (because you have to sort your data)
non-pdf link to the article via arxiv vanity: https://www.arxiv-vanity.com/papers/1909.10140/
I implemented this in Python and it seems to work really well! Thanks for this link!
Pretty cool that it can also work for e.g. textual data in principle, as someone mentioned.
I mean, if you choose a bullshit ordering relationship, sure. But that's (essentially) the only assumption.
Ah, I missed the reference to "the humanities" the first time... seems like you have an ideological axe to grind.
What are you talking about?
Correlation is absolutely useful for analysis of non-physical based systems.
"analysis"
There will be no AI fire alarm
One of the aspects of math papers that I dislike is how unapproachable they are if you’re unfamiliar with some of the terminology and conventions. The esoteric symbols don’t make it any easier to Google their definitions either.
For instance, what does this mean? > μ is the law of Y μ is usually the mean or average. Is “law” something else?
The law of Y means the distribution of Y. See the last phrase from this [0] StackOverflow answer.
[0] https://math.stackexchange.com/a/1397467
Great thanks! I’ll remember to use math.stackexchange.com as a resource in the future.
Do you think a random production code snippet is approachable for someone lacking the necessary background?
The notation in this paper is totally standard.
EDIT: \mu there refers to a probability measure. Nothing to do with averages.
Yeah, working code snippets would be great. That would provide an unambiguous implementation that someone could use to dig into the underlying functions used and learn the basic concepts that would be tedious for the author to go through.
In terms of the notation, it seemed like the author actually tried to keep his paper accessible, so my complaint isn’t with the author. My gripe is more with math notation in general.
In my opinion, unless you’ve read the appropriate textbooks or taken the right classes, math notation is hard to learn. The symbols are hard to Google for. Integral symbols, R for real numbers, sigma, delta, the round E that stands for IN are not found on a standard keyboard so it’s challenging for a layman to Google and learn that notation on their own. Math evolved over millennia and the notation wasn’t constructed with SEO in mind, so I understand why things are the way they are, but it’s a stumbling block for the uninitiated trying to learn more advanced math. Maybe there are resources like math.stackexchange out there that I’m unaware of that would help make learning notation more approachable.