Comment by pmayrgundter

4 years ago

I didn't get the idea of ranking. But it's simple:

"In statistics, ranking is the data transformation in which numerical or ordinal values are replaced by their rank when the data are sorted. For example, the numerical data 3.4, 5.1, 2.6, 7.3 are observed, the ranks of these data items would be 2, 3, 1 and 4 respectively. For example, the ordinal data hot, cold, warm would be replaced by 3, 1, 2."

https://en.m.wikipedia.org/wiki/Ranking

Also learned that a Spearman coeff is just the Pearson coefficient taken on the rank of the data, instead of on the raw data.

But whereas Pearson/Spearman takes the sum of product of data/mean differences (Σ(x-x̄)(y-ȳ)/σxσy) where x̄ is mean and σ=std. dev., Chatterjee takes sum of rank differences (3Σ(rᵢ₊₁-rᵢ)/n²-1), concerning just the ranks of the Y data after the X,Y pairs have been sorted by X.

But still missing the intuition for why the sum of difference of ranks is so useful or where the magic numbers come from.

2 comments

pmayrgundter

cornel_io 4 years ago

jmount has an explanation elsewhere in this thread linking to https://win-vector.com/2021/12/26/how-to-read-sourav-chatter... which does a great job of explaining the intuition, but in a nutshell the normalization factor of 3 comes from the fact that if you select two random points between 0 and 1, the mean distance between will be 1/3 (which is pretty easy to write down and solve, boils down the the fact that a pyramid of height 1 that's 1x1 at the base has volume = 1/3).

pmayrgundter 4 years ago

Thanks! That did it