← Back to context

Comment by credit_guy

4 years ago

This is a bit of non-sense. If you have n pairs (x_i,y_i) with all x_i's and y_i's different, then you can always say that y is a function of x: you simply take a very wiggly function that passes through all the pairs. So this new "coefficient of correlation" should always return 100% in such cases. Of course, such a coefficient would be useless, and fortunately, this one doesn't do that. The point is however that in order for such a coefficient to not be trivially 100%, you need extra specifications, more precisely, you need to specify some type of desired smoothness (or tension) in the class of functions you are looking for.

Let's make this more precise, and use R-squared (which this coefficient appears to be in some shape or form, since it is between 0 and 1, not -1 and 1). You look for relationships of the type y_i = f(x_i) + epsilon_i, where f is in some class of functions. Then this coefficient would be the ratio of the variance of f(x_i) and the variance of y_i. If you allow the class of functions to be arbitrary, there's no residual epsilon. So you need to specify somehow that class. In the case of the Pearson correlation, that class is the class of linear functions. If you want to generalize the Pearson correlation, you need to think of more general ways to specify the class of regression functions.