← Back to context

Comment by zaptheimpaler

4 years ago

How is it possible to make a general coefficient of correlation that works for any non-linear relationship? Say if y=sha256(x), doesn't that mean y is a predictable function of x, but its statistically impossible to tell from looking at inputs/outputs alone?

Two things: 1. The result is asymptotical, i.e. holds as number of samples approach infinity.

2. The result is an "almost surely" result, i.e. in the collection of all possible infinite samples, the set of samples for which it fails has 0 measure. In non technical terms this means that it works for typical random samples and may not work for handpicked counterexamples.

In our particular case let f=Sha256. Then X must be discrete, i.e. a natural number. Now the particulars depend on the distribution on X, but the general idea is that since we have discrete values, the probability that we get an infinite sample where the values tend to infinity is 0. So we get that in a typical sample theres going to be an infinitude of x ties and furthermore most x values arent too large (in a way you can make precise), so the tie factors l_i dominate since there just arent that many distinct values encountered total. And so we get that the coefficient tends to 1.

No. If you have a good hash function, that means it's computationally infeasible to determine anything about x based only on y. It's not statistically impossible at all; "statistically" doesn't concern itself with computational difficulties.

This is similar to how, e.g., we generally assume that AES is unbreakable from a computational point of view, but if you want a statistically unbreakable cipher, your only (IINM) option is a one-time pad.

The summary says that it works if it is a measurable function [0], which is structure preserving. So sha256 would scramble the input too much for it to be detected here.

[0] https://en.m.wikipedia.org/wiki/Measurable_function

  • I've not yet read the article, just the abstract. But the abstract is pretty precise, and being "measurable" is a very weak constraint. For (computationally) complex functions like a hash, my first guess is the "escape clause" is in the number of samples needed for the statistic to converge to its expected value.

Simple: if in the whole sample, the same x always comes with the same y, then y is a function of x.

Example:

x1 = 1, y1 = 2

x2 = 1.1, y2 = 2

Here, y is a function of x, because if we know that x = 1, the we also know that y = 2. However, x is not a function of y, as we don't know what value x has given that y = 2.

I hope that made it more clear. Here, "function" simply means that every time we started with the same x, we also got the same y.

  • Either Chatterjee has made a mistake, that is wrong or that definition has some extremely precise meanings because:

        > counterexample = tibble(x = 1:6, y = c(1.1,-1.2,1.3,-1.4,1.5,-1.6))
        > counterexample
        # A tibble: 6 × 2
              x     y
          <int> <dbl>
        1     1   1.1
        2     2  -1.2
        3     3   1.3
        4     4  -1.4
        5     5   1.5
        6     6  -1.6
        > XICOR::calculateXI(xvec = counterexample$x, yvec=counterexample$y)
        [1] -0.2857143
        > -0.2857143 == 1
        [1] FALSE
    

    & I assume you could get data like that in the wild sampling a sin wave as a timeseries.

    I think the abstract is a bit strong, it probably means "converges to" for a large repeating sample.

    • There is no mistake in the definition and this is all elaborated upon in page 4 of the article.

      Quote: " On the other hand, it is not very hard to prove that the minimum possible value of ξn(X, Y ) is −1/2 + O(1/n), and the minimum is attained when the top n/2 values of Yi are placed alternately with the bottom n/2 values. This seems to be paradoxical, since Theorem 1.1 says that the limiting value is in [0, 1]. The resolution is that Theorem 1.1 only applies to i.i.d. samples. Therefore a large negative value of ξn has only one possible interpretation: the data does not resemble an i.i.d. sample."

      You propose Y = f(X) = (-1)^X*(1+X/10) as your functional relation, which is measurable if X is discrete and indeed if we let x_n=n, then the limiting value of the estimator will be -1/2 not 1.

      However, this is just a particular value of the estimator on a particular sample of (x,y). The theorem is an "almost surely statement", which means that it fails for a set of samples with 0 propbability.

      Indeed, if we actually picked a random sample of (X, f(X)) with your f, then independent on the distribution on X, since X is discrete, we would expect to see many ties (infinitely many ties as the number of samples goes to infinity). This would mean that |r_{i+1}-r_i| is 0 for all but finitely many i and so the estimator would be 1.

      This also covers the case of f being a hash function as mentioned before. Worse yet it only has finitely many different values so once again infinitely many ties.

      The way the estimator is defined, it will take care of the (X, f(X)) case fairly easily as for a typical sample you will get x values that cluster and for a measurable function this implies that the resulting values will be close together and so the rank differences will be small.

      This discussion probably wasnt included in the abstract since its fairly simple measure theory which most experts readimg tje article will be intimately familiar with

    • as an alternate to cirpus's reply, and noting that in roenxi's example roenxi is using the R packaged supplied by the paper's author (suggesting a real interest in understanding), I again refer to page 4 of the article.

      "it is not very hard to prove that the minimum possible value of [the proposed measure] is −1/2 + O(1/n), and the minimum is attained when the top n/2 values of Yi are placed alternately with the bottom n/2 values. ...Theorem 1.1 only applies to i.i.d. samples. Therefore a large negative value of [the measure] has only one possible interpretation: the data does not resemble an i.i.d. sample."

      roenxi, I congratulate your example, it shows that you are working to understand the measure. "where does it break" is always a good question.

  • Seems that by this logic if you don’t have any repeats in your x values then you are bound to conclude that any set of y values is a function of the x.

In Theorem 1.1, f is a function of random variables, which might be where you're confused.

> doesn't that mean y is a predictable function of x

Sort of: as function of real numbers, sha256 is just some deterministic function. But point is its output "looks like" a uniform random variable for any reasonable input distribution i.e. as a function of random variables the input and output variables should have 0 correlation

Perhaps they mean continuous or differentiable functions (evident by the desire to put an order on the elements)

edit: was wrong about sha256

  • That's right, the paper has practical estimates of convergence only for continuous functions (kind of). For hash functions this coefficient is useless.