Everything Is Correlated

6 years ago (gwern.net)

55 comments

19h

When the correlation is close to 0 it's often because of a feedback loop.

For example - in economy with central bank trying to hit inflation target - interest rates and inflation will have near 0 correlation (interest rates change but inflation remains constant). That's because central bank adjusts interest rates to counter other variables so that inflation remains near the target.

Other example (my favorite, it was mindblowing when my teacher showed it to us on econometrics as a warning :) ) - gas pedal and speed of a car driving on a hilly road. Driver wants to drive near the speed limit, so he adjusts the gas pedal to keep the speed constant. Simplistic conclusion would be - speed is constant despite the gas pedal position changing therefore they are unrelated :)

gwern 6 years ago

That's a good point. Another one I forgot to make: given the established empirical reality of 'everything is correlated', if you find a variable which does in fact seem to be independent of most or everything else, that alone makes that variable suspicious - it suggests that it may be a pseudo-variable, composed largely or entirely of measurement error/randomness, or perhaps afflicted by a severe selection bias or other problem (such as range restriction or Berkson's paradox eliminating the real correlation).
Somewhat similarly, because 'everything is heritable', if you run into a human trait which is not heritable at all and is precisely estimated at h^2~0, that cast considerable doubt on whether you have a real trait at all. (I've seen this happen to a few latent variables extracted by factor analysis: they have near-zero heritability in a twin study and on further investigation, turn out to have been just sampling error or bad factor analysis in the first place, and don't replicate or predict anything or satisfy any of the criteria you might use to decide if a trait is 'real'.)
tcgv 6 years ago
That's very interesting. In the car driving example we can define three variables: 1) Throttle 2) Speed 3) Elevation derivative
If "3" is constant (ex: flat terrain) then "1" and "2" will have strong correlation. However if "2" is constant (ex: cruise control) as in your example, "1" and "3" will have strong correlation.
In the economic example, however, this kind of analisys should be much more complex and take plenty of variables into account.
- zeusk 6 years ago
  
  The key point being identifying those variables and ensuring they remain constant (i.e. in that example - tire pressure, elevation, fuel load etc.)
snthpy 6 years ago

> Other example (my favorite, it was mindblowing when my teacher showed it to us on econometrics as a warning :) ) - gas pedal and speed of a car driving on a hilly road. Driver wants to drive near the speed limit, so he adjusts the gas pedal to keep the speed constant. Simplistic conclusion would be - speed is constant despite the gas pedal position changing therefore they are unrelated :)
I think that's Milton Friedman's Thermostat in case you want to search for it.
mojomark 6 years ago
Good discussion. On the flip side, in my data mining class the professor keeps saying ~"you may be able to find clusters in a data set, but often no true correlation exists." However, that's an absolute statement I just don't swallow. In my mind what I see is that if an unexplained correlation or non-correlation appears, it may be random (or true) or it could be the result of an unmeasured (hidden) variable. In your two examples, your simply pointing out two respective hidden variables that weren't accounted for in the original analysis.
I think any data analysis should always be caveated with the understanding that there may be hidden variables shrouding or perhaps enhancing correlations - from economics to quantum mechanics. It's up to the reviewer of the results to determine, subjectively or by using a standard measure, whether the level of rigor involved in data collection & analysis sufficiently models reality.
- dalore 6 years ago
  
  Perhaps they are trying to explain clustering illusion? The phenomenon that even random data will produce clusters. You can take that further and state random data WILL produce clusters. If you don't have clusters then your data is not random and some pattern is at play.
  This really tricks up our mind as our mind tries to find patterns everywhere. If you try and plot random dots you will usually put dots without clusters. A true random plot will have clusters.
  https://en.wikipedia.org/wiki/Clustering_illusion
  Edit: Note your professor said "often" which means they did not make an absolute statement
  
  6 replies →
- bluGill 6 years ago
  
  The ability for adults to drink milk and fluency in speaking English is well correlated. This is because those of northern European ancestry are more likely to be able to drink milk, and it happens that most of northern European ancestry either immigrated to an English speaking country (US) 150 years ago, or are in a country where English instruction is good.
- PeterisP 6 years ago
  
  It's probably in the same vein as the classic quote by Tukey "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." - a "need" for an answer can easily motivate people to mangle the data in order to find it even if it doesn't exist.
MadWombat 6 years ago

I like the gas pedal example. I read a similar one somewhere where we measure temperature inside the house and energy usage by the heater. The energy usage is correlated with outside temperature, but inside temperature stays constant, so we conclude that inside temperature is unrelated to heater and turn the heater off.
seiferteric 6 years ago
I really like that example, but I am wondering if it would really be true? Real drivers would not maintain a perfect speed, but would instead work to maintain the average. If you looked closely at the speed, it would drift away from the average, then the peddle would move to return it to the average. So it would look a bit like an integral (The I in PID control) of the difference from the mean speed right?
- ajuc 6 years ago
  
  Yup, that's how you know I only had this on university, never used it in real life :) I think in real life you might see the feedback loop in motion, or not, depending on the resolution and sampling.
- mercer 6 years ago
  
  Well, we're obviously talking about perfectly spherical drivers (good point though).
  
  1 reply →
yeahitslikethat 6 years ago

Pressing the brake is positively correlated with the car going faster. Down hill.
Good thing correlation is not an indicator of causation.

limbicsystem 6 years ago

It is true that, as Fisher points out, with enough samples you are almost guaranteed to reject the null hypothesis. That's why we tell students to consider both p values (which you could think of as a form of quality control on the dataset) and variance explained. Loftus and Loftus make the point nicely: p tells you if you have enough samples and any effect to consider, variance explained tells you if it's worth pursuing. Both are useful guides to a thoughtful analysis. In addition, I'd make a case for thinking about the scientific significance and importance of the hypothesis and the Bayesian prior. And to put a positive spin on this, given how easy it is to get small p values, big ones are pretty much a red flag to stop the analysis and go and do something more productive instead.

nonbel 6 years ago
> "It is true that, as Fisher points out, with enough samples you are almost guaranteed to reject the null hypothesis. "
Where does Fisher point this out?
> "That's why we tell students to consider both p values (which you could think of as a form of quality control on the dataset)"
How is this "quality control"? It just tells you whether your sample size was large enough to pass an arbitrary threshold...
- gwern 6 years ago
  
  > Where does Fisher point this out?
  Probably in the Fisher excerpt.
  
  1 reply →

ivan_ah 6 years ago

Agree that NHST using simple null hypothesis of the form

   H0:  μ = 0

doesn't provide much value. H0 is never true, and the conclusion of "rejecting H0" based on a p-value is therefore not super profound. Also "rejecting H0" conclusion doesn't really tells anything about the alternative hypothesis HA (not even considered when computing p-value, since p-value is under H0). Dichotomies in general are bad, but NHST with point H0 is useless!

However a composite hypothesis setup of the form

   H0:  μ ≤ 0
   HA:  μ > 0

is probabilistically sound (in as much as some journal requires you to report a p-values). Much better to report effect size estimate and/or CI.

nonbel 6 years ago

That still gives 50-50 odds with sufficient sample size, not much of a test of the research hypothesis (since many alternatives will predict the same direction). It is better than 100% chance of rejection though.
mjfl 6 years ago
Couldn't you make an argument that the point H0 has use when you are testing whether two populations are identical? i.e. it's probably true that \mu is very close to 0 if it is the difference in heights of men from Nebraska vs men from Iowa.
- rwj 6 years ago
  
  You've kind of hit the point with the second half of your comment. Two populations are virtually never identical, so you don't need any statistics to answer the question. A more reasonable question is whether or not you have the statistical power (i.e. measurement precision) to see the difference, and whether the difference is big enough to matter.

Mistletoe 6 years ago

This reminds me of the current omnigenic hypothesis about genes. That unexpectedly almost every gene seems to affect the expression of traits.

https://www.quantamagazine.org/omnigenic-model-suggests-that...

"Drawing on GWAS analyses of three diseases, they concluded that in the cell types that are relevant to a disease, it appears that not 15, not 100, but essentially all genes contribute to the condition. The authors suggested that for some traits, “multiple” loci could mean more than 100,000."

nonbel 6 years ago

That is just a special case of the "everything is correlated" principle.

RosanaAnaDana 6 years ago

I think a major issue here is that, perhaps, there is a tendency to want to use statistics to decide what the 'truth' is, because it takes the onus of responsibility for making a mistake away from the interpreter. Its nice to be able to stand behind a p-value and not be accountable for whatever argument is being made. But the issue here, is that most any argument can be made in a large enough dataset, and a careful analyst will find significance.

This is of course the case only if one does not venture far from the principal assumptions of frequentism, most of which are routinely violated outside of almost every example except pure random number generation and fundamental quantum physics.

So a central issue that isn't addressed in STATS101 level hypothesis testing is the impact that the question has on the result. Its almost inevitable that people want to interpret a failure to reject as a positive result. But a p-value really doesn't tell you if its a useful result; but rather, your sample size is big enough to detect a difference.

Statistical significance is something that can be calculated. Practical significance is something that needs to be interpreted.

anthony_doan 6 years ago

I think this article is trying to tie two things together, the p-value problem and the fact you can throw in more data.

I disagree.

It's cheating, it's goes against experimental design analysis, and it does not differentiate between given data and data that was carefully collected. We have experimental design class for a reason. It helps us to be honest. Of course there are tons of pit falls many novice statisticians can do.

It also implicitly leads people to think that statistic can magically handle given data and big data by doing the old fashion statistic way. If you do that than of course you'll get a good p-value.

gwern 6 years ago
> It's cheating, it's goes against experimental design analysis, and it does not differentiate between given data and data that was carefully collected. We have experimental design class for a reason. It helps us to be honest. Of course there are tons of pit falls many novice statisticians can do.
Explicit sequential testing runs into exactly the same problem. The problem is, the null hypothesis is not true. So no matter whether you use fixed (large) sample sizes or adaptive procedures which can terminate early while still preserving (the irrelevant) nominal false-positive error rates, you will at some sample size reject the null as your power approaches 100%.
- nonbel 6 years ago
  
  This is mostly right, but you are still thinking of these rejections as "false positives" for some reason. They are real deviations from the null hypothesis ("true positives"). The problem is the user didn't test the null model they wanted, it is 100% user error.
  
  6 replies →

nonbel 6 years ago

>"The fact that these variables are all typically linear or additive further implies that interactions between variables will be typically rare or small or both (implying that most such hits will be false positives, as interactions are far harder to detect than main effects)."

Where does this "fact" come from? And if everything is correlated with everything else all these effects are true positives...

Also, another ridiculous aspect of this is that when data becomes cheap the researchers just make the threshold stricter so it doesn't become too easy. They are (collectively) choosing what is "significant" or not and then acting like "significant" = real and "non-significant" = 0.

Finally, I didn't read through the whole thing. Does he claim to have found an exception to this rule at any point?

gwern 6 years ago
> Finally, I didn't read through the whole thing. Does he claim to have found an exception to this rule at any point?
Oakes 1975 points out that explicit randomized experiments, which test a useless intervention such as school reform, can be exceptions. (Oakes might not be quite right here, since surely even useless interventions have some non-zero effect, if only by wasting peoples' time & effort, but you might say that the 'crud factor' is vastly smaller in randomized experiments than in correlational data, which is a point worth noting.)
- nonbel 6 years ago
  
  Thanks,
  How about this "fact": The fact that these variables are all typically linear or additive?
  
  2 replies →

pierrebai 6 years ago

Is this trying to be too clever? If the correlation is weaker than the random noise of the data, then it is equivalent to not being correlated.

Otherwise, we'd get conclusions like the color of your car influencing your risk of lung cancer or some such nonsense. With enough data, you could see a weak correlation of red car to cancer, but it would still be insignificant. That's what the null-hypothesis is for: to put a treshold under which we can just ignore whatever weak correlation seems to be there.

Sniffnoy 6 years ago

Question: Are these correlations typically transitive? That is to say, does it typically happen that in addition to everything having nonzero correlation with everything else, it additionally happens that the sign of the correlation between A and C is equal to the product of the signs of the correlations between A and B and between B and C?

Thorndike's dictum would suggest that this is so, at least in that particular domain. What about more generally?

SubiculumCode 6 years ago

Like a background radiation, we have an "absolute background" correlation value...a value we might test against e.g. |+/- .02321|

Or we could drop the null

ahazred8ta 6 years ago

REJECT THE NULL HYPOTHESIS !!! :-)

purplezooey 6 years ago

It's well known that the number of Nicholas Cage movies is correlated with a wide variety of natural phenomena.

leaky_valve 6 years ago

Sample means and true means are different things.

gwern 6 years ago

You're being downvoted because you missed the point repeatedly made in the intro and many of the excerpts that this is in fact a claim about the 'true means'.

drk23 6 years ago

Cause, like, when you start learning about systems, everything is correlated, everything is connected, everything is linked, and you have to point it all out to everyone all the time.