Comment by mojomark

6 years ago

Good discussion. On the flip side, in my data mining class the professor keeps saying ~"you may be able to find clusters in a data set, but often no true correlation exists." However, that's an absolute statement I just don't swallow. In my mind what I see is that if an unexplained correlation or non-correlation appears, it may be random (or true) or it could be the result of an unmeasured (hidden) variable. In your two examples, your simply pointing out two respective hidden variables that weren't accounted for in the original analysis.

I think any data analysis should always be caveated with the understanding that there may be hidden variables shrouding or perhaps enhancing correlations - from economics to quantum mechanics. It's up to the reviewer of the results to determine, subjectively or by using a standard measure, whether the level of rigor involved in data collection & analysis sufficiently models reality.

Perhaps they are trying to explain clustering illusion? The phenomenon that even random data will produce clusters. You can take that further and state random data WILL produce clusters. If you don't have clusters then your data is not random and some pattern is at play.

This really tricks up our mind as our mind tries to find patterns everywhere. If you try and plot random dots you will usually put dots without clusters. A true random plot will have clusters.

https://en.wikipedia.org/wiki/Clustering_illusion

Edit: Note your professor said "often" which means they did not make an absolute statement

  • Ipso factum all "natural" variables are related to bounded random walk which produces clusters (Markovian process), or otherwise have complex chaotic (e.g. fractal) mechanics, which also produces clusters. This follows from physics.

    Maximum entropy as well as zero entropy is a very rare state to observe.

    • does this imply that the universe somehow rewards structures that engender 'compressibility' (coarse graining)? it does seem like our brains subjectively enjoy identifying it, to the point of over-optimization in the form of phenomena like pareidolia

      3 replies →

  • >"The phenomenon that even random data will produce clusters."

    You don't really mean "random", you mean i.i.d. You can have a statistical model where the probability of something happens is random, but not independent of the past values (eg, the next step a markov chain).

The ability for adults to drink milk and fluency in speaking English is well correlated. This is because those of northern European ancestry are more likely to be able to drink milk, and it happens that most of northern European ancestry either immigrated to an English speaking country (US) 150 years ago, or are in a country where English instruction is good.

It's probably in the same vein as the classic quote by Tukey "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." - a "need" for an answer can easily motivate people to mangle the data in order to find it even if it doesn't exist.