Comment by xpe
2 years ago
This story is a playful example of confounding:
> In causal inference, a confounder (also confounding variable, confounding factor, extraneous determinant or lurking variable) is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlations or associations. The existence of confounders is an important quantitative explanation why correlation does not imply causation. Some notations are explicitly designed to identify the existence, possible existence, or non-existence of confounders in causal relationships between elements of a system. / Confounds are threats to internal validity.
https://en.wikipedia.org/wiki/Confounding
Here is a sketch of a statistical model that shows a confounder (a variable affecting both the dependent and independent variables)
S = f(H, I, T)
`S`: car starting or not (dependent variable)
`H`: how hot is the car engine (independent variable)
`I`: ice cream type chosen (independent variable)
`T`: time taken to buy the ice cream (a confounder)
Explanation: `T` influences `S` because a shorter time leads to `H` (a hotter engine, which is prone to vapor lock). And `T` also influences `I` (type of ice cream chosen) because the placement of vanilla ice cream allows for quicker purchase. Voila, now we have a spurious relationship between `I` and `S`.
This is an interesting explanation, but wouldn't `I` influence `T` rather than the other way around? Since the type of ice cream determines the amount of time taken in the store.
My comment did two things (but they were somewhat muddled). It: (a) laid a particular model; and (b) offered {explanations/claims} of causality. But unfortunately it said nothing about (c) experimental design.
I'll start with (c). Attempting to talk about a model in isolation from its experimental design can be misleading, as it ignores the context that gives the model its interpretive power and validity. In this case, a good experimental design must include a sufficiently diverse sample of people to account for variation.
Regarding (b), depending on the person, the influence could flow either way between `I` and `T`, to varying degrees.
- Example of `I->T`: One person might come into the store strongly preferring one type of ice cream (`I`) and be willing to take time to look for it (`T`)
- Example of `T->I`: Another person might come into the store in a hurry and be motivated to procure the closest ice cream flavor.
Regarding (a), no model is 'true' but some are better than others for particular purposes.
- To the extent that prediction is the key goal, confounding variables don't usually matter.
- But to the extent that _statistical inference_ is the key goal, there are many techniques for teasing apart influence.
Unfortunately, too often in machine learning contexts, the word "inference" refers to the process of using a trained model for _prediction_. Yikes. This contrasts sharply with the term's use in statistics. The field of statistics got this one right, even as ML techniques have taken off spectacularly.