Comment by simsla

1 day ago

This relates to one of my biggest pet peeves.

People interpret "statistically significant" to mean "notable"/"meaningful". I detected a difference, and statistics say that it matters. That's the wrong way to think about things.

Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Whether the measured difference is significant in the sense of "meaningful" is a value judgement that we / stakeholders should impose on top of that, usually based on the magnitude of the measured difference, not the statistical significance.

It sounds obvious, but this is one of the most common fallacies I observe in industry and a lot of science.

For example: "This intervention causes an uplift in [metric] with p<0.001. High statistical significance! The uplift: 0.000001%." Meaningful? Probably not.

You're spot on that significant ≠ meaningful effect. But I'd push back slightly on the example. A very low p-value doesn't always imply a meaningful effect, but it's not independent of effect size either. A p-value comes from a test statistic that's basically:

(effect size) / (noise / sqrt(n))

Note that bigger test statistic means smaller p-value.

So very low p-values usually come from bigger effects or from very large sample sizes (n). That's why you can technically get p<0.001 with a microscopic effect, but only if you have astronomical sample sizes. In most empirical studies, though, p<0.001 does suggest the effect is going to be large because there are practical limits on the sample size.

  • The challenge is that datasets are just much bigger now. These tools grew up in a world where n=2000 was considered pretty solid. I do a lot of work with social science types, and that's still a decent sized survey.

    I'm regularly working with datasets in the hundreds of thousands to millions, and that's small fry compared with what's out there.

    The use of regression, for me at least, is not getting that p-gotcha for a paper, but as a posh pivot table that accounts for all the variables at once.

    • There’s a common misconception that high throughput methods = large n.

      For example, I’ve encountered the belief that just by recording something at ultra high temporal resolution gives you “millions of datapoints”. This then has all sorts of effects on the breakdown of statistics and hypothesis testing (seemingly).

      In reality, the replicability of the entire setup, the day it was performed, the person doing it, etc. means the n for the day is probably closer to 1. So to ensure replicability you’d have to at least do it on separate days, with separately prepared samples. Otherwise, how can you eliminate the chance that your ultra finicky sample just happened to vibe with that day’s temperature and humidity?

      But they don’t teach you in statistics what exactly “n” means, probably because a hundred years ago it was much more literal in nature. 100 samples is because I counted 100 mice, 100 peas, or 100 surveys.

      2 replies →

  • Depending on the nature of the study, there's lots of scientific disciplines where it's trivial to get populations in the millions. I got to see a fresh new student's poster where they had a p-value in the range of 10^-146 because every cell in their experiment was counted as it's own sample.

https://pmc.ncbi.nlm.nih.gov/articles/PMC3444174/

> Using Effect Size—or Why the P Value Is Not Enough

> Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.

– Gene V. Glass

Agreed. However, I think you're being overly charitable in calling it a "pet peeve", it's more like a pathological misunderstanding of stats that leads to a lot of bad outcomes especially in popular wellness media.

As an example, read just about any health or nutrition research article referenced in popular media and there's very often a pretty weak effect size even though they've achieved "statistical significance." People then end up making big changes to their lifestyles and habits based on research that really does not justify those changes.

^

And if we increase N enough we will be able to find these 'good measurements' and 'statistically significant differences' everywhere.

Worse still if we did not agree in advance what hypotheses we were testing, and go looking back through historical data to find 'statistically significant' correlations.

  • Which means that statistical significance is really a measure of whether N is big enough

    • This has been known ever since the beginning of frequentist hypothesis testing. Fisher warned us not to place too much emphasis on the p-value he asked us to calculate, specifically because it is mainly a measure of sample size, not clinical significance.

      1 reply →

    • It's not, that would be quite the misunderstanding of statistical power.

      N being big means that small real effects can plausibly be detected as being statistically significant.

      It doesn't mean that a larger proportion of measurements are falsely identified as being statistically significant. That will still occur at a 5% frequency or whatever your alpha value is, unless your null is misspecified.

      2 replies →

I really like this video [1] from 3blue1brown, where he proposes to think about significance as a way to update the probability. One positive test (or in this analog a study) updates the probability by X % and thus you nearly always need more tests (or studies) for a 'meaningful' judgment.

[1] https://www.youtube.com/watch?v=lG4VkPoG3ko

To add nuance, it is not that bad. Given reasonable levels of statistical power, experiments cannot show meaningless effect sizes with statistical significance. Of course, some people design experiments at power levels way beyond what's useful, and this is perhaps even more true when it comes to things where big data is available (like website analytics), but I would argue the problem is the unreasonable power level, rather than a problem with statistical significance itself.

When wielded correctly, statistical significance is a useful guide both to what's a real signal worth further investigation, and it filters out meaningless effect sizes.

A bigger problem even when statistical significance is used right is publication bias. If, out of 100 experiments, we only get to see the 7 that were significant, we already have a false:true ratio of 5:2 in the results we see – even though all are presented as true.

> Significance testing only tells you the probability that the measured difference is a "good measurement". With a certain degree of confidence, you can say "the difference exists as measured".

Significance does not tell you this. The p-value can be arbitrarily close to 0 while the probability of the null hypothesis being true is simultaneously arbitrarily close to one

  • Right. The meaning of p-value is, in a world where there is no effect, what is the probability of getting the result you got purely by random chance? It doesn’t directly tell you anything about whether this is such a world or not.

This is sort of the basis of econometrics, as well as a driving thought behind causal inference.

Econometrics cares not only about statistical significance but also usefulness/economic usefulness.

Causal inference builds on base statistics and ML, but its strength lies in how it uses design and assumptions to isolate causality. Tools like sensitivity analysis, robustness checks, and falsification tests help assess whether the causal story holds up. My one beef is that these tools still lean heavily on the assumption that the underlying theoretical model is correctly specified. In other words, causal inference helps stress-test assumptions, but it doesn’t always provide a clear way to judge whether one theoretical framework is more valid than another!

I’d say rather that “statistically significance” is a measure of surprise. It’s saying “If this default (the null hypothesis) is true, how surprised would I be to make these observations?”

  • Maybe you can think of it as saying "should I be surprised" but certainly not "how surprised should I be". The magnitude of the p-value is a function of sample size. It is not an odds ratio for updating your beliefs.

For all the shit that HN gives to MBAs, one thing they instill into you during the Managerial Stats class is Stag Sig not the same as Managerial Sig.