This has been known ever since the beginning of frequentist hypothesis testing. Fisher warned us not to place too much emphasis on the p-value he asked us to calculate, specifically because it is mainly a measure of sample size, not clinical significance.
Yes the whole thing has been a bit of a tragedy IMO. A minor tragedy all things considered, but still one nonetheless.
One interesting thing to keep in mind is that Ronald Fisher did most of his work before the publication of Kolmogorov's probability axioms (1933). There's a real sense in which the statistics used in social sciences diverged from mathematics before the rise of modern statistics.
So there's a lot of tradition going back to the 19th century that's misguided, wrong, or maybe just not best practice.
It's not, that would be quite the misunderstanding of statistical power.
N being big means that small real effects can plausibly be detected as being statistically significant.
It doesn't mean that a larger proportion of measurements are falsely identified as being statistically significant. That will still occur at a 5% frequency or whatever your alpha value is, unless your null is misspecified.
It's standard to set the null hypothesis to be a measure zero set (e.g. mu = 0 or mu1 = mu2). So the probability of the null hypothesis is 0 and the only question remaining is whether your measurement is good enough to detect that.
But even though you know the measurement can't be exactly 0.000 (with infinitely many decimal places) a priori, you don't know if your measurement is any good a priori or whether you're measuring the right thing.
The probability is only zero a.s., it's not zero. That's a very big difference. And hypothesis tests aren't estimating the probability of the null being true, they're estimating the probability of rejecting the null if the null was true.
This has been known ever since the beginning of frequentist hypothesis testing. Fisher warned us not to place too much emphasis on the p-value he asked us to calculate, specifically because it is mainly a measure of sample size, not clinical significance.
Yes the whole thing has been a bit of a tragedy IMO. A minor tragedy all things considered, but still one nonetheless.
One interesting thing to keep in mind is that Ronald Fisher did most of his work before the publication of Kolmogorov's probability axioms (1933). There's a real sense in which the statistics used in social sciences diverged from mathematics before the rise of modern statistics.
So there's a lot of tradition going back to the 19th century that's misguided, wrong, or maybe just not best practice.
It's not, that would be quite the misunderstanding of statistical power.
N being big means that small real effects can plausibly be detected as being statistically significant.
It doesn't mean that a larger proportion of measurements are falsely identified as being statistically significant. That will still occur at a 5% frequency or whatever your alpha value is, unless your null is misspecified.
It's standard to set the null hypothesis to be a measure zero set (e.g. mu = 0 or mu1 = mu2). So the probability of the null hypothesis is 0 and the only question remaining is whether your measurement is good enough to detect that.
But even though you know the measurement can't be exactly 0.000 (with infinitely many decimal places) a priori, you don't know if your measurement is any good a priori or whether you're measuring the right thing.
The probability is only zero a.s., it's not zero. That's a very big difference. And hypothesis tests aren't estimating the probability of the null being true, they're estimating the probability of rejecting the null if the null was true.
1 reply →