Comment by mustaphah
21 hours ago
You're spot on that significant ≠ meaningful effect. But I'd push back slightly on the example. A very low p-value doesn't always imply a meaningful effect, but it's not independent of effect size either. A p-value comes from a test statistic that's basically:
(effect size) / (noise / sqrt(n))
Note that bigger test statistic means smaller p-value.
So very low p-values usually come from bigger effects or from very large sample sizes (n). That's why you can technically get p<0.001 with a microscopic effect, but only if you have astronomical sample sizes. In most empirical studies, though, p<0.001 does suggest the effect is going to be large because there are practical limits on the sample size.
The challenge is that datasets are just much bigger now. These tools grew up in a world where n=2000 was considered pretty solid. I do a lot of work with social science types, and that's still a decent sized survey.
I'm regularly working with datasets in the hundreds of thousands to millions, and that's small fry compared with what's out there.
The use of regression, for me at least, is not getting that p-gotcha for a paper, but as a posh pivot table that accounts for all the variables at once.
There’s a common misconception that high throughput methods = large n.
For example, I’ve encountered the belief that just by recording something at ultra high temporal resolution gives you “millions of datapoints”. This then has all sorts of effects on the breakdown of statistics and hypothesis testing (seemingly).
In reality, the replicability of the entire setup, the day it was performed, the person doing it, etc. means the n for the day is probably closer to 1. So to ensure replicability you’d have to at least do it on separate days, with separately prepared samples. Otherwise, how can you eliminate the chance that your ultra finicky sample just happened to vibe with that day’s temperature and humidity?
But they don’t teach you in statistics what exactly “n” means, probably because a hundred years ago it was much more literal in nature. 100 samples is because I counted 100 mice, 100 peas, or 100 surveys.
I learned about experiment design in statistics, so I wouldn’t blame statisticians for this.
There’s a lot of folks out there though who learned the mechanics of linear regression in a bootcamp or something without gaining an appreciation for the underlying theories, and those folks are looking for low p-value and as long as they get it it’s good enough.
I saw this link yesterday and could barely believe it, but I guess these folks really live among us.
https://stats.stackexchange.com/questions/185507/what-happen...
This is an interesting point. I've been trying to think about something similar recently but don't have much of an idea how to proceed. I'm gathering periodic time series data and am wondering how to factor in the frequency of my sampling for the statistical tests. I'm not sure how to assess the difference between 50Hz and 100Hz on the outcome, given that my periods are significantly longer. Would you have an idea of how to proceed? The person I'm working with currently just bins everything in hour long buckets and uses the mean for comparison between time series but this seems flawed to me.
Depending on the nature of the study, there's lots of scientific disciplines where it's trivial to get populations in the millions. I got to see a fresh new student's poster where they had a p-value in the range of 10^-146 because every cell in their experiment was counted as it's own sample.