Posted: January 9, 2017, 10:55am
How do we still not understand P values?
Quite a bit of buzz was generated in the science twitter-verse this past week with the release of a new publication called “P values and the search for significance” by Naomi Altman and Martin Kryzywinski.
A p-value, via Wikipedia, is “the probability that, using a given statistical model, the statistical summary (such as the sample mean difference between two compared groups) would be the same as or more extreme than the actual observed results.”
In general, we can’t test every single data point available in an experiment, so we use a sample group smaller than the total number of possible samples. We use this to predict the Presidential election, for example. We survey 100 people in order to predict the behavior of 300 million.
There are two ways of looking at the problem. The first (which we won’t discuss here) is determining the adequate size of a sample required for a “good” estimate, or an estimate with statistical significance.
The second way of looking at the problem is to find the probability that a given sample size will give measurements that will be representative of the true value of a larger or different data set. This is a p-value.
The example given in the paper: “consider a study in which 10 physiological variables are measured in 100 individuals to determine whether any of the variables are predictive of systolic blood pressure (SBP). Suppose that none of the variables are actually predictive in the population and that they are all independent. If we use simple linear regression and focus on one of the variables as a predictor, a test of association will yield P < 0.05 in 5% of samples.”
That means, we might incorrectly assume that single variable is predictive of SBP 5% of the time. We see something that isn’t there, just based on the randomization of the sample size (100 individuals) and their selection from the overall global pot of individuals in the world. 5% chance of incorrectly seeing a single variable as predictive is a pretty low error. In most cases we are fine with a low error like this.
But here’s the kicker.
If we test each of our predictors, there is now a 40% chance that we find P < 0.05 for at least one variable. That’s bad.
The paper describes some math that shows, as we increase the number of predictors and keep the null hypothesis fixed for 10 variables, we now have 10 times to catch that unlucky 5%, leading to the incorrect assumption of one of the variables as a predictor. That’s how science is done isn’t it? Let’s test a bunch of variables, see if any of them correlate with outcome. This is especially true if we have no previous intuition in the biological significance of any of the 10 variables.
So… now what?
The authors go on to describe a process of fitting all 10 variables simultaneously and testing a null hypothesis (null for of no association with outcome for all 10 variables), which correctly predicts 95% of the time, as expected with 5% error allowed. But this assumes independent variables, and dependence of the predictors complicate matters. It may complicate matters further if new models are introduced, going from simple linear regression to quadratic or other models. The conclusion statement of the paper gives an exhortation of the necessity to nail down this problem, and soon.
“In confirmatory use, P values and confidence intervals can be computed and interpreted as taught in basic statistics courses. In exploratory use, P values can be interpreted as measures of statistical significance only if appropriately adjusted for multiple testing or selection; confidence intervals also need to be adjusted for multiple testing. There are no simple and well-accepted means of doing this adjustment except in the case of explicit multiple testing.”
The article ends with a warning and some helpful advice:
“Recently, the American Statistical Association issued a statement on the appropriate use of P values and other inferential statistical methods, calling for caution in searching for significance. The report warned against confounding relevance with statistical significance and effect sizes, inadequately exploring the data, not considering relevant covariates and overfitting—all practices that can lead to misuse and squandering of a data resource.”
Published on July 26th, 2017
Last updated on August 10th, 2017