I may have been exaggerating one year ago when I started this post with “Hardly a day goes by”, but now it is literally the case*. (This also pertains to reading for Phil6334 for Thurs. March 6):
Hardly a day goes by where I do not come across an article on the problems for statistical inference based on fallaciously capitalizing on chance: high-powered computer searches and “big” data trolling offer rich hunting grounds out of which apparently impressive results may be “cherry-picked”:
When the hypotheses are tested on the same data that suggested them and when tests of significance are based on such data, then a spurious impression of validity may result. The computed level of significance may have almost no relation to the true level. . . . Suppose that twenty sets of differences have been examined, that one difference seems large enough to test and that this difference turns out to be “significant at the 5 percent level.” Does this mean that differences as large as the one tested would occur by chance only 5 percent of the time when the true difference is zero? The answer is no, because the difference tested has been selected from the twenty differences that were examined. The actual level of significance is not 5 percent, but 64 percent! (Selvin 1970, 104)[1]
…Oh wait -this is from a contributor to Morrison and Henkel way back in 1970! But there is one big contrast, I find, that makes current day reports so much more worrisome: critics of the Morrison and Henkel ilk clearly report that to ignore a variety of “selection effects” results in a fallacious computation of the actual significance level associated with a given inference; clear terminology is used to distinguish the “computed” or “nominal” significance level on the one hand, and the actual or warranted significance level on the other. Nowadays, writers make it much less clear that the fault lies with the fallacious use of significance tests and other error statistical methods. Instead, the tests are blamed for permitting or even encouraging such misuses. Criticisms to the effect that we should stop trying to teach these methods correctly have hardly helped. The situation is especially puzzling given the fact that these same statistical fallacies have trickled down to the public sphere, what with Ben Goldacre’s “Bad Pharma”, calls for “all trials” to be registered and reported, and the popular articles on the ills of ‘big data’:
We’re more fooled by noise than ever before, and it’s because of a nasty phenomenon called “big data”. With big data, researchers have brought cherry-picking to an industrial level. (from Taleb 2013)
I say it is puzzling because the very reason these writers are able to raise the antennae of entirely non-technical audiences is that they assume–rightly in my judgment–the intuitive plausibility of the reasoning needed to pinpoint the fallacies being committed. No technical statistics required. Some well-known slogans:
- Statistical significance is not substantive significance
- Association is not causation
- If you torture the data enough they will confess.
The associated statistical fallacies are so antiquated that one is almost embarrassed to take them up in 2013, but there is no way around it if one is to tell the truth about statistical inference. When p-values are reported in the same way regardless of various “selection effects,” testing’s basic principles—whether of the Neyman-Pearson or Fisherian variety—are being twisted, distorted, invalidly used, even if the underlying statistical model has been approximately satisfied—a big if! To those who say, turn tests into decisions with explicit losses, I ask: how does it help in holding “big pharma” accountable to advocate they explicitly mix their cost-benefits into the very analysis of the data? At least a Goldacre-type criticism can call them out on subliminal bias.
To detect or even explain the fallacies is implicitly to recognize how various tactics mischaracterize and may greatly inflate the chance of reporting unwarranted claims. It is not a problem about long-runs either—it is a problem of the capability of the specific test to have done its job. Whenever presented with a statistical report, I always want to audit it by asking: just how frequently would this method have alerted me to erroneous claims of this form? If it would infrequently have alerted me, I deny it provides good evidence for this particular claim (at least without further assurances). When probability arises to describe how frequently methods are capable of detecting and discriminating erroneous interpretations of data, we may call it an error statistical use.
I suspect that the growth of fallacious statistics is due not only to the growth of big data but also to the acceptability of, if not preference for, methods that declare themselves free from such error-probabilistic encumbrances. The popular writers, quite correctly in my judgment, assume the reader will know all too well what is troubling about cherry-picking, hunting and all the rest. But statistical accounts that downplay error probabilities are at odds with these commonplace intuitions! These accounts, in more formal terms, deny or downplay the relevance of the sampling distribution on which pinpointing the trouble depends. Deniers do not take into account the sampling distribution once the data are in hand.
In a view that does not take into account the sampling distribution, inferences are conditional on the realized value x; other values which may have occurred are regarded as irrelevant. . . . No consideration of the sampling distribution of a statistic is entertained; sample space averaging is ruled out. (Barnett 1982, 226)
So if I ask a denier, how often the procedure would output nominally significant effects, I might receive the reply:
The question of how often a given situation would arise is utterly irrelevant to the question of how we should reason when it does arise. I don’t know how many times this simple fact will have to be pointed out before statisticians of ‘frequentist’ persuasions will take note of it. (Jaynes 1976, 247)[See new note ** and yet newer note ***].
To us, reasoning from the result that did arise is crucially dependent on how often it would occur erroneously.
But what is the “it”? I admit this needs clarification.
It may be best understood as the inference for which the data have been claimed to provide evidence. Even better, one might consider the particular erroneous interpretation of the data that would be of concern or interest. The critique is directed by the sources of misinterpretation relevant to the inferential problem of interest. One then considers the inference, the data, how it was obtained, etc. as a general process that might make it too easy to find an erroneous positive result. Given how common and vital this type of error-statistical reasoning is, I have found myself thinking that the dismissal of sampling distributions (among, say, Bayesians [2]) is based on a confusion—and that we error statisticians have simply not explained the philosophy behind using these tools. (I still think this!)
Remember that, here, statistical scrutiny means scrutinizing for mistaken or unwarranted interpretations of data. To say that a statistical inference to claim H is warranted is to say that H is, at the least, an adequate interpretation of the data (be it this data or all the data available): that is, that H has passed with reasonable severity. So if the report claims to have evidence that treatment T increased benefits regarding factor F, it matters that the procedure would often have reported that T increases some benefit or other, even if all are spurious. To properly assess these error frequencies or error probabilities, the overall process must be considered, though it requires thought and is rarely automatic.
Some advocate that we view the observed statistical significance level, or p-value, simply as a logical measure between data and hypothesis, with no error-statistical component. One is free to do so (despite apparently violating standard FDA regulations and being at odds with best practices in statistics in the law). But the main problem is that one will still need some way to set about scrutinizing low-p-values that can very often be generated, even though the mistaken interpretation has scarcely been well ruled out. We prefer to adhere to the intended and valid use of observed significance levels. But mixing valid and invalid uses has become so prevalent, that some new terminology is called for. Any computed or nominal observed significance level (or p-value) will be called unaudited. Until we have vouchsafed its error-statistical credentials, it is at most unaudited. We auditors invariably have an interest that Jaynes would call pathological inexplicable (?): viewing the positive result as an instance of a general procedure of testing, with various abilities to mislead (or not)[3].
Frequentist calculations may be used to examine the particular case by describing how well tests can uncover mistakes in inference, or so I have argued. On these grounds, the “hunting procedure” is little able to have alerted us to temper our enthusiasm, even where tempering is warranted. In Selvin’s illustration, because at least one such impressive departure is common even if all are due to chance, the test has scarcely reassured us that it has done a good job of avoiding such a mistake in this case (Mayo 1996). We do not thereby discount the claim, but call attention to the need for further evidence. Being able to describe a test as affording terrible evidence is actually one of the important assets of this account. In some cases we may want to say that the evidence “so far” is terrible [4].
Other unaudited reports of observed significance levels proceed apace. In some cases, though the null or test hypothesis is fixed, the criteria for rejection or choice of distance measure are chosen in order to produce a result that appears impressively far from what is expected under the null. And despite the absence of genuine association, much less causal connection, the frequency or probability of outputting an impressive result may be high. The non-null has “passed” the test, but with terrible severity. It does not really matter whether one alludes to relative frequencies or probabilities[5]. Mounting the critical audit (whether formal or informal) reflects an error-statistical principle of evidence because its application depends on sampling distributions. A statistical account that denies the relevance of sampling distributions to the interpretation of data obstructs such auditing.
*I am grateful to readers who send me articles retracing, in various forms, the criticisms of significance test abuse, “trouble in the lab” type articles, and items on statistical forensics and fraudbusting. For my diagnosis of the problem, see my recent slides “Probabilism and an obstacle to fraudbusting here, Statistical Dirty Laundry here, and search this blog.
**I want to be clear that the reason outcomes other than the one observed matter (for interpreting the evidential meaning of that one) is not a concern (merely) for low long run error rates! Critics keep getting this wrong. A procedure might have a swell long-run error rate q, while q fails to qualify the warrant of the inference of interest. (At the same time, the qualification we seek will not be a probability to the inferred hypothesis H, however interpreted, but an assessment of how well tested H is.)
***It may be that Jaynes is merely pointing out that that there are inferences that are obviously counterintuitive in the case at hand, even though they stem from methods with good “long-run properties”. But we agreed to this from the start. Recall the case of flipping a coin to decide whether to use a highly imprecise or a highly precise measuring tool (the mixture test, discussed in relation to Birnbaum on this blog.) Knowing we got the toss outcome that led to the imprecise tool, it would be a misleading report of the severity achieved to average their properties in some way. We “condition” on the measuring tool actually used (weak conditionality principle). Whether or not you think this requires a “principle” is a separate matter.
REFERENCES
Barnett, V. 1982. Comparative statistical inference. 2nd ed. New York: John Wiley & Sons.
Jaynes, E. T. 1976. Confidence Intervals vs Bayesian Intervals. In Foundations of probability theory, statistical inference and statistical theories of science. Vol. 2, edited by W. L. Harper and C. A. Hooker, 218-57. Dordrecht, The Netherlands: D. Reidel (175-213)l.
Mayo, D. 1996. Error and the Growth of Experimental Knowledge. Chicago: University of Chicago Press. (see Chapter 9)
Selvin, H. 1970. A critique of tests of significance in survey research. In The significance test controversy, edited by D. Morrison and R. Henkel, 94-106. Chicago: Aldine.
Young,S:
http://www.niss.org/sites/default/files/RASSWebinar_121212.pdf
[1] Selvin calculates this approximately by considering the probability of finding at least one statistically significant difference at the .05 level when 20 independent samples are drawn from populations having true differences of zero, 1 – P (no such difference): 1 – (.95)20 = 1 – .36. This assumes, unrealistically, independent samples, but without that it may be unclear how to even approximately compute actual p-values.
[2] At least those who accept the (strong) likelihood principle.
[3] But because we consider outcomes other than the one observed does not entail we consider experiments other than the one performed, in reasoning from the data.
[4] Contrast this with saying that the hypothesis is not probable so far.
[5] Frequentist or propensity notions.
Filed under: junk science, selection effects, spurious p values, Statistical fraudbusting, Statistics