Such articles have continued apace since this blogpost from 2013. During that time, meta-research, replication studies, statistical forensics and fraudbusting have become popular academic fields in their own right. Since I regard the ‘programme’ (to evoke Lakatos) as essentially a part of the philosophy and methodology of science, I’m all in favor of it—I employed the term “metastatistics” eons ago–but, as a philosopher, I claim there’s a pressing need for meta-meta-research, i.e., a conceptual, logical, and methodological scrutiny of presuppositions and gaps in meta-level work itself. There was an issue I raised in the section “But what about the statistics?” below that hasn’t been addressed. I question the way size and power (from statistical hypothesis testing) are employed in a “diagnostics and screening” computation that underlies most “most findings are false” articles. (This is (2) in my new “Let PBP” series, and follows upon my last post, comments in burgandy are added, 12/5/15.)
In this time of government cut-backs and sequester, scientists are under increased pressure to dream up ever new strategies to publish attention-getting articles with eye-catching, but inadequately scrutinized, conjectures. Science writers are under similar pressures, and to this end they have found a way to deliver up at least one fire-breathing, front page article a month. How? By writing minor variations on an article about how in this time of government cut-backs and sequester, scientists are under increased pressure to dream up ever new strategies to publish attention-getting articles with eye-catching, but inadequately scrutinized, conjectures. (I’m prepared to admit that meta-research consciousness raising, like “self help books,” warrant frequent revisiting. Lessons are forgotten, and there are always new users of statistics.)
Thus every month or so we see retreads on why most scientific claims are unreliable, biased, wrong, and not even wrong. Maybe that’s the reason the authors of a recent article in The Economist (“Trouble at the Lab“) remain anonymous. (I realize that is their general policy.)
I don’t disagree with everything in the article; on the contrary, part of their strategy is to include such well known problems as publication bias, nonreplicable priming studies in psychology, hunting and fishing for significance, and failed statistical assumptions. But the “big news”—the one that sells—is that “to an alarming degree” science (as a whole) is not reliable and not self-correcting. The main evidence is that there are the factory-like (thumbs up/thumbs down) applications of statistics in exploratory, hypotheses generating contexts wherein the goal is merely screening through reams of associations to identify a smaller batch for further analysis. But do even those screening efforts claim to have evidence of a genuine relationship when a given H is spewed out of their industrial complexes? Do they go straight to press after one statistically significant result? I don’t know, maybe some do. (Shame on them!) What I do know is that the generalizations we are seeing in these “gotcha” articles are (often) as guilty of sensationalizing without substance as the bad statistics they purport to be impugning. As they see it, scientists, upon finding a single statistically significant result at the 5% level, declare an effect real or a hypothesis true, and then move on to the next hypothesis. No real follow-up scrutiny, no building on discrepancies found, no triangulation, self-scrutiny, etc.
But even so, the argument which purports to follow from “statistical logic”, but which actually is a jumble of “up-down” significance testing, Bayesian calculations, and computations that might at best hold for crude screening exercises (e.g., for associations between genes and disease) commits blunders about statistical power, and founders. Never mind that if the highest rate of true outputs was wanted, scientists would dabble in trivialities….Never mind that I guarantee if you asked Nobel prize winning scientists the rate of correct attempts vs blind alleys they went through before their Prize winning results, they’d say far more than 50% errors, (Perrin and Brownian motion, Prusiner and Prions, experimental general relativity, just to name some I know.)
But what about the statistics?
It is assumed that we know that, in any (?) field of science, 90% of hypotheses are false. [A] Who knows how, we just do. Further, this will serve as a Bayesian prior probability to be multiplied by rejection rates in non-Bayesian hypothesis testing.
Ok so (1) 90% of the hypotheses that scientists consider are false. Let γ be “the power” of the test. Then the probability of a false discovery, assuming we reject H0 at the α level is given by the computation I’m pasting from an older article on Normal Deviate’s blog.
Again, γ is “the power” and A = “the event of rejecting H0 at the α level”. They use α = .05.
So, let’s see…
(1) is equated to 90% of null hypotheses in significance testing are true!
So P(H0) = .9, and P(not-H0) = .1.
A false hypothesis means the null of a significance test is true. What is a true hypothesis? It would seem to be the denial of H0, i.e., not-H0. Then the existence of any discrepancy from the null would be a case in which the alternative hypothesis is true. Yet their example considers
P(x reaches cut-off to reject H0|not-H0) = .8.
They call this a power of .8. But the power is only defined relative to detecting a specific alternative or discrepancy from the null, in a given test. You can’t just speak about the power of a test (not that it stops the front page article from doing just this). But to try and make sense of this, they appear to mean
“a hypothesis is true” = there is truly some discrepancy from H0 = not-H0 is true.
But we know the power against (i.e., for detecting) parameter values very close to H0 is scarcely more than .05 (the power at H0 being .05)!
So it can’t be that what they mean by a true hypothesis is “not-H0 is true”.
Let’s try to use their power of .8 to figure out, then, what a true hypothesis is supposed to be.
Let x* be the .05 cut-off for the test in question (1 or 2-sided, we’re not told, I assume 1-sided).
P(test rejects H0 at the .05 level| H’) = .8.
So all we have to do to find H’ is consider an alternative against which this test has .8 power.
But H0 together with H’ do not exhaust the space of parameters. So you can’t have this true/false dichotomy referring to them. Let’s try a different interpretation.
Let “a hypothesis is true” mean that H’ is true (where the test has .8 power against H’). Then
“a hypothesis is false” = the true parameter value is closer to H0 than is H’.
But then the .05 probability of erroneous rejections is no longer .05, but would be much larger—very close to the power .8. So I fail to see how this argument can hold up.
Suppose, for example, we are applying a Normal test T+ of H0: µ ≤ 0 against H1: µ > 0, and x* is the ~.025 cut-off: 0 + 2(σ/√n), where for simplicity let σ be known (and assume iid is not itself problematic). Then
x* + 1(σ/√n)
brings us to µ = 3(σ/√n), an alternative H‘ against which the test has .84 power—so this is a useful benchmark and close enough to their .8 power.
Consider then H0: µ ≤ 0 and H‘: 3(σ/√n). H0 and H‘ do not exhaust the hypothesis space, so how could they have the “priors” (of .9 and .1) they assign?
It’s odd, as well, to consider that the alternative to which we assign .1 prior is to be determined by the test to be run. I’m prepared to grant there is some context where there are just two point hypotheses; but their example involves associations, and one would have thought degrees were the norm.
I’ll come back to this when I have time later …—I put it here as a place holder for a new kind of howler we’ve been seeing for the past decade. …Since I do believe in error correction, please let me know where I have gone wrong.
This is the second (2) in a series on “Let PBP”. The first was my last post.
Notes from12/5/15 begin with [A] and are in bold.
[A] While the “hypotheses” here generally refer to substantive research or causal claims, their denials are apparently statistical null hypotheses of 0 effect. Moving from rejecting one of these nil hypotheses to the research claim is already a howler not countenanced by proper significance testing. This what the fallacious animal, NHST, is claimed to permit. So I use this acronym only for the fallacious entity.
Original notes:
[1] I made a few corrections (i.e., the original post was draft (iii) which has nothing to do with the numbering I’m now using for “LetPBP” posts).
[2] See larger version of this great cartoon (that did not come from their article.)
[3] To respond to a query on power (in a comment): Power: For the Normal testing example with T+: H0: µ ≤ µ0 against H1: µ > µ0. (Please see the comments from the original post.)
Test T+: Infer a (positive) discrepancy from µ0 iff {x > x*) where x* corresponds to a difference statistically significant at the α level (I’m being sloppy in using x rather than a proper test stat d(x), but no matter).
Z =(x*- µ1 ) √n (1/σ)
I let x* = µ0+ 2(σ/√n)
So, with µ0= 0, and µ1= 3(σ/√n),
Z = [2(σ/√n) – 3(σ/√n)] (√n /σ) = -1
P(Z > -1) = .84
For (quite a lot) more on power, search power on this blog.
Filed under: junk science, Let PBP, P-values, science-wise screening