In this time of government cut-backs and sequester, scientists are under increased pressure to dream up ever new strategies to publish attention-getting articles with eye-catching, but inadequately scrutinized, conjectures. Science writers are under similar pressures, and to this end they have found a way to deliver up at least one fire-breathing, front page article a month. How? By writing minor variations on an article about how in this time of government cut-backs and sequester, scientists are under increased pressure to dream up ever new strategies to publish attention-getting articles with eye-catching, but inadequately scrutinized, conjectures.
Thus every month or so we see retreads on why most scientific claims are unreliable, biased, wrong, and not even wrong. Maybe that’s the reason the authors of a recent article in The Economist (“Trouble at the Lab“) remain anonymous.
I don’t disagree with everything in the article; on the contrary, part of their strategy is to include such well known problems as publication bias, problems with priming studies in psychology, and failed statistical assumptions. But the “big news”–the one that sells– is that “to an alarming degree” science (as a whole) is not reliable and not self-correcting. The main evidence is that there are the factory-like (thumbs up/thumbs down) applications of statistics in exploratory, hypotheses generating contexts wherein the goal is merely screening through reams of associations to identify a smaller batch for further analysis. But do even those screening efforts claim to have evidence of a genuine relationship when a given H is spewed out of their industrial complexes? Do they go straight to press after one statistically significant result? I don’t know, maybe some do. What I do know is that the generalizations we are seeing in these “gotcha” articles are every bit as guilty of sensationalizing without substance as the bad statistics they purport to be impugning. As they see it, scientists, upon finding a single statistically significant result at the 5% level, declare an effect real or a hypothesis true, and then move on to the next hypothesis. No real follow-up scrutiny, no building on discrepancies found, no triangulation, self-scrutiny.etc.
But even so, the argument which purports to follow from “statistical logic”, but which actually is a jumble of “up-down” significance testing, Bayesian calculations, and computations that might at best hold for crude screening exercises (e.g., for associations between genes and disease) commits blunders about statistical power, and founders. Never mind that if the highest rate of true outputs was wanted, scientists would dabble in trivialities….Never mind that I guarantee if you asked Nobel prize winning scientists the rate of correct attempts vs blind alleys they went through before their Prize winning results, they’d say far more than 50% errors, (Perrin and Brownian motion, Prusiner and Prions, experimental general relativity, just to name some I know.)
But what about the statistics?
It is assumed that we know that, in any (?) field of science, 90% of hypotheses are false. Who knows how, we just do. Further, this will serve as a Bayesian prior probability to be multiplied by rejection rates in non-Bayesian hypothesis testing.
Ok so (1) 90% of the hypotheses that scientists consider are false. Let γ be “the power”. Then the probability of a false discovery, assuming we reject H0 at the α level is given by the computation I’m pasting from an older article on Normal Deviate’s blog.
Again, γ is “the power” and A = “the event of rejecting H0 at the α level”. They use α = .05.
So, let’s see…
(1) is equated to 90% of null hypotheses in significance testing are true!
(2) So P(H0) = .9, and P(not-H0) = .1.
A false hypothesis means the null of a significance test is true. What is a true hypothesis? It would seem to be the denial of H0 , i.e.,not-H0. Then the existence of any discrepancy from the null would be a case in which the alternative hypothesis is true. Yet their example considers
P(x reaches cut-off to reject H0|not-H0) = .8.
They call this a power of .8. But the power is only defined relative to detecting a specific alternative or discrepancy from the null, in a given test. You can’t just speak about the power of a test (not that it stops the front page article from doing just this). But to try and make sense of this, they appear to mean
“a hypothesis is true” = there is truly some discrepancy from H0 = not-H0 is true.
But we know the power against (i.e., for detecting) parameter values very close to H0 is scarcely more than .05 (the power at H0 being .05)!
So it can’t be that what they mean by a true hypothesis is “not-H0 is true”.
Let’s try to use their power of .8. to figure out, then,what a true hypothesis is supposed to be.
Let x* be the .05 cut-off for the test in question (1 or 2-sided, we’re not told, I assume 1-sided).
P(test rejects H0 at the .05 level| H’) = .8.
So all we have to do to find H’ is consider an alternative against which this test has .8 power.
But H0 together with H’ do not exhaust the space of parameters. So you can’t have this true/false dichotomy referring to them. Let’s try a different interpretation.
Let “a hypothesis is true” mean that H’ is true (where the test has .8 power against H’). Then
“a hypothesis is false” = the true parameter value is closer to H0 than is H’ .
But then the .05 probability of erroneous rejections is no longer .05, but would be much larger—very close to the power .8. So I fail to see how this argument can hold up.
Suppose, for example, we are applying a Normal test T+ of H0: µ ≤ 0 against H1: µ > 0 , and x* is the ~.025 cut-off: 0 + 2(σ/ √n), where for simplicity let σ be known (and assume iid is not itself problematic). Then
x* + 1(σ/ √n)
brings us to µ = 3(σ/ √n),an alternative H’ against which the test has .84 power–so this is a useful benchmark and close enough to their .8 power.
Consider then H0: µ ≤ 0 and H’: 3(σ/ √n). H0 and H’ do not exhaust the hypothesis space, so how could they have the “priors” (of .9 and .1) they assign?
It’s odd, as well, to consider that the alternative to which we assign .1 prior is to be determined by the test to be run. I’m prepared to grant there is some context where there are just two point hypotheses; but their example involves associations, and one would have thought degrees were the norm.
I’ll come back to this when I have time later (I’ll call this draft (i)) [1]—I put it here as a place holder for a new kind of howler we’ve been seeing for the past decade. I’m having a party here at Thebes tonight…so I have to turn to that…Since I do believe in error correction, please let me know where I have gone wrong.
[1] I made a few corrections after the party, so it’s now draft (ii).
[2] See larger version of this great cartoon (that did not come from their article.)
[3] To respond to a query on power (in a comment): Power: For the Normal testing example with T+: H0: µ ≤ µ0 against H1: µ > µ0.
Test T+: Infer a (positive) discrepancy from µ0 iff {x > x*) where x* corresponds to a difference statistically significant at the α level (I’m being sloppy in using x rather than a proper test stat d(x), but no matter).
Z =(x*- µ1 ) √n (1/σ)
I let x* = µ0+ 2(σ/ √n)
So, with µ0= 0, and µ1= 3(σ/ √n),
Z = [2(σ/ √n)- 3(σ/ √n) ] (√n /σ) = -1
P(Z > -1) = .84
For more, search power on this blog.
[4] Another one!, and a guy Johnson rediscovering the wheel: alternatives! http://www.nature.com/news/weak-statistical-standards-implicated-in-scientific-irreproducibility-1.14131
Filed under: junk science, P-values, Statistics