Monday, September 27, 2010

Fetishizing p-Values - The Cult of Statistical Significance

en.wikipedia.org William Sealy Gosset
Fetishizing p-Values; Tom Leinster - The n-Category Cafe
Recovering the insight of "Student" Gosset from the over-simplification of Ronald A. Fisher
Leinster: Now there’s a whole book making the same point: The Cult of Statistical Significance, by two economists, Stephen T. Ziliak and Deirdre N. McCloskey. You can see their argument in this 15-page paper with the same title. Just because they’re economists doesn’t mean their prose is sober: according to one subheading, ‘Precision is Nice but Oomph is the Bomb’.
Leinster: it is true that p-value does not measure the magnitude of the effect (but then, anyone who has taken at least one course in statistics should know that)
I think Jost, Ziliak and McCloskey would completely agree that anyone who has taken at least one course in statistics should know that. They’re pointing out, open-mouthed, that this incredibly basic mistake is being made on a massive scale, including by many people who should know much, much better. Bane used the term ‘collective self-deception’; one might go further and say ‘mass delusion’. It’s a situation where a fundamental mistake has become so ingrained in how science is done that it’s hard to get your paper accepted if you don’t perpetuate that mistake.
That last statement is probably putting it too strongly, but as I understand it, the point they’re making is along those lines.
From the 15-page paper "The Cult of Statistical Significance":
In 1937 Gosset, the inventor and original calculator of “Student’s” t-table told Egon, then editor of Biometrika, that a significant finding is by itself “nearly valueless”:
...obviously the important thing in such is to have a low real error, not to have a "significant" result at a particular station. The latter seems to me to be nearly valueless in itself. . . . Experiments at a single station [that is, tests of statistical significance on a single set of data] are almost valueless. . . . What you really want is a low real error. You want to be able to say not only "We have significant evidence that if farmers in general do this they will make money by it", but also "we have found it so in nineteen cases out of twenty and we are finding out why it doesn't work in the twentieth.” To do that you have to be as sure as possible which is the 20th—your real error must be small...
Gosset to E. S. Pearson 1937, in Pearson 1939, p. 244.
Gosset, we have noted, is unknown to most users of statistics, including economists. Yet he was proposing and using in his own work at Guinness a characteristically economic way of looking at the acquisition of knowledge and the meaning of “error.” The inventor of small sample econometrics focused on the opportunity cost of each observation; he tried to minimize random and non-random errors, real errors.
Edit 11/12/10
A very nice write-up here, along same lines: Significance Tests in Climate Science -- Maarten H. P. Ambaum -- http://www.met.reading.ac.uk/~sws97mha/Publications/jclim_ambaum_rev2.pdf
Consider a scientist who is interested in measuring some effect and who does an experiment in the lab. Now consider the following thought process that the scientist goes through:
  1. My measurement stands out from the noise.
  2. So my measurement is not likely to be caused by noise.
  3. It is therefore unlikely that what I am seeing is noise.
  4. The measurement is therefore positive evidence that there is really something happening.
  5. This provides evidence for my theory.
This apparently innocuous train of thought contains a serious logical fallacy, and it appears at a spot where not many people notice it.
To the surprise of most, the logical fallacy occurs between step 2 and step 3. Step 2 says that there is a low probability of finding our specific measurement if our system would just produce noise. Step 3 says that there is a low probability that the system just produces noise. These sound the same but they are entirely different.
This can be compactly described using Bayesian statistics...
This comes from a summary of the paper: How significance tests are misused in climate science -- Guest post by Dr Maarten H. P. Ambaum from the Department of Meteorology, University of Reading, U.K. -- http://www.skepticalscience.com/news.php?n=456#
Edit 11/21/10

Significance Tests, frequentist vs. bayesian

When we perform a test of statistical significance test, what we
would really like to ask is “what is the probability that the
alternative hypothesis is true?”. A frequentist analysis
fundamentally cannot give a direct answer to that question, as
they cannot meaningfully talk of the probability of a hypothesis
being true – it is not a random variable, it is either true or
false and has no “long run frequency”. Instead, the frequentists
gives a rather indirect answer to the question by telling you the
likelihood of the observations assuming the null hypothesis is
true and leaving it up to you to decide what to conclude from
that. A Bayesian on the other hand can answer the question
directly as the Bayesian definition of probability is not based
on long run frequencies but on the state of knowledge of the
truth of a proposition. The problem with frequentist statistical
test is that there is a tendency to interpret the result as if it
were the result of a Bayesian test, which is natural as that is
the form of answer we generally want, but still wrong.

The frequentist approach avoids the “subjectivity” of the
Bayesian approach (although the extent of that “subjectivity” is
debatable), but this is only achieved at the expense of not
answering the question we would most like to ask. It could be
argued that the frequentist approach merely shifts the
subjectivity from the analysis to the interpretation (what should
we conclude based on our p-value). Which form of analysis you
should use depends on whether you find the “subjectivity” of the
Bayesian approach or the “indirectness” of the frequentist
approach most abhorrent! ;o)

At the end of the day, as long as the interpretation is
consistent with the formulation, there is no problem and both
forms of analysis are useful.
This was my favorite comment, the whole sub-thread underneath is interesting. The original Open Mind | tamino.wordpress.com article has good qualifications to Dr Maarten H. P. Ambaum's Skeptical Science post.

Enhanced by Zemanta

No comments: