On Statistical Significance

May 9, 2009

I’ve been reading through the papers of Steve’s ESE course. I’m happy it’s rainy out.

Several papers so far have dealt with the issue of statistical significance and multiple tests. If each statistical test you try has a five percent chance of giving a false positive, you would expect, if you try twenty tests, to get at least one wrong.

The papers I mean, specifically, are these:Preliminary guidelines for empirical research in software engineering,Why Most Published Research Findings Are False,Generalization and Theory-Building in Software Engineering Research.

Kitchenam et al. don’t say it, but the others seem to: You should decide what statistical tests you’re going to apply before you collect data, or at least make sure your tests are grounded in a solid theory.

Wait, what?

There are two different policies advocated here: Don’t try lots of statistical tests (because you’ll probably find something) and Have a theory.

Have a theory, by analogy: As scientists, we must assume our theories are wrong and behave accordingly. Having a theory about patterns in lottery numbers doesn’t make you any more likely to win. P(Win|TheoryCorrect) = P(Win). And if you win the lottery, it makes no difference whether or not you had a theory. P(TheoryCorrect|Win) = P(TheoryCorrect). It’s great to focus on building theories for software engineering. Theories let us move forward. But let’s not pretend that your having a theory has any bearing on the presence or absence of an effect in the world, nor does it have any bearing on the likelihood of a statistically significant finding being correct. The multiple-test significance problem is not support for a call for theories.

As for the matter of trying many statistical tests: The effect is either present in the real world or it isn’t. And there’s either an effect in your data or there isn’t. Those facts don’t change if you choose not to look for them. Kitchenam et al. have it right: You should state how hard you looked for an effect in terms of number of statistical tests. But you shouldn’t hold back from looking carefully at all your data!

I’m probably either wrong in my logic or I’m building strawmen out of the relevant parts of the latter two papers. Maybe someone who took the course can enlighten me.


5 Responses to “On Statistical Significance”

  1. George Says:

    I don’t know if this is germane, but I hate p values because they are so easily and often misinterpreted!
    I’m assuming none of those links make any of these mistakes, but I don’t care to read them. Regardless, one can never go over these misconceptions enough. So from wikipedia:

    There are several common misunderstandings about p-values.[2][3]
    The p-value is not the probability that the null hypothesis is true. (This false conclusion is used to justify the “rule” of considering a result to be significant if its p-value is very small (near zero).)
    In fact, frequentist statistics does not, and cannot, attach probabilities to hypotheses. Comparison of Bayesian and classical approaches shows that a p-value can be very close to zero while the posterior probability of the null is very close to unity. This is the Jeffreys-Lindley paradox.
    The p-value is not the probability that a finding is “merely a fluke.” (Again, this conclusion arises from the “rule” that small p-values indicate significant differences.)
    As the calculation of a p-value is based on the assumption that a finding is the product of chance alone, it patently cannot also be used to gauge the probability of that assumption being true. This is subtly different from the real meaning which is that the p-value is the chance that null hypothesis explains the result: the result might not be “merely a fluke,” and be explicable by the null hypothesis with confidence equal to the p-value.
    The p-value is not the probability of falsely rejecting the null hypothesis. This error is a version of the so-called prosecutor’s fallacy.
    The p-value is not the probability that a replicating experiment would not yield the same conclusion.
    1 − (p-value) is not the probability of the alternative hypothesis being true (see (1)).
    The significance level of the test is not determined by the p-value.
    The significance level of a test is a value that should be decided upon by the agent interpreting the data before the data are viewed, and is compared against the p-value or any other statistic calculated after the test has been performed.
    The p-value does not indicate the size or importance of the observed effect (compare with effect size).

  2. Neil Says:

    I agree that for a random event like a lottery, having a theory does not help you win, but presumably the point of a theory is that for most phenomena, the theory will predict some result. You then decide whether the phenomena invalidate the theory or not. Am I following you correctly? So in a lottery, any theory other than “the results are random” ought to be rejected. But for e.g. the bending of light by black holes, we don’t reject Einstein’s relativity, as it predicted the result correctly.

  3. aran Says:

    @George: Thanks, it appears I do misunderstand the intent and meaning of p-values somewhat.

    @Neil: I think I might not have made my point as clear as I could. The latter two papers I mentioned above seem to imply that _having a theory_ somehow makes you more likely to be correct. Black holes bend light regardless of whether or not we understand why. And our observations of light-bending are equally likely to be meaningful whether or not we’ve made the theory yet.

  4. Neil Says:

    Although in some fields (I’m looking at you, literary criticism), they would argue that having a theory shapes your observations and even reality (well, reality as it is perceived). Seems to apply more to human artifacts than external phenomena, but might be true in some areas of software engineering (e.g., the narrative structure of the history of the creation of C++)

  5. Ah, you missed a good course there.

    The papers you indicate are about research strategy, not epistemology. Basically, they are trying to push empirical SE research to a higher level of maturity. Too many researchers design experiments to test hypotheses, using the standard hypothesis-testing approach of modern stats. Many of these studies have incorrect conclusions, for all the reasons you mention. A much better strategy is to build causal theories from your observations, and then design experiments that probe different aspects of the causal mechanism posited by the theory. Not single studies, but a coordinated series of studies that accumulate knowledge about the theory.

    By the way, the conclusions of the Ioannidis paper are wrong. Did you figure out why? (other than the flippant, self-referential reason, that is)?

Comments are closed.

%d bloggers like this: