Can confidence intervals save psychology? Part 2.

This is part 2 in a series about confidence intervals ( here's part 1). Answering the question in the title is not really my goal, but simply to discuss confidence intervals and their pros and cons. The last post explained why frequency statistics (and confidence intervals) can't assign probabilities to one-time events, but always refer to a collective of long-run events.

If confidence intervals don't really tell us what we want to know, does that mean we should throw them in the dumpster along with our p-values? No, for a simple reason: In the long-run we will make less errors with confidence intervals (CIs) than we will with p. Eventually we may want to drop CIs for more nuanced inference, but for the time being we would do much better with this simple switch.

If we calculate CIs for every (confirmatory) experiment we ever run, roughly 95% of our CIs will hit the mark (i.e., contain the true population mean). Can we ever know which ones? Tragically, no. But some would feel pretty good about the process being used if it only has a 5% life-time error rate. One could achieve a lower error rate by stretching the intervals (to say, 99%) but that would leave them too embarrassingly wide for most.

If we use p we will be wrong 5% of the time in the long-run when we are testing a true null-hypothesis (i.e., no association between variables, or no difference between means, etc., and assuming the analysis is 100% pre-planned). But when we are testing a Button et al., 2013; Sedlmeier & Gigerenzer, 1989 false null-hypothesis then we will be wrong roughly 40-50% of the time or more in the long-run (). If you are one of the many who do not believe a null-hypothesis can actually be true, then we are always in the latter scenario with that huge error rate. In many cases (i.e., studying smallish and noisy effects- like most of psychology) we would literally be better off by flipping a coin and declaring our result "significant" whenever it lands heads.

There is a limitation to this benefit of CIs, and this limitation is self-imposed. We cannot escape the monstrous error rates associated with p if we report CIs but then interpret them as if they are significance tests (i.e., reject if null value falls inside the interval). Switching to confidence intervals will do nothing if we use them as a proxy for p. So the question then becomes: Do people actually interpret CIs simply as a null-hypothesis significance test? .

References

Button, K. S., Ioannidis, J. P., Mokrysz, C., Nosek, B. A., Flint, J., Robinson, E. S., & Munafò, M. R. (2013). Power failure: why small sample size undermines the reliability of neuroscience. Nature Reviews Neuroscience, 14(5), 365-376.

Cohen, J. (1962). The statistical power of abnormal-social psychological research: A review. Journal of abnormal and social psychology, 65(3), 145-153.

Coulson, M., Healey, M., Fidler, F., & Cumming, G. (2010). Confidence intervals permit, but don't guarantee, better inference than statistical significance testing. Frontiers in psychology, 1, 26.

Sedlmeier, P., & Gigerenzer, G. (1989). Do studies of statistical power have an effect on the power of studies?. Psychological Bulletin, 105(2), 309.

http://datacolada.org/2014/10/08/28-confidence-intervals-dont-change-how-we-think-abou\nt-data/

Reviews

Showing 1 Reviews

  • Profil
    Emil O. W. Kirkegaard
    0

    "confidence intervals ( here's part 1)."

    Unnecessary space.

    "No, for a simple reason: In the long-run we will make less errors with confidence intervals (CIs) than we will with p."

    Change to "p-values".

    "If we calculate CIs for every (confirmatory) experiment we ever run, roughly 95% of our CIs will hit the mark (i.e., contain the true population mean)."

    Depending on whether there are other QRPs in use. If there are, then maybe not, unless you have subtlely redefined "population mean".

    "If we use p we will be wrong 5% of the time in the long-run when we are testing a )."

    Missing content. From looking at the original blog post, this is Winnower's fault.

    "But when we are testing a false null-hypothesis then we will be wrong roughly 40-50% of the time or more in the long-run"

    This is true only if one concludes from a non-significant finding to that the population effect is 0. While this is sometimes done, it is not a necessary part of NHST.

    "In many cases (i.e., studying smallish and noisy effects- like most of psychology) we would literally be better off by flipping a coin and declaring our result “significant” whenever it lands heads."

    You are claiming it is worse than chance. I think you should substantiate that.

    "There is a limitation to this benefit of CIs, and this limitation is self-imposed. We cannot escape the monstrous error rates associated with p if we report CIs but then interpret them as if they are significance tests (i.e., reject if null value falls inside the interval). Switching to confidence intervals will do nothing if we use them as a proxy for p. So the question then becomes: Do people actually interpret CIs simply as a null-hypothesis significance test? Yes, unfortunately they do (Coulson et al., 2010)."

    Confidence intervals, however, need not necessary be interpreted that way, and even when done so, they still give additional information, namely the standard error. Cf. Schmidt, F. L., & Hunter, J. E. (1997). Eight common but false objections to the discontinuation of significance testing in the analysis of research data. What if there were no significance tests, 37-64.

    Probably switching to Bayesian stats will be too difficult for many scientists and especially students in the less intelligent fields (e.g. psych).

    This review has 2 comments. Click to view.
    • Website photo
      Alexander Etz

      Hi, thanks for reviewing. It seems the winnower is still working out the kinks of the web importer.

      "This is true only if one concludes from a non-significant finding to that the population effect is 0. While this is sometimes done, it is not a necessary part of NHST." If we are examining power, then it is inherent to NHST to accept the null when you fail to reject it. Neyman and Pearson are very clear that you do not have to personally believe it, but you should act as if it were true. They call it inductive behavior to clarify the distinction. Fisher didn't believe in the concept of power or type-2 errors, so it's moot if you subscribe to his framework of course.

      "You are claiming it is worse than chance. I think you should substantiate that." I did, I cited work showing power is less than 50%.

      "Probably switching to Bayesian stats will be too difficult for many scientists and especially students in the less intelligent fields (e.g. psych)." That is incredibly condescending.

      • Profil
        Emil O. W. Kirkegaard

        Does it matter what Neyman and Pearson said? I mean, it is clearly wrong to always accept a NH when one has failed to reject it. Often this would be caused by low power or bad measures, data error, statistical error and other mistakes having nothing to do with whether the NH is approximately true or not. One could reject any hypothesis simply by continuously subsetting the data until all tests give p>.05. I've never heard of anyone seriously defend this idea (except Neyman and Pearson according to you).

        So your claim that it is worse than 50% depends on the other claim, OK. This point has also been made by Schmidt and Hunter in their famous book. However, NHST clearly does not require always accepting NHs when they fail to get rejected. Probably most authors realize this fact sometimes. Many papers ascribe their lack of a 'significant finding' to low sample size.

        Condescending or not, it is probably true. Psychologists are not very bright compared to other academics. Bayesian statistics is more difficult than frequentist statistics. Do you have any evidence to the contrary? I'm curious if you can find data about the relative difficulty of the two. I assume BS is more difficult. I googled around but didn't find any actual data. It should be easy enough to test with intro psych students. Data by academic field here: http://emilkirkegaard.dk/en/?p=3925

    • Img 20150729 155406 animation 360
      Joshua Nicholson

      Hello. On behalf of The Winnower I'd like to apologize for the formatting errors introduced on our end during import as well as the formatting errors in the comment box. We will work on fixing these!

License

This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.