Bad Stats: Not what it Seems

Transparent statistical communication in HCI research

Pierre Dragicevic and colleagues

“In the post p<0.05 era, scientific argumentation is not based on whether a p-value is small enough or not. Attention is paid to effect sizes and confidence intervals. Evidence is thought of as being continuous rather than some sort of dichotomy.”    Ron Wasserstein, executive director of the American Statistical Association, 2016.
We started this page in 2014 to collect arguments and material to help us push back against reviewers who insisted that we should report p-values and declarations of statistical significance. It became a manifesto explaining why it would be beneficial for human-computer interaction and information visualization to move beyond mindless null hypothesis significance testing (NHST), and focus on presenting informative charts with effect sizes and their interval estimates, and drawing nuanced conclusions. Our scientific standards can also be greatly improved by planning analyses and sharing experimental material online. At the bottom of this page you will find studies published at CHI and VIS without any p-value, some of which have received best paper awards.

👉 2024 update

This web page has not been updated much since 2015, except for the "original material" section below, where we occasionally feature some of our work on statistical communication. Over the past 10 years, I (Pierre) have sightly changed my mind on a few things:

  • I'm now more inclined to say that it's OK to report p-values and use them to draw conclusions, as long as they're not used to make dichomotous inferences, i.e., put results in two buckets: significant and non-significant. I remain firmly opposed to such inferences and frustrated by their persistence. My current position is summarized in the introduction of this 2019 paper by Lonni Besançon and myself, which has arguments and many references.
  • I'm now also more inclined to accept that it's hard for people not to interpret confidence intervals in a dichotomous way, and perhaps reporting evidence as continuous distributions instead of interval estimates (e.g., reporting posterior probability distributions from Bayesian analyses, as in Matthew Kay's papers) may partly address the problem.
  • However, I came to think that the core issues go beyond the distinction between NHST and estimation, and between frequentist and Bayesian statistics:
    • People want yes/no answers to research questions. Many authors and readers don't know what to do with nuanced and uncertain conclusions. Many readers are uncomfortable with papers reporting only interval estimates — I can only imagine how uncomfortable they will be with papers reporting only continuous distributions. I expect many paper rejections.
    • It's very hard for people to appreciate that the results from any statistical analysis are as unreliable as the data itself. I did a presentation on this topic in 2016.
    • Yvonne adds: the issue behind seems to be at least in part a mentality issue. Science is a process and not a machine that produces certainty. We all need to learn to accept that it's so, instead of trying to find new ways to turn stuff into gold. Alchemy's not real and there is no magic way to make scientific data into definitive results. Those who want to reject papers because the results are not definitive (despite the research having been done very rigorously), need to get used to that this is how science works.
  • Finally, I came to think that many issues arise from researchers failing to appreciate the importance of distinguishing between planned and unplanned analyses (and thus of pre-registering studies), and that it is really a shame that over 300 journals now accept registered reports as a submission format, but no HCI or VIS journal currently does, as we pointed out in this 2021 paper. One recent exception is the JoVI journal which is a fantastic journal with many desirable characteristics but it is still nascent, and it remains more attractive for students to submit their work to more prestigious venues.

Table of Contents

Original material:
2024 - The Many Ways of Being Transparent in HCI Research (paper)
2021 – Publishing Visualization Studies as Registered Reports (paper)
2021 – Can visualization alleviate dichotomous thinking? (paper)
2020 – Threats of a replication crisis in empirical computer science (paper)
2019 – Explorable multiverse analyses (paper)
2019 – Dichotomous inferences in HCI (paper)
2018 – What are really effect sizes? (blog post)
2018 – My first preregistered study
2018 – Special Interest Group on transparent statistics guidelines (workshop)
2017 – Moving transparent statistics forward (workshop)
2016 – Statistical Dances (keynote)
2016 – Special Interest Group on Transparent Statistics (workshop)
2016 – Fair Statistical Communication in HCI (book chapter)
2014 – Bad Stats are Miscommunicated Stats (keynote)
2014 – Running an HCI Experiment in Multiple Parallel Universes (paper)

External resources:
Quotes about NHST
Links about NHST
Reading List
More Readings
Papers (somehow) in favor of NHST
Papers against confidence intervals
Papers from the HCI Community
Examples of HCI and VIS studies without p-values

To Start with

Listen to this 15-minute speech from Geoff Cumming, a prominent researcher in statistical cognition. This will give you a good overview of what follows.

Original Material

2024 - The Many Ways of Being Transparent in HCI Research

My colleagues and I have shared a pre-print, The Many Ways of Being Transparent in Human-Computer Interaction Research.

Research transparency has become a common topic of discussion in many fields, including human-computer interaction (HCI). However, constructive discussions require a shared understanding, and the diversity of contribution types and ways of doing research in HCI have made dialog about research transparency difficult. With this paper, we aim to facilitate such dialog by proposing a definition of research transparency that is broadly encompassing to diverse types of HCI research: Research transparency refers to honesty and clarity in all communications about the research processes and outcomes—to the extent possible. We unpack this definition to highlight the different facets of research transparency and offer examples of how transparency can be practiced across different HCI contribution types, including non-empirical contributions. With this article, we argue that research transparency does not have to be inflexible and that transparency in HCI research can be achieved in many ways depending on one’s methodology or contribution type.

2021 - Publishing Visualization Studies as Registered Reports

My colleagues and I have published a paper at alt.VIS 2021, Publishing Visualization Studies as Registered Reports: Expected Benefits and Researchers' Attitudes.

Registered Reports are publications in which study proposals are peer reviewed and pre-accepted before the study is ran. Their adoption in other disciplines has been found to promote research quality and save time and resources. We offer a brief introduction to Registered Reports and their expected benefits for visualization research. We then report on a survey on the attitudes held in the Visualization community towards Registered Reports.

2021 - Can visualization alleviate dichotomous thinking?


My colleagues Jouni Helske, Satu Helske, Matthew Cooper Anders Ynnerman and Lonni Besançon published a paper in IEEE TVCG Can visualization alleviate dichotomous thinking Effects of visual representations on the cliff effect? They studied whether visualization can alleviate dichotomous thinking of inferential statistics and produce smoother "confidence profiles" over the results. They pfound that, compared to textual representation of p-value and CI, classic CI visualization increased the drop in confidence around the underlying p-value of 0.05. Gradient CI and violin CI, however, produced smaller cliff effect and overall smoother confidence profiles.

2020 – Threats of a replication crisis in empirical computer science


Together with Andy Cockburn, Carl Gutwin, and Lonni Besançon, I published a paper on the threats on a replication crisis in empirical computer science and the strategies that can be adopted to mitigate these threats. We found evidence that papers in empirical computer science still rely on binary interpretations of p-values which have been shown to likely lead to a replication crisis. Most of the presented solutions rely on the adoption of more transparency in reporting and embracing more nuanced interpretations of statistical results. We also suggest adopting new publication formats such as registered reports.

2019 – Explorable Multiverse Analyses


We published a paper on explorable multiverse analysis reports, a new approach to statistical reporting where readers of research papers can explore alternative analysis options by interacting with the paper itself.

2019 – Dichotomous Inferences in HCI


Lonni Besançon and I published a paper at alt.chi 2019 entitled The Continued Prevalence of Dichotomous Inferences at CHI. We refer to dichotomous inference as the classification of statistical evidence as either sufficient or insufficient. It is most commonly done through null hypothesis significance testing (NHST). Although predominant, dichotomous inferences have proven to cause countless problems. Thus, an increasing number of methodologists have been urging researchers to recognize the continuous nature of statistical evidence and to ban dichotomous inferences. We wanted to see whether they have had any influence on CHI. Our analysis of CHI proceedings from the past nine years suggests that they have not.

In the author version of our paper, we included commentaries from 11 researchers and an authors' response. Commentaries are from Xiaojun Bi, Géry Casiez, Andy Cockburn, Geoff Cumming, Jouni Helske, Jessica Hullman, Matthew Kay, Arnaud Prouzeau, Theophanis Tsandilas, Chat Wacharamanotham, and Shumin Zhai. We are very grateful for their time.

2018 – What are Really Effect Sizes?


In the Transparent Statistics Blog I answer the question Can we call mean differences "effect sizes"?. The answer is yes. In 2020, I wrote a research report A Mean Difference is an Effect Size

2018 – My First Preregistred Study


After four years of planning all my statistical analyses, I'm happy to say that I just conducted my first preregistered study, together with Lonni Besançon, Tobias Isenberg, Ammir Semmo, and collaborators from clinical medicine. The paper title is Reducing Affective Responses to Surgical Images through Color Manipulation and Stylization.

All our R analysis scripts were written ahead of time and tested on simulated data. This made it very easy to preregister our study, as we just had to upload the code and add a few details about planned sample sizes and data exclusion criteria.

2018 – CHI Special Interest Group on Transparent Statistics Guidelines


With Chat Wacharamanotham, Matthew Kay, Steve Haroz and Shion Guha, I co-organized a CHI SIG titled Special Interest Group on Transparent Statistics Guidelines where we solicited feedback from the HCI community on a first working draft of Transparent Statistics Guidelines and engage potential contributors to push the transparent statistics movement forward.

2017 – CHI Workshop on Moving Transparent Statistics Forward


With Matthew Kay, Steve Haroz, Shion Guha and Chat Wacharamanotham, I co-organized a CHI workshop titled Moving Transparent Statistics Forward as a follow-up to our Special Interest Group on Transparent Statistics. The goal of this workshop was to start developing concrete guidelines for improving statistical practice in HCI. Contact us if you'd like to contribute.

2016 – BioVis Primer Keynote – Statistical Dances


https://www.youtube.com/watch?v=UKX9iN0p5_A

I gave a keynote talk at the BioVis 2016 symposium titled Statistical Dances: Why No Statistical Analysis is Reliable and What To Do About It.

I gave a similar talk in Grenoble in June 2017 that you can watch on the GRICAD website or on Youtube (embedded video above). You can also download the video file (1.82 GB). Thanks to François Bérard and Renaud Blanch for inviting me, and to Arnaud Legrand for organizing this event.

Summary of the Talk: It is now widely recognized that we need to improve the way we report empirical data in our scientific papers. More formal training in statistics is not enough. We also need good "intuition pumps" to develop our statistical thinking skills. In this talk I explore the basic concept of statistical dance. The dance analogy has been used by Geoff Cumming to describe the variability of p-values and confidence intervals across replications. I explain why any statistical analysis and any statistical chart dances across replications. I discuss why most attempts at stabilizing statistical dances (e.g, increasing power or applying binary accept/reject criteria) are either insufficient or misguided. The solution is to embrace the uncertainty and messiness in our data. We need to develop a good intuition of this uncertainty and communicate it faithfully to our peers. I give tips for conveying and interpreting interval estimates in our papers in a honest and truthful way.

Animated plots by Pierre Dragicevic and Yvonne Jansen.

Erratum: In the video linked above I said we need to square sample size to get dances twice as small. I should have said that we need to multiply sample size by four.

2016 – CHI Special Interest Group on Transparent Statistics in HCI


With Matthew Kay, Steve Haroz and Shion Guha, I co-organized a SIG meeting at CHI '16 titled Transparent Statistics in HCI. We propose to define transparent statistics as a philosophy of statistical reporting whose purpose is scientific advancement rather than persuasion. The meeting generated lots of interest and we are now looking at forming a community and developing some concrete recommendations to move the field forward. If you are interested, join our mailing list.

2016 – Book Chapter: Fair Statistical Communication in HCI


I wrote a book chapter titled Fair Statistical Communication in HCI that attempts to explain why and how we can do stats without p-values in HCI and Vis. The book is titled Modern Statistical Methods for HCI and is edited by Judy Robertson and Maurits Kaptein. I previously released an early draft as a research report titled HCI Statistics without p-values. However, I recommend that you use the Springer chapter instead (pre-print link below), as it's more up-to-date and much improved.

Get the final book chapter from Springer
Download author version (PDF 1.8 MB)

See errata, updated tips, and responses.

2014 – BELIV Keynote: Bad Stats are Miscommunicated Stats


I gave a keynote talk at the BELIV 2014 bi-annual workshop entitled Bad Stats are Miscommunicated Stats.

Summary of the talk: When reporting on user studies, we often need to do stats. But many of us have little training in statistics, and we are just as anxious about doing it right as we are eager to incriminate others for any flaw we might spot. Violations of statistical assumptions, too small samples, uncorrected multiple comparisons—deadly sins abound. But our obsession with flaw-spotting in statistical procedures makes us miss far more serious issues and the real purpose of statistics. Stats are here to help us communicate about our experimental results for the purpose of advancing scientific knowledge. Science is a cumulative and collective enterprise, so miscommunication, confusion and obfuscation are much more damaging than moderately inflated Type I error rates.

In my talk, I argue that the most common form of bad stats are miscommunicated stats. I also explain why we all have been faring terribly according to this criteria—mostly due to our blind endorsement of the concept of statistical significance. This idea promotes a form of dichotomous thinking that not only gives a highly misleading view of the uncertainty in our data, but also encourages questionable practices such as selective data analysis and various other forms of convolutions to reach the sacred .05 level. While researchers’ reliance on mechanical statistical testing rituals is both deeply entrenched and severely criticized in a range of disciplines—and has been so for more than 50 years—it is particularly striking that it has been so easily endorsed by our community. We repeatedly stress the crucial role of human judgment when analyzing data, but do the opposite when we conduct or review statistical analyses from user experiments. I believe that we can cure our schizophrenia and substantially improve our scientific production by banning p-values, by reporting empirical data using clear figures with effect sizes and confidence intervals, and by learning to provide nuanced interpretations of our results. We can also dramatically raise our scientific standards by pre-specifying our analyses, fully disclosing our results, and sharing extensive replication material online. These are small but important reforms that are much more likely to improve science than methodological nitpicking on statistical testing procedures.

2014 – Alt.CHI Paper: Running an HCI Experiment in Multiple Parallel Universes


Pierre Dragicevic, Fanny Chevalier and Stéphane Huot (2014) Running an HCI Experiment in Multiple Parallel Universes. In ACM CHI Conference on Human Factors in Computing Systems (alt.chi). Toronto, Canada, Apr, Apr 2014.

The purpose of this alt.chi paper was to raise the community's awareness on the need to question the statistical procedures we currently use to interpret and communicate our experiment results, i.e., the ritual of null hypothesis significance testing (NHST). The paper does not elaborate on the numerous problems of NHST, nor does it discuss how statistics should be ideally done, as these points have been already covered in hundreds of articles published across decades in many disciplines. The interested reader can find a non-exhaustive list of references below.

A comment often elicited by this paper is that HCI researchers should learn to use NHST properly, or that they should use larger sample sizes. This is not the point we are trying to make. Also, we are not implying that we should reject inferential statistics altogether, or that we should stop caring about statistical error. See the discussion thread from the alt.chi open review process. Our personal position is that HCI research can and should get rid of NHST procedures and p-values, and instead switch to reporting (preferably unstandardized) effect sizes with interval estimates — e.g., 95% confidence intervals — as recommended by many methodologists.

Video Teaser

In our presentation we take an HCI perspective to rebut common arguments against the discontinuation of NHST, namely: i) the problem is NHST misuse, ii) the problem is low statistical power, iii) NHST is needed to test hypotheses or make decisions, iv) p-values and confidence intervals are equivalent anyways, v) we need both.

Quotes About Null Hypothesis Significance Testing (NHST)

The number of papers papers stressing the deep flaws of NHST is simply bewildering. Sadly, the awareness of this literature seems very low in HCI and Infovis. Yet most criticisms are not about the theoretical underpinnings of NHST, but about its usability. Will HCI and Infovis give the good example to other disciplines?

“...no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
   Sir Ronald A. Fisher (1956), quoted by Gigerenzer (2004).
“[NHST] is based upon a fundamental misunderstanding of the nature of rational inference, and is seldom if ever appropriate to the aims of scientific research.”
   Rozeboom (1960), quoted by Levine et al. (2008)
“Statistical significance is perhaps the least important attribute of a good experiment; it is never a sufficient condition for claiming that a theory has been usefully corroborated, that a meaningful empirical fact has been established, or that an experimental report ought to be published.”
   Likken (1968), quoted by Levine et al. (2008)
“Small wonder that students have trouble [learning significance testing]. They may be trying to think.”
   Deming (1975), quoted by Ziliak and McCloskey (2009)
“I believe that the almost exclusive reliance on merely refuting the null hypothesis as the standard method for corroborating substantive theories is a terrible mistake, is basically unsound, poor scientific strategy, and one of the worst things that ever happened in the history of psychology. I am not making some nit-picking statistician’s correction. I am saying that the whole business is so radically defective as to be scientifically almost pointless.”
   Paul Meehl (1978), quoted by Levine et al. (2008)
“One of the most frustrating aspects of the journal business is the null hypothesis. It just will not go away. [...]. It is almost impossible to drag authors away from their p values [...]. It is not uncommon for over half the space in a results section to be composed of parentheses inside of which are test statistics, degrees of freedom, and p values. Perhaps p values are like mosquitos. They have an evolutionary niche somewhere and no amount of scratching, swatting, or spraying will dislodge them.”
   John Campbell (1982)
“After 4 decades of severe criticism, the ritual of null hypothesis significance testing -- mechanical dichotomous decisions around a sacred .05 criterion -- still persists.”
   Jacob Cohen (1994)
“Statistical significance testing retards the growth of scientific knowledge; it never makes a positive contribution.”
   Schmidt and Hunter (1997)
“Logically and conceptually, the use of statistical significance testing in the analysis of research data has been thoroughly discredited.”
   Schmidt and Hunter (1997), quoted by Levine et al. (2008)
“D. Anderson, Burnham and W.Thompson (2000) recently found more than 300 articles in different disciplines about the indiscriminate use of NHST, and W. Thompson (2001) lists more than 400 references on this topic. [...] After review of the debate about NHST, I argue that the criticisms have sufficient merit to support the minimization or elimination of NHST.”
   Rex B Kline (2004)
“Our unfortunate historical commitment to significance tests forces us to rephrase good questions in the negative, attempt to reject those nullities, and be left with nothing we can logically say about the questions.”
   Killen (2005), quoted by Levine et al. (2008)
“If these arguments are sound, then the continuing popularity of significance tests in our peer-reviewed journals is at best embarrassing and at worst intellectually dishonest.”
   Lambdin (2012)
“Many scientific disciplines either do not understand how seriously weird, deficient, and damaging is their reliance on null hypothesis significance testing (NHST), or they are in a state of denial.”
   Geoff Cumming, in his open review of our alt.chi paper
“A ritual is a collective or solemn ceremony consisting of actions performed in a prescribed order. It typically involves sacred numbers or colors, delusions to avoid thinking about why one is performing the actions, and fear of being punished if one stops doing so. The null ritual contains all these features.”
   Gerd Gigerenzer, 2015
“The widespread use of “statistical significance” (generally interpreted as “p ≤ 0.05”) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process.”
   The American Statistical Association, 2016

Links

Note: the material and references that follow are generally older than 2015. A lot has been written since then and I also came across a lot more interesting stuff, particularly on pre-registration and registered reports. I haven't found the courage to update the lists, but the papers below are still worthwhile a read.

To learn more about issues with NHST and how they can be easily addressed with interval estimates, here are two good starting points that will only take 25 minutes of your time in total.

  • (Cumming, 2011) Significant Does not Equal Important: Why we Need the New Statistics (radio interview).
    A short podcast accessible to a wide audience. Focuses on the understandability of p values vs. confidence intervals. In an ideal world, this argument alone should suffice to convince people to whom usability matters.
  • (Cumming, 2013) The Dance of p values (video).
    A nice video demonstrating how unreliable p values are across replications. Methodologist and statistical cognition researcher Geoff Cumming has communicated a lot on this particular problem, although many other problems have been covered in previous literature (references below). An older version of this video was the main source of inspiration for our alt.chi paper..

Reading List

Those are suggested initial readings. They are often quite accessible and provide a good coverage of the problems with NHST, as well as their solutions.

More Readings

List not maintained and basically not updated since 2015.

Other references, some of which are rather accessible while others may be a bit technical for non-statisticians like most of us. Best is to skip the difficult parts and return to the paper later on. I have plenty of other references that I won't have time to add here, for more see my book chapter.

  • (Kline, 2004) What's Wrong With Statistical Tests--And Where We Go From Here.
    A broad overview of the problems of NHST and publications questioning significance testing (~300 since 1950s). Recommends to downplay or eliminate NHST, report effect sizes with confidence intervals instead, and focus on replication.
  • (Amrhein​ et al, 2017) The earth is flat (p>0.05): Significance thresholds and the crisis of unreplicable research.
    An up-to-date and exhaustive review of arguments against the use of a significance threshold with p-values. The authors are not for banning p-values, but provide a very strong and convincing critique of binary significance, with many references. Still at the pre-print stage.
  • (Gigerenzer, 2004) Mindless statistics.
    Another (better) version of Gigerenzer's paper The Null Ritual - What You Always Wanted to Know About Significance Testing but Were Afraid to Ask. Explains why the NHST ritual is a confusing combination of Fisher and Neyman/Pearson methods. Explains why few people understand p, problems with meaningless nil hypotheses, problems with the use of "post-hoc alpha values", pressures on students by researchers, researchers by editors, etc. Important paper although not a very easy read.
  • (Ziliak and McCloskey, 2009) The Cult of Statistical Significance.
    A harsh criticism of NHST, from the authors of the book "The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives ". Nicely explains why effect size is more important than precision (which refers to both p values and widths of confidence intervals).
  • (Gelman, 2013) Interrogating p-values.
    Explains towards the end why p-values are inappropriate even for making decisions, and that effect sizes should be used instead.
  • (Cumming, 2008) p Values Predict the Future Only Vaguely, but Confidence Intervals Do Much Better.
    Examines the unreliability of p values and introduces p intervals. For example, when one obtains p = .05, there is a 80% chance that the next replication yields p inside [.00008, .44], and a 20% chance it's outside.

Effect size

Effect size does not mean complex statistics. If you report mean task completion times, or differences in mean task completion times, you're reporting effect sizes.

Bootstrapping

Bootstrapping is an easy method for computing confidence intervals for many kinds of distributions. It's too computationally intensive to have been even conceivable in the past, but now we have this thing called a computer. Oh, and it's non-deterministic — if this disturbs you, then maybe take another look at our alt.chi paper!

Statistical errors

In addition to having been greatly over-emphasized, the notion of type I error is misleading. One issue is that the nil hypothesis of no effect is almost always false, so it's practically impossible to commit a type I error. At the same time, many NHST critics stress the absurdity of testing nil hypotheses, but this is rarely what people (especially in HCI) are really doing. Some articles on this page clarify this, here are more articles specifically on the topic of statistical errors. More to come.

Multiple comparisons

How to deal with multiple comparisons? Use a simple experiment design, pick meaningful comparisons in advance, and use interval estimates.

Contrast analysis

Still looking for an accessible introduction. Suggestions welcome.

Hypotheses

HCI is a young field and rarely has deep hypotheses to test. If you only have superficial research questions, frame them in a quantitative manner (e.g., is my technique better and if so, to what extent?) and answer them using estimation. If you have something deeper to test, be careful how you do it. It's hard.

  • (Meehl, 1967) Theory-testing in Psychology and Physics: A Methodological Paradox.
    Contrasts substantive theories with statistical hypotheses, and explains why testing the latter provides only a very weak confirmation of the former. Also clarifies that point nil hypotheses in NHST are really used for testing directional hypotheses (see also section 'Errors' above). Paul Meehl is an influential methodologist and has written other important papers, but his style is a bit dry. I will try to add more references from him on this page later on.

Bayesian statistics

  • (Kruschke, 2010) Doing Bayesian Data Analysis: A Tutorial with R and BUGS.
    This book has been recommended to me several times as being an accessible introduction to Bayesian methods, although I did not read it yet. If you are willing to invest time, Bayesian (credible) intervals are worth studying as they have several advantages over confidence intervals.
  • (McElreath, 2015) Statistical Rethinking.
    This book is recommended by Matthew Kay for its very accessible introduction to Bayesian analysis. Pick this one or Kruschke's.
  • (Kruschke and Liddell, 2015) The Bayesian New Statistics: Two historical trends converge.
    Explains why estimation (rather than null hypothesis testing) and Bayesian (rather than frequentist) methods are complementary, and why we should do both.

Other papers cited in the BELIV talk

  • (Dawkins, 2011) The Tyranny of the Discontinuous Mind.
    A short but insightful article by Richard Dawkins that explains why dichotomous thinking, or more generally categorical thinking, can be problematic.
  • (Rosenthal and Gaito, 1964) Further evidence for the cliff effect in interpretation of levels of significance.
    A short summary of early research in statistical cognition showing that people trust p-values that are sightly below 0.05 much more than p-values that are slightly above. This seems to imply that Fisher's suggestion to minimize the use of decision cut-offs and simply report p-values as a measure of strength of evidence may not work.

Papers (somehow) in favor of NHST

List not maintained and basically not updated since 2015.

To fight confirmation bias here is an attempt to provide credible references that go against the discontinuation of NHST. Many are valuable in that they clarify what NHST is really about.

Papers against confidence intervals

List not maintained and basically not updated since 2015.

For several decades Bayesians have (more or less strongly) objected to the use of confidence intervals, but their claim that "CIs don't solve the problems of NHST" is misleading. Confidence intervals and credible (Bayesian) intervals are both interval estimates, thus a great step forward compared to dichotomous testing. Credible intervals arguably rest on more solid theoretical foundations and have a number of advantages, but they also have a higher switching cost. By claiming that confidence intervals are invalid and should never be used, Bayesian idealists may just give NHST users an excuse for sticking to their ritual.

  • (Morey et al, in press) The Fallacy of Placing Confidence in Confidence Intervals.
    This is the most recent paper on the topic, with many authors and with the boldest claims. It is very informative and shows that CIs can behave in pathological ways in specific situations. Unfortunately, it is hard to assess the real consequences, i.e., how much we are off when we interpret CIs as ranges of plausible parameter values in typical HCI user studies. Also see this long discussion thread on an older revision, especially toward the end.

Papers from the HCI Community

List not maintained and basically not updated since 2015.

There has been several papers in HCI questioning current practices, although (until recently) none of them calls for a discontinuation or even minimization of NHST, none of them mentions the poor replicability of p-values, and no satisfactory treatment of interval estimation is provided.

Examples of HCI and VIS studies without p-values

The papers we co-author are by no means prescriptive. We are still polishing our methods and learning. We are grateful to Geoff Cumming for his help and support.

Contact

List of people currently involved in this initiative:

License

All material on this page is CC-BY-SA unless specified otherwise. You can reuse or adapt it provided you link to www.aviz.fr/badstats.