Verifiable research: The missing link between replicability and reproducibility

  1. 1.  Centre de Biophysique Moléculaire, CNRS, France

 

To err is human. Scientists being human, they make mistakes. Many if not most of the rules for doing science are designed to weed out mistakes. Reproducibility and replicability are recognized as playing a central role in this process. But a lot of confusion remains about the difference between these two labels and the relation between them. In this essay, I will explain why replicability is the foundation on top of which reproducibility can be constructed, and introduce verifiability as the missing link between them, which deserves particular attention in the context of computer-aided research.

First, a note about terminology. Some people use "reproducible" and "replicable" in the sense I will soon define, whereas others exchange the definitions of the two terms, and yet others seem to consider them synonyms. I hope that the scientific community will ultimately converge to common definitions, but we aren't there yet.

To make the relation between replicability, reproducibility, and verifiability as clear as possible, I will adopt a high-level perspective that intentionally disregards many details, important though they may be in the practice of doing research. For example, when I speak of mistakes, I include in this category everything that may invalidate conclusions drawn from the results of scientific work. This includes forgetting a factor of 2 in evaluating a formula, but also using an insufficient sample size for a statistical analysis, or using an instrument that turns out to be defective. Even fraud is lumped into my general category of mistakes.

A typical report about a scientific study says, "this is what we did, that is what we observed, and those are our conclusions". Replicability is about the observations, whereas reproducibility is about the conclusions. For theoretical and computational work, substitute "this is what we assume, that is what we compute based on our assumptions, and those are our conclusions". Replication is then about the results of the computations. When the computations are done by hand, this is simply called double-checking.

Replication attempts check for mistakes at a technical level. If someone else can independently replicate experimental work or a computation, that means that the published description is sufficiently precise, and it strongly suggests that the original authors did not make a technical mistake in applying their own protocol. A good replication attempt should therefore follow the published instructions as closely as possible. Any deviation from the original protocol makes the interpretation of the outcome more difficult. If the results are significantly different, it is impossible to say if that is due to a mistake in the original work, a mistake in the replication, or a change introduced in the replication protocol.

Reproduction attempts check for mistakes in the scientific interpretation, and in particular for hidden assumptions. The idea is to perform an experiment or computation that should, according to the expectations of experts in the field, lead to the same conclusions, in spite of intentional changes in the protocol. A reproduction attempt therefore retains the important features of the original work but modifies something that according to the current state of knowledge in the field is an unimportant detail. Obviously, reproduction is much more subjective than replication, because what is or isn't important is a matter of personal judgment.

A crucial but often neglected aspect is that reproduction attempts make little sense unless replicability has been verified first. The reason is again the possibility to learn something from the outcome. If you try to reproduce a finding and fail, then what? Perhaps you misunderstood the original authors' protocol. Perhaps they made a technical mistake, or you did. Perhaps all the technical work was done correctly but some assumption — yours or theirs — was not justified. If you can replicate their technical work first, you know that a different outcome in a reproduction attempt with a modified protocol is not due to some simple mistake, and thus really adds new scientific insight.

This simple principle is something that I learned as a physics student many years ago, so it hardly counts as revolutionary. In his "Cargo Cult Science" lecture from 1974 [1], Richard Feynman explains it to his students, and cites an anecdote going back to 1937 for an illustration of why it is so often ignored: replication is seen as boring, uninteresting, and unpublishable. Some have even argued that it is useless [2]. With the growing awareness of reliability issues in science, this attitude is finally beginning to change. To cite only two examples, the OSF Reproducibility Project actively supports replication of psychological studies, and the ReScience journal encourages the publication of replication attempts in computational science. There is still a lot more to do — the problems described in [3] are a good example for open technical issues, and we still lack an efficient incentive structure for encouraging replication work — but replicability in science is clearly improving.

Let's imagine an ideal world in which peer review works to perfection, meaning that all published research work has been shown to be replicable at the technical level. Technical mistakes, including fraud, have been effectively eliminated from this world. We can then concentrate on the scientific level, and aim for reproducibility. Ultimately, the conclusions of a study can be considered reproducible if there are many other studies that come to very similar conclusions. It doesn't really matter if those other studies were explicitly designed to be reproduction attempts, or were performed independently and just happen to be similar. We can just look at our nice collection of replicable technical work and its outcomes, and draw our conclusions.

Drawing conclusions implies making scientific judgments, which is always subjective to some extent. But there is also a technical requirement for drawing conclusions, and that is what I call verifiability. Even perfectly replicable work is no sound basis for scientific conclusions if it is not verifiable. But we never discuss verifiability explicitly, because until not very long ago, it was simply obvious.

Illustration 1 
© ScienceCartoonsPlus.com

What non-verifiable work looks like is nicely illustrated in this famous cartoon by Sidney Harris [4]. Even the most superficial reviewer would spot such a manifestly non-verifiable line of reasoning, but subtler cases do end up in the scientific literature. Quite often, reviewers don't take the time to check every argument rigorously if it seems plausible at first sight. But massive widespread non-verifiability, to which reviewers never object, is a recent phenomenon. The ubiquitous modern version of "Then a miracle occurs" is "We used version 2.1 of the program InsightDiscoverer."

The problem with computer software is that even if you can download, install, and run it on your computer, and replicate published results with it, you still do not know what it computes, unless the task and the software are particularly simple. You thus cannot judge if a computation supports a scientific conclusion. In the best imaginable scenario, the software is well documented, so you know what its authors intended it to compute. But to err is human. Programmers being human, they make mistakes [5]. With the exception of very simple software that you can completely understand by reading its source code, verifying that a program does what it is supposed to do is nearly impossible. To make it worse, for much of today's complex scientific software there isn't even a complete and precise description — technically called a specification — of what it is supposed to do. In the philosophy of science, this problem is called epistemic opacity [6]. Unfortunately, the most common attitude in the scientific community today is to shrug off epistemic opacity as inevitable. Journals dedicated to software papers. such as the Journal of Open Research Software or the Journal of Open Source Software, do not even ask reviewers to comment on the correctness of the software's output, because they know that such a request would be unreasonable.

I suspect this is the reason why we use the term "replication", originally applied to experiments, for computations performed by software, instead of the old-fashioned term "double-checking" that we use for manual computations. Double-checking implies verification, because humans cannot do computations well without understanding their context.

Note that this is not just an issue for computational science, i.e. research where computation is the central technique of exploration. Experimental data is processed using software that is often just as opaque as complex simulation software. Worse, we see more and more scientific instruments with integrated computers and embedded software. An increasing part of what we consider raw experimental observations are actually pre-processed by only superficially documented software.

There are numerous examples for this problem, of which I will cite two that I encountered in my own research work in computational biophysics. The theoretical models on which biomolecular simulations are based are called "force fields". They are complex algorithms that compute a physical quantity called "potential energy" from a graph describing a molecular structure. The published descriptions of these algorithms are not detailed enough to write or verify an executable implementation. Seemingly basic questions such as "Does version 6.2 of the GROMACS software correctly implement the AMBER99 force field?" are therefore meaningless — nobody can say what implementing a force field correctly actually means. My second example is protein structures observed by electron microscopy. The files available from public databases such as EMDB contain numbers whose precise meaning is not explained anywhere. There is the vague notion that higher numbers correspond to a higher local Coulomb potential — the physical quantity that electron microscopes measure – but the exact meaning of the numbers is defined by software that is neither published nor documented.

If we want to maintain the reproducibility of scientific conclusions as a cornerstone of reliable science, we must strive to make computer-aided research verifiable. My own contribution to this is an Open Science project that aims to develop digital scientific notations in which we can express precisely what software is supposed to compute. Everyone is welcome to join, or to develop complementary projects. What matters most is that we stop treating epistemic opacity as a normal and inevitable aspect of using computers.

 

 

References

1. Feynman RP. Cargo Cult Science [Internet]. 1974. Available from: http://calteches.library.caltech.edu/51/2/CargoCult.htm

2. Drummond C. Replicability is not Reproducibility: Nor is it Good Science. In: ICML 2009 Proceedings [Internet]. Montréal; 2009. Available from: http://cogprints.org/7691/7/ICMLws09.pdf

3. Mesnard O, Barba LA. Reproducible and replicable CFD: It’s harder than you think. 2016. Available from: http://arxiv.org/abs/1605.04339

4. Harris S. © ScienceCartoonsPlus.com, reprinted with permission.

5. Soergel DAW. Rampant software errors may undermine scientific results [version 2; referees: 2 approved]. F1000Research 2015, 3:303. Available from: http://f1000research.com/articles/3-303/v2

6. Newman J. Epistemic Opacity, Confirmation Holism and Technical Debt: Computer Simulation in the light of Empirical Software Engineering. In Pisa, Italy; 2015. Available from: http://eprints.bbk.ac.uk/12921/1/12921.pdf

 

Reviews

Showing 2 Reviews

  • Placeholder
    Olivia Guest
    Confidence in paper
    Quality of writing
    Originality of work
    1

    This article attempts to define some key words: replication, reproduction, verification. Especially in the case of the first two, there is currently no consensus and instead they are used often synonymously in the literature.

    As mentioned, the author proposes three concepts to do with the (re-)running of experiments and a solution. Firstly, he defines replication as “check[ing] for mistakes at a technical level[, which] follow[s] the published instructions as closely as possible.” Secondly, he defines reproduction as “check[ing] for mistakes in the scientific interpretation, and in particular for hidden assumptions [which should] lead to the same conclusions, in spite of intentional changes in the protocol.”. Thirdly, he defines verifiability as a synonym to epistemic opacity. Finally, he proposes his own solution, or least part-solution, which seems to be a formal specification language.

    It is important to nail down definitions for technical terms, as such I am glad the author is proposing his take on defining: replication, reproduction, and verification. This being said I would have appreciated better signposting of these definitions by the use of subsection headings or some other emphasis placed in the text.

    It concerns me that the author does not make enough reference to theory testing explicitly as the main drive for either replication or reproduction or indeed the main drive for setting out to run any experiment (original or otherwise). For example, he claims:

    “A typical report about a scientific study says, "this is what we did, that is what we observed, and those are our conclusions".”

    Any good research article before “this is what we did” expounds on the impetus for the experiment: to test a theoretical prediction.

    The author then goes on to define replication as what I and others in cognitive science would call direct replication: the most faithful rerun of the experiment; and reproduction as what we would call conceptual replication: a less faithful rerun to explicitly test the theory and generalisability of the results of the previous case. The comments he makes on the two types of replication/reproduction he defines seem sensible to me, except the part about conceptual replications aka reproductions:

    “A crucial but often neglected aspect is that reproduction attempts make little sense unless replicability has been verified first.”

    This does not always hold, if ever. Many, if not all, authors place their experimental work within a larger theory very carefully. So a conceptual replication has more validity in theory testing than a direct replication because the original authors stated in their theory that such conceptual replications should be in line with their theory. In fact that is what we do in science. We generalise from our findings. We take a sample and conclude from it the properties of the population from which it was drawn. Testing the extent and usefulness of these generalisations is in part what theory testing is about. So in answer to “If you try to reproduce a finding and fail, then what?“ Then we can further define what is and is not explained or predicted by the theory. If we discover that a conceptual replication does not make sense within the proposed theory then the theory is under just as much (if not more) threat than if we failed to replicate the original experiment faithfully.

    Readers may find it confusing to see the OSF Reproducibility Project being referred to as a replication: not only because of the name (the author would prefer “replicability project”) but also because the OSF Reproducibility Project (at least for Psychology) is actually more of a reproduction project (using the author’s terminology). “[M]any of OSC’s replication studies drew their samples from different populations than the original studies did” (see: Gilbert, D. T., King, G., Pettigrew, S., & Wilson, T. D. (2016). Comment on ‘Estimating the reproducibility of psychological science’. Science, 351(6277), 1037–1037.). The same definition-based confusion goes for the ReScience journal. These are easy confusions to arise because of the lack of consensus regarding the meanings of these words.

    “Let's imagine an ideal world in which peer review works to perfection, meaning that all published research work has been shown to be replicable at the technical level. Technical mistakes, including fraud, have been effectively eliminated from this world. We can then concentrate on the scientific level, and aim for reproducibility.”

    Such a world is statistically impossible given the nature of false positives, noisy data, peer review, etc. This means we will never get to the “scientific level”, which I suppose is the testing of theories. Regardless as to whether false positives will be eliminated from ever entering the literature (they will not), theory testing needs to have the highest of priorities. Besides, there are millions of scientists in the world who can work on countless projects. Hopefully most of them want to test theirs and others’ theories as a matter of course.

    It would be nice for the cartoon by Sindey Harris to have been included (under fair use) in the article to help the reader follow the argument. I object to the use of the word “dumbest”, as it has extremely negative connotations and offers nothing to the attempt at defining the word “verifiable”. The concepts touched on to define this term are nonetheless important for science, and so I am glad to see a discussion on epistemic opacity.

    To conclude, the author proposes his own idea — his own project which has the goal of  “develop[ing] digital scientific notations”. This raises the question: what is the difference between a formal specification language (of which many exist, e.g., Z notation) and this new project? I am happy to see formal specification languages being mentioned, as they have been mentioned previously by Cooper and Guest in the same context (see: Cooper, R. P., & Guest, O. (2014). Implementations are not specifications: Specification, replication and experimentation in computational cognitive modeling. Cognitive Systems Research, 27, 42–49.).



    All in all, this was a good read. While I may disagree on some key points, I am very glad to see important scientific concepts with respect to methodology being discussed, as well as calls for more replications both conceptual and direct. Notwithstanding, I would have liked to have seen more points on the value of theory testing to have been raised.

    This review has 2 comments. Click to view.
    • Placeholder
      Olivia Guest

      Apologies for the weird spacing between paragraphs. I was editing on Google Documents due to already losing a little of my work, but sadly it seems formatting doesn't carry over as nicely as I expected - especially with respect to the hyperlinks. The two references are links; as are the phrases "Z notation", "direct replication", and "conceptual replication".

    • Kh web square
      Konrad Hinsen

      Thanks for this detailed review! I see it as mainly raising issues due to differences between disciplines, which are always a source of difficulties when talking about rep[.*]ibility in science. Moreover, the length limit of the essay contest forced me to condense everything as much as possible.

      Perhaps I should have hinted at my own biases by describing my background, which is (in order) physics, chemistry, and biology. My own work in these fields is almost exclusively theoretical and computational, though I do collborate with experimentalists often enough to be aware of their preoccupations as well.

      One important difference between the reviewer's field and mine is in the interplay of theory and experiment. In physics and chemistry, the theory is well-known shared knowledge, but figuring out the consequences of the theory for a given situation is difficult. With few (but very visible) exceptions, experiments don't aim at testing the theory but at observing new phenomena or exploring more complex systems. Theoretical and computational work is mostly about finding good approximations to the trusted theory in order to be able to apply it to non-trivial systems and phenomena. Most commonly, the experiments come first and then theoreticians try to explain them using clever approximations.

      The typical technical mistakes in doing or reporting experiments, using my wide definition of "mistake", are related to unknowns in the preparation of experimental setups. A parameter that nobody thought of can turn out to be important. A new tool or instrument can have an unknown precision. Such mistakes are often found by replication attempts, sometimes by the original authors themselves. Typical technical mistakes in computational work are human operator errors in managing computations, bugs in software, and inappropriate and undocumented approximations. Replication attempts (same code, same data) check for operator errors. Verifiability as I introduce it is the condition for dealing with bugs and inappropriate approximations.

      I do not quite understand the reviewer's criticism of my assertion that reproduction attempts make little sense unless replicability has been verified first. It seems we perfectly agree that reproduction has a higher scientific value than replications. My point is that there is no point in discussing the scientific implicatations of a failed reproduction attempt as long as there's a good chance that the difference is simply due to technical mistakes.

      My "ideal world in which peer review works to perfection" is indeed statistically impossible. It's a fictional creation set up to discuss verifiability in isolation from other problems.

      Finally, to answer the question as to what is the difference between a formal specification language and my new project for digital scientific notations, there is no fundamental difference. From the point of view of a computer scientist, digital scientific notations are specification languages. There are, however, differences to every existing specification language I know of. Not surprisingly, because none of the was designed to represent scientific knowledge. Z notation is as ill-adapted to describing physics as the C language is ill-adapted to writing accounting software. Moreover, my digital scientific notation goes beyond specification languages in that it can represent not only computations, but also the non-computable aspects of traditional mathematics.

  • Placeholder
    jenny sabir
    0

    Now it is a easy thing for us all to have the shadow fight 3 hack online when we want the free gems and coins.

License

This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.