The pitfalls of open data

TL;DR summary: some data can be made publicly available without any problems. A lot of data, however, cannot. Therefore, unrestricted sharing should not be the default. In stead, all data could be hosted on institutional repositories to which researchers can get access upon request to the institution.

Data is an essential part of research, and it is a no-brainer that scientists should share their data. The default approach is and has been ‘share on request': if you’re interested in a dataset, you simply e-mail the author of a paper, and ask for the data. However, it turns out that this does not work that well. Wicherts, Borsboom, Kats, & Molenaar (2006)  have shown, for example, that authors are not really enthusiastic about sharing data, something not unique to psychology.

This is bad. Sadly, not just for the sake of scientific progress – recently, social science has seen another data-fabrication scandal where a graduate student faked his data for a study published in Science (you would think they had learned their lesson at Science after Stapel, but sadly, no). Making data available with your publication at least makes sure that a) you conducted the study, and b) allows others to (re)use your data, saving work in the end.

It is therefore not surprising that there is now an open research movement, calling for full transparency in research, including making all research data public by default. I totally support open research, and I have considered signing the ‘Agenda’ several times. After a discussion of the ISCON Facebook page I have now decided not to.

As a matter of fact, the discussion has convinced me that making all research data publicly available without restriction by default is in fact a bad idea.

Before the flame-war starts, let me point out that I am not against sharing data between researchers, or not even against compulsory data sharing (i.e., if an author refuses to share data without good reason, her/his boss will send the data). However, I disagree with unrestricted data publishing, i.e. putting all data online where anyone (i.e., the general public) can access it. I am strongly in favour of a system where data is deposited at an institutional repository and anyone interested in the data may ask for access, if necessary even without consent of the author.

Let me illustrate my concerns with the following thought experiment. You participate in an experiment on sexual arousal, and have to fill out a questionnaire about how aroused you are after watching a clip with the most depraved sex acts. Your data is stored anonymously, and will be uploaded to Github directly after the experiment (see Jeff Rouder’s paper on an implementation of such a system). Would you give consent?

For this example, I may. I can always fake my response on the questionnaire should I feel something tingling in my nether regions to avoid embarrassment.

For the next experiment, this study is repeated, but we’re now measuring physiological arousal (i.e., the response of your private parts to said depraved sex acts). Again, the data will be uploaded directly to Github after the experiment.

Now, I would be a bit uncomfortable. Suppose I got sexually aroused (or not – it actually does not matter, the behaviour of my private Willy Johnson is not anyone’s business besides my own and my wife’s, and for this one occasion, the researcher’s). This is now freely available for anyone to see. And by the timestamp on the file, I may be identified by the one or two students who saw me entering the sex research room for the 12:00 session on June 2nd. Unlikely, but not impossible. Oh sure, remove the timestamp then! Yes, but how is a researcher then going to show (s)he collected all the data after preregistering his/her study and not before (or did not fabricate the data on request after someone asked for it)?

Ok, we take it a step further. We now measure the response of your nether regions, but now we ask you to have your fingerprint scanned and stored with the data as well.

Making this data publicly available would be huge no to me. Fingerprints are unique identifiers, are you mad?

But now replace ‘fingerprint’ with raw EEG data. We do not often consider this, but EEG data is as uniquely identifiable as fingerprints. I can recognize some of my regular test subjects and students from their raw EEG data – shape and topography of alpha, for example, are individual traits and may be used to identify individuals if you really, really want to.

One step further: individual, raw fMRI data, associated with your physiological ‘performance’ on this sex rating task. Rendering a 3D face from the associated anatomical image is trivial – it’s one of the first things you do (for fun!) when you start learning MRI analysis. How identifiable do you want to have your participant? And note that raw individual fMRI data cannot be interpreted without the anatomical scan – you need the latter to map activations on brain structures.

So, don’t publish the raw data then! Sure, that fixes some problems, but creates others. What if I want to re-analyze a study’s data, because I do not agree with the author’s preprocessing pipeline, and rather try my own? For this I would still ask the author for the full data set, then. Mind you – most researcher degrees of freedom for EEG and fMRI are in the preprocessing of data (e.g., what filters do you use, what kind of corrections do you apply, what rereferencing do you apply, etc.), and aggregate datasets, such as published on Neurosynth do not allow you to reproduce a preprocessing pipeline.

But the main problem is that many data or patterns in data can be used as unique identifiers. Even questionnaire, reaction time, or psychophysics data. Data mining techniques can be used to find patterns in datasets, such as Facebook likes, that can be used for personal identification. What’s to stop people from running publicly available research data through such algorithms? Unlikely? Sure. Very much so, even. Impossible? Nope.

Of course, my thought experiment deals with a rather extreme example – I guess that very little people are willing to have their boy/girl-boner data in a public database for everyone to see. So let’s take another example. Visual masking. What can go wrong with that? Well – performance on a visual masking task may be affected by illnesses such as schizophrenia, or being related to an individual with schizophrenia. Is that something you want to be publicly accessible? And so there are many other examples. Data reveals an awful lot about participants and it is not clear at all how much data is needed to identify people. It may be less than we think.

I fully realize that the scenarios I put forward here are extreme, hypothetical, and I am sure some people will think I am fearmongering, making a fuss, and maybe even an enemy of open science. Ok, so be it. I think that we as scientists do not only have a responsibility to each other, but even more so to our participants. People participating in our studies are the lifeblood of what we do and earn our utmost respect and care. They participate in our studies and provide us with often very intimate data, but also trust us we handle that data conscientiously, and they contribute their data for science. We need to protect their privacy. Just putting all data online for everyone to see does not fit with that idea. There is always a potential for violations of privacy, but making all data public also opens up the data for, let’s say, the government, insurance companies, marketeers, and so on, for corporate analyses, marketing purposes, and other goals than the progress of science. Do we want that?

Maybe I should give another example – what about video material? Suppose you carried out an experiment in which you taped participants’ emotional responses to shocking material. Even if you would blur out faces to prevent identification, and my IRB is ok with publishing these clips, I would still not submit such material to a public depository for every Tom, Dick, and Harry to browse clips of crying participants.

I am not saying these are realistic scenarios, but is worth giving some thought – at least, more than people are doing now.

There are and will be many datasets that can be made publicly available without any concern at all. I’ve got a feeling that the authors of the Agenda for Open Research primarily work with such datasets, but do not sufficiently realize that there is a lot of sensitive data being collected as well. The ideal of all data made public by default does not fit well with my ideas of being a responsible experimenter. And there is a clear ‘grey zone’ here. Not everyone will share my concerns. Some will even say I am making a fuss out of nothing. But I would like to be able to carry out my job with a clear conscience. Towards my colleagues, but most of all towards my participants. And that means I will not make every dataset I collect publicly available, even if this entails the signatories of Agenda for Open Research will not review my paper because they do not agree with my reasons not to make a given dataset publicly available. Too bad.

So, you want access to the data I did not made publicly available, but I am on extended leave? There is a fix! And actually, this fix should appeal to the Open Research movement too.

For every IRB-approved experiment, require authors to deposit their data at an institutional repository. All data and materials that is. Raw data, stimuli, analysis code and scripts. The whole shebang, all documented of course. Authors are free to give anyone access to this data they want to. Scientists interested in the data can request access via a link provided with a paper. In principle, the author will provide access, but if no reply is given within a reasonable term (let’s say two weeks), or the data is not shared, but without proper reason, the request is forwarded to the Director of Research (or another independent authority) who then decides.

In Groningen, we have such a system in place. It ensures that for every published study, the data is accounted for, and access to the raw data can be granted if an individual requires so. The author of a study controls who has access to the data, but can be overruled by the Director of Research. It works for me, and I do not see what the added benefits of unrestricted access are over this system. Working in this way makes me feel a lot better. I can only hope that the signatories of the Agenda for Open Research consider this practice to be open enough.


This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.