Organize your Data and Code for Sharing from the Start

On September 12, 2016, experimental psychologist Christopher Ferguson created a “go-fund-me” page to raise funds for access to an existing data set that was used to advance scientific arguments in a scientific publication (link here). In Ferguson’s own words: “So I spoke with the Flourishing Families project staff who manage the dataset from which the study was published and which was authored by one of their scholars.  They agreed to send the data file, but require I cover the expenses for the data file preparation ($300/hour, $450 in total; you can see the invoice here).” Ferguson’s request has generated a lot of discussion on social media (this link as well), with many individuals disappointed that data used to support ideas put forward in a scientific publication are only available after a big fee is paid. Others feel a fee is warranted given the amount of effort required to put together the data requested into one file, as well as instructions regarding how to use the data file. And in the words of one commenter, “But I also know people who work with giant longitudinal datasets, and preparing just the codebook for one of those, in a way that will make sense to people outside the research team, can take weeks.” (highlighting added by me).

As someone that has collected data over time from large numbers of romantically involved couples, I agree that it would it take some time to prepare these data sets and codebooks for others to understand. But I think this is a shame really, and is a problem in need of a solution. If it takes me weeks to prepare documentation to explain my dataset organization to outsiders, I am guessing it would take the same amount of time to explain the same dataset organization to my future self (e.g., when running new analyses with an existing data set), or a new graduate student that wants to use the data to test new ideas, not to mention people outside of the lab. This seems highly inefficient for in-lab research activities, and represents the potential loss of valuable data to the field given that others may never have access to my data in the event that (a) I am too busy to spend weeks (or even hours for other data sets) putting everything together for others to make sense of my data, and (b) I die before I put these documents together (I am 43 with a love of red meat, so I could drop dead tomorrow. I think twice before buying green bananas).

So what is my proposed solution? Organize your data and code from the start with the assumption that you will need to share this information (see also “Why scientists must share their research code”). Create a data management plan at the beginning of all your research projects. Consider how the data will be organized, where it will be stored, and where the code for data cleaning/variable generation, analyses, and plots will be stored. Create meta-data (information about your dataset) along the way, updating as needed; consider where to store this meta-data from the beginning. If you follow these steps, your data, meta-data, and code can be available for sharing in a manner understandable to other competent researchers in a matter of minutes, not weeks. Even for complex data sets. Your future self will thank you. Your future graduate students will thank you. Your future colleagues will praise your foresight long after you are dead, as your [organized] data will live on.

Update: see Candice Morey’s post on the same topic.

 

Reviews

License

This article and its reviews are distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and redistribution in any medium, provided that the original author and source are credited.