About the SA Blog Network

Absolutely Maybe

Absolutely Maybe

Evidence and uncertainties about medicine and life
Absolutely Maybe Home

Opening a can of data-sharing worms

The views expressed are those of the author and are not necessarily those of Scientific American.

Email   PrintPrint

Cartoon of missing study dataAre researchers’ dogs eating a lot of their homework? Well, yesterday afternoon at the quadrennial medical editors’ scientific meeting in Chicago, we found out they kinda are.

Timothy Vines and colleagues did a study on how the reproducibility of data sets in zoology changes through time. They gathered 516 papers published between 1991 and 2011. And then they tried to track the data down.

Even tracking down the authors was a challenge, never mind the actual data. As the years went by, a dwindling minority of papers were accompanied by author email addresses that still functioned.

Vines’ luck with data was even worse. In the end, only 37% of the data even from papers in 2011 were still findable and retrievable. But the proportion dropped each earlier year. By the time they got to papers published in 1991, only 7% of the data could be determined to truly still be in existence and retrievable. By then, few authors could be found, and most of them were reporting that their data were lost or inaccessible.

Researchers who had the data had died, retired, or the research had been done five computers and two universities ago. Or the data were in software or hardware that no one could access any more. As the stories and reasons kept coming, we were all wincing and more or less freaked – partly in personal recognition of life as we all know it, and partly at seeing the collective enormity of this problem tabulated. Human research in areas requiring that data be kept might fair better, but who knows? Vines thinks years from now people will look back and think it was silly not to publish data at the same time as the article.

The following speaker added further cheery news: Christine Laine from the Annals of Internal Medicine told us that between 2008 and 2012, researchers’ willingness to share their data had actually decreased. So although they have the admirable practice of including a reproducibility statement, researchers who want to replicate will still often have trouble getting the precise details they need.

Photo of Kay Dickersin

Kay Dickersin, annual EQUATOR lecture at the 7th International Congress on Peer Review and Biomedical Publication, Chicago

Some people definitely don’t want to share. Kay Dickersin told one such story, and the way one drug company approached unwelcome data about one of their products. You can read about it here. Although fair warning: you mightn’t want to read it before bedtime – it’s quite scary.

Dickersin pointed out that the problem of hidden trial data is particularly bad for industry studies of off-label use of drugs. No trials have to be submitted to the FDA or other drug regulatory authorities for those uses, cutting off one of the major sources of data.

Dickersin was delivering a tour de force annual EQUATOR lecture at the end of the day yesterday. A key message came early: “We must agree on the balance between scientific trust and scientific accountability,” she said. “It’s not just that the studies aren’t reported – the investigators aren’t telling the whole story.”

The problem goes through the whole health and science ecosystem, Dickersin pointed out. Whether it’s academic researchers, industry or clinicians, fears about legal implications drive all sorts of behavior, including withholding data: “We’ve been too industry-focused: academics are resisting this too.” People’s unpreparedness for high quality, sustainable data-sharing practices needs to be taken seriously – which means working to resolve the content, ethical, and practical problems standing in the way.

Dickersin was concerned that researchers have to “become detectives” to find out treatment effects and resolve discrepancies between different sources of data: “There needs to be more scrutiny by regulators.”

Two particular examples of good practice in sharing clinical trial information were highlighted: the extensive processes, including detailed data dictionaries to explain the content and technical detail for the data, advanced by the NIH’s NICHD (Eunice Kennedy Shriver National Institute for Child Health and Human Development). And YODA: Yale University’s Open Data Access Project that got a lot of attention in June with the publication of its project on Medtronic’s biological agent to promote bone growth.

There was a lot of passion in the room on this: discussion went well over time. Safe and valuable data-sharing and preservation are critical. But as we’ve seen with other areas like genomics dealing with the same issues, it’s going to take a lot of effort. And a lot of parts of the system are involved. As Dickersin said last night: ”The whole system is depending on the rest of the system to work.”


From day 1: “Bad research rising”

From the morning of day 2: “Academic spin”

As you would expect from a congress on biomedical publication, there’s a whole lot of tweeting going on. Follow on #PRC7

The cartoon is by the author, under a Creative Commons, non-commercial, share-alike license. Photo of Kay Dickersin by the author.

The thoughts Hilda Bastian expresses here are personal, and do not necessarily reflect the views of the National Institutes of Health or the U.S. Department of Health and Human Services.

Hilda Bastian About the Author: Hilda Bastian likes thinking about bias, uncertainty and how we come to know all sorts of thing. Her day job is making clinical effectiveness research accessible. And she explores the limitless comedic potential of clinical epidemiology at her cartoon blog, Statistically Funny. Follow on Twitter @hildabast.

The views expressed are those of the author and are not necessarily those of Scientific American.

Rights & Permissions

Comments 2 Comments

Add Comment
  1. 1. sesuncedu 10:40 am 09/11/2013

    This is where the effort to define the semantics of the information to be recorded shows its value. Standards are better than ad hoc, and richer semantics is better than simple taxonomies, but something is better than nothing (p < 0.05).

    The bioinformatics community has done relatively well, though some early choices in ontology design are limiting in some respects.

    Long term *Data* preservation is relatively easy (though until costs are properly taken in to account it is easy to waste huge amounts of resources on short read sequences that will never end up being assembled).

    Also, usually the data/information is not enough. Instrument models or schematics, calibration metadata, source code, operating system, libraries, workflows, random number state, parameters, etc. all need to be recorded/saved.

    The failure of the UEA climate change unit to follow proper preservation protocols unnecessarily undercut their epistemic virtues.

    If the NSF data management plans ever start getting taken seriously, then funding requests should start to include proper reservations for long term preservation and replication. So far I have only heard tell of a single submission getting kicked back because of the data plan, and that was just to get a pro forma revision so the already formed decision to award the grant could be acted on…

    Link to this
  2. 2. Hilda Bastian in reply to Hilda Bastian 7:54 pm 09/11/2013

    Terrific points, sesuncedu- yes, agree that something is better than nothing. Only one thing I’d add: there should be more infrastructure support & standards to make it easier. I’m from the clinical effectiveness community, but as I listen to many of these debates happening there, I’ve heard them before in other areas where some of the issues were resolved systemically: and many of those systemic solutions are naturally content-agnostic. This is surely an area where there are economies of scale – and where areas of science need to be learning from each other.

    Link to this

Add a Comment
You must sign in or register as a member to submit a comment.

More from Scientific American

Email this Article