October 17, 2017

Scientific Data Lost to Poor Archiving

Hundreds of new pieces of scientific research are published every month, in fields from physics to biology. While the studies themselves are assiduously archived by publishers, the underlying data researchers analyze to come to their published conclusions can be another story. A recent study in the journal Current Biology found that the data that forms the backbone of those studies becomes less and less accessible to researchers over the years. That lack of archiving, says University of British Columbia zoologist Tim Vines, represents a missed opportunity for the scientific community as a whole.

The problem isn’t insurmountable, though, and Vines suggests that both publishers and funding institutions from universities to the National Institutes of Health (NIH) could improve record keeping and data sharing just by raising their own standards and expectations for researchers when it comes to archiving data.

“At the moment, the scientific community doesn’t think about this,” said Vines. “There’s zero expectation that you could get the data from an older study. I think we’re missing a vast opportunity there.” When data from older research is not available, opportunities to do meta-analysis research—studies that look at multiple studies on a single topic and can provide new insights from old material—are lost. And data that might be useful to researchers in the future—figures on climate and weather, for example—can be lost entirely, leaving gaps in the research record.

For their study, Vines and his colleagues used a metric that hasn’t changed much over the past few decades—morphological data on plants and animals. Taken with calipers or under microscopes, these measurements, which include things like the length of an animal’s leg of the width of a plant’s leaf, haven’t changed much over the course of decades, making them a useful sample set for Vines and his colleagues. Analyzing 516 papers from two to 22 years old, the authors found that the older a paper was, the harder it was to find the data sets underlying it. The researchers who conducted these studies proved harder to reach as time wore on, with every year seeing a 7 percent increase in the likelihood that a given article would contain no working email to reach the scientist who had conducted the research in question. And even when researchers were available, their data often wasn’t. Every year a study aged, Vines found, the likelihood that the data underlying it would no longer exist rose by an average of 17 percent.

Of course, morphological data is just one of the sets of numbers scientist use in their research, and it’s a relatively easy one to analyze. One of the problems of data preservation is the staggering variety of data that need cataloging. From tree height to emotional reactions, researchers measure a wide variety of things, making it difficult to lay down a set of universal best practices for data management. While expectations for data archiving can differ from field to field and even project to project, though, Vines says the problems detailed in the Current Biology paper are common across research topics.

The problem isn’t that the data isn’t backed up at all, said Vines, but how that backup is handled. “Often at the end of a study, researchers will take all their data, put it into a folder, and then back it up,” said Vines, “But then time starts taking its toll.” Computers where data is stored can break down. Laptops can be lost in travel or dropped from tables. Even if stored data is archived thoughtfully by researchers, its accessibility can be affected as storage media changes over the years. “Very carefully backed up data doesn’t do anyone any good if it’s backed up on 3.5” floppy disks,” Vines pointed out.

One way that Vines suggests could improve data management among researchers is for publishers to demand more accountability, and take a more active hand in data preservation. In addition to being responsible, he says, it’s good for business as well. “Papers that do a good job archiving their data are more valuable to the scientific community, and to the publication they appear in,” said Vines. While most researchers only send along subsets of their data that are pertinent to their published work, Vines says, most would likely not be averse to offering up their full data sets for analysis. They’re just not being regularly asked by journals to do so.

As an editor as well as a scientist, Vines is trying to change that standard, and feels that academic libraries are well placed to help drive that change. Molecular Ecology, where Vines serves as managing editor, works with databases like the nonprofit data repository Dryad, where he is a former board member.

A nonprofit operated by a partnership between North Carolina State University, Duke, and UNC Chapel Hill to ensure that the data researchers are using in submitted articles has a permanent home—and one that can easily migrate to new formats in the future. According to Dryad board member Todd Vision, the service was set up with ease of use in mind. “We wanted to make sure it’s a practical process for researchers,” said Vision of making data available through Dryad.

While it began as a service geared toward researchers studying molecular and evolutionary biology, Dryad has expanded to welcome data sets from a much wider variety of fields since it went live in 2011 and now hosts more than 2,000 new data sets every year across a broad spectrum of disciplines. For a one-time fee of $80, paid by either the researcher or a sponsoring publication, Dryad provides data sets with a permanent home and DOI reference so the data can be cited when it is used by researchers going forward. To help data get cited in the first place, Dryad also works to ensure the data sets are searchable, and makes them freely accessible under a creative commons license.

Storing data in systems like Dryad is the exception to the rule for now, but Vines thinks that as new mandates like the Office of Science and Technology Policy (OSTP)’s drive federally funded research toward open-access availability, that may change. Repositories like Dryad are key to the Association of Research Library’s SHARE plan, one of the key differentiators of which is that, unlike the publisher-led CHORUS, it addresses the data aspect of the OSTP mandate as well as the published paper part. But in the meantime, “We’re losing a lot of data right now,” Vines told LJ. ”And no one has given me a compelling reason why we should let it disappear.”

Ian Chant About Ian Chant

Ian Chant is a former editor at LJ and a freelance journalist whose work has appeared in Scientific American and Popular Mechanics and on NPR.

Design Institute Heads to Washington!
On Friday, October 20, in partnership with Fort Vancouver Regional Library—at its award-winning Vancouver Community Library (WA)—the newest installment of Library Journal’s building and design event will provide ideas and inspiration for renovating, retrofitting, or re-building your library, no matter your budget!


  1. Ramune Kubilius says:

    It’s hard to believe I’m reading a news item written for library folks (not a popular magazine or newspaper), when I read: “A recent study in the journal Current Biology found…” without any clues as to the authors, the title, or the date of the article. That makes extra work for those of us who may want to look up the original study…

  2. Stephen Maher says:

    My thoughts exactly, Ramune!


  3. Ian Chant Ian Chant says:

    An oversight, and one that’s now corrected. Thanks for your input!

    Also, I’m curious in situations like these – is it more helpful to link to the abstract – since most of our readers will likely have access to the journal in question – or something like the arXiv pre-press, which is less pretty, but freely available? Look forward to hearing everyone’s thoughts, and thanks for reading!

    • Good question. Link to abstract is probably the minimum…Depends on where that abstract “sits”, there may be links to full-text, eg. oa. Unlikely that it will be the version of the abstract where the user’s institutional link resolver will lead to licensed full-text, but what do I know…