The changing nature of how and where scientists share raw data has sparked a growing need for guidelines on how to cite these increasingly available datasets.

Scientists are producing more data than ever before due to the (relative) ease of collecting and storing this data. Often, scientists are collecting more than they can analyze. Instead of allowing this un-analyzed data to die when the hard drive crashes, they are releasing the data in its raw form as a dataset. As a result, datasets are increasingly available as separate, stand-alone packages. In the past, any data available for other scientists to use would have been associated with some other kind of publication - printed as table in a journal article, included as an image in a book, etc. - and cited as such.

Now that we can find datasets "living on their own," scientists need to be able to cite these sources.

Unfortunately, the traditional citation manuals do a poor job of helping a scientist figure out what elements to include in the reference list, either ignoring data or over-complicating things.

I looked at the citation manuals on the shelf behind the reference desk in my library and didn't find much clear advice. Many manuals haven't been updated since 2006 or 2007 and are less likely to offer up-to-date advice. Some of the manuals briefly mention the idea of citing data as if it comes from a print reference book. Others focus on the database the data comes from and assume there won't be a clear author (or creator). And many manuals are recommending including the place of publication, something that gets tricky and irrelevant in an online world.

The best and most relevant advice comes from the DataCite cooperative. Their mission is to encourage scientists to cite datasets in their work and they provide clear guidelines for accomplishing this:

Creator (PublicationYear): Title. Publisher. Identifier

Authors can easily adjust formatting as necessary to meet the style guide of the journal. If applicable, the DataCite folks recommend that two additional pieces of information can be included.

Creator (PublicationYear): Title. Version. Publisher. ResourceType. Identifier

Let's look at a few of the elements a bit closer:

  • Creator - Occupying the place where "Author" would normally go, this serves the same function and comes with the same questions. Multiple names may be listed, or an entire organization might be listed as the creator. One of the rationales behind publishing data is to give appropriate credit to the folks who collected the data.
  • Publisher - The entity that makes the data available to others. This might be a data publisher like Dryad, or an institutional repository at an academic institution, or many other options.
  • Identifier - This should be the DOI assigned to the data set. The DOI (Digital Object Identifier) is a unique number for a digital item. It helps you find the item, even if its URL changes. DOIs are often registered by publishers, and DataCite offers a DOI registry service for data sets. DataCite recommends including a DOI as a clickable URL, e.g.
  • ResourceType - The kind of thing you are citing as a one-word description. Examples include: Image, Dataset, Software, Sound, Audiovisual, etc.

Data can then be cited in-text just like books and articles. Importantly, the folks who worked hard to collect and organize that data get credit (through your citation) for their work.

The DataCite schema for citing data acknowledges that data often exist as independent resources and makes citing these resources simple and straightforward. You can check out their detailed metadata schema, or learn more about the organization on their website.