When I talk to most scientists and mention the word "metadata" they look at me as if I've grown a second head. Despite the fact that these folks regularly use and create metadata (not to be confused with megadata or "big data" which is a whole other subject), many have not heard of the term.
Broadly speaking, metadata is simply a structured description of something else. The most popular example of metadata comes from the library catalog. Each book has a title, author, call number, publisher, ISBN etc. listed in the online catalog. These elements comprise the book's metadata, and there are rules to make sure that things are standardized.
Without metadata, discovery and reuse of digital information would be much harder. This is why discussion about metadata has increased greatly since the second half of the twentieth century.
The best way to understand metadata is to look at a few examples of metadata at work.
Here is part of a digital data table:
If you stumbled across this list on the web you might be able to guess what it was, but you couldn't be sure. It would also be difficult to find this list again if you were looking for it. The list creator might find this pretty useful, but if he or she shared it with others, we would want some added information to help the new user understand what he or she was looking at: this is metadata.
Metadata for this data file:
- Who created the data: Santa Claus, North Pole. An email address would be nice. This way we have some contact information in case we need clarification.
- Title: "My List" isn't a title that is conducive to finding the file again. While it might be tempting to just call this "Santa's list" that won't help other folks who see this file. The title should be descriptive of what the data file contains, and "Santa's List" could be many things: Santa's list of Reindeer? Santa's list of toys that need to be made? A more descriptive title might be "Santa's list of naughty and nice children."
- Date created: We don't want to confuse this year's list (2012) with last year's list (2011). This could lead to all sorts of unfortunate events where nice kids get coal, naughty kids get presents, or infants (who weren't around in 2011) get nothing at all.
- Who created the data file: Perhaps Santa created the data, but then used an elf to input the data into a computer file. Many computer programs automatically record this information, although you may not realize this.
- How the list was created: Behavioral scans? Parental surveys? Elf on the Shelf reports? All of the above? In order to reuse this data in future research projects, we need to know how it was collected, including collection instruments and methodologies.
- Definitions of terms used: What is "naughty" what is "nice"? How did Santa place a child into one category or another?
- File type: What kind of file is it? The data here are pretty simple, but Santa has lots of different file formats to choose from: excel, .csv, xml, etc. Knowing the file type helps end users determine if they can use the data
Naturally, a different kind of item might have a completely different set of metadata.
This is my mom's favorite Christmas picture of me:
My mom remembers the details of where, when and how this picture was taken, but if she isn't around to tell the story, metadata can help:
Metadata for this photo:
- Date the photo was taken: December, 1981. The digital version was created on 12/13/2012
- Who took the photo: A mall employee. This can have implications for who owns the rights to use and distribute the image. The photographer? The folks who paid to have the photo taken?
- Camera used to take the photo: I have no idea what camera was used for this picture. Luckily, modern digital cameras often automatically record this information as a part of the .jpg file. Digital cameras can also record all the detailed camera settings (for those who understand these things).
- Location where the photo was taken: Arnot Mall, Horseheads, NY. Some digital cameras can automatically capture this information too, using built in GPS.
- Picture format: .jpg
- Picture size: Original size of the photo is 3.5 x 5.5 (I think). The original scanned image is 852 x 1116 pixels.
- Description of the photo: Currently, the primary way of searching for an image is for a computer to search for the associated text. Good file names and good descriptions can be key to finding the image again. Bonnie J M Swoger, age 3, sitting on Santa's lap. Her grandpa brought her to the mall to visit Santa. While not enthusiastic about it, she loved her grandpa and obliged him by sitting on Santa's lap.
- Copyright information: I don't think the mall Santa folks were thinking about copyright in 1981 because there wasn't an easy way to copy the photo. These days, it is important to state explicitly what rights other folks have to use the picture. Creative Commons licenses are great for being explicit about what users can do with your content.
Depending on the type of data, there may be many more metadata elements. Geospatial data, chemical data, astronomical data, etc. each have specific descriptive elements that are used. Many organizations have developed standards describing what kinds of metadata should be included and how the metadata should be formatted. This helps data creators add metadata that can be read by computers and reused by other interested folks.
Once you have well established metadata formats, you can start analyzing the metadata. Common metrics used to evaluate scholarly publication (impact factor, alt metrics, etc.) all rely on high quality metadata.
I think we can agree that Santa would use sound data management practices, including the creation and use of proper metadata, to keep track of his gift giving and logistical data. He would want the rest of us to use good metadata so we can always locate that 30 year old picture of him, too.
Be like Santa and make sure your data is findable and re-useable: use good metadata!
For a more robust (yet clear and understandable) definition of metadata, see NISO's Understanding Metadata (PDF).