Skip to main content

All that glisters is not gold: Quality of Public Domain Chemistry Databases

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American


Shakespeare wrote "All that glisters is not gold" and how right he was. Whether it’s the before and after shots of models who have lost an incredible 10 pounds in just two days on a particular pill, or the couch potato who showed a six pack of abs in just 2 weeks after drinking some particular concoction, the truth is most of us know we are being manipulated by marketers in these cases. When applied to the contents of the internet, as powerful and enabling as it is, the internet hosts everything from one person’s country-toppling agendas to another’s staggering amounts of dross and drivel.

As a practicing chemist (albeit without a lab) for almost fifteen years I have lauded the promise of the internet as a web-based information management system for chemistry data [1]. Scientists, in particular, have aligned with the concept of "Information wants to be free (and expensive)", and large amounts of scientific data have become available on the internet, freely accessible and, in an increasing trend, free of licenses that restrict use. Do you remember those days when the human genome databases cost money? We should not forget however that someone is paying for these data, whether through government grants, distributed tax dollars or charitable donations, scientific data is far from free to generate regardless of the increased discussion of sub $1000 genomes for all.

The scientific community embraces "free" and have started to use the data provided online for computer modeling, as the basis of designing experiments and as reference data. Remembering that all that glisters is not gold you will not be surprised to hear that according to a search on one of the world’s most popular chemistry databases diamonds are made of the same material as natural gas. Or would you? More on that later.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


In the social network of chemists online I am known as ChemConnector [2]. I spend a lot of time figuring out ways by which the chemistry community can source data, preferably of high quality, as well as how to engage chemists in contributing, validating and curating data online, to all of our benefits. As a hobby project a small group of us developed an online chemistry database known as ChemSpider. In less than 3 years the portal had become one of the primary resources for chemists worldwide, hosted over 20 million unique chemicals and linked together almost 300 data sources across the internet.

In May of 2009 one of the world’s most prestigious chemistry societies, the Royal Society of Chemistry (RSC) had acquired ChemSpider and provided resources to enhance the functionality and expand the content. Why did ChemSpider become so popular? One of the reasons was our ongoing commitment to curating data in an attempt to clean up chemistry content across the internet. That suggests that online chemistry data might be questionable. Are there examples?

Methane is the simplest alkane with one carbon atom and four hydrogen atoms. It is of course listed on Wikipedia. It is the principal component of natural gas. It is also contained within many of the public chemistry databases. Diamond differs ever so slightly in that it is only a gas under some pretty extreme circumstances, it is rather a hard material, it is used in jewelry etc. but it is NOT methane! However, the National Institutes of Health database, PubChem, says they are equivalent according to a name-based search. Searching for diamond retrieves the record for methane as well as 67 others . This is because the term diamond shows up in many of the chemical names such as "Diamond Chrome Pure Blue". However, if we search just for the term diamond we find one record, methane.

Figure 1: A search on NIH’s PubChem database for "diamond" returns the chemical structure of methane.

If we look at the chemical names associated with methane we find many "dubious" names that most of the population, even those not skilled in chemistry, would be scratching their heads on. Methane is not graphite, neither is it animal bone charcoal! It is rather easy to explain away some of these errors as follows. Methane has a single carbon atom. Diamond, graphite, buckminsterfullerene (buckyball) and carbon nanotubes are all forms of carbon (and all in the list of names associated with methane!). A specific electronic representation of a compound, called a SMILES string, [3] represents both methane and all forms of carbon simply as the letter "C". It is likely that during data assembly these data have collapsed. It can be explained, but it’s not acceptable, especially at a time when internet resources are increasingly being used for reference data.

Figure 2: A subset of the chemical names associated with the record for Methane on the NIH’s PubChem database. Methane is referred to as graphite, carbon nanotube, diamond, fullerene and a number of general organic chemical names.

Consider slightly more complex chemistry, for example a simple chemical such as Vitamin K1, a vitamin commonly found in green plants. A recent investigation as to whether the correct structure of this compound was captured in public domain databases showed that in many cases the structure was wrong as explained in an online video. The structure was also incorrect on Wikipedia at the time but yours truly made the edit, a benefit of crowdsourced curation.

Figure 3: A series of structure depictions for Vitamin K1 from a number of web-based databases. The correct structure for Vitamin K1 is represented by a specific orientation around the double bond and the presence of two "stereocenters" represented by the hashed bonds. None of the six online databases represented have the correct structure representation. The structure on Wikipedia was since been edited by the author.

Chemical compounds can be very complex but we would assume that those included on pharmaceutical tablet labels that are distributed at the pharmacy would be correct, especially since they are hosted at an FDA website called DailyMed. However, a review of data quality on the DailyMed website showed examples of categorically incorrect chemical representations of drugs, even according to basic chemistry rules! [4]

So what can we trust in terms of "quality chemistry"? Trust is subjective in nature and, in terms of chemistry databases, the majority of users grant trust without objective validation. In 2010 I asked the community about their level of trust in online chemistry databases (Survey online at http://www.surveymonkey.com/s/FJMDFGF). The nature of the responses are discussed elsewhere. In particular, I was quite surprised by the level of trust for Wikipedia relative to NIH databases such as PubChem.

Wikipedia has an active crowd of curators while PubChem puts very little work into validating and checking of the data quality and have to trust the depositors to deposit quality data. While Wikipedia has received a lot of press regarding the validity of the data, and the media have challenged the premise and future of the platform, Wikipedia is now a mainstay for many people seeking reference data and information. In particular a dedicated team of chemists have worked very hard in the past two and a half years to validate Wikipedia chemistry content, especially the "ChemBoxes" and "DrugBoxes" [5,6] on the website, validating each structure one bond at a time. While Wikipedia validation of such chemistry related articles is "crowdsourcing", the crowd is actually rather small for this effort (more like a line at Dunkin’ Donuts) generally less than a half dozen dedicated scientists and some extremely well-tuned ‘bots to assist in the work.

Now here comes the challenge. There are literally hundreds of databases containing chemistry related information, specifically chemical structures. These are chemical vendor databases, aggregating databases collecting data from multiple sources and commonly containing millions of chemicals, niche databases of a few thousand compounds (metabolism, drugs, side effects etc.), and a variety of other contributions to the online chemistry content. Some of these databases have many millions of molecules. The strong relationship between chemistry and biology means that much of the effort behind these databases is to support pharmaceutical science. We are also seeing databases released with great fanfare which contain very high numbers of errors. For example the NIH funded NPC Browser, announced in a Science Translational Medicine paper [7] has been reviewed by Williams [8-10] and discussed by Ekins [11].

The danger here is that chemistry errors are proliferating from one database to another as the content is reused. As a result of the diverse nature of these databases the quality also varies dramatically [12] and there is no single effort, as yet, to produce a quality ranking for the content. Efforts have begun to gather the chemistry databases, at least, into a common directory of resources, and review the content. We have set up a scientific databases wiki so that database hosts (and anyone interested for that matter) can add their databases to the list. For the chemistry related data each database will be painstakingly examined using a selected set of compounds as a review set to hopefully garner a semi-quantitative representation of data quality. Time will tell whether such an approach will be of value. But it’s a long overdue start.

Certainly the chemistry databases that are available online have the potential to impact the research of chemists throughout the world. When a molecular property or spectrum has been measured¸ is deemed pre-competitive, and can be shared, putting this into one of the online databases would be of value to chemists searching for reference data. This can contribute to improved research and reduced costs for data generation and analysis. We can then stop questioning data quality [13-15] when building computational models as the data will include the assertions, attributions and details of measurement.

When a reaction synthesis is performed it can now be immediately exposed online and shared with the community without publishing. An example is the ChemSpider SyntheticPages, a crowdsourced reaction database with open "peer review". With wikis, data validation and curation tools on the public databases and a collective crowdsourcing mentality increasingly prevailing, the quality of data will improve as the scope continues to expand. Within a few years we can surely hope that the internet is the source of high quality reference data that the world has helped assemble. Perhaps a bigger question for the future is: who should be hosting such databases and paying for the manual curation efforts? Until then we should continue to question everything, be cautious of what we trust and not buy diamonds from a natural gas vendor.

For those of you who read my Shakespearean quote as "All that glitters is not gold" the spelling I used was indeed "glisters". I refer you to Wikipedia for that "misspelling".

References:

[1] Web-based information management system, D.E. Brown, A.J. Williams and D. McLaughlin, TrAC Trends in Analytical Chemistry, Volume 16, Issue 7, August 1997, Pages 370-380, doi:10.1016/S0165-9936(97)00046-0

[2] Antony Williams, ChemConnector: LinkedIn page.

[3] Wikipedia, SMILES article.

[4] Antony Williams, ACS Salt Lake City, Spring 2009 "Cleaning up chemistry for the pharma industry: delivering a flexible platform for interrogating the FDA DailyMed website"

[5] ChemBox.

[6] DrugBox.

[7] Huang, R. et al. (2011) The NCGC Pharmaceutical Collection: A Comprehensive Resource of Clinically Approved Drugs Enabling Repurposing and Chemical Genomics. Sci Transl Med 3 (80), 80ps16

[8] Antony Williams, ChemConnector Blog, Reviewing Data Quality in the NCGC Pharmaceutical Collection Browser.

[9] Antony Williams, ChemConnector Blog, What is a Drug? Data Quality in the NCGC Pharmaceutical Collection Browser Part 2.

[10] Antony Williams, ChemConnector Blog, Rabbits, Potatoes and other Vegetables in the NCGC Database.

[11] Sean Ekins, Collabchem Blog, Collaboration could give us a gold standard database of drugs.

[12] A Quality Alert and Call for Improved Curation of Public Chemistry Databases, A.J. Williams and S. Ekins, Accepted for publication in Drug Discovery Today, July 2011

[13] Oprea, T. et al. (2002) On the propogation of errors in the QSAR literature. In Euro QSAR 2002

[14] Fourches, D. et al. (2010) Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research. J Chem Inf Model 50 (7), 1189-1204

[14] Young, D. et al. (2008) Are the chemical structures in your QSAR correct? QSAR Comb Sci 27, 1337-1345

About The Author: Antony Williams is a chemist and host of the online chemistry database ChemSpider. ChemSpider was initiated as a hobby project and acquired by the Royal Society of Chemistry to facilitate data access and sharing for the chemistry community. He is widely published with over 130 peer-reviewed publications and book chapters and is a co-author of the newly released book, Collaborative Computational Technologies for Biomedical Research (Wiley). He is the co-host of the Scientific Mobile Apps and Scientific Databases wikis. He is a member of the social network as ChemConnector. Follow on Twitter @ChemConnector or visit www.chemconnector.com. Williams holds a PhD from the University of London, UK and is a resident of North Carolina.

The views expressed are those of the author and are not necessarily those of Scientific American or the Royal Society of Chemistry.

Take a look at the complete line-up of bloggers at our brand new blog network.