June 26, 2013

Do blog posts correlate with a higher number of future citations?

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

Do blog posts correlate with a higher number of future citations? In many cases, yes, at least for Researchblogging.org (RB). Judit Bar-Ilan, Mike Thelwall and I already used RB, a science blogging aggregator for posts citing peer-reviewed research, in our previous article.

RB has many advantages (if you read the previous article’s post, you can probably skip this part), the most important being structured citation(s) at the end of each post. It has human editors, so we didn’t have to check for spam or pseudo-science blogs. In short, RB gives us those bloggers who care about and are familiar enough with research to refer to it in a formal way. Of course, it also has its disadvantages; it’s self-selecting, so we can gather only data from bloggers who bothered to register with it; also, RB is life-science oriented, so the results aren’t necessarily true for other disciplines.

Last research we found that RB bloggers are highly educated (32% earned a PhD) and that most (59%) are part of the academic system in one way or another. So, we knew that many RB bloggers either belong or used to belong to the academic system and wanted to see if, as a group, they cover articles which will be better cited in future peer-reviewed literature than articles from the same journal and year they didn’t cover.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

As a rule, we differentiate between blog mentions and blog citations. Blog mentionsare any sort of reference to scholarly material in blogs, while blog citations are mentions of scholarly materials written in structured styles (e.g., APA, MLA) and appear in blog posts.

Methodology

As I wrote earlier, the idea was to take blog posts which covered articles from the same year and see if the articles, as a group, will receive more citations later on compared with articles from the same year and journal which weren’t covered. The problem with that was that RB was launched around 2008. Since we studied the citations at the beginning of 2013, it meant citations from peer-reviewed journals did not have too much time to accumulate. We knew from previous research (Glänzel & Schoepﬂin, 1995) that in the life sciences, to which most of the journals and articles in the sample belonged, articles reach citation peak in about three years from the time of publication, including the publication year (biomedical fields tend to be fast-moving). That gave us 2009 and 2010 to work with. We downloaded all RB data from 2009-2010 and looked at all the posts from a certain year that reported about articles from the same year (e.g. 2009 post covering a 2009 article). There were 4013 posts of that kind in 2009 and 6116 in 2010. Next we limited the sample only to journals with 20 or more articles published in the journal and covered during 2009 and 2010. The cut-off of 20 articles and above was a compromise - we wanted to have as many journals as possible in the sample, but also wanted the results to be statistically reliable. The 20 cut-off left 12 journals from 2009 and 19 from 2010. For both years, the most popular journals were PLoS One, PNAS, Science and Nature (not necessarily in this order).

Tables 1 and 2 shows the journals for 2009 and 2010. Three journals (Current Biology, Journal of the American Chemical Society and Nature Neuroscience) didn’t make the threshold for 2010, and 10 new journals were added to the old ones.

Medians - for each journal we calculated the median of the article group that were covered by bloggers and the median of the article group that weren’t covered by bloggers. We used medians rather than averages because citation numbers for articles in the same journal tend to be highly skewed, and averages would have been affected by the extreme values. For 10 of the 12 journals in 2009 the medians for the covered groups were higher than the non-covered ones. The same was true for 17 out of 19 journals in 2010.

We used the medians to perform statistical tests (Mann-Whitney). In 2009, 7 out of the 12 journals (58%) had significant differences between the medians at p<.05 (the citation window was 2009-2011; there’s a mistake in table 6’s column headline that says 2010-2012 – please ignore, it won’t be like that in the final version). In 2010, 12 out of the 19 journals (68%) had significant differences at p<.05 for the citation window 2010-2012. We also calculated the 2010-2011 citation window for 2009 and the 2011-2012 citation window for 2010 to see if there’s any difference, but the results were very similar (the data for these citation windows isn’t shown in the article)

Martin: “But why? Why? I mean, why? Why?”

Douglas: “Four excellent questions.”

Cabin Pressure, “Douz”

We believe it’s mainly the “wisdom of crowds” in action here. It makes sense that a large group of people with scholarly background in a field can guess more accurately which articles are likely to have more impact in that field than an editor and 2-3 peer-reviewers. Notice the improved accuracy of bloggers between 2009 (887 items overall in the journals studied) and 2010 (1394 items). It’s true that the bloggers didn’t have a citation advantage in all journals, but that could have had something with the 20-article threshold. Had we chosen, say, a 50-article threshold we would have had 10 journals in 2009 and 2010 combined, out of which only 2 would have had non-significant results.

We also looked into other “Whys”; we know that reviews are over-represented among highly-cited articles, so we checked to see if there was an over-representation of reviews among the articles covered by blogs as well, in comparison to their representation in every journal’s general population in the same year. However, it doesn’t seem reviews are over-represented in the covered articles population (though we can’t have statistical significance because of the small numbers of reviews), so this speculation fell through.

Another “Why” we looked at was a possible media-blogs connection. The median differences for the New England Journal of Medicine (NEJM) between the covered by blogs and not covered by blogs groups were especially high (172 vs. 56 in 2009; 138 vs. 51 in 2010). Since the NEJM is an elite journal which has many of its articles covered in the media, we wanted to see if bloggers tend to choose NEJM articles which were also reported by the New York Times and Reuters. The results weren’t surprising: twenty-one out of 26 articles in 2009 (81%) and 20 out of 38 articles in 2010 (53%) were covered by Reuters and/or the New York Times. The numbers of NEJM articles are different than in earlier tables because some articles were covered by more than one post, some posts covered more than one journal article and some news articles covered more than one journal article. The bloggers were usually not far behind the mainstream media – up to a month difference between the news article and the blog post for most articles. So at least for NEJM there could be a media-blog connection, though we can’t tell what kind of connection. However, most journals aren’t as thoroughly covered by the media the way NEJM is, so we can’t say bloggers take their cues from the media.

The main limitations of the study were the time frame – we could only take posts from 2009 and 2010 – and the relatively small number of articles. Despite these limitations, I think the results are rather promising and would love to repeat the study in the future to see if they hold.

The article doesn’t have an official publication date yet, but it’ll be published in the Journal of the American Society for Information Science and Technology (JASIST) and can for now be found in Professor Thelwall’s site (PDF).

References

Glanzel, W., & Schoepflin, U. (1995). A bibliometric study on ageing and reception processes of scientific literature Journal of Information Science, 21 (1), 37-53 DOI: 10.1177/016555159502100104

Shema H, Bar-Ilan J, & Thelwall M (2012). Research blogs and the discussion of scholarly information. PloS one, 7 (5) PMID: 22606239

Shema, H., Bar-Ilan, J., & Thelwall, M. (in press). Do blog citations correlate with a higher number of future citations? Research blogs as a potential source of alternative metrics. Journal of the American Society for Information Science and Technology.