September 26, 2012

Data Visualization: Trudging through the Digital Road to New Scientific Discovery

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

Robert Simpson, A Post-Doctoral Researcher at the University of Oxford, commutes to work using the buses provided by the school. Imagining commuters on the bus, one might envision a collection of individuals reading the paper, drinking coffee, writing in planners, or simply chatting up the person next to them. Recently, Simpson has been using his commute to chart out word occurrences to a matrix visualization from astronomy journals on his laptop. Some people bus to the beat of a different drum.

He is creating the matrices according to the nature of work that Don R. Swanson did at the University of Chicago in the late 80's. Swanson, an Information Scientist, discovered connections between Raynaud's Disease and Fish Oil through something called Literature Based Discovery. The literature all came from MEDLINE. Raynaud's causes numbness in the extremities (fingers, toes, etc.) of the body because of blood vessel narrowing. Thanks to Swanson, Fish oil was found to be of relief to Raynaud's patients where previously the relationship had gone unnoticed. No one had ever combined the literatures from MEDLINE that stated the benefits of Fish Oil as far as blood vessel narrowing in connection with Raynaud's. Put another way, if there are many papers that mention the term Raynaud's aligned with certain features and terms of the disease, and there are many papers that name those same features and terms in relation to Fish Oil, chances are better that Raynaud's and Fish Oil might have an interesting relationship.

“In the present article I demonstrate something similar for the pair of literatures on migraine and magnesium (another discovery by Swanson). The goal of this work is not simply to find unnoticed connections but to develop a systematic approach to the process of hunting for them.” Swanson states in an article he published on the subject. “As in the preceding case, one begins with a disease for which neither cause nor cure is known. The problem is to find, within the literature, indirect evidence that an unknown cure might already exist. The literatures on fish oil and magnesium, respectively, were not fortuitous choices; they were the survivors of a process of elimination.”

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Robert Simpson is taking Swanson's process a step further and putting Literature Based Discovery into a visual computer matrix that plots out the relationship between terms in Astronomy journals, mostly from the abstracts available online. Below is one such Matrix, it is from Simpson's blog, Orbiting Frog:

When asked where this idea had come from, the mild-mannered Simpson had this to say,

“I run a conference series called .Astronomy where we have a hack day with the geek elite of astronomy. My pal, local conference organizer, Sarah Kendrew, suggested we try to create a hypothesis generator for astronomy - based on the BrainSCANr project from neuroscience.
Although we hacked on it a bit on that day, I became far more interested afterward. I found myself playing about with a big database of terms and papers and just having fun.
I realized that doing hypothesis generation with the data would require a decent bit of visualization of correlations between pairs of words. You'd need to start with looking at what pairs appear together and then move on to pairs that both correlate with a third word (I haven't done this yet). The word matrices are a great way to instantly see the patterns, and d3.js (software) - which was suggested by our visualization experts Noah Iliinsky and Julie Steele at .Astronomy - includes lots of examples with matrices.”

Brilliant! Building a hypothesis generator sounds like a sea change that could open up doors in science that no one has even considered. Imagine all of the possibilities! Imagine isolating many papers based on a subject and having all of the relationships that have never been recognized between them staring back at you in a handy matrix. What an amazing idea! But, what are the drawbacks? Well, the structuring of the Data itself.

Enter the wizards of Computer Science and Visual Analytics.

Jeffrey Heer is an Assistant Professor at The Computer Science Department of Stanford University, he said this about visualizing text groups,

“Text is among our most abundant types of data. However, text is difficult to visualize directly. How does one "plot" a sentence or set of words other than to print them? In practice, text visualization requires both modeling (creating statistical models of the content of text collections) and then visualizing the structure and content of those models.”

Heer has been involved with some amazing work that has resulted in success and successful failure. At Stanford, Heer's group found significant flaws in the way that textually derived data was modeled when the visualization of a group of data they were working with was analyzed.

“Probably the most relevant work from our group is on visualizing the shifting similarities among academic dicsiplines over time, based on an analysis of the text of PHD dissertations.
We were visualizing the results of a chain of models, including text modeling and dimensionality reduction. These models can sometimes give rise to misleading results, which we then spotted in the visualization. This result led us to consider how visualizations must do more than just turn data into images -- it is vital that visualizations support interactive exploration and verification, so that one can not only uncover new hypotheses but begin the process of assessing their credibility. Another result of this work is that the insights gained from the visualizations enabled us to design better machine learning methods, such that our mathematical models of textual similarity better matched the judgments of human experts.”

“Statistics alone are dangerous and they hide a lot.” Says Ben Shneiderman, a professor of computer science and a founding director of the Human-Computer Interaction Lab at University Of Maryland. Shneiderman created the widely used “Treemaps” a compact visual display of tree structures that work by “...splitting the screen into rectangles in alternating horizontal and vertical directions as you traverse down the levels.” His work also contributed to the commercial success of Spotfire, a popular visual data analysis company.

Shneiderman was profiled in Scientific American by Tim Beardsley in the March 1999 issue about his opposition to anthropomorphizing the interactive qualities of computer programs. “The purpose of computing is insight, not numbers, Shneiderman likes to reiterate, and likewise the purpose of visualization is insight, not pictures. What people want in their interactions with computers, he argues, is a feeling of mastery. That comes from interfaces that are controllable, consistent and predictable.”

Contextualizing, and/or structuring, the data before it is put into a helpful display, seems to be where humans need to have more involvement. If you showed someone who you just met a new gold ring you had acquired, one complete with a nice clear stone attached and automatically expected tears and acclaim from that person, it would seem odd. But if you had instead first told them it was because the person of your dreams had just proposed to you and that they had a house in Bermuda, the visual of the ring would provide useful in attaining squeals, tears, and hugs from your new friend. It is the same with visual analytics, if there is no structure for the information, it goes out without a context and ultimately might lead to a misunderstanding or misrepresentation: Why are you showing me your jewelry, are you bragging or being vain?

This goes right along with any conversation about how to figure out efficient ways to structure data so that the visual analytics will work properly in displaying the relationships of the data.

Shneiderman added the example below of an error that he saw in the data structure from a client that came to his lab. He also highlights using the visualizations as an error finding program:

“I was working with a group that was analyzing emergency room admissions. They had 6400 admissions one month and they were looking at the age of the men and the women, those who were admitted and those who were discharged, etc. They were happily running their statistics and coming up with numbers and statistically significant differences, etc. In 10 minutes, once I put the data through some of the tools we have, there turned out to be 8 patients who were 999 years old! They had no idea! So that throws things off terribly. A lot of people in the statistics or data mining world don't take a look at the data to be able to detect these things.”

Recently, Shneiderman and Heer co-authored a paper on the importance of control panels to help users (scientists, teachers, students, etc) to navigate through a visual display. The paper goes through different methods of displaying legends, navigation boxes, or widgets, etc., to help users navigate to what they want.

“Overview first, zoom and filter, then details-on-demand” Shneiderman has famously stated about how to display and work through a graph. It is a commentary on what to do with a sound visualization, it lays out a clear path for looking at the data once we can find out out an efficient way to structure the data.

So, if the data could be properly structured and fed into compelling, but strict, visualizations, we could have this amazingly useful tool that gives scientific insight at a glance (what Simpson was shooting for), new hypotheses in science purely from emergency room admissions programs, or medical journals, or even grocery store cash registers. Is Robert Simpson's matrix intriguing because the data is already structured from scholarly articles? It seems that way because he is using a system put together by a successful innovator. Don R. Swanson, someone without a medical degree, furthered Medical Science purely because he compared contextualized data in a different and efficient way. This was without a colorful visualization. Visualizations have been promoted in science long before Simpson, by people like the inventor of the term 'software', Princeton statistician John Tukey. But, there are still many complications with the process as compiled above. Why are Simpson's efforts important?

Simpson is the smart “kid” with a question here: What is a simple way that I can display data, with contemporary resources, that can streamline and augment the scientific process? Heer serves as a realist, telling us how complicated these efforts might become. Shneiderman serves as a philosopher and craftsman, figuring out a human's role in the act of transitioning data to the screen and then offering responsible ways to do it.

All in all, we start with a basic question: How do scientists use a computer's ability to attain and process massive amounts of data to provide them with hypotheses that they might never be able to notice with their brains? The answer might well be at the core of Computer Science and Visual Analytics, but it is a dangerous concept that needs some healthy parenting so as not to evolve into a faulty idea generator that will ultimately bring scientists back to square one, with heaps of useless data.

One last question I had was for Stanford's Jeffrey Heer in reference to his analysis of PhD Dissertations that is linked above:

Q: “Do you ever think, once (in the future) the kinks are worked out with data visualizations and someone can maybe easily rule out spurious results, this process will be available to the general public as a community effort towards Science? (ie: maybe a site where people can go and insert their own data sets to generate testable hypothesis)”

A: “First let me clarify something just to be safe... the data visualizations in my earlier anecdote were not responsible for the spurious results. Instead, the underlying models were the culprit and the visualization played a valuable role in helping reveal the problematic data. This issue is unlikely to evaporate -- a single data set can often be analyzed using a variety of algorithms or models, some of which may not be appropriate (the model may make assumptions about the data that are not true). Visualization can play a useful role in helping us understand our data prior to modeling (to help generate hypotheses and assess if modeling assumptions hold) and understanding the results of our models (as in my earlier anecdote).

Now on to your question -- there have already been a number of efforts to enable the type of "citizen science" you describe. Even prior to the internet, "amateurs" played an important role in astronomy (by observing new stars, etc). Also, websites like IBM's Many Eyes or Google's Fusion Tables allow people to upload data sets, visualize them, and share them. While I haven't seen such a public site dedicated to generating testable hypotheses, you might be interested in this recent work we did using paid crowd workers to help generate explanatory hypotheses for public-interest data sets.

However, the more common practice is to engage "citizen scientists" in data collection efforts, while the "professional scientists" then analyze that data. Of course, it is important to provide transparency in this process and recognize where lay people can contribute valuable insights. Here's one example of putting this idea into practice by engaging with a community to monitor local air quality.”

One thing about these wizards of Visual Analytics, they are thorough.