About the SA Blog Network



Critical views of science in the news
Cross-Check Home

So Far, Big Data Is Small Potatoes

The views expressed are those of the author and are not necessarily those of Scientific American.

Email   PrintPrint

Is Big Data going to revolutionize science and help us make a better world? Not based on what it’s done so far.

Let me back up a moment. I was recently a speaker at How the Light Gets In, a groovy philosophy and music festival in Hay-on-Wye, Britain. The festival lodged me in a fantastical mansion called Great Brampton House, where I hung out with other festival speakers, like physicists George Ellis, Carlo Rovelli, Carlos Frenk and Tara Shears; biologist Rupert Sheldrake; psychiatrist David Nutt; and journalists Colin Tudge and David Malone. (I hope to post Q&As with Ellis and Sheldrake soon.)

One afternoon, I participated in a public debate about Big Data with journalists Kenneth Cukier and Angela Saini and sociologist Laurie Taylor. The festival brochure blurbed our session as follows: “In an age when we can collect information in unimaginable quantities, will we replace simplifying theories with complex real patterns? Might Big Data be the end of theory?” These are questions posed by Cukier, data editor for The Economist, and Viktor Mayer-Schonberger, professor of Internet governance at Oxford, in their 2013 bestseller Big Data: A Revolution That Will Transform How We Live, Work, and Think.

In an essay based on their book, they write: “Big data starts with the fact that there is a lot more information floating around these days than ever before, and it is being put to extraordinary new uses. Big data is distinct from the Internet, although the Web makes it much easier to collect and share data. Big data is about more than just communication: the idea is that we can learn from a large body of information things that we could not comprehend when we used only smaller amounts.”

Their most intriguing assertion is that Big Data will allow us to solve problems without necessarily understanding them. Big Data will shift the emphasis of researchers from “causation to correlation,” Cukier and Mayer-Schonberger write. “This represents a move away from always trying to understand the deeper reasons behind how the world works to simply learning about an association among phenomena and using that to get things done.” Former WIRED editor Chris Anderson made similar claims in his 2008 essay “The End of Theory.”

If Big Data means digital technologies, I love Big Data. Digital technologies have transformed the way journalists as well as scientists gather, analyze and disseminate information. With my MacBook Air, I can Google Cukier without leaving my room and in an instant find reviews of his book—including a surprisingly positive one by often-cranky Michiko Kakutani of The New York Times.

Moreover, Cukier is right that science can achieve a lot merely by uncovering correlations. Epidemiological studies demonstrated more than a half century ago a strong correlation between smoking and cancer. We still don’t understand exactly how smoking causes cancer. The discovery of the correlation nonetheless led to anti-smoking campaigns, which have arguably done more to reduce cancer rates over the past few decades than all our advances in testing and treatment (as I point out in a recent post).

I’ll also grant Cukier’s point that theory can impede problem-solving. Let’s say, for example, you are a judge pondering whether a convicted murderer might kill again. You could ask a psychiatrist or other so-called mind-expert to make a prediction based on the expert’s pet psychological paradigm. But you’re much better off using the method that insurance companies employ to calculate rates for policy-holders; that is, just look at recidivism rates of criminals with backgrounds like that of your murderer.

The enthusiasm of Cukier and others for Big Data nonetheless irks me, for several reasons. First, their rhetoric reminds me of the hype generated by the fields of chaos and its successor, complexity, which in my 1996 book The End of Science I lumped together under the term “chaoplexity.” Both fields promised that with faster computers and more sophisticated software, scientists could solve problems that had resisted analysis by stodgy old reductionist methods. Some chaoplexologists hoped to discover profound new principles governing the “self-organization” of a wide range of complex phenomena—and possibly even an “anti-entropy” force.

These discoveries never happened, and neither have the kinds of practical advances envisioned by Cukier and Schonberger. Take genetics. The Human Genome Project was completed in 2003 in less time and for less money than had been expected because of advances in computers and other technologies. The costs of extracting and analyzing genetic data from humans and other organisms has continued to plummet.

But all this progress has produced disappointingly few medical advances. At this writing, not a single gene therapy has been approved for commercial sale in the U.S.; only one has been approved in Europe. The war on cancer has been a bust, as has the effort to find specific genes underpinning complex behavioral traits and disorders.

Just as geneticists are drowning in data, so are neuroscientists. In spite of the increasing power of scanners and other tools, neuroscientists still can’t explain exactly how brains make minds, or why our minds often work so badly. Thomas Insel, director of the National Institute of Mental Health, recently advocated overhauling our methods of defining and diagnosing schizophrenia, depression and other mental illnesses. Our treatments for these illnesses also remain appallingly primitive.

The economic crash of 2008 provides another reality check for Big Data. Wall Streeters have the fastest computers, most sophisticated software and biggest databases money can buy, and yet many failed to see the 2008 crash coming. The hope that Big Data will make economics and other social sciences truly scientific—that is, precise and predictive–remains, for now, a fantasy.

I assume—I hope—that our ever-improving information technologies will one day yield truly revolutionary advances in medicine, social sciences and other fields. But until that day arrives, let’s keep a lid on the hype about Big Data.

Further Reading: Are “Big Data” Sucking Scientific Talent into Big Business?


John Horgan About the Author: Every week, hockey-playing science writer John Horgan takes a puckish, provocative look at breaking science. A teacher at Stevens Institute of Technology, Horgan is the author of four books, including The End of Science (Addison Wesley, 1996) and The End of War (McSweeney's, 2012). Follow on Twitter @Horganism.

The views expressed are those of the author and are not necessarily those of Scientific American.

Rights & Permissions

Comments 14 Comments

Add Comment
  1. 1. richord 10:45 am 06/9/2014

    There are various factors that are glossed over in many of the articles on big data. One of those factors is the quality of the data and the other is the bias in the data.

    Data quality includes the accuracy, timeliness and authenticity (provenance) as well as the trustworthiness of the data and those who processed the data.

    Equally important is the bias in the data. Why the data was collected in the first place. Most organizations collect data to facilitate transactions such as orders and invoices. From those limited interactions they attempt to use this data and derive correlations to determine patterns of behavior and create digital personas for their customers.

    There are many claims how these correlations are helping gain and retain customers. Some are so bold as to suggest these correlations can be used to create “customer intimacy”.

    The reality is that individuals are not statistical correlations. Just because the stock market goes up doesn’t mean I am making money. Because a group of Netflix viewers “liked” a movie I watched doesn’t mean I like the other movies they like.

    What is lost in correlations is the individual’s behaviors and personality traits. I am not a correlation!

    The quality of the data and the quality and individualization of the correlations is questionable at best and big data maybe nothing more than passing correlated hype.

    Link to this
  2. 2. rshoff2 11:08 am 06/9/2014

    Richord’s comment really strikes a chord and hits the nail on the head. We can see its action in our daily exposure to the results of ‘big data’ analysis. And John is wise as always. With all ‘big data’ promises, what real advances have we seen? Genetics being a great example.

    My off the wall view is that ‘big data’ is like a lump of coal sitting in a pile. It doesn’t heat your home until someone does something meaningful with it. And along the lines of Richord’s view, I’d agree, its about the people, who we are, who uses the data, and what they do with it.

    Link to this
  3. 3. rshoff2 11:29 am 06/9/2014

    …”Their most intriguing assertion is that Big Data will allow us to solve problems without necessarily understanding them.”…

    Although we can benefit from derivative knowledge and ask only “what”, its a tremendous mistake to stop asking the questions “why” and “how”. Without those questions, why bother being human at all?

    Link to this
  4. 4. MarkRK 2:26 pm 06/9/2014

    I have made the point in several organizations that is the exact opposite to the point here (not the John Horgan’s, but the books) – In my opinion the ONLY way to make good use of “big” data at a deep level you must have a good model underneath.

    If one thinks of “Big Data” just as large data sets, then the LHC is likely the biggest coherent dataset producer in the world and what it outputs would be useless without an extremely precise “Standard Model” underneath.

    If one thinks of Big Data as loosely coupled data sets from say, billions of sensors (cell phones, internet of things, whatever) than the number of correlations one can find will be almost limitless which is not too good for providing insight. But if you have a good underlying model then you can do things like correlate the motions of thousands of cell phones to get reasonable good earthquake magnitude measurements. The model transforms the “dataset” into useful information, not the other way around.

    Actually even with some of the things touted by these kinds of books there are incredible underlying assumptions about normality, independence of effect and other things that if not made explicit, pretty much ruin and predictability.

    Even the recidivism example given there is an implied model in “like that of your murderer.” OK – what kind of parameters are you going to use to make that match? Why those?

    Link to this
  5. 5. brodix 11:02 pm 06/9/2014

    So it’s a small infinity inside a much larger infinity.
    It seems that if you want to find the needle in the haystack, you must first define the needle, then propose a method for discovering and extracting it. The first step is usually to eliminate as much excess hay as possible.
    So basically big data is quantity, not quality.
    Knowledge is information. Wisdom is editing.

    Link to this
  6. 6. rshoff2 3:51 pm 06/10/2014

    Why is a needle in a haystack infinitesimal, almost impossibly small and overwhelms the mind at how it can possibly be found, whereas that same needle once found is nothing more than a simple, boring, almost useless inert object? Why is that?

    A lost needle fuels the imagination. It MUST be found. But why?

    The same could be said about any particular data set.

    Link to this
  7. 7. brodix 10:15 pm 06/10/2014

    It goes much deeper than that. Our process of cognition is reductionistic. We absorb lots of information and then organize it into semi-coherent frames.
    For instance, an infant is highly aware, yet doesn’t yet have the ability to distill out coherent thoughts from the masses of information it absorbs and thus has no memory, no stream of ideas/thoughts to refer back to.
    Similarly, as a form of swarm intelligence, humanity is constantly congregating to memes, trends, common concepts, etc. to provide a group focus.
    We could even take it back to physics, in that gravity is a consolidation process, which coalesces mass out of more ephemeral energies, just as our thoughts coalesce out of diverse information.
    Yet obviously the result is not always coherent or the most logical response, that another point of perspective might have gravitated to, yet that is a consequence of the essential subjectivity of reality.
    So, no, there is no true ideal needle to be found, but then every pearl needs that grain of sand at the center.

    Link to this
  8. 8. brodix 10:18 pm 06/10/2014

    If you work around racehorses, losing a needle in the straw does happen and you want to find it and not have it get stuck in the horse. Spectacular Bid presumably lost the 78 Belmont because he got a needle stuck in his foot and it was a bit hot.

    Link to this
  9. 9. brodix 10:20 pm 06/10/2014

    Just double checked, 79 Belmont.

    Link to this
  10. 10. Jerzy v. 3.0. 7:01 am 06/11/2014

    It is naive to believe claims about big data produced by big data marketers.

    There is at present no verifable audit of which problems were solved by big data, and where big data turned useless.

    These stories ‘somebody used big data and made lots of money’ are like dieting pill advertisements: ‘somebody, somewhere tried it and became slim’.

    To explain what the problem is about. Normally problems are solved by gathering problem-specific data and modelling it. For example, seismological stations gather data about earthquakes. Big data claims that the same information can be found by analysis the bigger number of random or biased data, and the cost is smaller than getting a specialized information (say, analysing smartphone movements is cheaper than running seismological stations). This premise only holds true in certain circumstances, including cheapness of random data, that information needed is present, the noise is filtered, and principles of causation and correlation hold. Every aspect of these circumstances must hold true, and in practice, big data is often useless.

    Watch out especially for the two situations: one is that the size of data needed to draw proper statistical correlations in chaotic data is often much bigger than present in these so-called ‘big’ databases. For big data, if you have 10,000,000 data points it can mean too little information.

    Second is that we often need information about individual, which is unpredictable from correlations. In example from the blog: a judge dealing with 99.9% certainity that a person with such and such profile would repeat the crime should be unappy – because 0.1% error in real life means 1000s of criminals on the loose and innocent convicts.

    BTW – good idea of @8: why big data people don’t make fortunes on betting horse races and any other sport events? Sport should be an ideal test case for big data: lots of high quality past information avialable for free, clear problems, clear variables.

    Link to this
  11. 11. rshoff2 2:58 pm 06/11/2014

    Very interesting to think of reductionism as a strategy of cognition. After simple observation I can sense that my brain also employs that strategy. But I think your great analogy of the clam is also important. It is an example of the opposite of reductionism. From a grain of sand the universe can be imagined. The detail (grain of sand) is nothing, but the result of its irritation is a beautiful pearl.

    So, how can I express this idea? Hmm. Cognition perhaps invokes both reductionism and emergentism. Consciousness is perhaps the event horizon.

    The pearl in the clam (emergentism) and the needle in the haystack (reductionism).

    How does this relate to big data? There maybe a similar ‘event horizon’ between the emergence of data and the analysis of it.

    I do admit to being fearful that analyzing ‘big data’ with reductionist strategies only to develop the ability to predict outcomes (including developing new strategies) is blind to individuality. We lose ourselves as individuals and become part of a hive.

    Just think of advertising. What inappropriate products are directed at me as a result of reductionist assumptions of mass data? What appropriate solutions do those inappropriately directed advertisements displace when they grab for my attention? Maybe I’m look for dentures AND a boogie board. No harm done when big data is analyzed for marketing purposes but it sounds like the same broad strokes are used for analysis of healthcare, finances, taxes, liberty, justice, etc, etc, etc.

    Without the balance of emergentism, reductionism can be a trap.

    Link to this
  12. 12. brodix 12:47 pm 06/12/2014

    Consider how the pearl is formed; by adding calcium and then polishing it down. Quality distilled from quantity.
    Emergentism without bounds is cancer. Just quantity.
    We are constantly trying to expand and then become defined by the limits we encounter and how we respond. Yes there is a lot of noise out there, but it is up to us to extract the signals we want. The problem is you assume there must be an ideal that is being missed in all this chaos, but all an ideal is, is a set of preferred characteristics. The absolute would be the essence from which we rise, not an ideal from which we fell.
    We are already a hive. That voice in the back of your head, the thought on the tip of your tongue, those memories washing through are all conscious aspects of you, just as the person next to you is conscious. We are all manifestations of some deeper entity. As someone who has played subordinate roles for much of my life, I find it possible to steer in any number of ways, by leaning this way, or that, who you chose to follow, how willingly you participate, etc. Those seemingly in the executive position are often just riding the wave of popular movements anyway.
    Think of your life as a clear sphere. Now most people like to plaster the insides of their balloon with personal mementos, because it makes them an individual. Yet I often try to polish my bubble as best as possible, to be able to really see and be part of what is going on around me. Then I can be much more aware of what is actually happening and how it might be better managed. In some ways, this can make me invisible and sometimes it can make me a force of nature, when I’ve got to get something done.

    Link to this
  13. 13. Shecky R 11:35 am 06/16/2014

    I agree ‘big data’ is very primitive thus far… I find the majority of ads sent to me over the Web, not only of NO interest to me, but in fact to be so annoying as to leave me with a very negative view of the product or company involved… and for the privilege of creating this negative impression/ill-will these companies are paying Google, Yahoo, or whomever large sums of money (it’s almost amusing).

    Link to this
  14. 14. rshoff2 5:59 pm 06/16/2014

    brodix, you have interesting philosophical perspectives. I can relate to the things you say in your comment. uncanny.

    Emergentism without bounds is cancerous, true, and that may be where reductionism plays it’s most important role. To smooth and shine the emerging pearl, eliminating any cancerous outcroppings.

    Yes, ‘bubbles’. My dad always accused people of living in bubbles. Oh, do I want a bubble. Do I ever want a bubble, and one of my choosing.

    Ah-oh, wait, I may already have one and I may have selected it. Oh my. Can I have a different one please?! :-)

    Shecky: BINGO! But it’s not amusing. It’s sad and wasteful of human capacity and application.

    Link to this

Add a Comment
You must sign in or register as a member to submit a comment.

More from Scientific American

Email this Article