Skip to main content

The Business Case for a Data Refinery

Crude data is similar to crude oil—in its raw form, it’s usually too messy to be useful

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American


The Economist has proclaimed that data is "the oil of the digital era.” Data will be for the 21st century what oil was for the 20th century—an enabler of new technologies, new products and new businesses. Data will be a focal point around which the economy, society and politics will organize. It is the clean new resource of the future, with undiscovered potential.

But, turning oil into something valuable has always been a complex process. Oil is crude when it comes out of the ground and needs to be cracked at a refinery to turn it into something useful.

Data is similar. In its raw form, it’s usually too big, too messy, and lacks structure. To solve this problem, imagine the concept of a “data refinery”— a software platform that pulls in huge datasets, finds patterns in that data and makes predictions. The data refinery is the missing link between gathering data and extracting value from it.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


Digital-first tech firms have excelled at data refineries already. But soon everyone will need to understand and build data refineries. It’s not enough to just “use data” to steer your business. Your data must be targeted, specifically gathered, and refined.   

Real-world sensors are the next step in moving this data revolution outside of the four walls of a business and into the larger environment in which they operate. Today, the average car has between 100 and 200 sensors that generate about 1 terabyte (TB) of data per car per day. Now think about all the cars on the road—not to mention ships, trains, satellites, smart devices, mobile phones and more—and you’ll begin to understand the scale and speed at which sensor data is driving the need for data refineries. Those sensors are turning the physical world into bits, allowing everyone to better understand not only their supply chain but the overall ecosystem. This means organizations can track where the things they make actually go—from raw materials to final consumption—as well as what the things they care about are doing no matter where they are on the planet.

Because of this sensor revolution, allowing computers to see and understand the world, every business will become fully digitized with data refineries at the core. The metaphor might be new, but big tech firms have understood the value of their data for decades—and the ones who have excelled at data refining are amongst the largest companies in the world.

The best place to look for examples are in digital-first leaders: Facebook has become a data refinery for the social network. Amazon refines consumer data, Netflix does it for video, and Google for Web pages.

Let’s focus on Google, the original data refinery business.

The primary business and cash cow is their search engine. Search is an astonishing model in that Web pages, the dataset for search, are publicly available. Though not easy, anyone can theoretically build a search engine (I was involved with a number of start-ups that tried).  Furthermore, there’s absolutely no lock-in. It’s just as easy to type into your browser “Bing.com” as it is “Google.com.” Even the interface is the same 10 blue links. But, despite Microsoft’s billions, it just can’t get people to switch to Bing.

To do this, Google created a superior internal platform for scientists. It has also collected a massive amount of user data.

Google took the raw Web data set, cleaned it up (e.g., reducing spam), and built the right tools to test out theories and rapidly improve their search algorithms. Google scientists don’t need to worry about wrangling data—they have a platform where they can quickly experiment and test their theories about what makes a better search engine. Google scientists aren’t smarter than their competitive peers, they just have a more powerful workbench that gives them more opportunities to use their intelligence.

It goes beyond Google’s internal platform. The virtuous cycle of a data refinery–based businesses is: more market-share maintained, more users using the service, more data generated. Smart businesses refine all that data into a better service … thereby attracting even more users.

In Google’s case, they have a long history of user search query history, user behavior, clicks, bids—really every user interaction with Google. By fusing this proprietary data into its data refinery, Google gets an edge. No matter how much its competition spends on building better algorithms, they’ll never be able to collect those years of data. Google has a natural moat, and it’s filled with data.

Data refineries are powerful in improving a current service, but they can also spawn new types of data-enabled products. The classic example of this is Amazon’s product recommendations, a new feature that has been optimized over the years. The reason recommendations are so good now is the decades of buying information—something only (maybe) a retailer like Wal-Mart could replicate.

Twitter is a counter-example of a company that is sitting on a treasure trove of data but can’t seem to build a functional data refinery. My last company, Zite, was largely based on intelligence built into the social graph, allowing us to recommend amazing articles to people. Twitter has lots of social interactions that often call out Web pages, making it the perfect place to mine data for recommendations across a broad range of topics. At Zite, we built a data refinery optimized for using data to create recommendations every day for the user. We were able to build a product on top of the Twitter dataset. It shocks me to this day that Twitter hasn’t done the same and launched products—or radically improved its service—by refining their treasure trove of data.

So far, all of the examples of companies that have embraced data refineries are digital-first businesses, ones that were born online into the world of analytical data. Physical product companies are now using sensors to digitize their operations and generate their own proprietary data. GE has been working for years on its industrial data refinery, Predix, in an effort to turns huge amounts of production and operational activity and tracking into useful feedback loops.

Sensors will cause every business to rethink its data strategy. They are becoming smaller and cheaper, plus they are all able to be networked to send their data back to a centralized brain. This means businesses, which used to operate in the physical world, which weren’t fundamentally disrupted by the PC, the internet or mobile, will be threatened by disruption in a world of data. Their physical goods will become bits and therefore analyzable.

Unlike oil, companies no longer have to find where the value is; plenty of companies are sitting on virtual oil reserves. But even a huge hoard of data won’t magically turn into value. That requires a data refinery, and a new set of tools to find and extract value.

Regardless of your industry, you are generating data. How are you housing it? What tools are you using to find value in it? I would love to hear what you’re doing to ensure your business isn’t left behind in the digital refinery revolution.