Skip to main content

How To Reconcile Big Data and Privacy

In many ways "big data" and "encryption" are antithetical. The former involves harvesting, storing and analyzing information to reveal patterns that researchers, law enforcement and industry can use to their benefit.

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American


In many ways “big data” and “encryption” are antithetical. The former involves harvesting, storing and analyzing information to reveal patterns that researchers, law enforcement and industry can use to their benefit. The goal of the latter is to obscure that data from prying eyes. That tension was at the core of a conference this week co-hosted by the White House Office of Science & Technology Policy and the Massachusetts Institute of Technology (M.I.T.), in which more than a dozen experts from academia, politics and industry explored ways encryption and other privacy-oriented technologies might protect the information at involved in big data efforts.

Functional encryption is the way to go, said MIT CSAIL professor Shafi Goldwasser during the panel. Alternatives such as anonymizing data records don’t work, she added. With so much data available about people freely available on social networks and other public sites, anyone looking to do harm can build a profile about their target by cross-referencing information from any number of online resources.

If data is simply being stored, encryption works wonderfully, said Nickolai Zeldovich, an associate professor at the M.I.T. Computer Science and Artificial Intelligence Lab (CSAIL), during Monday’s “Privacy Enhancing Technologies” panel. The trouble comes when you actually need to process and analyze that data. That’s why there is a need for systems that can do practical processing of encrypted data, he added.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


Such practical efforts generally refer to so-called "homomorphic" encryption, which makes it possible to perform computations on encrypted data without decrypting it first. Since the late 1970s, researchers have been saying that fully homomorphic encryption—in which protected information can be sliced and diced any number of ways without revealing the actual data—is possible. Such systems would be a boon to cloud computing, providing a way to analyze information with minimal privacy risks to the people supplying that information.

In practice, however, computer scientists have been unable to develop a way to perform more than a handful of meaning operations on encrypted data. IBM claimed that computer scientist Craig Gentry had developed a practical, fully homomorphic system in 2009, but critics said the technology was too complex, slow and impractical for actual use in the cloud. IBM has since patented Gentry’s work and continues to develop it.

Efforts are underway to develop, if not fully homomorphic systems, those that can work with encrypted data in new and novel ways.

One such project is CryptDB, a system that enables analysis of encrypted data by placing a proxy server between the software requesting the data and the database storing that encrypted data. The proxy uses algorithms designed to compare and analyze encrypted information. In some cases the proxy has to remove different layers of encryption to better analyze the data, but the idea is that it would not fully encrypt the data into plain text. Despite the limited types of queries that CryptDB can perform, Google is a big supporter of the technology and uses it to provide encrypted queries in its cloud-based BigQuery service for searching massive datasets. Adding an extra piece of equipment like a proxy server in the search and retrieval process typically slows things down, but Zeldovich and his colleagues say they are making strides in alleviating that problem (pdf).

The M.I.T. panelists also posed security measures that didn’t rely on encryption. Differential privacy, for example, is an alternative to anonymizing data. This approach uses an automated data curator that can protect the privacy of the individuals in a data set while providing useful info to the person requesting the data, said Salil Vadhan, Harvard University Vicky Joseph Professor. As noted in a December 2012 article on Scientific American’s Web site, “A differentially private data release algorithm allows researchers to ask practically any question about a database of sensitive information and provides answers that have been ‘blurred’ so that they reveal virtually nothing about any individual’s data—not even whether the individual was in the database in the first place.”

Another option is for engineers to code privacy-policy requirements directly into software that collects, stores, and analyzes data. Such “accountable systems” would be written to automatically analyze whether a particular use of data violates a law, said Daniel Weitzner, M.I.T. CSAIL principal research scientist. “Using an analogy, we can operate economies all over the world with a reasonably high degree of public trust,” he added. “We do this because we have a set of consistent rules applied in a consistent way. I think we ought to have a similar goal for the way information is used.”

Monday’s conference stemmed from Pres. Barack Obama’s call earlier this year for a comprehensive review of big data’s impact on Americans’ lives, livelihoods and relationship with the government. Obama tasked White House Counselor John Podesta to lead the review process, which will culminate a few months from now with a report expected to impact policy, funding and research related to big data.