Skip to main content

In a Big Data World, Scholars Need New Guidelines for Research

User information from Facebook and other social-media sites is invaluable to political and social scientists, but it must be treated with care 

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American


Mark Zuckerberg’s recent testimony to Congress was full of discussion about Facebook’s privacy policies, its advertising-driven business model and the issue of protecting of consumers around the globe. Equally as important, however, but less prominent in the public conversation, are some of the issues around trusting scholars to use people’s personal data from social media sites in an ethical way.

To understand political or social behavior today, scholars need access to private data. But in the case leading up to Zuckerberg’s hearing, a scholar collected data via a “third party application” that he developed, then sold those data to Cambridge Analytica, with unfortunate results. Given the importance of research review processes for institutions and the strict oversight by Institutional Review Boards (IRB) in the United States in particular, the Facebook case brings challenges of doing big data internet research into the spotlight.

Certainly, the Zuckerberg hearings centered around important topics (and provided a lot of congressional theater) but the implications for social science research are now in question. Researchers who have asked for access to large datasets in order to know more about our digital lives are concerned about how tech companies will change their policies. Facebook, for example, has hesitated to share data with social scientists who have questions about political opinions, interpersonal behavior, group networks and digital life.


On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.


We have known for a long time that digital technologies have seriously impacted all kinds of scientific inquiry, but managing the oversight of data collection in a digital world has become much more complex over time. Historically social data, in the form of surveys or transcripts of interviews, were collected with informed consent, then stored on paper in a locked file cabinet.

With the onset of online surveys, datasets were stored instead on password-protected or encrypted computers. Now the emergence of cloud computing is changing data management yet again, and we have to trust a cloud provider to protect the data. It is difficult to see how recent shifts in IRB policy take into full account the magnitude of protection needed to protect research participants of all kinds.

In fact, recent changes to the federal policy in the United States were the first revisions in decades, excluding guidelines for human protections in data science, and actually seem to have relaxed existing standards for protecting research participants. These changes to the federal policy include an expanded list of the kinds of research that are exempt from full IRB review, a broadened reach of consent for secondary data use, and the need to gain IRB approval at each research site as opposed to single-IRB coverage for multisite studies.

IRB boards and research ethics committees need to quickly embrace big data management problems given the scale and speed at which data misuse can harm research participants who have, in many cases, entrusted scholars with their personal data.

To be fair, too, many have written about studying digital behavior and methods for internet research. New initiatives and ideas are emerging for companies like Facebook to share data in a way that protects users and a company’s proprietary information while making anonymized datasets available for experts to analyze.

As Facebook announced, scholars will soon have the ability to interrogate the impact of social media on electoral processes. In a working paper, scholars from Harvard and Stanford suggest a new model for data management for research purposes, one that protects industry interest while allowing for the kinds of scholarly inquiry that are needed to understand social trends, online behavior, and human psychology relative to digital engagements. Ideas like these may or may not be the right paths to take, and internet researchers have already voiced concerns about centralizing the research agenda via a proposed commission or engaging in solid research without knowing a lot about how data are originally collected, but at least there are new proposals emerging.

Social scientists need to access digital data in a safe way that protects consenting research participants. Some IRB boards or other committees overseeing research ethics around the globe have begun the process of envisioning how today’s scholars safely do social media, internet or big data research.

For years there have been calls for educating research review boards about internet research; concerns about scientific contribution when ethical research processes are under fire have also been raised. So research institutions are already late in reviewing their ethical guidelines.

This is urgent, and it is imperative that every researching institution review data protection processes and requirements especially as researchers need access to big datasets to address a wide variety of types of scholarly inquiry. If research institutions around the globe are not nimble in how they provide guidelines for the ethical use of data today and over time, there will be more large Facebook-like data cases that leave ordinary digital citizens vulnerable.