September 22, 2013

Viagra ads and NSA watch lists: smoke but usually no fire

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

In our society, Big Data plays an increasingly important role. Organizations like Google and the National Security Agency (NSA) have access to an unprecedented capacity of information regarding what we buy, with whom we communicate, and which websites we visit. These organizations spend billions of dollars to analyze the plethora of data they harvest, reflecting the profits in both power and money that are reaped from Big Data. The NSA revelations have been on the front pages for the entire summer. Some find the idea of Big Brother watching them deeply unsettling. Others see nothing to worry about: they reckon, if you have nothing to hide, you have nothing to fear.

So, should the righteous fear Big Data? Our moral intuitions about the limits of privacy and surveillance may be clear enough, but Big Data is a numbers game, and our intuitions about statistics are notoriously hazy and fallible. So to sharpen the debate, we need to inject some mathematics into the discussion.

For illustration, imagine we have a startup that has developed a predictive algorithm that screens emails, Facebook, search queries, and YouTube watch history to identify customers interested in Viagra for targeted advertising. We hope to sell our algorithm to Google. The drug company that owns Viagra would pay Google for each Viagra ad it displays. This pay would depend on the fraction of the ads that would actually reach the target audience. Also important, Google would like to avoid uncomfortable situations of these ads popping up for the wrong audience. Showing how to calculate the relevant probabilities is a nice illustration of Bayes’ theorem in statistics.

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

To facilitate our discussion, let us define two events:

Event A: The algorithm identifies you as the target audience for Viagra, and thus Google displays to you the ad.

Event B: You are a member of the target audience. That is, upon seeing the ad, you would run to the doctor for a prescription for Viagra or at least try to convince your partner, friend, or family member to do so.

Two metrics that characterize an identification algorithm’s efficacy are the specificity and sensitivity. Precisely, the sensitivity is the probability that the algorithm will identify someone to be a member of the target audience, given that the person actually is. The specificity is the probability that the algorithm will not identify someone to be a member of the target audience, given that the person actually is not. The sensitivity measures the algorithm’s ability to pick up on those that are interested in Viagra; the specificity measures its ability to identify those who are not interested. Mathematically, the sensitivity is P(A | B), which is read “the conditional probability that A occurs given that we know B has occurred”. The specificity is P(~A | ~B), where the tilde denotes that the event did not occur.

Back to the drug company that owns Viagra, who wants to know P(B | A), the conditional probability that, given a person receives an ad, this person actually belongs to the target audience. We can calculate this from Bayes’ powerful theorem, which, as we will see, sometimes rips apart the statistical intuition we thought we had:

The following Youtube video provides a geometric derivation of Bayes’ theorem.

The beauty of Bayes’ theorem is that it relates the conditional probability we are interested in, P(B | A), to the sensitivity, P(A | B), of our algorithm. The specificity is buried in the denominator; the probability that someone receives an ad is the sum of the probability that the algorithm correctly identifies someone to belong to the target audience and the probability that the algorithm generates a false positive, or:

P(A) = P(A | B) P(B) + P(A | ~B) P(~B)

= P(A | B) P(B) + [1- P(~A | ~B)] [1 - P(B)]

Let’s assume that our algorithm is quite good, with a specificity of 99% and a sensitivity of 99%. Estimating that 1.0% of the population is so interested in Viagra that they immediately take action if they see the ad (i.e., P(B) = 0.01), we calculate with Bayes’ theorem that P(B | A) = 0.5. Based on the high (99%) sensitivity and specificity, it may seem surprising that only 50% of the displayed ads successfully reach the target audience. The relatively high fraction of displayed ads that aren’t reaching the target audience is a consequence of the target audience being only a small (1.0%) fraction of the population for which we have collected our Big Data.

To visualize this consequence, let each circle in the left panel of Fig. 1 below represent a person in the population to which Google would apply our algorithm. The red circles (1%) represent the target audience for Viagra. The blue circles are those that are not interested. When our algorithm analyzes the emails of each circle in the left of Fig. 1, its high sensitivity (99%) picks up that all 9 red circles are interested in Viagra, and Google displays the ad to them. From the specificity, we know that Google will end up displaying the ad to 100% - 99% = 1% of the blue circles. However, there are so many blue circles that 1.0% of them is equal to the number of red circles, and the algorithm displays a Viagra ad to 9 blue circles as well! The right panel of the Fig. 1 is a visualization of the population to which Google displays the Viagra ad based on our algorithm.

Fig. 1: Each circle represents a member of the population. The left box visualizes a population of 1.0% Viagra candidates. The right box is the population to which our algorithm displays a Viagra ad based on a 99% specificity and sensitivity.

Instead of sending an ad to everybody to reach the entire 1% of Viagra candidates, our algorithm for Google is sending the Viagra ad to only 2% (i.e, P(A)=0.02) of the population and achieving nearly the same result. The chance of reaching the target audience with an ad in Gmail has now increased from 1%, in the case of displaying the ad to random Gmail users, to 50%, in the case of targeted advertising using Big Data. Not only has our algorithm made Viagra advertising much more efficient, but also the collateral damage from showing uninterested people uncomfortable ads has been greatly reduced. This seems to be a small price to pay for all our “free” G-mail, Facebook, or Browsers.

For most of us, receiving an incorrectly targeted ad will most likely go unnoticed. But will this also be the case if Google’s target list of people likely (50% certain!) to be interested in Viagra were to become publicly accessible? Or, imagine a more sensitive scenario where our startup becomes part of the NSA and our algorithm now screens Big Data to predict terrorist behavior. We seek to use our predictive algorithm to deny people access to an airplane. Modifying our definition of events:

Event A: Our algorithm identifies you as a terrorist and puts you on a no-fly list.

Event B: You indeed have intentions of killing many innocent people.

We are interested in the collateral damage of our algorithm: P(~B | A), given that you are on the no-fly list, the probability that you do not have intentions of killing innocent people.

We can use Bayes’ theorem again to calculate P(~B | A) = 1 – P(B | A) if we know the specificity and sensitivity of our algorithm. Let’s make it even more impressive than our Viagra algorithm: here, the sensitivity is 100%-- the algorithm picks up terrorists 100% of the time, and the specificity is 99.99%.

The population of terrorists in the US, represented by P(B) here, is much smaller than the population of Viagra candidates. The consequence is that, despite the more sensitive and specific algorithm, the amplification of false positives is even more exacerbated than in the Viagra scenario. In Fig. 2, we use Bayes’ theorem and plot P(B | A) as a function of the population of terrorists in the US. If the terrorist population is below 1 in 10,000, even with our algorithm that perfectly picks up every terrorist but has a specificity of 99.99%, the majority of people on our no-fly list are not terrorists. The story with the red and blue circles is the same, but now the ratio of blue circles to red ones is even greater.

This is a general problem with tests/algorithms seeking to identify a rare occurrence in the population, whether it be for disease testing, drug testing, Viagra ad targeting, or NSA terrorist screening. Bayes’ formula tells us that even an impressively small probability of a false positive is amplified by applying the test/algorithm to a large population of people who are disease-free/drug-free/not interested in Viagra. Coupled with the fact that the number of positives in the population is very small, the number of true positives becomes comparatively small to the false positives.

Another problem with screening for a rare occurrence is that its very rarity makes it so much more difficult to train the algorithm to have a high specificity and sensitivity. To figure out what features reliably identify someone as a likely terrorist, all we have to train our algorithm on is the sample of past terrorists – which (fortunately) is a very small sample. The trouble is that for such a small sample it is much harder to filter out which of the properties that they share are truly indicative of being a terrorist, and which are merely accidental, the effect of random fluctuations.

To illustrate the point, let us do the following experiment. We select 20 random people, and we train our algorithm to find the similarities between them. Despite the fact that this is a random sample of people who have nothing to do with each other, trawling through Big Data will inevitably throw up a whole list of things they have in common: we might find, for example, that our small random sample of 20 people just happens to have a higher frequency of people that like yellow shirts, send exactly 5 emails to their mothers each week, put mustard on their cheese sandwiches, and like Laurel and Hardy movies. Remarkable though such similarities may seem, remember that we were looking at a random sample, and the clusters we are seeing are just random fluctuations in the distribution of properties across the population, indicative of exactly nothing.

Our random training sample is like a limit case: if we trained our algorithm on it, and applied it to the general population, we could say that, by construction, it would generate only false positives: whatever remarkable properties the members of our random sample have in common, it is by statistical fluke. The thing is, such statistical flukes will also occur in a small, non-random sample, like our training sample of past terrorists. Of all the properties they have in common, some will actually be indicators of terrorism, but others will just be random fluctuations. And the problem is, the only way to filter the accidental similarities from those similarities that are a signature of our target group is to increase the size of the training group – in the case of terrorists, precisely what we are trying to avoid.

Big Data and rare occurrences thus conspire to yield a `suspects’ list that is a lot longer than you’d expect: the surprising result from Bayes' theorem tells us that if one is using Big Data to identify rare occurrences, even the most impressive algorithms will result in a no-fly list on which the majority of people are collateral damage. They are quite innocent and had nothing to hide, yet their very appearance on that list makes them statistically more suspect than anyone who is not on the list. (Remember the Viagra ads: of those on Google’s target list, 50% are likely to actually be interested in Viagra, against only 1% across the population as a whole.) So we end up with a paradoxical result: given a good algorithm, and a rare occurrence, only a small minority of the population will end up on the list of suspects. However, of those people on the list, the majority will still be innocent. Yet being on the list will make you incomparably more suspect than anyone not on the list. And as no one can provide evidence on their future innocence, having nothing to hide is not necessarily without fear.