February 17, 2013

Babble-onia: Solving the Cocktail Party Problem

This article was published in Scientific American’s former blog network and reflects the views of the author, not necessarily those of Scientific American

Walk into a crowded bar, with music blaring, and your first impression is likely to be a shudder at the sudden wall of sound -- which you will interpret at first as a single loud noise. But very quickly, you adjust, and different sounds begin to emerge. We navigate by tuning our neurons to specific voices, thereby tuning out others -- like that irritating, leering would-be Lothario at the other end of the bar, or all that ambient noise.

Over at Scientopia, Scicurious wrote about a new MEG study by neuroscientists on how the brain deals with the so-called "cocktail party problem" -- distinguishing one conversational thread amid a cacaphony of babble in a crowded room. It's not just a question of attention, although it can be difficult to concentrate on even the most fascinating discussion if there's too much background noise.

The brain doesn't just detect sounds, it also processes temporal patterns of speech and visual cues. The latter was the basis for the latest study, in which the authors set out to measure whether (as Scicurious put it) "the visual input from the face that is speaking might help someone to "predict" what they are about to hear, easing processing of the words." As expected, they found that people followed a conversation just fine one on one, and had difficulty in a small cocktail party setting. But their performance improved dramatically in the latter setting if they had a face to go along with the speech patterns. Per Scicurious:

On supporting science journalism

If you're enjoying this article, consider supporting our award-winning journalism by subscribing. By purchasing a subscription you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.

Why does this help? It could be that the visual input helps you maintain attention. The visual input could also help you predict what is to be said next and help with auditory processing that way.

This is a perennially favorite topic for science writers; I blogged about it back in 2011 when Scientific Americanfeatured an article by Graham Collins on how our brains separate various auditory streams when in a crowded room, like a restaurant or a cocktail party, so why not revisit that classic post now? (Personally, my brain has never been especially good at this. I find myself having to really concentrate when the noise levels reach a certain critical threshold.) Scientists have been pretty successful at studying how the brain accomplishes this feat. They've been less successful at devising computer algorithms to do the same thing.

A few years ago, at an acoustics conference, I chatted with Shihab Shamma, a researcher at the University of Maryland, College Park. He believes this ability arises from auditory nerve cells in the brain that re-tune themselves to specific sounds as part of the adaptive process. It's kind of an auditory feedback loop that enables us to sort out confusing incoming acoustical stimuli.

He's surprised, however, by how quickly this process happens: auditory neurons in adult mammal brains make the adjustment in a few seconds. To Shamma, this suggests that the developed brain is even more "plastic" or adaptable than previously realized. We're literally changing our minds.

Scientists are still a bit in the dark in terms of understanding the mechanisms that cause this rapid tuning, but Shamma says that if we can mimic those abilities, it could lead to the development of more effective hearing aids and cochlear implants. In the shorter term, it might help improve automatic speech recognition systems by teaching them to filter out moderate levels of background noise and other acoustical "clutter."

And that brings us to the 2011 Scientific American article. Apparently a team of researchers at IBM's TJ Watson Research Center have managed to create an algorithm for the "cocktail party problem" that outperforms human beings. Why is it so hard, and therefore such a bit deal? It comes down the number of possible sound combinations which quickly becomes unwieldy. Here's how Collins phrases it:

"Whether one person is talking or many, the sound contains a spectrum of frequencies, and the intensity of each frequency changes on a millisecond timescale; spectrograms display data of this kind. Standard single-talker speech recognition analyzes the data at the level of phonemes, the individual units of sound that make up words... Each spoken phoneme produces a variable but recognizable pattern in the spectrogram. Statistical models ... [specify] the expected probability that, for instance, an "oh" sound will be followed by an "n". The recognition engine looks for the most likely sequences of phonemes and tries to build up whole words and plausible sentences."

In other words, speech recognition works a bit like Auto-Correct -- and we all know what can happen when Auto-Correct goes horribly, horribly wrong.

Collins continues:

"When two people talk at once, the number of possibilities explodes. The frequency spectrum at each moment could come from any two phonemes, enunciated in any of the ways each person might use them in a word. Each additional talker makes the problem exponentially worse."

The good news is that such algorithms can simplify the search by focusing on the dominant speaker -- c'mon, we all know there's at least one Loud Talker in any given crowd. A number of shortcuts have been devised in recent years by exploiting this kind of thing. A "bottom-up" approach looks for segments in a spectrogram without a dominant speaker, and sets those segments aside, literally removing them from the equation so the algorithm can focus on finding phoneme sequences in the "clean regions" -- i.e., where there is a dominant speaker. That approach has been adopted by scientists at the University of Sheffield in England, apparently.

Alternatively, you can use a "top-down" approach, devising an algorithm that analyzes trial sequences of the most likely phonemes for all speakers in a given spectrogram. Finnish researchers at Tampere University of Technology exploit this approach by switching between each of two speakers. As Collins explains, "Given the current best estimate of talker A's speech, search for talker B's speech that best explains the total sound." Context is everything, baby. The IBM achieved their "superhuman" automated speech separation by tweaking a "top-down" approach and devising an algorithm to seek out areas on the spectrogram where one talker was bellowing so loudly s/he masked the voices of the other(s).

But you really shouldn't worry too much just yet about secret agents eavesdropping on your party guests: the new algorithms aren't that good. Maybe someday. In the meantime, please to enjoy this classic party scene from Breakfast at Tiffany's to illustrate just how tough the cocktail party problem is likely to be. As one of the YouTube commenters remarked, "It's not a party until someone is laughing and crying at themselves in the mirror."

[Adapted from an April 2011 post from the archived Cocktail Party Physics blog.]