James DiCarlo is a professor of neuroscience in the Department of Brain and Cognitive Sciences at MIT who researches visual object recognition in primates. I had a chance to interview him in late May at the 79th Cold Spring Harbor Laboratory Symposium on Quantitative Biology that highlighted research findings on the topic of cognition. In the interview, DiCarlo talked about his research, but also addressed basic questions, such as what is object recognition. An edited transcript of the interview follows—or you can watch the full video.
Scientific American: So Jim, can you give us a definition of object recognition?
James DiCarlo: We all have this intuitive feel for what object recognition is. It’s the ability to discriminate your face from other faces, a car from other cars, a dog from a camel, that ability we all intuitively feel. But making progress in understanding how our brains are able to accomplish that is a very challenging problem and part of the reason is that it’s challenging to define what it isn’t and is. We take this problem for granted because it seems effortless to us. However, a computer vision person would tell you is that this is an extremely challenging problem because each object presents an essentially infinite number of images to your retina so you essentially never see the same image of each object twice.
SA: It seems like object recognition is actually one of the big problems both in neuroscience and in the computational science of machine learning?
DiCarlo: That’s right., not only machine learning but also in psychology or cognitive science because the objects that we see are the sources in the world of what we use to build higher cognition, things like memory and decision-making. Should I reach for this, should I avoid it? Our brains can’t do what you would call higher cognition without these foundational elements that we often take for granted.
SA: Maybe you can talk about what’s actually happening in the brain during this process.
DiCarlo: It’s been known for several decades that there’s a portion of the brain, the temporal lobe down the sides of our head, that, when lost or damaged in humans and non-human primates, leads to deficits of recognition. So we had clues that that’s where these algorithms for object recognition are living. But just saying that part of your brain solves the problem is not really specific. It's still a very large piece of tissue. Anatomy tells us that there’s a whole network of areas that exist there, and now the tools of neurophysiology and still more advanced tools allow us to go in and look more closely at the neural activity, especially in non-human primates. We can then begin to decipher the actual computations to the level that an engineer might, for instance, in order to emulate what’s going on in our heads.
SA: Maybe you could address a little bit more how you’re really trying to find the constituents of these networks.
DiCarlo: Well, I would start by saying the foundation of any science is really the ability to have predictive models of a phenomenon. So for the domain of object recognition, if you want to emulate that from an engineering perspective, you first need to define what you are trying to predict. The goal that we call core object recognition is the ability you have when you view an image for just 200 milliseconds, which is about the time that your eyes dwell on something as they explore a scene. But we humans can do a lot with that short time window. We can easily recognize one or more objects within that short, 200-millisecond glimpse, which is a fifth of a second. You can see that’s not all of vision, but it’s a defined domain of behavior where now we can start to get some traction on the problem.
SA: Okay so you’ve got a predictive model and then you want to test that model…
DiCarlo: Well, so what I described for you was a domain of tasks to be understood, which we call core recognition and we know that, of course, images come in and are processed by the eyes and then move through a series of visual areas in the brain for further processing in ways that are sometimes murky but we can record the neural activity along the pathway there. Others have done that before us and now we’re doing it at a much larger scale. We can record neural activity and we’re especially interested in a place in the brain called the inferior temporal cortex, which is at the highest level of this processing chain that we spoke of earlier. We found the patterns of neural activity there with a very simple model that can predict very accurately the animal’s perception and also our own perception, our ability to do recognition in that core domain.
SA: You could predict, say, that I am looking at a tree in the background from observing that neural activity?
DiCarlo: That’s exactly what I mean. Now the granularity at which we can do that is still part of the active research but we can certainly do ”tree detection”. From looking at neural activity. we can predict if the subject is going to report that it sees a tree versus a dog or if it reports a tree versus a car, and if it's looking at one tree versus another. We’re now trying to see if we can do that on a moment-to moment-basis, and if we can exactly predict the pattern of errors in the subject reports -- meaning the subject reports a dog when it was shown a cat.
SA: The issue with object recognition is that if I’m looking at that tree and then I move slightly to the left or the right the tree changes or I start to see another tree. Will this model be able to still recognize that that’s a tree or that it’s the same tree?
DiCarlo: I should’ve made that more clear. That’s the largest thing that the model has to deal with and when I say the model deals with that, I mean that activation of the neurons up to the inferior temporal cortex have been recorded. So once we build a decoder of the inferior temporal cortex that reads activity there, an image of the tree will be properly decoded as a tree. It’s a brand-new image but the model will still make a prediction of what you will see and the model will be quite accurate.
SA: What are some of the implications of this for machine learning and perhaps one day even for understanding problems for people who have disruptions of this neurocircuitry?
DiCarlo: From the perspective of machine learning, these neural activities are something that machine learning folks would call features. So those are features computed on the image and they are a very powerful set of features. What many people would love to do is to be able to have algorithms that produce those features. So much of machine learning is devoted to finding good features and the brain’s evolution has already found some good features and that’s essentially, what we’re reporting: Here are some nice features. Here’s where they are, here’s our evidence for that. So now we’re working alongside of machine learning scientists to help build what are called encoding algorithms that produce those features and there’s a lot of exciting progress in the field in the last few years driven by what are essentially brain-inspired models that are actually now some of the state-of-the-art computer-vision algorithms.
SA: The grand vision of what you’re doing is modeling this all the way from the encoding to the neural activation and then to the decoding and the perception in the brain.
DiCarlo: That’s exactly the grand vision. If we can do all of that, we would then say that we have a complete end-to-end understanding of this domain of behavior.
SA: When do you think all this might happen?
DiCarlo: It depends on the level of detail but I would say certainly in the next ten years, we will have a very good understanding of core basic-level object recognition to the degree that many engineers will be satisfied. We won’t know it down to the synapse level, but we will know it so that the algorithms are very predictive of the neural activity at various levels of the system.
SA: Do you think that this could provide some insight into what sometimes goes wrong with this circuitry?
DiCarlo: The most common deficits that would affect recognition are major damage to the inferior temporal area through stroke or lesion and of course it’s obvious what’s gone wrong: you’ve taken out those neurons. Now maybe this would lead to ways that you could sidestep or replace that.
There are other deficits to the temporal lobe where people have deficits in the ability to discriminate among faces or rarely among other types of objects. They’re not very common but this kind of work should bear on those deficits as well. We hope it will also relate to things like how kids learn to read. At the end of the day whenever you’re doing visual tasks you’re leaning on these kind of representations in your visual system and so I think it will help us understand higher-level issues of, say, social cognition or things like dyslexia.
SA: Despite what people see in the movies, robots in the real world are still very limited in what they can do and one of the big issues with that is their ability to recognize and process information that they perceive. Do you think your model could help with that?
DiCarlo: The computer vision community’s already using brain-like algorithms right now and the next frontier is expanding the domain of tasks, not just what you do in 200 milliseconds, but what you might do as you explore a scene with many eye movements or navigate the scene. For that, you’ll have to accumulate information over time. There’ll be more feedback in the system. I won’t say that we can do this work and then we will have robots doing everything you see on “Star Trek” but it will be a foundation to enable us to take next steps.
SA: There has been work in the last few years on retinal prosthetics and one approach being pursued is to implant neural coding into some of these prosthetics so that they can process the incoming photons in the same way that the retina does. Is there any chance that the kind of work that you’re doing could, in some ways, jibe with this?
DiCarlo: That’s actually one of the things that we’re most excited about right now. There are visual prosthetics for people that, say, have lost a retina and there are various approaches. But the dominant one is to try to just bypass the retina and reinject a spatial pattern of activity, say, in the early visual area or one of the subcortical areas that comes up right beyond the retina, called the lateral geniculate nucleus. That makes sense from an engineering point of view. It makes sense given our knowledge of how you might try to do this.
The downside is trying to get an image in a very high-dimensional space with many, many pixels that would resemble normal vision. But we’re working at the highest level in which your brain has already reduced the dimensionality from millions of pixels to something that’s more abstract, something that’s on the order of 100 dimensions. We might be able to emulate a very rich visual panorama. It could be a better way to think about brain/machine interfaces as we understand them, that you might only have 100 ways to inject a signal, 100 channels rather than millions to make a rich perceptual space.
SA: Just to sum up, what your work is doing is taking something very basic that all of us can relate to, and then come to a fundamental physical and theoretical understanding of that really huge challenge.
DiCarlo: Yes and I think that’s very well-put and that’s really been the goal of neuroscience since its formation -- we believe that the brain is a set of mechanisms that give rise to amazing mental states and behaviors that each of us can relate to. Object recognition is just one core example of that mental phenomenology, but one that many of us can relate to. So it would be a foundational success if we come to an end-to-end understanding of this behavior and its underlying neural mechanisms. It would be a large brick, if you will, in the foundation of building towards understanding cognition.
Gary Stix: Good luck with that.
DiCarlo: Thank you.
Image Source: MIT