James J. DiCarlo, M.D., Ph.D.
Associate Professor of Neuroscience McGovern Institute for Brain Research and Department of Brain and Cognitive Sciences Massachusetts Institute of Technology{mosimage} Dr. DiCarlo received his Ph.D. in biomedical engineering and M.D. from The Johns Hopkins University in 1998, and his postdoctoral training in visual neurophysiology at Baylor College of Medicine. His research goal is a computational understanding of the brain mechanisms that underlie object recognition. He and his collaborators have shown that populations of neurons at the highest cortical visual processing stage rapidly convey explicit representations of object identity, even in the face of naturally occurring image variability. His group has found that the brain’s ability to accomplish this feat is rapidly altered by natural visual experience and they can now monitor the neuronal substrates of this learning online. This points the way to understanding how the visual system uses the statistics of the visual world to “learn” to build these object representations. He and his collaborators are currently using a combination of neurophysiology, brain imaging, and high-throughput computational simulations to understand the neuronal mechanisms and fundamental cortical computations that underlie the construction of these powerful image representations. He aims to use this understanding to inspire new machine vision systems and neural prosthetics (brain-machine interfaces) to restore or augment lost senses. Dr. DiCarlo is an Alfred Sloan Fellow, a Pew Scholar in the Biomedical Sciences, and a McKnight Scholar in Neuroscience. Learning to untangle object identity in the ventral visual stream Although visual object recognition is fundamental to our behavior and seemingly effortless, it is a remarkably challenging computational problem because the visual system must somehow tolerate tremendous image variation produced by different views of each object (the “invariance” problem). In this talk, I will briefly present a framework for thinking about this computational crux of object recognition and how it might be solved (“untangling” object identity manifolds). Current neurophysiological evidence shows that the primate brain accomplishes this untangling by gradually transforming its initial neuronal population representation (a photograph on the retina) to a new, explicit form of neuronal population representation at the highest level of the ventral visual processing stream (inferior temporal cortex, IT). This explicit object representation depends on the key response property of tolerant shape selectivity found in most IT neurons. We have recently discovered that this key response property can be rapidly, robustly, and specifically altered by unsupervised exposure to naturally-occurring temporal contiguity cues in the visual environment. This points the way to understanding how the natural statistics of the visual world “teach” the ventral stream to untangle object identity manifolds. We are currently using a combination of neurophysiology, brain imaging (fMRI and x-ray guided physiology), and high-throughput computational simulations to understand the fundamental cortical computations that underlie this untangling transformation. We aim to use this understanding to inspire new machine vision systems and neural prosthetics (brain-machine interfaces) to restore or augment lost senses. Jay YagnikHead of Computer Vision and Audio Understanding Research Google Inc.{mosimage}Jay Yagnik is currently the Head of Computer Vision and Audio Understanding Research at Google Inc. His interests include machine learning, scalable matching, graph information propagation, image representation and recognition, temporal information mining, and statistics. He is an alumnus of the Indian Institute of Science and Nirma Institute of Technology for graduate and undergraduate studies. Prior to Google he was at the Super Education and Research Center at IISc Bangalore and had worked on criminal identification through beard-mustache invariant facial recognition, machine learning for predicting protein function, cooperative robotics and solving large PDEs. His hobbies include games (badminton, ping-pong, foosball), reading and writing. Computer Vision at the Web ScaleTraditional computer vision deals with the paradigm of "training" a vision algorithm with manually labelled data and applying it to the field task at hand. Systems built through this philosophy inherently suffer from lack of representative training data to capture the real use cases. Specifically when the real use case is to allow users to search / browse through vast collections of image and video data, the traditional paradigm does not scale well. For eg. YouTube gets 20 hours of new video uploaded to it every minute, such applications force us to redefine what we mean by large scale. We'll talk about some new perspectives on formulating vision problems for such applications and their far reaching implications on how we think about complexity of the solution space. We'll also touch upon some of the systems issues that come up when dealing with such scales and how they influence the design space of vision algorithms.
|