New artificial intelligence system lets machines see the world more like humans

A new “common sense” approach to computer vision allows artificial intelligence to interpret scenes more accurately than other systems.

Computer vision systems sometimes make inferences about a scene that go against common sense. For example, if a robot were processing a scene from a dining table, it might completely ignore a bowl visible to any human observer, feel that a plate is floating above the table, or mistakenly perceive that a fork is entering it. a bowl rather than leaning against it.

Move that computer vision system to a self-driving car and the stakes get much higher – for example, such systems have failed to detect emergency vehicles and pedestrians crossing the street.

To overcome these mistakes, MIT researchers have developed a framework that helps machines see the world more like humans do. Their new artificial intelligence system for analyzing scenes learns to perceive real-world objects from just a few images and perceives scenes based on these learned objects.

The researchers built the framework using probabilistic programming, an AI approach that allows the system to compare detected objects to input data, to see if images recorded from a camera likely matched. a candidate scene. Probabilistic inference allows the system to deduce whether the mismatches are likely due to noise or to errors in the interpretation of the scene that need to be corrected by further processing.

3DP3 deduces precise pose estimates of objects

This image shows how 3DP3 (bottom row) derives more accurate pose estimates of objects from input images (top row) than deep learning systems (middle row). Credit: Courtesy of the researchers

This common sense safeguard enables the system to detect and correct many errors that plague the “deep-learning” approaches that have also been used for computer vision. Probabilistic programming also makes it possible to deduce probable contact relationships between objects in the scene and to use common sense reasoning about these contacts to derive more precise positions for the objects.

“If you don’t know the contact relationships, then you could say that an object is floating above the table – that would be a valid explanation. As humans, it is obvious to us that this is physically unrealistic and that the object placed on the table is a more likely pose of the object. Because our reasoning system is aware of this type of knowledge, it can deduce more precise poses. This is a key overview of this work, ”says lead author Nishad Gothoskar, a doctoral student in Electrical and Computer Engineering (EECS) with the Probabilistic Computing Project.

In addition to improving the safety of self-driving cars, this work could improve the performance of computer perception systems that must interpret complicated arrangements of objects, such as a robot tasked with cleaning a cluttered kitchen.

Gothoskar co-authors include Marco Cusumano-Towner, a recent EECS PhD graduate; research engineer Ben Zinberg; the guest student Matin Ghavamizadeh; Falk Pollok, software engineer at MIT-IBM Watson AI Lab; Austin Garrett, recent EECS Masters graduate; Dan Gutfreund, Principal Investigator at MIT-IBM Watson AI Lab; Joshua B. Tenenbaum, Paul E. Newton Professor of Career Development in Cognitive Science and Computing in the Department of Brain and Cognitive Sciences (BCS) and member of the Computer Science and Artificial Intelligence Laboratory; and principal author Vikash K. Mansinghka, principal investigator and head of the probabilistic calculus project at BCS. The research is presented at the Neural Information Processing Systems Conference in December.

A blast from the past

To develop the system, called “3D Scene Perception via Probabilistic Programming (3DP3)”, the researchers were inspired by a concept from the early days of AI research, namely that computer vision can be considered like the “reverse” of computer graphics.

Computer graphics focuses on the generation of images based on the representation of a scene; computer vision can be thought of as the reverse of this process. Gothoskar and his collaborators made this technique easier to learn and scalable by integrating it into a framework built using probabilistic programming.

“Probabilistic programming allows us to write down our knowledge about certain aspects of the world in a way that a computer can interpret, but at the same time, it allows us to express what we don’t know, uncertainty. Thus, the system is able to learn automatically from the data and also automatically detect when the rules do not hold ”, explains Cusumano-Towner.

In this case, the model is coded with prior knowledge about 3D scenes. For example, 3DP3 “knows” that scenes are made up of different objects and that these objects often lie flat on top of each other, but they may not always be in such simple relationships. This allows the model to reason on a stage with more common sense.

Learning shapes and scenes

To analyze an image of a scene, 3DP3 first learns the objects in that scene. After showing just five images of an object, each taken from a different angle, 3DP3 learns the shape of the object and estimates how much volume it would occupy in space.

“If I show you an object from five different perspectives, you can create a pretty good representation of that object. You would understand its color, its shape and you would be able to recognize this object in many different scenes, ”explains Gothoskar.

Mansinghka adds, “This is a lot less data than deep learning approaches. For example, the Dense Fusion Neural Object Detection System requires thousands of training examples for each type of object. On the other hand, 3DP3 only requires a few images per object and signals an uncertainty about the parts of the shape of each object that it does not know.

The 3DP3 system generates a graph to represent the scene, where each object is a node and the lines that connect the nodes indicate which objects are in contact with each other. This allows 3DP3 to produce a more accurate estimate of the arrangement of objects. (Deep learning approaches rely on depth images to estimate the poses of objects, but these methods do not produce a graphical structure of the contact relationships, so their estimates are less precise.)

More efficient reference models

The researchers compared 3DP3 with several deep learning systems, all responsible for estimating the poses of 3D objects in a scene.

In almost all cases, 3DP3 generated more accurate poses than other models and performed much better when some objects partially obstructed others. And 3DP3 only needed to see five images of each object, while each of the base models it surpassed needed thousands of images for training.

When used in conjunction with another model, 3DP3 was able to improve its precision. For example, a deep learning model can predict that a bowl will float slightly above a table, but because 3DP3 has knowledge of contact relationships and can see that this is an unlikely configuration , he is able to make a correction by aligning the bowl with the board.

“I found it surprising how big the mistakes of deep learning can sometimes be, producing depictions of scenes where the objects really didn’t match what people would perceive. I also found it surprising that only a little model-based inference in our causal probabilistic program is sufficient to detect and correct these errors. Of course, there is still a long way to go to make it fast and robust enough for tough real-time vision systems – but for the first time we see probabilistic programming and structured causal models improving the robustness over deep learning on hard 3D. vision landmarks, ”Mansinghka says.

In the future, researchers would like to take the system further so that it can know an object from a single frame or a single frame in a movie, and then be able to robustly detect that object in different scenes. They would also like to explore the use of 3DP3 to collect training data for a neural network. It is often difficult for humans to manually label images with 3D geometry, so 3DP3 could be used to generate more complex image labels.

The 3DP3 system “combines low-fidelity graphical modeling with common sense reasoning to correct large scene interpretation errors made by deep learning neural networks. This type of approach could have wide applicability as it addresses important failure modes of deep learning. The accomplishment of the MIT researchers also shows how the probabilistic programming technology previously developed under DARPAThe Probabilistic Programming Program for the Advancement of Machine Learning (PPAML) can be applied to solve core common sense AI problems under DARPA’s current Machine Common Sense (MCS) program, ”says Matt Turek, DARPA program manager for the Machine Common Sense program, who was not involved in this research, although the program partially funded the study.

Reference: “3DP3: 3D Scene Perception via Probabilistic Programming” by Nishad Gothoskar, Marco Cusumano-Towner, Ben Zinberg, Matin Ghavamizadeh, Falk Pollok, Austin Garrett, Joshua B. Tenenbaum, Dan Gutfreund and Vikash K. Mansinghka, October 30, 2021, Computing> Computer vision and pattern recognition.
arXiv: 2111.00312

Other funders include the Singapore Defense Science and Technology Agency’s collaboration with MIT Schwarzman College of Computing, Intel’s Probabilistic Computing Center, MIT-IBM Watson AI Lab, Aphorism Foundation, and Siegel Family Foundation.

Comments are closed.