This article is part of our reviews of AI research papers, a series of posts that explore the latest findings in artificial intelligence.
What makes us humans so good at making sense of visual data? That’s a question that has preoccupied artificial intelligence and computer vision scientists for decades. Efforts at reproducing the capabilities of human vision have so far yielded results that are commendable but still leave much to be desired.
Our current artificial intelligence algorithms can detect objects in images with remarkable accuracy, but only after they’ve seen many (thousands or maybe millions) examples and only if the new images are not too different from what they’ve seen before.
There is a range of efforts aimed at solving the shallowness and brittleness of deep learning, the main AI algorithm used in computer vision today. But sometimes, finding the right solution is predicated on asking the right questions and formulating the problem in the right way. And at present, there’s a lot of confusion surrounding what really needs to be done to fix computer vision algorithms.
In a paper published last month, scientists at Massachusetts Institute of Technology and University of California, Los Angeles, argue that the key to making AI systems that can reason about visual data like humans is to address the “dark matter” of computer vision, the things that are not visible in pixels.
Titled, “Dark, Beyond Deep: A Paradigm Shift to Cognitive AI with Humanlike Common Sense,” the paper delves into five key elements that are missing from current approaches to computer vision. Adding these five components will enable us to move from “big data for small tasks” AI to “small data for big tasks,” the authors argue.
Today’s AI: Big data for small tasks
“Recent progress in deep learning is essentially based on a ‘big data for small tasks’ paradigm, under which massive amounts of data are used to train a classifier for a single narrow task,” write the AI researchers from MIT and UCLA.
Most recent advances in artificial intelligence rely on deep neural networks, machine learning algorithms that roughly mimic the pattern-matching capabilities of human and animal brains. Deep neural networks are like layers of complex mathematical functions stacked on top of each other. To perform their functions, DNNs go through a “training” process, where they are fed many examples (e.g. images) and their corresponding outcome (e.g. the object the images contain). The DNN adjusts the weights of its functions to represent the common patterns found across objects of common classes.
In general, the more layers a deep neural network has and the more quality data it is trained on, the better it can extract and detect common patterns in data. For instance, to train a neural network that can detect cats with accuracy, you must provide it with many different pictures of cats, from different angles, against different backgrounds, and under different lighting conditions. That’s a lot of cat pictures.
Although DNNs have proven to be very successful and are a key component of many computer vision applications today, they do not see the world as humans do.
In fact, deep neural networks have existed for decades. The reason they have risen to popularity in recent years is the availability of huge data sets (e.g. ImageNet with 14 million labeled images) and more powerful processors. This has allowed AI scientists to create and train bigger neural networks in short timespans. But at their core, neural networks are still statistical engines that search for visible patterns in pixels. That is only part of what the human vision system does.
“The inference and reasoning abilities of current computer vision systems are narrow and highly specialized, require large sets of labeled training data designed for special tasks, and lack a general understanding of common facts (facts that are obvious to average humans),” the authors of “Dark, Beyond Deep” write.
The scientists also point out that human vision is not the memorization of pixel patterns. We use a single vision system to perform thousands of tasks, as opposed to AI systems that are tailored for one model, one task.
How can we achieve human-level computer vision? Some researchers believe that by continuing to invest in larger deep learning models, we’ll eventually be able to develop AI systems that match the efficiency of the human vision.
The authors of “Dark, Beyond Deep,” however, underline that breakthroughs in computer vision are not tied to better recognizing the things that are visible in images. Instead, we need AI systems that can understand and reason about the “dark matter” of visual data, the things that are not present in images and videos.
“By reasoning about the unobservable factors beyond visible pixels, we could approximate humanlike common sense, using limited data to achieve generalizations across a variety of tasks,” the MIT and UCLA scientists write.
These dark components are functionality, intuitive physics, intent, causality, and utility (FPICU). Solving the FPICU problem will enable us to move from “big data for small tasks” AI systems that can only answer “what and where” questions to “small data for big tasks” AI systems that can also discuss the “why, how, and what if” questions of images and videos.
Our understanding of how the world operates at the physical level is one of the key components of our visual system. Since infanthood, we start to explore the world, much of it through observation. We learn about things such as gravity, object persistence, dimensionality, and we later use these concepts to reason about visual scenes.
“The ability to perceive, predict, and therefore appropriately interact with objects in the physical world relies on rapid physical inference about the environment,” the authors of “Dark, Beyond Deep,” write.
With a quick glance at a scene, we can quickly understand which objects support or are hanging from others. We can tell with decent accuracy whether an object will tolerate the weight of another or if a stack of objects is likely to topple or not. We can also reason about not only rigid objects but also about the properties of liquids and sand. For instance, if you see an upended ketchup bottle, you’ll probably know that it has been positioned to harness gravity for easy dispensing.
While physical relationships are, for the most part, visible in images, understanding them without having a model of intuitive physics would be nearly impossible. For instance, whether you know anything about playing pool or not, you can quickly reason about which ball is causing other balls to move in the following scene because of your general knowledge of the physical world. You would also be able to understand the same scene from a different angle, or any other pool table scene.
What needs to change in current AI systems? “To construct humanlike commonsense knowledge, a computational model for intuitive physics that can support the performance of any task that involves physics, not just one narrow task, must be explicitly represented in an agent’s environmental understanding,” the authors write.
This goes against the current end-to-end paradigm in AI, where neural networks are given video sequences or images and their corresponding descriptions and expected to embed those physical properties into their weights.
Recent work shows that AI systems that have incorporated physics engines are much better at reasoning about relations between objects than pure neural network–based systems.
Causality is the ultimate missing piece of today’s artificial intelligence algorithms and the foundation of all FPICU components. Does the rooster’s crow cause the sun to rise or the sunrise prompts the rooster to crow? Does the rising temperature raise the mercury level in a thermometer? Does flipping the switch turn on the lights or vice versa?
We can see things happening at the same time and make assumptions about whether one causes the other or if there are no causal relations between them. Machine learning algorithms, on the other hand, can track correlations between different variables but can’t reason about causality. This is because causal events are not always visible, and they require an understanding of the world.
Causality enables us not only to reason about what’s happening in a scene but also about counterfactuals, “what if” scenarios that have not taken place. “Observers recruit their counterfactual reasoning capacity to interpret visual events. In other words, interpretation is not based only on what is observed, but also on what would have happened but did not,” the AI researchers write.
Why is this important? So far, success in AI systems have been largely tied to providing more and more data to make up for the lack of causal reasoning. This is especially true in reinforcement learning, in which AI agents are unleashed to explore environments through trial and error. Tech giants such as Google use their sheer computational power and limitless financial resources to brute-force their AI systems through millions of scenarios in hopes of capturing all possible combinations. This is the approach has largely been successful in areas such as board and video games.
As the authors of “Dark, Beyond Deep” note, however, reinforcement learning programs don’t capture causal relationships, which limits their capability to transfer their functionality to other problems. For instance, an AI that can play StarCraft 2 at championship level will be dumbfounded if it is given Warcraft 3 or an earlier version of StarCraft. It won’t even be able to generalize its skills beyond the maps and race it has been trained on, unless it goes through thousands of years of extra gameplay in the new settings.
“One approach to solving this challenge is to learn a causal encoding of the environment, because causal knowledge inherently encodes a transferable representation of the world,” the authors write. “Assuming the dynamics of the world are constant, causal relationships will remain true regardless of observational changes to the environment.”
If you want to sit and can’t find a chair, you’ll look for a flat and solid surface that can support your weight. If you want to drive a nail in a wall and can’t find a hammer, you’ll look for a solid and heavy object that has a graspable part. If you want to carry water, you’ll look for a container. If you want to climb a wall, you’ll look for objects or protrusions that can act as handles.
Our vision system is largely task-driven. We reflect on our environment and the objects we see in terms of the functions they can perform. We can classify objects based on their functionalities.
Again, this is missing from today’s AI. Deep learning algorithms can find spatial consistency in images of the same object. But what happens when they have to deal with a class of objects that is very varied?
Since we look at objects in terms of functionality, we will immediately know that the above objects are all chairs, albeit very weird ones. But for a deep neural network that has been trained on images of conventional chairs, they will be confusing masses of pixels that will probably end up being classified as something else.
“Reasoning across such large intraclass variance is extremely difficult to capture and describe for modern computer vision and AI systems. Without a consistent visual pattern, properly identifying tools for a given task is a long-tail visual recognition problem,” the authors note.
“The perception and comprehension of intent enable humans to better understand and predict the behavior of other agents and engage with others in cooperative activities with shared goals,” write the AI researchers from MIT and UCLA.
Inferring intents and goals play a very important part in our understanding of visual scenes. Intent prediction enables us to generalize our understanding of scenes and be able to reason about novel situations without the need for prior examples.
We have the tendency to anthropomorphize animate objects, even when they’re not human—we empathize with them subconsciously to understand their goals. This allows us to reason about their courses of actions. And we do not even need rich visual cues to reason about intent. Sometimes, an eye gaze, a body posture or motion trajectory is enough for us to make inferences about goals and intentions.
Take the following video, which is an old psychology experiment. Can you tell what is happening? Most participants in the experiment were quick to establish social relationships between the simple geometric shapes and give them roles such as bully, victim, etc.
This is something that can’t be fully extracted from pixel patterns and needs complementary knowledge about social relations and intent.
Finally, the authors discuss the tendency of rational agents to make decisions that maximize their expected utility.
“Every possible action or state within a given model can be described with a single, uniform value. This value, usually referred to as utility, describes the usefulness of that action within the given context,” the AI researchers write.
For instance, when searching for a place to sit, we try to find the most comfortable chair. Many AI systems incorporate utility functions, such as scoring more points in a game or optimizing resource usage. But without incorporating the other components of FPICU, the use of utility functions remains very limited.
“these cognitive abilities have shown potential to be, in turn, the building blocks of cognitive AI, and should therefore be the foundation of future efforts in constructing this cognitive architecture,” write the authors of “Dark, Beyond Deep.”
This, of course, is easier said than done. There are numerous efforts to codify some of the components mentioned in the paper, and the authors mention some of the promising work that is being conducted in the field. But so far, advances have been incremental and the community is largely divided on which approach will work best.
The authors of “Dark, Beyond Deep” believe hybrid AI systems that incorporate both neural networks and classic intelligence algorithms have the best chance of achieving FPICU-capable AI systems.
“Experiments show that the current neural network-based models do not acquire mathematical reasoning abilities after learning, whereas classic search-based algorithms equipped with an additional perception module achieve a sharp performance gain with fewer search steps.”