A team of researchers from MIT trained a machine learning model to reason abstractly. It compares and contrasts a set of dynamic events captured on video.
One defining feature of human intelligence is the ability to reason abstractly about events around us.
It doesn’t involve a conscious effort to know that crying and writing are both means of communicating. Similarly, we instinctively understand that an Apple falling off a tree and a plane landing are descending variations.
Machines, on the other hand, are still learning to organize the world into such abstract categories. Recent studies have gotten closer to training machine learning models about everyday actions.
A team of researchers presented one such study at the European Conferences on Computer Vision this month.
The team unveiled a hybrid language-vision model that can compare and contrast a set of dynamic events captured on video. Then, the model would tease out the high-level concept that connects the events.
In a statement, the study’s senior author and a senior research scientist at Massachusetts Institute of Technology, Aude Oliva, said:
“We show that you can build abstraction into an AI system to perform ordinary visual reasoning tasks close to a human level.”
Deep neural networks have gotten significantly better at recognizing objects and actions in photos. Researchers are now focusing on a new milestone — abstraction and training models to reason what they see.
To achieve this goal, the researchers leveraged the links in the meanings of words to give the model visual reasoning power.
Research scientists at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), Mathew Monfort explained:
Language representations allow us to integrate contextual information learned from text databases into our visual models. Words like ‘running,’ ‘lifting,’ and ‘boxing’ share common characteristics that make them more closely related to the concept ‘exercising,’ for example than ‘driving.’
The team used WordNet, a database of word meanings, to map each action-class label’s relation in their dataset.
For example, they linked words like “sculpting,” “carving,” and “cutting” to higher-level concepts such as “crafting,” “cooking,” and “making art.” So, when the model recognizes sculpting activity, it can pick out conceptually similar activities.
In a test, the model performed as well as humans at two types of reasoning tasks — sometimes even better. These tasks are:
- Picking the video that conceptually completes a set
- Identifying the footage that doesn’t fit
For example, after viewing the video of a dog barking and a man howling beside the dog, the model selected a crying baby to complete the set. What’s more, it picked that specific video from a group of five.
“It’s a rich and efficient way to learn that could eventually lead to machine learning models that can understand analogies and are that much closer to communicating intelligently with us,” Olivia said.
After that, the team replicated their result on two datasets for training AI systems in action recognition. These are MIT’s Multi-Moments in Time and DeepMind‘s Kinetics.
Here are the details about the study.