So-called neurosymbolic models, which combine algorithms with symbolic reasoning techniques, appear to be much better-suited to predicting, explaining, and considering counterfactual possibilities than neural networks. But researchers at DeepMind claim neural networks can outperform neurosymbolic models under the right testing conditions. In a preprint paper, coauthors describe an architecture for spatiotemporal reasoning about videos in which all components are learned and all intermediate representations are distributed (rather than symbolic) throughout the layers of the neural network. The team says that it surpasses the performance of neurosymbolic models across all questions in a popular dataset, with the greatest advantage on the counterfactual questions.
DeepMind’s research could have implications for the development of machines that can reason about their experiences. Contrary to the conclusions of some previous studies, models based exclusively on distributed representations can indeed perform well on visual-based tasks that measure high-level cognitive functions, according to the researchers — at least to the extent they outperform existing neurosymbolic models.
The neural network architecture proposed in the paper leverages attention to effectively integrate information. (Attention is the mechanism by which the algorithm focuses on a single element or a few elements at a time.) It’s self-supervised, meaning the model must infer masked-out objects in videos using the underlying dynamics to extract more information. And the architecture ensures visual elements in the videos correspond to physical objects, a step the coauthors argue is essential for higher-level reasoning.
The researchers benchmarked their neural network against CoLlision Events for Video REpresentation and Reasoning (CLEVRER), a dataset that draws on insights from psychology. CLEVRER contains over 20,000 5-second videos of colliding objects (three shapes of two materials and eight colors) generated by a physics engine and more than 300,000 questions and answers, all focusing on four elements of logical reasoning: descriptive (e.g., “what color”), explanatory (“what’s responsible for”), predictive (“what will happen next”), and counterfactual (“what if”).
According to the DeepMind coauthors, their neural network equaled the performance of the best neurosymbolic models without pretraining or labeled data and with 40% less training data, challenging the notion that neural networks are more data-hungry than neurosymbolic models. Moreover, it scored 59.8% on the hardest counterfactual questions — better than both chance and all other models — and it generalized to other tasks including those in CATER, an object-tracking video dataset where the goal is to predict the location of a target object in the final frame.
“Our results … add to a body of evidence that deep networks can replicate many properties of human cognition and reasoning, while benefiting from the flexibility and expressivity of distributed representations,” the coauthors wrote. “Neural models have also had some success in mathematics, a domain that, intuitively, would seem to require the execution of formal rules and manipulation of symbols. Somewhat surprisingly, large-scale neural language models … can acquire a propensity for arithmetic reasoning and analogy-making without being trained explicitly for such tasks, suggesting that current neural network limitations are ameliorated when scaling to more data and using larger, more efficient architectures.”