Facebook in May launched the Hateful Memes Challenge, a $100,000 competition aimed at spurring researchers to develop systems that can identify memes intended to hurt people. The first phase of the one-year contest recently crossed the halfway mark with over 3,000 entries from hundreds of teams around the world. But while progress has been encouraging, the leaderboard shows even the top-performing systems struggle to outdo humans when it comes to identifying hateful memes.
Detecting hateful memes is a multimodal problem requiring a holistic understanding of photos, words in photos, and the context around the two. Unlike most machine learning systems, humans intrinsically understand the combined meaning of captions and pictures in memes. For example, given text and an image that seem innocuous when considered apart (e.g., “Look how many people love you” and a picture of a barren desert), people recognize that these elements can take on potentially hurtful meanings when they’re paired or juxtaposed.
Using a labeled dataset of 10,000 images Facebook provided for the competition, a group of humans trained to recognize hate speech managed to accurately identify hateful memes 84.70% of the time. As of this week, the top three algorithms on the public leaderboard attained accuracies of 83.4%, 85.6%, and 85.8%. While those numbers best the 64.7% accuracy the baseline Visual BERT COCO model achieved in May, they’re only marginally better than human performance on the absolute highest end. Given 1 million memes, the AI system with 85.8% accuracy would misclassify 142,000 of them. If it were deployed on Facebook, for example, untold numbers of users could be exposed to hateful memes.
The challenges of multimodal learning
Why does classifying hateful memes continue to pose a challenge for AI systems? Perhaps because even human experts sometimes wrestle with the task. The annotators who attained 84.70% accuracy on the Hateful Memes benchmark weren’t inexperienced; they received four hours of training in recognizing hate speech and completed three pilot runs in which they were tasked with categorizing memes and given feedback to improve their performance. Despite the prep, each annotator took an average of 27 minutes to figure out whether a meme was “hateful.”
Understanding why the classification problem is more acute within the realm of AI requires knowledge of how multimodal systems work. In any given multimodal system, computer vision and natural language processing models are typically trained on a dataset together to learn a combined embedding space, or a space occupied by variables representing specific features of the images and text. To build a classifier that can detect hateful memes, researchers need to model the correlation between images and text, which helps the system find an alignment between the two modalities. This alignment informs the system’s predictions about whether a meme is hateful.
Some multimodal systems leverage a “two-stream” architecture that processes visual and language information before fusing them together. Others adopt a “single-stream” architecture that directly combines the two modalities in an earlier stage, passing images and text independently through encoders to extract features that can be fused to perform classification. Regardless of architecture, state-of-the-art systems employ a method called “attention” to model the relationships between image regions and words according to their semantic meaning, increasingly concentrating on only the most relevant regions in the various images.
Many of the Hateful Memes Challenge contestants have yet to detail their work, but in a new paper, IBM and University of Maryland scientists explain how they incorporated an image captioning workflow into the meme detection process to nab 13th place on the leaderboard. Consisting of three components — an object detector, image captioner, and “triplet-relation network” — the system learns to distinguish hateful memes through image captioning and multimodal features. An image captioning model trains on pairs of images and corresponding captions from a dataset, while a separate module predicts whether memes are hateful by drawing on image features, image caption features, and features from image text processed by an optical character recognition model.
The researchers believe their triplet-relation network could be extended to other frameworks that require “strong attention” from multimodal signals. “The performance boost brought by image captioning further indicates that, due to the rich effective and societal content in memes, a practical solution should also consider some additional information related to the meme,” they wrote in a paper describing their work.
Skills like natural language understanding, which humans acquire early on and practice in some cases subconsciously, present roadblocks for even top-performing models, particularly in areas like bias.
In a study accepted to last year’s annual meeting of the Association for Computational Linguistics, researchers from the Allen Institute for AI found that annotators’ insensitivity to differences in dialect could lead to racial bias in automatic hate speech detection models. A separate work came to the same conclusion. And according to an investigation by NBC, Black Instagram users in the U.S. were about 50% more likely to have their accounts disabled by automated hate speech moderation systems than those whose activity indicated they were white.
These types of prejudices can become encoded in computer vision models, which are the components multimodal systems use to classify images. Back in 2015, a software engineer discovered that the image recognition algorithms deployed in Google Photos, Google’s photo storage service, were labeling Black people as “gorillas.” A University of Washington study found women were significantly underrepresented in Google Image searches for professions like “CEO.” Google’s Cloud Vision API recently mislabeled thermometers held by people with darker skin as guns. And countless experiments have shown that image-classifying models trained on ImageNet, a popular (but problematic) dataset containing photos scraped from the internet, automatically learn humanlike biases about race, gender, weight, and more.
Audits of multimodal systems like visual question answering (VQA) models, which incorporate two data types (e.g., text and images) to answer questions, demonstrate that these biases and others negatively impact classification performance. VQA systems frequently lean on statistical relationships between words to answer questions irrespective of images. Most struggle when fed a question like “What time is it?” — which requires the skill of being able to read the time on a clockface — but manage to answer questions like “What color is the grass?” because grass is frequently green in the dataset used for training.
Bias isn’t the only problem multimodal systems have to contend with. A growing body of work suggests natural language models in particular struggle to understand the nuances of human expression.
A paper published by researchers affiliated with Facebook and Tel Aviv University discovered that on a benchmark designed to measure the extent to which an AI system can follow instructions, a popular language model performed dismally across all tasks. Benchmarks commonly used in the AI and machine learning research community, such as XTREME, have been found to poorly measure models’ knowledge.
Facebook might disagree with this finding. In its latest Community Standards Enforcement Report, the company said it now proactively detects 94.7% of the hate speech it ultimately removes, which amounted to 22.1 million text, image, and video posts in Q3 2019. But critics take issue with these claims. A New York University study published in July estimated that Facebook’s AI systems make about 300,000 content moderation mistakes per day, and problematic posts continue to slip through Facebook’s filters.
Multimodal classifiers are also vulnerable to threats in which attackers attempt to circumvent them by modifying the appearance of images and text. In a Facebook paper published earlier this year, which treated the Hateful Memes Challenge as a case study, researchers managed to trip up classifiers 73% of the time by manipulating both images and text and between 30% and 40% of the time by modifying either images or text alone. In one example involving a hateful meme referencing body odor, formatting the caption “Love the way you smell today” as “LOve the wa y you smell today” caused a system to classify the meme as not hateful.
A tough road ahead
Despite the barriers standing in the way of developing superhuman hateful meme classifiers, researchers are forging ahead with techniques that promise to improve accuracy.
Facebook attempted to mitigate biases in its hateful memes dataset through the use of confounders, or memes whose effect is the opposite of the offending meme. By taking an originally mean-spirited meme and turning it into something appreciative or complimentary, the team hoped to upset whatever prejudices might allow a multimodal classifier to easily gauge the mean quality of memes. Separately, in a paper last year, Facebook researchers pioneered a new learning strategy to reduce the importance of the most biased examples in VQA model training datasets, implicitly forcing models to use both images and text. And Facebook and others have open-sourced libraries and frameworks, like Pythia, to bolster vision and language multimodal research.
But hateful memes are a moving target because “hateful” is a nebulous category. The act of endorsing hateful memes could be considered hateful, and memes can be indirect or subtle in their perpetration of rumors, fake news, extremist views, and propaganda, in addition to hate speech. Facebook considers “attacks” in memes to be violent or dehumanizing speech; statements of inferiority; and calls for exclusion or segregation based on characteristics like ethnicity, race, nationality, immigration status, religion, caste, sex, gender identity, sexual orientation, and disability or disease, as well as mocking hate crime. But despite its broad reach, this definition is likely too narrow to cover all types of hateful memes.
Emerging trends in hateful memes, like writing text on colored background images, also threaten to stymie multimodal classifiers. Beyond that, most experts believe further research will be required to better understand the relationship between images and text. This might require larger and more diverse datasets than Facebook’s hateful memes collection, which draws from 1 million Facebook posts but discards memes for which replacement images from Getty Images can’t be found to avoid copyright issues.
Whether AI ever surpasses human performance on hateful meme classification by very much may be immaterial, given the unreliability of such systems at a scale as vast as, say, Facebook’s. But if that comes to pass, the techniques could be applied to other challenges in AI and machine learning. Research firm OpenAI is reportedly developing a system trained on images, text, and other data using massive computational resources. The company’s leadership believes this is the most promising path toward artificial general intelligence, or AI that can learn any task a human can. In the near term, novel multimodal approaches could lead to stronger performance in tasks from image captioning to visual dialogue.
“Hate speech is an important societal problem, and addressing it requires improvements in the capabilities of modern machine learning systems,” the coauthors of Facebook’s original paper write in describing the Hateful Memes Challenge. “We found that results on the task reflected a concrete hierarchy in multimodal sophistication, with more advanced fusion models performing better. Still, current state-of-the-art multimodal models perform relatively poorly on this dataset, with a large gap to human performance, highlighting the challenge’s promise as a benchmark to the community.”