Home PC News MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets

MIT study finds ‘systematic’ labeling errors in popular AI benchmark datasets

Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.

The field of AI and machine learning is arguably built on the shoulders of a few hundred papers, many of which draw conclusions using data from a subset of public datasets. Large, labeled corpora have been critical to the success of AI in domains ranging from image classification to audio classification. That’s because their annotations expose comprehensible patterns to machine learning algorithms, in effect telling machines what to look for in future datasets so they’re able to make predictions.

But while labeled data is usually equated with ground truth, datasets can — and do — contain errors. The processes used to construct corpora often involve some degree of automatic annotation or crowdsourcing techniques that are inherently error-prone. This becomes especially problematic when these errors reach test sets, the subsets of datasets researchers use to compare progress and validate their findings. Labeling errors here could lead scientists to draw incorrect conclusions about which models perform best in the real world, potentially undermining the framework by which the community benchmarks machine learning systems.

A new paper and website published by researchers at MIT instill little confidence that popular test sets in machine learning are immune to labeling errors. In an analysis of 10 test sets from datasets that include ImageNet, an image database used to train countless computer vision algorithms, the coauthors found an average of 3.4% errors across all of the datasets. The quantities ranged from just over 2,900 errors in the ImageNet validation set to over 5 million errors in QuickDraw, a Google-maintained collection of 50 million drawings contributed by players of the game Quick, Draw!

The researchers say the mislabelings make benchmark results from the test sets unstable. For example, when ImageNet and another image dataset, CIFAR-10, were corrected for labeling errors, larger models performed worse than their lower-capacity counterparts. That’s because the higher-capacity models reflected the distribution of labeling errors in their predictions to a greater degree than smaller models — an effect that increased with the prevalence of mislabeled test data.

MIT dataset audit

Above: A chart showing the percentage of labeling errors in popular AI benchmark datasets.

In choosing which datasets to audit, the researchers looked at the most-used open source datasets created in the last 20 years, with a preference for diversity across computer vision, natural language processing, sentiment analysis, and audio modalities. In total, they evaluated six image datasets (MNIST, CIFAR-10, CIFAR-100, Caltech-256, and ImageNet), three text datasets (20news, IMDB, and Amazon Reviews), and one audio dataset (AudioSet).

The researchers estimate that QuickDraw had the highest percentage of errors in its test set, at 10.12% of the total labels. CIFAR was second, with around 5.85% incorrect labels, while ImageNet was close behind, with 5.83%. And 390,000 label errors make up roughly 4% of the Amazon Reviews dataset.

Errors included:

  • Mislabeled images, like one breed of dog being confused for another or a baby being confused for a nipple.
  • Mislabeled text sentiment, like Amazon product reviews described as negative when they were actually positive.
  • Mislabeled audio of YouTube videos, like an Ariana Grande high note being classified as a whistle.

A previous study out of MIT found that ImageNet has “systematic annotation issues” and is misaligned with ground truth or direct observation when used as a benchmark dataset. The coauthors of that research concluded that about 20% of ImageNet photos contain multiple objects, leading to a drop in accuracy as high as 10% among models trained on the dataset.

In an experiment, the researchers filtered out the erroneous labels in ImageNet and benchmarked a number of models on the corrected set. The results were largely unchanged, but when the models were evaluated only on the erroneous data, those that performed best on the original, incorrect labels were found to perform the worst on the correct labels. The implication is that the models learned to capture systematic patterns of label error in order to improve their original test accuracy.

Chihuahua mislabeled as a feather boa

Above: A Chihuahua mislabeled as a feather boa in ImageNet.

In a follow-up experiment, the coauthors created an error-free CIFAR-10 test set to measure AI models for “corrected” accuracy. The results show that powerful models didn’t reliably perform better than their simpler counterparts because performance was correlated with the degree of labeling errors. For datasets where errors are common, data scientists might be misled to select a model that isn’t actually the best model in terms of corrected accuracy, the study’s coauthors say.

“Traditionally, machine learning practitioners choose which model to deploy based on test accuracy — our findings advise caution here, proposing that judging models over correctly labeled test sets may be more useful, especially for noisy real-world datasets,” the researchers wrote. “It is imperative to be cognizant of the distinction between corrected versus original test accuracy and to follow dataset curation practices that maximize high-quality test labels.”

To promote more accurate benchmarks, the researchers have released a cleaned version of each test set in which a large portion of the label errors have been corrected. The team recommends that data scientists measure the real-world accuracy they care about in practice and consider using simpler models for datasets with error-prone labels, especially for algorithms trained or evaluated with noisy labeled data.

Creating datasets in a privacy-preserving, ethical way remains a major blocker for researchers in the AI community, particularly those who specialize in computer vision. In January 2019, IBM released a corpus designed to mitigate bias in facial recognition algorithms that contained nearly a million photos of people from Flickr. But IBM failed to notify either the photographers or the subjects of the photos that their work would be canvassed. Separately, an earlier version of ImageNet, a dataset used to train AI systems around the world, was found to contain photos of naked children, porn actresses, college parties, and more — all scraped from the web without those individuals’ consent.

In July 2020, the creators of the 80 Million Tiny Images dataset from MIT and NYU took the collection offline, apologized, and asked other researchers to refrain from using the dataset and to delete any existing copies. Introduced in 2006 and containing photos scraped from internet search engines, 80 Million Tiny Images was found to have a range of racist, sexist, and otherwise offensive annotations, such as nearly 2,000 images labeled with the N-word, and labels like “rape suspect” and “child molester.” The dataset also contained pornographic content like nonconsensual photos taken up women’s skirts.

Biases in these datasets not uncommonly find their way into trained, commercially available AI systems. Back in 2015, a software engineer pointed out that the image recognition algorithms in Google Photos were labeling his Black friends as “gorillas.” Nonprofit AlgorithmWatch showed Cloud Vision API automatically labeled a thermometer held by a dark-skinned person as a “gun” while labeling a thermometer held by a light-skinned person as an “electronic device.” And benchmarks of major vendors’ systems by the Gender Shades project and the National Institute of Standards and Technology (NIST) suggest facial recognition technology exhibits racial and gender bias and facial recognition programs can be wildly inaccurate, misclassifying people upwards of 96% of the time.

Some in the AI community are taking steps to build less problematic corpora. The ImageNet creators said they plan to remove virtually all of about 2,800 categories in the “person” subtree of the dataset, which were found to poorly represent people from the Global South. And this week, the group released a version of the dataset that blurs people’s faces in order to support privacy experimentation.


VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.

Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on the subjects of interest to you
  • our newsletters
  • gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
  • networking features, and more

Become a member

Most Popular

Recent Comments