In the contest, a method called deep learning, which consists of providing examples to a simulated giant neural network, has been shown to be significantly better at identifying objects in images than other approaches. This interest sparked the use of AI to solve different problems.
But research revealed this week shows that ImageNet and nine other key AI datasets contain a lot of errors. MIT researchers compared how a data-trained AI algorithm interprets an image with the label applied to it. If, for example, an algorithm decides that an image has a 70% chance of being a cat but the label says “spoon,” then it’s likely that the image is mislabeled and actually shows a cat. To verify, where the algorithm and the label disagreed, the researchers showed the image to more people.
ImageNet and other big data sets are critical to how AI systems, including those used in autonomous cars, medical imaging devicesand credit rating systems are developed and tested. But they can also be a weak link. Data is generally collected and labeled by low paid workers, and research is accumulating on the problems that this method introduces.
Algorithms may present prejudices by recognizing faces, for example, if they are trained on data that is predominantly white and male. Labelers can also introduce bias if, for example, they decide that women presented in medical settings are more likely to be “nurses” while men are more likely to be “doctors”.
Recent research has also highlighted how the basic errors lurking in the data used to train and test AI models – the predictions produced by an algorithm – can mask how good or how good those models really are or bad.
“What this work reveals to the world is that you need to eliminate the errors,” says Curtis Northcutt, a doctoral student at MIT who led the new work. “Otherwise, the models that you think are the best for your actual business problem might actually be wrong.”
Alexandre le Madry, professor at MIT, directed another effort to identify problems in image datasets last year and was not involved in the new work. He says this highlights an important problem, although he says the methodology needs to be studied carefully to determine if the errors are as widespread as the new work suggests.
Similar big data sets are used to develop algorithms for various industrial uses of AI. Millions of annotated images of road scenes, for example, are fed to algorithms that help autonomous vehicles perceive obstacles on the road. Large collections of labeled medical records also help algorithms predict the likelihood of a person developing a particular disease.
Such mistakes can lead machine learning engineers down the wrong path when choosing from different AI models. “They could actually choose the model that performs the worst in the real world,” says Northcutt.
Northcutt cites the algorithms used to identify objects on the road in front of self-driving cars as an example of a critical system that might not perform as well as its developers think.
It’s hardly surprising that AI datasets contain errors, given that annotations and labels are typically applied by low-paid collaborators. This is an open secret in AI research, but few researchers have attempted to determine the frequency of such errors. The effect on the performance of different AI models has also not been demonstrated.
MIT researchers looked at the ImageNet test dataset – the subset of images used to test a trained algorithm – and found incorrect labels on 6% of the images. They found a similar proportion of errors in the data sets used to train AI programs to gauge the degree of positive or negative reviews on films, how many stars a product review will receive, or what ‘a video shows, among other things.
These AI datasets have been used to train algorithms and measure progress in areas such as computer vision and understanding natural language. The work shows that the presence of these errors in the test data set makes it difficult to assess the quality of one algorithm over another. For example, an algorithm designed to identify pedestrians may perform less well when incorrect labels are removed. It may not seem like much, but it could have a big impact on the performance of an autonomous vehicle.
After a period of intense hype after the ImageNet breakthrough in 2012, it has become increasingly clear that modern AI algorithms can suffer from problems due to the data fed to them. Some say the whole concept of labeling data is problematic as well. “At the heart of supervised learning, especially in vision, is this fuzzy idea of a label,” says Vinay Prabhu, a machine learning researcher who works for the company UnifyID.
Last June, Prabhu and Abeba Birhane, a PhD student at University College Dublin, scanned ImageNet and found errors, abusive language and personally identifying information.
Prabhu points out that labels often cannot fully describe an image containing multiple objects, for example. He also says it’s problematic if labellers can add judgments about a person’s profession, nationality or character, as was the case with ImageNet.