As incredibly capable as AI systems are these days, teaching machines to perform a variety of tasks, from translating real-time speech to accurately differentiating chihuahuas and blueberry muffins. But this process always involves some ownership and retention of data by the humans who train them. However, the emergence of self-supervised learning (SSL) methods, which have already revolutionized natural language processing, may hold the key to imbuing AI with much needed common sense. Facebook’s AI research division (FAIR) has now, for the first time, applied SSL to computer vision training.
“We have developed SEER (SElf-supERvised), a new billion-parameter self-supervised computer vision model that can learn from any group of random images on the internet, without the need for curation. and careful labeling that go into most cases of computer vision. training today, ”Facebook AI researchers wrote in a blog post Thursday. In the case of SEERs, Facebook showed him over a billion random, untagged, unsaturated Instagram public images.
In supervised learning programs, Facebook AI chief scientist Yann LeCunn told Engadget: “To recognize a speech, you have to label the words that have been spoken; if you want to translate you must have parallel text. To recognize images, you must have labels for each image. “
Unsupervised learning, on the other hand, “is the idea of a problem of trying to train a system to represent images appropriately, without requiring labeled images,” LeCunn explained. One of these methods is co-incorporation in which a neural network is presented with a pair of almost identical images – an original and a slightly modified and distorted copy. “You train the system so that all of the vectors produced by these two elements are as close to each other as possible,” said LeCunn. “Then the problem is to make sure that when the system sees two different images, it produces different vectors, different ’embeddings’ as we call them. The very natural way to do this is to randomly choose millions of pairs of images that you know are different, broadcast them on the network, and hope for the best. However, contrasting methods like this tend to be very resource and time consuming given the magnitude of training data required.
Applying the same SSL techniques used in NLP to computer vision poses additional challenges. As LeCunn notes, the concepts of semantic language are easily broken down into discrete words and sentences. “But with images, the algorithm has to decide which pixel belongs to which concept. Also, the same concept will vary widely between images, such as a cat in different poses or seen from different angles, ”he wrote. “We have to look at a lot of pictures to catch the variation around a single concept.”
And for this training method to be effective, researchers needed both an algorithm flexible enough to learn from a large number of unannotated images and a convoluted network capable of sorting the data generated by algorithm. Facebook found the first in the recent , which “uses online grouping to quickly group images with similar visual concepts and exploit their similarities,” six times faster than state of the art, according to LeCunn. The latter could be found in RegNets, a convoluted network that can apply billions (if not trillions) of parameters to a learning model while optimizing its function for the available computational resources.
The results of this new system are quite impressive. After its billion-parameter pre-training session, SEER managed to outperform leading self-monitoring systems on ImageNet, reaching 84.2%. . Even when trained using only 10% of the original data set, SEER achieved 77.9% accuracy. And using just 1% of the OG dataset, SEER still managed a respectable 60.5% accuracy in the top 1.
Essentially, this research shows that, like with NLP training, unsupervised learning methods can be effectively applied to computer vision applications. With this added flexibility, Facebook and other social media platforms should be better equipped to deal with banned content.
“What we would like to have and what we already have to some extent, but we have to improve ourselves, is a universal system of understanding the image,” LeCunn said. “So a system that every time you upload a photo or an image to Facebook calculates one of those embeds and from there we can tell you it’s a cat photo or that’s, you know, terrorist propaganda. “
As with their other AI research, the LeCunn team publishes their research and SEER’s training library, dubbed VISSL, under an open source license. If you want to give the system a boost, go to for additional documentation and to retrieve its GitHub code.