Facebook’s new computer vision model achieves cutting-edge performance, learning from random images

Join Transform 2021 to learn about the most important topics in AI and business data. To know more.


Facebook today announced an AI model trained on a billion images that ostensibly achieves cutting edge results in a series of computer vision benchmarks. Unlike most computer vision models, which learn from labeled data sets, Facebook’s generates labels from the data, exposing the relationships between parts of the data – a step considered critical to one day achieving intelligence at the human level.

The future of AI lies in creating systems that can make inferences from any information provided, without relying on annotated data sets. With text, images or other data, an AI system would ideally be able to recognize objects in a photo, interpret text or perform any of the numerous other tasks required of it.

Facebook claims to have taken a step in that direction with a computer vision model called SEER, which stands for SElf-supERvised. SEER contains a billion parameters and can learn from any random group of images on the internet without the need for curation or annotation. The parameters, a fundamental part of machine learning systems, are the part of the model derived from historical training data.

New techniques

Self-supervision for vision is a challenging task. With text, semantic concepts can be divided into discrete words, but with images, a model must decide for itself which pixel belongs to which concept. To make things more challenging, the same concept often varies between images. Understanding the variation around a single concept, then, requires looking at many different images.

Facebook researchers found that scaling AI systems to work with complex image data required at least two major components. The first was an algorithm that could learn from a large number of random images without any metadata or annotations, while the second was a convolutional network – ConvNet – large enough to capture and learn all the visual concepts from that data. Convolutional networks, which were first proposed in the 1980s, are inspired by biological processes, in which the pattern of connectivity between the components of the model resembles the visual cortex.

In developing SEER, Facebook took advantage of an algorithm called SwAV, which was born out of the company’s investigations into self-supervised learning. SwAV uses a technique called clustering to quickly group images of similar visual concepts and leverage their similarities, improving on the state of the art in self-supervised learning while requiring up to 6 times less training time.

Facebook SEER

Above: A simplified scheme showing the architecture of the SEER model.

Image credit: Facebook

Training models the size of SEER also required an efficient architecture in terms of runtime and memory, without compromising accuracy, according to Facebook. The researchers behind SEER chose to use RegNets, or a type of ConvNet model capable of scaling to billions or potentially trillions of parameters while adjusting to runtime and memory constraints.

Facebook software engineer Priya Goyal said SEER was trained on 512 NVIDIA V100 GPUs with 32 GB of RAM for 30 days.

The last piece that made SEER possible was a general purpose library called VISSL, short for the VIsion library for the state of the art of Self Supervised Learning. VISSL, whose source code is opened by Facebook today, allows self-supervised training with a variety of modern machine learning methods. The library facilitates self-supervised learning at scale, integrating algorithms that reduce the memory requirement per GPU and increase the training speed of any model.

Performance and future work

After pre-training on a billion public Instagram images, SEER has surpassed the most advanced state-of-the-art self-supervised systems, says Facebook. SEER also outperformed models in tasks, including object detection, segmentation and image classification. When trained with just 10% of the examples in the popular ImageNet data set, SEER still managed to achieve 77.9% accuracy. And when trained with just 1%, the SEER was 60.5% accurate.

When asked whether Instagram users whose images were used to train SEER were notified or had the opportunity to refuse the survey, Goyal noted that Facebook informs Instagram account holders in its data policy that it uses information such as images to support research, including the SEER type of support. That said, Facebook does not plan to share the images or the SEER model itself, in part because the model may contain unintended trends.

“Self-supervised learning has been a focus for Facebook’s AI because it allows machines to learn directly from the vast amount of information available in the world, rather than just the training data created specifically for AI research,” wrote Facebook on a blog. “Self-supervised learning has incredible ramifications for the future of computer vision, as well as in other fields of research. Eliminating the need for human annotations and metadata allows the computer vision community to work with larger and more diverse data sets, learn from random public imagery and potentially mitigate some of the prejudices that come into play with data curation. Self-supervised learning can also help to specialize models in domains where we have limited images or metadata, such as medical images. And without the need for initial labeling work, templates can be created and deployed more quickly, allowing for faster and more accurate responses to rapidly evolving situations. “

VentureBeat

VentureBeat’s mission is to be a digital city square for technical decision makers to gain knowledge about transformative technology and transact. Our website offers essential information about data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:

  • up-to-date information on subjects of interest
  • our newsletters
  • leading closed-minded content and discounted access to our award-winning events, such as Transform
  • network resources and more

Become a member

Source