Re-read Note: Self-supervised learning: The dark matter of intelligence

Paul Xiong
10 min readNov 8, 2022

original post:https://ai.facebook.com/blog/self-supervised-learning-the-dark-matter-of-intelligence/

As of now, end of 2022, let's re-read the paper published on mid of 2021, it may give us a fresh view of SSL by doing this review.

We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.

Self-supervised learning enables AI systems to learn from orders of magnitude more data, which is important to recognize and understand patterns of more subtle, less common representations of the world. Self-supervised learning has long had great success in advancing the field of natural language processing (NLP), including the Collobert-Weston 2008 model , , , , and, more recently, , , , and others. Systems pretrained this way yield considerably higher performance than when solely trained in a supervised manner.

Our latest research project leverages SwAV and other methods to pretrain a large network on a billion random unlabeled images, yielding top accuracy on a diverse set of vision tasks. This progress demonstrates that self-supervised learning can excel at CV tasks in complex, real-world settings as well.

SwAV have not made big success yet.  

Self-supervised learning obtains supervisory signals from the data itself, often leveraging the underlying structure in the data. The general technique of self-supervised learning is to predict any unobserved or hidden part (or property) of the input from any observed or unhidden part of the input. For example, as is common in NLP, we can hide part of a sentence and predict the hidden words from the remaining words.

Self-supervised learning has had a particularly profound impact on NLP, allowing us to train models such as BERT, RoBERTa, XLM-R, and others on large unlabeled text data sets and then use these models for downstream tasks. These models are pretrained in a self-supervised phase and then fine-tuned for a particular task, such as classifying the topic of a text. In the self-supervised pretraining phase, the system is shown a short text (typically 1,000 words) in which some of the words have been masked or replaced. The system is trained to predict the words that were masked or replaced. In doing so, the system learns to represent the meaning of the text so that it can do a good job at filling in “correct” words, or those that make sense in the context.

Predicting missing parts of the input is one of the more standard tasks for SSL pretraining. To complete a sentence such as “The (blank) chases the (blank) in the savanna,” the system must learn that lions or cheetahs can chase antelope or wildebeests, but that cats chase mice in the kitchen, not the savanna. As a consequence of the training, the system learns to represent the meaning of words, the syntactic role of words, and the meaning of entire texts.

These techniques, however, can’t be easily extended to new domains, such as CV. Despite promising early results, SSL has not yet brought about the same improvements in computer vision that we have seen in NLP (though this will change).

ViT and pix2seq are trying so. 

But we do not know how to efficiently represent uncertainty when we predict missing frames in a video or missing patches in an image. We cannot list all possible video frames and associate a score to each of them, because there is an infinite number of them. While this problem has limited the performance improvement from SSL in vision, new techniques SSL techniques such as SwAV are starting to beat accuracy records in vision tasks. This is best demonstrated by the SEER system that uses a large convolutional network trained with billions of examples.

To better understand this challenge, we first need to understand the prediction uncertainty and the way it’s modeled in NLP compared with CV. In NLP, predicting the missing words involves computing a prediction score for every possible word in the vocabulary. While the vocabulary itself is large and predicting a missing word involves some uncertainty, it’s possible to produce a list of all the possible words in the vocabulary together with a probability estimate of the words’ appearance at that location. Typical machine learning systems do so by treating the prediction problem as a classification problem and computing scores for each outcome using a giant so-called softmax layer, which transforms raw scores into a probability distribution over words. With this technique, the uncertainty of the prediction is represented by a probability distribution over all possible outcomes, provided that there is a finite number of possible outcomes.

In CV, on the other hand, the analogous task of predicting “missing” frames in a video, missing patches in an image, or missing segment in a speech signal involves a prediction of high-dimensional continuous objects rather than discrete outcomes. There are an infinite number of possible video frames that can plausibly follow a given video clip. It is not possible to explicitly represent all the possible video frames and associate a prediction score to them. In fact, we may never have techniques to represent suitable probability distributions over high-dimensional continuous spaces, such as the set of all possible video frames.

This seems like an intractable problem.

There is a way to think about SSL within the unified framework of an energy-based model (EBM). An EBM is a trainable system that, given two inputs, x and y, tells us how incompatible they are with each other. For example, x could be a short video clip, and y another proposed video clip. The machine would tell us to what extent y is a good continuation for x. To indicate the incompatibility between x and y, the machine produces a single number, called an energy. If the energy is low, x and y are deemed compatible; if it is high, they are deemed incompatible.

Entropy is the energy, Contrastive learning, SimCLR from Google does this too. Seems that SimCLR has higher influence than EBM. 

Joint embedding, Siamese networks

A particular well-suited deep learning architecture to do so is the so-called Siamese networks or joint embedding architecture. The idea goes back to papers from Geoff Hinton’s lab and Yann LeCun’s group in the early 1990s ( and ) and mid-2000s (, , and ). It was relatively ignored for a long time but has enjoyed a revival since late 2019. A joint embedding architecture is composed of two identical (or almost identical) copies of the same network. One network is fed with x and the other with y. The networks produce output vectors called embeddings, which represent x and y. A third module, joining the networks at the head, computes the energy as the distance between the two embedding vectors. When the model is shown distorted versions of the same image, the parameters of the networks can easily be adjusted so that their outputs move closer together. This will ensure that the network will produce nearly identical representations (or embedding) of an object, regardless of the particular view of that object.

The difficulty is to make sure that the networks produce high energy, i.e. different embedding vectors, when x and y are different images. Without a specific way to do so, the two networks could happily ignore their inputs and always produce identical output embeddings. This phenomenon is called a collapse. When a collapse occurs, the energy is not higher for nonmatching x and y than it is for matching x and y.

There are two categories of techniques to avoid collapse: contrastive methods and regularization methods.

Contrastive energy-based SSL

Contrastive methods are based on the simple idea of constructing pairs of x and y that are not compatible, and adjusting the parameters of the model so that the corresponding output energy is large.

The method used to train NLP systems by masking or substituting some input words belongs to the category of contrastive methods. But they don’t use the joint embedding architecture. Instead, they use a predictive architecture in which the model directly produces a prediction for y. One starts for a complete segment of text y, then corrupts it, e.g., by masking some words to produce the observation x. The corrupted input is fed to a large neural network that is trained to reproduce the original text y. An uncorrupted text will be reconstructed as itself (low reconstruction error), while a corrupted text will be reconstructed as an uncorrupted version of itself (large reconstruction error). If one interprets the reconstruction error as an energy, it will have the desired property: low energy for “clean” text and higher energy for “corrupted” text.

The general technique of training a model to restore a corrupted version of an input is called denoising auto-encoder. While early forms of this idea go back to the 1980s, it was revived in 2008 by Pascal Vincent and colleagues at the University of Montréal, introduced in the context of NLP by Collobert and Weston , and popularized by the from our friends at Google.

As we pointed out earlier, a predictive architecture of this type can produce only a single prediction for a given input. Since the model must be able to predict multiple possible outcomes, the prediction is not a single set of words but a series of scores for every word in the vocabulary for each missing word location.

But we cannot use this trick for images because we cannot enumerate all possible images. Is there a solution to this problem? The short answer is no. There are interesting ideas in this direction, but they have not yet led to results that are as good as joint embedding architectures. One interesting avenue is latent-variable predictive architectures.

Latent-variable predictive models contain an extra input variable (z). It is called latent because its value is never observed. With a properly trained model, as the latent variable varies over a given set, the output prediction varies over the set of plausible predictions compatible with the input x.

Latent-variable models can be trained with contrastive methods. A good example of this is a generative adversarial network (GAN). The critic (or discriminator) can be seen as computing an energy indicating whether the input y looks good. The generator network is trained to produce contrastive samples to which the critic is trained to associate high energy.

But contrastive methods have a major issue: They are very inefficient to train. In high-dimensional spaces such as images, there are many ways one image can be different from another. Finding a set of contrastive images that cover all the ways they can differ from a given image is a nearly impossible task. To paraphrase Leo Tolstoy’s Anna Karenina: “Happy families are all alike; every unhappy family is unhappy in its own way.” This applies to any family of high-dimensional objects, it seems.

What if it were possible to make sure the energy of incompatible pairs is higher than that of compatible pairs without explicitly pushing up on the energy of many incompatible pairs?

Non-contrastive energy-based SSL

Non-contrastive methods applied to joint embedding architectures is possibly the hottest topic in SSL for vision at the moment. The domain is still largely unexplored, but it seems very promising.

Non-contrastive methods for joint-embedding include DeeperCluster , ClusterFit , , , , Barlow Twins, from DeepMind, and a few others. They use various tricks, such as computing virtual target embeddings for groups of similar images (DeeperCluster, SwAV, SimSiam) or making the two joint embedding architectures slightly different through the architecture or the parameter vector (BYOL, MoCo). Barlow Twins tries to minimize the redundancy between the individual components of the embedding vectors.

Perhaps a better alternative in the long run will be to devise non-contrastive methods with latent-variable predictive models. The main obstacle is that they require a way to minimize the capacity of the latent variable. The volume of the set over which the latent variable can vary limits the volume of outputs that take low energy. By minimizing this volume, one automatically shapes the energy in the right way.

A successful example of such a method is the Variational Auto-Encoder (VAE), in which the latent variable is made “fuzzy”, which limits its capacity.

But VAE have not yet been shown to produce good representations for downstream visual tasks.

It makes big success with Stable diffusion now. 

Another successful example is sparse modeling , but its use has been limited to simple architectures. No perfect recipe seems to exist to limit the capacity of latent variables.

Most recently, we’ve created and open sourced a new billion-parameter self-supervised CV model called SEER that’s proven to work efficiently with complex, high-dimensional image data. It is based on the SwAV method applied to a convolutional network architecture (ConvNet) and can be trained from a vast number of random images without any metadata or annotations. The ConvNet is large enough to capture and learn every visual concept from this large and complex data. After pretraining on a billion random, unlabeled and uncurated public Instagram images, and supervised fine-tuning on ImageNet, SEER outperformed the most advanced, state-of-the-art self-supervised systems, reaching 84.2 percent top-1 accuracy on ImageNet.

No, seems not big impact on the industry. 

Originally published at https://ai.facebook.com.

--

--

Paul Xiong

Coding, implementing, optimizing ML annotation with self-supervised learning, TLDR: doctor’s labeling is the 1st priority for our Cervical AI project.