The Cambridge Dictionary defines “bootstrap” as: “to improve your situation or become more successful, without help from others or without advantages that others have.” While a machine learning algorithm’s strength depends heavily on the quality of data it is fed, an algorithm that can do the work required to improve itself should become even stronger. A team of researchers from DeepMind and Imperial College recently set out to prove that in the arena of computer vision.
In the updated paper Bootstrap Your Own Latent – A New Approach to Self-Supervised Learning, the researchers release the source code and checkpoint for their new “BYOL” approach to self-supervised image representation learning along with new theoretical and experimental insights.
In computer vision, learning good image representations is critical as it allows for efficient training on downstream tasks. Image representation learning basically leverages neural networks that have been trained to produce good representations. Researchers can first train the neural networks using very large datasets, then adapt their architecture to suit tasks where data is more scarce. This supervised representation learning approach differs from self-supervised learning approaches, where model training does not require manual data labelling.
Many successful self-supervised learning approaches for learning representations adopt a cross-view prediction framework, where the outputs of a pair of modules closely correspond in how they vary as the input changes. Such module pairs for example may relate to spatially adjacent parts of the same image. This framework allows a neural network with no prior knowledge of the third dimension to discover depth. Following this strategy, current SOTA contrastive methods are trained by reducing the distance between representations of different augmented views of the same image (positive pairs) and increasing the distance between the representations of augmented views from different images (negative pairs).
But some challenges remain, as “contrastive methods often require comparing each example with many other examples to work well, prompting the question of whether using negative pairs is necessary,” the researchers explain. Various factors underlying the negative pairs retrieval process can affect performance, such as large batch sizes, memory banks, or customized mining strategies.
Shunning negative pairs in self-supervised representation learning, BYOL relies instead on two neural networks — online and target networks — that interact and learn from each other. From an augmented view of an image, the researchers trained the online network to predict the target network representation of the same image under a different augmented view. BYOL also uses a slow-moving average network to produce prediction targets to help stabilize the bootstrap step.
In evaluations of the representations learned by BYOL on ImageNet and other vision benchmarks using ResNet architectures, BYOL reached 74.3 percent top-1 accuracy with a standard ResNet-50 and 79.6 percent top-1 accuracy with a ResNet-200 architecture. In the semi-supervised and transfer settings on ImageNet, BYOL performed similar to or better than the SOTA. Moreover, when compared to the strong contrastive baseline method SimCLR, BYOL suffered much smaller performance drops when batch size was reduced.
Because BYOL relies on augmentations specific to vision applications, it will require other similarly suitable augmentations to function with modalities such as audio, video and text in order to generalize to other applications.
Reporter: Fangyu Cai | Editor: Michael Sarazen