SELF SUPERVISED LEARNING ON LARGE SCALE DATASETS
Files
Publication or External Link
Date
Authors
Advisor
Citation
DRUM DOI
Abstract
Humans and animals possess the remarkable ability to comprehend and perceive the world around them with minimal, if any, reliance on explicit labels. Much of the knowledge acquired by humans is obtained without the need for direct supervision, simply by processing extensive amounts of unlabeled data. This observation strongly suggests that enabling machines to grasp the world without the use of labels could represent a fundamental approach to artificial intelligence.However, the vast majority of advancements achieved by state-of-the-art deep neural networks have been fueled by their dependence on annotated datasets. The process of annotating datasets is both costly and impractical for numerous domains. This manuscript discusses various ways machines can be taught without using any labels using Self-Supervised Learning (SSL). We show that generally training machines without using any labels can result in less biased and more robust representations.
This manuscript deals with mainly four types of issues in SSL. The first problem we tackle is the over-emphasis of neural networks on low level shortcuts such as texture. Consider the example of a sofa with texture of a leopard. State-of-the-art neural networks will often predict this sofa to be a leopard, instead of a sofa. Unlike humans, neural networks don't understand the shape of objects and often rely on low level cues. To solve this we propose two different methods. To reduce reliance on texture cues we firstly propose to suppress texture in images, which helps the neural networks to focus less on texture and more on higher level information such as shape. Secondly we augment the SSL learning methods with negative samples which contain only texture from the images. By augmenting with texture based images our method achieves better generalization, especially in out-of-domain settings. The second problem we deal with is the poor performance of SSL methods on multi-object datasets like OpenImages. One of the fundamental reasons behind this is the cropping data augmentations that select sub-regions of an image to be used as positive samples. These positive samples are generally very meaningful since in object centric datasets they often contain semantic overlap between the views. However this doesn't hold for multi-object datasets since there could be multiple objects and the two views might not have semantic overlap. To remedy this we propose replacing one or both of the random crops with crops obtained from an object proposal algorithm. This encourages the network to learn more object aware representations that result in significant improvement over the random crop baselines. Thirdly, current SSL networks generally treat objects and scenes using the same framework. However visually similar objects are close in the representation space, hence we argue that scenes and objects should follow a hierarchical structure based on their compositionality. To solve this, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in a hyperbolic space. Our hyperbolic loss encourages the network to have a scene-object hypernymy by optimizing the magnitude of their norms. Lastly, we address the challenge of training self-supervised learning (SSL) methods on vast real-world datasets like JFT. Currently, state-of-the-art SSL meth- ods struggle to perform effectively on JFT due to its skewed data distribution. To address this issue, we present a novel approach that combines Masked Autoen- coders and contrastive learning. We introduce CAN, which is a concise and con- ceptually clear fusion of three components: (C) contrastive learning, (A) masked autoencoders, and (N) the noise prediction approach commonly used in diffusion models. These learning mechanisms complement each other in the following ways: contrastive learning shapes the embedding space when processing a batch of im- ages; masked autoencoders focus on reconstructing low-frequency spatial correla- tions within a single image; and noise prediction is employed to reconstruct high- frequency image components. When combined, our approach surpasses the perfor- mance of its individual constituents, MAE and SimCLR, across a wide range of downstream transfer learning and robustness tasks.