Multimodal learning: Teaching Machines to understand different representations of data.

4 min readMar 17, 2024

Introduction

In the context of Machine Learning, Multimodal learning is a subfield of deep learning that aims to develop models that are capable of discovering useful representations across different modalities such as text, vision(images, videos) and audio.

One thing I think stands out in representation learning is; it’s close to how humans learn and function. We don’t represent data from our sensors in isolation (processing is isolated to some point though) — A “3” in visual form is the same as a “three” in both textual and auditory forms.

Yann Lecun has famously highlighted how much more information there is in high-bandwith data like videos — The example he gives is how a 2 year old would have seen data larger than all text on the internet.

Multimodal Architectures

Multimodal models could be roughly classified by

Whether there’s a shared latent space (Common Space projection)
Modality Specific vs Modality agnostic encoders

CLIP: Contrastive Language-Image Pretraining

The idea behind CLIP is surprisingly simple — We are given an image and a text description, if the text description rightly describes the image, then they should their embedding vectors should be close together.

CLIP is an example of a multimodal model that uses modality-specific encoders.

A batch of images and their text descriptions are passed through an image encoder and text encoder respectively. The batch of image embeddings are then matrix multiplied by the batch of text embeddings and divided by the product of the L2 norms (Cosine similarity).

If we force the model to have the diagonals of this similarity matrix to have the largest values, the model learns to have larger cosine similarity values for plausible text-image combinations. We are actually learning text-aligned image embeddings and vice versa!

Once we do this so called “contrastive pretraining” stage, we can do cool stuff like zero-shoting the description for an image.

We could feed in “A photo of a ” and get a list of words and complete that sentence with each of those words and ask CLIP what sentence best aligns with the image.

Pytorch CLIP implementation

from torch import nn

class CLIP(nn.Module):
  def __init__(self, img_embed_size, text_embed_size, embed_size):
    self.img_embed_size = img_embed_size
    self.text_embed_size = text_embed_size
    self.embed_size = embed_size
    self.img_proj = nn.Linear(img_embed_size, embed_size)
    self.text_proj = nn.Linear(text_embed_size, embed_size)
    self.loss = nn.CrossEntropyLoss()
 
  def forward(self, image_embeddings, text_embeddings):
    image_embeddings = self.img_proj(image_embeddings)
    text_embeddings = self.text_proj(text_embeddings)
    similarity_matrix = image_embeddings @ text_embeddings.transpose(-2, -1) # (N, E) @ (E, N) -> (N, N)
    labels = torch.arange(image_embedding.shape[0])
    loss = self.loss(similarity_matrix, labels)
    return similarity_matrix, loss

GATO

GATO is a really good example of multimodal learning at scale. They trained a large Transformer model on data of modality ranging from text and images to continuous sensor values (preprocessed by discretising).

Not only is the model trained with multimodal data, the model is also trained in a multi-task fashion — It’s trained to be conversational, to play multiple atari games, etc.

https://deepmind.google/discover/blog/a-generalist-agent/

Although the inputs are passed through modality-specific encoders (like the Image is processed through a ResNet before being fed into GATO), the whole batch is fed as a single matrix of embeddings to GATO.

An interesting observation is how generalist training affected the performance of GATO on control tasks such as cartpole. The model trained in a multimodal and multitask manner outperforms every other method in almost all tasks.

LIMoE: Language Image Mixture of Experts

Mixture of models were gaining popularity before LIMoE but for single modality usecases (mostly to do with images). LiMoE was the first multimodal model to utilise sparse MoEs.

LIMoE is an example of a modality agnostic model as it doesn’t have seperate encoders for different modalities.

The idea is that the model learns to understand each modality by specialising on certain aspects of the inputs, thanks to the sparse experts.

https://blog.research.google/2022/06/limoe-learning-multiple-modalities-with.html

An interesting observation on the behaviour of LIMoE is that after training, a sort of modality specialisation of the experts emerged.

We see that some experts tend to process text-tokens and others image tokens.

Upon further inspection the specialisation goes even deeper, some experts then specialise in abstract attributes in an image like the presence of eyes, etc.

I really appreciate for reading this far! If you have any questions, comments, feedback in general, drop ’em in the comments and I’d be glad to check them out :)