Deep Metric Learning- Supervised Approaches

9 min readApr 18, 2023

Part-1:- Deep Metric Learning- Fundamentals
Part-2:- Deep Metric Learning- Contrastive Approaches
Part-3:- Deep Metric Learning- Supervised Approaches

This is Part-3 of Deep Metric Learning Series. In this part, our emphasis will be on the various techniques of representation learning that are employed in supervised deep metric learning.

Introduction
Limitations of Triplet & Contrastive Loss
Exploring Alternative to Contrastive Approaches
a. Center Loss
b. Limitations of Center Loss
c. SphereFace Loss
CosFace Loss
ArcFace Loss
Curricular Loss
ElasticFace Loss
References

Introduction

From our previous blog posts, we know that the goal of metric learning is to create a distance metric that makes sure that similar data points are grouped closely together while dissimilar data points are far apart. This is done by learning features that maximize the distance between different groups of data (inter-class distance) and minimize the distance between data points within the same group (intra-class distance). By doing this, we can make sure that the features we learn are able to distinguish between different groups of data while also being able to identify similarities within those groups.

Limitations of Triplet & Contrastive Loss

Triplet Loss and Contrastive Loss are two commonly used loss functions in deep metric learning. Despite their popularity, they do have some limitations and problems that have been identified by researchers:

Sampling Issues: Both Triplet Loss and Contrastive Loss require carefully selected samples for training to be effective. In Triplet Loss, we need to choose triplets of samples that satisfy certain criteria, such as having one anchor, one positive, and one negative sample. In Contrastive Loss, we need to choose pairs of samples that are similar or dissimilar. This process can be time-consuming and computationally expensive, and it may also lead to overfitting if the samples are not chosen carefully.
Expansion Issues: One issue is the expansion problem, where it is difficult to ensure that samples with the same label will be pulled together to a common region in space. This problem becomes more pronounced with larger numbers of classes and samples. Additionally, during mini-batch training, these loss functions can only enforce the structure locally for the samples in the batch, not globally. This means that they may not be able to capture the entire structure of the data, leading to suboptimal embeddings.

Exploring Alternative to Contrastive Approaches

Before delving into the specifics of Center Loss (Wen et al. 2016), it is worth discussing Softmax Loss, as it was one of the initial endeavors to address the aforementioned concerns and Center Loss is one of the first triumphs in solving them.

Center Loss
In a classification problem with multiple classes, we typically add a linear layer on top of the neural network to classify the input. This layer is represented by a matrix W and a vector b. We use the Softmax Loss to measure the accuracy of the predictions made by this linear layer for a batch of N samples. Formula for softmax will be:-

From A Discriminative Feature Learning Approach for Deep Face Recognition

From above we can observe that: (i) under the supervision of softmax loss, the deeply learned features are separable, and (ii) the deep features are not discriminative enough, since they still show significant intra-class variations. Consequently, it is not suitable to directly use these features for metric learning but it has good inter-class variation property. So, the idea of Center Loss is to add a new regularization term to the Softmax Loss to pull the features to corresponding class centers.

Center Loss, λ is used for balancing the two loss functions

Distribution of deeply learned features under the joint supervision of softmax loss and center loss under different *λ values*

The Center Loss solves the Expansion Issue by providing the class centers, thus forcing the samples to cluster together to the corresponding class center. It also solves the Sampling issue because we don’t need to perform hard sample mining anymore.

Limitation of Center Loss
There’s still no guarantee that you will have a large inter-class variability since the clusters closer to zero will benefit less from the regularization term. To make it “fair” for each class, why don’t we just enforce the class centers to be at the same distance from the center? Let’s map it to a hypersphere.

SphereFace
In center loss, the idea was for each class’ features to centre around their Class Center. Now we can modify the softmax function such that:-
1. We can rewrite softmax loss function where (Wyi) can now be thought of as class centers in different metric space if we go into polar coordinates.
2. Normalize the weights so that ∥Wj∥=1, ‖Wi‖=1.
3. Fix the bias vector b=0 to make the future analysis easier

For Modified Softmax, the decision boundary between classes ii and jj is actually the bisector between two class center vectors WiWi and WjWj. Having such a thin decision boundary will not make our features discriminative enough — the inter-class variation is too small. Hence the second part of SphereFace — introducing the margins. We want to amplify the angles between 2 classes.

The idea is, instead of requiring cos⁡(θi)>cos⁡(θj) for all j=1,…,m(j≠i) to classify a sample as belonging to i-th class as in Modified Softmax, we additionally enforce a margin μ, so that a sample will only be classified as belonging to i-th class if cos(μθi)>cos(θj) for all j=1,…,m(j≠i) with the requirement that θi∈[0,πμ].

But, there are limitations with cos function, it is a periodic function, so we replace it by a smoother function for convergence. We replace cos⁡(θ) with a monotonically decreasing angle function ψ(θ), which we define as ψ(θ)=(−1)kcos(μθ)−2k for θ∈[kπ/μ,(k+1)π/μ] and k∈[0,μ−1]. Thus the final form of SphereFace is:

Difference between Softmax, Modified Softmax, and SphereFace. One can see that features learned by the original softmax loss can not be classified simply via angles, while modified softmax loss can. The SphereFace loss further increases the angular margin of learned features. (Image Source: SphereFace: Deep Hypersphere Embedding for Face Recognition)

After SphereFace was introduced, many new methods were developed that utilize angular distance with angular margin. It’s important to note that these methods only work in a specific type of learning called Supervised Deep Metric Learning. In other situations, such as when there is no labeled data or many different types of data during testing, Contrastive Learning methods are still considered a good option.

CosFace Loss

SphereFace’s decision boundary is created based on the cosine of the angular space. This can make optimization difficult because the cosine function is not always monotonically decreasing. To solve this problem, an additional technique is required that uses a different function that does decrease monotonically. Furthermore, SphereFace’s decision margin is based on the angle, which means different classes have different margins. As a result, some features between classes have a larger margin than others, which can reduce the model’s ability to distinguish between them.

CosFace proposes a simpler yet more effective way to define the margin. This is similar to SphereFace with normalizing the rows of weight matrix W, i.e. ∥Wj∥=1, and zeroing the biases b=0. Additionally, we normalize the features z (extracted by a neural network) as well, so ∥z∥=1. The CosFace objective is then defined as:

where s is referred to as the scaling parameter, and m is referred to as the margin parameter. It defines a decision margin in cosine space rather than the angle space (like A-Softmax). Therefore, cos(θ1) is maximized while cos(θ2) being minimized for C1 (similarly for C2) to perform the large-margin classification.

These are Decision boundaries of different loss functions in cosine space, LMCL is CosFace Loss. (Image Source: CosFace: Large Margin Cosine Loss for Deep Face Recognition)

ArcFace Loss

Arcface is similar to cosface where:-
1. biases b=0
2. Weights are normalized ||Wj∥=1
3. Feature vector are also normalized ∥z∥=1
4. But, it defines margin in angle face instead of cosine space unlike CosFace loss function.

where s is the scaling parameter and m is the margin parameter.

Decision Boundary of different loss function in angle space. (Image Source:- ArcFace: Additive Angular Margin Loss for Deep Face Recognition)

Further improvements were done to make arcface better:-

How to choose hyperparameters s and m?
AdaCos: Adaptively Scaling Cosine Logits for Effectively Learning Deep Face Representations
Multiple Centers for each class
It is not advisable to compress samples into a single cluster in the embedding space when intra-class sample variance is high, and training can be affected when large and noisy datasets generate a high loss value due to incorrect samples.
Sub-center ArcFace: Boosting Face Recognition by Large-scale Noisy Web Faces
Dealing with imbalance dataset
For models to converge better in the presence of heavy imbalance, smaller classes need to have bigger margins as they are harder to learn. Instead of manually setting different margin levels based on class size, dynamic margins were introduced, a family of continuous functions mapping class size to margin level.
ArcFace with dynamic margins. Google Landmark Recognition 2020 Competition Third Place Solution

CurricularFace Loss

This loss function incorporates the concept of curriculum learning, which involves presenting training examples in a gradual, increasing order of difficulty. CurricularFace dynamically adjusts the importance of easy and hard samples during different training stages to achieve a more effective and efficient training strategy. The loss function assigns different weights to samples based on their degree of difficulty, with easy samples being assigned more weight in the early training stages and harder samples being assigned more weight in the later stages.

This approach allows the model to learn from simpler examples first, gradually increasing the difficulty of the training examples, and thereby facilitating better learning and generalization.

ElasticFace Loss

Marginal penalty softmax losses, such as ArcFace and CosFace, assume that the geodesic distance between and within the different identities can be equally learned using a fixed penalty margin. However, such a learning objective is not realistic for real data with inconsistent inter-and intra-class variation, which might limit the discriminative and generalizability of the face recognition model. ElasticFace loss, relax the fixed penalty margin constrain by proposing elastic penalty margin loss that allows flexibility in the push for class separability. The main idea is to utilize random margin values drawn from a normal distribution in each training iteration. This aims at giving the decision boundary chances to extract and retract to allow space for flexible class separability learning.

(Image Source: ElasticFace: Elastic Margin Loss for Deep Face Recognition)

References

Your feedback is valuable to me and greatly appreciated. Even a simple clap 👏🏼 would be a wonderful show of support 😇. You can connect with me on Linkedin.