Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                

LipSim: A Provably Robust Perceptual
Similarity Metric

Sara Ghazanfari, Alexandre Araujo
Prashanth Krishnamurthy, Farshad Khorrami, Siddharth Garg
Department of Electronic and Computer Engineering
New York University
sg7457@nyu.edu
Abstract

Recent years have seen growing interest in developing and applying perceptual similarity metrics. Research has shown the superiority of perceptual metrics over pixel-wise metrics in aligning with human perception and serving as a proxy for the human visual system. On the other hand, as perceptual metrics rely on neural networks, there is a growing concern regarding their resilience, given the established vulnerability of neural networks to adversarial attacks. It is indeed logical to infer that perceptual metrics may inherit both the strengths and shortcomings of neural networks. In this work, we demonstrate the vulnerability of state-of-the-art perceptual similarity metrics based on an ensemble of ViT-based feature extractors to adversarial attacks. We then propose a framework to train a robust perceptual similarity metric called LipSim (Lipschitz Similarity Metric) with provable guarantees. By leveraging 1-Lipschitz neural networks as the backbone, LipSim provides guarded areas around each data point and certificates for all perturbations within an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball. Finally, a comprehensive set of experiments shows the performance of LipSim in terms of natural and certified scores and on the image retrieval application.

1 Introduction

Comparing data items and having a notion of similarity has long been a fundamental problem in computer science. For many years psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norms and other mathematically well-defined distance metrics have been used for comparing data items. However, these metrics fail to measure the semantic similarity between more complex data like images and are more focused on pixel-wise comparison. To address this problem perceptual distance metrics (Zhang et al., 2011; Fu et al., 2023) have been proposed that employ deep neural networks as a backbone to first compute embeddings, then apply traditional distance metrics on the embeddings of the data in the new space.

It is well-established that neural networks are susceptible to adversarial attacks (Goodfellow et al., 2014), That is, imperceptible variations of natural examples can be crafted to deliberately mislead models. Although perceptual metrics provide rich semantic interpretations compared to traditional metrics, they inherit the properties of neural networks and therefore their susceptibility to adversarial attacks (Kettunen et al., 2019; Sjögren et al., 2022; Ghildyal & Liu, 2022). Recent works have tried to address this problem by training robust perceptual metrics (Kettunen et al., 2019; Ghazanfari et al., 2023). However, these works rely on heuristic defenses and do not provide provable guarantees. Recent research has focused on designing and training neural networks with prescribed Lipschitz constants (Tsuzuku et al., 2018; Meunier et al., 2022; Wang & Manchester, 2023), aiming to improve and guarantee robustness against adversarial attacks. Promising techniques, like the SDP-based Lipschitz Layer (SLL)  (Araujo et al., 2023), have emerged and allow to design of non-trivial yet efficient neural networks with pre-defined Lipschitz constant. Constraining the Lipschitz of neural has been known to induce properties such as stability in training (Miyato et al., 2018), robustness (Tsuzuku et al., 2018), and generalization (Bartlett et al., 2017).

Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
d(,)𝑑d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \char 182}},\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,.5,0}\char 183}})italic_d ( ❶ , ❷ ) 0.64 0.59 0.50 0.76 0.65 0.64 0.62 0.65 0.73
d(,)𝑑d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \char 182}},\text{{{\color[rgb]{.75,0,.25}\definecolor[named]{pgfstrokecolor}{% rgb}{.75,0,.25}\char 184}}})italic_d ( ❶ , ❸ ) 0.68 0.63 0.54 0.75 0.66 0.64 0.66 0.62 0.75
Figure 1: Demonstrating the effect of an attack on the alignment of DreamSim distance values with the real perceptual distance between images. Instances of original and adversarial reference images from the NIGHT dataset are shown in the first and second rows and the DreamSim distance between them (i.e., d(,)𝑑d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \char 182}},\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,.5,0}\char 183}})italic_d ( ❶ , ❷ )) is reported below. To get a sense of how big the distance values are, images that have the same distance from the original images are shown in the third row (i.e., d(,)=d(,)𝑑𝑑d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \char 182}},\text{{{\color[rgb]{.75,0,.25}\definecolor[named]{pgfstrokecolor}{% rgb}{.75,0,.25}\char 184}}})=d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{0,.5,.5}\char 182}},\text{{\color[rgb]{1,.5,0}% \definecolor[named]{pgfstrokecolor}{rgb}{1,.5,0}\char 183}})italic_d ( ❶ , ❸ ) = italic_d ( ❶ , ❷ )). Obviously, the first and third rows include semantically different images, whereas the images on the first and second rows are perceptually identical.

Recently, the DreamSim metric (Fu et al., 2023) has been established as the state-of-the-art perceptual similarity metric. This metric consists of a concatenation of fine-tuned versions of ViT-based embeddings, namely, DINO (Caron et al., 2021), CLIP (Radford et al., 2021), and Open CLIP (Cherti et al., 2023). To compute the distance between two images, DreamSim measures the cosine similarity distance between these ViT-based embeddings.

In this work, we initially demonstrate with a series of experiments that the DreamSim metric is not robust to adversarial examples. Consequently, it could be easy for an attacker to bypass important filtering schemes based on perceptual hash, copy detection, etc. Then, to tackle this problem, we propose LipSim, the first perceptual similarity metric with provable guarantees. Building on the DreamSim metric and recent advances in 1-Lipschitz neural networks, we propose a novel student-teacher approach with a Lipschitz-constrained student model. Specifically, we train a 1-Lipschitz feature extractor (student network) based on the state-of-the-art SLL architecture. The student network is trained to mimic the outputs of the embedding of the DreamSim metric, thus distilling the intricate knowledge captured by DreamSim into the 1-Lipschitz student model. After training the 1-Lipschitz feature extractor on the ImageNet-1k dataset, we fine-tune it on the NIGHT dataset which is a two-alternative forced choice (2AFC) dataset that seeks to encode human perceptions of image similarity (more explanation and some instances of NIGHT dataset are presented in Appendix B). By combining the capabilities of DreamSim with the provable guarantees of a Lipschitz network, our approach paves the way for a certifiably robust perceptual similarity metric. Finally, we demonstrate good natural accuracy and state-of-the-art certified robustness on the NIGHT dataset. Our contributions can be summarized as follows:

  • We investigate the vulnerabilities of state-of-the-art ViT-based perceptual distance including DINO, CLIP, OpenCLIP, and DreamSim Ensemble. The vulnerabilities are highlighted using APGD (Croce & Hein, 2020) on the 2AFC score which is an index for human alignment and PGD attack against the distance metric and calculating the distance between an original image and its perturbed version.

  • We propose a framework to train the first certifiably robust distance metric, LipSim, which leverages a pipeline composed of 1-Lipschitz feature extractor, projection to the unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball and cosine distance to provide certified bounds for the perturbations applied on the reference image.

  • We show by a comprehensive set of experiments that not only LipSim provides certified accuracy for a specified perturbation budget, but also demonstrates good performance in terms of natural 2AFC score and accuracy on image retrieval which is a serious application for perceptual metrics.

2 Related Works

Similarity Metrics.

Low-level metrics including psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norms, PSNR as point-wise metrics, SSIM (Wang et al., 2004) and FSIM (Zhang et al., 2011) as patch-wise metrics fail to capture the high-level structure and the semantic concept of more complicated data points like images. In order to overcome this challenge the perceptual distance metrics were proposed. In the context of perceptual distance metrics, neural networks are used as feature extractors, and the low-level metrics are employed in the embeddings of images in the new space. The feature extractors used in recent work include a convolutional neural network as proposed by Zhang et al. (2018) for the LPIPS metric, or an ensemble of ViT-based models (Radford et al., 2021; Caron et al., 2021) as proposed by Fu et al. (2023) for DreamSim. As shown by experiments the perceptual similarity metrics have better alignment with human perception and are considered a good proxy for human vision.

Adversarial Attacks & Defenses.

Initially demonstrated by Szegedy et al. (2013), neural networks are vulnerable to adversarial attacks, i.e., carefully crafted small perturbations that can fool the model into predicting wrong answers. Since then a large body of research has been focused on generating stronger attacks (Goodfellow et al., 2014; Kurakin et al., 2018; Carlini & Wagner, 2017; Croce & Hein, 2020; 2021) and providing more robust defenses (Goodfellow et al., 2014; Madry et al., 2017; Pinot et al., 2019; Araujo et al., 2020; 2021; Meunier et al., 2022). To break this pattern, certified adversarial robustness methods were proposed. By providing mathematical guarantees, the model is theoretically robust against the worst-case attack for perturbations smaller than a specific perturbation budget. Certified defense methods fall into two categories. Randomized Smoothing (Cohen et al., 2019; Salman et al., 2019) turns an arbitrary classifier into a smoother classifier, then based on the Neyman-Pearson lemma, the smooth classifier obtains some theoretical robustness against a specific psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm. Despite the impressive results achieved by randomized smoothing in terms of natural and certified accuracy (Carlini et al., 2023), the high computational cost of inference and the probabilistic nature of the certificate make it difficult to deploy in real-time applications. Another direction of research has been to leverage the Lipschitz property of neural networks (Hein & Andriushchenko, 2017; Tsuzuku et al., 2018) to better control the stability and robustness of the model. Tsuzuku et al. (2018) highlighted the connection between the certified radius of the network with its Lipschitz constant and margin. As calculating the Lipschitz constant of a neural network is computationally expensive, a body of work has focused on designing 1-Lipschitz networks by constraining each layer with its spectral norm (Miyato et al., 2018; Farnia et al., 2018), replacing the normalized weight matrix by an orthogonal ones (Li et al., 2019; Prach & Lampert, 2022) or designing 1-Lipschitz layer from dynamical systems (Meunier et al., 2022) or control theory arguments (Araujo et al., 2023; Wang & Manchester, 2023).

Vulnerabilities and Robustness of Perceptual Metrics.

Investigating the vulnerabilities of perceptual metrics has been overlooked for years since the first perceptual metric was proposed. As shown in (Kettunen et al., 2019; Ghazanfari et al., 2023; Sjögren et al., 2022; Ghildyal & Liu, 2022) perceptual similarity metrics (LPIPS (Zhang et al., 2018)) are vulnerable to adversarial attacks. (Sjögren et al., 2022) presents a qualitative analysis of deep perceptual similarity metrics resilience to image distortions including color inversion, translation, rotation, and color stain. Finally (Luo et al., 2022) proposes a new way to generate attacks to similarity metrics by reducing the similarity between the adversarial example and its original while increasing the similarity between the adversarial example and its most dissimilar one in the minibatch. To introduce robust perceptual metrics, (Kettunen et al., 2019) proposes e-lpips which uses an ensemble of random transformations of the input image and demonstrates the empirical robustness using qualitative experiments. (Ghildyal & Liu, 2022) employs some modules including anti-aliasing filters to provide robustness to the vulnerability of LPIPS to a one-pixel shift. More recently (Ghazanfari et al., 2023) proposes R-LPIPS which is a robust perceptual metric achieved by adversarial training Madry et al. (2017) over LPIPS and evaluates R-LPIPS using extensive qualitative and quantitative experiments on BAPPS (Zhang et al., 2018) dataset. Besides the aforementioned methods that show empirical robustness, (Kumar & Goldstein, 2021; Shao et al., 2023) propose methods to achieve certified robustness on perceptual metrics based on randomized smoothing. For example, Kumar & Goldstein (2021) proposed center smoothing which is an approach that provides certified robustness for structure outputs. More precisely, the center of the ball enclosing at least half of the perturbed points in output space is considered as the output of the smoothed function and is proved to be robust to input perturbations bounded by an 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-size budget. The proof requires the distance metric to hold symmetry property and triangle inequality. As perceptual metrics generally don’t hold the triangle inequality, the triangle inequality approximation is used which makes the bound excessively loose. In Shao et al. (2023), the same enclosing ball is employed however, the problem is mapped to a binary classification setting to leverage the certified bound as in the randomized smoothing paper (by assigning one to the points that are in the enclosing ball and zero otherwise). Besides their loose bound, these methods are computationally very expensive due to the Monte Carlo sampling for each data point.

3 Background

Lipschitz Networks. After the discovery of the vulnerability of neural networks to adversarial attacks, one major direction of research has focused on improving the robustness of neural networks to small input perturbations by leveraging Lipschitz continuity. This goal can be mathematically achieved by using a Lipschitz function. Let f𝑓fitalic_f be a Lipschitz function with Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT Lipschitz constant in terms of 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, then we can bound the output of the function by f(x)f(x+δ)2Lfδ2subscriptnorm𝑓𝑥𝑓𝑥𝛿2subscript𝐿𝑓subscriptnorm𝛿2\norm{f(x)-f(x+\delta)}_{2}\leq L_{f}\norm{\delta}_{2}∥ start_ARG italic_f ( italic_x ) - italic_f ( italic_x + italic_δ ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT ∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. To achieve stability using the Lipschitz property, different approaches have been taken. One efficient way is to design a network with 1-Lipschitz layers which leads to a 1-Lipschitz network (Meunier et al., 2022; Araujo et al., 2023; Wang & Manchester, 2023).

State of the Art Perceptual Similarity Metric. DreamSim is a recently proposed perceptual distance metric (Fu et al., 2023) that employs cosine distance on the concatenation of feature vectors generated by an ensemble of ViT-based representation learning methods. More precisely DreamSim is a concatenation of embeddings generated by DINO (Caron et al., 2021), CLIP (Radford et al., 2021), and Open CLIP (Cherti et al., 2023). Let f𝑓fitalic_f be the feature extractor function, the DreamSim distance metric d(x1,x2)𝑑subscript𝑥1subscript𝑥2d(x_{1},x_{2})italic_d ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is defined as:

d(x1,x2)=1Sc(f(x1),f(x2))𝑑subscript𝑥1subscript𝑥21subscript𝑆𝑐𝑓subscript𝑥1𝑓subscript𝑥2d(x_{1},x_{2})=1-S_{c}(f(x_{1}),f(x_{2}))italic_d ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 1 - italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) (1)

where Sc(x1,x2)subscript𝑆𝑐subscript𝑥1subscript𝑥2S_{c}(x_{1},x_{2})italic_S start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) is the cosine similarity metric. To fine-tune the DreamSim distance metric, the NIGHT dataset is used which provides two variations, x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT for a reference image x𝑥xitalic_x, and a label y𝑦yitalic_y that is based on human judgments about which variation is more similar to the reference image x𝑥xitalic_x (some instances and more explanation of the NIGHT dataset are deferred to Appendix B). Supplemented with this dataset, the authors of DreamSim turn the setting into a binary classification problem. More concretely, given a triplet (x,x0,x1)𝑥subscript𝑥0subscript𝑥1(x,x_{0},x_{1})( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), and a feature extractor f𝑓fitalic_f, they define the following classifier:

h(x)={1,d(x,x1)d(x,x0)0,d(x,x1)>d(x,x0)𝑥cases1𝑑𝑥subscript𝑥1𝑑𝑥subscript𝑥00𝑑𝑥subscript𝑥1𝑑𝑥subscript𝑥0h(x)=\begin{cases}1,&d(x,x_{1})\leq d(x,x_{0})\\ 0,&d(x,x_{1})>d(x,x_{0})\end{cases}italic_h ( italic_x ) = { start_ROW start_CELL 1 , end_CELL start_CELL italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) > italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL end_ROW (2)

Finally, to better align DreamSim with human judgment, given the triplet (x,x0,x1)𝑥subscript𝑥0subscript𝑥1(x,x_{0},x_{1})( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), they optimize a hinge loss based on the difference between the perceptual distances d(x,x1)𝑑𝑥subscript𝑥1d(x,x_{1})italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and d(x,x1)𝑑𝑥subscript𝑥1d(x,x_{1})italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) with a margin parameter. Note that the classifier hhitalic_h has a dependency on f𝑓fitalic_f, d𝑑ditalic_d and each input x𝑥xitalic_x comes as triplet (x,x0,x1)𝑥subscript𝑥0subscript𝑥1(x,x_{0},x_{1})( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) but to simplify the notation we omit all these dependencies.

4 LipSim: Lipschitz Similarity Metric

In this section, we present the theoretical guarantees of LipSim along with the technical details of LipSim architecture and training.

4.1 A Perceptual Metric with Theoretical Guarantees

General Robustness for Perceptual Metric. A perceptual similarity metric can have a lot of important use cases, e.g., image retrieval, copy detection, etc. In order to make a robust perceptual metric we need to ensure that when a small perturbation is added to the input image, the output distance should not change a lot. In the following, we demonstrate a general robustness property when the feature extractor is 1-Lipschitz and the embeddings lie on the unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball, i.e., f(x)2=1subscriptnorm𝑓𝑥21\norm{f(x)}_{2}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1. {restatable}propositionpropdistance Let f:𝒳k:𝑓𝒳superscript𝑘f:\mathcal{X}\rightarrow\mathbb{R}^{k}italic_f : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT be a 1111-Lipschitz feature extractor and f(x)2=1subscriptnorm𝑓𝑥21\norm{f(x)}_{2}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, let d𝑑ditalic_d be a distance metric defined as in Equation 1 and let δ𝒳𝛿𝒳\delta\in\mathcal{X}italic_δ ∈ caligraphic_X and ε+𝜀superscript\varepsilon\in\mathbb{R}^{+}italic_ε ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT such that δ2εsubscriptnorm𝛿2𝜀\norm{\delta}_{2}\leq\varepsilon∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ε. Then, we have,

|d(x1,x2)d(x1+δ,x2)|δ2𝑑subscript𝑥1subscript𝑥2𝑑subscript𝑥1𝛿subscript𝑥2subscriptnorm𝛿2|d(x_{1},x_{2})-d(x_{1}+\delta,x_{2})|\leq\norm{\delta}_{2}| italic_d ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_d ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | ≤ ∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (3)

The proof is deferred to Appendix A. This proposition implies that when the feature extractor is 1111-Lipschitz and its output is projected on the unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball then the composition of the distance metric d𝑑ditalic_d and the feature extractor, i.e., df𝑑𝑓d\circ fitalic_d ∘ italic_f, is also 1111-Lipschitz with respect to its first argument. This result provides some general stability results and guarantees that the distance metric cannot change more than the norm of the perturbation.

Certified Robustness for 2AFC datasets. We aim to go even further and provide certified robustness for perceptual similarity metrics with 2AFC datasets, i.e., in a classification setting. In the following, we show that with the same assumptions as in Proposition 4.1, the classifier hhitalic_h can obtain certified accuracy. First, let us define a soft classifier H:𝒳2:𝐻𝒳superscript2H:\mathcal{X}\rightarrow\mathbb{R}^{2}italic_H : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with respect to some feature extractor f𝑓fitalic_f as follows:

H(x)=[d(x,x1),d(x,x0)]𝐻𝑥𝑑𝑥subscript𝑥1𝑑𝑥subscript𝑥0H(x)=\left[d(x,x_{1}),d(x,x_{0})\right]italic_H ( italic_x ) = [ italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] (4)

It is clear that h(x)=argmaxi{0,1}Hi(x)𝑥subscriptargmax𝑖01subscript𝐻𝑖𝑥h(x)=\operatorname*{arg\,max}_{i\in\{0,1\}}\,H_{i}(x)italic_h ( italic_x ) = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_i ∈ { 0 , 1 } end_POSTSUBSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_x ) where Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represent the i𝑖iitalic_i-th value of the output of H𝐻Hitalic_H. The classifier hhitalic_h is said to be certifiably robust at radius ϵ0italic-ϵ0\epsilon\geq 0italic_ϵ ≥ 0 at point x𝑥xitalic_x if for all δ2ϵsubscriptnorm𝛿2italic-ϵ\norm{\delta}_{2}\leq\epsilon∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ϵ we have:

h(x+δ)=h(x)𝑥𝛿𝑥h(x+\delta)=h(x)italic_h ( italic_x + italic_δ ) = italic_h ( italic_x ) (5)

Equivalently, one can look at the margin of the soft classifier: MH,x:=Hy(x)H1y(x)assignsubscript𝑀𝐻𝑥subscript𝐻𝑦𝑥subscript𝐻1𝑦𝑥M_{H,x}:=H_{y}(x)-H_{1-y}(x)italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT := italic_H start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ( italic_x ) - italic_H start_POSTSUBSCRIPT 1 - italic_y end_POSTSUBSCRIPT ( italic_x ) and provide a provable guarantee that:

MH,x+δ>0subscript𝑀𝐻𝑥𝛿0M_{H,x+\delta}>0italic_M start_POSTSUBSCRIPT italic_H , italic_x + italic_δ end_POSTSUBSCRIPT > 0 (6)
{restatable}

[Certified Accuracy for Perceptual Distance Metric]theoremmainresult Let H:𝒳2:𝐻𝒳superscript2H:\mathcal{X}\rightarrow\mathbb{R}^{2}italic_H : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT be the soft classifier as defined in Equation 4. Let δ𝒳𝛿𝒳\delta\in\mathcal{X}italic_δ ∈ caligraphic_X and ε+𝜀superscript\varepsilon\in\mathbb{R}^{+}italic_ε ∈ blackboard_R start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT such that δ2εsubscriptnorm𝛿2𝜀\norm{\delta}_{2}\leq\varepsilon∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ε. Assume that the feature extractor f:𝒳k:𝑓𝒳superscript𝑘f:\mathcal{X}\rightarrow\mathbb{R}^{k}italic_f : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT is 1-Lipschitz and that for all x𝑥xitalic_x, f(x)2=1subscriptnorm𝑓𝑥21\norm{f(x)}_{2}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, then we have the following result:

MH,xεf(x0)f(x1)2MH,x+δ0formulae-sequencesubscript𝑀𝐻𝑥𝜀subscriptnorm𝑓subscript𝑥0𝑓subscript𝑥12subscript𝑀𝐻𝑥𝛿0M_{H,x}\geq\varepsilon\norm{f(x_{0})-f(x_{1})}_{2}\quad\Longrightarrow\quad M_% {H,x+\delta}\geq 0italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT ≥ italic_ε ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟹ italic_M start_POSTSUBSCRIPT italic_H , italic_x + italic_δ end_POSTSUBSCRIPT ≥ 0 (7)

The proof is deferred to Appendix A. Based on Theorem 6, and assuming x1x0subscript𝑥1subscript𝑥0x_{1}\neq x_{0}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the certified radius for the classier hhitalic_h at point x𝑥xitalic_x can be computed as follows:

R(h,x)=MH,xf(x0)f(x1)2𝑅𝑥subscriptsubscript𝑀𝐻𝑥norm𝑓subscript𝑥0𝑓subscript𝑥12R(h,x)=\frac{M_{H,x}}{\norm{f(x_{0})-f(x_{1})}}_{2}italic_R ( italic_h , italic_x ) = divide start_ARG italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT end_ARG start_ARG ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (8)
Step 1: Lipschitz-based Student-Teacher training of embeddings Step 2: Lipschitz finetunning on Night Dataset ImageNetImageData JitteringDreamsimDino, Clip, OpenClip 1-Lip. Model1-Lip. ModelRMSERMSEdirect-sum\bigoplusminimize(x,x0,x1)𝑥subscript𝑥0subscript𝑥1(x,x_{0},x_{1})( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT )Projected 1-Lip. Model\scalerel{\scalerel*{\bm{\ominus}}{\bigoplus}}∗ bold_⊖ ⨁Hinge Lossminimizelabeld(x~,x0~)𝑑~𝑥~subscript𝑥0d(\tilde{x},\tilde{x_{0}})italic_d ( over~ start_ARG italic_x end_ARG , over~ start_ARG italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG )d(x~,x1~)𝑑~𝑥~subscript𝑥1d(\tilde{x},\tilde{x_{1}})italic_d ( over~ start_ARG italic_x end_ARG , over~ start_ARG italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG )x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPTx𝑥xitalic_xx1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPTNight Dataset
Figure 2: Two-step process for training the LipSim perceptual similarity metric. First (left), a distillation on the ImageNet dataset is performed where DreamSim acts as the teacher model, and a 1-Lipschitz neural network (i.e., the feature extractor) is learned to mimic DreamSim embeddings. To reduce color bias with the Lipschitz network, we use two different dataset augmentation schemes: a simple random flip and a jittered data augmentation technique that varies the brightness, contrast, hue, and saturation of the image. Second (right), the 1-Lipschitz neural network with projection is then fine-tuned on the NIGHT dataset with a hinge loss.

Theorem 6 provides the necessary condition for a provable perceptual distance metric without changing the underlying distance on the embeddings (i.e., cosine similarity). This result has two key advantages. First, as in Tsuzuku et al. (2018), computing the certificate at each point only requires efficient computation of the classifier margin H𝐻Hitalic_H. Leveraging Lipschitz continuity enables efficient certificate computation, unlike the randomized smoothing approach of Kumar & Goldstein (2021) which requires Monte Carlo sampling for each point. Second, the bound obtained on the margin to guarantee the robustness is in fact tighter than the one provided by Tsuzuku et al. (2018). Recall Tsuzuku et al. (2018) result states that for a L-Lipschitz classifier H𝐻Hitalic_H, we have:

MH,xε2LMH,x+δ0formulae-sequencesubscript𝑀𝐻𝑥𝜀2𝐿subscript𝑀𝐻𝑥𝛿0M_{H,x}\geq\varepsilon\sqrt{2}L\quad\Longrightarrow\quad M_{H,x+\delta}\geq 0italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT ≥ italic_ε square-root start_ARG 2 end_ARG italic_L ⟹ italic_M start_POSTSUBSCRIPT italic_H , italic_x + italic_δ end_POSTSUBSCRIPT ≥ 0 (9)

Given that the Lipschitz constant of H𝐻Hitalic_H111Recall the Lipschitz of the concatenation. Let f𝑓fitalic_f and g𝑔gitalic_g be Lfsubscript𝐿𝑓L_{f}italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT and Lgsubscript𝐿𝑔L_{g}italic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT-Lipschitz, then the function x[f(x),g(x)]maps-to𝑥𝑓𝑥𝑔𝑥x\mapsto[f(x),g(x)]italic_x ↦ [ italic_f ( italic_x ) , italic_g ( italic_x ) ] can be upper bounded by Lf2+Lg2superscriptsubscript𝐿𝑓2superscriptsubscript𝐿𝑔2\sqrt{L_{f}^{2}+L_{g}^{2}}square-root start_ARG italic_L start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_L start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG-Lipschitz is 22\sqrt{2}square-root start_ARG 2 end_ARG, this lead to the following bound:

MH,x2εεf(x0)f(x1)2subscript𝑀𝐻𝑥2𝜀𝜀subscriptnorm𝑓subscript𝑥0𝑓subscript𝑥12M_{H,x}\geq 2\varepsilon\geq\varepsilon\norm{f(x_{0})-f(x_{1})}_{2}italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT ≥ 2 italic_ε ≥ italic_ε ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (10)

simply based on the triangle inequality and the assumption that f(x)2=1subscriptnorm𝑓𝑥21\norm{f(x)}_{2}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1.

4.2 LipSim Architecture & Training

To design a reliable feature extractor that can be used with Proposition 4.1 and Theorem 6, we combined a 1-Lipschitz neural network architecture with an Euclidean projection. Let f:𝒳k:𝑓𝒳superscript𝑘f:\mathcal{X}\rightarrow\mathbb{R}^{k}italic_f : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT such that:

f(x)=πB2(0,1)βϕ(l)ϕ(1)(x)𝑓𝑥subscript𝜋subscript𝐵201𝛽superscriptitalic-ϕ𝑙superscriptitalic-ϕ1𝑥f(x)=\mathcal{\pi}_{B_{2}(0,1)}\circ\beta\circ\phi^{(l)}\circ\cdots\circ\phi^{% (1)}(x)italic_f ( italic_x ) = italic_π start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , 1 ) end_POSTSUBSCRIPT ∘ italic_β ∘ italic_ϕ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_ϕ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) (11)

where l𝑙litalic_l is the number of layers, πB2(0,1)subscript𝜋subscript𝐵201\mathcal{\pi}_{B_{2}(0,1)}italic_π start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , 1 ) end_POSTSUBSCRIPT is a projection on the unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball, i.e., πB2(0,1)(x)=argminzB2(0,1)xz2subscript𝜋subscript𝐵201𝑥subscriptargmin𝑧subscript𝐵201subscriptnorm𝑥𝑧2\mathcal{\pi}_{B_{2}(0,1)}(x)=\operatorname*{arg\,min}_{z\in B_{2}(0,1)}\norm{% x-z}_{2}italic_π start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , 1 ) end_POSTSUBSCRIPT ( italic_x ) = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT italic_z ∈ italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , 1 ) end_POSTSUBSCRIPT ∥ start_ARG italic_x - italic_z end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and where the layers ϕitalic-ϕ\phiitalic_ϕ are the SDP-based Lipschitz Layers (SLL) proposed by Araujo et al. (2023):

ϕ(x)=x2Wdiag(j=1n|WW|ij)1σ(Wx+b),italic-ϕ𝑥𝑥2𝑊diagsuperscriptsuperscriptsubscript𝑗1𝑛subscriptsuperscript𝑊top𝑊𝑖𝑗1𝜎superscript𝑊top𝑥𝑏\phi(x)=x-2W\text{diag}\left(\sum_{j=1}^{n}|W^{\top}W|_{ij}\right)^{-1}\sigma(% W^{\top}x+b),italic_ϕ ( italic_x ) = italic_x - 2 italic_W diag ( ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_W | start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ ( italic_W start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_x + italic_b ) , (12)

where W𝑊Witalic_W is a parameter matrix being either dense or a convolution, {qi}subscript𝑞𝑖\{q_{i}\}{ italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } forms a diagonal scaling matrix and σ𝜎\sigmaitalic_σ is the ReLU nonlinear activation.

To apply Theorem 6 for certification, we need f(x)2=1subscriptnorm𝑓𝑥21\norm{f(x)}_{2}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 with f𝑓fitalic_f 1-Lipschitz. To respect these conditions, we need βϕ(l)ϕ(1)(x)21subscriptnorm𝛽superscriptitalic-ϕ𝑙superscriptitalic-ϕ1𝑥21\norm{\beta\circ\phi^{(l)}\circ\cdots\circ\phi^{(1)}(x)}_{2}\geq 1∥ start_ARG italic_β ∘ italic_ϕ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_ϕ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 1 in order for the projection πB2(0,1)subscript𝜋subscript𝐵201\mathcal{\pi}_{B_{2}(0,1)}italic_π start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( 0 , 1 ) end_POSTSUBSCRIPT to be 1-Lipschitz. The mapping β:xx+b:𝛽maps-to𝑥𝑥𝑏\beta:x\mapsto x+bitalic_β : italic_x ↦ italic_x + italic_b is a 1-Lipschitz translation which is set to increase the norm of the embedding (i.e., ideally above one) before the Euclidean projection. However, we do not have a guarantee that the norm of the embedding after the β𝛽\betaitalic_β projection is above one, and therefore concerning the assumption of Theorem 6, in such cases, we abstain from the prediction. In practice, however, for all the samples of the NIGHT dataset βϕ(l)ϕ(1)(x)21subscriptnorm𝛽superscriptitalic-ϕ𝑙superscriptitalic-ϕ1𝑥21\norm{\beta\circ\phi^{(l)}\circ\cdots\circ\phi^{(1)}(x)}_{2}\geq 1∥ start_ARG italic_β ∘ italic_ϕ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_ϕ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 1 and thus the abstain percentage is equal to zero.

Proposition 1.

The neural network f:𝒳k:𝑓𝒳superscript𝑘f:\mathcal{X}\rightarrow\mathbb{R}^{k}italic_f : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT describe in Equation 11 is 1-Lipschitz and for all x𝑥xitalic_x with f(x)2=1subscriptnorm𝑓𝑥21\norm{f(x)}_{2}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 under the assumption that βϕ(l)ϕ(1)(x)21subscriptnorm𝛽superscriptitalic-ϕ𝑙superscriptitalic-ϕ1𝑥21\norm{\beta\circ\phi^{(l)}\circ\cdots\circ\phi^{(1)}(x)}_{2}\geq 1∥ start_ARG italic_β ∘ italic_ϕ start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∘ ⋯ ∘ italic_ϕ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ 1

Proof of Proposition 1.

The proof is a straightforward application of Theorem 3 of Araujo et al. (2023) and Corollary 2.2.3 of Nesterov et al. (2018). ∎

Two-step Process for Training LipSim.

LipSim aims to provide good image embeddings that are less sensitive to adversarial perturbations. We train LipSim in two steps, similar to the DreamSim approach. Recall that DreamSim first concatenates the embeddings of three ViT-based models and then fine-tunes the result on the NIGHT dataset. However, to obtain theoretical guarantees, we cannot use the embeddings of three ViT-based models because they are not generated by a 1-Lipschitz feature extractor. To address this issue and avoid self-supervised schemes for training the feature extractor, we leverage a distillation scheme on the ImageNet dataset, where DreamSim acts as the teacher model and we use a 1-Lipschitz neural network (without the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT unit ball projection) as a student model. This first step is described on the left of Figure 2. In the second step, we fine-tuned the 1-Lipschitz neural network with projection on the NIGHT dataset using a hinge loss to increase margins and therefore robustness, as in Araujo et al. (2023). This second step is described on the right of Figure 2.

Low LevelPrior learned metricsBase metricsFinetuned metricsEmpirical robustness020406080100PSNRSSIMLPIPSDISTSClipOpenClipDinoDreamSimLipSim
(a)
Refer to caption
(b)
Figure 3: Figure 3(a) compares percentages of alignment of several distance metrics with human vision based on the NIGHT dataset. The ViT-based methods outperform the pixel-wise and CNN-based metrics for the original images. However, LipSim with the 1-Lipschitz backbone composed of CNN and Linear layers has a decent natural score and outperforms the (Base) Clip. Moreover, the figure shows the performance under attack (2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-APGD with ϵ=2.0italic-ϵ2.0\epsilon=2.0italic_ϵ = 2.0) for the SOTA metric. While perturbing the reference image, other methods are experiencing a large decay in their performance but LipSim is showing much stronger robustness. Figure 3(b) shows the distribution of d(x,x+δ)𝑑𝑥𝑥𝛿d(x,x+\delta)italic_d ( italic_x , italic_x + italic_δ ) for LipSim and DreamSim. The δ𝛿\deltaitalic_δ perturbation is optimized for each method separately.

5 Experiments

In this section, we present a comprehensive set of experiments to first highlight the vulnerabilities of DreamSim which is the state-of-the-art perceptual distance metric, and second to demonstrate the certified and empirical robustness of LipSim to adversarial attacks.

5.1 Vulnerabilities of Perceptual Similarity Metrics

To investigate the vulnerabilities of DreamSim to adversarial attacks, we aim to answer two questions in this section; Can adversarial attacks against SOTA metrics cause: (1) misalignment with human perception? (2) large changes in distance between perturbed and original images?

Q1 – Alignment of SOTA Metric with Human Judgments after Attack.

In this part we focus on the binary classification setting and the NIGHT dataset with triplet input. The goal is to generate adversarial attacks and evaluate the resilience of state-of-the-art distance metrics. For this purpose, we use APGD (Croce & Hein, 2020), which is one of the most powerful attack algorithms. During optimization, we maximize the cross-entropy loss, the perturbation δ𝛿\deltaitalic_δ is crafted only on the reference image and the two distortions stay untouched:

argmaxδ:δ2εce(y,y^)=ce([d(x+δ,x1),d(x+δ,x0)],y)subscriptargmax:𝛿subscriptnorm𝛿2𝜀subscript𝑐𝑒𝑦^𝑦subscript𝑐𝑒𝑑𝑥𝛿subscript𝑥1𝑑𝑥𝛿subscript𝑥0𝑦\operatorname*{arg\,max}_{\delta:\norm{\delta}_{2}\leq\varepsilon}\mathcal{L}_% {ce}(y,\hat{y})=\mathcal{L}_{ce}([d(x+\delta,x_{1}),d(x+\delta,x_{0})],y)start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_δ : ∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ε end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( italic_y , over^ start_ARG italic_y end_ARG ) = caligraphic_L start_POSTSUBSCRIPT italic_c italic_e end_POSTSUBSCRIPT ( [ italic_d ( italic_x + italic_δ , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d ( italic_x + italic_δ , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] , italic_y ) (13)

Where y{0,1}𝑦01y\in\{0,1\}italic_y ∈ { 0 , 1 } and y^=[d(x+δ,x1),d(x+δ,x0)]^𝑦𝑑𝑥𝛿subscript𝑥1𝑑𝑥𝛿subscript𝑥0\hat{y}=[d(x+\delta,x_{1}),d(x+\delta,x_{0})]over^ start_ARG italic_y end_ARG = [ italic_d ( italic_x + italic_δ , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d ( italic_x + italic_δ , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] which is considered as the logits generated by the model. The natural and adversarial 2AFC scores of DreamSim are reported in Table 1. The natural accuracy drops to half the value for a tiny perturbation of size ϵ=0.5italic-ϵ0.5\epsilon=0.5italic_ϵ = 0.5 and decreases to zero for ϵ=2.0italic-ϵ2.0\epsilon=2.0italic_ϵ = 2.0. In order to visualize the effect of the attack on the astuteness of distances provided by DreamSim, original and adversarial images (that are generated by 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-APGD and caused misclassification) are shown in Figure 1. The distances are reported underneath the images as d(,)𝑑d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \char 182}},\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,.5,0}\char 183}})italic_d ( ❶ , ❷ ). To get a sense of the DreamSim distances between the original and perturbed images, the third row is added so that the original images have (approximately) the same distance to the perturbed images and the perceptually different images in the third row (d(,)=d(,)𝑑𝑑d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{pgfstrokecolor}{rgb}{0,.5,.5}% \char 182}},\text{{\color[rgb]{1,.5,0}\definecolor[named]{pgfstrokecolor}{rgb}% {1,.5,0}\char 183}})=d(\text{{\color[rgb]{0,.5,.5}\definecolor[named]{% pgfstrokecolor}{rgb}{0,.5,.5}\char 182}},\text{{{\color[rgb]{.75,0,.25}% \definecolor[named]{pgfstrokecolor}{rgb}{.75,0,.25}\char 184}}})italic_d ( ❶ , ❷ ) = italic_d ( ❶ , ❸ )). The takeaway from this experiment is the fact that tiny perturbations can fool the distance metric to produce large values for perceptually identical images.

Q2 – Specialized Attack for Semantic Metric.

In this part, we perform a direct attack against the feature extractor model which is the source of the vulnerability for perceptual metrics by employing the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-PGD (Madry et al., 2017) attack (ϵ=1.0)\epsilon=1.0)italic_ϵ = 1.0 ) and the following MSE loss is used during the optimization:

maxδ:δ2εMSE[f(x+δ),f(x)]subscript:𝛿subscriptnorm𝛿2𝜀subscriptMSE𝑓𝑥𝛿𝑓𝑥\max_{\delta:\norm{\delta}_{2}\leq\varepsilon}\mathcal{L}_{\text{MSE}}\left[f(% x+\delta),f(x)\right]roman_max start_POSTSUBSCRIPT italic_δ : ∥ start_ARG italic_δ end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_ε end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT MSE end_POSTSUBSCRIPT [ italic_f ( italic_x + italic_δ ) , italic_f ( italic_x ) ] (14)

The attack is performed on 500 randomly selected samples from the ImageNet-1k test set and against the DreamSim Ensemble feature extractor. After optimizing the δ𝛿\deltaitalic_δ, the DreamSim distance metric is calculated between the original image and the perturbed image: d(x,x+δ)𝑑𝑥𝑥𝛿d(x,x+\delta)italic_d ( italic_x , italic_x + italic_δ ). The distribution of distances is shown in Figure 3(b). We can observe a shift in the mean of the distance from 0 to 0.6 which can be considered as a large value for the DreamSim distance as shown in Figure 1.

5.2 LipSim Results

In this section, we aim to leverage the framework introduced in the paper and evaluate the LipSim perceptual metric. In the first step (i.e., right of Figure 2), we train a 1-Lipschitz network for the backbone of the LipSim metric and use the SSL architecture which has 20 Layers of Conv-SSL and 7 layers of Linear-SSL. For training the 1-Lipschitz feature extractor, the ImageNet-1k dataset is used (without labels) and the knowledge distillation approach is applied to utilize the state-of-the-art feature extractors including DINO, OpenCLIP, and DreamSim which is an ensemble of ViT-based models. To enhance the effectiveness of LipSim, we incorporate two parallel augmentation pipelines: standard and jittered. The standard version passes through the feature extractor and the teacher model while the jittered only passes through the feature extractor. Then, the RMSE loss is applied to enforce similarity between the embeddings of the jittered and standard images. This enables LipSim to focus more on the semantics of the image, rather than its colors. After training the 1-Lipschitz backbone of LipSim, we further fine-tune our model on the NIGHT dataset (i.e., step 2 see right of Figure 2). During the fine-tuning process, the embeddings are produced and are projected to the unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball. In order to maintain the margin between logits, the hinge loss is employed similarly to DreamSim. However, while DreamSim has used a margin parameter of 0.05, we used a margin parameter of 0.5 for fine-tuning LipSim in order to boost the robustness of the metric. Remarkably, LipSim achieves strong robustness using a 1-Lipschitz pipeline composed of a 1-Lipschitz feature extractor and a projection to the unit 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ball that guarantees the 1-Lipschitzness of cosine distance. To evaluate the performance of LipSim and compare its performance against other perceptual metrics, we report empirical and certified results of LipSim for different settings.

Table 1: Alignment on NIGHT dataset for original and perturbed images using APGD. In this experiment, the perturbation is only applied on the reference images. LipSim demonstrates a higher robustness on all perturbations.
Metric/ Embedding Natural Score 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-APGD
0.5 1.0 2.0 3.0
CLIP 93.91 29.93 8.44 1.20 0.27
OpenCLIP 95.45 72.31 42.32 11.84 3.28
DINO 94.52 81.91 59.04 19.29 6.35
DreamSim 96.16 46.27 16.66 0.93 0.93
LipSim (ours) 85.58 82.89 79.82 72.20 61.84
Table 2: Certified scores of LipSim given different settings. The natural and certified 2AFC scores of all variants of LipSim are shown in this figure. The LipSim - DreamSim version outperforms other variants regarding certified scores. The tradeoff between robustness and accuracy compares the results for different margins in the hinge loss. A higher margin parameter leads to a higher certified score and a lower natural score.
LipSim with Teacher Model Margin in Hinge Loss Natural Score Certified Score
3625536255\frac{36}{255}divide start_ARG 36 end_ARG start_ARG 255 end_ARG 7225572255\frac{72}{255}divide start_ARG 72 end_ARG start_ARG 255 end_ARG 108255108255\frac{108}{255}divide start_ARG 108 end_ARG start_ARG 255 end_ARG
LipSim – DINO 0.2 84.76 64.14 34.76 10.53
0.4 84.65 65.19 40.51 18.04
0.5 81.96 66.28 44.63 22.49
LipSim – OpenCLIP 0.2 83.33 61.18 34.87 13.60
0.4 80.59 63.27 42.32 21.71
0.5 81.30 64.80 45.12 25.38
LipSim – Dreamsim 0.2 85.58 62.88 35.36 11.18
0.4 83.33 65.40 43.69 21.11
0.5 82.89 66.39 44.90 23.46

Empirical Robustness Evaluation. We provide the empirical results of LipSim against 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-APGD in Table 1. Although the natural score of LipSim is lower than the natural score of DreamSim, there is a large gap between the adversary scores. We can observe that LipSim outperforms all state-of-the-art metrics. The results of a more comprehensive comparison between LipSim, state-of-the-art perceptual metrics, previously proposed perceptual metrics, and pixel-wise metrics are presented in Figure 3(a). The pre-trained and fine-tuned natural accuracies are comparable with the state-of-the-art metrics and even higher in comparison to CLIP. In terms of empirical robustness, LipSim demonstrates great resilience. More comparisons have been performed in this sense, the empirical results over subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-APGD and psubscript𝑝\ell_{p}roman_ℓ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT-MIA are also reported in Table 4 and Table 5 in Appendix D which aligns with the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT results and shows strong empirical robustness of LipSim. In order to evaluate the robustness of LipSim outside the classification setting, we have performed 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-PGD attack (ϵ=1.0italic-ϵ1.0\epsilon=1.0italic_ϵ = 1.0) using the MSE loss defined in Equation 14 and the distribution of d(x,x+δ)𝑑𝑥𝑥𝛿d(x,x+\delta)italic_d ( italic_x , italic_x + italic_δ ) is shown at Figure 3(b). The values of d(x,x+δ)𝑑𝑥𝑥𝛿d(x,x+\delta)italic_d ( italic_x , italic_x + italic_δ ) are pretty much close to zero which illustrates the general robustness of LipSim as discussed in proposition 1. The histogram of the same attack with a bigger perturbation budget (ϵ=3.0italic-ϵ3.0\epsilon=3.0italic_ϵ = 3.0) is shown in Figure 6 of the Appendix.

Certified Robustness Evaluation. In order to find the certified radius for data points, the margin between logits is computed and divided by the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm distance between embeddings of distorted images (f(x0)f(x1)2subscriptnorm𝑓subscript𝑥0𝑓subscript𝑥12\norm{f(x_{0})-f(x_{1})}_{2}∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). The results for certified 2AFC scores for different settings of LipSim are reported in Table 2, which demonstrates the robustness of LipSim along with a high natural score. The value of the margin parameter in hinge loss used during fine-tuning is mentioned in the table which clearly shows the trade-off between robustness and accuracy. A larger margin parameter leads to more robustness and therefore higher certified scores but lower natural scores.

LipSim DreamSim
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
 Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption  Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption Refer to caption
Figure 4: Adversarial attack impact on the performance of DreamSim and LipSim distance metrics on image retrieval application. The rows show the original images and the top-1 nearest neighbors. Adversarial images generated separately for LipSim and DreamSim metrics along with their top-1 nearest neighbors are depicted in rows. More precisely, the red block shows a complete sample in the figure, where the upper and lower right images are the original and adversarial queries and the upper and lower left images are the 1-top nearest images to them respectively.

5.3 Image Retrieval.

After demonstrating the robustness of LipSim in terms of certified and empirical scores, the focus of this section is on one of the real-world applications of a distance metric which is image retrieval. We employed the image retrieval dataset proposed by Fu et al. (2023), which has 500 images randomly selected from the COCO dataset. The top-1 closest neighbor to an image with respect to LipSim and DreamSim distance metrics are shown at the rows of Figure 4. In order to investigate the impact of adversarial attacks on the performance of LipSim and DreamSim in terms of Image Retrieval application, we have performed 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-PGD attack (ϵ=2.0italic-ϵ2.0\epsilon=2.0italic_ϵ = 2.0) with the same MSE loss defined in Equation 14 separately for the two metrics and the results are depicted at the rows of Figure 4. In adversarial rows, the LipSim sometimes generates a different image as the closest which is semantically similar to the closest image generated for the original image.

6 Conclusion

In this paper, we initially showed the vulnerabilities of the SOTA perceptual metrics including DreamSim to adversarial attacks and more importantly presented a framework for training a certifiable robust distance metric called LipSim which leverages the 1-Lipschitz network as its backbone, 1-Lipschitz cosine similarity and demonstrates non-trivial certified and empirical robustness. Moreover, LipSim was employed for an image retrieval task and exhibited good performance in gathering semantically close images with original and adversarial image queries. For future work, It will be interesting to investigate the certified robustness of LipSim for other 2AFC datasets and extend the performance of LipSim for other applications, including copy detection, and feature inversion.

Acknowledgments

This paper is supported in part by the Army Research Office under grant number W911NF-21-1-0155 and by the New York University Abu Dhabi (NYUAD) Center for Artificial Intelligence and Robotics, funded by Tamkeen under the NYUAD Research Institute Award CG010.

References

  • Araujo et al. (2020) Alexandre Araujo, Laurent Meunier, Rafael Pinot, and Benjamin Negrevergne. Advocating for multiple defense strategies against adversarial examples. In ECML PKDD 2020 Workshops, 2020.
  • Araujo et al. (2021) Alexandre Araujo, Benjamin Negrevergne, Yann Chevaleyre, and Jamal Atif. On lipschitz regularization of convolutional layers using toeplitz matrix theory. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  • Araujo et al. (2023) Alexandre Araujo, Aaron J Havens, Blaise Delattre, Alexandre Allauzen, and Bin Hu. A unified algebraic perspective on lipschitz neural networks. In The Eleventh International Conference on Learning Representations, 2023.
  • Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in neural information processing systems, 30, 2017.
  • Carlini & Wagner (2017) Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), 2017.
  • Carlini et al. (2023) Nicholas Carlini, Florian Tramer, Krishnamurthy Dj Dvijotham, Leslie Rice, Mingjie Sun, and J Zico Kolter. (certified!!) adversarial robustness for free! In The Eleventh International Conference on Learning Representations, 2023.
  • Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  9650–9660, 2021.
  • Cherti et al. (2023) Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  • Cohen et al. (2019) Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via randomized smoothing. In international conference on machine learning, 2019.
  • Croce & Hein (2020) Francesco Croce and Matthias Hein. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In International conference on machine learning, 2020.
  • Croce & Hein (2021) Francesco Croce and Matthias Hein. Mind the box: l_1𝑙_1l\_1italic_l _ 1-apgd for sparse adversarial attacks on image classifiers. In International Conference on Machine Learning, 2021.
  • Dong et al. (2018) Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Hang Su, Jun Zhu, Xiaolin Hu, and Jianguo Li. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  9185–9193, 2018.
  • Farnia et al. (2018) Farzan Farnia, Jesse M Zhang, and David Tse. Generalizable adversarial training via spectral normalization. arXiv preprint arXiv:1811.07457, 2018.
  • Fu et al. (2023) Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344, 2023.
  • Ghazanfari et al. (2023) Sara Ghazanfari, Siddharth Garg, Prashanth Krishnamurthy, Farshad Khorrami, and Alexandre Araujo. R-lpips: An adversarially robust perceptual similarity metric. arXiv preprint arXiv:2307.15157, 2023.
  • Ghildyal & Liu (2022) Abhijay Ghildyal and Feng Liu. Shift-tolerant perceptual similarity metric. In European Conference on Computer Vision, pp.  91–107. Springer, 2022.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Hein & Andriushchenko (2017) Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. Advances in neural information processing systems, 30, 2017.
  • Kettunen et al. (2019) Markus Kettunen, Erik Härkönen, and Jaakko Lehtinen. E-lpips: robust perceptual image similarity via random transformation ensembles. arXiv preprint arXiv:1906.03973, 2019.
  • Kumar & Goldstein (2021) Aounon Kumar and Tom Goldstein. Center smoothing: Certified robustness for networks with structured outputs. Advances in Neural Information Processing Systems, 34:5560–5575, 2021.
  • Kurakin et al. (2018) Alexey Kurakin, Ian J Goodfellow, and Samy Bengio. Adversarial examples in the physical world. In Artificial intelligence safety and security. 2018.
  • Li et al. (2019) Qiyang Li, Saminul Haque, Cem Anil, James Lucas, Roger B Grosse, and Jörn-Henrik Jacobsen. Preventing gradient attenuation in lipschitz constrained convolutional networks. Advances in neural information processing systems, 32, 2019.
  • Luo et al. (2022) Cheng Luo, Qinliang Lin, Weicheng Xie, Bizhu Wu, Jinheng Xie, and Linlin Shen. Frequency-driven imperceptible adversarial attack on semantic similarity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  15315–15324, 2022.
  • Madry et al. (2017) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Meunier et al. (2022) Laurent Meunier, Blaise J Delattre, Alexandre Araujo, and Alexandre Allauzen. A dynamical system perspective for lipschitz neural networks. In International Conference on Machine Learning. PMLR, 2022.
  • Miyato et al. (2018) Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. arXiv preprint arXiv:1802.05957, 2018.
  • Nesterov et al. (2018) Yurii Nesterov et al. Lectures on convex optimization, volume 137. Springer, 2018.
  • Pinot et al. (2019) Rafael Pinot, Laurent Meunier, Alexandre Araujo, Hisashi Kashima, Florian Yger, Cédric Gouy-Pailler, and Jamal Atif. Theoretical evidence for adversarial robustness through randomization. Advances in neural information processing systems, 2019.
  • Prach & Lampert (2022) Bernd Prach and Christoph H Lampert. Almost-orthogonal layers for efficient general-purpose lipschitz networks. In European Conference on Computer Vision, pp.  350–365. Springer, 2022.
  • Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  • Salman et al. (2019) Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. Advances in Neural Information Processing Systems, 32, 2019.
  • Shao et al. (2023) Huaqing Shao, Lanjun Wang, and Junchi Yan. Robustness certification for structured prediction with general inputs via safe region modeling in the semimetric output space. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  2010–2022, 2023.
  • Sjögren et al. (2022) Oskar Sjögren, Gustav Grund Pihlgren, Fredrik Sandin, and Marcus Liwicki. Identifying and mitigating flaws of deep perceptual similarity metrics. arXiv preprint arXiv:2207.02512, 2022.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Tsuzuku et al. (2018) Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Lipschitz-margin training: Scalable certification of perturbation invariance for deep neural networks. Advances in neural information processing systems, 31, 2018.
  • Wang & Manchester (2023) Ruigang Wang and Ian Manchester. Direct parameterization of lipschitz-bounded deep networks. In International Conference on Machine Learning, pp.  36093–36110. PMLR, 2023.
  • Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612, 2004.
  • Zhang et al. (2011) Lin Zhang, Lei Zhang, Xuanqin Mou, and David Zhang. Fsim: A feature similarity index for image quality assessment. IEEE transactions on Image Processing, 20(8):2378–2386, 2011.
  • Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018.

Appendix A Proofs

\propdistance

*

Proof of Proposition 4.1.

We have the following:

|d(x1,x2)d(x1+δ,x2)|𝑑subscript𝑥1subscript𝑥2𝑑subscript𝑥1𝛿subscript𝑥2\displaystyle|d(x_{1},x_{2})-d(x_{1}+\delta,x_{2})|| italic_d ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - italic_d ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | =|f(x1+δ),f(x2)f(x1+δ)f(x2f(x1),f(x2)f(x1)f(x2|\displaystyle=\left|\frac{\langle f(x_{1}+\delta),f(x_{2})\rangle}{\norm{f(x_{% 1}+\delta)}\norm{f(x_{2}}}-\frac{\langle f(x_{1}),f(x_{2})\rangle}{\norm{f(x_{% 1})}\norm{f(x_{2}}}\right|= | divide start_ARG ⟨ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ ) end_ARG ∥ ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ end_ARG - divide start_ARG ⟨ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ∥ end_ARG |
=(1)|f(x1+δ),f(x2)f(x1),f(x2)|1𝑓subscript𝑥1𝛿𝑓subscript𝑥2𝑓subscript𝑥1𝑓subscript𝑥2\displaystyle\overset{(1)}{=}\left|\langle f(x_{1}+\delta),f(x_{2})\rangle-% \langle f(x_{1}),f(x_{2})\rangle\right|start_OVERACCENT ( 1 ) end_OVERACCENT start_ARG = end_ARG | ⟨ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟩ - ⟨ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟩ |
=|f(x1+δ)f(x1),f(x2)|absent𝑓subscript𝑥1𝛿𝑓subscript𝑥1𝑓subscript𝑥2\displaystyle=\left|\langle f(x_{1}+\delta)-f(x_{1}),f(x_{2})\rangle\right|= | ⟨ italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⟩ |
f(x1+δ)f(x1)f(x2)absentnorm𝑓subscript𝑥1𝛿𝑓subscript𝑥1norm𝑓subscript𝑥2\displaystyle\leq\norm{f(x_{1}+\delta)-f(x_{1})}\norm{f(x_{2})}≤ ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_δ ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_ARG ∥
(2)δ2norm𝛿\displaystyle\overset{(2)}{\leq}\norm{\delta}start_OVERACCENT ( 2 ) end_OVERACCENT start_ARG ≤ end_ARG ∥ start_ARG italic_δ end_ARG ∥

where (1)1(1)( 1 ) and (2)2(2)( 2 ) are due to f(x)=1norm𝑓𝑥1\norm{f(x)}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ = 1 for all x𝑥xitalic_x and the fact that f𝑓fitalic_f is 1-Lipschitz, which concludes the proof. ∎

\mainresult

*

Proof of Theorem 6.

First, let us recall the soft classifier H:𝒳2:𝐻𝒳superscript2H:\mathcal{X}\rightarrow\mathbb{R}^{2}italic_H : caligraphic_X → blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT with respect to some feature extractor f𝑓fitalic_f as follows:

H(x)=[d(x,x1),d(x,x0)]𝐻𝑥𝑑𝑥subscript𝑥1𝑑𝑥subscript𝑥0H(x)=\left[d(x,x_{1}),d(x,x_{0})\right]italic_H ( italic_x ) = [ italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ] (15)

where d:𝒳×𝒳:𝑑𝒳𝒳d:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}italic_d : caligraphic_X × caligraphic_X → blackboard_R is defined as: d(x,y)=1f(x),f(y)f(x)2f(y)2𝑑𝑥𝑦1𝑓𝑥𝑓𝑦subscriptnorm𝑓𝑥2subscriptnorm𝑓𝑦2d(x,y)=1-\frac{\langle f(x),f(y)\rangle}{\norm{f(x)}_{2}\norm{f(y)}_{2}}italic_d ( italic_x , italic_y ) = 1 - divide start_ARG ⟨ italic_f ( italic_x ) , italic_f ( italic_y ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_ARG italic_f ( italic_y ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG.

Let us denote H0subscript𝐻0H_{0}italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT the first and second logits of the soft classifier. For a tuple (x,x0,x1)𝑥subscript𝑥0subscript𝑥1(x,x_{0},x_{1})( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and a target label y𝑦yitalic_y, we say that H𝐻Hitalic_H correctly classifies if argmaxH(x)=yargmax𝐻𝑥𝑦\operatorname*{arg\,max}H(x)=ystart_OPERATOR roman_arg roman_max end_OPERATOR italic_H ( italic_x ) = italic_y. Note that we omit the dependency on x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the notation. Let us assume the target class y=1𝑦1y=1italic_y = 1. The case for y=0𝑦0y=0italic_y = 0 is exactly symmetric. Let us define the margin of the soft classifier H𝐻Hitalic_H as:

MH,x:=H1(x)H0(x)assignsubscript𝑀𝐻𝑥subscript𝐻1𝑥subscript𝐻0𝑥M_{H,x}:=H_{1}(x)-H_{0}(x)italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT := italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) - italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) (16)

We have the following:

M(H,x+δ)𝑀𝐻𝑥𝛿\displaystyle M(H,x+\delta)italic_M ( italic_H , italic_x + italic_δ ) =H1(x+δ)H0(x+δ)absentsubscript𝐻1𝑥𝛿subscript𝐻0𝑥𝛿\displaystyle=H_{1}(x+\delta)-H_{0}(x+\delta)= italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x + italic_δ ) - italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x + italic_δ )
=H1(x+δ)H1(x)H0(x+δ)+H0(x)+(H1(x)H0(x))absentsubscript𝐻1𝑥𝛿subscript𝐻1𝑥subscript𝐻0𝑥𝛿subscript𝐻0𝑥subscript𝐻1𝑥subscript𝐻0𝑥\displaystyle=H_{1}(x+\delta)-H_{1}(x)-H_{0}(x+\delta)+H_{0}(x)+(H_{1}(x)-H_{0% }(x))= italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x + italic_δ ) - italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) - italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x + italic_δ ) + italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) + ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) - italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) )
=d(x+δ,x0)d(x,x0)d(x+δ,x1)+d(x,x1)+(H1(x)H0(x))absent𝑑𝑥𝛿subscript𝑥0𝑑𝑥subscript𝑥0𝑑𝑥𝛿subscript𝑥1𝑑𝑥subscript𝑥1subscript𝐻1𝑥subscript𝐻0𝑥\displaystyle=d(x+\delta,x_{0})-d(x,x_{0})-d(x+\delta,x_{1})+d(x,x_{1})+\left(% H_{1}(x)-H_{0}(x)\right)= italic_d ( italic_x + italic_δ , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_d ( italic_x + italic_δ , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_d ( italic_x , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + ( italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x ) - italic_H start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_x ) )
=(1f(x+δ),f(x0)f(x+δ)2f(x0)2)(1f(x),f(x0)f(x)2f(x0)2)absent1𝑓𝑥𝛿𝑓subscript𝑥0subscriptnorm𝑓𝑥𝛿2subscriptnorm𝑓subscript𝑥021𝑓𝑥𝑓subscript𝑥0subscriptnorm𝑓𝑥2subscriptnorm𝑓subscript𝑥02\displaystyle=\left(1-\frac{\langle f(x+\delta),f(x_{0})\rangle}{\norm{f(x+% \delta)}_{2}\norm{f(x_{0})}_{2}}\right)-\left(1-\frac{\langle f(x),f(x_{0})% \rangle}{\norm{f(x)}_{2}\norm{f(x_{0})}_{2}}\right)= ( 1 - divide start_ARG ⟨ italic_f ( italic_x + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x + italic_δ ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) - ( 1 - divide start_ARG ⟨ italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG )
(1f(x+δ),f(x1)f(x+δ)2f(x1)2)+(1f(x),f(x1)f(x)2f(x1)2)+MH,x1𝑓𝑥𝛿𝑓subscript𝑥1subscriptnorm𝑓𝑥𝛿2subscriptnorm𝑓subscript𝑥121𝑓𝑥𝑓subscript𝑥1subscriptnorm𝑓𝑥2subscriptnorm𝑓subscript𝑥12subscript𝑀𝐻𝑥\displaystyle\quad\quad-\left(1-\frac{\langle f(x+\delta),f(x_{1})\rangle}{% \norm{f(x+\delta)}_{2}\norm{f(x_{1})}_{2}}\right)+\left(1-\frac{\langle f(x),f% (x_{1})\rangle}{\norm{f(x)}_{2}\norm{f(x_{1})}_{2}}\right)+M_{H,x}- ( 1 - divide start_ARG ⟨ italic_f ( italic_x + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x + italic_δ ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) + ( 1 - divide start_ARG ⟨ italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ end_ARG start_ARG ∥ start_ARG italic_f ( italic_x ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) + italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT
=(1)f(x+δ),f(x0)+f(x),f(x0)+f(x+δ),f(x1)f(x),f(x1)+MH,x1𝑓𝑥𝛿𝑓subscript𝑥0𝑓𝑥𝑓subscript𝑥0𝑓𝑥𝛿𝑓subscript𝑥1𝑓𝑥𝑓subscript𝑥1subscript𝑀𝐻𝑥\displaystyle\overset{(1)}{=}-\langle f(x+\delta),f(x_{0})\rangle+\langle f(x)% ,f(x_{0})\rangle+\langle f(x+\delta),f(x_{1})\rangle-\langle f(x),f(x_{1})% \rangle+M_{H,x}start_OVERACCENT ( 1 ) end_OVERACCENT start_ARG = end_ARG - ⟨ italic_f ( italic_x + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩ + ⟨ italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩ + ⟨ italic_f ( italic_x + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ - ⟨ italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ + italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT
=f(x+δ),f(x1)f(x0)+f(x),f(x0)f(x1)+MH,xabsent𝑓𝑥𝛿𝑓subscript𝑥1𝑓subscript𝑥0𝑓𝑥𝑓subscript𝑥0𝑓subscript𝑥1subscript𝑀𝐻𝑥\displaystyle=\langle f(x+\delta),f(x_{1})-f(x_{0})\rangle+\langle f(x),f(x_{0% })-f(x_{1})\rangle+M_{H,x}= ⟨ italic_f ( italic_x + italic_δ ) , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩ + ⟨ italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ⟩ + italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT
=f(x+δ)f(x),f(x1)f(x0)+MH,xabsent𝑓𝑥𝛿𝑓𝑥𝑓subscript𝑥1𝑓subscript𝑥0subscript𝑀𝐻𝑥\displaystyle=\langle f(x+\delta)-f(x),f(x_{1})-f(x_{0})\rangle+M_{H,x}= ⟨ italic_f ( italic_x + italic_δ ) - italic_f ( italic_x ) , italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ⟩ + italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT
f(x)f(x+δ)f(x0)f(x1)+MH,xabsentnorm𝑓𝑥𝑓𝑥𝛿norm𝑓subscript𝑥0𝑓subscript𝑥1subscript𝑀𝐻𝑥\displaystyle\geq-\norm{f(x)-f(x+\delta)}\norm{f(x_{0})-f(x_{1})}+M_{H,x}≥ - ∥ start_ARG italic_f ( italic_x ) - italic_f ( italic_x + italic_δ ) end_ARG ∥ ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ + italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT
εf(x0)f(x1)+MH,xabsent𝜀norm𝑓subscript𝑥0𝑓subscript𝑥1subscript𝑀𝐻𝑥\displaystyle\geq-\varepsilon\norm{f(x_{0})-f(x_{1})}+M_{H,x}≥ - italic_ε ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ + italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT

where (1)1(1)( 1 ) is due to the fact that for all x𝑥xitalic_x, f(x)=1norm𝑓𝑥1\norm{f(x)}=1∥ start_ARG italic_f ( italic_x ) end_ARG ∥ = 1. Therefore, MH,x+δ0subscript𝑀𝐻𝑥𝛿0M_{H,x+\delta}\geq 0italic_M start_POSTSUBSCRIPT italic_H , italic_x + italic_δ end_POSTSUBSCRIPT ≥ 0 only if MH,xεf(x0)+f(x1)subscript𝑀𝐻𝑥𝜀norm𝑓subscript𝑥0𝑓subscript𝑥1M_{H,x}\geq\varepsilon\norm{f(x_{0})+f(x_{1})}italic_M start_POSTSUBSCRIPT italic_H , italic_x end_POSTSUBSCRIPT ≥ italic_ε ∥ start_ARG italic_f ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) + italic_f ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG ∥ which conclude the proof. ∎

Appendix B Dataset Details & Additional Figures

In the section, we initially represent some details of 2AFC datasets and more specifically NIGHT dataset and show some examples of this dataset.

B.1 Dataset Details

In order to train a perceptual distance metric, datasets with perceptual judgments are used. The perceptual judgments are of two types: two alternative forced choice (2AFC) test, that asks which of two variations of a reference image is more similar to it. To validate the 2AFC test results, a second test, just a noticeable difference (JND) test is performed. In the JND test, the reference images and one of the variations are asked to be the same or different. BAPPS (Zhang et al., 2018) and NIGHT (Fu et al., 2023) are two datasets organized with the 2AFC and JND judgments. The JND section of the NIGHT dataset has not been released yet, therefore we did our evaluations only based on the 2AFC score. In Figure 5 we show some instances from the NIGHT dataset, the reference is located in the middle and the two variations are left and right. The reference images are sampled from well-known datasets including ImageNet-1k.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Some instances of NIGHT dataset is shown. The middle image is the reference image and right and left images are x0subscript𝑥0x_{0}italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT distortions respectively.

Appendix C Certified Robustness on BAPPS Dataset

In this section, we aim to show the generalization of LipSim for other 2AFC dataset available called BAPPS (Zhang et al., 2018). The BAPPS has the same structure with NIGHT dataset as explained at Appendix B. However, the left and right images have been generated by adding distortions to the reference image and the label determines which of the distortions is more similar to the reference. Therefore the distribution of data for NIGHT and BAPPS datasets are totally different. To provide certification for the BAPPS dataset, the labels which are are between 0 and 1 are rounded to have binary labels for all samples. Later, we leverage the finetuned verions of LipSim to provide certificates on the BAPPS dataset. Although LipSim is fine-tuned on the NIGHT dataset and the distribution of images differ for these two datasets, it has good certified scores on BAPPS dataset as reported at Table 3.

Table 3: Certified scores of LipSim on BAPPS dataset.
Metric Margin in Hinge Loss Natural Score Certified Score
3625536255\frac{36}{255}divide start_ARG 36 end_ARG start_ARG 255 end_ARG 7225572255\frac{72}{255}divide start_ARG 72 end_ARG start_ARG 255 end_ARG 108255108255\frac{108}{255}divide start_ARG 108 end_ARG start_ARG 255 end_ARG
DreamSim - 78.47 0.0 0.0 0.0
LipSim 0.2 73.47 30.0 12.96 5.33
0.4 74.31 31.74 15.19 7.0
0.5 74.29 31.20 15.07 6.77

Appendix D Additional Empirical Results for LipSim

In this section, we aim to perform a diverse set of empirical analyses on LipSim. At first we revisit the empirical robustness of LipSim using different adversarial attacks. Second, we compare LipSim with a recent empirical defense called R-LPIPS Ghazanfari et al. (2023) on both the NIGHT and BAPPS datasets. Later we look into the general robustness property and perform stronger attacks directly to the metric. Afterward, we present another application for LipSim as a perceptual metric which is the KNN Task Finally we have the comparison between LipSim and other metrics in terms of natural score.

D.1 Empirical Robustness Results

In this part, the empirical robustness of LipSim is evaluated using subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-APGD, subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-PGD attack and Momentum Iterative Attack Dong et al. (2018). The cross-entropy loss as defined in Equation 13 is used for the optimization. In the case of subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-APGD, LipSim outperforms all metrics for the entire set of perturbation budgets. For subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-PGD, the performance of DreamSim is better for ϵ=0.01italic-ϵ0.01\epsilon=0.01italic_ϵ = 0.01, however, LipSim has kept a high stable score encountering PGD attacks with different perturbation sizes, which demonstrates the stability of the LipSim. The results for Momentum Iterative Attack are reported in Table 5, which is in line with the other results presented in the paper in terms of the empirical robustness of LipSim towards adversarial attacks.

Table 4: Alignment on NIGHT dataset for original and perturbed images using APGD. In this experiment, different feature extractors are employed in combination with cosine distance to calculate the distance between reference images and distorted images, and the perturbation is only applied to the reference images. DINO, CLIP, and OpenCLIP are ViT-based feature extractors, DreamSim Ensemble is a concatenation of all these three feature extractors and LipSim Backbone is a 1-Lipschitz network that is trained from scratch using the knowledge distillation approach.
Metric/ Embedding Natural Score subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-APGD
0.01 0.02 0.03
DINO 94.52 14.91 1.42 0.21
CLIP 93.91 0.93 0.05 0.00
OpenCLIP 95.45 7.89 0.76 0.05
DreamSim 96.16 2.24 0.10 0.05
LipSim 85.58 62.28 33.66 15.19
Table 5: Alignment on NIGHT dataset for original and perturbed images using Momentum Iterative Attack.
Metric Natural Score 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-MIA subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-MIA
0.5 1 2 0.01 0.02 0.03
DreamSim 96.16 61.79 52.85 52.69 2.08 0.05 0.0
LipSim 85.58 82.79 79.99 80.10 62.45 34.38 15.84
Refer to caption
Figure 6: Distribution of d(x,x+δ)𝑑𝑥𝑥𝛿d(x,x+\delta)italic_d ( italic_x , italic_x + italic_δ ) where δ𝛿\deltaitalic_δ is generated using 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-PGD attack with ϵitalic-ϵ\epsilonitalic_ϵ=3.0

D.2 General Robustness of LipSim

In order to evaluate the general robustness of LipSim in comparison with DreamSim, we optimize the MSE loss defined in Equation 14 employing 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-PGD attack for LipSim and DreamSim separately. The distribution of d(x,x+δ)𝑑𝑥𝑥𝛿d(x,x+\delta)italic_d ( italic_x , italic_x + italic_δ ) is shown in Figure 6. The difference between this figure and the histogram in Figure 3(b) is that we have chosen a larger perturbation budget, ϵ=3italic-ϵ3\epsilon=3italic_ϵ = 3, to demonstrate the fact that even for larger perturbations, LipSim is showing general robustness.

D.3 Comparison with an Empirical Defense

As discussed in the Related Work, there exist a few papers about providing defense for perceptual metrics. R-LPIPS (Ghazanfari et al., 2023) is a recent paper to propose an empirical defense for the LPIPS (Zhang et al., 2018) perpetual similarity metric by leveraging Adversarial training and subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-PGD attacks. For our evaluation, we perform subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-PGD and 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-MIA (Momentum Iterative attack) and provide an empirical robustness score on the BAPPS dataset as well as the NIGHT dataset in Table 6. Finally, we need to emphasize that R-LPIPS is trained on the BAPPS dataset while Lipsim is finetuned on the NIGHT dataset and Lipsim provides certified scores while R-LPIPS does not come with any provable guarantees.

Table 6: Alignment on NIGHT and BAPPS dataset for original and perturbed images with LipSim and R-LPIPS metrics.
Metric Dataset Natural Score 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-MIA ϵ=1.0italic-ϵ1.0\epsilon=1.0italic_ϵ = 1.0 subscript\ell_{\infty}roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT-PGD ϵ=0.03italic-ϵ0.03\epsilon=0.03italic_ϵ = 0.03
R-LPIPS NIGHT 70.56 58.50 32.46
BAPPS 80.25 72.38 70.94
LipSim NIGHT 85.58 79.99 75.27
BAPPS 73.47 60.09 42.77

D.4 More Applications: KNN Task

For the applications of LipSim, we have presented image retrieval in Section 5 and for a second application, we add the KNN task to this part. KNN (k-nearest neighbors algorithm) is a zero-shot classification task that classifies test images based on the proximity of their feature vectors to the training images’ feature vectors. We performed our experiment on Imagenette222https://github.com/fastai/imagenette (ImageNet dataset with 10 classes) and for k={10,20}𝑘1020k=\{10,20\}italic_k = { 10 , 20 }. The accuracy of LipSim and DreamSim for the KNN task are reported in Table 7. LipSim is providing more than 85%percent8585\%85 % accuracy on classification task by leveraging its robust embeddings and its accuracy is also very close to DreamSim in terms of Top 5.

Table 7: LipSim and DreamSim comparison for the KNN task on Imagenette dataset.
Metric 10-NN 20-NN
Top 1 Top 5 Top 1 Top 5
DreamSim 99.03 99.82 98.82 99.89
LipSim 85.32 97.20 85.35 98.09