1 Introduction

Quantum machine learning, as highlighted in Biamonte et al. (2017); Schuld et al. (2015); Ciliberto et al. (2018); Lloyd et al. (2013), represents a promising research direction at the intersection of quantum computing and artificial intelligence. Within this realm, the utilization of quantum computers promises to significantly boost machine learning algorithms by leveraging their innate parallel attributes, thereby showcasing quantum advantages that surpass classical algorithms, as suggested by Harrow and Montanaro (2017). Due to the substantial collaborative endeavors of academia and industry, contemporary quantum devices, often referred to as noisy intermediate-scale quantum (NISQ) devices (Preskill 2018), are now capable of demonstrating quantum advantages in specific meticulously crafted tasks (Arute et al. 2019; Zhong et al. 2020). An emerging research focus lies in leveraging near-term quantum devices for practical machine learning applications, with a prominent approach being hybrid quantum-classical algorithms (Bharti et al. 2022; Cerezo et al. 2021), also referred to as variational quantum algorithms. These algorithms typically employ a classical optimizer to refine quantum neural networks (QNNs), allocating complex tasks to quantum computers while assigning simpler ones to classical computers. In typical quantum machine learning scenarios, a quantum circuit utilized in variational quantum algorithms is commonly divided into two components: a data encoding circuit and a QNN. On the one hand, enhancing these algorithms’ efficacy in handling practical tasks involves the development of various QNN architectures. Numerous architectures, including strongly entangling circuit architectures (Schuld et al. 2020), tree-tensor networks (Grant et al. 2018), quantum convolutional neural networks (Cong et al. 2019), and even automatically searched architectures (Ostaszewski et al. 2021a, b; Zhang et al. 2022; Du et al. 2020), have been proposed. On the other hand, careful design of the encoding circuit is crucial, as it can significantly impact the generalization performance of these algorithms.

Encoding classical information into quantum data is a crucial step, as it directly impacts the performance of quantum machine learning algorithms. These algorithms are designed to optimize objective functions, such as classification, using encoded data. However, quantum encoding poses significant challenges, especially on near-term quantum devices, as highlighted in previous research (Biamonte et al. 2017). While phase and amplitude encoding are foundational approaches, recent advancements have popularized parameterized quantum circuits (PQCs) as the most practical strategy for encoding on NISQ devices (Benedetti et al. 2019). Nevertheless, despite the prevalence of PQCs, it is essential to utilize the basic encoding methods at the first step, such as phase and amplitude encoding. An important question arises regarding whether these encoding strategies guarantee preserving fundamental properties or characteristics of classical data in its quantum form.

Contributions of this work

This paper has three key contributions. First, we identify a challenge with current visual encoding strategies regarding the preservation of information during the transition from classical to quantum data. Specifically, we observe distinct characteristics between feature spaces in quantum computing compared to their classical counterparts, resulting in lower performance of quantum machine learning algorithms than expected. Second, we introduce a simple but efficient novel training approach to generate classical features conducive to quantum machines post-encoding. This method holds promise for substantially enhancing quantum machine learning algorithms. Finally, our empirical experiments demonstrate the state-of-the-art performance of quantum machine learning across diverse benchmarks.

2 Related work

2.1 Quantum computer vision

Several quantum techniques are available for computer vision tasks, such as recognition and classification (O’Malley et al. 2018; Cavallaro et al. 2020), object tracking (Li and Ghosh 2020), transformation estimation (Golyanik and Theobalt 2020), shape alignment and matching (Noormandipour and Wang 2022; Benkner et al. 2021, 2020), permutation synchronization (Birdal et al. 2021), visual clustering (Nguyen et al. 2023), and motion segmentation (Arrigoni et al. 2022). Via Adiabatic Quantum Computing (AQC), O’Malley et al. (2018) applied binary matrix factorization to extract features of facial images. In contrast, Li and Ghosh (2020) reduced redundant detections in multi-object detection. Dendukuri and Luu (2018) presented the image representation using quantum information to reduce the computational resources of classical computers. Cavallaro et al. (2020) presented multi-spectral image classification using quantum SVM. Golyanik and Theobalt (2020) introduced correspondence problems for point sets using AQC to align the rotations between pairs of point sets. Meanwhile, Noormandipour and Wang (2022) proposed a parameterized quantum circuit learning method for the point set matching problem. Using AQC to solve the formulated Quadratic Unconstrained Binary Optimization (QUBO), Nguyen et al. (2023) proposed an unsupervised visual clustering method optimizing the distances between clusters. In contrast, Arrigoni et al. (2022) optimized the matching motions of key points between consecutive frames.

2.2 Hybrid classical-quantum machine learning

Date et al. (2020) implemented a classical high-performance computing model with an Adiabatic Quantum Processor for a classification task on the MNIST dataset. Their experiment evaluated two classification models, i.e., the Deep Belief Network (DBN) and the Restricted Boltzmann Machines (RBM). It is shown that classical computing performs heavy matrix computations efficiently. At the same time, the sampling task is more convenient to quantum computing, as quantum mechanical processes are used to generate samples, making them truly random. Barkoutsos et al. (2020) introduced an improved platform for combinatorial optimization problems using hybrid classical-quantum variational circuits. It was empirically shown that this approach leads to faster convergence to better solutions for all combinatorial optimization problems on both classical simulation and quantum hardware. Romero and Aspuru-Guzik (2021) presented generative modeling of continuous probability distributions via a Hybrid Quantum-Classical model. Inspired by convolutional neural networks, Liu et al. (2021) proposed a hybrid quantum-classical convolutional neural network using the quantum advantage to enhance the feature mapping process, the most computationally intensive part of the convolutional neural networks. The feature map extracted by a parametrized quantum circuit can detect the correlations of neighboring data points in a complexly large space.

3 Background

3.1 Quantum basics

This section provides a concise introduction to fundamental concepts in quantum computing essential for this paper. For a detailed comprehensive review, we refer to Nielsen and Chuang (2001). In quantum computing, quantum information is typically expressed through n-qubit (pure) quantum states within the Hilbert space \({{\mathbb {C}}}^{2^n}\). Specifically, a pure quantum state can be denoted by a unit vector \(| \psi \rangle \in {{\mathbb {C}}}^{2^n}\) (or \(\langle \psi |\)), where the ket notation \(| \rangle \) signifies a column vector, and the bra notation \(\langle \psi |=| \psi \rangle ^\textsf{T}\) with \(\textsf{T}\) indicating the conjugate transpose, represents a row vector.

Mathematically, the evaluation of a pure quantum state \(| \psi \rangle \) is delineated by employing a quantum circuit, often called a quantum gate. It is represented as \(| \psi ^\prime \rangle =U| \psi \rangle \), where U denotes the unitary operator (matrix) signifying the quantum circuit, and \(| \psi ^\prime \rangle \) represents the quantum state after the evolution. Standard single-qubit quantum gates encompass the Pauli operators.

$$\begin{aligned} X := \begin{bmatrix} 0 & 1\\ 1 & 0\end{bmatrix}, Y := \begin{bmatrix} 0 & -i\\ i & 0\end{bmatrix}, Z := \begin{bmatrix} 1 & 0\\ 0 & -1\end{bmatrix}, \end{aligned}$$
(1)

The corresponding rotation gates denoted by \(R_P(\theta )= \text {exp}(-i\theta P/2)\) \( = \cos \frac{\theta }{2} I -i\sin \frac{\theta }{2}P\), where the rotation angle \(\theta \in [0,2\pi )\) and \(P\in \{X,Y,Z\}\) indicating rotation around XYZ coordinates. In this paper, multiple-qubit quantum gates mainly include the identity gate I, the CNOT gate, and the tensor product of single-qubit gates, e.g., \(Z\otimes Z\), \(Z\otimes I\), \(Z^{\otimes n}\) and so on.

Quantum measurement is a method for extracting classical information from a quantum state. For example, given a quantum state \(| \psi \rangle \) and an observable H, one can design quantum measurements to obtain the information \(\langle \psi | H | \psi \rangle \). This study concentrates on hardware-efficient Pauli measurements, where H is set as Pauli operators or their tensor products. For instance, one might choose \(Z_1 = Z\otimes I^{\otimes (n-1)}\), \(X_2 = I\otimes X\otimes I^{\otimes (n-2)}\), \(Z_1Z_2 = Z\otimes Z\otimes I^{\otimes (n-2)}\), etc., with a total of n qubits.

3.2 Limitations in current quantum encoding methods

Let \(\textbf{v} \in {{\mathbb {R}}}^{d}\) be a typical \(d-\)dimension vector of a classical computer. We denote \(\mathcal {E}(\textbf{v})\) to be a quantum encoding function that transforms the vector \(\textbf{v}\) into the vector \(| \psi \rangle \in {{\mathbb {C}}}^{2^n}\) of quantum states over Hilbert space, where n is the number of qubits.

$$\begin{aligned} | \psi \rangle = \mathcal {E}(\textbf{v}) \end{aligned}$$
(2)
Fig. 1
figure 1

Overview of the hybrid quantum system. The run in the classical machine. The includes components running on the quantum machine. The dashed indicates our focus on this paper. Best viewed in color

Specifically, the \(\mathcal {E}\) can be amplitude, phase encoding, or PQC. It is important to note that the \(| \psi \rangle \) represents the qubits’ states; for further usage of the quantum machine learning function, it is necessary to extract information from these quantum states. To accomplish it, the observable denoted as \(\mathcal {O}(| \psi \rangle )\) is utilized. In particular, the observable \(\mathcal {O}\) measures the state of every single qubit. Let \(\textbf{q} = [q(0),\dots ,q(i),\dots ,q({n-1})] \in {{\mathbb {R}}}^{n}\) be a vector of information measured by \(\mathcal {O}\) where q(i) is the measurement of \(i^{th}\) qubit and formulated as in Eq. 3.

$$\begin{aligned} q(i) = \langle {\psi }\!|\mathcal {O}_i\!|{\psi }\rangle \end{aligned}$$
(3)

In the equation above, a different observable \(\mathcal {O}_i\) is applied for each qubit. In particular, \(\mathcal {O}_i\) is a unitary operator represented by a matrix. Let P be a Pauli operation where \(P \in \{X, Y, Z\}\), the \(\mathcal {O}_i\) can be further derived as Eq. 4.

$$\begin{aligned} \mathcal {O}_i = I^{\otimes i} \otimes P \otimes I^{\otimes (n-i-1)} \end{aligned}$$
(4)

According to Eq. 4, we can measure the state of a qubit in any coordinates (X, Y, or Z) of the Hilbert space.

In summary, the relation between quantum information vector \(\textbf{q}\) and classical information vector \(\textbf{v}\) is represented as Eq. 5.

$$\begin{aligned} \textbf{v} \in {{\mathbb {R}}}^{d} \overset{\mathcal {E}(\textbf{v})}{\underset{\text {Quantum encoding}}{\longmapsto }} | \psi \rangle \in {{\mathbb {R}}}^{2^n} = {{\mathbb {R}}}^{d} \overset{\mathcal {O}(| \psi \rangle )}{\underset{\text {Measurement}}{\longmapsto }} \textbf{q} \in {{\mathbb {R}}}^{n} \end{aligned}$$
(5)

Mathematically, we can define \(\mathcal {Q}\) as the function to map \(\textbf{v} \rightarrow \textbf{q}\) as Eq. 6.

$$\begin{aligned} \textbf{q} = \mathcal {Q}(\textbf{v}, \mathcal {E}, \mathcal {O}) \end{aligned}$$
(6)

The details of the proposed framework are demonstrated in Fig. 1.

Proposition 1

Consider two different quantum state vectors, denoted as \(|\psi _1\rangle \) and \(|\psi _2\rangle \), and these corresponding quantum information vectors \(\textbf{q}_1\) and \(\textbf{q}_2\). We have \(\langle \psi _1|\psi _2\rangle \ne \textbf{q}_1^\textsf{T} \textbf{q}_2\) for any Pauli observable and quantum encoding strategies.

Proof

As \(q(i) = \langle \psi |\mathcal {O}_i| \psi \rangle \), we have:

$$\begin{aligned} \begin{aligned} \textbf{q}_1^\textsf{T} \textbf{q}_2&= \sum _{i=1}^n q_1(i)q_2(i) \\&= \sum _{i=1}^n \langle \psi _1| \mathcal {O}_i |\psi _1\rangle \langle \psi _2| \mathcal {O}_i |\psi _2\rangle \\&= \langle \psi _1| \left( \sum _{i=1}^n \mathcal {O}_i |\psi _1\rangle \langle \psi _2| \mathcal {O}_i \right) |\psi _2\rangle , \\&= \langle \psi _1| A |\psi _2\rangle , \end{aligned} \end{aligned}$$
(7)

where \(A = \sum _{i=1}^n \left( \mathcal {O}_i |\psi _1\rangle \langle \psi _2| \mathcal {O}_i \right) \). We have to prove that \(A \ne I\). That is true because:

$$\begin{aligned} \begin{aligned} \text {tr}(A)&= \sum _{i=1}^n \text {tr}(\mathcal {O}_i |\psi _1\rangle \langle \psi _2| \mathcal {O}_i) \\&= \sum _{i=1}^n \text {tr}(\mathcal {O}_i \mathcal {O}_i |\psi _1\rangle \langle \psi _2|) \\&= \sum _{i=1}^n \text {tr}(|\psi _1\rangle \langle \psi _2|) \end{aligned} \end{aligned}$$
(8)

From Proposition 1, since \(|\psi _1\rangle \ne |\psi _2\rangle \), then \(\langle \psi _1|\psi _2\rangle = \text {tr}(|\psi _1\rangle \langle \psi _2|) < 1\). For that reason, we have \(\text {tr}(A) < n\) then \(A \ne I\) since \(\text {tr}(I) = n\). The proposition 1 has been proven. This proposition indicates that no Pauli observable and quantum encoding strategies keep the information when we transform the classical features into quantum features. \(\square \)

Fig. 2
figure 2

Limitations in current quantum encoding strategies, which result in non-robust feature representations in the quantum feature space and our proposed QIP solution. Figure (b) showcases encoded quantum features. Figure (c) presents our proposed method for enhancing the discriminative of quantum features

3.3 Theoretical analysis and problem visualization

In this section, we first pre-define the definition of the term information as the correlation between pairwise vectors.

Theoretical analysis

The goal of encoding \(\mathcal {E}\) is to transform a classical feature \(\textbf{v} \in \mathbb {R}^d\) into a quantum state \(| \psi \rangle \in \mathbb {R}^d\) using fewer bits while retaining maximum information as much as in the classical one. Assuming \(\textbf{v}\) is a normalized vector and \(\mathcal {E}\) represents an amplitude encoding, the preservation of information is evident as \(\textbf{v} = | \psi \rangle \). Additionally, since \(\mathcal {E}\) requires fewer than d qubits (\(n < d\)), it appears to be the optimal choice given these constraints.

However, the limitation of amplitude encoding is its potential unsuitability for many problems. To address this problem, Parametrized Quantum Circuits (PQC) have recently become the most prevalent encoding strategy. PQC incorporates trainable parameters that can be optimized during training, reducing dependencies on specific problems. However, information is not guaranteed to be preserved when representing features in Hilbert spaces of \(| \psi \rangle \). Additionally, Proposition 1 suggests that no observables guarantee uniform discriminability between the features \(| \psi \rangle \) and \(\textbf{q}\). Considering these factors, current encoding strategies fail to ensure the preservation of information when mapping classical features to quantum features, thus creating an information gap.

Looking at it from a different angle, if we temporarily set aside quantum theory, Eq. 6 reveals that \(\mathcal {Q}\) serves as a dimension reduction function, mapping \(\mathbb {R}^d\) to \(\mathbb {R}^n\) where \(n \ll d\). As far as we know, no flawless dimension reduction algorithms can preserve pairwise cosine distances between vectors. Even if a perfect algorithm existed, extending its theory to the quantum realm remains an open question.

Problem visualization

Considering the task of face clustering (Nguyen et al. 2021), we assume that a model \(\mathcal {M}(x)\) (Deng et al. 2019) is trained with metric loss functions (Wang et al. 2018; Deng et al. 2019) to map a facial image x into a high-dimensional features space. This mapping ensures that similar faces are clustered closely while separating from faces of different identities. As discussed in Nguyen et al. (2021), recent studies have significantly addressed large-scale clustering challenges within classical machine learning. These methods extensively utilize the discriminative nature of facial features, mainly relying on cosine distance in algorithmic design. However, envisioning a quantum counterpart algorithm that perfectly mirrors these methods reveals a crucial limitation. Despite their potential, quantum algorithms struggle to match the performance of classical ones due to the absence of ideal strategies for encoding classical information into quantum formats, as shown in the Proposition 1.

We illustrate the issue in Fig. 2. Specifically, we employ a face recognition model, ResNet50 (He et al. 2016), trained with ArcFace (Deng et al. 2019) on the MSCeleb-1 M database (Guo et al. 2016) using classical machine techniques. We randomly select subjects from the hold-out set and extract their facial features. Subsequently, we process the corresponding quantum information of these features according to Eq. 5. The boundary between these subjects appears blurred in the quantum machine’s perspective, whereas it remains distinct in the classical one. Some samples close together in the classical machine space appear far apart in the quantum space, presenting challenges for quantum algorithms to determine the boundary.

4 Our proposed approach

4.1 Problem formulation

Let \(x \in {{\mathbb {R}}}^{h \times w \times c}\) denote the input image where h, w, and c are the image height, width, and number of channels correspondingly. Consider \(\textbf{v} = \mathcal {M}(x)\) is the deep features extracted by a model \(\mathcal {M}\). Let \(\mathcal {K}\) be the function to measure the gap of information between classical vector \(\textbf{v}\) and its corresponding quantum vector \(\textbf{q}\). Our goal can be presented as in Eq. 9.

$$\begin{aligned} \text {min} \quad \mathcal {K}(\textbf{v}, \textbf{q}) = \mathcal {K}(\mathcal {M}(x), \mathcal {Q}(\mathcal {M}(x), \mathcal {E}, \mathcal {O})) \quad \text {w.r.t} \quad \mathcal {E} \text {,} \mathcal {O} \quad \text {and} \quad \textbf{v} = \mathcal {M}(x) \end{aligned}$$
(9)

4.2 Quantum information preserving loss

In Eq. 9, only \(\mathcal {M}\) and \(\mathcal {E}\) are considered trainable. Theoretically, we can optimize either \(\mathcal {M}\) or \(\mathcal {E}\) to minimize the Eq. 9. In this study, however, we concentrate on training \(\mathcal {M}\) since, as demonstrated in Eq. 5, \(\textbf{q} = \mathcal {M} \circ \mathcal {E} \circ \mathcal {O}\), indicating that \(\mathcal {M}\) initiates the quantum encoding process, making it the most critical component to address. Let \(\mathcal {F}\) represent the task-specific layer to train the feature representation of x. \(\mathcal {M}\) can be optimized with the objective function as in Eq. 10.

$$\begin{aligned} \theta ^*_{\mathcal {M}} = \arg \min _{\theta _{\mathcal {M}}} \mathbb {E}_{x_i \sim p(x_i)} \left[ \mathcal {L} ( \mathcal {F}(\mathcal {M}(x_i)), \hat{y}_i) \right] \end{aligned}$$
(10)

Here, \(\hat{y}_i\) and \(\mathcal {L}\) denote the ground truth and the loss function, respectively. The common approach (e.g., Deng et al. 2009; He et al. 2016; Liu et al. 2022) typically designs \(\mathcal {F}\) as a fully connected layer and employs loss functions such as cross-entropy or metric losses (e.g., Deng et al. 2019; Wang et al. 2018) for training a classification model. For simplicity, we choose cross-entropy as \(\mathcal {L}\). It’s important to note that, however, \(\mathcal {L}\) is also applicable to metric loss functions like ArcFace or CosFace.

$$\begin{aligned} \mathcal {L} = - \frac{1}{N} \sum _{i=1}^N \text {log} \frac{e^{W_{\hat{y}_i}^\textsf{T} \textbf{v}_i + b_j}}{\sum _{j=1}^C e^{W_j^\textsf{T} \textbf{v}_i + b_j}} \end{aligned}$$
(11)

where \(W_j \in {{\mathbb {R}}}^d\) denotes the \(j^{th}\) column of the weight \(W \in {{\mathbb {R}}}^{d \times C}\). C is the number of classes and \(b_j \in {{\mathbb {R}}}\) is the bias term. For simply, we fix \(b_j = 0\) as in Wang et al. (2018). The equation turns out \(\mathcal {L} = - \frac{1}{N} \sum _{i=1}^N \text {log} \frac{e^{W_{\hat{y}_i}^\textsf{T} \textbf{v}_i}}{\sum _{j=1}^C e^{W_j^\textsf{T} \textbf{v}_i}}\). Interestingly, \(W_j\) represents a center vector corresponding to class j. The loss function \(\mathcal {L}\) optimizes model \(\mathcal {M}\) so that the vector \(\textbf{v}_i\) aligns closely with \(W_j\) if they belong to the same class in the feature space. Moreover, \(W_j^\textsf{T} \textbf{v}\) signifies the cosine distance between the two vectors since as in Deng et al. (2019); Wang et al. (2018) these features are normalized, which precisely fulfills the roles of \(| \psi _1 \rangle \) and \(| \psi _2 \rangle \) in Proposition 1. Leveraging this elegant property, we can define \(\mathcal {K}\) as the Kullback–Leibler divergence (KL) to minimize the information gap formulated in Eq. 9 as follows:

$$\begin{aligned} \begin{aligned} \mathcal {K}&= \frac{1}{N} \sum _{i=1}^N \text {KL}\left( W^\textsf{T} \textbf{v}_i, S^\textsf{T} \textbf{q}_i \right) \\&= \frac{1}{N} \sum _{i=1}^N \sum _{j=1}^C \text {softmax}(W^\textsf{T} \textbf{v}_i)_j \times \text {log}\frac{\text {softmax}(W^\textsf{T} \textbf{v}_i)_j}{\text {softmax}(S^\textsf{T} \textbf{q}_i)_j} \end{aligned} \end{aligned}$$
(12)

where \(S_j\) is the corresponding quantum information vector of \(W_j\) using Eq. 6. In conclusion, we propose a novel loss function named Quantum Information Preserving Loss to train \(\mathcal {M}\) as follows:

$$\begin{aligned} \theta ^*_{\mathcal {M}} = \arg \min _{\theta _{\mathcal {M}}} \mathbb {E}_{x_i \sim p(x_i)} \left[ -\text {log} \frac{e^{W_{\hat{y}_i}^\textsf{T} \textbf{v}_i}}{\sum _{j=1}^C e^{W_j^\textsf{T} \textbf{v}_j}} \!+\! \lambda \times \text {KL}\left( W^\textsf{T} \textbf{v}_i, S^\textsf{T} \textbf{q}_i \right) \right] \end{aligned}$$
(13)

where \(\lambda \) is the loss factor for controlling how much information is preserved. Using this loss function, the model \(\mathcal {M}\) can produce the feature \(\textbf{v}\), which is friendly with the quantum machine by keeping as much information after the quantum encoding. We also provide the pseudo-code in the Algorithm 1.

Algorithm 1
figure d

Pseudo-code for the implementation of Quantum Information Preserving Loss.

5 Experiment setup and implementation

Given that Proposition 1 implies the information as the relationship between two vectors, i.e., cosine similarity, selecting the model \(\mathcal {M}\) optimized for cosine similarity becomes paramount for problem validation and experimental demonstration. Consequently, this study aims for unsupervised clustering tasks, namely face and landmark clustering, as they align well with models trained using cosine-based loss functions. It is important to note that similar problems, such as classification, also apply to our proposed Proposition 1.

5.1 Experiment setup

We follow the experimental framework outlined in previous studies (Nguyen et al. 2021; Yang et al. 2019, 2020; Shen et al. 2023; Shin et al. 2023; Shen et al. 2021; Wang et al. 2022; Nguyen et al. 2023b). In essence, our clustering methodology consists of three key stages. First, we train a model \(\mathcal {M}(x)\) to extract image features x. Second, the k nearest neighbors algorithm, denoted as \(\textbf{K}(x_i, k)\), is utilized to identify the k most similar neighbors of a given sample \(x_i\), forming a cluster \(\mathbf {\Phi }_i = \textbf{K}(x_i, k)\). Finally, as clusters \(\mathbf {\Phi }_i\) may encompass erroneous samples due to challenges such as database anomalies or imperfect feature representations by \(\mathcal {M}\), previous studies have proposed training a model \(\mathcal {N}(\mathbf {\Phi }_i)\) to detect and eliminate these inaccuracies, thereby refining the cluster.

In contrast to prior research, we focus on studying this problem from a quantum perspective. It leads to designing modules, namely \(\mathcal {M}(x)\) and \(\mathcal {N}(\mathbf {\Phi }_i)\), to operate on quantum hardware to the fullest extent possible. While training \(\mathcal {M}(x)\) using our proposed methodology constitutes a critical aspect of this study, We aim to design \(\mathcal {N}(\mathbf {\Phi }_i)\) as a quantum machine learning model, thus enabling the entire pipeline to be executed on a quantum machine as much as possible.

Multiple methodologies have addressed the clustering problem on classical computers. These include traditional techniques (Ester et al. 1996; Otto et al. 2017), graph-based methodologies (Wang et al. 2019; Yang et al. 2020, 2019; Shen et al. 2021, 2023; Shin et al. 2023), and transformer-based approaches (Nguyen et al. 2021). While transformer architectures have demonstrated significant success in various computer vision tasks (Li et al. 2022; Yu et al. 2022; Zhai et al. 2023; Luo et al. 2023; Wang et al. 2023; Nguyen et al. 2023a, b, 2020, 2019; Nguyen-Xuan and Lee 2019; Nguyen et al. 2021, 2022, 2023d, c, b; Serna-Aguilera et al. 2024), their potential in quantum computing remains promising. Adapting the typical transformer architecture for quantum systems, as proposed by Chen et al. (2022), offers added convenience. Although graph-based networks present a possible option, the computational challenge of processing large datasets, such as a (5.2M \(\times \) 5.2M) sparse matrix on a quantum machine or even a simulated one, poses limitations. In contrast, transformer models do not encounter such constraints. Hence, inspired by the insights from Nguyen et al. (2021), we propose redesigning \(\mathcal {N}(\mathbf {\Phi }_i)\) as a transformer-based quantum model.

5.2 Implementation details

We employ ResNet50 architecture to train the model \(\mathcal {M}(x)\) as prior works (Wang et al. 2019; Yang et al. 2020; Nguyen et al. 2021). This model is trained on large-scale datasets like MSCeleb-1 M, employing ArcFace (Deng et al. 2019) for feature representation learning. In addition to ArcFace, we integrate the Quantum Information Preserving Loss outlined in Sect. 4 to mitigate information loss during encoding. The loss factor \(\lambda \) is configured at 0.5.

Fig. 3
figure 3

The MSCeleb-1 M and Google Landmark Datasets are illustrated through samples. Each row represents either a subject (for MSCeleb-1 M) or a location (for Google Landmark). The first image in each row denotes the center of a cluster \(\mathbf {\Phi }_i\), while the subsequent images are the nearest neighbors of the first one, identified through the K-NN algorithm utilizing quantum features. Images bordered in red signify that they belong to a different class than the first image in the row, whereas those bordered in green share the same class as the first image. The clusters obtained without QIP loss in (a) exhibit more noisy samples compared to (b), which are obtained with QIP Loss. Best view in color

To implement the Quantum Clusformer (Nguyen et al. 2024) \(\mathcal {N}(\mathbf {\Phi }_i)\), we initially redesign the self-attention layer (Vaswani et al. 2017) tailored for quantum machines. We employ Parameterized Quantum Circuits (PQC) for each Query, Key, and Value layer. We construct transformer blocks suitable for the transformer-based model. Ultimately, we achieve full implementation of the Quantum Clusformer on quantum machines.Footnote 1

For the components running on the classical machine, we use the PyTorch framework while we utilize the torchquantum library (Wang et al. 2022) and cuQuantum to simulate the quantum machine. Since this library relays Pytorch as the backend, we can also leverage GPUs and CUDA to speed up the training process. The models are trained utilizing an 8 \(\times \) A100 GPU setup, each with 40GB of memory. The learning rate is initially set to 0.0001, progressively decreasing to zero following the CosineAnnealing policy (Loshchilov and Hutter 2016). Each GPU operates with a batch size of 512. The optimization uses AdamW (Loshchilov and Hutter 2017) for 12 epochs. Training time for the model \(\mathcal {M}\) is approximately 2 h, and the training time for the Quantum Clusformer \(\mathcal {N}(\mathbf {\Phi }_i)\) is about 4 h.

5.3 Datasets and metrics

5.3.1 Datasets

We follow (Yang et al. 2019, 2020) to use MSCeleb-1 M Guo et al. (2016) and Nguyen et al. (2021) to use the Google Landmarks Dataset Version 2 (GLDv2) Weyand et al. (2020) for experiments.

MSCeleb-1 M

Guo et al. (2016) is a vast face recognition dataset compiled from web sources, encompassing 100,000 identities, with each identity represented by approximately 100 facial images. Nonetheless, the original dataset retains noisy labels. Consequently, we utilize a subset derived from ArcFace (Deng et al. 2019), which undergoes improved annotation post-cleaning. This refined dataset comprises 5.8 million images sourced from 85,000 identities. All images undergo pre-processing, involving alignment and cropping to dimensions of \(112 \times 112\).

The Google Landmarks Dataset Version 2 (GLDv2)

Weyand et al. (2020) is one of the largest datasets dedicated to visual landmark recognition and identification. Its cleaned iteration comprises 1.4 million images spanning 85,000 landmarks and 800 h of human annotation. These landmarks span diverse categories and are sourced from various corners of the globe. The dataset exhibits an extremely long-tail distribution, with the number of images per class varying from 0 to 10,000. Compared to face recognition tasks, GLDv2 presents a similar yet notably more challenging scenario. We randomly partition the dataset into three segments, each featuring 28,000 landmarks. Notably, there is no overlap between these partitions. One segment is designated for training the deep visual model and Clusformer, while the remaining segments are reserved for testing purposes. The Fig. 3 demonstrates samples from these datasets.

5.3.2 Metrics

To evaluate the approach for the clustering task, we follow (Yang et al. 2019, 2020; Nguyen et al. 2021) and use Fowlkes Mallows Score to measure the similarity between two clusters with a set of points. This score is computed by taking the geometry mean of precision and recall of the point pairs. Thus, Fowlkes Mallows Score is called Pairwise F-score (\(F_P\)). BCubed F-score \(F_B\) is another popular metric for clustering evaluation focusing on each data point.

6 Experimental results

6.1 Performance on MSCeleb-1 M clustering

The performance of our proposed method is shown in the Table 1. To begin, we define QClusformer as the Clusformer operating on a quantum machine for ease of reference. However, due to hardware constraints, we can only emulate QClusformer with fewer layers/transformer blocks than the original model (Nguyen et al. 2021). To ensure a fair evaluation, we initially retrain the Clusformer, denoted as \(\text {Clusformer}^{\dagger }\), on a classical machine using identical configurations to those of QClusformer, explicitly setting the number of encoders to 1. The training process is outlined in Fig. 4a. As a result, the performance of \(\text {Clusformer}^{\dagger }\) is slightly inferior to the original model. Notably, the \(F_P\) metric decreases from 88.20 to 86.49% on the 584K test set, representing an approximate 2% reduction. It consistently maintains marginally lower performance across both \(F_B\) and \(F_P\) on the remaining test sets.

Fig. 4
figure 4

Experiment setup and objective of the clustering problem. a The typical experiment setup used by Nguyen et al. (2021) for the classical machine. b A similar setup. However, only deep model \(\mathcal {M}(x)\) retains running on the classical machine, while the rest of the modules are redesigned to run on the quantum computer

Then, we train QClusformer with the strategy as in Fig. 4b. Our chosen encoding strategy is amplitude, paired with Pauli-Z as the observable for the baseline. There is a notable decline in performance, approximately 2.8%. However, employing the QIP Loss function within the same setup is a potent remedy for bridging the information gap between quantum and classical features, resulting in a notable performance recovery. Noted that QClusformer with QIP Loss achieves 87.18% and 91.01% on \(F_P\) and \(F_B\), respectively, on the 584K test set, surpassing \(\text {Clusformer}^{\dagger }\) by 0.6% and 3.2%, respectively. Similar trends are observed across all test sets of MSCeleb-1 M.

These findings underscore the competitive performance of Quantum Clusformer, particularly when leveraging with QIP Loss. Notably, its performance surpasses that of the best-performing Clusformer with a complete setup on a classical machine, signaling the promising capabilities of quantum computing in the clustering problem.

6.2 Performance on Google Landmark clustering

This section compares the proposed method’s performance on the Google Landmark Dataset, a visual landmark clustering dataset shown in Table 2. The experimental setups and evaluation protocols are similar to the previous MSCeleb-1 M section and in the prior work, Nguyen et al. (2021). Similar results to those obtained with the MSCeleb-1 M database are observed. Specifically, \(\text {Clusformer}^{\dagger }\), when runs on a classical machine, achieves 17.74% and 38.80% in terms of \(F_P\) and \(F_B\) respectively. However, when the model operates on a quantum machine named QClusformer, its performance drops significantly to 13.20% and 35.63% for \(F_P\) and \(F_B\), respectively. Nonetheless, by using the QIP Loss function, the performance rebounds to 19.02% for \(F_P\) and 40.28% for \(F_B\), surpassing that of \(\text {Clusformer}^{\dagger }\) and remaining competitive with the original Clusformer which has 19.32% and 40.63% of \(F_P\) and \(F_B\).

6.3 Ablation studies

This ablation study section practically proves the Proposition 1.

QIP works with different encoding strategies

In Proposition 1, we present the information gap between quantum and classical machines across various encoding strategies. To demonstrate the efficiency of our proposed method with diverse encoding approaches, we initially hold observables constant, specifically the Pauli-Z, and subsequently change between phase and \(U_3\) encoding (Benedetti et al. 2019). Unlike amplitude and phase encoding, \(U_3\) represents a Parameterized Quantum Circuit (PQC) with trainable parameters. The performances of these configurations are detailed in Table 3. Remarkably, the QClusformer, trained with QIP Loss, the Pauli-Z observable, and either phase or \(U_3\) encoding strategies, consistently outperforms the standalone QClusformer. It underscores the adaptability of the QIP Loss across diverse encoding strategies. Notably, phase and \(U_3\) encoding show inferior performance compared to amplitude. As we mentioned in the previous section, the amplitude is naturally fit for the clustering problem than other strategies.

QIP works with different observables

The intuition of these ablation studies is similar to the encoding above strategies. In particular, we fix the encoding strategies as amplitude while experimenting with various observables, i.e., Z, X, and XZ (a combination of measuring both X and Z coordinates). As depicted in Table 4, QClusformer exhibits the highest accuracy in \(F_P\) and \(F_B\) when utilizing the Z observable, while both X and XZ show slight decreases. When dealing with the Pauli-Y observable, amplitude strategies prove ineffective as they result in all-zero measurements. Consequently, we select \(U_3\) for encoding and compare the performance of Pauli-Y versus Pauli-Z. Interestingly, the performance using Pauli-Y remains relatively unchanged compared to Pauli-Z. Nonetheless, these configurations still significantly outperform QClusformer alone, underscoring the versatility of the Quantum Information Processing (QIP) approach across diverse observables.

The role of \(\lambda \) - QIP loss factor

We investigate the impact of the control factor \(\lambda \) for managing QIP Loss on the performance. To achieve this, we conduct experiments using a subset of 584K samples from the MSCeleb-1 M dataset. The experimental configurations remain consistent with those outlined in the previous section, i.e., employing amplitude encoding and Pauli-Z observable.

Table 1 Performance on face clustering w.r.t the different number of unlabelled test sets
Table 2 Performance on landmark clustering w.r.t different quantum encoding and observables
Table 3 Ablation studies on different encoding strategies of the MSCeleb-1 M
Table 4 Ablation studies on different observables of MSCeleb-1 M
Fig. 5
figure 5

Ablation studies on different QIP Loss factor \(\lambda \)

Fig. 6
figure 6

Ablation studies on features representation using QIP Loss. From left to right, the first image presents classical features, the second one presents quantum features w/o QIP Loss, and the last one shows the quantum features optimized by QIP Loss

The results are shown in Fig. 5. When \(\lambda = 0\), indicating the absence of QIP Loss utilization, the performance stands at 83.68% and 86.89% for \(F_P\) and \(F_B\) respectively, as detailed in Table 1 above. Gradually increasing this parameter yields a steady enhancement in performance. However, the peak performance is attained at \(\lambda =0.5\), after which a decline is observed. This phenomenon is due to the role of QIP Loss in minimizing the disparity between quantum and classical features. According to Proposition 1, the gap toward zero only when two vectors \({\textbf {v}}_1\) and \({\textbf {v}}_2\) are identical. In this case, the model \(\mathcal {M}\) generates similar features irrespective of input images, leading to model collapse and failure in distinguishing samples from distinct classes. Hence, it is necessary to control \(\lambda \) to prevent such collapse. Our investigation found that the optimal value for \(\lambda \) within this framework is 0.5.

Quantum feature representations

We investigate how QIP Loss helps to align the features in the quantum computer as in Fig. 6. We randomly select 200 subjects from 581K part of MSCeleb-1 M to extract the features. We employ T-SNE to reduce the dimension from 256 to 2 and visualize these features in the 2D space. From left to right, the first image (with a red border) indicates the classical features. The second image (with a green border) illustrates the quantum features of these subjects without training with QIP Loss, and the last one demonstrates the quantum features optimized by QIP Loss.

Performance of feature extractor - \(\mathcal {M}\)

Since \(\mathcal {M}\) is trained by a combination of ArcFace (Deng et al. 2019) and our proposed QIP Loss, it is important to evaluate the effectiveness of \(\mathcal {M}\) and verify how QIP Loss affects to its performance. We follow the same evaluation protocol as in Deng et al. (2019). In particular, we evaluate the face verification accuracy of \(\mathcal {M}\) on the IJBC (Maze et al. 2018) database. The results are reported in the Table 5. As the baseline, the performance of Resnet50 without using QIP Loss on IJBC is 96.140%. We observe a slight drop to 96.068 when incorporating QIP Loss with the factor by \(\lambda = 0.5\). However, the lambda is increased to \(\lambda = 0.9\), the performance is reduced by 4% approximately. The reason for that drop can be explained in the section above where the feature representation tends to collapse when increasing \(\lambda \).

Table 5 Face verification accuracy of feature extractor \(\mathcal {M}\) on IJBC database

Comparison with classical method

Since the problem can be treated as a representation learning task, we compare our method to a classical machine learning approach in this section. Specifically, we choose the Support Vector Machine (SVM), a kernel-based feature representation method, for the comparison. Following (Schuld 2021), we implement a Quantum SVM algorithm that can be executed on a quantum computer. This algorithm comprises two main components: quantum encoding and measurement, i.e., Parameterized Quantum Circuit (PQC). Unlike the aforementioned training strategy, we do not train \(\mathcal {M}\) jointly with Quantum SVM. Instead, we train the Quantum SVM separately, using classical features v as input to perform a classification task. After training, the corresponding quantum features are utilized to train the Quantum Clusformer \(\mathcal {N}(\mathbf {\Phi }_i)\). The performance results are presented in Table 6. Using Quantum SVM for quantum feature representation results in a significant performance drop. It achieves \(F_P\) and \(F_B\) scores of 80.3% and 82.82%, respectively, which is about 7% lower than our proposed method approximately. This decline in performance is because Quantum SVM is designed for a close-set problem, whereas unsupervised clustering addresses an open-set problem. While Quantum SVM may provide a good quantum feature representation for the training set, it struggles with the testing set, leading to poor feature distinction and, consequently, the worst performance.

Table 6 Performance comparison with classical method Quantum SVM on 584K subject of MSCeleb-1 M

7 Conclusion

This paper revisits the quantum visual feature encoding strategies employed in quantum machine learning with computer vision applications. We identify a significant Quantum Information Gap (QIG) issue stemming from current encoding methods, resulting in non-discriminative feature representations in the quantum space, thereby challenging quantum machine learning algorithms. To tackle this challenge, we propose a simple yet effective solution called Quantum Information Preserving Loss. Through empirical experiments conducted on various large-scale datasets, we demonstrate the effectiveness of our approach, achieving state-of-the-art performance in clustering problems on quantum machines. Our insights into quantum encoding strategies are poised to stimulate further research efforts in this domain, prompting researchers to focus on designing more effective quantum machine learning algorithms.

8 Discussion

Since quantum machines have limited access to the general public, the experiments were carried out through noise-free simulation systems such as torchquantum and cuQuantum. However, real-world scenarios may involve noise within the system, leading to uncertain quantum state measurements and affecting overall performance. Despite this limitation, the theoretical problem of QIG persists. It is crucial to figure out that quantum machine learning algorithms must confront these dual challenges of QIP and noise. We anticipate that addressing these issues will attract significant research attention in future endeavors.