Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article
Open access

Dynamic Weighted Gradient Reversal Network for Visible-infrared Person Re-identification

Published: 25 August 2023 Publication History
  • Get Citation Alerts
  • Abstract

    Due to intra-modality variations and cross-modality discrepancy, visible-infrared person re-identification (VI Re-ID) is an important and challenging task in intelligent video surveillance. The cross-modality discrepancy is mainly caused by the differences between visible images and infrared images, the inherent essence of which is heterogeneous. To alleviate this discrepancy, we propose a Dynamic Weighted Gradient Reversal Network (DGRNet) to enhance the learning of discriminative common representations by confusing the modality discrimination. In the proposed DGRNet, we design the gradient reversal model guiding adversarial training between identity classifier and modality discriminator to reduce the modality discrepancy of the same person in different modalities. Furthermore, we propose an optimization training method, that is, designing dynamic weight of gradient reversal to achieve optimal adversarial training, and dynamic weight has the ability to dynamically and adaptively evaluate the significance of target loss term, without involving hyper-parameter tuning. Extensive experiments were conducted on two public VI Re-ID datasets, SYSU-MM01 and RegDB. The experimental results show that the proposed DGRNet outperforms state-of-the-art methods and demonstrate the effectiveness of the DGRNet to learn more discriminative common representations for VI Re-ID.

    1 Introduction

    Person re-identification (Re-ID) [1, 8, 32, 37] is an important research field in intelligent video surveillance. Given a specific person, Re-ID task aims to search for it from the gallery set, which contains person images captured by non-overlapping surveillance cameras [62]. Recent research efforts mainly focus on the visible camera module [20, 27]. However, when ambient light is poor or unavailable, visible-visible (VV) Re-ID will be restricted in surveillance. Besides, in the same condition, visible images will be uninformative. In such a case, imaging device such as infrared cameras, with no need to rely on visible light should be applied. In practical scenarios, we can consist the probe set by visible/infrared images, which are captured by visible/infrared cameras in the daytime/nighttime. Also, the gallery set can be constructed by infrared/visible images, which are captured by infrared/visible cameras in the nighttime/daytime [28]. Therefore, the task of Re-ID in video surveillance has a new and challenging requirement, performing person matching in cross-modality, which is also called visible-infrared person re-identification (VI Re-ID).
    There is a big issue that needs more attention in the VI Re-ID task. Like face recognition tasks in cross-modality scenarios [14, 15, 29], in the VI Re-ID task, the huge cross-modality discrepancy is a major hindrance. The cross-modality discrepancy arises from two aspects. On the one hand, visible and infrared images are intrinsically distinct. As shown in Figure 1, in the first row, the visible images have three channels containing rich color information from visible light. Conversely, in the second row, the infrared images have only one channel and contain no color information. For this, they are considered as heterogeneous data [42]. On the other hand, in terms of imaging principles, the wavelength ranges of visible images and infrared images are also different.
    Fig. 1.
    Fig. 1. The images in each column indicate the same person in different modalities.
    In the VI Re-ID task [11, 19, 22, 25, 35, 36, 39, 41, 42, 48, 49, 50, 52, 54, 56], to cut down the additional cross-modality discrepancy, the existing methods mainly include the following aspects: metric learning, representation learning, and image translation. Metric learning focuses on designing loss function, which make the distance as small as possible between the same persons and the distance as large as possible between different persons [41, 50]. To achieve the similarity measure, some representation learning approaches map the persons from different modalities into a common feature space [48, 49, 54, 56]. Besides, to reduce the modality discrepancy, some representation learning approaches generate a novel common modality between different modalities [19, 25]. Similarly, for person matching, some image translation approaches take image generator methods based on generative adversarial network (GAN) to produce cross-modality images [35, 36]. But both generated cross-modality images and generated common modality introduce noise [39]. Moreover, the newly generated images not only increase the computation but also introduce more uncertainty for network learning [11]. To avoid the above issues, Hao et al. [11] proposed a modality confusion learning network to ignore modality information and learn common features, and an identity-aware marginal center aggregation strategy and a camera-aware learning scheme are explored for further improvement. Their work achieves competitive performance compared with the states of the art. However, in their work, the loss contribution of modality classifier participates in adversarial training with fixed weights, which is detrimental to the learning of instance-level feature representations at the early training stage, to achieve optimal adversarial training of the network. Their idea of though modality confusion learning to obtain discriminative common feature representation is very innovative and inspires us to find out whether there is a better adversarial learning mechanism to obtain more discriminative modality common features. Motivated by this, we propose a Dynamic Weighted Gradient Reversal network (DGRNet) for enhancing the discriminative common representation learning.
    The contributions of this article mainly include the following three aspects:
    We design the gradient reversal model guiding adversarial training in the VI Re-ID task. The gradient reversal works with modality discriminator to guide adversarial training with identity classifier to reduce the modality discrepancy of the same person in different modalities. This can confuse the feature distributions of visible images and infrared images to learn the discriminative common representations.
    An optimization training method through designing the dynamic weight of gradient reversal model is proposed in this article. It is able to evaluate the contribution of the target loss term adaptively and dynamically and guide the network to find the optimal parameter of different parts during the training and thus to achieve optimal adversarial training without any hyper-parameter tuning. This can for further enhance discriminative common representations learning.
    In extensive experiments on two public datasets, our proposed DGRNet with only global features achieves good results compared to global-feature-based state-of-the-art (SOTA) methods, even outperforming majority of local-feature-based methods. And it is constructed by two-stream CNN structures training in an end-to-end way.
    We organize the rest of this article as follows. In Section 2, the related works are discussed, then, in Section 3, we elaborate on the framework of the proposed DGRNet, loss function, the training algorithm, and optimization. In Section 4, we demonstrate the experimental results of our approach. In Section 5, we finally conclude this article.

    2 Related Work

    The intra-modality variations are the problem that traditional VV Re-ID aims to solve. Compared to it, the discrepancy of cross-modality should be handled additionally in the VI Re-ID task. To mitigate this problem, the existing methods focus on projecting (or translating) the heterogeneous cross-modality person images into a common space for similarity measure, mainly including image translation, metric learning, and network designing.
    Image translation–based VI Re-ID. Image translation usually takes image generation methods, which reduces the domain gap between visible modalities and infrared modalities based on generative adversarial network. Kniaz et al. [18] first translated a visible image into a multimodal thermal probe set by using a set of GANs. Choi et al. [6] proposed hierarchical cross-modality disentanglement framework that consists of an identity-preserving person image generation network and a hierarchical feature learning module. Their goal is to separate ID-discriminative and ID-excluded from cross-modal images. Reference [38] introduced a GAN-based network to address the feature-level. Furthermore, AlignGAN [35] and JSIA-ReID [36] were also GAN-based approaches that implement pixel alignment and feature dual-level alignment. Zhong et al. [63] also proposed a GAN-based approach, GECNet, which preserves the structure of informative colored images to bridge the gap between different modalities. However, the methods to generate cross-modality images and common modality will both introduce noise [39]. Moreover, this types of methods have more performance uncertainty, more computation complexity, and higher demands for training trick [23].
    Metric learning–based VI Re-ID. Metric learning is a critical step for Re-ID to perform a similarity metric. Before the era of deep learning, people studied it by learning a projection matrix or Mahalanobis distance function. Nowadays, the loss function designing has replaced the role of metric learning to guide the learning of feature representation. Wu et al. [41] designed a novel loss, which uses the similarity of same-modality to guide the similarity learning of the inter-modality. For cross modality metric learning, to obtain the stronger discriminability, Ye et al. [50] presented a channel-mixed learning strategy to synchronously address the variations of both cross-modality and intra-modality. Hao et al. [11] introduced a strategy called identity-aware marginal center aggregation, through which the centralization features are extracted, and for further improvement, the camera-aware learning scheme was also proposed, which exploited the camera label information to enhance discriminability. From the angles vector respect, a new ranking loss was proposed in Reference [46], which constrained the angle between the embedding vectors to learn common feature space, and this space is angularly separable. Similarly, Gao et al. [10] proposed a novel loss called Enumerate Angular Triplet loss for studying the angularly discriminative feature embedding and presented a new Cross-Modality Knowledge Distillation loss to narrow down the features between different modalities before feature embedding. To further decrease the cross-modality variations, Huang et al. [16] proposed a novel cross-modality quadruplet loss. ZhangLa et al. [59] provided a comprehensive metric learning framework on the basis of paired-based similarity constraints to deal with all the variations within and across modalities.
    Network designing–based VI Re-ID. In the re-identification task, feature learning is an important step before measuring similarity. Many research works designed deep neural networks to study the feature learning between different modalities. To explore the VI Re-ID task, a deep zero-padding network was proposed in Reference [42], which can learn the invariant feature representations between different modalities. In References [48, 49, 54], two-stream network was proposed, which first extracts the modality-specific features and then uses the share fully connected layers to extract shared features. Similarly, Zhang et al. [56] proposed a model by dictionary learning to project the features of different modalities onto a common space. To alleviate the modality discrepancy, Li et al. [19] and Lu et al. [25] used this method to generate a new modality between different modalities. On the foundation of a traditional two-stream network, Liu et al. [22] enhanced the modality-shared person features by introducing mid-level features incorporation. To reduce the cross-modality discrepancy, Ye et al. [52] proposed AGW, which contains three major components, including non-local attention block, weighted regularization triplet loss, and generalize-mean pooling. From the aspect of adversarial learning, Dai et al. [7] proposed a network (cmGAN) to generate the discriminative common representations by constructing a discriminator to act as an adversary. To explore the information of cross-modality and intra-modality, a dual-attentive network was proposed in Reference [51], which contained an attention module to give different weight of variant body parts. Ye et al. [53] proposed a HAT model to generate a grayscale modality from the homogeneous visible images, and the process of generation does not need any additional training.
    Liang et al. [21] proposed an unsupervised VI Re-ID framework that reduces the modality discrepancy by homogeneous-to-heterogeneous learning and finally produced robust feature representations from different modalities. Considering the intrinsic spatial structures as well as the difference of two modalities, Zhang et al. [58] proposed a dual-path framework for learning cross-modality features. Wei et al. [40] proposed a flexible body partition framework by using of adversarial learning method (FBP-AL) for learning more fine-grained information. For utilizing both multi-level information as well as potential contextual cues as a supplement, Cheng et al. [5] proposed a new network, named Dual-path Deep Supervision Network (DDSN). Moreover, Zhao et al. [60] propose a novel approach to learn the color-irrelevant features through the color-irrelevant consistency learning and align the identity-level feature distributions by the identity-aware modality adaptation. Kansal et al. [17] proposed a novel network to explore the spectrum information in the VI Re-ID task, and the network had two branches: The spectrum dispelling branch was designed to keep the useful identity features, and the spectrum distilling branch was designed to dispel spectrum features. The network structure was then optimized through a multi-stage training strategy.

    3 Proposed Method

    In this section, we will elaborate on the framework of the proposed DGRNet model for VI Re-ID. As shown in Figure 2, the proposed DGRNet mainly comprises four main components: feature extractor \(G_f\) , identity classifier \(G_y\) , the gradient reversal \(GRL\) , and modality discriminator \(G_d\) . The input data are first transformed into high-dimensional representations \(Z\) by the component of feature extractor. Using the learned feature representations, the identity classifier, whose goal is to maximize the prediction accuracy, can obtain the predictions of identity label on all the input data. The modality discriminator is designed to discriminate different modalities. To obtain modality-invariant features, the gradient reversal is inserted between feature extractor and modality discriminator.
    Fig. 2.
    Fig. 2. The overall structure of the proposed DGRNet. DGRNet consists of a deep feature extractor \(G_f\) ; an identity classifier \(G_y\) , \(\theta _y\) is the parameter of \(G_y\) ; a modality discriminator \(G_d\) and a gradient reversal \(GRL\) . \(\lambda ^k\) denotes a dynamic weight, function denotes the method of controlling \(\lambda ^k\) and reverse means multiplying by \(-\) 1. \(\otimes\) denotes the product operator. Feature extractor constructed by ResNet-50 model, where the first two stages are parameter independent and the latter three stages are parameter-shared. GAP stands for global average pooling, \(Z\) is the extracted deep features, BN stands for batch normalization, and C and D are constructed by using fully connected layers. \(\hat{y}\) is the predicted probability of human identities, \(\hat{d}\) is the predicted probability of modalities. \(L_{bhtri}\) denotes hard triplet loss, \(L_{cc}\) denotes center cluster loss, \(L_{id}\) denotes the identity loss, and \(L_d\) denotes the loss of modality discriminator.
    The network model of the DGRNet will be detailed in Section 3.1. Then, loss function is presented in Section 3.2. The training algorithm and optimization will be provided in Section 3.3 and Section 3.4, respectively.

    3.1 The Network Model

    Problem Formulation. Formally speaking, let \(x\) be the input data of the network model, which includes two parts: the infrared images \(x^r\) and the visible images \(x^v\) . For each datum \(x_i\) , a corresponding modality label \(d_i \in \mathcal {M}\) is given, where \(\mathcal {M}\) represents the set that includes all the visible and infrared modalities. Each labeled datum \(x_i\) also has a true identity label \(y_i \in \ varUpsilon\) , where \(\ varUpsilon\) is the set of all identities.

    3.1.1 Baseline.

    As shown in Figure 2, our baseline consists of feature extractor and identity classifier. The two-steam network framework presented in Reference [49] is adopted in feature extractor, which using ResNet-50 [12] to extract features for different modalities. The network parameters of the first two convolutional blocks are used to capture the modality specific information of input images from different modalities, which parameters are independent. To narrow the gap between two heterogeneous modalities, the network parameters of the last three convolutional blocks are employed to learn a multi-modality sharable space, which parameters are shared. Let \(\theta _f\) be the parameters of feature extractor \(G_f\) . Given the input data \(x\) , we can obtain their feature representations \(Z\) as
    \(\begin{equation} Z = G_f(x; \theta _f). \end{equation}\)
    (1)
    Let \(\theta _y\) be the set of the identity classifier’s parameters. As shown in Figure 2, based on the outputs of feature extractor \(Z\) , to get the probability vector \(\hat{y_i}\) of human identities, the softmax layer is used, \(\hat{y_i}\) is definded as
    \(\begin{equation} \hat{y_i} = softmax(FC(BN(Z_i))). \end{equation}\)
    (2)

    3.1.2 Modality Discriminator.

    As shown in Figure 2, a modality discrimination \(G_d\) is constructed, and its parameter is denoted as \(\theta _d\) . The modality discriminator acts as an adversary, whose goal is to judge whether the learned representation vector belongs to the visible modality or infrared modality. It is consisted of a two-layer feed-forward neural network, based on the outputs of feature extractor \(Z\) and the probability vector \(\hat{d_i}\) of modalities as
    \(\begin{equation} \hat{d_i} = \sigma (FC(BN(FC(Z_i)))), \end{equation}\)
    (3)
    where \(\sigma (x)\) is the sigmoid function.

    3.1.3 Gradient Reversal.

    Domain adaptation refers to the process of transferring knowledge from a source domain to a target domain, and it can confuse the data distribution from the source domain and target domain [2, 33]. To learn the transferable feature representations, it was successfully embedded into deep network for reducing the distribution discrepancy of different domains. Moreover, gradient reversal [9] is a way to realize domain adaption by adversarial training. Traditional method achieves adversarial training by constantly adjusting the sign of the identity classifier and modality discriminator [7], but it often leads to unstable network training. In this article, we align the feature distribution through gradient reversal to learn the discriminative common representations. Specifically, we join gradient reversal after the output of feature extractor and before the input of modality discriminator to guide adversarial training. The purple lines describe the process of gradient reversal, as illustrated in Figure 3. During the model training process, gradient reversal flips the gradient of the feature extractor and passes the reversed gradient to update the feature extractor, forcing it to minimize the identity loss while maximizing the modality discriminator loss. This encourages the feature extractor to make the feature distributions between different modalities as similar as possible. Meanwhile, the modality discriminator tries to distinguish the feature distributions between different modalities in an adversarial manner. This process is adversarial in nature, with the feature extractor and the modality discriminator playing against each other, ultimately achieving the effect of modality adaptation.
    Fig. 3.
    Fig. 3. The back propagation process of the proposed DGRNet.

    3.2 Loss Function

    The triplet constraints is imposed to the cross-modality loss to minimize the gap among features of the same person from different modalities. In our method, to simplify calculations and improve the performance of model, the batch hard triplet loss [13] is used in the network training. The main ideas are as follows: \(P\) classes (person identities) are first randomly sampled for batch forming, and for each class (person), \(K\) images are randomly sampled, thus forming the mini-batch of \(2\times P \times K\) images. Now, for each sample \(i\) of a mini-batch, instead of selecting all pairs of samples, we choose the hardest positive and the hardest negative samples as triplets for loss calculation, and the loss \(L_{bhtri}\) is computed as
    \(\begin{equation} L_{bhtri} = \sum _{i=1}^{P} \sum _{a=1}^{2K} \left[\rho + \max _{{{\scriptstyle \begin{matrix} {p=1 \cdots 2K} \\ {a \ne p} \end{matrix}}}} D\left(Z_a^i, Z_p^i\right) - \min _{{{\scriptstyle \begin{matrix} {j=1 \cdots P} \\ {n=1 \cdots 2K} \\ {j \ne i} \end{matrix}}}}D\left(Z_a^i, Z_n^j\right)\right]_+, \end{equation}\)
    (4)
    where \(Z_i^j\) represents the \(j{\rm th}\) feature of the \(i{\rm th}\) person. \(\max D(Z_a^i,Z_p^i)\) denotes the hardest positive, which means the maximum distance of anchor \(Z_a^i\) and positive samples \(Z_p^i\) . \(\min D(Z_a^i,Z_n^j)\) denotes the hardest negative, which means the minimize distance of anchor \(Z_a^i\) and negative samples \(Z_n^j\) .
    The key point of hard sample triplet loss is to optimize the features of hard samples, However, it does not specifically constrain features from the perspective of identity learning. To further alleviate the difference between the modalities of the same identity and increase the feature distance between different identities, the center cluster loss function [44] is applied to guide the identity learning during training as
    \(\begin{equation} \begin{aligned} L_{cc} = &\frac{1}{2PK} \sum _{i=1}^{2PK} \left\Vert Z_{i}-c_{y_{i}}\right\Vert _{2} + \\ &\frac{2}{P(P-1)} \sum _{j=1}^{P-1}\sum _{l=j+1}^{P} \left[\rho _{cc} - \left\Vert c_{y_{j}}-c_{y_{l}}\right\Vert _{2}\right]_+, \end{aligned} \end{equation}\)
    (5)
    where \(c_{yi}\) represents the average center of image features with the label \(y_{i}\) and \(\rho _{cc}\) denotes the minimum margin between all center pairs.
    We used entropy loss to calculate the identity loss \(L_{id}\) ; \(L_{id}\) is defined as
    \(\begin{equation} L_{id} = -\frac{1}{M} \sum _{i=1}^{M} q(i, y_i) \log (\hat{y_i}), \end{equation}\)
    (6)
    where \(M\) denotes the number of human identities. \(q(i, y_i)\) represents the true distribution of sample. When the predicted identity \(i\) is the target identity \(y_i\) , \(q(i, y_i)=1\) ; otherwise, \(q(i, y_i)=0\) . \(\hat{y_i}\) represents the predicted probability of the sample on the \(i{\rm th}\) class.
    The baseline (combination of feature extractor and identity classifier) loss \(L_B\) is denoted as
    \(\begin{equation} L_B = \eta _{1} L_{bhtri} + \eta _{2} L_{cc} + L_{id}, \end{equation}\)
    (7)
    where \(\eta _{1}\) and \(\eta _{2}\) are hype-parameters to balance the contributions of individual loss terms.
    Upon \(\hat{d_i}\) the loss of modality discriminator is defined by binomial cross-entropy loss as
    \(\begin{equation} L_d = -\frac{1}{N}\sum _{i=1}^{N} (d_i * \log (\hat{d_i}) + (1-d_i) * \log (1-\hat{d_i})), \end{equation}\)
    (8)
    where \(N\) represents the number of all samples, \(d_i\) indicates the modality label of the \(i{\rm th}\) samples, and \(\hat{d_i}\) is the modality probability of the \(i{\rm th}\) samples.
    Overall. After introducing the modality discriminator, the loss function of the network is expressed as
    \(\begin{equation} L = \beta L_B + \alpha L_d, \end{equation}\)
    (9)
    where \(L_B\) is the baseline loss and \(L_d\) is the modality discriminator loss. The hyper-parameters \(\alpha\) and \(\beta\) are used to adjust the contribution of different loss terms in the network.

    3.3 The Training Algorithm

    The network combined with baseline and modality discriminator is used to learn a feature extractor that maps an example into a representation allowing the identity classifier to accurately classify human identity, while crippling the ability of the modality discriminator to detect each sample belongs to the visible or infrared modality by adversary training. To achieve this, we maximize the loss \(L_d\) of modality discriminator. This would make samples from different modalities are indistinguishable, and the extracted features are modality invariant. Moreover, we minimize the loss \(L_B\) of baseline to further improve the modality invariance and inter-class discriminative ability of learned features. More formally, the complete optimization of our network is equivalent to solving the following minimization problem as
    \(\begin{equation} E(\theta _f,\theta _y,\theta _d) = \sum _{i=1}^{N} L_B^i(\theta _f,\theta _y)-\lambda \sum _{i=1}^{N} L_d^i(\theta _f,\theta _d), \end{equation}\)
    (10)
    where \(i\) is the iteration number and \(\lambda\) is a hyper-parameter that is used to trade off the two objectives in the optimization problem. We can solve the above minimization problem based on the following stochastic updates method:
    \(\begin{equation} \theta _f \leftarrow \theta _f-\mu \left(\frac{\partial L_B^i}{\partial \theta _f}-\lambda \frac{\partial L_d^i}{\partial \theta _f}\right), \end{equation}\)
    (11)
    \(\begin{equation} \theta _y \leftarrow \theta _y-\mu \frac{\partial L_{id}}{\partial \theta _y}, \end{equation}\)
    (12)
    \(\begin{equation} \theta _d \leftarrow \theta _d-\mu \frac{\partial L_d^i}{\partial \theta _d}, \end{equation}\)
    (13)
    where \(\mu\) represents learning rate. Excepting the factor \(-\lambda\) in Equation (11), the update process of Equations (11)–(13) is formally like the stochastic gradient descent (SGD) method. To update the parameters in Equations (11)–(13) with the standard SGD method, the gradient reversal is introduced between the modality discriminator and feature extractor, as shown in Figure 3.
    The gradient reversal does not have any parameter to learn. It was treated as an identity transformation during the forward propagation, whereas, during the backpropagation, the gradient reversal takes the gradient from the subsequent layer and changes its sign, multiplies it by \(\lambda\) , and passes it to the preceding layer. We can formally treat the gradient reversal as a “pseudo-function” by two equations as
    \(\begin{equation} GRL(Z) = Z, \end{equation}\)
    (14)
    \(\begin{equation} \frac{\partial GRL}{\partial Z} = -\lambda I, \end{equation}\)
    (15)
    where \(I\) denotes an identity matrix. Based on the pseudo-function \(GRL\) , the update process of Equations (11)–(13) can then be implemented as doing standard SGD. During the backpropagation, the gradient reversal ensures the gradients from the baseline and modality discriminator are subtracted and leads to the emergence of the following features: modality invariance and inter-class discrimination. The feature distributions are similar over the visible modality, and infrared modality is ensured, but as indistinguishable as possible for the modality discriminator, thus producing the modality-invariant features. In a word, through using gradient reversal, the baseline and the modality discriminator are competing against each other, by adversarial training, over the objective of Equation (10). But \(\lambda\) still introduces hyper-parameter tuning.

    3.4 Optimization

    Dynamic weighted gradient reversal. During the network training, the loss value of \(L_B\) and \(L_d\) in Equation (11) is different. If \(\alpha\) and \(\beta\) take a fix value, then the imbalances loss contribution impedes proper training and finally results in suboptimal training. So it is necessary to adjust the parameters \(\alpha\) and \(\beta\) to achieve the optimal training. For this, we generally take many attempts to find the most suitable parameters for network training. To achieve optimal adversarial training, we design a dynamic weight for gradient reversal to adaptively and dynamically evaluate the significance of the target loss term during the training to further enhance learning of the discriminative common representations. In this way, the modality discriminator is introduced gradually by dynamically adjusting the weights of the loss terms so that the network can learn the instance features better at the early training stage and achieve optimal training of the network.
    We design a dynamic weight for gradient reversal, which is inspired by multi-task learning. Our fundamental idea is to treat the loss of cross-modality person re-identification \(L_B\) as the dominant loss, and then the loss of modality discriminator \(L_d\) is gradually introduced for optimization. The main reason for doing this is that, at an early training stage, it is easier to learn the instance-level feature representations guided with \(L_B\) . Then, based on the optimization degree of the identity classifier, through controlling the weight factor of \(L_d\) , we gradually increase the modality discriminator to conduct adversarial training with the identity classifier to better learn the modality-invariant features.
    We treat the training of the identity classifier and the modality discriminator as different tasks. To optimize the weights \(\lambda ^k\) for the loss contribution of modality discriminator, we present a simple algorithm as shown in Figure 4; that is, it will penalize the modality discriminator if the backpropagated gradients from identity classifier are too large at the beginning. If identity classifier is training relatively slowly, then dynamic weight \(\lambda ^k\) of modality discriminator should be increased to ensure it has more influence on training. When the training rate of different tasks is similar, the correct balance is finally achieved. The dynamic weight \(\lambda ^k\) and the total loss \(L^k\) of the proposed DGRNet can be denoted respectively as
    \(\begin{equation} \lambda ^k = \frac{1}{1 + \left\Vert \frac{\partial L_{id}}{\partial \theta _y}\right\Vert _2^{k-1} }, \end{equation}\)
    (16)
    \(\begin{equation} L^k = L_B^k + \lambda ^k L_d^k, \end{equation}\)
    (17)
    where \(k\) is the current iteration and \(\Vert \tfrac{\partial L_{id}}{\partial \theta _y}\Vert _2\) represents the \(L_2\) norm of the gradient of identity loss \(L_{id}\) with respect to the parameters of identity classifier \(\theta _y\) . \(\Vert \tfrac{\partial L_{id}}{\partial \theta _y}\Vert _2\) reflects the optimization degree for the identity discriminability of pedestrian features. By continuously monitoring \(\Vert \tfrac{\partial L_{id}}{\partial \theta _y}\Vert _2\) , we can construct dynamic weights based on the identity discriminability of pedestrian features, thereby adaptively balancing the adversarial process. Take an example, when the gradient of identity classifier is too big, we can know that \(\lambda ^k\) is small at this time from Equation (16), and do not introduce modality discriminator for the adversary training too much. In this way, we can (1) reason about the relative importance of the target loss contribution through the gradient of identity classifier and then (2) dynamically adjust the target loss contribution so that the different tasks train at suitable rates. The dynamic optimization details are illustrated in Algorithm 1. During the dynamic update process, the dynamic weights \(\lambda ^k\) can be accordingly computed once after each epoch iteration, while the modality discriminator loss \(L_d\) is gradually introduced into the overall learning. When the training converges, DGRNet will learn a rather robust dynamic weight and achieve optimal adversarial training. Different from other domain adaption methods with gradient reversal [9], we use \(\lambda ^k\) to dynamically adjust the target loss contribution of the proposed DGRNet, without involving hyper-parameter tuning, and we only need to run the whole network once to get the stable result.
    Fig. 4.
    Fig. 4. Dynamically adjust the loss contribution by dynamic weight \(\lambda ^k\) . The blue line represents the back propagation gradient of the identity classifier, and \(\frac{\partial L_{id}}{\partial \theta _y}\) is the gradient of identity classifier. The green circles denote the contribution of loss terms, the loss contribution of identity classifier takes a fix value, while the loss contribution of modality discriminator varies dynamically according to the gradient of identity classifier.

    4 Experiments Results and Analyses

    In this section, extensive experiments are conducted to evaluate the effectiveness of proposed DGRNet to enhance the discriminative common representation learning. In the experiments reported blow, to verify the effectiveness of proposed approach, we make the comparison of our proposed DGRNet and the state-of-the-art approaches on the SYSU-MM01 [42] and RegDB [31] datasets. Then we conduct further analysis to investigate the performance of DGRNet in more detail.

    4.1 Datasets and Evluation Metrics

    Datasets. The proposed DGRNet is evaluated on two public VI-ReID datasets, SYSU-MM01 and RegDB. In it, (1) SYSU-MM01 is a large-scale VI Re-ID dataset, and 491 identities captured in outdoor and indoor environment are included. The images are obtained by two near-infrared and four visible cameras. Three hundred ninety-five persons are contained in the training set, which includes 11,909 infrared images and 22,258 visible images. Ninety-six persons are contained in the testing set. There are all-search mode and indoor-search mode. In the indoor-search mode, Cam 1, Cam 2, Cam 3, and Cam6 are used to capture indoor images. In the all-search mode, the pictures collected by Cam 1 to Cam 6 are used. For both modes, the gallery set consists of visible images, and the probe set consists of infrared images. We adopt both the single-shot and multi-shot settings, where only 1 or 10 images in the gallery set can be matched with the anchor image. In this article, single-shot indoor-search mode and the single-shot all-search mode and are adopted as the evaluation protocol. (2) The RegDB dataset includes 412 persons. Each person has 10 visible images and 10 infrared images.
    Settings. For SYSU-MM01, the training set has 395 persons, and the testing set has 96 persons. In the testing set, there are 3,803 infrared images were constructed for query, and 301 visible images were randomly selected from testing set for gallery set. For RegDB, it is randomly split in half; one half is used as a training set and another half is used as a testing set, and then we follow the evaluation protocol. For testing, the images from visible/infrared modality are used to form the gallery set, while the images of the infrared/visible modality are used to form the probe set. The above evaluation will be repeated 10 times to achieve a statistically stable result.
    Evaluation metrics. For indicating the performance of the model, we used cumulative matching characteristic (CMC) [30] and mean average precision (mAP) [61], the reason to use mAP is that one person in the gallery set has multiple ground truths.

    4.2 Implement Details

    The ResNet-50, which is pre-trained on ImageNet, is adopted as our CNN backbone. We set the last stride of convolution of ResNet-50 as 1, and thus the feature map with enlarged spatial size (18 \(\times\) 9) is obtained. This operation increases the computational cost of network, while no additional training parameters are involved. It should be noted that the increase in spatial resolution leads to significant improvement of the performance. Furthermore, we use one fully connected layer for identity prediction, where the size is set as 2,048. The modality discriminator is constructed by two fully connected layers, where the size is set as (2,048–1,024).
    For input images, the size of the input images is resized to 288 \(\times\) 144, and random horizontal flipping, random crop with zero-padding, random erasing [64], and random channel exchangeable augmentation [50] for data augmentation are performed on the input data. The batch size is set to 64 for both datasets, which contains 32 visible images and 32 infrared images. SGD is utilized for optimization, and the momentum is set to 0.9. Meanwhile, to bootstrap the network for enhancing performance, we use the warm-up strategy from Reference [26]. In experiments, in the first 10 epochs, the learning rate grows linearly from 0.01 to 0.1, and in the following, it decays to 0.01 at the 20th epoch, and then decays to 0.001 at the 50th epoch. At the epoch \(k\) , the learning rate \(\mu (k)\) is computed as
    \(\begin{equation} \mu (k)=\left\lbrace \begin{aligned}0.01\times k & , & if & &0\lt k\le 10 \\ 0.1 & , & if & &10\lt k\le 20 \\ 0.01 & , & if & &20\lt k\le 50 \\ 0.001 & , & if & &50\lt k\le 80, \end{aligned} \right. \end{equation}\)
    (18)
    where the training epoch is \(k\) for RegDB dataset and the SYSU-MM01 dataset is set to 80. For the \(PK\) sampling strategy, \(P\) and \(K\) are set to 8 and 4, respectively; \(\rho\) is set to 0.3, and \(\rho _{cc}\) is set to 0.7. We set \(\eta _{1}=1\) , \(\eta _{2}=0.1\) for RegDB, and \(\eta _{1}=0.1\) , \(\eta _{2}=1\) for SYSU-MM01.

    4.3 Ablation Study

    We adopt the feature extractor (two-stream structure) and identity classifier as our baseline method. We evaluate the effectiveness of proposed DGRNet from five aspects: the effectiveness of two-stream backbone network setting, the effectiveness of gradient reversal, the effectiveness of the dynamic weight of gradient reversal, convergence analysis, and feature visualizations. In the following experiment, the gradient reversal will work together with modality discriminator.

    4.3.1 The Effectiveness of Two-stream Backbone Network Setting.

    To improve the adaptability of the model to the two modalities of input images and better extract the shared information between modalities, we use a two-stream network with partially shared structure as the feature extractor. ResNet-50 contains a total of five convolutional modules, and the two-stream network splits ResNet-50 into modality-specific layers and modality-shared layers, where the parameters of modality-specific layers are independent, while the parameters of modality-shared layers are shared. To choose a reasonable method for splitting the two-stream network, we conduct experiments with different numbers of modality-specific and modality-shared layers, and Table 1 shows the detailed experimental results. From the experimental results, we can see that for the RegDB and SYSU-MM01 datasets, the most reasonable two-stream network structure is to use two modality-specific layers and three modality-shared layers.
    Table 1.
    SP:SHRegDBSYSU-MM01
    Visible-InfraredAll Search
    Rank-1Rank-10Rank-20mAPRank-1Rank-10Rank-20mAP
    1:489.03 \(\%\) 97.38 \(\%\) 98.64 \(\%\) 79.87 \(\%\) 70.09 \(\%\) 96.11 \(\%\) 98.58 \(\%\) 66.45 \(\%\)
    2:391.26 \(\%\) 97.91 \(\%\) 99.22 \(\%\) 82.02 \(\%\) 71.53 \(\%\) 96.06 \(\%\) 98.62 \(\%\) 68.04 \(\%\)
    3:288.74 \(\%\) 97.18 \(\%\) 98.54 \(\%\) 80.03 \(\%\) 70.29 \(\%\) 96.12 \(\%\) 98.60 \(\%\) 66.48 \(\%\)
    4:185.34 \(\%\) 96.41 \(\%\) 97.96 \(\%\) 76.05 \(\%\) 68.28 \(\%\) 95.51 \(\%\) 98.44 \(\%\) 65.50 \(\%\)
    Table 1. Effectiveness of Different Splits of Two-steam Backbone Network in Terms of mAP ( \(\%\) ) and CMC ( \(\%\) ) on the RegDB and SYSU-MM01 Datasets
    The corresponding best results are in bold. SP denotes the number of modality-specific layers and SH denotes the number of modality-shared layers.

    4.3.2 The Effectiveness of Gradient Reversal.

    Table 2 displays the results by employing baseline only and baseline combined with gradient reversal. In the experiments, the weight \(\lambda\) of gradient reversal was set as 1. The results can be seen clearly in Table 2. For the RegDB and SYSU-MM01 datasets, the combination of baseline and gradient reversal outperforms the baseline. It demonstrates the effectiveness of gradient reversal for guiding the adversarial training of neural networks to reduce the cross-modality discrepancy. Setting the weight of gradient reversal to 1 is only to verify the effectiveness of gradient reversal. To achieve the optimal training of the network, we need to find the best weight \(\lambda\) of gradient reversal.
    Table 2.
    MethodsRegDBSYSU-MM01
    Visible-InfraredAll Search
    Rank-1Rank-10Rank-20mAPRank-1Rank-10Rank-20mAP
    baseline84.96 \(\%\) 95.53 \(\%\) 97.33 \(\%\) 76.65 \(\%\) 64.94 \(\%\) 94.30 \(\%\) 97.81 \(\%\) 61.58 \(\%\)
    baseline+gradient reversal ( \(\lambda =1\) )89.47 \(\%\) 97.48 \(\%\) 98.27 \(\%\) 79.69 \(\%\) 70.05 \(\%\) 95.70 \(\%\) 98.32 \(\%\) 66.17 \(\%\)
    Table 2. Effectiveness of Gradient Reversal in Terms of mAP ( \(\%\) ) and CMC ( \(\%\) ) on the RegDB and SYSU-MM01 datasets.

    4.3.3 The Effectiveness of the Dynamic Weight of Gradient Reversal.

    To evaluate the effectiveness of dynamic weight of gradient reversal, based on the baseline combined with gradient reversal, we set the weight \(\lambda\) of gradient reversal as \(w_p\) (same as in Reference [9]) and \(\lambda ^k\) (dynamic weight mentioned in Section 3.3), respectively. The weight \(w_p\) is initialed at 0 and is gradually change to 1 using the following schedule:
    \(\begin{equation} w_p = \frac{2}{1+\exp (-\gamma \cdot p)} - 1, \end{equation}\)
    (19)
    where \(\gamma\) was set to 10 (results of multiple hyper-parameter tuning) in all experiments (the schedule was not optimized); \(p\) controls the training progress, which changes from 0 to 1 linearly. For the RegDB dataset, we use visible images as query and infrared images as gallery, noting the default setting as “Visible to Infrared.” For the SYSU-MM01 dataset, we take single-shot all-search mode to get the results of the model.
    The results are displayed in Table 3: (1) When the weight of the gradient reversal is \(w_p\) , the network performs much better than the baseline network on both datasets. Since \(w_p\) is used for updating the feature extractor component \(G_f\) , which allows the modality discriminator to be less sensitive to noisy signal at the early stages of the training procedure. The downside is that it always involves hyper-parameter \(\gamma\) tuning (mentioned in Equation (19)). (2) Compared with the gradient reversal using fixed weights in Table 2 and the gradient reversal using weights \(w_p\) [9] in Table 3, our proposed dynamic weights outperform them on RegDB and SYSU-MM01 datasets. The fixed-weight leads to the premature involvement of the modality discriminator in the early training of the network, which is detrimental to the learning of instance-level feature representations. Furthermore, the method with weight \(w_p\) not only introduces additional hyper-parameter \(\gamma\) but also fails to determine the best time to introduce adversarial training. Unlike the two methods mentioned above, our method aims to ensure that after the identity classifier is trained, we gradually increase the modality discriminator to conduct adversarial training with it to better learn the modality-invariant features. Moreover, our network with dynamic weight only needs to run the entire network once to get the stable results, while other approaches require it to be run multiple times to obtain stable results. It demonstrates that the dynamic weight of gradient reversal has ability to further enhance the discriminative common representation learning. Most importantly, it does not need to introduce any additional hyper-parameter tuning.
    Table 3.
    MethodsRegDBSYSU-MM01
    Visible-InfraredAll Search
    Rank-1Rank-10Rank-20mAPRank-1Rank-10Rank-20mAP
    baseline84.96 \(\%\) 95.53 \(\%\) 97.33 \(\%\) 76.65 \(\%\) 64.94 \(\%\) 94.30 \(\%\) 97.81 \(\%\) 61.58 \(\%\)
    baseline+gradient reversal ( \(\lambda =w_p\) )89.95 \(\%\) 97.43 \(\%\) 98.83 \(\%\) 80.23 \(\%\) 70.94 \(\%\) 95.88 \(\%\) 98.54 \(\%\) 66.58 \(\%\)
    baseline+gradient reversal ( \(\lambda =\lambda ^k\) )91.26 \(\%\) 97.91 \(\%\) 99.22 \(\%\) 82.02 \(\%\) 71.53 \(\%\) 96.06 \(\%\) 98.62 \(\%\) 68.04 \(\%\)
    Table 3. Effectiveness of Gradient Reversal with Different Weights in Terms of mAP ( \(\%\) ) and CMC ( \(\%\) ) on RegDB and SYSU-MM01 Datasets
    The corresponding best results are in bold.
    Figure 5 shows the final performance of proposed DGRNet on both two public datasets. Baseline with gradient reversal ( \(\lambda =1\) ), baseline with gradient reversal ( \(\lambda =w_p\) ), and baseline with dynamic weighted gradient reversal ( \(\lambda =\lambda ^k\) ) all perform better than the baseline, respectively. And network with dynamic weighted gradient reversal (DGRNet) achieves the competitive performance by a large margin.
    Fig. 5.
    Fig. 5. Final performances of our proposed DGRNet on (a) RegDB and (b) SYSU-MM01 datasets.

    4.3.4 Convergence Analysis.

    The gradient of identity classifier and the change of dynamic weight \(\lambda ^k\) are evaluated, in this part, to verify the effectiveness of the proposed designed dynamic weight in Equation (16). From Figure 6, we can see that (1) initially, the gradient of identity classifier shows a large value, and the value of the dynamic weight \(\lambda ^k\) is correspondingly small. (2) In the first 20 epochs, the gradient of identity classifier keeps decreasing and the value of the dynamic weight \(\lambda ^k\) increases in negative correlation. (3) The gradient of identity classifier remains a stable value after 20 epochs, and the dynamic weight \(\lambda ^k\) become steady correspondingly. (4) These results demonstrate that the dynamic weights we designed (in Equation (16)) in the actual change process and the expected change process are consistent.
    Fig. 6.
    Fig. 6. The trend of dynamic weight \(\lambda ^k\) and the gradient of identity classifier.
    We also evaluated the convergence of DGRNet and drew a trend chart of the total loss of DGRNet and baseline, respectively, shown in Figure 7. We can see that in the same number of training iterations, DGRNet can also achieve normal convergence, indicating the stability and effectiveness of the designed adversarial process. Compared to the baseline, DGRNet exhibits a larger convergence point, which provides the network with a certain level of robustness and improved tolerance to modal change in input data. Specifically, DGRNet continuously balances the tasks of feature extraction and modality discrimination, enabling better adaptation to the differences between the visible and infrared modalities.
    Fig. 7.
    Fig. 7. The trend of total loss.

    4.3.5 Feature Visualizations.

    To further evaluate the performance of DGRNet, we visualize the features of person (12 classes) learned by baseline, baseline with gradient reversal ( \(\lambda =1\) ), baseline with gradient reversal ( \(\lambda =w_p\) ), and baseline with dynamic weighted gradient reversal ( \(\lambda =\lambda ^k\) ) through using the t-SNE [34] embedding in Figure 8. Red circles denote the visible samples, and blue pentagrams represent the thermal samples. The visualization tells significant conclusions: (1) From the results of baseline in Figure 8(a), we can clearly see that not only are the distributions from the infrared and visible modalities not well aligned, but also different classes are not well-distinguished clearly. (2) As shown in Figure 8(b) and Figure 8(c), compared to Figure 8(a), the feature distances of the same person in different modalities are effectively approximated, demonstrating that gradient reversal enables the network to learn discriminative common representations. But fix weight method and method with weight \(w_p\) still cannot align features very well. (3) For the features learned with our DGRNet, as shown in Figure 8(d), not only are the distributions aligned very well between the visible and infrared modalities, but also it is discriminated more clearly between different classes. In other words, the proposed DGRNet can get better performance. (4) The observations shown above suggest that DGRNet has the ability to learn discriminative common representations better by confusing the modality discrimination.
    Fig. 8.
    Fig. 8. Network activation after visualization with the t-SNE. Panels (a), (b), (c), and (d) are the visualiation of the learned representations on baseline, combination of baseline and gradient reversal ( \(\lambda =1\) ), and combination of baseline and gradient reversal ( \(\lambda =w_p\) ) and DGRNet, respectively.

    4.4 Comparison to the State of the Art

    The proposed DGRNet will be compared with the following state-of-the-art approaches in this section: zero-padding [42], cmGAN [7], eBDTR [49], AlignGAN [35], JSIA-Re-ID [36], cm-SSFT(sq) [25], DDAG [51], HAT [53], AGW [52], DDSN [5], NFS [4], MCLNet [11], FBP-AL [40], HMML_T [55], DTRM [47], TSME [24], SPOT [3], FMCNet [57], FAM+NNCLoss [43], and TVTR [45]. The comparison results on the SYSU-MM01 and RegDB datasets are respectively shown in Table 4 and Table 5, which is judged based on the Rank-1, 10, 20 accuracies of CMC and mAP. The details are given as follows.
    Table 4.
    SettingsAll-SearchIndoor-Search
    MethodVenuer = 1r = 10r = 20mAPr = 1r = 10r = 20mAP
    Zero-Pading[42]ICCV1714.8054.1271.3315.9520.5868.3885.7926.92
    cmGAN[7]IJCAI1826.9767.5180.5627.8031.6377.2389.1842.19
    eBDTR[49]TIFS1927.8267.3481.3428.4232.4677.4289.6242.46
    AlignGAN[35]ICCV1942.485.093.740.745.987.694.454.3
    JSIA-Re-ID[36]AAAI2038.1080.7089.9036.9043.8086.2094.2052.90
    cm-SSFT(sq)[25]CVPR2047.7054.10
    DDAG[51]ECCV2054.7590.3995.8153.0261.0294.0698.4167.98
    HAT[53]TIFS2055.2992.1497.3653.8962.1095.7599.2069.37
    DDSN[5]ISCAS2146.1686.3494.9746.92
    AGW[52]TPAMI2147.5084.3992.1447.6554.1791.1495.9862.97
    NFS[4]CVPR2156.9191.3496.5255.4562.7996.5399.0769.79
    MCLNet[11]CVPR2165.4093.3397.1461.9872.5696.9899.2076.58
    FBP-AL[40]TNNLS2254.1486.0493.0350.20
    HMML_T[55]ACM2261.9692.5197.0759.62
    DTRM[47]TIFS2263.0393.8297.5658.6366.3595.5898.8071.76
    TSME[24]TCSVT2264.2395.1998.7361.2164.8096.9299.3171.53
    SPOT[3]TIP2265.3492.7397.0462.2569.4296.2299.1274.63
    FMCNet[57]CVPR2266.3462.5168.1574.09
    FAM+NNCLoss[43]SPL2355.7587.5193.2751.5258.2491.0896.4265.65
    TVTR[45]ICASSP2365.3095.4198.7464.1572.2177.94
    DGRNetOurs71.5396.0698.6268.0477.4998.6199.7981.51
    Table 4. Comparison with the States of the Art on the SYSU-MM01 Datasets
    Re-identification rates ( \(\%\) ) at Rank-r and mAP ( \(\%\) ).
    The corresponding best results are in bold.
    Table 5.
    SettingsVisible to InfraredInfrared to Visible
    MethodVenuer = 1r = 10r = 20mAPr = 1r = 10r = 20mAP
    Zero-Pading[42]ICCV1717.7534.2144.3518.9016.6334.6844.2517.82
    eBDTR[49]TIFS1934.6258.9668.7233.4634.2158.7468.6432.49
    AlignGAN[35]ICCV1957.953.656.353.4
    JSIA-Re-ID[36]AAAI2048.549.348.148.9
    cm-SSFT(sq)[25]CVPR2072.372.971.071.7
    DDAG[51]ECCV2069.3486.1991.4963.4668.0685.1590.3161.80
    HAT[53]TIFS2071.8387.1692.1667.5670.0286.4591.6166.30
    AGW[52]TPAMI2170.0566.37
    DDSN[5]ISCAS2179.3288.6195.9275.37
    NFS[4]CVPR2180.5491.9695.0772.1077.9590.4593.6269.79
    MCLNet[11]CVPR2180.3192.7096.0373.0775.9390.9394.5969.49
    FBP-AL[40]TNNLS2273.9889.7193.6968.2470.0589.2293.8866.61
    DTRM[47]TIFS2279.0992.2595.6670.0978.0291.7595.1969.56
    SPOT[3]TIP2280.3593.4896.4472.4679.3792.7996.0172.26
    HMML_T[55]ACM2282.9794.0396.4277.56
    TSME[24]TCSVT2287.3597.1098.9076.9486.4196.3998.2075.70
    FMCNet[57]CVPR2289.1284.4388.3883.86
    TVTR[45]ICASSP2384.179.583.778.0
    FAM+NNCLoss[43]SPL2387.3195.6797.4976.7084.8194.3396.4874.73
    DGRNetOurs91.2697.9199.2282.0287.4896.7098.5080.75
    Table 5. Comparison with the States of the Art on the RegDB Datasets
    Re-identification rates ( \(\%\) ) at Rank-r and mAP ( \(\%\) ).
    The corresponding best results are in bold.
    On the SYSU-MM01 dataset, the DGRNet achieves Rank-1 scores of 71.53 \(\%\) and 77.49 \(\%\) in single-shot all-search mode and single-shot indoor-search mode, better than FMCNet [57] by 5.19 \(\%\) and 9.34 \(\%\) and better than FAM+NNCLoss [43] by 15.78 \(\%\) and 19.25 \(\%\) , respectively. Compared with the method TVTR [45] in single-shot all-search mode, the proposed DGRNet surpassed TVTR [45] by 6.23 \(\%\) on Rank-1 score and by 3.89 \(\%\) on mAP score. AGW [52] is also designed based on top of eBDTR [49], but it performs worse than the proposed DGRNet by a large margin. The DGRNet improves 24.03 \(\%\) on Rank-1 score and 20.39 \(\%\) on mAP score. When compared with some representative adversarial methods, our method also demonstrates superior performance. For example, both DGRNet and cmGAN [7] utilize ResNet-50 as person feature extractor, and cmGAN adopts adversarial training to learn discriminator common representation as well. However, the DGRNet performs much better than cmGAN by 44.56 \(\%\) on Rank-1 score and 40.24 \(\%\) on mAP score. Moreover, DGRNet does not require searching for excessive hyperparameters for stable adversarial training like cmGAN. Compared to MCLNet [11], the Rank-1 accuracy is improved by 6.13 \(\%\) in single-shot all-search mode. Excepting the adversarial training to confuse the features of two modalities, MCLNet also exploited the camera label information for further improvement. It is worth noting that we compare DGRNet with MCLNet (Base+MCM, only adopt adversarial training to confuse the features of two modalities, the results can also be seen in Reference [11]), the Rank-1 score and mAP score of the DGRNet are improved by 20.07 \(\%\) and 18.2 \(\%\) , respectively. TSME [24] proposed a new deeper skip-connection generative adversarial networks as an image generator and generated high-quality cross modality images through adversarial training to alleviate modality discrepancy. In single-shot all-search mode, compared with TSME, the Rank-1 score and mAP score of the DGRNet are improved by 7.3 \(\%\) and 6.83 \(\%\) , respectively. Moreover, DGRNet has a simpler network architecture that does not involve complex image generation processes, and it does not require training in stages like TSME.
    On the RegDB dataset, the DGRNet achieves the Rank-1 scores of 91.26 \(\%\) and 87.48 \(\%\) in visible-to-infrared and infrared-to-visible modes, better than TSME [24] by 3.91 \(\%\) and 1.07 \(\%\) , respectively, and better than TVTR [45] by 7.16 \(\%\) and 3.78 \(\%\) , respectively. Compared with AGW [52], MCLNet [11], FBP-AL [40], DTRM [47], and SPOT [3], the Rank-1 score and mAP score of the DGRNet are improved more than 10 \(\%\) . It demonstrates that the proposed DGRNet containing dynamic weight of gradient reversal has ability to enhance the discriminative common representation learning on VI Re-ID task.
    Notably, our model only adopts global features. Global features refer to the features extracted from the entire image, which can preserve the overall structural information of the image. Local features, however, refer to the features extracted from certain regions of the image, which can better uncover the detailed information of the image. However, compared to global features, local features require greater computational cost. Therefore, our work focuses on global features, attempting to improve model performance without increasing computation. From comparison results listed in Table 4 and Table 5, it is demonstrated that the proposed DGRNet method with only global features achieves good performance compared to the global-feature-based SOTA methods, such as AlignGAN [35], JSIA-Re-ID [36], HAT [53], DDSN [5], AGW [52], MCLNet [11], HMML_T [55], and FMCNet [57], and even outperforming majority of local-feature-based methods, such as the DDAG [51], cm-SSFT(sq) [25], NFS [4], FBP-AL [40], DTRM [47], SPOT [3], and TSME [24]. This indicates that we were able to achieve competitive performance by only optimizing global features without increasing the computational cost.

    5 Conclusion

    This article focuses on a challenging newly developing task: VI Re-ID. In this work, the DGRNet based on dynamic weighted gradient reversal is proposed for helping deep networks to learn enhanced discriminative common representations from different modalities by confusing the modality discrimination. The proposed dynamic weight for gradient reversal can not only dynamically and adaptively evaluate the significance of target loss term to learn sharable features better by adversarial training but also not involve any hyper-parameter turning. We conduct feature visualization and extensive experiments to verify the effectiveness of DGRNet. It demonstrates that our adversarial method with dynamic weighted gradient reversal can confuse two modalities better and then enhance the discriminative common representation learning.

    References

    [1]
    Xiang Bai, Mingkun Yang, Tengteng Huang, Zhiyong Dou, Rui Yu, and Yongchao Xu. 2020. Deep-person: Learning discriminative deep features for person re-identification. Pattern Recogn. 98 (2020), 107036.
    [2]
    Shai Ben-David, John Blitzer, Koby Crammer, Alex Kulesza, Fernando Pereira, and Jennifer Wortman Vaughan. 2010. A theory of learning from different domains. Mach. Learn. 79, 1 (2010), 151–175.
    [3]
    Cuiqun Chen, Mang Ye, Meibin Qi, Jingjing Wu, Jianguo Jiang, and Chia-Wen Lin. 2022. Structure-aware positional transformer for visible-infrared person re-identification. IEEE Trans. Image Process. 31 (2022), 2352–2364.
    [4]
    Yehansen Chen, Lin Wan, Zhihang Li, Qianyan Jing, and Zongyuan Sun. 2021. Neural feature search for rgb-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 587–597.
    [5]
    Yunzhou Cheng, Xinyi Li, Guoqiang Xiao, Wenzhuo Ma, and Xinye Gou. 2021. Dual-path deep supervision network with self-attention for visible-infrared person re-identification. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS’21). IEEE, 1–5.
    [6]
    Seokeon Choi, Sumin Lee, Youngeun Kim, Taekyung Kim, and Changick Kim. 2020. Hi-CMD: Hierarchical cross-modality disentanglement for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10257–10266.
    [7]
    Pingyang Dai, Rongrong Ji, Haibin Wang, Qiong Wu, and Yuyu Huang. 2018. Cross-modality person re-identification with generative adversarial training. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18), Vol. 1. 6.
    [8]
    Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Trans. Multimedia Comput. Commun. Appl. 14, 4 (2018), 1–18.
    [9]
    Yaroslav Ganin and Victor Lempitsky. 2015. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning. PMLR, 1180–1189.
    [10]
    Guangwei Gao, Hao Shao, Fei Wu, Meng Yang, and Yi Yu. 2022. Leaning compact and representative features for cross-modality person re-identification. WWW J. (2022), 1–18.
    [11]
    Xin Hao, Sanyuan Zhao, Mang Ye, and Jianbing Shen. 2021. Cross-modality person re-identification via modality confusion and center aggregation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 16403–16412.
    [12]
    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
    [13]
    Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv:1703.07737. Retrieved from https://arxiv.org/abs/1703.07737.
    [14]
    Weipeng Hu and Haifeng Hu. 2020. Adversarial disentanglement spectrum variations and cross-modality attention networks for NIR-VIS face recognition. IEEE Trans. Multimedia 23 (2020), 145–160.
    [15]
    Weipeng Hu and Haifeng Hu. 2020. Dual adversarial disentanglement and deep representation decorrelation for NIR-VIS face recognition. IEEE Trans. Inf. Forens. Secur. 16 (2020), 70–85.
    [16]
    Nianchang Huang, Jianan Liu, Qiang Zhang, and Jungong Han. 2021. Exploring modality-shared appearance features and modality-invariant relation features for cross-modality person re-identification. arXiv:2104.11539. Retrieved from https://arxiv.org/abs/2104.11539.
    [17]
    Kajal Kansal, A. Venkata Subramanyam, Zheng Wang, and Shin’ichi Satoh. 2020. SDL: Spectrum-disentangled representation learning for visible-infrared person re-identification. IEEE Trans. Circ. Syst. Vid. Technol. 30, 10 (2020), 3422–3432.
    [18]
    Vladimir V. Kniaz, Vladimir A. Knyaz, Jiri Hladuvka, Walter G. Kropatsch, and Vladimir Mizginov. 2018. Thermalgan: Multimodal color-to-thermal image translation for person re-identification in multispectral dataset. In Proceedings of the European Conference on Computer Vision (ECCV’18) Workshops. 0–0.
    [19]
    Diangang Li, Xing Wei, Xiaopeng Hong, and Yihong Gong. 2020. Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 4610–4617.
    [20]
    Yaoyu Li, Hantao Yao, Tianzhu Zhang, and Changsheng Xu. 2020. Part-based structured representation learning for person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. 16, 4 (2020), 1–22.
    [21]
    Wenqi Liang, Guangcong Wang, Jianhuang Lai, and Xiaohua Xie. 2021. Homogeneous-to-heterogeneous: Unsupervised learning for rgb-infrared person re-identification. IEEE Trans. Image Process. 30 (2021), 6392–6407.
    [22]
    Haijun Liu, Jian Cheng, Wen Wang, Yanzhou Su, and Haiwei Bai. 2020. Enhancing the discriminative feature learning for visible-thermal cross-modality person re-identification. Neurocomputing 398 (2020), 11–19.
    [23]
    Haijun Liu, Xiaoheng Tan, and Xichuan Zhou. 2020. Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification. IEEE Trans. Multimedia 23 (2020), 4414–4425.
    [24]
    Jianan Liu, Jialiang Wang, Nianchang Huang, Qiang Zhang, and Jungong Han. 2022. Revisiting modality-specific feature compensation for visible-infrared person re-identification. IEEE Trans. Circ. Syst. Vid. Technol. 32, 10 (2022), 7226–7240.
    [25]
    Yan Lu, Yue Wu, Bin Liu, Tianzhu Zhang, Baopu Li, Qi Chu, and Nenghai Yu. 2020. Cross-modality person re-identification with shared-specific feature transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13379–13389.
    [26]
    Hao Luo, Youzhi Gu, Xingyu Liao, Shenqi Lai, and Wei Jiang. 2019. Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 0–0.
    [27]
    Hao Luo, Wei Jiang, Xing Fan, and Chi Zhang. 2020. Stnreid: Deep convolutional networks with pairwise spatial transformer networks for partial person re-identification. IEEE Trans. Multimedia 22, 11 (2020), 2905–2913.
    [28]
    Hao Luo, Wei Jiang, Youzhi Gu, Fuxu Liu, Xingyu Liao, Shenqi Lai, and Jianyang Gu. 2019. A strong baseline and batch normalization neck for deep person re-identification. IEEE Trans. Multimedia 22, 10 (2019), 2597–2609.
    [29]
    Mandi Luo, Xin Ma, Zhihang Li, Jie Cao, and Ran He. 2021. Partial NIR-VIS heterogeneous face recognition with automatic saliency search. IEEE Trans. Inf. Forens. Secur. 16 (2021), 5003–5017.
    [30]
    Hyeonjoon Moon and P. Jonathon Phillips. 2001. Computational and performance aspects of PCA-based face-recognition algorithms. Perception 30, 3 (2001), 303–321.
    [31]
    Dat Tien Nguyen, Hyung Gil Hong, Ki Wan Kim, and Kang Ryoung Park. 2017. Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors 17, 3 (2017), 605.
    [32]
    Yifan Sun, Liang Zheng, Yali Li, Yi Yang, Qi Tian, and Shengjin Wang. 2019. Learning part-based convolutional features for person re-identification. IEEE Trans. Pattern Anal. Mach. Intell. 43, 3 (2019), 902–917.
    [33]
    Eric Tzeng, Judy Hoffman, Ning Zhang, Kate Saenko, and Trevor Darrell. 2014. Deep domain confusion: Maximizing for domain invariance. arXiv:1412.3474. Retrieved from https://arxiv.org/abs/1412.3474.
    [34]
    Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 11 (2008).
    [35]
    Guan’an Wang, Tianzhu Zhang, Jian Cheng, Si Liu, Yang Yang, and Zengguang Hou. 2019. RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3623–3632.
    [36]
    Guan-An Wang, Tianzhu Zhang, Yang Yang, Jian Cheng, Jianlong Chang, Xu Liang, and Zeng-Guang Hou. 2020. Cross-modality paired-images generation for RGB-infrared person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 12144–12151.
    [37]
    Pingyu Wang, Zhicheng Zhao, Fei Su, Yanyun Zhao, Haiying Wang, Lei Yang, and Yang Li. 2020. Deep multi-patch matching network for visible thermal person re-identification. IEEE Trans. Multimedia 23 (2020), 1474–1488.
    [38]
    Zhixiang Wang, Zheng Wang, Yinqiang Zheng, Yung-Yu Chuang, and Shin’ichi Satoh. 2019. Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 618–626.
    [39]
    Ziyu Wei, Xi Yang, Nannan Wang, and Xinbo Gao. 2021. Syncretic modality collaborative learning for visible infrared person re-identification. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 225–234.
    [40]
    Ziyu Wei, Xi Yang, Nannan Wang, and Xinbo Gao. 2022. Flexible body partition-based adversarial learning for visible infrared person re-identification. IEEE Trans. Neural Netw. Learn. Syst. 33, 9 (2022), 4676–4687.
    [41]
    Ancong Wu, Wei-Shi Zheng, Shaogang Gong, and Jianhuang Lai. 2020. Rgb-ir person re-identification by cross-modality similarity preservation. Int. J. Comput. Vis. 128, 6 (2020), 1765–1785.
    [42]
    Ancong Wu, Wei-Shi Zheng, Hong-Xing Yu, Shaogang Gong, and Jianhuang Lai. 2017. RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 5380–5389.
    [43]
    Baotai Wu, Yujian Feng, Yunfei Sun, and Yimu Ji. 2023. Feature aggregation via attention mechanism for visible-thermal person re-identification. IEEE Sign. Process. Lett. 30 (2023), 140–144.
    [44]
    Qiong Wu, Pingyang Dai, Jie Chen, Chia-Wen Lin, Yongjian Wu, Feiyue Huang, Bineng Zhong, and Rongrong Ji. 2021. Discover cross-modality nuances for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4330–4339.
    [45]
    Bin Yang, Jun Chen, and Mang Ye. 2023. Top-K visual tokens transformer: Selecting tokens for visible-infrared person re-identification.Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 1–5.
    [46]
    Hanrong Ye, Hong Liu, Fanyang Meng, and Xia Li. 2020. Bi-directional exponential angular triplet loss for RGB-infrared person re-identification. IEEE Trans. Image Process. 30 (2020), 1583–1595.
    [47]
    Mang Ye, Cuiqun Chen, Jianbing Shen, and Ling Shao. 2022. Dynamic tri-level relation mining with attentive graph for visible infrared re-identification. IEEE Trans. Inf. Forens. Secur. 17 (2022), 386–398.
    [48]
    Mang Ye, Xiangyuan Lan, Jiawei Li, and Pong Yuen. 2018. Hierarchical discriminative learning for visible thermal person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
    [49]
    Mang Ye, Xiangyuan Lan, Zheng Wang, and Pong C. Yuen. 2019. Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Trans. Inf. Forens. Secur. 15 (2019), 407–419.
    [50]
    Mang Ye, Weijian Ruan, Bo Du, and Mike Zheng Shou. 2021. Channel augmented joint learning for visible-infrared recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13567–13576.
    [51]
    Mang Ye, Jianbing Shen, David J Crandall, Ling Shao, and Jiebo Luo. 2020. Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In European Conference on Computer Vision. Springer, 229–247.
    [52]
    Mang Ye, Jianbing Shen, Gaojie Lin, Tao Xiang, Ling Shao, and Steven CH Hoi. 2021. Deep learning for person re-identification: A survey and outlook. IEEE Trans. Pattern Anal. Mach. Intell. 44, 6 (2021), 2872–2893.
    [53]
    Mang Ye, Jianbing Shen, and Ling Shao. 2020. Visible-infrared person re-identification via homogeneous augmented tri-modal learning. IEEE Trans. Inf. Forens. Secur. 16 (2020), 728–739.
    [54]
    Mang Ye, Zheng Wang, Xiangyuan Lan, and Pong C. Yuen. 2018. Visible thermal person re-identification via dual-constrained top-ranking. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’18), Vol. 1. 2.
    [55]
    La Zhang, Haiyun Guo, Kuan Zhu, Honglin Qiao, Gaopan Huang, Sen Zhang, Huichen Zhang, Jian Sun, and Jinqiao Wang. 2022. Hybrid modality metric learning for visible-infrared person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. 18, 1s (2022), 15.
    [56]
    Peng Zhang, Jingsong Xu, Qiang Wu, Yan Huang, and Jian Zhang. 2019. Top-push constrained modality-adaptive dictionary learning for cross-modality person re-identification. IEEE Trans. Circ. Syst. Vid. Technol. 30, 12 (2019), 4554–4566.
    [57]
    Qiang Zhang, Changzhou Lai, Jianan Liu, Nianchang Huang, and Jungong Han. 2022. FMCNet: Feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7339–7348.
    [58]
    Shizhou Zhang, Yifei Yang, Peng Wang, Guoqiang Liang, Xiuwei Zhang, and Yanning Zhang. 2021. Attend to the difference: Cross-modality person re-identification via contrastive correlation. IEEE Trans. Image Process. 30 (2021), 8861–8872.
    [59]
    ZhangLa, GuoHaiyun, ZhuKuan, QiaoHonglin, HuangGaopan, ZhangSen, ZhangHuichen, SunJian, and WangJinqiao. 2022. Hybrid modality metric learning for visible-infrared person re-identification. ACM Trans. Multimedia Comput. Commun. Appl. (2022).
    [60]
    Zhiwei Zhao, Bin Liu, Qi Chu, Yan Lu, and Nenghai Yu. 2021. Joint color-irrelevant consistency learning and identity-aware modality adaptation for visible-infrared cross modality person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3520–3528.
    [61]
    Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE International Conference on Computer Vision. 1116–1124.
    [62]
    Liang Zheng, Yi Yang, and Qi Tian. 2017. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2017), 1224–1244.
    [63]
    Xian Zhong, Tianyou Lu, Wenxin Huang, Mang Ye, Xuemei Jia, and Chia-Wen Lin. 2021. Grayscale enhancement colorization network for visible-infrared person re-identification. IEEE Trans. Circ. Syst. Vid. Technol. 32, 3 (2021), 1418–1430.
    [64]
    Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 13001–13008.

    Index Terms

    1. Dynamic Weighted Gradient Reversal Network for Visible-infrared Person Re-identification
              Index terms have been assigned to the content through auto-classification.

              Recommendations

              Comments

              Information & Contributors

              Information

              Published In

              cover image ACM Transactions on Multimedia Computing, Communications, and Applications
              ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 20, Issue 1
              January 2024
              639 pages
              ISSN:1551-6857
              EISSN:1551-6865
              DOI:10.1145/3613542
              • Editor:
              • Abdulmotaleb El Saddik
              Issue’s Table of Contents

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              Published: 25 August 2023
              Online AM: 07 July 2023
              Accepted: 29 June 2023
              Revised: 20 May 2023
              Received: 02 December 2022
              Published in TOMM Volume 20, Issue 1

              Permissions

              Request permissions for this article.

              Check for updates

              Author Tags

              1. Dynamic weight
              2. gradient reversal
              3. adversarial training
              4. VI Re-ID

              Qualifiers

              • Research-article

              Funding Sources

              • National Key R&D Program of China
              • Guangdong Basic and Applied Basic Research Foundation
              • Funds of South Central Minzu University

              Contributors

              Other Metrics

              Bibliometrics & Citations

              Bibliometrics

              Article Metrics

              • 0
                Total Citations
              • 846
                Total Downloads
              • Downloads (Last 12 months)846
              • Downloads (Last 6 weeks)93

              Other Metrics

              Citations

              View Options

              View options

              PDF

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader

              Get Access

              Login options

              Full Access

              Media

              Figures

              Other

              Tables

              Share

              Share

              Share this Publication link

              Share on social media