research-article

Open access

Dynamic Weighted Gradient Reversal Network for Visible-infrared Person Re-identification

Authors:

Xiaoping Jiang, and

Fan ZhangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 1

Article No.: 12, Pages 1 - 23

https://doi.org/10.1145/3607535

Published: 25 August 2023 Publication History

PDF eReader

Abstract

Due to intra-modality variations and cross-modality discrepancy, visible-infrared person re-identification (VI Re-ID) is an important and challenging task in intelligent video surveillance. The cross-modality discrepancy is mainly caused by the differences between visible images and infrared images, the inherent essence of which is heterogeneous. To alleviate this discrepancy, we propose a Dynamic Weighted Gradient Reversal Network (DGRNet) to enhance the learning of discriminative common representations by confusing the modality discrimination. In the proposed DGRNet, we design the gradient reversal model guiding adversarial training between identity classifier and modality discriminator to reduce the modality discrepancy of the same person in different modalities. Furthermore, we propose an optimization training method, that is, designing dynamic weight of gradient reversal to achieve optimal adversarial training, and dynamic weight has the ability to dynamically and adaptively evaluate the significance of target loss term, without involving hyper-parameter tuning. Extensive experiments were conducted on two public VI Re-ID datasets, SYSU-MM01 and RegDB. The experimental results show that the proposed DGRNet outperforms state-of-the-art methods and demonstrate the effectiveness of the DGRNet to learn more discriminative common representations for VI Re-ID.

1 Introduction

Person re-identification (Re-ID) [1, 8, 32, 37] is an important research field in intelligent video surveillance. Given a specific person, Re-ID task aims to search for it from the gallery set, which contains person images captured by non-overlapping surveillance cameras [62]. Recent research efforts mainly focus on the visible camera module [20, 27]. However, when ambient light is poor or unavailable, visible-visible (VV) Re-ID will be restricted in surveillance. Besides, in the same condition, visible images will be uninformative. In such a case, imaging device such as infrared cameras, with no need to rely on visible light should be applied. In practical scenarios, we can consist the probe set by visible/infrared images, which are captured by visible/infrared cameras in the daytime/nighttime. Also, the gallery set can be constructed by infrared/visible images, which are captured by infrared/visible cameras in the nighttime/daytime [28]. Therefore, the task of Re-ID in video surveillance has a new and challenging requirement, performing person matching in cross-modality, which is also called visible-infrared person re-identification (VI Re-ID).

There is a big issue that needs more attention in the VI Re-ID task. Like face recognition tasks in cross-modality scenarios [14, 15, 29], in the VI Re-ID task, the huge cross-modality discrepancy is a major hindrance. The cross-modality discrepancy arises from two aspects. On the one hand, visible and infrared images are intrinsically distinct. As shown in Figure 1, in the first row, the visible images have three channels containing rich color information from visible light. Conversely, in the second row, the infrared images have only one channel and contain no color information. For this, they are considered as heterogeneous data [42]. On the other hand, in terms of imaging principles, the wavelength ranges of visible images and infrared images are also different.

Fig. 1.

In the VI Re-ID task [11, 19, 22, 25, 35, 36, 39, 41, 42, 48, 49, 50, 52, 54, 56], to cut down the additional cross-modality discrepancy, the existing methods mainly include the following aspects: metric learning, representation learning, and image translation. Metric learning focuses on designing loss function, which make the distance as small as possible between the same persons and the distance as large as possible between different persons [41, 50]. To achieve the similarity measure, some representation learning approaches map the persons from different modalities into a common feature space [48, 49, 54, 56]. Besides, to reduce the modality discrepancy, some representation learning approaches generate a novel common modality between different modalities [19, 25]. Similarly, for person matching, some image translation approaches take image generator methods based on generative adversarial network (GAN) to produce cross-modality images [35, 36]. But both generated cross-modality images and generated common modality introduce noise [39]. Moreover, the newly generated images not only increase the computation but also introduce more uncertainty for network learning [11]. To avoid the above issues, Hao et al. [11] proposed a modality confusion learning network to ignore modality information and learn common features, and an identity-aware marginal center aggregation strategy and a camera-aware learning scheme are explored for further improvement. Their work achieves competitive performance compared with the states of the art. However, in their work, the loss contribution of modality classifier participates in adversarial training with fixed weights, which is detrimental to the learning of instance-level feature representations at the early training stage, to achieve optimal adversarial training of the network. Their idea of though modality confusion learning to obtain discriminative common feature representation is very innovative and inspires us to find out whether there is a better adversarial learning mechanism to obtain more discriminative modality common features. Motivated by this, we propose a Dynamic Weighted Gradient Reversal network (DGRNet) for enhancing the discriminative common representation learning.

The contributions of this article mainly include the following three aspects:

•

We design the gradient reversal model guiding adversarial training in the VI Re-ID task. The gradient reversal works with modality discriminator to guide adversarial training with identity classifier to reduce the modality discrepancy of the same person in different modalities. This can confuse the feature distributions of visible images and infrared images to learn the discriminative common representations.

•

An optimization training method through designing the dynamic weight of gradient reversal model is proposed in this article. It is able to evaluate the contribution of the target loss term adaptively and dynamically and guide the network to find the optimal parameter of different parts during the training and thus to achieve optimal adversarial training without any hyper-parameter tuning. This can for further enhance discriminative common representations learning.

•

In extensive experiments on two public datasets, our proposed DGRNet with only global features achieves good results compared to global-feature-based state-of-the-art (SOTA) methods, even outperforming majority of local-feature-based methods. And it is constructed by two-stream CNN structures training in an end-to-end way.

We organize the rest of this article as follows. In Section 2, the related works are discussed, then, in Section 3, we elaborate on the framework of the proposed DGRNet, loss function, the training algorithm, and optimization. In Section 4, we demonstrate the experimental results of our approach. In Section 5, we finally conclude this article.

2 Related Work

The intra-modality variations are the problem that traditional VV Re-ID aims to solve. Compared to it, the discrepancy of cross-modality should be handled additionally in the VI Re-ID task. To mitigate this problem, the existing methods focus on projecting (or translating) the heterogeneous cross-modality person images into a common space for similarity measure, mainly including image translation, metric learning, and network designing.

Image translation–based VI Re-ID. Image translation usually takes image generation methods, which reduces the domain gap between visible modalities and infrared modalities based on generative adversarial network. Kniaz et al. [18] first translated a visible image into a multimodal thermal probe set by using a set of GANs. Choi et al. [6] proposed hierarchical cross-modality disentanglement framework that consists of an identity-preserving person image generation network and a hierarchical feature learning module. Their goal is to separate ID-discriminative and ID-excluded from cross-modal images. Reference [38] introduced a GAN-based network to address the feature-level. Furthermore, AlignGAN [35] and JSIA-ReID [36] were also GAN-based approaches that implement pixel alignment and feature dual-level alignment. Zhong et al. [63] also proposed a GAN-based approach, GECNet, which preserves the structure of informative colored images to bridge the gap between different modalities. However, the methods to generate cross-modality images and common modality will both introduce noise [39]. Moreover, this types of methods have more performance uncertainty, more computation complexity, and higher demands for training trick [23].

Metric learning–based VI Re-ID. Metric learning is a critical step for Re-ID to perform a similarity metric. Before the era of deep learning, people studied it by learning a projection matrix or Mahalanobis distance function. Nowadays, the loss function designing has replaced the role of metric learning to guide the learning of feature representation. Wu et al. [41] designed a novel loss, which uses the similarity of same-modality to guide the similarity learning of the inter-modality. For cross modality metric learning, to obtain the stronger discriminability, Ye et al. [50] presented a channel-mixed learning strategy to synchronously address the variations of both cross-modality and intra-modality. Hao et al. [11] introduced a strategy called identity-aware marginal center aggregation, through which the centralization features are extracted, and for further improvement, the camera-aware learning scheme was also proposed, which exploited the camera label information to enhance discriminability. From the angles vector respect, a new ranking loss was proposed in Reference [46], which constrained the angle between the embedding vectors to learn common feature space, and this space is angularly separable. Similarly, Gao et al. [10] proposed a novel loss called Enumerate Angular Triplet loss for studying the angularly discriminative feature embedding and presented a new Cross-Modality Knowledge Distillation loss to narrow down the features between different modalities before feature embedding. To further decrease the cross-modality variations, Huang et al. [16] proposed a novel cross-modality quadruplet loss. ZhangLa et al. [59] provided a comprehensive metric learning framework on the basis of paired-based similarity constraints to deal with all the variations within and across modalities.

Network designing–based VI Re-ID. In the re-identification task, feature learning is an important step before measuring similarity. Many research works designed deep neural networks to study the feature learning between different modalities. To explore the VI Re-ID task, a deep zero-padding network was proposed in Reference [42], which can learn the invariant feature representations between different modalities. In References [48, 49, 54], two-stream network was proposed, which first extracts the modality-specific features and then uses the share fully connected layers to extract shared features. Similarly, Zhang et al. [56] proposed a model by dictionary learning to project the features of different modalities onto a common space. To alleviate the modality discrepancy, Li et al. [19] and Lu et al. [25] used this method to generate a new modality between different modalities. On the foundation of a traditional two-stream network, Liu et al. [22] enhanced the modality-shared person features by introducing mid-level features incorporation. To reduce the cross-modality discrepancy, Ye et al. [52] proposed AGW, which contains three major components, including non-local attention block, weighted regularization triplet loss, and generalize-mean pooling. From the aspect of adversarial learning, Dai et al. [7] proposed a network (cmGAN) to generate the discriminative common representations by constructing a discriminator to act as an adversary. To explore the information of cross-modality and intra-modality, a dual-attentive network was proposed in Reference [51], which contained an attention module to give different weight of variant body parts. Ye et al. [53] proposed a HAT model to generate a grayscale modality from the homogeneous visible images, and the process of generation does not need any additional training.

Liang et al. [21] proposed an unsupervised VI Re-ID framework that reduces the modality discrepancy by homogeneous-to-heterogeneous learning and finally produced robust feature representations from different modalities. Considering the intrinsic spatial structures as well as the difference of two modalities, Zhang et al. [58] proposed a dual-path framework for learning cross-modality features. Wei et al. [40] proposed a flexible body partition framework by using of adversarial learning method (FBP-AL) for learning more fine-grained information. For utilizing both multi-level information as well as potential contextual cues as a supplement, Cheng et al. [5] proposed a new network, named Dual-path Deep Supervision Network (DDSN). Moreover, Zhao et al. [60] propose a novel approach to learn the color-irrelevant features through the color-irrelevant consistency learning and align the identity-level feature distributions by the identity-aware modality adaptation. Kansal et al. [17] proposed a novel network to explore the spectrum information in the VI Re-ID task, and the network had two branches: The spectrum dispelling branch was designed to keep the useful identity features, and the spectrum distilling branch was designed to dispel spectrum features. The network structure was then optimized through a multi-stage training strategy.

3 Proposed Method

In this section, we will elaborate on the framework of the proposed DGRNet model for VI Re-ID. As shown in Figure 2, the proposed DGRNet mainly comprises four main components: feature extractor \(G_f\) , identity classifier \(G_y\) , the gradient reversal \(GRL\) , and modality discriminator \(G_d\) . The input data are first transformed into high-dimensional representations \(Z\) by the component of feature extractor. Using the learned feature representations, the identity classifier, whose goal is to maximize the prediction accuracy, can obtain the predictions of identity label on all the input data. The modality discriminator is designed to discriminate different modalities. To obtain modality-invariant features, the gradient reversal is inserted between feature extractor and modality discriminator.

Fig. 2.

The network model of the DGRNet will be detailed in Section 3.1. Then, loss function is presented in Section 3.2. The training algorithm and optimization will be provided in Section 3.3 and Section 3.4, respectively.

3.1 The Network Model

Problem Formulation. Formally speaking, let \(x\) be the input data of the network model, which includes two parts: the infrared images \(x^r\) and the visible images \(x^v\) . For each datum \(x_i\) , a corresponding modality label \(d_i \in \mathcal {M}\) is given, where \(\mathcal {M}\) represents the set that includes all the visible and infrared modalities. Each labeled datum \(x_i\) also has a true identity label \(y_i \in \ varUpsilon\) , where \(\ varUpsilon\) is the set of all identities.

3.1.1 Baseline.

As shown in Figure 2, our baseline consists of feature extractor and identity classifier. The two-steam network framework presented in Reference [49] is adopted in feature extractor, which using ResNet-50 [12] to extract features for different modalities. The network parameters of the first two convolutional blocks are used to capture the modality specific information of input images from different modalities, which parameters are independent. To narrow the gap between two heterogeneous modalities, the network parameters of the last three convolutional blocks are employed to learn a multi-modality sharable space, which parameters are shared. Let \(\theta _f\) be the parameters of feature extractor \(G_f\) . Given the input data \(x\) , we can obtain their feature representations \(Z\) as

\(\begin{equation} Z = G_f(x; \theta _f). \end{equation}\)

(1)

Let \(\theta _y\) be the set of the identity classifier’s parameters. As shown in Figure 2, based on the outputs of feature extractor \(Z\) , to get the probability vector \(\hat{y_i}\) of human identities, the softmax layer is used, \(\hat{y_i}\) is definded as

\(\begin{equation} \hat{y_i} = softmax(FC(BN(Z_i))). \end{equation}\)

(2)

3.1.2 Modality Discriminator.

As shown in Figure 2, a modality discrimination \(G_d\) is constructed, and its parameter is denoted as \(\theta _d\) . The modality discriminator acts as an adversary, whose goal is to judge whether the learned representation vector belongs to the visible modality or infrared modality. It is consisted of a two-layer feed-forward neural network, based on the outputs of feature extractor \(Z\) and the probability vector \(\hat{d_i}\) of modalities as

\(\begin{equation} \hat{d_i} = \sigma (FC(BN(FC(Z_i)))), \end{equation}\)

(3)

where \(\sigma (x)\) is the sigmoid function.

3.1.3 Gradient Reversal.

Domain adaptation refers to the process of transferring knowledge from a source domain to a target domain, and it can confuse the data distribution from the source domain and target domain [2, 33]. To learn the transferable feature representations, it was successfully embedded into deep network for reducing the distribution discrepancy of different domains. Moreover, gradient reversal [9] is a way to realize domain adaption by adversarial training. Traditional method achieves adversarial training by constantly adjusting the sign of the identity classifier and modality discriminator [7], but it often leads to unstable network training. In this article, we align the feature distribution through gradient reversal to learn the discriminative common representations. Specifically, we join gradient reversal after the output of feature extractor and before the input of modality discriminator to guide adversarial training. The purple lines describe the process of gradient reversal, as illustrated in Figure 3. During the model training process, gradient reversal flips the gradient of the feature extractor and passes the reversed gradient to update the feature extractor, forcing it to minimize the identity loss while maximizing the modality discriminator loss. This encourages the feature extractor to make the feature distributions between different modalities as similar as possible. Meanwhile, the modality discriminator tries to distinguish the feature distributions between different modalities in an adversarial manner. This process is adversarial in nature, with the feature extractor and the modality discriminator playing against each other, ultimately achieving the effect of modality adaptation.

Fig. 3.

3.2 Loss Function

The triplet constraints is imposed to the cross-modality loss to minimize the gap among features of the same person from different modalities. In our method, to simplify calculations and improve the performance of model, the batch hard triplet loss [13] is used in the network training. The main ideas are as follows: \(P\) classes (person identities) are first randomly sampled for batch forming, and for each class (person), \(K\) images are randomly sampled, thus forming the mini-batch of \(2\times P \times K\) images. Now, for each sample \(i\) of a mini-batch, instead of selecting all pairs of samples, we choose the hardest positive and the hardest negative samples as triplets for loss calculation, and the loss \(L_{bhtri}\) is computed as

\(\begin{equation} L_{bhtri} = \sum _{i=1}^{P} \sum _{a=1}^{2K} \left[\rho + \max _{{{\scriptstyle \begin{matrix} {p=1 \cdots 2K} \\ {a \ne p} \end{matrix}}}} D\left(Z_a^i, Z_p^i\right) - \min _{{{\scriptstyle \begin{matrix} {j=1 \cdots P} \\ {n=1 \cdots 2K} \\ {j \ne i} \end{matrix}}}}D\left(Z_a^i, Z_n^j\right)\right]_+, \end{equation}\)

(4)

where \(Z_i^j\) represents the \(j{\rm th}\) feature of the \(i{\rm th}\) person. \(\max D(Z_a^i,Z_p^i)\) denotes the hardest positive, which means the maximum distance of anchor \(Z_a^i\) and positive samples \(Z_p^i\) . \(\min D(Z_a^i,Z_n^j)\) denotes the hardest negative, which means the minimize distance of anchor \(Z_a^i\) and negative samples \(Z_n^j\) .

The key point of hard sample triplet loss is to optimize the features of hard samples, However, it does not specifically constrain features from the perspective of identity learning. To further alleviate the difference between the modalities of the same identity and increase the feature distance between different identities, the center cluster loss function [44] is applied to guide the identity learning during training as

\(\begin{equation} \begin{aligned} L_{cc} = &\frac{1}{2PK} \sum _{i=1}^{2PK} \left\Vert Z_{i}-c_{y_{i}}\right\Vert _{2} + \\ &\frac{2}{P(P-1)} \sum _{j=1}^{P-1}\sum _{l=j+1}^{P} \left[\rho _{cc} - \left\Vert c_{y_{j}}-c_{y_{l}}\right\Vert _{2}\right]_+, \end{aligned} \end{equation}\)

(5)

where \(c_{yi}\) represents the average center of image features with the label \(y_{i}\) and \(\rho _{cc}\) denotes the minimum margin between all center pairs.

We used entropy loss to calculate the identity loss \(L_{id}\) ; \(L_{id}\) is defined as

\(\begin{equation} L_{id} = -\frac{1}{M} \sum _{i=1}^{M} q(i, y_i) \log (\hat{y_i}), \end{equation}\)

(6)

where \(M\) denotes the number of human identities. \(q(i, y_i)\) represents the true distribution of sample. When the predicted identity \(i\) is the target identity \(y_i\) , \(q(i, y_i)=1\) ; otherwise, \(q(i, y_i)=0\) . \(\hat{y_i}\) represents the predicted probability of the sample on the \(i{\rm th}\) class.

The baseline (combination of feature extractor and identity classifier) loss \(L_B\) is denoted as

\(\begin{equation} L_B = \eta _{1} L_{bhtri} + \eta _{2} L_{cc} + L_{id}, \end{equation}\)

(7)

where \(\eta _{1}\) and \(\eta _{2}\) are hype-parameters to balance the contributions of individual loss terms.

Upon \(\hat{d_i}\) the loss of modality discriminator is defined by binomial cross-entropy loss as

\(\begin{equation} L_d = -\frac{1}{N}\sum _{i=1}^{N} (d_i * \log (\hat{d_i}) + (1-d_i) * \log (1-\hat{d_i})), \end{equation}\)

(8)

where \(N\) represents the number of all samples, \(d_i\) indicates the modality label of the \(i{\rm th}\) samples, and \(\hat{d_i}\) is the modality probability of the \(i{\rm th}\) samples.

Overall. After introducing the modality discriminator, the loss function of the network is expressed as

\(\begin{equation} L = \beta L_B + \alpha L_d, \end{equation}\)

(9)

where \(L_B\) is the baseline loss and \(L_d\) is the modality discriminator loss. The hyper-parameters \(\alpha\) and \(\beta\) are used to adjust the contribution of different loss terms in the network.

3.3 The Training Algorithm

The network combined with baseline and modality discriminator is used to learn a feature extractor that maps an example into a representation allowing the identity classifier to accurately classify human identity, while crippling the ability of the modality discriminator to detect each sample belongs to the visible or infrared modality by adversary training. To achieve this, we maximize the loss \(L_d\) of modality discriminator. This would make samples from different modalities are indistinguishable, and the extracted features are modality invariant. Moreover, we minimize the loss \(L_B\) of baseline to further improve the modality invariance and inter-class discriminative ability of learned features. More formally, the complete optimization of our network is equivalent to solving the following minimization problem as

\(\begin{equation} E(\theta _f,\theta _y,\theta _d) = \sum _{i=1}^{N} L_B^i(\theta _f,\theta _y)-\lambda \sum _{i=1}^{N} L_d^i(\theta _f,\theta _d), \end{equation}\)

(10)

where \(i\) is the iteration number and \(\lambda\) is a hyper-parameter that is used to trade off the two objectives in the optimization problem. We can solve the above minimization problem based on the following stochastic updates method:

\(\begin{equation} \theta _f \leftarrow \theta _f-\mu \left(\frac{\partial L_B^i}{\partial \theta _f}-\lambda \frac{\partial L_d^i}{\partial \theta _f}\right), \end{equation}\)

(11)

\(\begin{equation} \theta _y \leftarrow \theta _y-\mu \frac{\partial L_{id}}{\partial \theta _y}, \end{equation}\)

(12)

\(\begin{equation} \theta _d \leftarrow \theta _d-\mu \frac{\partial L_d^i}{\partial \theta _d}, \end{equation}\)

(13)

where \(\mu\) represents learning rate. Excepting the factor \(-\lambda\) in Equation (11), the update process of Equations (11)–(13) is formally like the stochastic gradient descent (SGD) method. To update the parameters in Equations (11)–(13) with the standard SGD method, the gradient reversal is introduced between the modality discriminator and feature extractor, as shown in Figure 3.

The gradient reversal does not have any parameter to learn. It was treated as an identity transformation during the forward propagation, whereas, during the backpropagation, the gradient reversal takes the gradient from the subsequent layer and changes its sign, multiplies it by \(\lambda\) , and passes it to the preceding layer. We can formally treat the gradient reversal as a “pseudo-function” by two equations as

\(\begin{equation} GRL(Z) = Z, \end{equation}\)

(14)

\(\begin{equation} \frac{\partial GRL}{\partial Z} = -\lambda I, \end{equation}\)

(15)

where \(I\) denotes an identity matrix. Based on the pseudo-function \(GRL\) , the update process of Equations (11)–(13) can then be implemented as doing standard SGD. During the backpropagation, the gradient reversal ensures the gradients from the baseline and modality discriminator are subtracted and leads to the emergence of the following features: modality invariance and inter-class discrimination. The feature distributions are similar over the visible modality, and infrared modality is ensured, but as indistinguishable as possible for the modality discriminator, thus producing the modality-invariant features. In a word, through using gradient reversal, the baseline and the modality discriminator are competing against each other, by adversarial training, over the objective of Equation (10). But \(\lambda\) still introduces hyper-parameter tuning.

3.4 Optimization

Dynamic weighted gradient reversal. During the network training, the loss value of \(L_B\) and \(L_d\) in Equation (11) is different. If \(\alpha\) and \(\beta\) take a fix value, then the imbalances loss contribution impedes proper training and finally results in suboptimal training. So it is necessary to adjust the parameters \(\alpha\) and \(\beta\) to achieve the optimal training. For this, we generally take many attempts to find the most suitable parameters for network training. To achieve optimal adversarial training, we design a dynamic weight for gradient reversal to adaptively and dynamically evaluate the significance of the target loss term during the training to further enhance learning of the discriminative common representations. In this way, the modality discriminator is introduced gradually by dynamically adjusting the weights of the loss terms so that the network can learn the instance features better at the early training stage and achieve optimal training of the network.

We design a dynamic weight for gradient reversal, which is inspired by multi-task learning. Our fundamental idea is to treat the loss of cross-modality person re-identification \(L_B\) as the dominant loss, and then the loss of modality discriminator \(L_d\) is gradually introduced for optimization. The main reason for doing this is that, at an early training stage, it is easier to learn the instance-level feature representations guided with \(L_B\) . Then, based on the optimization degree of the identity classifier, through controlling the weight factor of \(L_d\) , we gradually increase the modality discriminator to conduct adversarial training with the identity classifier to better learn the modality-invariant features.

We treat the training of the identity classifier and the modality discriminator as different tasks. To optimize the weights \(\lambda ^k\) for the loss contribution of modality discriminator, we present a simple algorithm as shown in Figure 4; that is, it will penalize the modality discriminator if the backpropagated gradients from identity classifier are too large at the beginning. If identity classifier is training relatively slowly, then dynamic weight \(\lambda ^k\) of modality discriminator should be increased to ensure it has more influence on training. When the training rate of different tasks is similar, the correct balance is finally achieved. The dynamic weight \(\lambda ^k\) and the total loss \(L^k\) of the proposed DGRNet can be denoted respectively as

\(\begin{equation} \lambda ^k = \frac{1}{1 + \left\Vert \frac{\partial L_{id}}{\partial \theta _y}\right\Vert _2^{k-1} }, \end{equation}\)

(16)

\(\begin{equation} L^k = L_B^k + \lambda ^k L_d^k, \end{equation}\)

(17)

where \(k\) is the current iteration and \(\Vert \tfrac{\partial L_{id}}{\partial \theta _y}\Vert _2\) represents the \(L_2\) norm of the gradient of identity loss \(L_{id}\) with respect to the parameters of identity classifier \(\theta _y\) . \(\Vert \tfrac{\partial L_{id}}{\partial \theta _y}\Vert _2\) reflects the optimization degree for the identity discriminability of pedestrian features. By continuously monitoring \(\Vert \tfrac{\partial L_{id}}{\partial \theta _y}\Vert _2\) , we can construct dynamic weights based on the identity discriminability of pedestrian features, thereby adaptively balancing the adversarial process. Take an example, when the gradient of identity classifier is too big, we can know that \(\lambda ^k\) is small at this time from Equation (16), and do not introduce modality discriminator for the adversary training too much. In this way, we can (1) reason about the relative importance of the target loss contribution through the gradient of identity classifier and then (2) dynamically adjust the target loss contribution so that the different tasks train at suitable rates. The dynamic optimization details are illustrated in Algorithm 1. During the dynamic update process, the dynamic weights \(\lambda ^k\) can be accordingly computed once after each epoch iteration, while the modality discriminator loss \(L_d\) is gradually introduced into the overall learning. When the training converges, DGRNet will learn a rather robust dynamic weight and achieve optimal adversarial training. Different from other domain adaption methods with gradient reversal [9], we use \(\lambda ^k\) to dynamically adjust the target loss contribution of the proposed DGRNet, without involving hyper-parameter tuning, and we only need to run the whole network once to get the stable result.

Fig. 4.

4 Experiments Results and Analyses

In this section, extensive experiments are conducted to evaluate the effectiveness of proposed DGRNet to enhance the discriminative common representation learning. In the experiments reported blow, to verify the effectiveness of proposed approach, we make the comparison of our proposed DGRNet and the state-of-the-art approaches on the SYSU-MM01 [42] and RegDB [31] datasets. Then we conduct further analysis to investigate the performance of DGRNet in more detail.

4.1 Datasets and Evluation Metrics

Datasets. The proposed DGRNet is evaluated on two public VI-ReID datasets, SYSU-MM01 and RegDB. In it, (1) SYSU-MM01 is a large-scale VI Re-ID dataset, and 491 identities captured in outdoor and indoor environment are included. The images are obtained by two near-infrared and four visible cameras. Three hundred ninety-five persons are contained in the training set, which includes 11,909 infrared images and 22,258 visible images. Ninety-six persons are contained in the testing set. There are all-search mode and indoor-search mode. In the indoor-search mode, Cam 1, Cam 2, Cam 3, and Cam6 are used to capture indoor images. In the all-search mode, the pictures collected by Cam 1 to Cam 6 are used. For both modes, the gallery set consists of visible images, and the probe set consists of infrared images. We adopt both the single-shot and multi-shot settings, where only 1 or 10 images in the gallery set can be matched with the anchor image. In this article, single-shot indoor-search mode and the single-shot all-search mode and are adopted as the evaluation protocol. (2) The RegDB dataset includes 412 persons. Each person has 10 visible images and 10 infrared images.

Settings. For SYSU-MM01, the training set has 395 persons, and the testing set has 96 persons. In the testing set, there are 3,803 infrared images were constructed for query, and 301 visible images were randomly selected from testing set for gallery set. For RegDB, it is randomly split in half; one half is used as a training set and another half is used as a testing set, and then we follow the evaluation protocol. For testing, the images from visible/infrared modality are used to form the gallery set, while the images of the infrared/visible modality are used to form the probe set. The above evaluation will be repeated 10 times to achieve a statistically stable result.

Evaluation metrics. For indicating the performance of the model, we used cumulative matching characteristic (CMC) [30] and mean average precision (mAP) [61], the reason to use mAP is that one person in the gallery set has multiple ground truths.

4.2 Implement Details

The ResNet-50, which is pre-trained on ImageNet, is adopted as our CNN backbone. We set the last stride of convolution of ResNet-50 as 1, and thus the feature map with enlarged spatial size (18 \(\times\) 9) is obtained. This operation increases the computational cost of network, while no additional training parameters are involved. It should be noted that the increase in spatial resolution leads to significant improvement of the performance. Furthermore, we use one fully connected layer for identity prediction, where the size is set as 2,048. The modality discriminator is constructed by two fully connected layers, where the size is set as (2,048–1,024).

For input images, the size of the input images is resized to 288 \(\times\) 144, and random horizontal flipping, random crop with zero-padding, random erasing [64], and random channel exchangeable augmentation [50] for data augmentation are performed on the input data. The batch size is set to 64 for both datasets, which contains 32 visible images and 32 infrared images. SGD is utilized for optimization, and the momentum is set to 0.9. Meanwhile, to bootstrap the network for enhancing performance, we use the warm-up strategy from Reference [26]. In experiments, in the first 10 epochs, the learning rate grows linearly from 0.01 to 0.1, and in the following, it decays to 0.01 at the 20th epoch, and then decays to 0.001 at the 50th epoch. At the epoch \(k\) , the learning rate \(\mu (k)\) is computed as

\(\begin{equation} \mu (k)=\left\lbrace \begin{aligned}0.01\times k & , & if & &0\lt k\le 10 \\ 0.1 & , & if & &10\lt k\le 20 \\ 0.01 & , & if & &20\lt k\le 50 \\ 0.001 & , & if & &50\lt k\le 80, \end{aligned} \right. \end{equation}\)

(18)

where the training epoch is \(k\) for RegDB dataset and the SYSU-MM01 dataset is set to 80. For the \(PK\) sampling strategy, \(P\) and \(K\) are set to 8 and 4, respectively; \(\rho\) is set to 0.3, and \(\rho _{cc}\) is set to 0.7. We set \(\eta _{1}=1\) , \(\eta _{2}=0.1\) for RegDB, and \(\eta _{1}=0.1\) , \(\eta _{2}=1\) for SYSU-MM01.

4.3 Ablation Study

We adopt the feature extractor (two-stream structure) and identity classifier as our baseline method. We evaluate the effectiveness of proposed DGRNet from five aspects: the effectiveness of two-stream backbone network setting, the effectiveness of gradient reversal, the effectiveness of the dynamic weight of gradient reversal, convergence analysis, and feature visualizations. In the following experiment, the gradient reversal will work together with modality discriminator.

4.3.1 The Effectiveness of Two-stream Backbone Network Setting.

To improve the adaptability of the model to the two modalities of input images and better extract the shared information between modalities, we use a two-stream network with partially shared structure as the feature extractor. ResNet-50 contains a total of five convolutional modules, and the two-stream network splits ResNet-50 into modality-specific layers and modality-shared layers, where the parameters of modality-specific layers are independent, while the parameters of modality-shared layers are shared. To choose a reasonable method for splitting the two-stream network, we conduct experiments with different numbers of modality-specific and modality-shared layers, and Table 1 shows the detailed experimental results. From the experimental results, we can see that for the RegDB and SYSU-MM01 datasets, the most reasonable two-stream network structure is to use two modality-specific layers and three modality-shared layers.

Table 1.

SP:SH	RegDB				SYSU-MM01
	Visible-Infrared				All Search
	Rank-1	Rank-10	Rank-20	mAP	Rank-1	Rank-10	Rank-20	mAP
1:4	89.03 \(\%\)	97.38 \(\%\)	98.64 \(\%\)	79.87 \(\%\)	70.09 \(\%\)	96.11 \(\%\)	98.58 \(\%\)	66.45 \(\%\)
2:3	91.26 \(\%\)	97.91 \(\%\)	99.22 \(\%\)	82.02 \(\%\)	71.53 \(\%\)	96.06 \(\%\)	98.62 \(\%\)	68.04 \(\%\)
3:2	88.74 \(\%\)	97.18 \(\%\)	98.54 \(\%\)	80.03 \(\%\)	70.29 \(\%\)	96.12 \(\%\)	98.60 \(\%\)	66.48 \(\%\)
4:1	85.34 \(\%\)	96.41 \(\%\)	97.96 \(\%\)	76.05 \(\%\)	68.28 \(\%\)	95.51 \(\%\)	98.44 \(\%\)	65.50 \(\%\)

Table 1. Effectiveness of Different Splits of Two-steam Backbone Network in Terms of mAP ( \(\%\) ) and CMC ( \(\%\) ) on the RegDB and SYSU-MM01 Datasets

The corresponding best results are in bold. SP denotes the number of modality-specific layers and SH denotes the number of modality-shared layers.

4.3.2 The Effectiveness of Gradient Reversal.

Table 2 displays the results by employing baseline only and baseline combined with gradient reversal. In the experiments, the weight \(\lambda\) of gradient reversal was set as 1. The results can be seen clearly in Table 2. For the RegDB and SYSU-MM01 datasets, the combination of baseline and gradient reversal outperforms the baseline. It demonstrates the effectiveness of gradient reversal for guiding the adversarial training of neural networks to reduce the cross-modality discrepancy. Setting the weight of gradient reversal to 1 is only to verify the effectiveness of gradient reversal. To achieve the optimal training of the network, we need to find the best weight \(\lambda\) of gradient reversal.

Table 2.

Methods	RegDB				SYSU-MM01
	Visible-Infrared				All Search
	Rank-1	Rank-10	Rank-20	mAP	Rank-1	Rank-10	Rank-20	mAP
baseline	84.96 \(\%\)	95.53 \(\%\)	97.33 \(\%\)	76.65 \(\%\)	64.94 \(\%\)	94.30 \(\%\)	97.81 \(\%\)	61.58 \(\%\)
baseline+gradient reversal ( \(\lambda =1\) )	89.47 \(\%\)	97.48 \(\%\)	98.27 \(\%\)	79.69 \(\%\)	70.05 \(\%\)	95.70 \(\%\)	98.32 \(\%\)	66.17 \(\%\)

Table 2. Effectiveness of Gradient Reversal in Terms of mAP ( \(\%\) ) and CMC ( \(\%\) ) on the RegDB and SYSU-MM01 datasets.

4.3.3 The Effectiveness of the Dynamic Weight of Gradient Reversal.

To evaluate the effectiveness of dynamic weight of gradient reversal, based on the baseline combined with gradient reversal, we set the weight \(\lambda\) of gradient reversal as \(w_p\) (same as in Reference [9]) and \(\lambda ^k\) (dynamic weight mentioned in Section 3.3), respectively. The weight \(w_p\) is initialed at 0 and is gradually change to 1 using the following schedule:

\(\begin{equation} w_p = \frac{2}{1+\exp (-\gamma \cdot p)} - 1, \end{equation}\)

(19)

where \(\gamma\) was set to 10 (results of multiple hyper-parameter tuning) in all experiments (the schedule was not optimized); \(p\) controls the training progress, which changes from 0 to 1 linearly. For the RegDB dataset, we use visible images as query and infrared images as gallery, noting the default setting as “Visible to Infrared.” For the SYSU-MM01 dataset, we take single-shot all-search mode to get the results of the model.

The results are displayed in Table 3: (1) When the weight of the gradient reversal is \(w_p\) , the network performs much better than the baseline network on both datasets. Since \(w_p\) is used for updating the feature extractor component \(G_f\) , which allows the modality discriminator to be less sensitive to noisy signal at the early stages of the training procedure. The downside is that it always involves hyper-parameter \(\gamma\) tuning (mentioned in Equation (19)). (2) Compared with the gradient reversal using fixed weights in Table 2 and the gradient reversal using weights \(w_p\) [9] in Table 3, our proposed dynamic weights outperform them on RegDB and SYSU-MM01 datasets. The fixed-weight leads to the premature involvement of the modality discriminator in the early training of the network, which is detrimental to the learning of instance-level feature representations. Furthermore, the method with weight \(w_p\) not only introduces additional hyper-parameter \(\gamma\) but also fails to determine the best time to introduce adversarial training. Unlike the two methods mentioned above, our method aims to ensure that after the identity classifier is trained, we gradually increase the modality discriminator to conduct adversarial training with it to better learn the modality-invariant features. Moreover, our network with dynamic weight only needs to run the entire network once to get the stable results, while other approaches require it to be run multiple times to obtain stable results. It demonstrates that the dynamic weight of gradient reversal has ability to further enhance the discriminative common representation learning. Most importantly, it does not need to introduce any additional hyper-parameter tuning.

Table 3.

Methods	RegDB				SYSU-MM01
	Visible-Infrared				All Search
	Rank-1	Rank-10	Rank-20	mAP	Rank-1	Rank-10	Rank-20	mAP
baseline	84.96 \(\%\)	95.53 \(\%\)	97.33 \(\%\)	76.65 \(\%\)	64.94 \(\%\)	94.30 \(\%\)	97.81 \(\%\)	61.58 \(\%\)
baseline+gradient reversal ( \(\lambda =w_p\) )	89.95 \(\%\)	97.43 \(\%\)	98.83 \(\%\)	80.23 \(\%\)	70.94 \(\%\)	95.88 \(\%\)	98.54 \(\%\)	66.58 \(\%\)
baseline+gradient reversal ( \(\lambda =\lambda ^k\) )	91.26 \(\%\)	97.91 \(\%\)	99.22 \(\%\)	82.02 \(\%\)	71.53 \(\%\)	96.06 \(\%\)	98.62 \(\%\)	68.04 \(\%\)

Table 3. Effectiveness of Gradient Reversal with Different Weights in Terms of mAP ( \(\%\) ) and CMC ( \(\%\) ) on RegDB and SYSU-MM01 Datasets

The corresponding best results are in bold.

Figure 5 shows the final performance of proposed DGRNet on both two public datasets. Baseline with gradient reversal ( \(\lambda =1\) ), baseline with gradient reversal ( \(\lambda =w_p\) ), and baseline with dynamic weighted gradient reversal ( \(\lambda =\lambda ^k\) ) all perform better than the baseline, respectively. And network with dynamic weighted gradient reversal (DGRNet) achieves the competitive performance by a large margin.

Fig. 5.

4.3.4 Convergence Analysis.

The gradient of identity classifier and the change of dynamic weight \(\lambda ^k\) are evaluated, in this part, to verify the effectiveness of the proposed designed dynamic weight in Equation (16). From Figure 6, we can see that (1) initially, the gradient of identity classifier shows a large value, and the value of the dynamic weight \(\lambda ^k\) is correspondingly small. (2) In the first 20 epochs, the gradient of identity classifier keeps decreasing and the value of the dynamic weight \(\lambda ^k\) increases in negative correlation. (3) The gradient of identity classifier remains a stable value after 20 epochs, and the dynamic weight \(\lambda ^k\) become steady correspondingly. (4) These results demonstrate that the dynamic weights we designed (in Equation (16)) in the actual change process and the expected change process are consistent.

Fig. 6.

We also evaluated the convergence of DGRNet and drew a trend chart of the total loss of DGRNet and baseline, respectively, shown in Figure 7. We can see that in the same number of training iterations, DGRNet can also achieve normal convergence, indicating the stability and effectiveness of the designed adversarial process. Compared to the baseline, DGRNet exhibits a larger convergence point, which provides the network with a certain level of robustness and improved tolerance to modal change in input data. Specifically, DGRNet continuously balances the tasks of feature extraction and modality discrimination, enabling better adaptation to the differences between the visible and infrared modalities.

Fig. 7.

4.3.5 Feature Visualizations.

To further evaluate the performance of DGRNet, we visualize the features of person (12 classes) learned by baseline, baseline with gradient reversal ( \(\lambda =1\) ), baseline with gradient reversal ( \(\lambda =w_p\) ), and baseline with dynamic weighted gradient reversal ( \(\lambda =\lambda ^k\) ) through using the t-SNE [34] embedding in Figure 8. Red circles denote the visible samples, and blue pentagrams represent the thermal samples. The visualization tells significant conclusions: (1) From the results of baseline in Figure 8(a), we can clearly see that not only are the distributions from the infrared and visible modalities not well aligned, but also different classes are not well-distinguished clearly. (2) As shown in Figure 8(b) and Figure 8(c), compared to Figure 8(a), the feature distances of the same person in different modalities are effectively approximated, demonstrating that gradient reversal enables the network to learn discriminative common representations. But fix weight method and method with weight \(w_p\) still cannot align features very well. (3) For the features learned with our DGRNet, as shown in Figure 8(d), not only are the distributions aligned very well between the visible and infrared modalities, but also it is discriminated more clearly between different classes. In other words, the proposed DGRNet can get better performance. (4) The observations shown above suggest that DGRNet has the ability to learn discriminative common representations better by confusing the modality discrimination.

Fig. 8.

4.4 Comparison to the State of the Art

The proposed DGRNet will be compared with the following state-of-the-art approaches in this section: zero-padding [42], cmGAN [7], eBDTR [49], AlignGAN [35], JSIA-Re-ID [36], cm-SSFT(sq) [25], DDAG [51], HAT [53], AGW [52], DDSN [5], NFS [4], MCLNet [11], FBP-AL [40], HMML_T [55], DTRM [47], TSME [24], SPOT [3], FMCNet [57], FAM+NNCLoss [43], and TVTR [45]. The comparison results on the SYSU-MM01 and RegDB datasets are respectively shown in Table 4 and Table 5, which is judged based on the Rank-1, 10, 20 accuracies of CMC and mAP. The details are given as follows.

Table 4.

Settings		All-Search				Indoor-Search
Method	Venue	r = 1	r = 10	r = 20	mAP	r = 1	r = 10	r = 20	mAP
Zero-Pading[42]	ICCV17	14.80	54.12	71.33	15.95	20.58	68.38	85.79	26.92
cmGAN[7]	IJCAI18	26.97	67.51	80.56	27.80	31.63	77.23	89.18	42.19
eBDTR[49]	TIFS19	27.82	67.34	81.34	28.42	32.46	77.42	89.62	42.46
AlignGAN[35]	ICCV19	42.4	85.0	93.7	40.7	45.9	87.6	94.4	54.3
JSIA-Re-ID[36]	AAAI20	38.10	80.70	89.90	36.90	43.80	86.20	94.20	52.90
cm-SSFT(sq)[25]	CVPR20	47.70	—	—	54.10	—	—	—	—
DDAG[51]	ECCV20	54.75	90.39	95.81	53.02	61.02	94.06	98.41	67.98
HAT[53]	TIFS20	55.29	92.14	97.36	53.89	62.10	95.75	99.20	69.37
DDSN[5]	ISCAS21	46.16	86.34	94.97	46.92	—	—	–	–
AGW[52]	TPAMI21	47.50	84.39	92.14	47.65	54.17	91.14	95.98	62.97
NFS[4]	CVPR21	56.91	91.34	96.52	55.45	62.79	96.53	99.07	69.79
MCLNet[11]	CVPR21	65.40	93.33	97.14	61.98	72.56	96.98	99.20	76.58
FBP-AL[40]	TNNLS22	54.14	86.04	93.03	50.20	—	—	—	—
HMML_T[55]	ACM22	61.96	92.51	97.07	59.62	—	—	—	—
DTRM[47]	TIFS22	63.03	93.82	97.56	58.63	66.35	95.58	98.80	71.76
TSME[24]	TCSVT22	64.23	95.19	98.73	61.21	64.80	96.92	99.31	71.53
SPOT[3]	TIP22	65.34	92.73	97.04	62.25	69.42	96.22	99.12	74.63
FMCNet[57]	CVPR22	66.34	—	—	62.51	68.15	—	—	74.09
FAM+NNCLoss[43]	SPL23	55.75	87.51	93.27	51.52	58.24	91.08	96.42	65.65
TVTR[45]	ICASSP23	65.30	95.41	98.74	64.15	72.21	—	—	77.94
DGRNet	Ours	71.53	96.06	98.62	68.04	77.49	98.61	99.79	81.51

Table 4. Comparison with the States of the Art on the SYSU-MM01 Datasets

Re-identification rates ( \(\%\) ) at Rank-r and mAP ( \(\%\) ).

The corresponding best results are in bold.

Table 5.

Settings		Visible to Infrared				Infrared to Visible
Method	Venue	r = 1	r = 10	r = 20	mAP	r = 1	r = 10	r = 20	mAP
Zero-Pading[42]	ICCV17	17.75	34.21	44.35	18.90	16.63	34.68	44.25	17.82
eBDTR[49]	TIFS19	34.62	58.96	68.72	33.46	34.21	58.74	68.64	32.49
AlignGAN[35]	ICCV19	57.9	—	—	53.6	56.3	—	—	53.4
JSIA-Re-ID[36]	AAAI20	48.5	—	—	49.3	48.1	—	—	48.9
cm-SSFT(sq)[25]	CVPR20	72.3	—	—	72.9	71.0	—	—	71.7
DDAG[51]	ECCV20	69.34	86.19	91.49	63.46	68.06	85.15	90.31	61.80
HAT[53]	TIFS20	71.83	87.16	92.16	67.56	70.02	86.45	91.61	66.30
AGW[52]	TPAMI21	70.05	—	—	66.37	—	—	—	—
DDSN[5]	ISCAS21	79.32	88.61	95.92	75.37	—	—	—	—
NFS[4]	CVPR21	80.54	91.96	95.07	72.10	77.95	90.45	93.62	69.79
MCLNet[11]	CVPR21	80.31	92.70	96.03	73.07	75.93	90.93	94.59	69.49
FBP-AL[40]	TNNLS22	73.98	89.71	93.69	68.24	70.05	89.22	93.88	66.61
DTRM[47]	TIFS22	79.09	92.25	95.66	70.09	78.02	91.75	95.19	69.56
SPOT[3]	TIP22	80.35	93.48	96.44	72.46	79.37	92.79	96.01	72.26
HMML_T[55]	ACM22	82.97	94.03	96.42	77.56	—	—	—	—
TSME[24]	TCSVT22	87.35	97.10	98.90	76.94	86.41	96.39	98.20	75.70
FMCNet[57]	CVPR22	89.12	—	—	84.43	88.38	—	—	83.86
TVTR[45]	ICASSP23	84.1	—	—	79.5	83.7	—	—	78.0
FAM+NNCLoss[43]	SPL23	87.31	95.67	97.49	76.70	84.81	94.33	96.48	74.73
DGRNet	Ours	91.26	97.91	99.22	82.02	87.48	96.70	98.50	80.75

Table 5. Comparison with the States of the Art on the RegDB Datasets

Re-identification rates ( \(\%\) ) at Rank-r and mAP ( \(\%\) ).

The corresponding best results are in bold.

On the SYSU-MM01 dataset, the DGRNet achieves Rank-1 scores of 71.53 \(\%\) and 77.49 \(\%\) in single-shot all-search mode and single-shot indoor-search mode, better than FMCNet [57] by 5.19 \(\%\) and 9.34 \(\%\) and better than FAM+NNCLoss [43] by 15.78 \(\%\) and 19.25 \(\%\) , respectively. Compared with the method TVTR [45] in single-shot all-search mode, the proposed DGRNet surpassed TVTR [45] by 6.23 \(\%\) on Rank-1 score and by 3.89 \(\%\) on mAP score. AGW [52] is also designed based on top of eBDTR [49], but it performs worse than the proposed DGRNet by a large margin. The DGRNet improves 24.03 \(\%\) on Rank-1 score and 20.39 \(\%\) on mAP score. When compared with some representative adversarial methods, our method also demonstrates superior performance. For example, both DGRNet and cmGAN [7] utilize ResNet-50 as person feature extractor, and cmGAN adopts adversarial training to learn discriminator common representation as well. However, the DGRNet performs much better than cmGAN by 44.56 \(\%\) on Rank-1 score and 40.24 \(\%\) on mAP score. Moreover, DGRNet does not require searching for excessive hyperparameters for stable adversarial training like cmGAN. Compared to MCLNet [11], the Rank-1 accuracy is improved by 6.13 \(\%\) in single-shot all-search mode. Excepting the adversarial training to confuse the features of two modalities, MCLNet also exploited the camera label information for further improvement. It is worth noting that we compare DGRNet with MCLNet (Base+MCM, only adopt adversarial training to confuse the features of two modalities, the results can also be seen in Reference [11]), the Rank-1 score and mAP score of the DGRNet are improved by 20.07 \(\%\) and 18.2 \(\%\) , respectively. TSME [24] proposed a new deeper skip-connection generative adversarial networks as an image generator and generated high-quality cross modality images through adversarial training to alleviate modality discrepancy. In single-shot all-search mode, compared with TSME, the Rank-1 score and mAP score of the DGRNet are improved by 7.3 \(\%\) and 6.83 \(\%\) , respectively. Moreover, DGRNet has a simpler network architecture that does not involve complex image generation processes, and it does not require training in stages like TSME.

On the RegDB dataset, the DGRNet achieves the Rank-1 scores of 91.26 \(\%\) and 87.48 \(\%\) in visible-to-infrared and infrared-to-visible modes, better than TSME [24] by 3.91 \(\%\) and 1.07 \(\%\) , respectively, and better than TVTR [45] by 7.16 \(\%\) and 3.78 \(\%\) , respectively. Compared with AGW [52], MCLNet [11], FBP-AL [40], DTRM [47], and SPOT [3], the Rank-1 score and mAP score of the DGRNet are improved more than 10 \(\%\) . It demonstrates that the proposed DGRNet containing dynamic weight of gradient reversal has ability to enhance the discriminative common representation learning on VI Re-ID task.

Notably, our model only adopts global features. Global features refer to the features extracted from the entire image, which can preserve the overall structural information of the image. Local features, however, refer to the features extracted from certain regions of the image, which can better uncover the detailed information of the image. However, compared to global features, local features require greater computational cost. Therefore, our work focuses on global features, attempting to improve model performance without increasing computation. From comparison results listed in Table 4 and Table 5, it is demonstrated that the proposed DGRNet method with only global features achieves good performance compared to the global-feature-based SOTA methods, such as AlignGAN [35], JSIA-Re-ID [36], HAT [53], DDSN [5], AGW [52], MCLNet [11], HMML_T [55], and FMCNet [57], and even outperforming majority of local-feature-based methods, such as the DDAG [51], cm-SSFT(sq) [25], NFS [4], FBP-AL [40], DTRM [47], SPOT [3], and TSME [24]. This indicates that we were able to achieve competitive performance by only optimizing global features without increasing the computational cost.

5 Conclusion

This article focuses on a challenging newly developing task: VI Re-ID. In this work, the DGRNet based on dynamic weighted gradient reversal is proposed for helping deep networks to learn enhanced discriminative common representations from different modalities by confusing the modality discrimination. The proposed dynamic weight for gradient reversal can not only dynamically and adaptively evaluate the significance of target loss term to learn sharable features better by adversarial training but also not involve any hyper-parameter turning. We conduct feature visualization and extensive experiments to verify the effectiveness of DGRNet. It demonstrates that our adversarial method with dynamic weighted gradient reversal can confuse two modalities better and then enhance the discriminative common representation learning.

References

[1]

Xiang Bai, Mingkun Yang, Tengteng Huang, Zhiyong Dou, Rui Yu, and Yongchao Xu. 2020. Deep-person: Learning discriminative deep features for person re-identification. Pattern Recogn. 98 (2020), 107036.

Abstract

1 Introduction

2 Related Work

3 Proposed Method

3.1 The Network Model

3.1.1 Baseline.

3.1.2 Modality Discriminator.

3.1.3 Gradient Reversal.

3.2 Loss Function

3.3 The Training Algorithm

3.4 Optimization

4 Experiments Results and Analyses

4.1 Datasets and Evluation Metrics

4.2 Implement Details

4.3 Ablation Study

4.3.1 The Effectiveness of Two-stream Backbone Network Setting.

4.3.2 The Effectiveness of Gradient Reversal.

4.3.3 The Effectiveness of the Dynamic Weight of Gradient Reversal.

4.3.4 Convergence Analysis.

4.3.5 Feature Visualizations.

4.4 Comparison to the State of the Art

5 Conclusion

References

Index Terms

Recommendations

Dual Consistency-Constrained Learning for Unsupervised Visible-Infrared Person Re-Identification

Dynamic Dual-Attentive Aggregation Learning for Visible-Infrared Person Re-identification

Alleviating Modality Bias Training for Infrared-Visible Person Re-Identification

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Get Access

Login options

Full Access

Figures

Other

Share

Share this Publication link

Share on social media

Affiliations