Keywords

1 Introduction

The problem of matching people across non-overlapping cameras, known as person re-identification (Re-ID), has drawn a great deal of attention recently [20, 53]. It remains an unsolved problem due to two reasons: (1) A person’s appearance often changes dramatically across cameras views due to occlusion, lighting, illumination and pose changes; (2) Many people in public spaces wear similar clothes (e.g. dark coats, jeans) thus having similar visual appearance.

Most recent Re-ID methods are based on supervised learning. Given a set of labelled training data consisting of images of people paired across camera views according to identity, a distance metric is learned either using hand-crafted features [9, 14, 19, 25, 31, 37, 38, 46, 48, 49, 55, 56, 58, 60], or end-to-end using deep neural networks [2, 36]. However, they require images of hundreds or more people to be paired across each pair of camera views which is both tedious and sometimes not possible – some people do not reappear in other camera views. This severely limits the scalability of the existing methods making them unsuitable for practical large scale Re-ID tasks. To overcome this problem, a number of unsupervised Re-ID methods have been proposed [30, 41, 54, 57]. However, without labelled training data, they can only focus on learning salient and view invariant representations. Their performance is thus much weaker compared to the supervised methods. This is because they are unable to learn the cross-view discriminative information effectively, critical for matching the same person whilst separating the person from imposters of similar appearance. Due to their uncompetitiveness in published benchmarking metrics, these unsupervised learning models have received little attention when practicality and scalability are not considered in current benchmarking.

Fig. 1.
figure 1

An illustration of graph learning for person re-id. (a) A graph constructed in the original low-level feature space; (b) A graph learned using the proposed model in this work. One graph node and its five connected neighbours are shown, with the neighbour capturing the same person highlighted in red. (Color figure online)

In this work, we propose to learn a low-dimensional feature representation from a set of unlabelled data that can be easily collected. To learn a feature representation that is both view-invariant and discriminative, we exploit dictionary learning models that are shared across camera views. It is easy to understand how a representation obtained by dictionary learning can be view-invariant and low-dimensional – dictionary learning is widely used as an unsupervised model for dimensionality reduction [1, 28, 43]; and by sharing the same dictionary across camera views, it intrinsically requires that the learned representation to be view-invariant. It is the discriminative part that is non-trivial: How can we enforce that the learned representation is good for matching people across camera views, without the discriminative information from a set of paired training data?

Our solution is to relax the definition of discriminativity. Consider each dictionary word as a new feature dimension, a learned dictionary defines a subspace, into which the original data points represented by high-dimensional low-level feature vectors are projected. Instead of enforcing that data points corresponding to the same person to be as close as possible whilst being further away from other people in the learned subspace as in supervised learning, we constrain the visually similar people to be close to each other. Without identity labels, this is obviously a weaker constraint but the best available. Specifically, discriminativity is achieved unsupervised via a visual similarity constraint, which is enforced by introducing a graph Laplacian regularisation term in the dictionary learning objective function [44].

However, two problems remain when the conventional graph Laplacian constraint is used in our problem context: (1) The conventional term has a squared \(\ell _2\)-norm, which makes the term susceptible to data outliers. This is particularly unsuitable for the Re-ID problem as there are plenty of data outliers in Re-ID, caused by various reasons such as the person detection boxes being imperfect and severe (self-)occlusions. (2) The visual similarity is encoded in a graph whose topology and edge weights are all determined by distances computed using the original high-dimensional low-level features. However, these features are not ideal for people matching, hence learning a new representation in the first place. As illustrated in Fig. 1(a), a graph constructed using the low-level features connects many visually dissimilar neighbours to each node. This diminishes the power of the graph regularisation term as a visual similarity constraint.

To overcome these two problems, we introduce a robust graph regularisation term and propose to learn the new representation and the optimal graph jointly. Specifically, a \(\ell _1\)-norm is introduced in our graph regularisation term to make it robust against outliers. With this \(\ell _1\)-norm and joint graph and dictionary learning, our learning objective function is both non-smooth and non-convex. Solving this optimisation problem is thus non-trivial. An efficient iterative optimisation algorithm is formulated in this work to solve it. Once learned, our model can be used to compute a representation for each image much more efficiently than any existing unsupervised Re-ID method. The final matching is done by computing a simple cosine distance between a pair of the representation vectors.

1.1 Related Work

Most existing person Re-ID techniques are based on supervised learning: After hand-crafted features are extracted from each image, the optimal cross-view matching function is learned by either distance metric learning [14, 31], learning to rank [7, 46], or discriminant subspace learning [19, 24, 25, 47, 56, 58]. Recently representation and metric learning are combined end-to-end based on deep neural networks [2, 36] achieving state-of-the-art results when a large number of labelled training images are available. As mentioned early, all of them rely on hundreds of labelled data per camera pair. Considering that a modest-sized surveillance video network can easily have hundreds of cameras, these supervised learning Re-ID models are of very limited practical use. Our model is related to the discriminant subspace learning methods [19, 24, 25, 47, 56, 58]. However, none of them can be employed under the unsupervised setting. In addition, kernelisation is critical to make them work [55]. In contrast, no kernelisation is required for our model resulting in small memory footprint.

The existing models for unsupervised learning of either features or representations for Re-ID fall into three categories. (1) Many focus on designing hand-crafted appearance features [10, 16, 37, 39, 40, 42]. However, it is very challenging to design a set of view-invariant features which are suitable for all camera view conditions. (2) Several methods exploit localised saliency statistics [54, 57]. Without being able to utilise cross-view identity-discriminative information, their performance is typically weak. Also, they are patch based methods and separate models are learned for every patch which makes them computational expensive. (3) There are also dictionary learning based methods which can intrinsically be used in an unsupervised setting [30, 41]. The key difference in this work is the use of robust graph Laplacian regularisation and joint graph and dictionary learning. We show experimentally that the proposed method is clearly superior to the existing unsupervised alternatives in both matching accuracy and running cost.

Beyond person Re-ID, dictionary learning [1, 28, 43] and graph regularisation [12, 18, 61] have been exploited in many different fields including unsupervised clustering [34], supervised face verification/recognition [21] and semi-supervised learning [5, 8, 33]. Graph learning has also been considered for subspace clustering [22, 45]. However, none of the existing models is directly applicable to the unsupervised cross-view person matching problem. Importantly none of them exploits both graph learning and robust graph regularisation. We show experimentally that both properties are critical for dictionary learning to be effective for solving the unsupervised Re-ID problem.

1.2 Contributions

Our contributions are two-fold: (1) We formulate a novel graph regularised dictionary learning model for unsupervised Re-ID with a new robust \(\ell _1\)-norm graph regularisation term and joint graph and dictionary learning. The model only requires unlabelled training data, which makes it suitable for large-scale Re-ID problems. (2) We develop an efficient iterative optimisation algorithm for the non-smooth and non-convex objective function of our model. During test time, the model is linear and has a closed-form solution for inference; it is thus extremely efficient. Extensive experiments are conducted on four large benchmark datasets, and the results show that our method significantly outperforms existing unsupervised methods in terms of both matching accuracy and running cost.

2 Methodology

2.1 Problem Definition

Suppose we have a set of unlabelled training data collected from two camera viewsFootnote 1. They are denoted as \(\mathbf {X} = [\mathbf {X}^{a},~\mathbf {X}^{b}] \in \mathbb {R}^{n \times m}\), where \(\mathbf {X}^{a} = [\mathbf {x}_1^{a},~...~, \mathbf {x}_{m_1}^{a}] \in \mathbb {R}^{n \times m_1}\) contains n-dimensional feature vectors of \(m_1\) images in view A, and \(\mathbf {X}^{b} = [\mathbf {x}_1^{b},~...~, \mathbf {x}_{m_2}^{b}] \in \mathbb {R}^{n \times m_2}\) of \(m_2\) images in view B. We thus have \(m = m_1+m_2\) data points in total. The objective of unsupervised person Re-ID is to learn a matching function f from \(\mathbf {X}\), so that given \(\mathbf {x}^a\) and \(\mathbf {x}^b\) as two test person images from A and B respectively, \(f(\mathbf {x}^a,\mathbf {x}^b)\) can match their identities.

2.2 Robust Graph Regularised Dictionary Learning

We solve the problem defined above by learning a dictionary \(\mathbf D \in \mathbb {R}^{k \times n}\) shared by the two camera views using \(\mathbf X\). Every atom of the learned dictionary (column of \(\mathbf D\)) can be considered as a latent appearance attribute that is invariant to camera view condition changes. Therefore, with this dictionary, each n-dimensional low-level feature vector, regardless which view it comes from, is represented by the coefficients of the k dictionary atoms. This is equivalent to projecting the original n-dimensional low-level feature vectors to a lower-dimensional (\(k<n\)) latent attribute space. The matching is done by computing a simple cosine distance between two coefficient vectors in the space. Formally, we aim to learn the optimal dictionary \(\mathbf D\), such that the latent attribute representation of \(\mathbf X\), denoted as \(\mathbf Y = [\mathbf Y^{a},~ \mathbf Y^{b}] \in \mathbb {R}^{k \times m}\), where \(\mathbf Y^{a} = [\mathbf y_1^{a},~...~, \mathbf y_{m_1}^{a}] \in \mathbb {R}^{k \times m_1}\) and \(\mathbf Y^{b} = [\mathbf y_1^{b},~...~, \mathbf y_{m_2}^{b}]\in \mathbb {R}^{k \times m_2}\), are optimised for matching the training data. We expect the same \(\mathbf D\) can be generalised to match unseen test data across camera views.

Conventional dictionary learning methods estimate the dicitionary \(\mathbf D\) and the representation \(\mathbf Y\) simultaneously by solving the following optimisation problem:

$$\begin{aligned} (\mathbf D^{*}, \mathbf Y^{*}) = \min _{\mathbf D,\mathbf Y} \Vert \mathbf X- \mathbf {DY}\Vert _F^2+\lambda _1 \mathrm {\Omega (\mathbf Y)} ~~s.t.~~\Vert \mathbf {d}_i\Vert _2^2\le 1, \end{aligned}$$
(1)

where \(\Vert \mathbf X- \mathbf {DY}\Vert _F^2 \) is the reconstruction error evaluating how well a linear combination of the learned atoms can approximate the input data, and \(||.||_F\) denotes the matrix Frobenious norm. \(\mathrm {\Omega (\mathbf Y)}\) is a regularisation term that is weighted by \(\lambda _1\). Different models differ mainly in the choice of the regularisation term on \(\mathbf Y\). The sparsity term, \(\mathrm {\Omega (\mathbf Y)} = \Vert \mathbf Y\Vert _1\) is widely used which favours a small number of atoms for reconstruction. The constraint \(\Vert \mathbf {d}_i\Vert ^2\le 1\) (\(\mathbf {d}_i\) is a column of \(\mathbf {D}\), \(i = 1, ... , k\)) enforces the learned dictionary atoms to be compact. It is clear from this formulation that a conventional dictionary learning model only cares about how to best reconstruct \(\mathbf X\) using \(\mathbf D\) and \(\mathbf Y\), without taking into account whether the representation \(\mathbf Y\) is discriminative. For learning a discriminative dictionary for cross-view Re-ID, one must exploit cross-view identity discriminative information.

A learned dictionary can be made discriminative by using a graph regularisation term which dictates that visually similar people will be close to each other in the learned latent attribute space [11]. Let \(\mathbf {G}=\left( \mathbf {V},\mathbf {E}\right) \) be an undirected graph connecting between the data points where \(\mathbf {V}\) and \(\mathbf {E}\) are a set of graph vertices representing the data points and an edge set, respectively. This graph can be encoded by an affinity matrix for m data points where \(\mathbf {W}_{i,j}\ne 0\) if the two vertices are connected, i.e. the corresponding data points are in a local neighbourhood. Note: (1) In the context of person Re-ID, we focus on the cross-view discriminative dictionary learning, thus restricting the graph edges to connecting cross-view nodes only. (2) We use the graph regularisation term to replace the commonly used sparsity constraint \(\Vert \mathbf Y\Vert _1\), for reasons to be explained later.

A standared graph regularisation term \(\mathrm {\Omega (\mathbf {Y})}\) is defined as:

$$\begin{aligned} \mathrm {\Omega (\mathbf {Y})}=\sum _{ij}^{m}\mathbf {W}_{ij}\Vert \mathbf {y}_i-\mathbf {y}_j\Vert _2^2. \end{aligned}$$
(2)

This regularisation essentially requires that the projected data points in the learned latent attribute space to be smooth with regards to the graph, that is, their distances need to conform to the visual similarity relationship embedded in the graph. However, we find that Eq. (2) has two critical limitations that make it unsuitable for the unsupervised Re-ID problem. First, the distance between two projected data points is calculated with a squared \(\ell _2\) -norm. It is well-known that a square-based regularisation function can be easily dominated by outlying data samples. Unfortunately outlying samples are commonplace in Re-ID because of background in person detection bounding boxes, detector errors, and (self-)occlusions. Another limitation arises from how the graph is constructed. Most existing methods build the graph in the original high dimensional low-level feature space using \(\mathbf {X}\). This is suboptimal – if the low-level feature space is good for measuring cross-camera visual similarity, we would have already solved the Re-ID problem. Learning a discriminative latent attribute space is precisely due to the fact that measuring visual similarity in the original space is unreliable and error-prone, as illustrated in Fig. 1. To tackle both limitations simultaneously, we introduce a robust graph regularisation formulation and a joint graph and dictionary learning method.

Robust Graph Regularisation. This new term is designed to alleviate the effect of outlying samples during model learning. To derive our robust graph regularisation, let us first rewrite Eq. (2) in a matrix form with trace notation:

$$\begin{aligned} \mathrm {\Omega (\mathbf {Y})}=\sum _{ij}^{m}\mathbf {W}_{ij}\Vert \mathbf {y}_i-\mathbf {y}_j\Vert _2^2 = tr(\mathbf {YL_WY^T}), \end{aligned}$$
(3)

where \(\mathbf {L_W}=\mathbf {D}-\mathbf {W}\) is the Laplacian matrix, \(\mathbf {D}_{ii}=\sum _j \mathbf {W}_{ij}\) is a degree matrix. Let \(\mathbf {L_W}=\mathbf {U_WS_WU_W^T}\) using the eigen decomposition technique, and after some matrix manipulation, we have

$$\begin{aligned} tr(\mathbf {YL_WY^T}) = tr(\mathbf {YU_WS_WU_W^TY^T})= \nonumber \\ tr(\mathbf {YU_WS_W^{\frac{1}{2}} S_W^{\frac{1}{2}} U_W^T Y^T})=\Vert \mathbf {YA_W}\Vert _F^2, \end{aligned}$$
(4)

where \(\mathbf {A_W}=\mathbf {U_WS_W^{\frac{1}{2}}}\). Equation (4) above is quadratic. To promote sparsity and suppress effects of outlying samples, we adopt a \(\ell _1\)-norm instead of the Frobenius norm. This gives the proposed graph weighted \(\ell _1\)-norm regularisation term

$$\begin{aligned} \mathrm {\Omega _{R1}(\mathbf {Y})} = \Vert \mathbf {YA_W}\Vert _1. \end{aligned}$$
(5)

Replacing \(\mathrm {\Omega (\mathbf {Y})}\) with \(\mathrm {\Omega _{R1}(\mathbf {Y})}\) in Eq. (1), we have a robust graph regularised dictionary learning model:

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {D},\mathbf {Y}}{\text {min}}&\frac{1}{2}\Vert \mathbf {X}-\mathbf {DY}\Vert _F^2 + \lambda _1\Vert \mathbf {YA_W}\Vert _1 {~~s.t.~~} \Vert \mathbf {d}_i\Vert ^2\le 1 \\ \end{aligned} \end{aligned}$$
(6)

The key advantages of the proposed robust graph regularisation in this work over the conventional regularisation formulation, including the existing dictionary learning based Re-ID model DLLAP [30], are as follows:

  1. 1.

    Non-linearity. Robust graph regularisation introduces non-linearity into the objective, i.e. \(\mathbf Y\) is non-linear with respect to the original data \(\mathbf X\), whilst the conventional graph regularisation is linear.

  2. 2.

    Sparsity. It is well-known that \(\ell _1\)-norm has a shrinkage property thus promotes sparsity [27, 29]. Intuitively, in the presence of noise and outliers, the magnitude of \(\Vert \mathbf {YA}_W\Vert _F^2\) of the regularisation becomes very big for those outlying data points, and as a result the whole objective function could be dominated by the noise and outliers. In contrast, \(\Vert \mathbf {YA}_W\Vert _1\) becomes sparse due to the use of \(\ell _1\)-norm, consequently suppressing the impact of outliers and noises. Moreover, in the proposed robust regularisation model, explicit sparsity constraint such as \(\Vert \mathbf Y\Vert _1\) is no longer neededFootnote 2.

Joint graph and dictionary learning. Instead of computing \(\mathbf {W}\) using \(\mathbf {X}\) and fixing it during model learning, we assume that \(\mathbf {W}\) (hence the graph \(\mathbf {G}\) as \(\mathbf {W}\) depends on the topology of \(\mathbf {G}\)) is unknown and has to be learned together with \(\mathbf {D}\) and \(\mathbf {Y}\). Our objective function thus becomes:

$$\begin{aligned} \begin{aligned}&\underset{\mathbf {D,W,Y}}{\text {min}} \frac{1}{2}\Vert \mathbf {X}-\mathbf {DY}\Vert _F^2 + \lambda _1\Vert \mathbf {YA_W}\Vert _1 +\lambda _2 \Vert \mathbf {W}\Vert _F^2&\\&{~~s.t.}~~~ \Vert \mathbf {d}_i\Vert _2^2\le 1,~ \mathbf {W}_{i}^T \mathbf 1 =1,~ \mathbf {W}_{i} \ge 0.&\end{aligned} \end{aligned}$$
(7)

where \(\lambda _2 \Vert \mathbf {W}\Vert _F^2\) is a regularisation term on \(\mathbf {W}\) weighted by \(\lambda _2\) to prevent trivial solutions. The constraints, \(\mathbf {W^T}{} \mathbf 1 = 1\) and \(\mathbf {W}\ge 0\), ensure the validity of the learned graph. We show in our experiments (Sect. 3.2) that this novel joint learning of graph and dictionary has significant advantage over the existing dictionary learning based Re-ID model DLLAP [30].

2.3 Optimisation

The optimisation problem in (7) is non-convex and non-smooth. Solving it is thus more difficult than (1) due to the \(\ell _1\)-norm used in \(\mathrm {\Omega _{R1}(\mathbf {Y})}\) and the additional unknown variable \(\mathbf {W}\). Next, we develop an efficient solver for (7) based on the Alternating Direction Method of Multipliers (ADMM) [6].

First, we transform (7) by letting \(\mathbf {U}=\mathbf {YA_W}\), then the Augmented Lagrangian function of (7) with the introduced constraint is:

$$\begin{aligned} \begin{aligned}&{\mathcal {L}_{(\mathbf {D},\mathbf {Y},\mathbf {U},\mathbf {W})}=} \frac{1}{2}\Vert \mathbf {X}-\mathbf {DY}\Vert _F^2 + \lambda _1 \Vert \mathbf {U}\Vert _1 + \langle \mathbf {F},\mathbf {U}-\mathbf {YA_W} \rangle&\\&~~~~~~~~~~~~~~~~~~ + \frac{\mathrm {\gamma }}{2}\Vert \mathbf {U}-\mathbf {YA_W}\Vert _F^2 + \lambda _2\Vert \mathbf {W}\Vert _F^2&\\&{s.t.}~~~~~~~~~~~~~~ \Vert \mathbf {d}_i\Vert ^2\le 1,~ \mathbf {W^T}{} \mathbf 1 = 1,~ \mathbf {W}\ge 0.&\end{aligned} \end{aligned}$$
(8)

where \(\mathbf F\) is Lagrangian multiplier, and \(\mathrm {\gamma }\) is a penalty parameter. Now, we can solve it alternatingly with the following five steps with respect to \(\mathbf {D}\), \(\mathbf {Y}\), \(\mathbf {U}\), and \(\mathbf {W}\), respectively.

(1) Solving for \(\mathbf {D}\) : To learn \(\mathbf {D}\) for a given \(\mathbf {Y}\), the objective function reduces to:

$$\begin{aligned} \min _{\mathbf {D}} \frac{1}{2}\Vert \mathbf {X}-\mathbf {DY}\Vert _F^2 ~~~~s.t.~~ \Vert \mathbf {d}_i\Vert _2^2\le 1 \end{aligned}$$
(9)

To solve this, we use the Lagrange dual method as in [32]. The analytical solution of \(\mathbf {D}\) can be computed as: \(\mathbf {D^{*}} = \mathbf {XY^T(YY^T}+\mathbf {\Lambda ^{*}})^{-1}\), where \(\mathbf {\Lambda ^{*}}\) is a diagonal matrix constructed from all the optimal dual variables.

(2) Solving for \(\mathbf {Y}\) : For a given \(\mathbf {D}\), solve the following objective to estimate \(\mathbf {Y}\):

$$\begin{aligned} \min _{\mathbf {Y}} \frac{1}{2}\Vert \mathbf {X}-\mathbf {DY}\Vert _F^2 + \frac{\mathrm {\gamma }}{2} \Vert \mathbf {U}-(\mathbf {YA_W}-\frac{\mathbf {F}}{\gamma })\Vert _F^2. \end{aligned}$$

Since each term in this objective is quadratic, we can take its derivative and set it to zero which gives

$$\begin{aligned} (\mathbf {D^TDY}+\gamma \mathbf {YA_WA_W^T}) = \mathbf {D^TX}+\gamma \mathbf {UA_W^T}+\mathbf {FA_W^T}. \end{aligned}$$

This is a standard Sylvester equation, which is solved using the Bartels-Stewart algorithm [4].

(3) Solving for \(\mathbf {U}\) : For a given \(\mathbf {Y}\), solve the following objective to estimate \(\mathbf {U}\):

$$\begin{aligned} \min _\mathbf {U} \lambda _1 \Vert \mathbf {U}\Vert _1 + \frac{\gamma }{2}\Vert \mathbf {U}-(\mathbf {YA_W}-\frac{\mathbf {F}}{\gamma })\Vert _F^2. \end{aligned}$$

We can use the soft-thresholding operator to get \(\mathbf {U}\):

$$\begin{aligned} \mathbf {U} = \mathrm {sign} \bigg (\mathbf {YA_W}-\frac{\mathbf {F}}{\gamma }\bigg ) \mathrm {max}(\bigg | \mathbf {YA_W}-\frac{\mathbf {F}}{\gamma }\bigg |-\frac{\lambda _1}{\gamma }). \end{aligned}$$
(10)

(4) Solving for \(\mathbf {W}\) : Given \(\mathbf {Y}\), the objective function with respect to \(\mathbf {W}\) is:

$$\begin{aligned} \begin{aligned} \underset{\mathbf {W}}{\text {min}}\, \lambda _1\sum _{ij}^{m}\mathbf {W}_{ij} \Vert \mathbf {y}_i - \mathbf {y}_j\Vert _1 +\lambda _2\Vert \mathbf {W}\Vert _F^2 {~~s.t.~~} \mathbf {W}_{i}^T \mathbf 1 =1, \mathbf {W}_{i} \ge 0. \end{aligned} \end{aligned}$$

We set \(\lambda _1=1\) for easiness, and denote \(\mathbf {d}_{ij} = \frac{\Vert \mathbf {y}_i-\mathbf {y}_j\Vert _1}{2\lambda _2}\) and \(\Vert \mathbf {W}\Vert _F^2 = \sum _{ij}\mathbf {W}_{ij}^2\), then

$$\begin{aligned} \min _\mathbf {W} \sum _{ij}^{m}\mathbf {W}_{ij} \mathbf {d}_{ij} + \sum _{ij}^m \mathbf {W}_{ij}^2~~s.t.~~\mathbf {W}_{i}^\mathbf {T} \mathbf 1 =1, \mathbf {W}_{i} \ge 0. \end{aligned}$$

The above optimisation problem is composed of independent problems with respect to i, and therefore can be rewritten in a vector form:

$$\begin{aligned} \min _{\mathbf {W}_i} \Vert \mathbf {W}_i + \mathbf {d}_i\Vert _2^2~~s.t.~~ \mathbf {W}_i\mathbf 1 =1, \mathbf {W}_i \ge 0. \end{aligned}$$

There is a closed-form solution using Lagrange multipliers [22, 45] for this problem:

$$\begin{aligned} \mathbf {W}_i = \bigg ( \frac{1+\sum _{j=1}^K \mathbf {\tilde{d}}_{j}}{K}{} \mathbf 1 - \mathbf {d}_i \bigg )_{+} \end{aligned}$$
(11)

where the operator \((\mathbf {q})_{+}\) projects negative elements in \(\mathbf {q}\) to 0. K is the parameter that controls the number of neighbours. \(\mathbf {\tilde{d}}_i\) is \(\mathbf {d}_i\) but with ascending order. After obtaining \(\mathbf {W}\), we symmetrise it, and do eigen-decomposition to get \(\mathbf {U_W}\) and \(\mathbf {S_W}\). Then we set \(\mathbf {A_W} = \mathbf {U_WS_W^{\frac{1}{2}}}\). Note that the regularisation parameter \(\lambda _2\) can be determined by [45]:

$$\begin{aligned} \lambda _2 = \frac{1}{m} \sum _{i=1}^m (\frac{K}{2} \mathbf {d}_{i,K+1} - \frac{1}{2}\sum _{j=1}^K \mathbf {d}_{i,j}). \end{aligned}$$
(12)

(5) Updating multipliers: \(\mathbf {F}, \gamma \),

$$\begin{aligned} \mathbf {F} = \mathbf {F}^{old} + \gamma (\mathbf {U}-\mathbf {YA_W}), ~~\gamma = \rho \gamma ^{old} \end{aligned}$$

In this work, we set \(\rho \) to 1.1 and initialise \(\gamma \) to 0.1. Typically the value for \(\rho \) is set between 1.0 and 1.8 [6].

We continue to alternate solving for \(\mathbf {D}\), \(\mathbf {Y}\), \(\mathbf {U}\), \(\mathbf {W}\) until a maximum number of iterations is reached or a predefined threshold (\(10^{-3}\)) is satisfied.

Convergence Analysis. The theoretical convergence proof of ADMM does not exist. However, in practice it is guaranteed that the objective function converges to at least a stable point [6]. This is validated by our experiments (see Sect. 3). In particular, it is observed that the proposed algorithm has a stable convergence behaviour, always converging after 10–25 iterations.

Remark on Computational Complexity and Scalability. Due to space limit we leave the computational complexity analysis and scalability with respect to the number of samples in the supplementary material.

2.4 Cross-View Matching

After learning the dictionary \(\mathbf D\) using the unlabelled training data \(\mathbf X\), given a pair of test samples \(\mathbf x_i^a\) and \(\mathbf x_i^b\), we first compute their collaborative representations \(\mathbf y_i^a\) and \(\mathbf y_i^b\) by solving the following problems:

$$\begin{aligned} \mathbf y_i^{a*}= \arg min_{{\mathbf y_i^a}} \Vert \mathbf x_i^a-\mathbf {Dy}_i^a\Vert _F^2+\lambda \Vert \mathbf y_i^a\Vert _2^2 \end{aligned}$$
(13)
$$\begin{aligned} \mathbf y_i^{b*}= \arg min_{{\mathbf y_i^b}} \Vert \mathbf x_i^b-\mathbf {Dy}_i^b\Vert _F^2+\lambda \Vert \mathbf y_i^b\Vert _2^2 \end{aligned}$$
(14)

These are standard \(\ell _2-\)norm regularised least squares problems with closed-form solutions: \(\mathbf y_i^{a*} = \mathbf {Px}_i^a\) and \(\mathbf y_i^{b*} = \mathbf {Px}_i^b\), where \(\mathbf P=(\mathbf {D^T D}+\lambda \mathbf I)^{-1}\mathbf {D^T}\). Then, after obtaining \(\mathbf y_i^{a*}\) and \(\mathbf y_i^{b*}\) their cosine distance is used to measure the visual similarity for Re-ID. Hence, our model is very efficient in testing.

2.5 Extension to Supervised Re-ID

Although our model is designed for unsupervised Re-ID, it can be easily extended if labelled cross-view pairs become available. More specifically, the label information can be encoded in the graph \(\mathbf W\). That is, instead of learning \(\mathbf W\), it is now fixed so that if the corresponding cross-view pair (ij) is labelled as containing the same person, we set \(\mathbf W_{i,j}\) to 1, otherwise it is set as 0. This essentially gives thus the ideal graph and the relaxed visual similarity constraint becomes a more stringent identity constraint which requires that people of the same identity to be close in the learned attribute space and vice versa.

3 Experiments

3.1 Datasets and Settings

Datasets. Four widely used benchmark datasets are used for the experiments. VIPeR [15] contains 632 image pairs of people captured outdoor from two non-overlapping camera views. Following the standard setting which is single-shot i.e., one image per person per view, the dataset is randomly split into two sets of 316 image pairs, one for training and the other for testing. For the test set, all images from one view is used as the gallery set and the others probe set. The results for all evaluations were obtained by averaging over 10 splits. PRID [23] is different from the other available datasets in that the gallery and probe sets have different numbers of people. In our experiments, we use the single-shot version of the dataset as in [19, 26, 46]. Specifically, out of the 749 people captured in two camera views, only 200 people appear in both views. In each data split, 100 out of that 200 people are chosen randomly for training, while the remaining 100 of one view are used as the probe set, and the remaining 649 people’s images of the other view are used as gallery, which thus includes the 100 people in the probe set. Experiments are carried out on the same 10 splits as in [19, 26] with the average results reported. CUHK01 [35] consists of 971 people with two images per person per camera view i.e. multi-shot. We follow the standard setting [35]: 486 persons for training, while 485 persons for test. CUHK03 [36] contains 13,164 images of 1,467 people. Two versions exist which differs in whether the images were obtained by manual cropping or automatically by applying the DPM person detector [17]. The detector-generated images are used as they reflect better the real-world application scenarios for testing the robustness of our model against outliers. There are in total six camera views but each person is observed in only two out of the six views, and has 4.8 images on average for each view. We used the same setting and random splits as in [36] with a single-shot setting: for the probe set we randomly select 100 people with two images each, whilst images of the remaining people are used for training. Note that out of the four datasets, CHUK03 is much bigger than the other three in terms of both the number of identities and the number of images in the training set.

Settings. Features: The features introduced in [19] are adopted. Each image is scaled to 128\(\times \)48 in all datasets, and then histogram-based image descriptors are computed consisting of three types: (1) Colour histogram using HS, RGB, and Lab colour spaces (2880-D colour vector), (2) HOG (1040-D) [13], and (3) LBP (1218-D) [3]. The final image feature vector, 5138-D, is obtained as the concatenation of these three types of features. Evaluation metrics: We obtain the Cumulative Matching Characteristics (CMC) curves. Due to space constraint, we only report matching accuracies at Rank 1 here and leave the full CMC curves in the supplementary material. Parameter settings: There are a number of parameters in our model. As an unsupervised learning method, there are no other means but setting them manually. For the dictionary size k, we do not tune it carefully and set it to 256 for the two small datasets VIPeR and PRID, and 512 for the larger CUHK01 and CUHK03 dataset. Its effects on the performance will be discussed later. In the objective function (Eq. (7)), there are two weights \(\lambda _1\) and \(\lambda _2\) for the two regularisation terms respectively. As explained in Sect. 2.3, \(\lambda _2\) is set automatically using Eq. (12) in the ADMM algorithm, whilst for \(\lambda _1\) we simply set it to 1 throughout, as we found that the algorithm is insensitive to its value. Similarly for the initial construction of graph \(\mathbf {G}\), we use a KNN graph with cosine distance and \(K=5\) for all datasets.

3.2 Evaluation of Unsupervised Learning Based Re-ID

Compared methods. Under this setting, we compared our approach with state-of-the-art unsupervised alternatives which fall into four categories: (1) The hand-crafted feature-based methods including SDALF [16] and CPS [10]. (2) The saliency learning-based eSDC [57] and GTS [54]; (3) The dictionary learning (DLLAP) [30] which uses the same 5138-D features for fair comparison. (4) The codebook learning method (BGG) [59].

Table 1. Unsupervised Re-ID results measured in Rank-1 matching accuracy (%) on VIPeR, PRID, CUHK01, CUHK03, where ‘-’ denotes no reported result.

Results. Table 1 compares the results of the proposed method against the six alternatives and a non-learning \(\ell _1\) distance based baseline. From Table 1, the following observations can be made: (1) Our robust graph regularised dictionary learning model outperforms all existing unsupervised methods on all four datasets, and often by a big margin. (2) The margin is in general bigger on the two larger datasets CUHK01 and CUHK03, which indicates that our model can benefit more from larger unlabelled training data. (3) Among the alternatives, the dictionary learning based method (DLLAP) [30] is the most competitive. However, the gap is still significant due to the introduced two novel components: robust graph regularisation and joint graph and dictionary learning. This result also suggests that learning a low-dimensional latent attribute representation is more suited for unsupervised Re-ID than the alternative models. In particular, the difference between Ours and \(\ell _1\) is large which means that matching people is made much easier in this learned discriminative subspace with less than one tenth of the original dimensions. The advantage of our method’s computational efficiency over other methods will be discussed later.

3.3 Evaluation of Supervised Learning Based Re-ID

Compared methods. Since the performance of different existing methods on different datasets often vary drasticallyFootnote 3, we choose the best methods for each dataset separately to better reflect the state-of-the-art. All methods are published in the last two years. Note that multi-feature fusion based methods are separated from single feature or deep models as typically any method can benefit from multi-feature fusion. As mentioned in Sect. 2.5, our model can also operate in the supervised mode; denoted as Ours_sup, this can be considered as the upper bound of our model’s performance under the unsupervised setting when the graph is learned perfectly.

Results. We have the following key findings from Table 2: (1) The gap between Ours_un and Our_sup is moderate. This indicates that our graph learning method is very effective and the performance of the unsupervised model is not far off from its upper bound. (2) On the two smaller datasets, VIPeR and PRID, our model is very competitive under the supervised setting: on VIPeR it beats all single feature-based methods and on PRID, it outperforms all existing supervised methods, often significantly. Even our unsupervised model outperforms some very recent supervised models. Note that this is without any kernalisation which could further improve our model’s performance. (3) On the two larger datasets CUHK01 and CUHK03 (with detected person images), the gap between our method and the state-of-the-art begins to appearFootnote 4. Our model (both supervised and unsupervised) remains competitive on CUHK01, but on CUHK03, the gap is big, in particular to our unsupervised model. This is expected: with over 10,000 labelled training images from 1,367 people, an unsupervised model cannot compete with a supervised one, especially those based on deep learning. However, we would like to point out that in practice collecting hundreds of labelled training samples is very difficult and collecting thousands would be near impossible across even just a handful of camera views.

Table 2. Comparison state-of-the-art supervised methods

3.4 Further Analysis

The contributions of individual components. Our proposed method has two key components and to see the impact of each we compare our full model with various striped-down versions of the model under the unsupervised setting: (1) Ours\(\_DL\) – without graph regularisation which is the same as conventional dictionary learning; (2) Ours\(\_\ell _2\) – the graph is fixed and \(\ell _2\)-norm is used for graph regularisation; (3) Ours\(\_\ell _2\)_graph – the graph is learned and \(\ell _2\)-norm is used for graph regularisation; (4) Ours\(\_\ell _1\) – the graph is fixed and \(\ell _1\)-norm is used for graph regularisation; (5) Ours_full – our full proposed model in which the graph is learned and \(\ell _1\)-norm is used for graph regularisation. Table 3 shows that both using robust \(\ell _1\)-norm graph regularisation and joint graph and dictionary learning contribute positively toward the final performance. The result (comparing Ours\(\_DL\) with the other models) also shows that adding a graph regularisation term to learn cross-view discriminative information in general is critical for dictionary-learning-based Re-ID.

Table 3. The contributions of individual model components

Effect of dictionary size and convergence analysis. The only parameter we tuned for each dataset is the dictionary size. Figure 2(Left) shows that when the size is over 100, its effect is small. Furthermore, Fig. 2(Right) shows the proposed method converges rapidly. Although there is no theoretically proof, convergence is observed in all our experiments within 25 iterations.

Fig. 2.
figure 2

(Left) Rank 1 accuracies with different dictionary sizes on VIPeR dataset; (Right) Objective function value with respect to the number of iterations on CUHK01.

Running cost. Our experiments were conducted in MATLAB on a PC with two 3.40 GHz CPUs and 16 G RAM. The training of the model on VIPeR takes 178.3 s but during test it is very efficient: once the 5138-D features are extracted, it takes only 0.01 s to match one probe image against 316 images from the gallery. Table 4 compares the running time of feature extraction and matching during test time against a number of alternative unsupervised methods. It is clear that our method is often a few magnitudes faster than its competitors.

Table 4. Average testing time of different methods on VIPeR

4 Conclusion

We have proposed a novel unsupervised Re-ID model based on dictionary learning. The key contributions are the introduction of a robust \(\ell _1\)-norm graph regularisation term in the dictionary learning formulation so that cross-view discriminative information can be learned. In addition, a joint graph and dictionary learning algorithm is developed which further improves the ability of the proposed model to deal with outlying samples abundant in person Re-ID data. Extensive experiments on four benchmark datasets show that the proposed method significantly outperforms existing unsupervised methods.