Keywords

1 Introduction

Visual object tracking has been a popular problem in computer vision, with proposed various tracking algorithms [1,2,3,4,5]. With the development and application of machine learning and deep learning, visual object tracking based on the discriminant modeling method has been widely used, and the visual tracking performance has improved significantly. The most common idea was to convert the tracking problem into a binary classification problem, and the use of online learning technology to dynamically update the model, such as TLD [3], MIL [9], CT [10], CNT [7] and many advanced algorithms.

At the same time, the tracking algorithm based on correlation filters has also been very popular. Bolme et al. proposed the minimum output squared error sum algorithm named MOSSE [5]. The algorithm used the discriminant model of the minimum output squared error sum as the discriminant model of the target and the background, and used the discrete Fourier transform to the frequency domain to calculate, which significantly reduced the computing complexity. Then CSK [2] adopted a circulant matrix to obtain a large number of training samples. Then, in the literature [12], Kiani Galoogahi et al. proposed in the correlation filters using multi-channel features, the experimental results are great. Various features were also of great concern in the correlation filters, such as gray, HOG, CNN [6] and so on. For the HOG feature, it has many advantages. First of all, because the HOG is on the image of the local grid unit operation, so it the image geometric and optical can keep good invariance, the deformation of the two deformation will only appear on the larger space. As long as pedestrians are generally able to maintain upright posture, they allow pedestrians to have some minor body movements. Therefore, HOG feature are particularly suitable for human detection in the image. In KCF [1], HOG feature improved CSK based on Gaussian kernel and HOG feature, which improved the performance of the CSK algorithm. Li and Zhu proposed the adaptive Kernel Correlation Filters Tracker with feature integration in the reference [13]. For the gray feature, It only contains brightness information and no color information, so the tracker handles it very quickly. There are correlation filters trackers algorithms [14,15,16,17,18,19,20] have a good effect. The correlation filters trackers based on CNN feature [6] can further improve the tracking performance. However, there are some disadvantages in these feature. the HOG feature is very sensitive to noise because of the nature of the HOG gradient. The gray feature is too simple to deal with the video sequence in the complex environment. with the increase of the computing complexity of feature extractor, the trackers lost the real-time advantage of correlation filters. Wang et al. proposed 5 stages of the object tracking process in reference [8], and proposed the feature extraction was the most important stage. So in the correlation filters tracking frame, a simple and effective feature needed to be researched. Moreover, model drift of the correlation filters can result in tracking failure.

In this paper, a novel and simple correlation filter tracker based on two-level filtering edge feature (ECFT) was proposed. In Sect. 2, correlation filters tracking frame was introduced. In Sect. 3, ECFT was detailed, which extracted a low-complexity edge feature based on two-level filtering for object representation and updated an object model adaptively by the maximum response value of correlation filters. Experiments in Sect. 4 showed the performance of ECFT by 7 trackers on 20 challenging sequences in terms of AUC and Precision. And Sect. 5 concluded this paper.

2 Correlation Filters Tracking Frame

In this section, we introduce the basic principle of the correlation filters and the circular matrix in the correlation filters. The advantages and disadvantages of the correlation filters are also analyzed.

2.1 The Basic Principle of Correlation Filters

The first step of correlation filters frame is to learn a convolution filters \(h_i\) as the Formula (1).

$$\begin{aligned} h_i=g_i\,+\,f_i \end{aligned}$$
(1)

where \(g_i\) abides by a two-dimensional Gaussian distribution about the center position \((x_i,y_i)\) as the Formula (2). And the \(f_i\) denotes input image.

$$\begin{aligned} g_i(x,y)=e^\frac{{(x\,-\,x_i)^2}+{(y\,-\,y_i)^2}}{\delta ^2} \end{aligned}$$
(2)

For the second step of the correlation filters frame, the filter \(h_i\) and the image \(f_i\) can get a correlation response map \(g_i\) by the convolution operation, and the convolution operation is define as the Formula (3).

$$\begin{aligned} g^{'}_{i}=f_i\otimes h_i \end{aligned}$$
(3)

Finally, in the correlation response map \(g^{'}_{i}\), the maximum response value is the current position of a tracked object.

2.2 Circulant Matrices in the Correlation Filters

Under the correlation filters frame, the cyclic matrix is used for sampling to accelerate.

Let \(x=[x_1,x_2,...,x_n]^T\) be an n-dimensional column vector, and a \({n}\times {n}\) cyclic matrix for dense sampling based on can be constructed as \(X^c\) by the Formula (4).

$$\begin{aligned} \left[ \begin{array}{cc} (P^{0}x)^T\\ \vdots \\ (P^{n\,-\,1}x)^T \end{array} \right] \end{aligned}$$
(4)

where the matrix P is the permutation matrix, \(P_i\) and means that the row vectors in the P matrix are cyclized i times. The matrix P is expressed as the Formula 5.

$$\begin{aligned} \left[ \begin{array}{ccccc} 0\quad 0\quad 0\quad \cdots 1\\ 1\quad 0\quad 0\quad \cdots 0\\ 0\quad 1\quad 0\quad \cdots 0\\ \quad \vdots \quad \quad \ddots \\ 0\quad 0\quad \cdots 1\quad 0\\ \end{array} \right] \end{aligned}$$
(5)

So there is \(Px=[x_n,x_1,x_2,...,x_{n\,-\,1}]^T\) , and the vector x is a base samples vector in cyclic matrix. C(x) means that vector x performs cyclic shift operations as the Formula (6).

$$\begin{aligned} X^{c}=C(x)=\left[ \begin{array}{cccc} x_1&{} x_2&{} \cdots x_n\\ x_n&{} x_1&{} \cdots x_{n\,-\,1}\\ \vdots &{}\quad \quad \ddots \\ x_2&{} x_3&{} \cdots x_1\\ \end{array} \right] \end{aligned}$$
(6)

All circulant matrices can be diagonalization in discrete Fourier space, and the diagonalization formula is the Formula (7).

$$\begin{aligned} X^c=Fdiag({\hat{x}})F^H \end{aligned}$$
(7)

where \(\hat{x}\) is a Discrete Fourier Transform (DFT) for x, and \(F^{H}F=1\). This method avoids the complex process of matrix inversion and decreases the computing complexity.

We summarize the advantages of the correlation filters to several following aspects. First, in the Fourier domain as an element product, effectively calculating the spatial correlation, the correlation filters tracker can achieve a fast tracking. Second, the correlation filters considers the information around an object, thus having more discrimination than the appearance model only on an object. Third, the correlation filters can be described as one ridge regression problem, where the input feature of a cyclic shift is returned to a soft label. Moreover, the tracker on correlation filters does not need to distribute a positive label or a negative label for a spatial correlation sample. For the disadvantages of the correlation filters, The correlation filters are very sensitive to image noise, which can lead to inaccurate tracking in complex environment. In this paper, based on the correlation filters tracking frame, we research a simple and effective feature extracting method and a model updating method.

3 Proposed Method

In this section, the proposed correlation filters tracker based on two-level filtering edge feature is given, abbreviated as ECFT, which is improved based on KCF [1]. In ECFT, a candidate object can be represented by the edge feature vector based on two-level filtering. In the model updating stage, an object model is updating adaptively.

3.1 Edge Feature Based on Two-Level Filtering

The edge feature of a candidate object is extracted based on two-level filtering. In the first-level filtering, for reducing the noise disturbing of a frame image effectively, the median filtering on an input frame is performed based on nonlinear smoothing technique. The median filtering value g(x, y) at (x, y) can be calculated by the Formula (8).

$$\begin{aligned} g(x,y)=med\{f(x-k,y-l),(k,l\in W)\} \end{aligned}$$
(8)

where f(x, y) is the pixel value at (x, y), and W is a two-dimensional template. In the second-level filtering, for extracting the edge characteristic, the median filtering image is convoluted by a Laplace operator as a filter. The Laplace operator is as the Formula (9), and it can be calculated by the Formula (10) in the practical operation.

$$\begin{aligned} \bigtriangleup ^{2}g=\partial ^{2}g(x,y)/\partial x^{2}+\partial ^{2}g(x,y)/\partial y^2 \end{aligned}$$
(9)
$$\begin{aligned} \bigtriangleup ^{2}g=[g(x+1,y)+g(x-1,y)+g(x,y+1)+g(x,y-1)]-4g(x,y) \end{aligned}$$
(10)

The Laplace filtering value at can be calculated by the Formula (11).

$$\begin{aligned} h(x,y)=g(x,y)+c(\bigtriangleup ^{2}g) \end{aligned}$$
(11)

where \(c(\bullet )=1\) when a mask centre is positive and \(c(\bullet )=-1\) when a mask centre is negative.

3.2 Model Updating Adaptively

In the common model updating, an object model always is updated by the current tracking result for every frame. However, when the current tracking result is not credible, updating model can result in model drift. For reducing model drift as much as possible, an object model is updated based on the Formula 12.

$$\begin{aligned} C_t=\left\{ \begin{array}{cccc} (1-{\rho })C_{t\,-\,1}+{\rho }{\hat{C_t}}\quad ,a_{n}\,-\,th\le r_t\le a_{n}\,+\,th\\ C_{t-1}\quad \quad ,otherwise \end{array} \right. \end{aligned}$$
(12)

where \(C_t\) is the object model at frame t, \(C_{t\,-\,1}\) is the object model at frame \(t-1\), \(\hat{C_t}\) is the object feature of the current tracking result at frame t, and \(\rho \) is the updating rate. Only when the maximum response value \(r_t\) for correlation filter at frame t is satisfied with the adaptive condition \(a_{n}-th\le r_t\le a_{n}+th\), \(C_t\) can be updated, otherwise \(C_t\) maintains the object model \(C_{t\,-\,1}\) at frame \(t-1\), where \(a_n\) is the average of the maximum response values for n previous frames as the Formula (13), and th is a threshold for model updating.

$$\begin{aligned} a_{n}=\sum \nolimits _{i\,=\,t\,-\,1}^{t\,-\,n}r_{i}/n \end{aligned}$$
(13)

4 Experiments

For evaluating the overall performance of ECFT, the ECFT tracker was compared with 6 state-of-the-art trackers on 20 sequences. The 20 sequences were publicly available from OTB100 datasets [11]. For fair comparisons, the initial tracking positions and the ground truth positions of 20 sequences were publicly available. The 6 compared trackers were KCF [1], CSK [2], CXT [4], TLD [3], CNT [7], and MOSSE [5]. ECFT was implemented in MATLAB, running on the hardware platform with Intel Quad-Core i5-4590 3.3 GHz CPU and 4 GB memory. The median filter was a smooth space filter. The updating rate was set as 0.09. 10 previous response values were used for calculating \(a_n\). The threshold was set as 0.2. The running speed of ECFT was approximately 150 frames per second. The comparisons were performed from both quantitative evaluation and visual evaluation. The experimental results showed that ECFT performed favorably against other 6 trackers on 20 challenging sequences in terms of accuracy and robustness.

4.1 Quantitative Evaluation

We performed the quantitative evaluations with 2 evaluation metrics: AUC and Precision [11]. An object in a frame was successfully tracked if the score is not less than \(t=area(ROI_{T}\cap ROI_{G})/area(ROI_{T}\cup ROI_{G})\), where \(ROI_T\) and \(ROI_G\) were respectively the tracking bounding box and the ground truth bounding box. AUC was the area under curve of a success plot. Precision (20 px) was the precision when center location errors were less than 20 pixels. The Fig. 1 showed the success plots of OPE and the precision plots of OPE for 7 trackers on 20 sequences. From Fig. 1, AUC and Precision of ECFT were respectively 0.712 and 0.930, which were better than those of other 6 tracker.

Fig. 1.
figure 1

Success plots and precision plots

Tables 1 and 2 respectively showed the evaluation results in terms of AUC and Precision for 7 trackers on 20 sequences. From Table 1, ECFT had the best AUC value for 10 sequences, and either KCF or CNT had the best AUC value only for 3 sequences. Moreover, ECFT had the best overall AUC on 20 sequences. From Table 2, ECFT had the best overall precision on 20 sequences, and KCF ranked the second.

We note that KCF has the highest performance in KCF, CSK, MOSSE, and MOSSE has the lowest performance, in part due to the complexity of features. However, the proposed algorithm (ECFT) does significantly improve performance.

The speed of the tracker is also critical in the performance comparison. The running speed of ECFT was approximately 150 frames per second, It’s about the same speed as KCF. The running speed of TLD was approximately 20 frames per second, and the running speed of CNT was only approximately 1.5 frames.

Table 1. AUC. fonts indicated the best performances
Table 2. Precision. fonts indicated the best performances

4.2 Visual Evaluation

We have done a lot of experiments and listed the positioning results of the seven trackers in different videos. And the screenshots for some sample tracking results on several sequences was shown in Fig. 2. This involved in the results of multiple environments. In the experimental results, our tracker performed well especially when other trackers were drifting. For example, in Board sequence (e.g., \(\#0036\), \(\#0069\), \(\#0119\), \(\#0276\) and \(\#0445\)), Fish sequence (e.g., \(\#0027\), \(\#0044\), \(\#0084\), \(\#0136\), \(\#0161\) and \(\#0470\)), Jumping sequence (e.g., \(\#0099\), \(\#0158\), \(\#0290\), \(\#0274\), \(\#0307\) and \(\#0313\)) and so on, our tracker can be still able to track continuously due to updating object model adaptively. In Sylvester sequence, before the No. 716 frame, all trackers can track the object well. Then, the model drifted from the tracked object in Sylvester (e.g., \(\#1148\), \(\#1259\), \(\#1259\), \(\#1276\), \(\#1317\) and \(\#1345\)), and KCF, CXT, CNT and MOSSE lost the object. In Singer2 sequence (e.g., \(\#0085\), \(\#150\), \(\#0246\), \(\#0349\) and \(\#0366\)), Skater2 sequence (e.g., \(\#0247\), \(\#0337\), \(\#0406\), \(\#0415\) and \(\#0435\)), Board sequence (e.g.,\(\#0036\), \(\#0069\), \(\#0119\), \(\#0276\), \(\#0445\)), and so on, Not only that, but in video Coke and Boy sequence (e.g., \(\#0134\), \(\#0165\), \(\#0222\), \(\#0254\) and \(\#261\)) and (e.g., \(\#0402\), \(\#0417\), \(\#551\), \(\#0557\) and \(\#593\)) have this characteristic. only ECFT can well overcome the model drift and keep a continuous tracking. In the Coke sequence, Before 134 frames, each tracker was able to track accuracy, and after about 165 frames, there was a gradual drift away from the other tracking devices we tracked. Moreover, The MOSSE, CNT and the CSK trackers started to drift early, but our algorithms have been tracking very high accuracy.

In the reference [14], Ma et al. proposed an online random fern classifier to re-detect objects in case of tracking failure,and it can realize long-term correlation tracking. This is an undeniable fact. In our method, we proposed the adaptive model updating technology, which also can realize long-term tracking. The feature of the proposed method can be well adapted to the correlation filters. For example, in Jumping sequence, the KCF tracker with a HOG feature fails after 158 frames. For the CNT tracker with deep feature, it not only processes very slowly, but the tracking success rate in Jumping sequence is very low. However, as shown in Fig. 2, our tracker can keep tracking of high accuracy all the time.

Fig. 2.
figure 2

Screenshots for some sample tracking results

5 Conclusions

In this paper, we proposed a correlation filters tracker based on two-level filtering edge feature (ECFT), improved based on KCF in the frame of correlation filters. For feature extractor, a candidate object can be represented by the edge feature vector based on two-level filtering. For model updating, an object model was updated adaptively when a current tracking result was credible. The comparative experiments of 7 trackers on 20 challenging sequences showed that the proposed ECFT tracker can perform well in terms of AUC and Precision.