Learning Rotation Adaptive Correlation Filters
in Robust Visual Object Tracking
Litu Rout1 , Priya Mariam Raju1 , Deepak Mishra1 , and Rama Krishna Sai
Subrahmanyam Gorthi2
arXiv:1906.01551v1 [cs.CV] 4 Jun 2019
1
Department of Avionics, Indian Institute of Space Science and Technology
Thiruvananthapuram, Kerala, India - 695 547
liturout1997@gmail.com, priyamariyam123@gmail.com,
deepak.mishra@iist.ac.in
2
Department of Electrical Engineering, Indian Institute of Technology
Tirupati, Andhra Pradesh, India - 517 506
rkg@iittp.ac.in
Abstract. Visual object tracking is one of the major challenges in the
field of computer vision. Correlation Filter (CF) trackers are one of the
most widely used categories in tracking. Though numerous tracking algorithms based on CFs are available today, most of them fail to efficiently
detect the object in an unconstrained environment with dynamically
changing object appearance. In order to tackle such challenges, the existing strategies often rely on a particular set of algorithms. Here, we
propose a robust framework that offers the provision to incorporate illumination and rotation invariance in the standard Discriminative Correlation Filter (DCF) formulation. We also supervise the detection stage of
DCF trackers by eliminating false positives in the convolution response
map. Further, we demonstrate the impact of displacement consistency on
CF trackers. The generality and efficiency of the proposed framework is
illustrated by integrating our contributions into two state-of-the-art CF
trackers: SRDCF and ECO. As per the comprehensive experiments on
the VOT2016 dataset, our top trackers show substantial improvement of
14.7% and 6.41% in robustness, 11.4% and 1.71% in Average Expected
Overlap (AEO) over the baseline SRDCF and ECO, respectively. 3
Keywords: Rotation Adaptiveness · False Positive Elimination · Displacement Consistency.
1
Introduction
Visual object tracking finds applications in diverse fields like traffic monitoring,
surveillance systems, human computer interaction etc. Though the same object
is being tracked throughout a given video sequence, the conditions under which
the video is captured may vary due to changes in the environment, object, or
camera. Illumination variations, object deformations, object rotations etc. are
3
The final authenticated version is available online here.
2
L. Rout et al.
various challenges that occur due to changes in the aforementioned factors. A
good tracking algorithm should continue tracking a desired object and its performance should remain unaffected under all these conditions. Most of the existing
trackers can be classified as either generative or discriminative. The generative
trackers [1,4,17,31,15] use the object information alone to search for the most
probable region in an image that matches the initially specified target object.
On the other hand, the discriminative trackers [28,2,13,18,7,14,11] use both the
object and background information to learn a classifier that discriminates the
object from its background. The discriminative trackers, to a large extent, make
use of CFs as classifiers. The main advantage of CFs is that correlation can be
efficiently performed in the Fourier domain as simple multiplication, as proven
by Parseval’s theorem. For this reason, CF trackers are learned and all computations are performed efficiently in the Fourier domain that leads to drastic
reduction in computational complexity [13]. Thus, the CF trackers have gained
popularity in the community because of their strong discriminative power, which
emerges due to implicit inclusion of large number of negative samples in training.
Despite all the advancements in CF tracking, most of these algorithms are
not robust enough to object deformations, rotations, and illumination changes.
These limitations are due to the inherent scarcity of robust training features
that can be derived from the preceding frames. This restricts the ability of the
learned appearance model to adapt the changes in target object. Therefore, we
propose rotation adaptiveness and illumination correction schemes in order to
extract sophisticated features from previous frames that helps in learning robust
appearance model. The rotation adaptiveness, up to some extent, tackles the
issues of object deformation due to the robustness in representation.
The main contributions of this paper are as follows. (a) An Illumination
Correction filter (IC) (Sec. 3) is introduced in the tracking framework that eliminates the adverse effects of variable illuminations on feature extraction. (b) We
propose an approach to incorporate rotation adaptiveness (Sec. 4) in standard
DCF by optimizing across the orientations (Sec. 4.4) of the target object in the
detector stage. The orientation optimization helps in extracting robust features
from properly oriented bounding boxes unlike most state-of-the-art trackers that
rely on axis aligned bounding boxes. (c) Building on it, we supervise the subgrid localization cost function (Sec. 4.4) in the detector stage of DCF trackers.
This cost function is intended to eliminate the false positives during detection.
(d) Further, we show the impact of enhancing smoothness through displacement
correction (Sec. 5), and demonstrate all these contributions on two popular CF
trackers: Spatially Regularized Disriminative Correlation Filters (SRDCF) [7],
and Efficient Convolution Operators (ECO) [5].
Though we have demonstrated the importance of our contributions by integrating with SRDCF and ECO, the proposed framework is generic, and can
be well integrated with other state-of-the-art correlation filter trackers. The rest
of the paper is structured as following. At first we discuss the previous works
related to ours (Sec. 2), followed by the illumination correction filter (Sec. 3),
the detailed description of rotation adaptive correlation filters (Sec. 4), and dis-
Rotation Adaptive Correlation Filters
3
Fig. 1. As the pipeline indicates, both train (kth ) and test (k + 1th ) frames undergo
illumination correction (IC) prior to feature extraction. The training features are then
used to learn the parameters of Rotation Adaptive Correlation Filter (RACF). During
detection stage, each candidate patch passes through a coarse orientation space from
which the orientation optimizer picks a seed orientation. The seed orientation is usually
the object’s immediate previous orientation which is then used by Newton’s iterative
optimization scheme as initial point to determine optimal orientation for k + 1th frame.
The optimizer maximizes the total energy content in the False Positive Eliminated
(FPE) convolutional response map. The response map corresponds to the winning
scale in the scale pyramid. Note that the optimal orientation in the first frame (θ1 ) is
assumed to be 0◦ without loss of generality. Thereafter, the optimal orientations in the
subsequent frames are determined through a deterministic optimization strategy.
placement consistency (Sec. 5). Thereafter, we provide experimental evidence
(Sec. 6) to validate our contributions. In Fig. 1, we show the pipeline of our
overall architecture.
2
Related Works
Numerous variants of the CF tracker have been proposed by adding constraints
to the basic filter design, and by utilizing different feature representations of the
target object. Initial extensions start with the KCF tracker [13] which uses a
kernel trick to perform efficient computations in the Fourier domain. The Structural CF tracker [18] uses a part based technique in which each part of the object
is independently tracked using separate CFs. Danelljan et al. [7] proposed the
SRDCF tracker which uses a spatial regularizer to weigh the CF coefficients
in order to emphasize the target locations and suppress the background information. Thus, the SRDCF tracker includes a larger set of negative patches in
training, leading to a much better discriminative model.
4
L. Rout et al.
The earlier trackers directly used the image intensities to represent the target
object. Later on, feature representations such as color transformations[3,23,24,13],
Colornames[9] etc. were used in CF trackers. Due to significant advancement of
deep neural networks in object detection, features from these networks have also
found applications in tracking, giving rise to substantial gain in performance. The
deep trackers, such as DeepSRDCF [6], MDNet [22], and TCNN [21], clearly
indicate the distinctive feature extraction ability of deep networks. The HCF
tracker [19] exploits both semantic and fine-grained details learned from a pretrained Convolutional Neural Network (CNN). It uses a multi-level correlation
map to locate the target object. The CCOT tracker [8] uses DeepSRDCF [6]
as the baseline and incorporates an interpolation technique to learn the filter
in continuous domain with multi-resolution feature maps. The ECO tracker [5]
reduces the computational cost of CCOT by using a factorized convolution operator that acts as a dimensionality reduction operator. ECO also updates the
features and filters after a predefined number of frames, instead of updating after each frame. This eliminates redundancy and over-fitting to recently observed
samples. As a result, the deep feature based ECO tracker does reasonably well
on diverse datasets outperforming other CF trackers by a large margin.
Among rotation adaptive tracking, Zhang et al. propose an exhaustive template search in joint scale and spatial space to determine the target location,
and learn a rotation template by transforming the training samples to LogPolar domain, as explained in RAJSSC [30]. We learn rotation adaptive filter in
the cartesian domain by incorporating orientation in the standard DCF, unlike
exhaustive search. In contrast to a recent rotation adaptive scheme SiameseFCDSR, as proposed by Rout et al. [26], we incorporate rotation adaptiveness directly in the standard DCF formulation, by performing a pseudo optimization on
a coarse grid in the orientation space, leading to robust training of CF. Qianyun
et al. [10] use a multi-oriented Circulant Structure with Kernel (CSK) tracker
to get multiple translation models each dominating one orientation. Each translation model is built upon the KCF tracker. The model with highest response
is picked to estimate the object’s location. The main difference is that we do
not learn multiple translation models at various orientations, as proposed in
multi-oriented CSK, in order to reduce computational cost. In contrast, we optimize the total energy content in convolution responses at the detector stage
with respect to object’s orientation. The multi-channel correlation filter is then
learned from a set of training samples which are properly oriented through a
deterministic approach. Note that our training process requires a single model.
3
Illumination Correction (IC) Filter
Illumination changes occur in a video due to dynamically changing environmental conditions, such as waving tree branches, low contrast regions, shadows of
other objects etc. This variable illumination gives rise to low frequency interference, which is one of the prominent causes of disturbing the object’s appearance.
As the appearance of an object changes dramatically under different lighting
Rotation Adaptive Correlation Filters
5
conditions, the learned model fails to detect the object, leading to reduction
in accuracy and robustness. Also, we may sometimes be interested in high frequency variations, such as edges, which are part of the dominant features in
representing an object. Though these issues are investigated extensively in image processing community, to our knowledge, necessary attention for the same
is not paid explicitly, even in the state-of-the-art trackers. Though deep features
have shown to be fairly invariant to random fluctuations in input image, such as
blur, white noise, illumination variation etc. the experimental results in Sec. 6
shows that trackers with deep features also fail to track the object under these
challenges. Therefore, we intend to introduce Illumination Correction filter (IC)
in the tracking paradigm in order to tackle the aforementioned issues up to
some degree without affecting the usual tracked scenarios. At first, we employ
a standard contrast stretching mechanism [25] to adjust the intensities of each
frame. The contrast stretched image is then subjected to unsharp masking [25],
a popular image enhancement technique in order to suppress the low frequency
interference, and enhance high variations. To our surprise, the performance of
the baseline trackers improves by a considerable amount just by enhancing the
input images, as given in Sec. 6. This validates the fact that the robust feature extractors still lack high quality visual inputs, which otherwise can lead to
substantial gain in performance.
4
Rotation Adaptive Correlation Filters (RACF)
Here, we elaborate the training and detection phase of rotation adaptive correlation filters in light of standard SRDCF. For the ease of understanding and
clearly distinguishing our contributions, we have used identical notations as in
SRDCF [7]. First, we explain standard SRDCF training and detection process,
and then, we integrate rotation adaptiveness with false positive elimination into
the optimization framework of CF, unlike heuristic template search [30,26].
4.1
SRDCF Training and Detection
In the standard DCF formulation, a multi-channel correlation filter f is learned
t
from a set of training samples {(xk , yk )}k=1 . Each training sample xk has a
d-dimensional feature map, which is extracted from an image region. All the
samples are assumed to be of identical spatial resolution M × N . Thus, we have
a d-dimensional feature vector xk (m, n) ∈ Rd at each spatial location (m, n) ∈
Ω := {0, . . . , M − 1}×{0, . . . , N − 1}. We also denote feature layer l ∈ {1, . . . , d}
of xk by xlk . The target of each training sample xk is denoted as yk , which is a
scalar valued function over the domain Ω. The correlation filter f has a stack
of d layers, each of which is a M × N convolution filter f l . The response of the
convolution filter f on a M × N sample x is computed by,
Sf (x) =
d
X
l=1
xl ∗ f l .
(1)
6
L. Rout et al.
Here, ∗ represents circular convolution. The desired filter f is obtained by minimizing the L2 -error between convolution response Sf (xk ) of training sample
xk and the corresponding label yk with a more general Tikhonov regularizer
w : Ω → R,
ε (f ) =
t
X
2
αk kSf (xk ) − yk k +
d
X
l=1
k=1
w
· fl
MN
2
(2)
.
Here, · denotes point-wise multiplication. With the help of Parseval’s theorem,
the filter f can be equivalently computed by minimizing the equation (2) in the
Fourier domain with respect to Discrete Forurier Transform (DFT) coefficients
fˆ,
ε̂(fˆ) =
t
X
αk
k=1
d
X
2
x̂lk
ˆl
· f − yˆk
l=1
+
d
X
l=1
ŵ
∗ fˆl
MN
2
.
(3)
Here, ˆ denotes the DFT of a function. After learning the DFT coefficients fˆ of
filter f , it is typically applied in a sliding-window-like
manner on all cyclic shifts
Pd
of a test sample z. Let ŝ := F {Sf (z)} = l=1 ẑ l · fˆl denote DFT (F) of the
convolution response Sf (z) evaluated at test sample z. The convolution response
s(u, v) at continuous location (u, v) ∈ [0, M ) × [0, N ) are interpolated by,
s(u, v) =
M −1 N −1
m
n
1 X X
ŝ (m, n) ei2π( M u+ N v) .
M N m=0 n=0
(4)
Here, i denotes the imaginary unit. The maximal sub-grid location (u∗ , v ∗ ) is
then computed by optimizing arg max(u,v)∈[0,M )×[0,N ) s (u, v) using Newton’s
method, starting at maximal grid-level score (u(0) , v (0) ) ∈ Ω. In a nutshell,
the standard SRDCF adapts translation invariance efficiently by exploiting the
periodic assumption with spatial regularization, but this does not learn rotation adaptiveness inherently. Therefore, we propose to extend the discriminative
power of SRDCF by learning rotation adaptive filters through a deterministic
optimization procedure.
4.2
RACF Training and Detection
First, we incorporate rotation adaptiveness in spatially regularized correlation filters by learning from appropriately oriented training samples. Similar to SRDCF,
we solve the resulting optimization problem in the Fourier domain, by employing
a deterministic orientation in each training sample. Let θk denotes the orientation corresponding to xk . Without loss of generality, it can be assumed that
θk = 0, ∀k ≤ 1. The training sample xk undergoes rotation θk by,
xk (m′ , n′ ) , (m′ , n′ ) ∈ Ω
(5)
xθk (m, n)(m,n)∈Ω =
0
, elsewhere
Rotation Adaptive Correlation Filters
where (m, n) and (m′ , n′ ) are related by,
cos(θk ) − sin(θk ) n′
n
.
=
m′
sin(θk ) cos(θk )
m
7
(6)
In other words, xθk is obtained by rotating xk anti-clockwise with an angle θk in
the Euclidean space and cropping same size M ×N as xk . In order to avoid wrong
gradient estimation due to zero paddings, we use a common solution that bands
the rotated image patch with cosine window. This does not disturb the structure
of the object assuming that the patch size is larger than the target object. This
is different from standard SRDCF, in a sense that we learn the multi-channel
t
correlation filter f from properly oriented training samples xθk , yk k=1 . The
training stage of rotation adaptive filters is explained in the following Sec. 4.3.
4.3
Training: Learning RACF parameters
The convolution response Sf (xθk ) of the rotated training sample xθk ∈ Rd is
computed by,
d
X
l
xθl
(7)
Sf (xθk ) =
k ∗f .
l=1
After incorporating rotation into the DCF formulation, the resulting cost
function is expressed as,
εθ (f ) =
t
X
αk Sf (xθk ) − yk
2
+
k=1
d
X
l=1
w
· fl
MN
2
(8)
.
Similar to SRDCF, we perform the Gauss-Seidel iterative optimization in
Fourier domain by computing DFT of equation (8) as,
ε̂θ (fˆ) =
t
X
k=1
αk
d
X
l=1
2
x̂θl
k
· fˆl − yˆk
+
d
X
l=1
ŵ
∗ fˆl
MN
2
.
(9)
The equation (9) is vectorized and simplified further by using fully vectorized
real-valued filter, as implemented in the standard SRDCF [7]. The aforementioned training procedure is feasible, provided we obtain the object’s orientation
corresponding to all the training samples beforehand. In the following Sec. 4.4,
we propose an approach to localize the target object and detect its orientation
by optimizing a newly formulated objective function.
4.4
Detection: Localization of the Target Object
At the detection stage, the correlation filter f learned from t training samples are
utilized to compute the convolution response of a test sample z obtained from
(t + 1)th frame, which is then optimized to locate the object in that (t + 1)th
8
L. Rout et al.
Fig. 2. Sample frames from the sequence glove of VOT2016 [16]. The blue, green,
and red rectangle shows the output of groundtruth, ECO, and F-ECO (with FPE),
respectively. Convolution response of shaded (red) region obtained directly (a) without,
and (b) with optimization through false positive elimination.
◦
θ=0
frame. For example, at t = 1, we learn the coefficients of f from (xk=1
, yk=1 ) and
∗
detect the object location, (u∗k+1 , vk+1
), and orientation, θk+1 in the (t + 1)th ,
i.e., 2nd frame. For efficient detection of scale, we construct multiple resolution
test samples {zr }r∈{⌊ 1−S ⌋,...,⌊ S−1 ⌋} by resizing the image at various scales ar , as
2
2
implemented in SRDCF[7]. Here, S and a denote the number of scales and scale
increment factor, respectively. Next, we discuss the false positive elimination
scheme, which offers notable gain in the overall performance.
False Positive Elimination (FPE) As per our extensive experiments, we
report that the convolution response map of test sample may sometimes contain
multiple peaks with equal detection scores. This situation usually arises when the
test sample is constructed from an image region that consists of multiple objects
with similar representations as target object. In fact, this issue can occur in many
real world scenarios, such as glove, leaves, rabbit etc. sequences from VOT2016
s(u,v)
unlike SRDCF,
dataset [16]. Therefore, we propose to maximize u−u
∗ ,v−v ∗
k(
k
k )k
∗ ∗
which focuses on maximizing s(u, v) alone. Here, (uk , vk ) denote the sub-grid
level target location in the k th frame. Thereby, we intend to detect the object that
has high response score as well as minimum deviation from previous location.
Arguably, this hypothesis is justified by the fact that it is less likely for an object
to undergo drastic deviation from immediate past location. Due to identical
representation in feature space, both the gloves have equal response score as
shown in Fig. 2(a). However, the FPE scheme mitigates this issue, as shown in
Fig. 2(b), by maximizing the response score subject to minimum deviation from
previous centroid which in turn creates a very distinct decision boundary. Note
that the Gauss-Seidel optimization of total energy content through FPE directly
yields this scalar valued response (Fig. 2(b)) without any post-processing.
Rotation Adaptive Correlation Filters
9
Detection of Orientation (DoO) Here, we elaborate the detection mecha
Pd
nism of object’s orientation in the test sample. Let ŝθ := F Sf (z θ ) = l=1 ẑ θl ·
fˆl represents the DFT (F {.}) of convolution response Sf (z θ ), evaluated at θ orientation of test sample z. Similar to equation (4), we compute sθ (u, v) on a coarse
grid (u, v) ∈ Ω by,
sθ (u, v) =
M −1 N −1
n
m
1 X X
ŝθ (m, n) ei2π( M u+ N v) .
M N m=0 n=0
(10)
Then, the aim is to find orientation that maximizes the total energy content in
the convolution response map by,
(M −1 N −1
2 )
X X
Sθ (u, v)
θk+1 = arg maxθ∈Φ
.
(11)
k(u − u∗k , v − vk∗ )k
u=0 v=0
Here, Φ := {θk ± aδ}, where a = 0, 1, 2, . . . , A. Thus, the orientation space Φ
consists of (2A + 1) number of rotations with step size δ. In our experiments,
we have used δ = 5◦ , and A = 2 based on the fact that an object’s orientation
is less likely to change drastically between consecutive frames. Nevertheless, the
orientation can be further optimized by Newton’s approach, or any suitable optimization algorithm, starting at optimal coarse orientation θk+1 . Also, a suitable
combination of A and δ can be chosen for searching exhaustively in Φ, but at the
expense of time complexity. Next, we incorporate the FPE and DoO techniques
in Fast Sub-grid Detection method of standard SRDCF (Sec. 4.4) formulation.
Fast Sub-grid Detection We apply the Newton’s optimization strategy, as
in SRDCF, for finding the sub-grid location that maximizes the detection score.
However, we incorporate the false positive elimination and optimal orientation in
the standard SRDCF sub-grid detection. Thus, we compute the sub-grid location
that corresponds to maximum detection score by,
Sθk+1 (u, v)
∗
∗
,
(12)
uk+1 , vk+1 = arg max(u,v)∈[0,M )×[0,N )
k(u − u∗k , v − vk∗ )k
Sθk+1 (u(0) ,v (0) )
(0) (0)
starting at (u , v ) ∈ Ω, such that
is maximal.
k(u(0) −u∗k ,v(0) −vk∗ )k
5
Displacement Consistency
Motivated by the displacement consistency techniques, as proposed in [26], we
enhance the degree of smoothness imposed on the movement variables, such as
∗
speed and angular displacement. We update the sub-grid location, u∗k+1 , vk+1
obtained from equation (12) by,
∗
u∗k+1 , vk+1
= (u∗k , vk∗ ) + d1n ∠ϕ1n ,
d1n = ωd × d1 + (1 − ωd ) × d0 ,
ϕ1n = ωa × ϕ1 + (1 − ωa ) × ϕ0 ,
(13)
10
L. Rout et al.
∗
∗
where, d0 = u∗k − u∗k−1 , vk∗ − vk−1
, d1 = u∗k+1 − u∗k , vk+1
− vk∗ ,
∗
∗
ϕ0 = arctan u∗k − u∗k−1 , vk∗ − vk−1
, ϕ1 = arctan u∗k+1 − u∗k , vk+1
− vk∗ , ωd =
∗
0.9, ωa = 0.9. The abrupt transition from (u∗k , vk∗ ) to u∗k+1 , vk+1
is restricted
by reducing the contribution of d1 and ϕ1 slightly to 0.9. Note that for ωd =
∗
ϕ = 1, the updated u∗k+1 , vk+1
of equation (13) remains unaltered from the
optimal solution of equation (12). In the following Sec. 6, we briefly describe our
experimental setup, and critically analyze the results.
6
Experiments
First, we detail the experimental setup, and then carry out the ablation studies to analyze the effect of each individual component towards overall tracking
performance. Then we conduct extensive experiments to compare with various
state-of-the-art trackers both qualitatively and quantitatively on VOT2016 [16]
and OTB100 [29] benchmark. In all our experiments, we use VOT toolkit and
OTB toolkit for evaluation on VOT2016 and OTB100 benchmark, respectively.
6.1
Implementation Details
In order to perform an unbiased analysis that may arise due to varying numerical
precision of different systems, we evaluate all the models, including baseline
SRDCF and ECO on the same system under identical experimental setup. All
the experiments are conducted on a single machine: Intel(R) Xeon(R) CPU E31225 v2 @ 3.20GHz, 4 Core(s), 4 Logical Processor(s) and 16GB RAM. The
proposed tracker has been implemented on MATLAB with Matconvnet. We use
the similar parameter settings as baseline, apart from the additional parameters
δ = 5◦ , and A = 2 in rotation adaptive filters. These settings are selected because
the orientation does not change drastically between consecutive frames. In IC,
we use output intensity range [0, 255] for contrast stretching, and a threshold
0.5 for unsharp masking. This intensity range is selected so as to match with
the conventional representation (uint8) of most images. On the other hand, this
threshold is selected manually by observing qualitative results on VOT2016.
6.2
Estimation of Computational Complexity
The Fast Fourier Transform (FFT) of a 2-dimensional signal of size M × N
can be computed in O(M N log M N ). Since there are d feature layers, S scales,
and (2A + 1) orientations, the training and detection stage of our algorithm
requires O (ASdM N log M N ) FFT computations. To compute the convolution
response, the computed FFTs require O (ASdM N ) multiplication operations,
and O (ASM N ) division operations. The division operations are used in False
Positive Elimination (FPE) strategy. Assuming that the Newton’s optimization
converges in NN e iterations, the total time complexity of matrix multiplication
and FPE sums up to O ((ASdM N + ASM N ) NN e ). In contrast to standard
Rotation Adaptive Correlation Filters
11
SRDCF [7], we learn the multi-resolution filter coefficients from properly oriented training samples. After detection of orientation through optimization of
total energy content on a coarse grid, the training samples are oriented appropriately in O (M N ) time complexity. The fraction of non-zero elements in At
of size dM N × dM N , as given in standard SRDCF, is bounded by the upper
2
limit 2d+k
dM N . Thus, the total time complexity of standard SRDCF training, assuming that the
optimization coverges in NGS iterations, sums up
Gauss-Seidel
to O d + k 2 dM N NGS . In addition to the standard SRDCF training, our
approach requires O (M N ) operations
to orient
the samples, leading to a total
complexity of O M N + d + k 2 dM N NGS . Therefore, the overall time complexity of our RIDF-SRDCF is given by,
O ASdM N log M N + (ASdM N + ASM N ) NN e + M N + d + k 2 dM N NGS
(14)
and that of SRDCF is given by,
O dSM N log M N + SM N NN e + d + k 2 dM N NGS
(15)
Note that the overall complexity of both of these are largely dominated by
O d + k 2 dM N NGS , leading to only slight increment in computational cost
due to the additional terms, but resulting in significant improvement in overall
performance of RIDF-SRDCF relative to standard SRDCF. In fact, the RIDFSRDCF runs at 5 fps and SRDCF runs at 7fps on our machine.
6.3
Ablation Studies
We progressively integrate Displacement consistency (D), False positive elimination (F), Rotation adaptiveness (R), Illumination correction (I), and their
combinations into ECO framework for faster experimentaion, and assimilate the
impact of each individual component on AEO, which is the standard metric on
VOT2016 benchmark. We evaluate each Ablative tracker on a set of 16 videos
(Table 1) during the development phase. The set is constructed from the pool
of 60 videos from VOT2016 dataset. A video is selected if its frames are labelled
as either severe deformation, rotation, or illumination change. Note that the
FPE scheme improves the performance in every integration, and illumination
correction alone provides a gain of 7.7% over base RDF-ECO. As per the results in Table 1, the proposed ideas independently and together provide a good
improvement relative to base model.
6.4
Comparison with the State of the Arts
Here, we demonstrate detailed evaluation results to experimentally validate the
efficacy of our contributions in hotistic visual object tracking challenge. We use
VOT2016 benchmark during development stage and hyper parameter tuning.
To analyze the generalization ability of proposed contributions, we benchmark
our trackers with same paramter settings on both VOT2016 and OTB100.
12
L. Rout et al.
Table 1. Quantitative evaluation of Ablative trackers on a set of 16 challenging videos
from VOT2016 benchmark.
DECO
AEO 0.357
0.360
%Gain Baseline 0.8
Tracker ECO
DFECO
0.362
1.4
RECO
0.383
7.3
RFECO
0.386
8.1
RDECO
0.395
10.6
RDFECO
0.402
12.6
RIDFECO
0.433
21.3
Evaluation on VOT2016 We evaluate the top performing models from Table 1, including Ablative trackers of SRDCF, on VOT2016 dataset. As per the
results in Table 2, the I-SRDCF, RDF-SRDCF, and RIDF-SRDCF provide a
considerable improvement of 3.53%, 10.60%, and 11.41% in AEO, 4.83%, 17.87%,
and 13.04% in robustness, respectively. The RIDF-ECO performs favourably
against the state-of-the-art trackers including MDNet (won VOT2015) and CCOT
(won VOT2016) with a slight improvement of 1.71% in AEO, and as high as
6.41% in robustness. The percentage gain is computed relative to baseline.
Table 2. State-of-the-art comparison on whole VOT2016 dataset.
Trackers
RIDFIRDF- RIDFTCNN CCOT ECO MDNet
ECO
SRDCF SRDCF SRDCF
0.1981 0.2051 0.2191 0.2207
0.3249 0.3310 0.3563 0.3584 0.3624
SRDCF
AEO
Failure Rate
2.07
(Robustness)
1.97
1.70
1.80
0.96
0.83
0.78
0.76
0.73
Evaluation on OTB100 As per the evaluation on OTB100 (Table 3), the
proposed RIDF-ECO performs favourably against baseline and also the state-ofthe-art trackers in most of the existing categories. As per the quantitative study
in Table 3, our method finds difficulties in dealing with Background Clutter, Deformation and Out-of-plane rotation. Though our contributions strengthen the
performance of ECO in these failure cases, it is definitely not better than MDNet.
Of particular interest, the SRDCF tracker with deep features, i.e. DeepSRDCF
lags behind SRDCF with RIDF, i.e. RIDF-SRDCF in success rate as well as precision which are the standard metrics in OTB100 benchmark. Note that exactly
same hyper parameters are used in the evaluation on OTB100 as on VOT2016.
This verifies the fact that rotation adaptive filters along with other contributions
generalizes well across OTB100 and VOT2016 benchmark datasets.
Evaluation of CF trackers To qualitatively assess the overall performance of
RIDF-SRDCF, we compare our results with baseline approach and few other CF
trackers on challenging sequences from VOT2016, as shown in Fig. 3. Further,
we quantitatively assess the performance by comparing the Average Expected
Overlap (AEO) of few correlation filter based trackers, as shown in Fig. 4. The
Rotation Adaptive Correlation Filters
13
Table 3. State-of-the-art comparison on OTB100 dataset.
Trackers
Out-of-view
Occlusion
Illumination
Variation
Low
Resolution
Background
Clutter
Deformation
In-plane
rotation
Out-of-plane
rotation
Fast Motion
Overall
Success Rate
Overall
Precision
RIDF-ECO ECO MDNet CCOT RIDF-SRDCF DeepSRDCF SRDCF CFNet Staple KCF
0.767
0.726 0.708 0.725 0.712
0.619
0.555 0.423 0.518 0.550
0.721
0.710 0.702 0.692 0.652
0.625
0.641 0.573 0.610 0.535
0.702
0.662 0.688
0.676 0.649
0.631
0.620
0.561
0.601 0.530
0.734
0.652 0.663
0.642 0.588
0.438
0.537
0.545
0.494 0.384
0.648
0.638 0.697
0.620 0.634
0.616
0.612
0.592
0.580 0.557
0.687
0.687 0.722
0.657 0.652
0.645
0.641
0.618
0.690 0.608
0.696
0.645 0.656
0.653 0.635
0.625
0.615
0.606
0.596 0.510
0.682
0.665 0.707
0.663 0.646
0.637
0.618
0.593
0.594 0.514
0.716
0.698 0.671
0.694 0.693
0.640
0.599
0.547
0.526 0.482
0.702
0.691 0.678
0.671 0.641
0.635
0.598
0.589
0.581 0.477
0.937
0.910 0.909
0.898 0.870
0.851
0.789
0.777
0.784 0.696
proposed RIDF-SRDCF outperforms the standard SRDCF in most of the individual categories that leads to 11.4% and 13.04% overall improvement in AEO
and robustness, respectively. The categorical comparison, as can be inferred from
Fig. 4, shows 56.25%, 23.53%, 38.46%, 5.26%, and 16.66% gain in Illumination
change, Size change, Motion Change, Camera motion, and Empty categories,
respectively. Note that the percentage improvement is computed relative to base
SRDCF. In Table 4, we compare our rotation adaptive scheme with two recent
approaches that aims at addressing this issue heuristically. As per our experiments, we report that the proposed rotation adaptive scheme outperforms these
Fig. 3. Qualitative analysis of RIDF-SRDCF. The proposed tracker successfully tracks
the target under severe rotation, unlike SRDCF and KCF. The rotation adaptive filters assist in determining the orientation of the target object effectively that leads to
substantial gain in overall performance. To avoid clumsiness only few bounding boxes
are plotted and other variants are quantified in Fig. 4.
14
L. Rout et al.
Fig. 4. Average Expected Overlap analysis of correlation filter based trackers.
counterparts on VOT2016 benchmark. Since base CF trackers are used as core
components in most trackers, we believe that the proposed performance gain
will be reflected positively in all their derivatives.
Table 4. Comparison with two recent rotation adaptive trackers on VOT2016.
Trackers RAJSSC SRDCF SiameseFC-DSR RIDF-SRDCF ECO RIDF-ECO
AEO
0.1664 0.1981 0.2084
0.2207
0.3563 0.3624
7
Concluding Remarks
In this study, we demonstrated that employing a simple, yet effective image enhancement technique prior to feature extraction, can yield considerable gain in
tracking paradigm. We analyzed the effectiveness of proposed rotation adaptive
correlation filter in standard DCF formulation, and showed compelling results
on popular tracking benchmarks. We renovated the sub-grid detection approach
by optimizing object’s orientation through false positive elimination, which was
reflected favourably in the overall performance. Also, the supervision of displacement consistency on CF trackers showed promising results in various scenarios.
Moreover, since the DCF formulation is used as backbone of most state-of-the-art
trackers, we believe that the proposed rotation adaptive scheme in correlation
filters can be suitably integrated into many frameworks and will be useful in
boosting the tracking research forward. In future, we will assimilate the performance of proposed tracker on other publicly available datasets [12,20,27].
Rotation Adaptive Correlation Filters
15
References
1. Adam, A., Rivlin, E., Shimshoni, I.: Robust fragments-based tracking using the
integral histogram. In: 2006 IEEE Computer Society Conference on Computer
vision and pattern recognition. vol. 1, pp. 798–805. IEEE (2006)
2. Babenko, B., Yang, M.H., Belongie, S.: Robust object tracking with online multiple
instance learning. IEEE transactions on pattern analysis and machine intelligence
33(8), 1619–1632 (2011)
3. Bolme, D.S., Beveridge, J.R., Draper, B.A., Lui, Y.M.: Visual object tracking using
adaptive correlation filters. In: Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on. pp. 2544–2550. IEEE (2010)
4. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on pattern analysis and machine intelligence 25(5), 564–577 (2003)
5. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: Eco: Efficient convolution operators for tracking. In: Proceedings of the 2017 IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), Honolulu, HI, USA. pp. 21–26 (2017)
6. Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Convolutional features
for correlation filter based visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 58–66 (2015)
7. Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatially regularized correlation filters for visual tracking. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4310–4318 (2015)
8. Danelljan, M., Robinson, A., Khan, F.S., Felsberg, M.: Beyond correlation filters:
Learning continuous convolution operators for visual tracking. In: European Conference on Computer Vision. pp. 472–488. Springer (2016)
9. Danelljan, M., Shahbaz Khan, F., Felsberg, M., Van de Weijer, J.: Adaptive color
attributes for real-time visual tracking. In: IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), Columbus, Ohio, USA, June 24-27, 2014. pp.
1090–1097. IEEE Computer Society (2014)
10. Du, Q., Cai, Z.q., Liu, H., Yu, Z.L.: A rotation adaptive correlation filter for robust
tracking. In: 2015 IEEE International Conference on Digital Signal Processing
(DSP). pp. 1035–1038. IEEE (2015)
11. Fan, H., Ling, H.: Parallel tracking and verifying: A framework for real-time and
high accuracy visual tracking. In: Proc. IEEE Int. Conf. Computer Vision, Venice,
Italy (2017)
12. Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed: A
benchmark for higher frame rate object tracking. arXiv preprint arXiv:1703.05884
(2017)
13. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with
kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine
Intelligence 37(3), 583–596 (2015)
14. Huang, C., Lucey, S., Ramanan, D.: Learning policies for adaptive tracking with
deep feature cascades. In: IEEE Int. Conf. on Computer Vision (ICCV). pp. 105–
114 (2017)
15. Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. IEEE transactions on pattern analysis and machine intelligence 34(7), 1409–1422 (2012)
16. Kristan, M., Matas, J., Leonardis, A., Vojir, T., Pflugfelder, R., Fernandez, G., Nebehay, G., Porikli, F., Čehovin, L.: A novel performance evaluation methodology for single-target trackers. IEEE Transactions on Pattern Analysis and Machine Intelligence 38(11), 2137–2155 (Nov 2016).
https://doi.org/10.1109/TPAMI.2016.2516982
16
L. Rout et al.
17. Kwon, J., Lee, K.M.: Visual tracking decomposition. In: 2010 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR). pp. 1269–1276. IEEE (2010)
18. Liu, S., Zhang, T., Cao, X., Xu, C.: Structural correlation filter for robust visual
tracking. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition. pp. 4312–4320 (2016)
19. Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional features for
visual tracking. In: Proceedings of the IEEE International Conference on Computer
Vision. pp. 3074–3082 (2015)
20. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for uav tracking.
In: European conference on computer vision. pp. 445–461. Springer (2016)
21. Nam, H., Baek, M., Han, B.: Modeling and propagating cnns in a tree structure
for visual tracking. arXiv preprint arXiv:1608.07242 (2016)
22. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visual
tracking. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR). pp. 4293–4302. IEEE (2016)
23. Nummiaro, K., Koller-Meier, E., Van Gool, L.: An adaptive color-based particle
filter. Image and vision computing 21(1), 99–110 (2003)
24. Oron, S., Bar-Hillel, A., Levi, D., Avidan, S.: Locally orderless tracking. International Journal of Computer Vision 111(2), 213–228 (2015)
25. Petrou, M., Petrou, C.: Ch.4: Image enhancement. Image Processing: The Fundamentals pp. 293–394 (2010)
26. Rout, L., Sidhartha, Manyam, G.R., Mishra, D.: Rotation adaptive visual object
tracking with motion consistency. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). pp. 1047–1055 (March 2018)
27. Smeulders, A.W., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah, M.:
Visual tracking: An experimental survey. IEEE Transactions on Pattern Analysis
& Machine Intelligence (1) (2013)
28. Wang, S., Lu, H., Yang, F., Yang, M.H.: Superpixel tracking. In: 2011 IEEE International Conference on Computer Vision (ICCV). pp. 1323–1330. IEEE (2011)
29. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: A benchmark. In: IEEE
Conference on Computer Vision and Pattern Recognition (CVPR) (2013)
30. Zhang, M., Xing, J., Gao, J., Shi, X., Wang, Q., Hu, W.: Joint scale-spatial correlation tracking with adaptive rotation estimation. In: Proceedings of the IEEE
International Conference on Computer Vision Workshops. pp. 32–40 (2015)
31. Zhang, T., Liu, S., Ahuja, N., Yang, M.H., Ghanem, B.: Robust visual tracking
via consistent low-rank sparse learning. International Journal of Computer Vision
111(2), 171–190 (2015)