CSRT
CSRT
CSRT
Alan Lukežič1 , Tomáš Vojı́ř2 , Luka Čehovin Zajc1 , Jiřı́ Matas2 and Matej Kristan1
arXiv:1611.08461v3 [cs.CV] 14 Jan 2019
1
Faculty of Computer and Information Science, University of Ljubljana, Slovenia
2
Faculty of Electrical Engineering, Czech Technical University in Prague, Czech Republic
{alan.lukezic, luka.cehovin, matej.kristan}@fri.uni-lj.si
{vojirtom, matas}@cmp.felk.cvut.cz
Abstract et al, 2013, 2014, 2015, 2016c; Liang et al, 2015; Smeul-
ders et al, 2014; Mueller et al, 2016). Diverse factors –
Short-term tracking is an open and challenging problem occlusion, illumination change, fast object or camera mo-
for which discriminative correlation filters (DCF) have tion, appearance changes due to rigid or non-rigid defor-
shown excellent performance. We introduce the channel mations and similarity to the background – make short-
and spatial reliability concepts to DCF tracking and pro- term tracking challenging.
vide a learning algorithm for its efficient and seamless Recent short-term tracking evaluations (Wu et al, 2013;
integration in the filter update and the tracking process. Kristan et al, 2013, 2014, 2015) consistently confirm the
The spatial reliability map adjusts the filter support to advantages of semi-supervised discriminative tracking ap-
the part of the object suitable for tracking. This both proaches (Grabner et al, 2006; Babenko et al, 2011; Hare
allows to enlarge the search region and improves track- et al, 2011; Bolme et al, 2010). In particular, track-
ing of non-rectangular objects. Reliability scores reflect ers based on the discriminative correlation filter (DCF)
channel-wise quality of the learned filters and are used as method (Bolme et al, 2010; Danelljan et al, 2014a; Hen-
feature weighting coefficients in localization. Experimen- riques et al, 2015; Li and Zhu, 2014a; Danelljan et al,
tally, with only two simple standard feature sets, HoGs 2015a) have shown state-of-the-art performance in all
and Colornames, the novel CSR-DCF method – DCF with standard benchmarks. Discriminative correlation meth-
Channel and Spatial Reliability – achieves state-of-the- ods learn a filter with a pre-defined response on the train-
art results on VOT 2016, VOT 2015 and OTB100. The ing image. The latter is obtained by slightly extending the
CSR-DCF runs close to real-time on a CPU. region around the target to include background samples.
Keywords— Visual tracking, Correlation filters, Channel The standard formulation of DCF uses circular corre-
reliability, Constrained optimization lation which allows to implement learning efficiently by
Fast Fourier transform (FFT). However, the FFT requires
the filter and the search region size to be equal which
1 Introduction limits the detection range. Due to the circularity, the fil-
ter is trained on many examples that contain unrealistic,
Short-term, model-free visual object tracking is the prob- wrapped-around circularly-shifted versions of the target.
lem of continuously localizing a target in a video- A naive approach to the reduction of the windowing prob-
sequence given a single example of its appearance. It has lems is to learn the filter from a larger region. However,
received significant attention of the computer vision com- due to the large area of the background in the region, the
munity which is reflected in the number of papers pub- tracking performance of the DCF drops significantly as
lished on the topic and the existence of multiple perfor- shown in Figure 2.
mance evaluation benchmarks (Wu et al, 2013; Kristan The windowing problems were recently addressed by
1
Learning - Update stage:
overcomes both the problems of circular shift enabling an
Spa
tial
Channel arbitrary search (and training) region size and the limi-
map weights
tations related to the rectangular shape assumption. An
important benefit of a large training region is that back-
ground samples from a wider area around the target are
Tra
inin Sp
atia obtained to improve the filter discriminative power. The
g pa ly c
tch
filteonstr spatial reliability map is estimated using the output of a
rs aine
d graph labeling problem solved efficiently in each frame.
Localization stage: An efficient optimization procedure is applied for learn-
Channel ing a correlation filter with the support constrained by the
weights Fin
al r
esp spatial reliability map since the standard closed-form so-
ons
e lution cannot be generalized to this case. Figure 2 shows
= that tracking performance of our spatially constrained cor-
relation filter (denoted as S-DCF) does not degrade with
Fil Tes
ter
res
t pat
ch
increasing training and search region size as is the case
p on with the standard DCF. In contrast, the performance of
se
s
S-DCF improves from better treatment of training sam-
Figure 1: Overview of the CSR-DCF approach. An au- ples and increased search region size. Experiments show
tomatically estimated spatial reliability map restricts the that the novel filter optimization procedure outperforms
correlation filter to the parts suitable for tracking (top) related approaches for constrained learning in DCFs.
improving localization within a larger search region and Channel reliability is the second novelty the CSR-
performance for irregularly shaped objects. Channel reli- DCF tracker introduces. The reliability is estimated
ability weights calculated in the constrained optimization from the properties of the constrained least-squares so-
step of the correlation filter learning reduce the noise of lution to filter design. The channel reliability scores
the weight-averaged filter response (bottom). are used for weighting the per-channel filter responses
in localization (Figure 1). The CSR-DCF shows
state-of-the-art performance on standard benchmarks –
Kiani Galoogahi et al (2015) who propose zero-padding OTB100 (Wu et al, 2015), VOT2015 (Kristan et al, 2015)
the filter during learning and by Danelljan et al (2015a) and VOT2016 (Kristan et al, 2015) while running close
who introduce spatial regularization to penalize filter val- to real-time on a single CPU. The spatial and channel re-
ues outside the target boundaries. Both approaches train liability formulation is general and can be used in most
from image regions much larger than the target and thus modern correlation filters, e.g. those using deep features.
increase the detection range. The remainder of the paper is structured as follows. In
Another limitation of the published DCF methods is the Section 2 we review most closely related work, our ap-
assumption that the target shape is well approximated by proach is described in Section 3, experimental results are
an axis-aligned rectangle. For irregularly shaped objects presented in Section 4 and conclusions are drawn in Sec-
or those with a hollow center, the filter eventually learns tion 5.
the background, which may lead to drift and failure. The
same problem appears for approximately rectangular ob-
jects in the case of occlusion. The Kiani Galoogahi et al 2 Related work
(2015) and Danelljan et al (2015a) methods both suffer
from this problem. The discriminative correlation filters for object detection
In this paper we introduce the CSR-DCF, the Discrim- date back to the 80’s with seminal work of Hester and
inative Correlation Filter with Channel and Spatial Reli- Casasent (1980). They have been popularized only re-
ability. The spatial reliability map adapts the filter sup- cently in the tracking community, starting with the Bolme
port to the part of the object suitable for tracking which et al (2010) MOSSE tracker. Using a gray-scale tem-
2
0.4
S-DCF get segmentation probability map. Danelljan et al (2016)
0.5
1
0.2
ous space, while Qi et al (2016) proposed a mechanism
0.1
to combine correlation responses from multiple convolu-
tional layers. A correlation filter tracker which is able to
0
0 0.5 1 2 3 4
handle drifts in longer sequences was proposed by Wang
Target template padding factor
et al (2016). It clusters similar target appearances together
Figure 2: Tracking performance measured by the Ex- and uses the clusters for target localization instead of a
pected Average Overlap (EAO) of the standard DCF and single online learned filter.
our spatially constrained DCF (S-DCF) as a function of Since most of the correlation filter trackers represent
search region size, expressed as the multiple of the target the target with a single filter, it can easily get corrupted
size (right, x-axis). The filter is learned from a training re- when occlusion or a target deformation happen. In gen-
gion equal in size to the search region. The search region eral, part-based trackers are better in addressing these is-
sizes are visualized by black-white dashed rectangles (left sues. Therefore several part-based correlation filter meth-
image) and the target bounding box is shown in yellow. ods were proposed. Liu et al (2015) use an efficient
method to combine correlation outputs of multiple parts
and Liu et al (2016) proposed a tracking method for mod-
eling the target structure with multiple parts using multi-
plate, MOSSE achieved state-of-the-art performance on a
ple correlation filters. Lukežič et al (2017) treat the parts
tracking benchmark (Wu et al, 2013) at a remarkable pro-
correlation filter responses and their constellation con-
cessing speed. Significant improvements have been made
straints jointly as an equivalent spring system. They de-
since and in 2014 the top-performing trackers on a recent
rive a highly efficient optimization to infer the most prob-
benchmark (Kristan et al, 2014) were all from this class of
able target deformation.
trackers. DCF improvements fall into two categories, in-
troduction of new features and conceptual improvements Recently, Kiani Galoogahi et al (2015) addressed the
in filter learning. problem that occurs due to learning with circular correla-
tion from small training regions. They proposed a learn-
In the first group, Henriques et al (2015) replaced the ing framework that artificially increases the filter size by
grayscale templates by HoG (Dalal and Triggs, 2005), implicitly zero padding the filter. This reduces the bound-
Danelljan et al (2014b) proposed multi-dimensional color ary artifacts by increasing the number of training exam-
attributes and Li and Zhu (2014b) applied feature combi- ples in constrained filter learning. Danelljan et al (2015a)
nation. Recently, convolutional network features learned reformulate the learning cost function to penalize non-
for object detection have been applied (Ma et al, 2015; zero filter values outside the object bounding box. Per-
Danelljan et al, 2015b, 2016), leading to a performance formance better than (Kiani Galoogahi et al, 2015) is re-
boost, but at a cost of significant speed reduction. ported, but the learned filter is still a trade-off between
Conceptually, the first successful theoretical extension the correlation response and regularization, and it does
of the standard DCF was the kernelized formulation by not guarantee that filter values are zero outside of object
Henriques et al (2015) which achieved remarkable track- bounding box.
ing performance, but still preserved high speed. Later, a
correlation filter based scale adaptation was proposed by
Danelljan et al (2014a) introduced a scale-space pyramid
learned within a correlation filter framework. Zhang et al
3 Spatially constrained correlation
(2014) introduced spatio-temporal context learning in the filters
DCFs. To improve localization with correlation filters,
Bertinetto et al (2016a) proposed a tracking method that The use of multiple channels in correlation filters (Hen-
combines the output of the correlation filter with the tar- riques et al, 2015; Danelljan et al, 2017; Galoogahi et al,
3
0.14
Channel responses
2013) has become very popular in visual tracking. In Per-channel correlation responses
0.12
the following we present the main ideas behind learn-
4
3.1 Constrained correlation filter learning A standard scheme for updating the constraint penalty µ
values (Boyd et al, 2011) is applied, i.e., µi+1 = βµi .
Since filter learning is independent across the channels
Computations of (12,11) are fully carried out in the fre-
in our formulation (5), we assume only a single channel
quency domain, the solution for (13) requires a single in-
in the following derivation (i.e., Nc = 1) and drop the
verse FFT and another FFT to compute the ĥi+1 . A single
channel index for clarity.
optimization iteration thus requires only two calls of the
Let m ∈ {0, 1} be a spatial reliability map with el-
Fourier transform, resulting in a very fast optimization.
ements either zero or one, that identifies pixels which
The computational complexity is that of the Fourier trans-
should be set to zero in the learned filter. The constraint
form, i.e., O(D log D). Filter learning is implemented in
can be formalized as h ≡ m h, where represents
less than five lines of Matlab code and is summarized in
the Hadamard (element-wise) product. Such constraint
the Algorithm 1.
does not lead to a closed-form solution, but an iterative ap-
proach akin to Kiani Galoogahi et al (2015) can be derived
Algorithm 1 : Constrained filter optimization.
for efficiently solving the optimization problem. In the
following we summarize the main steps of our approach Require:
and report the full derivation in Appendix 6. Features extracted from training region f , ideal corre-
We start by introducing a dual variable hc and the con- lation response g,
straint binary mask m.
hc − m h ≡ 0, (7) Ensure:
Optimized filter h. b
which leads to the following augmented La- Procedure:
grangian (Boyd et al, 2011) b 0 by ht−1 .
1: Initialize filter h
λ 2: Initialize Lagrangian coefficients: b l0 ← zeros.
L(ĥc , h, l̂|m) = kdiag(f̂ )ĥc − ĝk2 + khm k2 + (8) 3: while stop condition do
2
4: Calculate ĥi+1
c from ĥi and l̂i using (12).
[l̂H (ĥc − ĥm ) + l̂H (ĥc − ĥm )] + µkĥc − ĥm k2 , i+1
5: Calculate h from ĥi+1
c and l̂i using (13).
where l̂ is a complex Lagrange multiplier, µ > 0, and 6: Update the Lagrangian l̂i+1 from ĥi+1 c and
i+1
we use the definition hm = (m h) for compact no- h (11).
tation. The augmented Lagrangian (8) can be iteratively 7: end while
minimized by the alternating direction method of multi-
pliers, e.g. Boyd et al (2011), which sequentially solves
the following sub-problems at each iteration:
3.2 Constructing spatial reliability map
ĥi+1
c = arg min L(ĥc , hi , l̂i |m), (9)
hc Once the target is localized, a training region is extracted
i+1
arg min L(ĥi+1 i and used to update the filter. Our constrained filter learn-
h = c , h, l̂ |m), (10)
h ing (13) requires estimation of spatial reliability map m
(i.e., segmentation) that identifies pixels in the training re-
and the Lagrange multiplier is updated as
gion which likely belong to the target (see Figure 4). In
l̂i+1 = l̂i + µ(ĥi+1
c − ĥi+1 ). (11) the following we briefly outline the segmentation model
which is used to estimate m.
Minimizations in (9,10) have at each iteration a closed- During tracking, the object foreground/background
form solution, i.e., color models are maintained as color histograms c =
{cf , cb }. Let yi = [yic , yix ] be the observation, i.e., the
ĥi+1 = f̂ ĝ + (µĥim − l̂i ) −1 f̂ f̂ + µi , (12)
c color yic and position yix at i-th pixel in the training re-
λ gion and let mi ∈ {0, 1} be a random variable denot-
hi+1 = m F −1 l̂i + µi ĥi+1 + µi .
c / (13) ing the unknown foreground/background label. The joint
2D
5
Training region Spatial prior Backprojection Posterior Overlayed training region
Figure 4: Spatial reliability map construction from the training region. From left to right: a training region with
the target bounding box, t the foreground-background color models, the posterior object probability after Markov
random field regularization, and the training region masked with the final binary reliability map. The probabilities are
color-coded in a blue (0.0) – green (0.5) – yellow (1.0) colormap.
the appearance likelihood, the spatial likelihood and the timization and can be implemented as a series of convo-
foreground/background prior probability. The appearance lutions.
likelihood term p(yic |mi = j) is computed by Bayes rule The prior over the i-th pixel is defined compactly
from the object foreground/background color models cf as πi = [πi0 , πi1 ] with πij = p(mi = j) and
and cb . The prior probability p(mi = j) is defined by the a standard approximation is made (Diplaros et al,
ratio between the region sizes for foreground/background 2007) that decomposes the joint pdf over priors π =
histogram extraction. [π1 , ..., πM ] into a product of local conditional distri-
QM
The central pixels in axis-aligned approximations of an butions p(π) = i=1 p(πi |πNi ), where M is number
elongated rotating, articulated or deformable object are of pixels, πNi is a mixture distributionPover the priors
likely to contain the object regardless of the specific de- of i-th pixel’s neighbors, i.e., πNi = P j∈Ni ,j6=i λij πj
formation. On the other hand, in the absence of measure- and λij are fixed weights satisfying j λij = 1. In
ments, pixels away from the center belong to the object or Diplaros et al (2007) the weights are fixed to a nor-
background equally likely. This deformation invariance malized Gaussian and are shared across all pixel lo-
of central elements reliability is enforced in our approach cations. The potentials in the MRF are defined as
by defining a weak spatial prior p(πi |πNi ) ∝ exp − 12 E(πi , πNi ) ,with exponent
defined as E(πi , πNi ) = D(πi ||πNi ) + H(πi ).The
p(yix |mi = j) = k(x; σ), (15)
term D(πi ||πNi ) is the Kullback-Leibler divergence
where k(x; σ) is a modified Epanechnikov kernel, which penalizes the difference between prior distribu-
k(r; σ) = 1 − (r/σ)2 , with size parameter σ equal to the tions over the neighboring pixels (πi and πNi ), while
minor bounding box axis and clipped to interval [0.5, 0.9] thePterm H(πi ) is the entropy defined as H(πi ) =
1
such that the object prior probability at center is 0.9 and − j=0 πij log πij ,which penalizes uninformative priors
changes to a uniform prior away from the center (Fig- πi .
ure 4). For smooth solutions Diplaros et al (2007) propose us-
6
ing a similar constraint over the posteriors pi = [pi0 , pi1 ] Training image
Filter response
Max: 0.58 Max: 0.17
with pij being the posterior probability of class j at i-th
pixel, leading to the following energy function
M
X 1 Channel #17 Channel #20
F = log p(yi ) − E(πi , πNi ) + E(pi , pNi ) .
Feature channel
2 Spatial reliability map
i=1
(16)
Minimization of the energy (16) w.r.t. π and p is effi-
ciently solved by the solver from Diplaros et al (2007).
The final mask m for learning the filter in Section 3.1 is m
Discriminative case Non-discriminative case
obtained by thresholding the posterior at 0.5.
Figure 5: A filter is learned on feature channels from a
training region using the constrained optimization with a
3.3 Channel reliability estimation binary segmentation mask m. Correlation responses be-
Channel reliability w̃d in (6) reflects the importance of tween the learned filter and the training region for two
each channel at the target localization stage. In our ap- feature channels are shown on the right. On a discrimi-
proach it consists of two types of reliability measures: native feature channel the filter response is much stronger
(lrn)
(i) channel learning reliability w̃d , which is calculated and less noisy than on a non-discriminative channel.
in the filter learning stage, and (ii) channel detection reli-
(det)
ability w̃d which is calculated in the target localization
Channel detection reliability. The second part of
stage. The joint channel reliability w̃d in (6) at target lo-
the channel reliability reflects how uniquely each chan-
calization stage is computed as the product of both relia-
nel votes for a single target location. Note that Bolme
bility measures, i.e.,
et al (2010) proposed a similar approach to detect loss
(17) of target. Our measure is based on the ratio between the
(lrn) (det)
w̃d = w̃d · w̃d
second and first highest non-adjacent peaks in the chan-
max2
and normalized s.t. d w̃d = 1. The reliability measures nel response map, i.e., 1 − ρd /ρmax1 . The two largest
P
d
are described in following paragraphs. peaks in the response map are obtained as two largest val-
Channel learning reliability. Constrained minimiza- ues after a 3×3 non-maximum suppression. Note that this
tion of (8) solves a least squares problem averaged over all ratio penalizes situations in which multiple similar objects
circular displacements of the filter on a feature channel. appear in the target vicinity (i.e., response map contains
A discriminative feature channel fd produces a filter hd many well expressed modes), even if the major mode ac-
whose output fd ∗ hd nearly exactly fits the ideal response curately depicts the target position. To mitigate such pe-
g. On the other hand, since the response is highly noisy on nalizations, the final values are note allowed to fall below
channels with low discriminative power, a global error re- 0.5. The detection reliability of d-th channel is estimated
duction in the least squares significantly reduces the max- as
(det)
imal response. This effect is demonstrated in Figure 5, w̃d = max(1 − ρmax2
d /ρmax1
d , 0.5). (19)
which shows correlation responses for a highly discrimi-
native and non-discriminative channels. Thus a straight-
is 3.4 Tracking with channel and spatial reli-
(lrn)
forward measure of channel learning reliability w̃d
the maximum response value of a learned channel filter, ability
which is computed as
A single tracking iteration of the proposed channel and
(lrn)
w̃d = max(fd ∗ hd ). (18) spatial reliability correlation filter tracker (CSR-DCF) is
summarized in Algorithm 2 and visualized in Figure 6.
The localization and update steps proceed as follows.
7
Localization step Update step
Spatial reliability
map estimation
Correlation
Weighted sum
Correlation filter Reliability Filter optimization &
channels weights weights estimation
Figure 6: The CSR-DCF tracking iteration: localization step is shown on the left and update step on the right side of
the image.
Localization step. Features are extracted from a search learning reliability (17). The filters and channel reliability
region centered at the target estimated position in the weights are updated by exponential moving average (cur-
previous time-step and correlated with the learned filter rent and from previous frame) with learning rate η (steps
ht−1 . The object is localized by summing the corre- 10 and 11 in the Algorithm 2). Note that we compute
lation responses weighted by the estimated channel re- the spatial reliability map in each frame independently to
liability scores wt−1 . The scale is estimated by a sin- capture large target appearance changes, e.g. caused by
gle scale-space correlation filter as in Danelljan et al rotation or deformation.
(2014a). Per-channel filter responses are used to compute
the corresponding detection reliability values w̃(det) = 3.5 Comparison with prior work
(det) (det)
[w̃1 , . . . , w̃Nc ]T according to (19).
Kiani Galoogahi et al (2015) and Danelljan et al (2015a)
Update step. The training region is centered at the tar- have previously considered constrained filter learning.
get location estimated at localization step. The foreground Here we highlight the differences of our approach.
and background histograms c̃ are extracted and updated The LBCF tracker (Kiani Galoogahi et al, 2015) ad-
by exponential moving average with learning rate ηc (step dresses the circular boundary effect of the Fourier trans-
5 in Algorithm 2). The foreground histogram is extracted form and implicitly increases the filter search region size.
by an Epanechnikov kernel within the estimated object In contrast, the CSR-DCF primarily reduces the impact
bounding box and the background is extracted from the of the background in the filter. The solution of Kiani Ga-
neighborhood twice the object size. The spatial reliability loogahi et al (2015) is similar to our filter optimization,
map m (Sect. 3.2) is constructed and the optimal filters h̃ but it is derived for a rectangular mask only. Since ro-
are computed by optimizing (8). The per-channel learning tating and deformable targets are poorly approximated by
(lrn) (lrn)
reliability weights w̃(lrn) = [w̃1 , . . . , w̃Nc ]T are esti- an axis-aligned bounding box their filter is contaminated
mated from the correlation responses (18). Current frame by background leading to a reduced performance. The
reliability weights w̃ are computed from detection and LBCF updates the auto-spectral and cross-spectral ener-
8
Algorithm 2 : The CSR-DCF tracking algorithm. to converge. In CSR-DCF the map serves as a hard con-
Require: straint resulting in a filter with values off the target set
Image It , object position on previous frame pt−1 , to zero. In contrast, the SRDCF (Danelljan et al, 2015a)
scale st−1 , filter ht−1 , color histograms ct−1 , chan- filter is a compromise between target position regression
nel reliability wt−1 . and a penalty term that prefers potentially non-zero val-
Ensure: ues in the filter center and close-to-zero values away from
Position pt , scale st and updated models. the center, but does not guarantee zero values outside the
Localization and scale estimation: mask.
1: New target location pt : position of the maximum in
correlation between ht−1 and image patch features f
extracted on position pt−1 and weighted by the chan- 4 Experimental analysis
nel reliability scores w (Sect. 3.3).
2: Using per-channel responses, estimate detection reli- This section presents a comprehensive experimental eval-
ability w̃(det) (Sect. 3.3). uation of the CSR-DCF tracker. Implementation details
3: Using location pt , estimate new scale st . are discussed in Section 4.1, convergence of the filter
Update: optimization method is presented in Section 4.2, Sec-
4: Extract foreground and background histograms c̃f , tion 4.3 reports comparison of the proposed constrained
c̃b . learning to the related state-of-the-art and the ablation
5: Update foreground and background histograms study is provided in Section 4.4. Tracking performance
cft = (1 − ηc )cft−1 + ηc c̃f , cbt = (1 − ηc )cbt−1 + ηc c̃b . on three recent benchmarks: OTB-100 (Wu et al, 2015),
VOT2015 (Kristan et al, 2015) and VOT2016 (Kristan
6: Estimate reliability map m (Sect. 3.2). et al, 2016b) is reported in Sections 4.6, 4.7 and 4.8, re-
7: Estimate a new filter h̃ using m (Algorithm 1). spectively. The detailed analysis of the tracker, includ-
8: Estimate learning channel reliability w̃(lrn) from h ing per-attribute tracking performance is presented in Sec-
(Sect. 3.3). tion 4.9 and tracking speed analysis in Section 4.10.
9: Calculate channel reliability w̃ = w̃(lrn) w̃(det)
10: Update filter ht = (1 − η)ht−1 + η h̃.
11: Update channel reliability wt = (1 − η)wt−1 + η w̃. 4.1 Implementation details and parameters
A popular implementation Felzenszwalb et al (2010) of
the standard HoG (Dalal and Triggs, 2005) and Color-
gies (f̂ f̂ and f̂ ĝ in (12)) separately, which approxi- names (van de Weijer et al, 2009) features are used in the
mates computation of a single filter from a weighted sum correlation filter and HSV foreground/background color
of errors over past training samples. This adaptation is histograms with 16 bins per color channel are used in re-
reasonable since it is derived for a rectangular mask that liability map estimation with parameter αmin = 0.05. All
remains constant throughout tracking. The CSR-DCF es- the parameters are set to values commonly used in litera-
timates the mask separately for each training sample and ture (Danelljan et al, 2015a; Kiani Galoogahi et al, 2015).
learns a corresponding filter. For articulated objects in Histogram adaptation rate is set to ηc = 0.04, correlation
particular the mask varies significantly with time, there- filter adaptation rate is set to η = 0.02, and the regu-
fore it is beneficial to compute the exact filter for each larization parameter is set to λ = 0.01. The augmented 0
frame. Robustness is increased by moderately averaging Lagrangian optimization parameters are set to µ = 5 and
the filters temporally. β = 3. All parameters have a straight-forward interpre-
tation, do not require fine-tuning, and were kept constant
Similarly to our approach, the SRDCF (Danelljan et al,
throughout all experiments. Our Matlab implementation1
2015a) uses a spatial map in filter learning. In contrast
to our approach, their map does not adapt to the target 1 The CSR-DCF Matlab source is publicly available on:
9
runs at 13 frames per second on an Intel Core i7 3.4GHz plemented to emphasize the difference in boundary con-
standard desktop. straints: the first uses our spatial reliability boundary con-
straint formulation from Section 3 (TSC ) the second ap-
4.2 Convergence of constrained learning plies the spatial regularization constraint (Danelljan et al,
2015a) (TSR ) and the third applies the limited boundaries
The constrained filter learning described in Section 3.1 is constraint (Kiani Galoogahi et al, 2015) (TLB ).
an iterative optimization method that minimizes the cost The three variants were compared on the challenging
function (8). This experiment demonstrates how the cost VOT2015 dataset (Kristan et al, 2015) by applying a stan-
changes with the number of iterations during filter opti- dard no-reset one-pass evaluation from OTB (Wu et al,
mization. 2013) and computing the AUC on the success plot. The
Figure 7 shows the average squared difference between tracker with our constraint formulation TSC achieved 0.32
the result of the correlation of the filter constrained by the AUC, while the alternatives achieved 0.28 (TSR ) and 0.16
spatially constrained function and the ideal output. This (TLB ). The only difference between these tackers is in the
graph was obtained by averaging 60 examples of initial- constraint formulation, which indicates superiority of the
izing a filter on a target (one per VOT2015 sequence) and proposed spatial-reliability-based constraints formulation
scaling each to an interval between zero and one. It is over the recent alternatives (Kiani Galoogahi et al, 2015;
clear that the error drops by 80% within the first few it- Danelljan et al, 2015a).
erations. Already after four iterations, the performance
improvements become negligible, therefore we set num-
4.3.1 Robustness to non-axis-aligned target initial-
ber of iterations to N = 4.
ization
The CSR-DCF tracker from Section 3 was compared
to the original recent state-of-the-art trackers SRDCF
(Danelljan et al, 2015a) and LBCF (Kiani Galoogahi et al,
2015) that apply alternative boundary constraints. For
fair comparison, the source code of SRDCF and LBCF
was obtained from the authors, all three trackers used
only HoG features and tracked on the same single scale.
An experiment was designed to evaluate initialization and
tracking of non axis-aligned targets, which is the case for
most realistic deforming and non-circular objects. Track-
ers were initialized on frames with non-axis aligned tar-
gets and left to track until the sequence end, resulting in a
Figure 7: Convergence speed of constrained filter learning large number of tracking trajectories.
from Section 3.1 shown as a relative drop of the initial The VOT2015 dataset (Kristan et al, 2015) contains
cost. non-axis-aligned annotations, which allows automatic
identification of tracker initialization frames, i.e., frames
in which the ground truth bounding box significantly de-
4.3 Impact of the boundary constraint for- viates from an axis-aligned approximation. Frames with
overlap (intersection over union of predicted and ground-
mulation
truth bounding boxes) of the ground truth and the axis-
This section compares our proposed boundary constraints aligned approximation lower than 0.5 were identified and
formulation (Sect. 3) with recent state-of-the-art ap- filtered to obtain a set of initialization frames at least hun-
proaches (Danelljan et al, 2015a; Kiani Galoogahi et al, dred frames apart. This constraint fits half the typical
2015). In the first experiment, three variants of the stan- short-term sequence length (Kristan et al, 2015) and re-
dard single-scale HoG-based correlation filter were im- duces the potential correlation across the initializations
10
Figure 9: Qualitative results for trackers CSR-DCF (red)
tracker, SRDCF (blue) and LBCF (green).
11
fluence of background in filter learning resulting in con- Table 2: Ablation study of CSR-DCF. The use of channel
siderable robustness to poor initializations. reliability is indicated in the Chan. column, the the type
of spatial reliability map in the Spat. column. The Opt.
column indicates whether the constrained optimization is
4.4 Spatial and channel reliability ablation used.
study
Tracker Chan. Spat. Opt. EAO Rav Aav
An ablation study on VOT2016 was conducted to eval- CSR-DCF x segm. x 1 0.338 1 0.85 1 0.51
uate the contribution of spatial and channel reliability in CSR-DCFc− – segm. x 2 0.297 2 1.08 2 0.51
CSR-DCF. Results of the VOT primary measure expected CSR-DCFsu x unif. x 3 0.264 3 1.18 3 0.49
average overlap (EAO) and two supplementary measures CSR-DCFc− su – unif. x 0.256 1.33 2 0.51
accuracy and robustness (A,R) are summarized in Table 2. CSR-DCFc− o− – segm. – 0.251 1.47 2 0.51
For the details of performance measures and evaluation CSR-DCFc− s− – – – 0.152 2.85 0.47
protocol we refer the reader to the Section 4.7. Perfor-
mance of the various modifications of CSR-DCF is dis-
cussed in the following.
Channel reliability weights. Setting the channel relia- denoted as CSR-DCFc− o− , does not use channel reliabil-
bility weights to uniform values (CSR-DCFc− ) is equiv- ity weights. The performance drop in EAO compared to
alent to treating all channels as independent and equally CSR-DCFc− is 15%.
important. The performance drop in EAO compared to
CSR-DCF is 12%.
4.5 Spatial reliability map quality analysis
Spatial reliability map. Replacing the spatial reliability
map in CSR-CDF by a constant map with uniform val- In this section we evaluate the quality of our spatial reli-
ues within the bounding box and zeros elsewhere (CSR- ability map estimation (Section 3.2) from a visual track-
DCFsu ), results in a 21% drop in EAO. The other parts ing perspective. We compare the CSR-DCF tracker with
of the tracker remained unchanged in this experiment, in- the version of CSR-DCF that uses ideal spatial reliabil-
cluding the channel reliability. This clearly shows the im- ity map (the tracker is denoted as CSR*-DCF). In the
portance of our segmentation-based spatial reliability map VOT2016 challenge (Kristan et al, 2016b), the ground
estimation from Section 3.2. truth bounding boxes were automatically computed by
Channel and spatial reliability. Making both replace- optimizing coverage over manually segmented targets in
ments in the original tracker means that this version each frame. The VOT2016 has recently made their per-
(CSR-DCFc− su ) does not use channel reliability weights frame segmentations freely available (Vojir and Matas,
and it uses uniform spatial reliability map (uniform values 2017). We use these per-frame segmentation masks in
within the bounding box and zeros elsewhere). The per- CSR*-DCF as spatial reliability map m.
formance drops by 24% compared to CSR-DCF. Removal Results of evaluation on VOT2016 (Kristan et al,
of the uniform spatial reliability map from CSR-DCFc− su 2016b) are reported in Table 3. The performances of the
results in the CSR-DCFc− s− . This version reduces our CSR-DCF and CSR*-DCF are very similar. The track-
tracker to a standard DCF with a large receptive field. ers achieve an equal expected average overlap (EAO) and
Since the learned filter captures a significant amount of average accuracy (Aav ). But the CSR*-DCF has a sin-
background, the performance drops by over 50%. gle failure less than CSR-DCF on 60 sequences which is
ADMM Filter optimization method. To demonstrate 0.02 on average. In Table 3 the average number of fail-
the importance of the constrained optimization method we ures is denoted as robustness (Rav ). These results show
modify the proposed tracker as follows. The filter h is cal- that our approach for spatial reliability estimation (Sec-
culated with a naive approach, i.e., a closed-form solution tion 3.2) generates near ideal maps from a tracking per-
followed by masking with the spatial reliability map m: spective.
ĥ = F(F −1 (ĥ) m). For a fair comparison the tracker, Figure 10 qualitatively compares the spatial reliabil-
12
Table 3: Tracking performance comparison of the two the maps are different. But from the perspective of track-
versions of CSR-DCF on VOT2016. The proposed ing they are nearly equivalent since the tracking perfor-
method is denoted as CSR-DCF while the version using mance remains unchanged. For example, in the case of a
ground-truth segmentation masks instead of color-based basketball player, the legs are not well segmented by our
spatial reliability map is denoted as CSR*-DCF. approach. But since the legs constantly move, they are in
fact non-informative for object localization from the per-
Tracker EAO Aav Rav spective of the correlation filter template matching and do
CSR-DCF 0.338 0.51 0.85 not contribute to improved tracking.
CSR*-DCF 0.338 0.51 0.83
13
1
Precision plots 1
Success plots
2@ k s e w q 0.4
Success rate
CSR-DCF [0.733] SRDCF [0.598]
Precision
0.6 0.6
SRDCF [0.725] CSR-DCF [0.587]
MUSTER [0.709] MUSTER [0.572] 5( 5& 5% 5# 4* 4% 0.2
Struck [0.599] Struck [0.463]
0.4
TLD [0.550] 0.4
SCM [0.446] 6! 5* ait
SCM [0.540] TLD [0.427] 0.1
CXT [0.521] CXT [0.414]
6@ 3* 3$ 3) 2^ 2! j g o u
0.2 0.2
CSK [0.496]
ASLA [0.491]
ASLA [0.410]
LSK [0.386]
6) 5@ 4( 4# 4@ 3& 3^ 2( 2$ 2) h d 0.0
60 50 40 30 20 10 1
0
LSK [0.478]
0
CSK [0.386]
Rank
0 10 20 30 40 50 0 0.2 0.4 0.6 0.8 1
Frag mkcf Rob
Location error threshold Overlap threshold 3% ACT 5& CT 5* Track 5@ L1APG 2! plus j Struck 2^ srat
s3
6@ amt g DAT 2( ggt t LDP 3) muster s tracker r srdcf
AOG Deep HMM
Figure 11: Evaluation on OTB100 (Wu et al, 2015) bench- 2# Tracker w SRDCF ; TxD 3& LGT 2% mvcft 2* samf 5% STC
ASMS DFT loft
mark. 2) 4( 4% HT 6) lite 6! ncc 4* SCBT o struck
3@ baseline 3* DSST 5$ IVT 5! LT_FLO i nsamf u scebt d sumshift
mat
4@ bdf 2& dtracker 4) KCF2 4& flow 5# OAB 4! sKCF 3! TGPR
kcf
3$ cmil e EBT 3^ mtsa l MCT k OACF 2$ sme 2@ tric
5^ CMT 4^ fct 4# kcfdp h MEEM 4$ PKLTF f SODLT 5( zhang
CSR-
lenging sequences. In contrast to related benchmarks, q DCF 5) FoT 3# kcfv2 3( MIL a rajssc y sPST
the VOT2015 dataset was constructed from over 300 se-
quences by an advanced sequence selection methodology Figure 12: Expected average overlap (EAO) plot for
that favors objects difficult to track and maximizes a vi- CSR-DSF (#1) and all trackers participating in the VOT
sual attribute diversity cost function (Kristan et al, 2015). 2015 (Kristan et al, 2015) benchmark listed below the plot
This makes it arguably the most challenging sequence set in alphabetical order with their numerical codes.
available. The VOT methodology (Kristan et al, 2016c)
resets a tracker upon failure to fully use the dataset. The Table 4: The ten top-performing trackers on the VOT2015
basic VOT measures are the number of failures during benchmark.
tracking (robustness) and average overlap during the pe-
riods of successful tracking (accuracy), while the primary Tracker EAO Aav Rav
VOT2015 measure is the expected average overlap (EAO) CSR-DCF 1 0.320 3 0.55 2 0.93
on short-term sequences. The latter can be thought of DeepSRDCF 2 0.318 2 0.56 3 1.00
as the expected no-reset average overlap (AUC in OTB EBT 3 0.313 0.45 1 0.81
methodology), but with reduced bias and the variance as srdcf 0.288 3 0.55 1.18
explained in (Kristan et al, 2015). LDP 0.278 0.49 1.30
sPST 0.277 0.54 1.42
Figure 12 shows the VOT EAO plots with the CSR- scebt 0.255 0.54 1.72
DCF and the VOT2015 state-of-the-art approaches con- nsamf 0.254 0.53 1.45
sidering the VOT2016 rules that do not consider track- struck 0.246 0.46 1.50
ers learned on video sequences related to VOT to pre- rajssc 0.242 1 0.57 1.75
vent over-fitting. The CSR-DCF outperforms all track-
ers and achieves the top rank. The CSR-DCF signifi-
cantly outperforms the related correlation filter trackers
like SRDCF (Danelljan et al, 2015a) as well as trackers 4.8 The VOT2016 benchmark (Kristan
that apply computationally-intesive state-of-the-art deep et al, 2016b)
features e.g., deepSRDCF (Danelljan et al, 2015b) and
SO-DLT (Wang et al, 2015b). For completeness, detailed Finally, we assess our tracker on the most recent visual
results for the ten top-performing trackers are shown in tracking benchmark, VOT2016 (Kristan et al, 2016b).
Table 4. The dataset contains 60 sequences from VOT2015 (Kris-
14
performing trackers come from various classes e.g., cor-
4& 4$3& 3@ 2( 2@ h f rw q
relation filter methods: CCOT (Danelljan et al, 2016),
15
CSR-DCF CCOT TCNN SSAT MLDF Staple Table 6: Speed in frames per second (fps) of correlation
Staple+ DDC EBT SRBT DNT
trackers and Struck – a baseline. The EAO, average accu-
Overall
(0.27, 0.34) racy (Aav ) and average failures (Rav ) are shown for ref-
erence.
Unassigned Camera motion
(0.10, 0.21) (0.28, 0.37) Tracker EAO Aav Rav fps
CSR-DCF 1 0.338 2 0.51 10.85 3 13.0
CCOT ECCV2016 2 0.331 1 0.52 10.85 0.6
CCOT* ECCV2016 3 0.274 1 0.52 21.18 1.0
SRDCF ICCV2015 0.247 1 0.52 3 1.50 7.3
KCF PAMI2015 0.192 3 0.48 2.03 1 115.7
Motion Illumination
change change DSST PAMI2016 0.181 3 0.48 2.52 2 18.6
(0.20, 0.46) (0.20, 0.46)
Struck ICCV2011 0.142 0.42 3.37 8.5
Size change Occlusion The average speed of our tracker measured on the VOT
(0.25, 0.38) (0.16, 0.29)
2016 dataset is approximately 13 frames-per-second2 or
Figure 14: Expected averaged overlap performance on 77 milliseconds per-frame. Figure 15 shows the process-
different visual attributes on the VOT2016 (Kristan et al, ing time required by each step of the CSR-DCF. A track-
2016b) benchmark. The CSR-DCF and the top 10 per- ing iteration is divided into two steps: (i) target localiza-
forming trackers from VOT2016 are shown. The scales tion and (ii) the visual model update. Target localization
of visual attribute axes are shown below the attribute la- takes in average 35 milliseconds at each frame and is com-
bels. posed of two sub-steps: estimation of object translation
(23ms) and scale change estimation (12ms). The visual
model update step takes on average 42 milliseconds. It
applies deep ConvNets, with respect to VOT measures, consists of three sub-steps: spatial reliability map estima-
while being 20 times faster than the CCOT. The CCOT tion (16ms), filter update (12ms) and scale model update
was modified by replacing the computationally intensive (14ms). Filter optimization, which is part of the filter up-
deep features with the same simple features used in CSR- date step, takes on average 7 milliseconds.
DCF. The resulting tracker, indicated by CCOT*, is still
ten times slower than CSR-DCF, while the performance 4.11 Qualitative evaluation
drops by over 15%. The CSR-DCF performs twice as
fast as the related SRDCF (Danelljan et al, 2015a), while Figure 16 shows four examples of tracking with the CSR-
achieving approximately 25% better tracking results. The DCF. In the following we describe tracking performance
speed of baseline real-time trackers like DSST (Danelljan on each sequence.
et al, 2014a) and Struck (Hare et al, 2011) is compara- The first example shows tracking of an octopus along
ble to CSR-DCF, but their tracking performance is signif- with channel reliability weights. The first eighteen
icantly poorer. The fastest compared tracker, KCF (Hen- weights correspond to HoG channels, the 19th weight is
riques et al, 2015) runs much faster than real-time, but de- reliability of a grayscale template and the last ten weights
livers a significantly poorer performance than CSR-DCF. correspond to colornames. Note that the colors in boxes
The experiments show that the CSR-DCF tracks com- are not the actual colors of the colornames, because these
parably to the state-of-the-art trackers which apply com- features are subspace of original colornames, designed to
putationally demanding high-dimensional features, but 2 With some basic code optimization and refactoring we speed-up our
runs considerably faster and delivers top tracking perfor- algorithm to 19 FPS without significant performance drop (only one ad-
mance among the real-time trackers. ditional failure on VOT2016 dataset).
16
Scale estimation
Localization
Translation
5 Conclusion
12ms 23ms
17
grangian from Equation (8) is The partial complex gradients are:
λ
L(ĥc , h, l̂) = kdiag(f̂ )ĥc − ĝk2 + khm k2 + (20) ∇ĥ L1 = (31)
2 c
∂ T
[l̂H (ĥc − ĥm ) + l̂H (ĥc − ĥm )] + µkĥc − ĥm k2 , = diag(f̂ )ĥc − ĝ diag(f̂ )ĥc − ĝ =
∂ ĥc
with hm = (m h). For the purposes of derivation we
∂
will rewrite (20) into a fully vectorized form = ĥTc diag(f̂ )H diag(f̂ )ĥc − ĥTc diag(f̂ )H ĝ−
λ ∂ ĥc
L(ĥc , h, l̂) = kdiag(f̂ )ĥc − ĝk2 + khm k2 + (21)
2 ĝH diag(f̂ )ĥc + ĝH ĝ =
√ √
l̂H (ĥc − DFMh) + l̂H (ĥc − DFMh) +
= diag(f̂ )H diag(f̂ )ĥc − diag(f̂ )ĝ,
√
µkĥc − DFMhk2 ,
where F denotes D × D orthonormal matrix of Fourier ∇ĥ L2 = 0, (32)
c
coefficients, such
√ that the Fourier transform is defined as
x̂ = F(x) = DFx and M = diag(m). For clearer
representation we denote the four terms in the summation ∇ĥ L3 = (33)
c
(21) as
√ √
∂
L(ĥc , h, l̂) = L1 + L2 + L3 + L4 , (22) = l̂H H
ĥc − DFMh + l̂ ĥc − DFMh =
∂ ĥc
where √ √
∂ H H T T
T = l̂ ĥc − l̂ DFMh + l̂ ĥc − l̂ DFMh =
L1 = diag(f̂ )ĥc − ĝ diag(f̂ )ĥc − ĝ , (23) ∂ ĥc
= l̂,
λ
L2 = khm k2 , (24)
2
√ √ ∇ĥ L4 = (34)
L3 = l̂H (ĥc −
c
DFMh) + l̂H (ĥc − DFMh), (25)
√ T √
∂
√ = µ ĥc − DFMh ĥc − DFMh =
L4 = µkĥc − DFMhk2 . (26) ∂ ĥc
√
∂
Minimization of Equation (8) in Section 3.1 is an itera- = µ ĥH H
c ĥc − ĥc DFMh−
tive process at which the following minimizations are re- ∂ ĥc
quired: √ T
Dh MFH ĥc + DhT MFH FMh =
ĥopt
c = arg min L(ĥc , h, l̂), (27) √
hc
= µĥc − µ DFMh.
hopt = arg min L(ĥopt
c , h, l̂). (28) √
h
Note that DFMh = ĥm according to our original def-
Minimization w.r.t. to ĥc is derived by finding ĥc at which inition of ĥm . Plugging (31-34) into (30) yields
the complex gradient of the augmented Lagrangian van-
ishes, i.e., diag(f̂ )H diag(f̂ )ĥc − diag(f̂ )ĝ + l̂ + µĥc − µĥm = 0,
(35)
∇ĥ L ≡ 0, (29)
c diag(f̂ )ĝ + µĥm − l̂
ĥc = ,
∇ĥ L1 + ∇ĥ L2 + ∇ĥ L3 + ∇ĥ L4 ≡ 0. (30) diag(f̂ )H diag(f̂ ) + µ
c c c c
18
which can be rewritten into Plugging (39-43) into (38) yields
f̂ ĝ + µĥm − l̂ λ √ √
ĥc = . (36) Mh − DMFH l̂ − µ DMFH ĥc + µDMh = 0,
f̂ f̂ + µ 2
(44)
Next we derive the closed-form solution of (28). The op- √
DFH (l̂ + µĥc )
timal h is obtained when the complex gradient w.r.t. h Mh = M λ
.
vanishes, i.e., 2 + µD
∇h L 4 = (43) References
√ T √
∂
= µ ĥc − DFMh ĥc − DFMh = Babenko B, Yang MH, Belongie S (2011) Robust object
∂h
√ tracking with online multiple instance learning. IEEE
∂
= µ ĥH
c ĥc − ĥH
c DFMh− Trans Pattern Anal Mach Intell 33(8):1619–1632
∂h
√ H Bertinetto L, Valmadre J, Golodetz S, Miksik O, Torr PHS
Dh MFH ĥc + DhH Mh = (2016a) Staple: Complementary learners for real-time
√ tracking. In: Comp. Vis. Patt. Recognition, pp 1401–
= −µ DMFH ĥc + µDMh. 1409
19
Bertinetto L, Valmadre J, Henriques JF, Vedaldi A, Torr Dinh TB, Vo N, Medioni G (2011) Context tracker:
PH (2016b) Fully-convolutional siamese networks for Exploring supporters and distracters in unconstrained
object tracking. arXiv preprint arXiv:160609549 environments. In: Comp. Vis. Patt. Recognition, pp
1177–1184
Bolme DS, Beveridge JR, Draper BA, Lui YM (2010) Vi-
sual object tracking using adaptive correlation filters. Diplaros A, Vlassis N, Gevers T (2007) A spatially
In: Comp. Vis. Patt. Recognition, IEEE, pp 2544–2550 constrained generative model and an em algorithm
for image segmentation. IEEE Trans Neural Networks
Boyd S, Parikh N, Chu E, Peleato B, Eckstein J (2011)
18(3):798 – 808
Distributed optimization and statistical learning via the
alternating direction method of multipliers. Founda- Felzenszwalb P, Girshick R, McAllester D, Ramanan D
tions and Trends in Machine Learning 3(1):1–122 (2010) Object detection with discriminatively trained
Čehovin L, Leonardis A, Kristan M (2016) Visual object part-based models. IEEE Trans Pattern Anal Mach In-
tracking performance measures revisited. IEEE Trans tell 32(9):1627–1645
Image Proc 25(3):1261–1274
Galoogahi HK, Sim T, Lucey S (2013) Multi-channel
Dalal N, Triggs B (2005) Histograms of oriented gradi- correlation filters. In: Int. Conf. Computer Vision, pp
ents for human detection. In: Comp. Vis. Patt. Recog- 3072–3079
nition, vol 1, pp 886–893
Grabner H, Grabner M, Bischof H (2006) Real-time
Danelljan M, Häger G, Khan FS, Felsberg M (2014a) Ac- tracking via on-line boosting. In: Proc. British Machine
curate scale estimation for robust visual tracking. In: Vision Conference, vol 1, pp 47–56
Proc. British Machine Vision Conference, pp 1–11
Hare S, Saffari A, Torr PHS (2011) Struck: Structured
Danelljan M, Khan FS, Felsberg M, van de Weijer J output tracking with kernels. In: Int. Conf. Com-
(2014b) Adaptive color attributes for real-time visual puter Vision, IEEE Computer Society, Washington,
tracking. In: 2014 IEEE Conference on Computer Vi- DC, USA, pp 263–270
sion and Pattern Recognition, CVPR 2014, Columbus,
OH, USA, June 23-28, 2014, pp 1090–1097 Henriques JF, Caseiro R, Martins P, Batista J (2012) Ex-
ploiting the circulant structure of tracking-by-detection
Danelljan M, Hager G, Shahbaz Khan F, Felsberg M
with kernels. In: Fitzgibbon A, Lazebnik S, Perona P,
(2015a) Learning spatially regularized correlation fil-
Sato Y, Schmid C (eds) Proc. European Conf. Com-
ters for visual tracking. In: Int. Conf. Computer Vision,
puter Vision, Springer Berlin Heidelberg, Berlin, Hei-
pp 4310–4318
delberg, pp 702–715
Danelljan M, Häger G, Khan FS, Felsberg M (2015b)
Convolutional features for correlation filter based vi- Henriques JF, Caseiro R, Martins P, Batista J (2015) High-
sual tracking. In: IEEE International Conference on speed tracking with kernelized correlation filters. IEEE
Computer Vision Workshop (ICCVW), pp 621–629 Trans Pattern Anal Mach Intell 37(3):583–596
Danelljan M, Robinson A, Khan FS, Felsberg M (2016) Hester CF, Casasent D (1980) Multivariant technique
Beyond correlation filters: learning continuous convo- for multiclass pattern recognition. Applied Optics
lution operators for visual tracking. In: Proc. European 19(11):1758–1761
Conf. Computer Vision, Springer, pp 472–488
Hong Z, Chen Z, Wang C, Mei X, Prokhorov D, Tao
Danelljan M, Häger G, Khan FS, Felsberg M (2017) Dis- D (2015) Multi-store tracker (muster): A cognitive
criminative scale space tracking. IEEE Trans Pattern psychology inspired approach to object tracking. In:
Anal Mach Intell 39(8):1561–1575 Comp. Vis. Patt. Recognition, pp 749–758
20
Kalal Z, Mikolajczyk K, Matas J (2012) Tracking- Liang P, Blasch E, Ling H (2015) Encoding color infor-
learning-detection. IEEE Trans Pattern Anal Mach In- mation for visual tracking: Algorithms and benchmark.
tell 34(7):1409–1422 IEEE Trans Image Proc 24(12):5630–5644
Kiani Galoogahi H, Sim T, Lucey S (2015) Correlation Liu B, Huang J, Yang L, Kulikowsk C (2011) Robust
filters with limited boundaries. In: Comp. Vis. Patt. tracking using local sparse appearance model and k-
Recognition, pp 4630–4638 selection. In: Comp. Vis. Patt. Recognition, pp 1313–
1320
Kristan M, Pflugfelder R, Leonardis A, Matas J, Porikli F,
Čehovin L, Nebehay G, Fernandez G, Vojir Tea (2013)
Liu S, Zhang T, Cao X, Xu C (2016) Structural correlation
The visual object tracking vot2013 challenge results.
filter for robust visual tracking. In: Comp. Vis. Patt.
In: Vis. Obj. Track. Challenge VOT2013, In conjunc-
Recognition, pp 4312–4320
tion with ICCV2013, pp 98–111
Kristan M, Pflugfelder R, Leonardis A, Matas J, Čehovin Liu T, Wang G, Yang Q (2015) Real-time part-based vi-
L, Nebehay G, Vojir T, et al Fernandez G (2014) The vi- sual tracking via adaptive correlation filters. In: Comp.
sual object tracking vot2014 challenge results. In: Proc. Vis. Patt. Recognition, pp 4902–4912
European Conf. Computer Vision, pp 191–217
Lukežič A, Č Zajc L, Kristan M (2017) Deformable
Kristan M, Matas J, Leonardis A, Felsberg M, Čehovin parts correlation filters for robust visual tracking. IEEE
L, Fernandez G, Vojir T, Häger G, Nebehay G, Transactions on Cybernetics PP(99):1–13
et al Pflugfelder R (2015) The visual object tracking
vot2015 challenge results. In: Int. Conf. Computer Vi- Ma C, Huang JB, Yang X, Yang MH (2015) Hierarchical
sion convolutional features for visual tracking. In: Int. Conf.
Computer Vision, pp 3074–3082
Kristan M, Kenk VS, Kovačič S, Perš J (2016a) Fast
image-based obstacle detection from unmanned surface Mueller M, Smith N, Ghanem B (2016) A benchmark and
vehicles. IEEE Transactions on Cybernetics 46(3):641– simulator for uav tracking. In: Proc. European Conf.
654 Computer Vision
Kristan M, Leonardis A, Matas J, Felsberg M, Pflugfelder Nam H, Han B (2016) Learning multi-domain convolu-
R, Čehovin L, Vojir T, Häger G, Lukežič A, et al Fer- tional neural networks for visual tracking. In: Comp.
nandez G (2016b) The visual object tracking vot2016 Vis. Patt. Recognition, pp 4293–4302
challenge results. In: Proc. European Conf. Computer
Vision Qi Y, Zhang S, Qin L, Yao H, Huang Q, Lim J, Yang
MH (2016) Hedged deep tracking. In: CVPR, pp 4303–
Kristan M, Matas J, Leonardis A, Vojir T, Pflugfelder
4311
R, Fernandez G, Nebehay G, Porikli F, Cehovin L
(2016c) A novel performance evaluation methodol-
Smeulders A, Chu D, Cucchiara R, Calderara S, De-
ogy for single-target trackers. IEEE Trans Pattern Anal
hghan A, Shah M (2014) Visual tracking: An exper-
Mach Intell
imental survey. IEEE Trans Pattern Anal Mach Intell
Li Y, Zhu J (2014a) A scale adaptive kernel correlation fil- 36(7):1442–1468
ter tracker with feature integration. In: Proc. European
Conf. Computer Vision, pp 254–265 Vojir T, Matas J (2017) Pixel-wise object segmenta-
tions for the VOT 2016 dataset. Research Report
Li Y, Zhu J (2014b) A scale adaptive kernel correlation fil- CTU–CMP–2017–01, Center for Machine Perception,
ter tracker with feature integration. In: Proc. European K13133 FEE Czech Technical University, Prague,
Conf. Computer Vision, pp 254–265 Czech Republic
21
Wang L, Ouyang W, Wang X, Lu H (2015a) Visual track-
ing with fully convolutional networks. In: Int. Conf.
Computer Vision, pp 3119–3127
Wang N, Li S, Gupta A, Yeung D (2015b) Transfer-
ring rich feature hierarchies for robust visual tracking.
CoRR abs/1501.04587
Wang S, Zhang S, Liu W, Metaxas DN (2016) Visual
tracking with reliable memories. In: Proceedings of the
Twenty-Fifth International Joint Conference on Artifi-
cial Intelligence, pp 3491–3497
22
Figure 16: Qualitative results of tracking with the CSR-DCF on four video sequences.
23