Zero-Shot Open-Vocabulary Tracking With Large Pre-Trained Models
Zero-Shot Open-Vocabulary Tracking With Large Pre-Trained Models
Zero-Shot Open-Vocabulary Tracking With Large Pre-Trained Models
Wen-Hsuan Chu1 , Adam W. Harley2 , Pavel Tokmakov3 , Achal Dave3 , Leonidas Guibas2 , Katerina Fragkiadaki1
Abstract— Object tracking is central to robot perception exceeding their accuracy. Recent methods for visual tracking
and scene understanding. Tracking-by-detection has long been build upon transformer architectures, where feature vectors
a dominant paradigm for object tracking of specific object represent tracked objects, and these are re-contextualized in
arXiv:2310.06992v2 [cs.CV] 25 Jan 2024
Transform or
Refine Segment
terminate
Mask to box
Optical flow
Re-ID or
Detect
spawn
Fig. 1: Architecture of OVTracktor. An open-vocabulary detector detects objects and an open-world segmenter segments
their masks. We propagate boxes to the next frame or decide their termination using an optical flow-based motion model.
The propagated boxes are refined with the detector’s box regression module. Refined boxes are used to prompt the segmenter
in the next frame. The detections and their associated appearance features from the next frame are used to determine whether
new tracks should be spawned or merged with previously terminated tracks.
consider object rotation or anisotropic scaling. The category f) Spawning new object tracks: At each frame, we need
label and instance ID of the box are maintained. to take into account new objects that enter the scene or
c) Object track termination: We determine if an object reappear after an occlusion. For each detection in Dt+1 , we
track should be terminated due to occlusions by checking if compute its IoU with all the masks in M t . A new track is
the ratio of forward-backward flow consistent pixels is lower spawned if the IoU between the detection and all masks in
than a fixed ratio λf low or if the object-ness score of the box M t is below some specified threshold λspawn .
is too low. g) Track re-identification: We use appearance feature
matching to determine whether to merge a new track with an
d) Object box refinement: We refine the propagated (non-
existing but terminated track. We store a small window of
terminated) boxes at frame t + 1 using Detic’s bounding box
features before a track’s termination and compare them with
regression module. This adjusts the bounding boxes according
the features of the newly spawned tracks. Newly spawned
to objectness cues on that frame and gives us higher-quality
tracks are considered for Re-ID until Treid time-steps have
box estimates in frame t + 1.
passed. We used the box features from Detic (before the box
e) Temporally consistent object segmentation: The regression module) and normalized them along the channel
bounding box estimates at frame t + 1 are used to prompt dimension to obtain a small set of features that represent
SAM to segment the object’s interior. SAM produces multiple each instance. We then compute the inner product between
segmentation mask candidates per box, to handle ambiguity normalized appearance features for any two tracks and merge
regarding what to segment from the box’s interior. Overall, them if their value is above a threshold λreid .
we found a box prompt often unambiguously determines the None of OVTracktor’s described components require any
object to segment (so all resulting masks will be identical), additional training. As detectors and segmenters improve,
in contrast to a center object point prompt, which does not the components can be easily swapped out for potential
have information regarding the object’s extent. To handle the improvements to the tracker.
cases where this ambiguity exists, we implement a form of Implementation details. During inference, we apply test
temporal cycle consistency at the mask level. time augmentation to video frames when running the detector
SAM segments an object via iterative attention between by scaling and horizontally flipping the individual frames
an object query vector and pixel features, and a final inner to obtain better object proposals, which we found to help
product between the contextualized query vector and pixel improve the recall of detections, especially in the harder
feature vectors. The three segmentation hypotheses consider open-world datasets. OVTracktor has the following hyper-
different object query initialization. For each box i, we use parameters λc = 0.5 for thresholding detector’s confidence,
the updated (contextualized) object query vector at frame λf low for deciding track termination due to occlusion,
t + 1 to segment the object at frame t via inner product with λspawn for instantiating new objects non-overlapping with
the pixel features from frame t; this results in a temporally existing objects tracks, and λreid for merging temporally non-
corresponding mask m̂ti . We select the SAM segmentation overlapping tracks during re-identification. We have found
hypothesis at frame t + 1 whose updated query vector-driven the model robust to the choice of these hyper-parameters,
segmentation m̂ti has the highest Intersection over Union due to the nature of videos: an object suffices to be detected
(IoU) with mti . We then update the object boxes to tightly confidently only very sparsely in time, and our propagation
contain the resulting segmentation mask. method will propagate it forward. In evaluations, we used
some videos in the training set to select a good set of TABLE I: Tracking performance in BURST [10]. Higher
hyperparameters for the individual datasets. As the memory is better. Best performing method is bolded.
consumption scales with the number of objects being tracked, Method HOTAall HOTAcom HOTAunc
we also put a hard limit K on the limit of tracks that can
STCN Tracker [10, 48] 5.5 17.5 2.5
co-exist at the same time. Box Tracker [10] 8.2 27.0 3.6
also unsurprisingly hurts performance, since the bounding networks as an extension in our future work. Augmenting
box refinement alone cannot recover the correct object box, SAM with extra modules [59] for tighter spatial-temporal
especially for frames where the motion is large. reasoning, where mask query tokens attend to previous frames
D. Running Time Analysis to be better conditioned on a track’s history is another
interesting avenue of future work.
We analyze the running time of the individual components
in OVTracktor in Table IV. The results are reported over
VI. C ONCLUSION
the average of all the frames in the videos. The model runs
at 0.41 FPS on an Nvidia V100 and costs around 18GB of We present OVTracktor, a zero-shot framework for open-
VRAM to run on 480p videos, without caching anything vocabulary visual tracking, that re-purposes modules of
to disk. We can see that most of the running time is spent large-scale pre-trained models for object tracking in videos
in the Detic detector, with the SAM segmenter coming in without any training or finetuning. Our model can be applied
second. Test time augmentations (TTA) incur a big overhead equally well for video object segmentation and multi-object
for Detic, and for scenes where detection is easy, noticeable tracking across various benchmarks and helps unify the video
speedups can be achieved by turning off TTA. tracking and segmentation literature, which has been oven-
fragmented using different evaluation protocols and ground-
V. L IMITATIONS AND F UTURE W ORK truth information at test time. Instead of specifying what
A limitation of OVTracktor is the use of pre-trained to track with ground-truth object masks, which are hard
features for feature matching for re-identifying object tracks. to annotate and provide, our tracking framework offers a
Empirically we observed that these features are not necessarily language interface over what to focus on and track in videos.
temporally consistent when used directly out-of-the-box, Thanks to its simplicity, we hope that our model can serve
leading to many mistakes in the Re-ID process. We will as an upgradeable and extendable baseline for future work in
explore the possibility of training more general-purpose re-id open-world and open-vocabulary video tracking literature.
R EFERENCES [28] A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary
object detection using captions,” in CVPR, 2021.
[1] M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection [29] A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran,
and people-detection-by-tracking,” CVPR, 2008. “Zero-shot object detection,” in ECCV, 2018.
[2] X. Weng and K. Kitani, “A baseline for 3d multi-object tracking,” [30] S. Rahman, S. Khan, and N. Barnes, “Improved visual-semantic
arXiV, 2019. alignment for zero-shot object detection,” in AAAI, 2020.
[3] X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting [31] X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object
twenty-thousand classes using image-level supervision,” in ECCV, 2022. detection via vision and language knowledge distillation,” ICLR, 2022.
[4] A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, [32] M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn,
T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al.,
arXiv, 2023. “Simple open-vocabulary object detection with vision transformers,”
[5] H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning arXiv, 2022.
optical flow via global matching,” in CVPR, 2022. [33] J. Shi and J. Malik, “Motion segmentation and tracking using
[6] P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells normalized cuts,” in ICCV, 1998.
and whistles,” in CVPR, 2019. [34] M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchical
[7] J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine-Hornung, graph-based video segmentation,” in CVPR. IEEE, 2010.
and L. V. Gool, “The 2017 DAVIS challenge on video object [35] P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by
segmentation,” arXiV, 2017. long term video analysis,” TPAMI, vol. 36, no. 6, 2013.
[8] N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. L. Price, [36] K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik, “Learning to
S. Cohen, and T. S. Huang, “Youtube-vos: Sequence-to-sequence video segment moving objects in videos,” in CVPR, 2015.
object segmentation,” in ECCV, 2018. [37] P. Bideau, A. RoyChowdhury, R. R. Menon, and E. Learned-Miller,
[9] W. Wang, M. Feiszli, H. Wang, and D. Tran, “Unidentified video “The best of both worlds: Combining cnns and geometric constraints
objects: A benchmark for dense, open-world segmentation,” CoRR, for hierarchical motion segmentation,” in CVPR, 2018.
2021. [38] Y. Liu, I. E. Zulfikar, J. Luiten, A. Dave, D. Ramanan, B. Leibe,
[10] A. Athar, J. Luiten, P. Voigtlaender, T. Khurana, A. Dave, B. Leibe, A. Ošep, and L. Leal-Taixé, “Opening up open world tracking,” in
and D. Ramanan, “Burst: A benchmark for unifying object recognition, CVPR, 2022.
segmentation and tracking in video,” in WACV, 2023. [39] S. Li, M. Danelljan, H. Ding, T. E. Huang, and F. Yu, “Tracking every
[11] M. Vecerik, C. Doersch, Y. Yang, T. Davchev, Y. Aytar, G. Zhou, thing in the wild,” in ECCV, 2022.
R. Hadsell, L. Agapito, and J. Scholz, “Robotap: Tracking arbitrary
[40] A. Ošep, W. Mehner, P. Voigtlaender, and B. Leibe, “Track, then decide:
points for few-shot visual imitation,” arXiv, 2023.
Category-agnostic vision-based multi-object tracking,” in ICRA, 2018.
[12] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool,
[41] Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and
“Online multi-person tracking-by-detection from a single, uncalibrated
B. Guo, “Swin transformer: Hierarchical vision transformer using
camera.” IEEE Transactions on Pattern Analysis and Machine Intelli-
shifted windows,” in ICCV, 2021.
gence, 2011.
[42] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N.
[13] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R–CNN: Towards
Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in
real-time object detection with region proposal networks,” in NeurIPS,
NeurIPS, 2017.
2015.
[43] P. Pan, F. Porikli, and D. Schonfeld, “Recurrent tracking using multifold
[14] T. Meinhardt, A. Kirillov, L. Leal-Taixé, and C. Feichtenhofer,
consistency,” in Proceedings of the Eleventh IEEE International
“Trackformer: Multi-object tracking with transformers,” CoRR, 2021.
Workshop on Performance Evaluation of Tracking and Surveillance,
[15] A. Athar, J. Luiten, A. Hermans, D. Ramanan, and B. Leibe, “Hodor:
vol. 3, 2009.
High-level object descriptors for object re-segmentation in video learned
from static images,” in CVPR, 2022. [44] N. Sundaram, T. Brox, and K. Keutzer, “Dense point trajectories by
[16] P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, gpu-accelerated large displacement optical flow.” in ECCV, 2010.
C. Wang, and P. Luo, “Transtrack: Multiple-object tracking with [45] A. Dave, T. Khurana, P. Tokmakov, C. Schmid, and D. Ramanan,
transformer,” CoRR, 2020. “TAO: A large-scale benchmark for tracking any object,” in ECCV,
[17] S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, 2020.
H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with [46] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays,
grounded pre-training for open-set object detection,” 2023. P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft
[18] H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014.
L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and [Online]. Available: http://arxiv.org/abs/1405.0312
vision-language understanding,” 2022. [47] A. Gupta, P. Dollár, and R. B. Girshick, “LVIS: A dataset for large
[19] X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” vocabulary instance segmentation,” in CVPR, 2019.
in ECCV, 2020. [48] H. K. Cheng, Y. Tai, and C. Tang, “Rethinking space-time networks with
[20] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online improved memory coverage for efficient video object segmentation,”
and realtime tracking,” in ICIP, 2016. in NeurIPS, 2021.
[21] N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime [49] B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, and A. G.
tracking with a deep association metric,” in ICIP, 2017. Schwing, “Mask2former for video instance segmentation,” CoRR, 2021.
[22] S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple people [50] L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” CoRR,
tracking by lifted multicut and person re-identification,” in CVPR, 2019.
2017. [51] J. Wu, Q. Liu, Y. Jiang, S. Bai, A. L. Yuille, and X. Bai, “In defense
[23] Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the of online models for video instance segmentation,” in ECCV, 2022.
fairness of detection and re-identification in multiple object tracking,” [52] J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng, “Track anything:
IJCV, 2021. Segment anything meets videos,” CoRR, 2023.
[24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, [53] H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object
G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable segmentation with an atkinson-shiffrin memory model,” in ECCV,
visual models from natural language supervision,” in International 2022.
conference on machine learning. PMLR, 2021, pp. 8748–8763. [54] S. W. Oh, J. Lee, N. Xu, and S. J. Kim, “Video object segmentation
[25] C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. using space-time memory networks,” in ICCV, 2019.
Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language [55] Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, “Fast online
representation learning with noisy text supervision,” in ICML, 2021. object tracking and segmentation: A unifying approach,” in CVPR,
[26] J. Devlin, M.-W. Chang, K. Lee, and K. N. Toutanova, “Bert: Pre- 2019.
training of deep bidirectional transformers for language understanding,” [56] B. Yan, Y. Jiang, P. Sun, D. Wang, Z. Yuan, P. Luo, and H. Lu,
in NAACL, 2019. “Towards grand unification of object tracking,” in ECCV, 2022.
[27] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors [57] P. Voigtlaender, J. Luiten, P. H. S. Torr, and B. Leibe, “Siam R-CNN:
for word representation,” in EMNLP, 2014. visual tracking by re-detection,” in CVPR, 2020.
[58] B. Yan, Y. Jiang, J. Wu, D. Wang, Z. Yuan, P. Luo, and H. Lu,
“Universal instance perception as object discovery and retrieval,” in
CVPR, 2023.
[59] L. Ke, M. Ye, M. Danelljan, Y. Liu, Y.-W. Tai, C.-K. Tang, and F. Yu,
“Segment anything in high quality,” arXiv, 2023.