Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3343031.3350906acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

W2VV++: Fully Deep Learning for Ad-hoc Video Search

Published: 15 October 2019 Publication History

Abstract

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple yet important changes, W2VV++ brings in a substantial improvement. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

References

[1]
G. Awad, A. Butt, K. Curtis, Yo. Lee, J. Fiscus, A. Godil, D. Joy, A. Delgado, A. F. Smeaton, Y. Graham, W. Kraaij, G. Quénot, J. Magalhaes, D. Semedo, and S. Blasi. 2018. TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search. In TRECVID .
[2]
G. Awad, A. Butt, J. Fiscus, D. Joy, A. Delgado, M. Michel, A. Smeaton, Y. Graham, G. J. F. Jones, W. Kraaij, G. Quénot, M. Eskevich, R. Ordelman, and B. Huet. 2017. TRECVID 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In TRECVID .
[3]
G. Awad, J. Fiscus, D. Joy, M. Michel, A. Smeaton, W. Kraaij, G. Quénot, M. Eskevich, R. Aly, R. Ordelman, G. Jones, B. Huet, and M. Larson. 2016. TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking. In TRECVID .
[4]
M. Bastan, X. Shi, J. Gu, Z. Heng, C. Zhuo, D. Sng, and A. Kot. 2018. NTU ROSE Lab at TRECVID 2018: Ad-hoc Video Search and Video to Text. In TRECVID .
[5]
K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning Phrase Representations using RNN Encoder-decoder for Statistical Machine Translation. In EMNLP .
[6]
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: a Large-scale Hierarchical Image Database. In CVPR .
[7]
J. Dong, S. Huang, D. Xu, and D. Tao. 2017. DL-61--86 at TRECVID 2017: Video-to-Text Description. In TRECVID .
[8]
J. Dong, X. Li, and C. G. M. Snoek. 2018a. Predicting Visual Features from Text for Image and Video Caption Retrieval. T-MM, Vol. 20, 12 (2018), 3377--3388.
[9]
J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. In CVPR .
[10]
J. Dong, X. Li, and D. Xu. 2018b. Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild. T-MM, Vol. 20, 9 (2018), 2371--2384.
[11]
F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. 2018. VSE
[12]
: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC .
[13]
E. Gabrilovich and S. Markovitch. 2007. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In IJCAI .
[14]
A. Habibian, T. Mensink, and C. G. M. Snoek. 2017. Video2vec Embeddings Recognize Events When Examples Are Scarce. T-PAMI, Vol. 39, 10 (2017), 2089--2103.
[15]
P.-Y. Huang, J. Liang, V. Vaibhav, X. Chang, and A. Hauptmann. 2018. Informedia@TRECVID 2018: Ad-hoc Video Search with Discrete and Continuous Representations. In TRECVID .
[16]
Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. T-PAMI, Vol. 40, 2 (2018), 352--364.
[17]
A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. 2016. Learning Visual Features from Large Weakly Supervised Data. In ECCV .
[18]
S. Kordumova, X. Li, and C. Snoek. 2015. Best practices for learning video concept detectors from social media examples. MTAP, Vol. 74, 4 (2015), 1291--1315.
[19]
D.-D. Le, S. Phan, V.-T. Nguyen, B. Renoust, T. Nguyen, V.-N. Hoang, T. Ngo, M.-T. Tran, Y. Watanabe, M. Klinkigt, et almbox. 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID .
[20]
Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo. 2016. TGIF: A New Dataset and Benchmark on Animated GIF Description. In CVPR .
[21]
J. Liang, J. Chen, P. Huang, X. Li, L. Jiang, Z. Lan, P. Pan, H. Fan, Q. Jin, J. Sun, et almbox. 2016a. Informedia @ Trecvid 2016. In TRECVID .
[22]
J. Liang, L. Jiang, D. Meng, and A. Hauptmann. 2016b. Learning to Detect Concepts from Webly-labeled Video Data. In IJCAI .
[23]
J. Lokovc, W. Bailer, K. Schoeffmann, B. Muenzer, and G. Awad. 2018. On Influential Trends in Interactive Video Retrieval: Video Browser Showdown 2015--2017. T-MM, Vol. 20, 12 (2018), 3361--3376.
[24]
Y.-J. Lu, H. Zhang, M. de Boer, and C.-W. Ngo. 2016. Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts. In ICMR .
[25]
F. Markatopoulou, D. Galanopoulos, V. Mezaris, and I. Patras. 2017. Query and Keyframe Representations for Ad-hoc Video Search. In ICMR .
[26]
F. Markatopoulou, A. Moumtzidou, D. Galanopoulos, T. Mironidis, V. Kaltsa, A. Ioannidou, S. Symeonidis, K. Avgerinakis, S. Andreadis, et almbox. 2016. ITI-CERTH Participation in TRECVID 2016. In TRECVID .
[27]
M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev. 2012. Semantic Model Vectors for Complex Video Event Recognition. T-MM, Vol. 14, 1 (2012).
[28]
N. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In ICMR .
[29]
P. Nguyen, Q. Li, Z.-Q. Cheng, Y.-J. Lu, H. Zhang, X. Wu, and C.-W. Ngo. 2017. VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking. In TRECVID .
[30]
A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W .
[31]
L. Rossetto, M. Amiri Parian, R. Gasser, I. Giangreco, S. Heller, and H. Schuldt. 2019. Deep Learning-Based Concept Detection in vitrivr. In MMM .
[32]
M. Smucker, J. Allan, and B. Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In CIKM .
[33]
C. G. M. Snoek, J. Dong, X. Li, X. Wang, Q. Wei, W. Lan, E. Gavves, N. Hussein, D. C. Koelma, and A. W. M. Smeulders. 2016. University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video. In TRECVID .
[34]
C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. 2017. University of Amsterdam and Renmin University at TRECVID 2017: Searching Video, Detecting Events and Describing Video. In TRECVID .
[35]
C. G. M. Snoek and M. Worring. 2009. Concept-Based Video Retrieval. Found. Trends Inf. Retr., Vol. 2, 4 (2009), 215--322.
[36]
K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID .
[37]
J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR .
[38]
G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. 2015. EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video. In ACMMM .
[39]
S.-I. Yu, L. Jiang, Z. Xu, Y. Yang, and A. G. Hauptmann. 2015. Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second. In ICMR .

Cited By

View all
  • (2024)Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognitionIntelligent Data Analysis10.3233/IDA-23039928:4(921-941)Online publication date: 17-Jul-2024
  • (2024)Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357120:10(1-21)Online publication date: 12-Sep-2024
  • (2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. ad-hoc video search
  2. cross-modal matching
  3. deep learning
  4. query representation learning
  5. trecvid benchmarks

Qualifiers

  • Research-article

Funding Sources

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)35
  • Downloads (Last 6 weeks)3
Reflects downloads up to 16 Oct 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognitionIntelligent Data Analysis10.3233/IDA-23039928:4(921-941)Online publication date: 17-Jul-2024
  • (2024)Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357120:10(1-21)Online publication date: 12-Sep-2024
  • (2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
  • (2024)AdOCTeRA: Adaptive Optimization Constraints for improved Text-guided Retrieval of ApartmentsProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658039(1043-1050)Online publication date: 30-May-2024
  • (2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: Nov-2024
  • (2024)Toward Video Anomaly Retrieval From Video Anomaly Detection: New Benchmarks and ModelIEEE Transactions on Image Processing10.1109/TIP.2024.337407033(2213-2225)Online publication date: 2024
  • (2024)Long Term Memory-Enhanced Via Causal Reasoning for Text-To-Video RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448201(8160-8164)Online publication date: 14-Apr-2024
  • (2024)Cliprerank: An Extremely Simple Method For Improving Ad-Hoc Video SearchICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446902(7850-7854)Online publication date: 14-Apr-2024
  • (2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024
  • (2024)Multi‐modal video search by examples—A video quality impact analysisIET Computer Vision10.1049/cvi2.12303Online publication date: 27-Jul-2024
  • Show More Cited By

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media