research-article

W2VV++: Fully Deep Learning for Ad-hoc Video Search

Authors:

Jianfeng DongAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 1786 - 1794

https://doi.org/10.1145/3343031.3350906

Published: 15 October 2019 Publication History

Abstract

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose a fully deep learning method for query representation learning. The proposed method requires no explicit concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple yet important changes, W2VV++ brings in a substantial improvement. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

References

[1]

G. Awad, A. Butt, K. Curtis, Yo. Lee, J. Fiscus, A. Godil, D. Joy, A. Delgado, A. F. Smeaton, Y. Graham, W. Kraaij, G. Quénot, J. Magalhaes, D. Semedo, and S. Blasi. 2018. TRECVID 2018: Benchmarking Video Activity Detection, Video Captioning and Matching, Video Storytelling Linking and Video Search. In TRECVID .

[2]

G. Awad, A. Butt, J. Fiscus, D. Joy, A. Delgado, M. Michel, A. Smeaton, Y. Graham, G. J. F. Jones, W. Kraaij, G. Quénot, M. Eskevich, R. Ordelman, and B. Huet. 2017. TRECVID 2017: Evaluating ad-hoc and instance video search, events detection, video captioning and hyperlinking. In TRECVID .

[3]

G. Awad, J. Fiscus, D. Joy, M. Michel, A. Smeaton, W. Kraaij, G. Quénot, M. Eskevich, R. Aly, R. Ordelman, G. Jones, B. Huet, and M. Larson. 2016. TRECVID 2016: Evaluating Video Search, Video Event Detection, Localization, and Hyperlinking. In TRECVID .

[4]

M. Bastan, X. Shi, J. Gu, Z. Heng, C. Zhuo, D. Sng, and A. Kot. 2018. NTU ROSE Lab at TRECVID 2018: Ad-hoc Video Search and Video to Text. In TRECVID .

[5]

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. 2014. Learning Phrase Representations using RNN Encoder-decoder for Statistical Machine Translation. In EMNLP .

[6]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: a Large-scale Hierarchical Image Database. In CVPR .

[7]

J. Dong, S. Huang, D. Xu, and D. Tao. 2017. DL-61--86 at TRECVID 2017: Video-to-Text Description. In TRECVID .

[8]

J. Dong, X. Li, and C. G. M. Snoek. 2018a. Predicting Visual Features from Text for Image and Video Caption Retrieval. T-MM, Vol. 20, 12 (2018), 3377--3388.

[9]

J. Dong, X. Li, C. Xu, S. Ji, Y. He, G. Yang, and X. Wang. 2019. Dual Encoding for Zero-Example Video Retrieval. In CVPR .

[10]

J. Dong, X. Li, and D. Xu. 2018b. Cross-Media Similarity Evaluation for Web Image Retrieval in the Wild. T-MM, Vol. 20, 9 (2018), 2371--2384.

[11]

F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. 2018. VSE

[12]

: Improving Visual-Semantic Embeddings with Hard Negatives. In BMVC .

[13]

E. Gabrilovich and S. Markovitch. 2007. Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. In IJCAI .

[14]

A. Habibian, T. Mensink, and C. G. M. Snoek. 2017. Video2vec Embeddings Recognize Events When Examples Are Scarce. T-PAMI, Vol. 39, 10 (2017), 2089--2103.

Digital Library

[15]

P.-Y. Huang, J. Liang, V. Vaibhav, X. Chang, and A. Hauptmann. 2018. Informedia@TRECVID 2018: Ad-hoc Video Search with Discrete and Continuous Representations. In TRECVID .

[16]

Y.-G. Jiang, Z. Wu, J. Wang, X. Xue, and S.-F. Chang. 2018. Exploiting feature and class relationships in video categorization with regularized deep neural networks. T-PAMI, Vol. 40, 2 (2018), 352--364.

Digital Library

[17]

A. Joulin, L. van der Maaten, A. Jabri, and N. Vasilache. 2016. Learning Visual Features from Large Weakly Supervised Data. In ECCV .

[18]

S. Kordumova, X. Li, and C. Snoek. 2015. Best practices for learning video concept detectors from social media examples. MTAP, Vol. 74, 4 (2015), 1291--1315.

[19]

D.-D. Le, S. Phan, V.-T. Nguyen, B. Renoust, T. Nguyen, V.-N. Hoang, T. Ngo, M.-T. Tran, Y. Watanabe, M. Klinkigt, et almbox. 2016. NII-HITACHI-UIT at TRECVID 2016. In TRECVID .

[20]

Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo. 2016. TGIF: A New Dataset and Benchmark on Animated GIF Description. In CVPR .

[21]

J. Liang, J. Chen, P. Huang, X. Li, L. Jiang, Z. Lan, P. Pan, H. Fan, Q. Jin, J. Sun, et almbox. 2016a. Informedia @ Trecvid 2016. In TRECVID .

[22]

J. Liang, L. Jiang, D. Meng, and A. Hauptmann. 2016b. Learning to Detect Concepts from Webly-labeled Video Data. In IJCAI .

[23]

J. Lokovc, W. Bailer, K. Schoeffmann, B. Muenzer, and G. Awad. 2018. On Influential Trends in Interactive Video Retrieval: Video Browser Showdown 2015--2017. T-MM, Vol. 20, 12 (2018), 3361--3376.

[24]

Y.-J. Lu, H. Zhang, M. de Boer, and C.-W. Ngo. 2016. Event Detection with Zero Example: Select the Right and Suppress the Wrong Concepts. In ICMR .

[25]

F. Markatopoulou, D. Galanopoulos, V. Mezaris, and I. Patras. 2017. Query and Keyframe Representations for Ad-hoc Video Search. In ICMR .

[26]

F. Markatopoulou, A. Moumtzidou, D. Galanopoulos, T. Mironidis, V. Kaltsa, A. Ioannidou, S. Symeonidis, K. Avgerinakis, S. Andreadis, et almbox. 2016. ITI-CERTH Participation in TRECVID 2016. In TRECVID .

[27]

M. Merler, B. Huang, L. Xie, G. Hua, and A. Natsev. 2012. Semantic Model Vectors for Complex Video Event Recognition. T-MM, Vol. 14, 1 (2012).

[28]

N. Mithun, J. Li, F. Metze, and A. K. Roy-Chowdhury. 2018. Learning Joint Embedding with Multimodal Cues for Cross-Modal Video-Text Retrieval. In ICMR .

[29]

P. Nguyen, Q. Li, Z.-Q. Cheng, Y.-J. Lu, H. Zhang, X. Wu, and C.-W. Ngo. 2017. VIREO @ TRECVID 2017: Video-to-Text, Ad-hoc Video Search and Video Hyperlinking. In TRECVID .

[30]

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. 2017. Automatic differentiation in PyTorch. In NIPS-W .

[31]

L. Rossetto, M. Amiri Parian, R. Gasser, I. Giangreco, S. Heller, and H. Schuldt. 2019. Deep Learning-Based Concept Detection in vitrivr. In MMM .

[32]

M. Smucker, J. Allan, and B. Carterette. 2007. A Comparison of Statistical Significance Tests for Information Retrieval Evaluation. In CIKM .

[33]

C. G. M. Snoek, J. Dong, X. Li, X. Wang, Q. Wei, W. Lan, E. Gavves, N. Hussein, D. C. Koelma, and A. W. M. Smeulders. 2016. University of Amsterdam and Renmin University at TRECVID 2016: Searching Video, Detecting Events and Describing Video. In TRECVID .

[34]

C. G. M. Snoek, X. Li, C. Xu, and D. C. Koelma. 2017. University of Amsterdam and Renmin University at TRECVID 2017: Searching Video, Detecting Events and Describing Video. In TRECVID .

[35]

C. G. M. Snoek and M. Worring. 2009. Concept-Based Video Retrieval. Found. Trends Inf. Retr., Vol. 2, 4 (2009), 215--322.

Digital Library

[36]

K. Ueki, K. Hirakawa, K. Kikuchi, T. Ogawa, and T. Kobayashi. 2017. Waseda_Meisei at TRECVID 2017: Ad-hoc Video Search. In TRECVID .

[37]

J. Xu, T. Mei, T. Yao, and Y. Rui. 2016. MSR-VTT: A Large Video Description Dataset for Bridging Video and Language. In CVPR .

[38]

G. Ye, Y. Li, H. Xu, D. Liu, and S.-F. Chang. 2015. EventNet: A Large Scale Structured Concept Library for Complex Event Detection in Video. In ACMMM .

[39]

S.-I. Yu, L. Jiang, Z. Xu, Y. Yang, and A. G. Hauptmann. 2015. Content-Based Video Search over 1 Million Videos with 1 Core in 1 Second. In ICMR .

Cited By

Zhang CZhong JCao WJi J(2024)Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognitionIntelligent Data Analysis10.3233/IDA-23039928:4(921-941)Online publication date: 17-Jul-2024
https://doi.org/10.3233/IDA-230399
Yin SZhao SWang HXu TChen E(2024)Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357120:10(1-21)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663571
Wu JNgo CChan WGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658052
Show More Cited By

Index Terms

W2VV++: Fully Deep Learning for Ad-hoc Video Search
1. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query representation
    2. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

Interpretable Embedding for Ad-Hoc Video Search
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Answering query with semantic concepts has long been the mainstream approach for video search. Until recently, its performance is surpassed by concept-free approach, which embeds queries in a joint space as videos. Nevertheless, the embedded features as ...
Vitrivr-Explore: Guided Multimedia Collection Exploration for Ad-hoc Video Search
Similarity Search and Applications
Abstract
vitrivr is an open-source system for indexing and retrieving multimedia data based on its content and it has been a fixture at the Video Browser Showdown (VBS) in the past years. While vitrivr has proven to be competitive in content-based ...
Cross-modal Graph Matching Network for Image-text Retrieval
Image-text retrieval is a fundamental cross-modal task whose main idea is to learn image-text matching. Generally, according to whether there exist interactions during the retrieval process, existing image-text retrieval methods can be classified into ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Beijing Natural Science Foundation
the Fundamental Research Funds for the Central Universities and the Research Funds of Renmin University of China
National Natural Science Foundation of China

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

81
Total Citations
View Citations
590
Total Downloads

Downloads (Last 12 months)35
Downloads (Last 6 weeks)3

Reflects downloads up to 16 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang CZhong JCao WJi J(2024)Multiple Distilling-based spatial-temporal attention networks for unsupervised human action recognitionIntelligent Data Analysis10.3233/IDA-23039928:4(921-941)Online publication date: 17-Jul-2024
https://doi.org/10.3233/IDA-230399
Yin SZhao SWang HXu TChen E(2024)Exploiting Instance-level Relationships in Weakly Supervised Text-to-Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357120:10(1-21)Online publication date: 12-Sep-2024
https://dl.acm.org/doi/10.1145/3663571
Wu JNgo CChan WGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept BankProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658052(73-82)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658052
Abdari AFalcon ASerra GGurrin CKongkachandra RSchoeffmann KDang-Nguyen DRossetto LSatoh SZhou L(2024)AdOCTeRA: Adaptive Optimization Constraints for improved Text-guided Retrieval of ApartmentsProceedings of the 2024 International Conference on Multimedia Retrieval10.1145/3652583.3658039(1043-1050)Online publication date: 30-May-2024
https://dl.acm.org/doi/10.1145/3652583.3658039
Cai RDong JLiang TLiang YWang YYang XWang XWang M(2024)Cross-Lingual Cross-Modal Retrieval With Noise-Robust Fine-TuningIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340006036:11(5860-5873)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3400060
Wu PLiu JHe XPeng YWang PZhang Y(2024)Toward Video Anomaly Retrieval From Video Anomaly Detection: New Benchmarks and ModelIEEE Transactions on Image Processing10.1109/TIP.2024.337407033(2213-2225)Online publication date: 2024
https://doi.org/10.1109/TIP.2024.3374070
Cheng DKong SWang WQu MJiang B(2024)Long Term Memory-Enhanced Via Causal Reasoning for Text-To-Video RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10448201(8160-8164)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10448201
Chen AZhou FWang ZLi X(2024)Cliprerank: An Extremely Simple Method For Improving Ad-Hoc Video SearchICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446902(7850-7854)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446902
Vadicamo LArnold RBailer WCarrara FGurrin CHezel NLi XLokoc JLubos SMa ZMessina NNguyen TPeska LRossetto LSauter LSchöffmann KSpiess FTran MVrochidis S(2024)Evaluating Performance and Trends in Interactive Video Retrieval: Insights From the 12th VBS CompetitionIEEE Access10.1109/ACCESS.2024.340563812(79342-79366)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2024.3405638
Wu GHaider ATian XLoweimi EChan CQian MMuhammad ASpence ICooper RNg WKittler JGales MWang H(2024)Multi‐modal video search by examples—A video quality impact analysisIET Computer Vision10.1049/cvi2.12303Online publication date: 27-Jul-2024
https://doi.org/10.1049/cvi2.12303
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents