Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
research-article

(Un)likelihood Training for Interpretable Embedding

Published: 30 December 2023 Publication History

Abstract

Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data. Learning modality agnostic representations in a continuous latent space, however, is often treated as a black-box data-driven training process. It is well known that the effectiveness of representation learning depends heavily on the quality and scale of training data. For video representation learning, having a complete set of labels that annotate the full spectrum of video content for training is highly difficult, if not impossible. These issues, black-box training and dataset bias, make representation learning practically challenging to be deployed for video understanding due to unexplainable and unpredictable results. In this article, we propose two novel training objectives, likelihood and unlikelihood functions, to unroll the semantics behind embeddings while addressing the label sparsity problem in training. The likelihood training aims to interpret semantics of embeddings beyond training labels, while the unlikelihood training leverages prior knowledge for regularization to ensure semantically coherent interpretation. With both training objectives, a new encoder-decoder network, which learns interpretable cross-modal representation, is proposed for ad-hoc video search. Extensive experiments on TRECVid and MSR-VTT datasets show that the proposed network outperforms several state-of-the-art retrieval models with a statistically significant performance margin.

References

[1]
Konstantinos Avgerinakis, Anastasia Moumtzidou, Damianos Galanopoulos, Georgios Orfanidis, Stelios Andreadis, Foteini Markatopoulou, Elissavet Batziou, Konstantinos Ioannidis, Stefanos Vrochidis, Vasileios Mezaris, and Ioannis Kompatsiaris. 2018. ITI-CERTH participation in TRECVid 2018. In Proceedings of the TRECVid 2018 Workshop. 1–13.
[2]
George Awad, Asad A. Butt, Keith Curtis, Jonathan Fiscus, Afzal Godil, Yooyoung Lee, Andrew Delgado, Jesse Zhang, Eliot Godard, Baptiste Chocot, Lukas Diduch, Jeffrey Liu, Yvette Graham, Gareth J. F. Jones, and Georges Qu’not. 2021. Evaluating multiple video understanding and retrieval tasks at TRECVid 2021. In Proceedings of TRECVid 2021. 1–55.
[3]
George Awad, Jonathan Fiscus, David Joy, Martial Michel, Alan F. Smeaton, Wessel Kraaij, Georges Qu’enot, Maria Eskevich, Robin Aly, Marc Ritter, Gareth J. F. Jones, Benoit Huet, and Martha Larson. 2016. TRECVid 2016: Evaluating video search, video event detection, localization, and hyperlinking. In Proceedings of the TRECVid 2016 Workshop. 1–54.
[4]
George Awad, Fiscus Jonathan, Joy David, Michel Martial, Smeaton Alan, Kraaij Wessel, Quenot Georges, Eskevich Maria, Aly Robin, Ordelman Roeland, Jones Gareth, Huet Benoit, and Larson Martha. 2016. TRECVid 2016: Evaluating video search, video event detection, localization, and hyperlinking. In Proceedings of the TRECVid 2016 Workshop. 1–54.
[5]
George Awad, Duy-Dinh Le, Chong-Wah Ngo, Vinh-Tiep Nguyen, Georges Quénot, Cees Snoek, and Shin’ichi Satoh. 2017. Video indexing, search, detection, and description with focus on TRECVid. In Proceedings of the ACM on International Conference on Multimedia Retrieval. 3–4.
[6]
Fabian Berns, Luca Rossetto, Klaus Schoeffmann, Christian Beecks, and George Awad. 2019. V3C1 Dataset: An evaluation of content characteristics. In Proceedings of the International Conference on Multimedia Retrieval. 334–338.
[7]
Maaike H. T. De Boer, Yi-Jie Lu, Hao Zhang, Klamer Schutte, Chong-Wah Ngo, and Wessel Kraaij. 2017. Semantic reasoning in zero example video event retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 4 (2017), 1–17.
[8]
João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. 2018. A short note about kinetics-600. ArXiv abs/1808.01340 (2018), 1–6.
[9]
David L. Chen and William B. Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 190–200.
[10]
S. Chen, Y. Zhao, Q. Jin, and Q. Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10635–10644.
[11]
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C. Lawrence Zitnick. 2015. Microsoft COCO captions: Data collection and evaluation server. CoRR abs/1504.00325 (2015), 1–7. arXiv:1504.00325
[12]
Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1724–1734.
[13]
Tat-Seng Chua and Li-Qun Ruan. 1995. A video retrieval and sequencing system. ACM Transaction on Information System. 13, 4 (1995), 373–407.
[14]
Marie-Catherine de Marneffe, Anna N. Rafferty, and Christopher D. Manning. 2008. Finding contradictions in text. In Proceedings of Association for Computational Linguistics. 1039–1047.
[15]
J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 248–255.
[16]
Jianfeng Dong, Xirong Li, and Cees G. M. Snoek. 2018. Predicting visual features from text for image and video caption retrieval. IEEE Transactions on Multimedia 20, 12 (2018), 3377–3388.
[17]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Shouling Ji, Yuan He, Gang Yang, and Xun Wang. 2019. Dual encoding for zero-example video retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9346–9355.
[18]
Jianfeng Dong, Xirong Li, Chaoxi Xu, Xun Yang, Gang Yang, Xun Wang, and Meng Wang. 2021. Dual encoding for video retrieval by text. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–17.
[19]
Jianfeng Dong, Yabing Wang, Xianke Chen, Xiaoye Qu, Xirong Li, Yuan He, and Xun Wang. 2022. Reading-strategy inspired visual representation learning for text-to-video retrieval. IEEE Transactions on Circuits and Systems for Video Technology 32 (2022), 5680–5694.
[20]
Fartash Faghri, David J. Fleet, Jamie Ryan Kiros, and Sanja Fidlere. 2018. VSE++: Improving visual-semantic embeddings with hard negatives. In Proceedings of the British Machine Vision Conference. 1–13.
[21]
Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. SlowFast networks for video recognition. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. 6201–6210.
[22]
Amato Giuseppe, Paolo Bolettieri, Fabio Carrara, Fabrizio Falchi, Claudio Gennaro, Nicola Messina, Lucia Vadicamo, and Claudio Vairo. 2023. VISIONE at video browser showdown 2023. In Proceedings of the International Conference on MultiMedia Modeling. 1–8.
[23]
Amirhossein Habibian, Thomas Mensink, and Cees G. M. Snoek. 2014. VideoStory: A new multimedia embedding for few-example recognition and translation of events. In Proceedings of the ACM Conference on Multimedia. 17–26.
[24]
K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[25]
Silvan Heller, Viktor Gsteiger, Werner Bailer, Cathal Gurrin, Björn Þór Jónsson, Jakub Lokoč, Andreas Leibetseder, František Mejzlík, Ladislav Peška, Luca Rossetto, Konstantin Schall, Klaus Schoeffmann, Heiko Schuldt, Florian Spiess, Ly-Duyen Tran, Lucia Vadicamo, Patrik Veselý, Stefanos Vrochidis, and Jiaxin Wu. 2022. Interactive video retrieval evaluation at a distance: Comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. International Journal of Multimedia Information Retrieval 11 (2022), 1–18.
[26]
Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou, Jianfeng Dong, and Xirong Li. 2022. Lightweight attentional feature fusion: A new baseline for text-to-video retrieval. In Proceedings of the European Conference on Computer Vision. Springer Nature Switzerland, Cham, 444–461.
[27]
Peng Hu, Zhenyu Huang, Dezhong Peng, Xu Wang, and Xi Peng. 2023. Cross-modal retrieval with partially mismatched pairs. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 8 (2023), 9595–9610.
[28]
Peng Hu, Hongyuan Zhu, Jie Lin, Dezhong Peng, Yin-Ping Zhao, and Xi Peng. 2023. Unsupervised contrastive cross-modal hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 3 (2023), 3877–3889.
[29]
Po-Yao Huang, Junwei Liang, Vaibhav, Xiaojun Chang, and Alexander Hauptmann. 2018. Informedia @ TRECVid 2018: Ad-hoc video search with discrete and continuous representations. In Proceedings of the TRECVid 2018 Workshop. 1–10.
[30]
Lu Jiang, Teruko Mitamura, Shoou-I Yu, and Alexander G. Hauptmann. 2014. Zero-example event search using multi-modal pseudo relevance feedback. In Proceedings of International Conference on Multimedia Retrieval. 297–304.
[31]
Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. 2017. The kinetics human action video dataset. CoRR abs/1705.06950 (2017), 1–22. arXiv:1705.06950
[32]
Lyndon S. Kennedy, Apostol (Paul) Natsev, and Shih-Fu Chang. 2005. Automatic discovery of query-class-dependent models for multi-modal search. In Proceedings of the Annual ACM International Conference on Multimedia. 882–891.
[33]
Shirahamay Kimiaki, Zhenying He, and Uehara Kuniaki. 2018. Kobe University and Kindai University at TRECVid 2018 AVS Task. In Proceedings of the TRECVid 2018 Workshop. 1–6.
[34]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv 2301.12597 (2023).
[35]
Margaret Li, Stephen Roller, Ilia Kulikov, Sean Welleck, Y.-Lan Boureau, Kyunghyun Cho, and Jason Weston. 2020. Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In Proceedings of the Annual Meeting of the Association for Computational Linguistics. 4715–4728.
[36]
Xirong Li, Chaoxi Xu, Gang Yang, Zhineng Chen, and Jianfeng Dong. 2019. W2VV++: Fully deep learning for ad-hoc video search. In Proceedings of the ACM International Conference on Multimedia. 1786–1794.
[37]
Xirong Li, Fangming Zhou, Chaoxi Xu, Jiaqi Ji, and Gang Yang. 2021. SEA: Sentence encoder assembly for video retrieval by textual queries. IEEE Transactions on Multimedia (2021), 4351–4362.
[38]
Yuncheng Li, Yale Song, Liangliang Cao, Joel Tetreault, Larry Goldberg, Alejandro Jaimes, and Jiebo Luo. 2016. TGIF: A new dataset and benchmark on animated gif description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4641–4650.
[39]
Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10921–10930.
[40]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10012–10022.
[41]
Yi-Jie Lu, Hao Zhang, Maaike de Boer, and Chong-Wah Ngo. 2016. Event detection with zero example: Select the right and suppress the wrong concepts. In Proceedings of the ACM on International Conference on Multimedia Retrieval. 127–134.
[42]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2021. CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021), 1–14.
[43]
Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for Ad-Hoc video search. In Proceedings of the ACM on International Conference on Multimedia Retrieval. 407–411.
[44]
George A. Miller. 1995. WordNet: A lexical database for english. Commun. ACM 38, 11 (1995), 39–41.
[45]
Norman Mu, Alexander Kirillov, David Wagner, and Saining Xie. 2022. SLIP: Self-supervision meets language-image pre-training. 529–544.
[46]
Kevin Patrick Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press.
[47]
Milind Naphade, J.R. Smith, Jelena Tesic, S. Chang, Winston Hsu, Lyndon Kennedy, Alexander Hauptmann, and Jon Curtis. 2006. Large-scale concept ontology for multimedia. IEEE Transactions on Multimedia 13 (2006), 86–91.
[48]
Phuong Anh Nguyen, Jiaxin Wu, Chong-Wah Ngo, Francis Danny, and Huet Benoit. 2019. VIREO-EURECOM @ TRECVid 2019: Ad-hoc video search. In Proceedings of the TRECVid 2019 Workshop. 1–8.
[49]
Vinh-Tiep Nguyen, Duy-Dinh Le, Benjamin Renoust, Thanh Duc Ngo, Minh-Triet Tran, Duc Anh Duong, and Shinichi Satoh. 2016. NII-HITACHI-UIT at TRECVid 2016 Ad-hoc Video Search: Enriching semantic features using multiple neural networks. In Proceedings of the TRECVid 2016 Workshop. 1–4.
[50]
Paul Over, Jon Fiscus, Gregory Sanders, David Joy, Martial Michel, George Awad, Wessel Kraaij, Alan Smeaton, and Georges Quénot. 2013. TRECVid 2013 – an overview of the goals, tasks, data, evaluation mechanisms, and metrics. In Proceedings of the TRECVid 2013 Workshop. 1–45.
[51]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML. 8748–8763.
[52]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using siamese bert-networks. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 3982–3992.
[53]
Konstantin Schall, Nico Hezel, Klaus Jung, and Kai Uwe Barthel. 2023. Vibro: Video browsing with semantic and visual image embeddings. In Proceedings of the International Conference on MultiMedia Modeling. 1–8.
[54]
Alan F. Smeaton, Paul Over, and Wessel Kraaij. 2006. Evaluation campaigns and TRECVid. In Proceedings of the ACM International Workshop on Multimedia Information Retrieval. 321–330.
[55]
Mark D. Smucker, James Allan, and Ben Carterette. 2007. A comparison of statistical significance tests for information retrieval evaluation. In Proceedings of the ACM Conference on Conference on Information and Knowledge Management. 623–632.
[56]
Cees G. M. Snoek, Xirong Li, Chaoxi Xu, and Dennis C. Koelma. 2017. University of Amsterdam and Renmin University at TRECVid 2017: Searching video, detecting events and describing video. In Proceedings of the TRECVid 2017 Workshop. 1–6.
[57]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: A simple way to prevent neural networks from overfitting. Journal of Machine Learning Research 15, 56 (2014), 1929–1958.
[58]
Nicola Strisciuglio, Manuel Lopez-Antequera, and Nicolai Petkov. 2020. Enhanced robustness of convolutional networks with a push-pull inhibition layer. Neural Computing and Applications 32, 24 (2020), 17957–17971.
[59]
Kazuya Ueki, Koji Hirakawa, Kotaro Kikuchi, Tetsuji Ogawa, and Tetsunori Kobayashi. 2017. Waseda Meisei at TRECVid 2017: Ad-hoc video search. In Proceedings of the TRECVid 2017 Workshop. 1–8.
[60]
Kazuya Ueki, Takayuki Hori, and Tetsunori Kobayashi. 2019. Waseda Meisei SoftBank at TRECVid 2019: Ad-hoc video search. In Proceedings of the TRECVid 2019 Workshop. 1–7.
[61]
Kazuya Ueki, Kotaro Kikuchi, Susumu Saito, and Tetsunori Kobayashi. 2016. Waseda at TRECVid 2016: Ad-hoc video search. In Proceedings of the TRECVid 2016 Workshop. 1–5.
[62]
Kazuya Ueki, Ryo Mutou, Takayuki Hori, Yongbeom Kim, and Yuma Suzuki. 2020. Waseda meisei softbank at TRECVid 2020: Ad-hoc video search. In Proceedings of the TRECVid 2020 Workshop. 1–7.
[63]
Kazuya Ueki, Yu Nakagome, Koji Hirakawa, Kotaro Kikuchi, Yoshihiko Hayashi, Tetsuji Ogawa, and Tetsunori Kobayashi. 2018. Waseda meisei at TRECVid 2018: Ad-hoc video search. In Proceedings of the TRECVid 2018 Workshop. 1–7.
[64]
Kazuya Ueki, Yuma Suzuki, Hiroki Takushima, Hideaki Okamoto, Hayato Tanoue, and Takayuki Hori. 2022. Waseda meisei softbank at TRECVID 2022. In Proceedings of the TRECVid 2022 Workshop. 1–5.
[65]
Xin Wang, Jiawei Wu, Junkun Chen, Lei Li, Yuan-Fang Wang, and William Yang Wang. 2019. VaTeX: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE International Conference on Computer Vision. 4580–4590.
[66]
Jiwei Wei, Yang Yang, Xing Xu, Xiaofeng Zhu, and Heng Tao Shen. 2021. Universal weighting metric learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021), 1–12.
[67]
Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, and Jason Weston. 2019. Neural text generation with unlikelihood training. In Proceedings of the international conference on learning representations. 1–17.
[68]
Jiaxin Wu, Zhijian Hou, Zhixin Ma, and Chong-Wah Ngo. 2021. VIREO @ TRECVid 2021 Ad-hoc video search. In Proceedings of the TRECVid 2021 Workshop. 1–9.
[69]
Jiaxin Wu and Chong-Wah Ngo. 2020. Interpretable embedding for Ad-hoc video search. In Proceedings of the ACM International Conference on Multimedia. 3357–3366.
[70]
Saining Xie, Ross Girshick, Piotr Dollar, Z. W. Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5987–5995.
[71]
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. MSR-VTT: A large video description dataset for bridging video and language. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition. 5288–5296.
[72]
Deyao Zhu, Mohamed Zahran, Li Erran Li, and Mohamed Elhoseiny. 2022. Motion forecasting with unlikelihood training in continuous space. In Proceedings of the 5th Conference on Robot Learning, Vol. 164. PMLR, 1003–1012.

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 42, Issue 3
May 2024
721 pages
EISSN:1558-2868
DOI:10.1145/3618081
  • Editor:
  • Min Zhang
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 30 December 2023
Online AM: 13 November 2023
Accepted: 03 November 2023
Revised: 14 September 2023
Received: 18 May 2023
Published in TOIS Volume 42, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Explainable embedding
  2. cross-modal representation learning
  3. ad-hoc video search

Qualifiers

  • Research-article

Funding Sources

  • Singapore Ministry of Education (MOE) Academic Research Fund (AcRF) Tier 1

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 164
    Total Downloads
  • Downloads (Last 12 months)164
  • Downloads (Last 6 weeks)4
Reflects downloads up to 02 Sep 2024

Other Metrics

Citations

View Options

Get Access

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media