Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1007/978-3-031-27077-2_31guideproceedingsArticle/Chapter ViewAbstractPublication PagesConference Proceedingsacm-pubtype
Article

A Study of a Cross-modal Interactive Search Tool Using CLIP and Temporal Fusion

Published: 29 March 2023 Publication History

Abstract

Recently, the CLIP model demonstrated impressive performance in text-image search and zero classification tasks. Hence, CLIP was used as a primary model in many cross-modal search tools at evaluation campaigns. In this paper, we show a study performed with the model integrated to a successful video search tool at the respected Video Browser Showdown competition. The tool allows more complex querying actions on top of the primary model. Specifically, temporal querying and Bayesian like relevance feedback were tested as well as their natural combination – temporal relevance feedback. In a thorough analysis of the tool’s performance, we show current limits of cross-modal searching with CLIP and also the impact of more advanced query formulation strategies. We conclude that current cross-modal search models enable users to solve some types of tasks trivially with a single query, however, for more challenging tasks it is necessary to rely also on interactive search strategies.

References

[1]
Amato G, et al., et al. Lokoč J, et al., et al. VISIONE at video browser showdown 2021 MultiMedia Modeling 2021 Cham Springer 473-478
[2]
Cox, I., Miller, M., Omohundro, S., Yianilos, P.: Pichunter: Bayesian relevance feedback for image retrieval. In: International Conference on Pattern Recognition. vol. 3, pp. 361–369. IEEE (1996),
[3]
Gao, Y., Gao, B., Chen, Q., Liu, J., Zhang, Y.: Deep convolutional neural network-based epileptic electroencephalogram (eeg) signal classification. Front. Neurol. 11 (2020).
[4]
Gurrin, C., et al.: Introduction to the third annual lifelog search challenge (lsc’20). In: International Conference on Multimedia Retrieval, pp. 584–585. ACM (2020).
[5]
Hezel N, Schall K, Jung K, and Barthel KU Þór Jónsson B, Gurrin C, Tran M-T, Dang-Nguyen D-T, Hu AM-C, Huynh Thi Thanh B, and Huet B Efficient Search and Browsing of Large-Scale Video Collections with Vibro MultiMedia Modeling 2022 Cham Springer 487-492
[6]
Kratochvíl M, Veselý P, Mejzlík F, Lokoč J, et al. Ro YM et al. SOM-Hunter: video browsing with relevance-to-som feedback loop MultiMedia Modeling 2020 Cham Springer 790-795
[7]
Lokoč J, Mejzlík F, Souček T, Dokoupil P, and Peška L Þór Jónsson B, Gurrin C, Tran M-T, Dang-Nguyen D-T, Hu AM-C, Huynh Thi Thanh B, and Huet B Video search with context-aware ranker and relevance feedback MultiMedia Modeling 2022 Cham Springer 505-510
[8]
Lokoč, J., et al.: A W2VV++ case study with automated and interactive text-to-video retrieval. In: International Conference on Multimedia. ACM (2020).
[9]
Lokoč, J., Kovalčík, G., Souček, T., Moravec, J., Čech, P.: A framework for effective known-item search in video. In: International Conference on Multimedia, pp. 1777–1785. ACM (2019).
[10]
Peška L, Kovalčík G, Souček T, Škrhák V, and Lokoč J Lokoč J, Skopal T, Schoeffmann K, Mezaris V, Li X, Vrochidis S, and Patras I W2VV++ BERT model at VBS 2021 MultiMedia Modeling 2021 Cham Springer 467-472
[11]
Radford, A., et al.: Learning transferable visual models from natural language supervision. CoRR abs/2103.00020 (2021). https://arxiv.org/abs/2103.00020
[12]
Rossetto L et al. Interactive video retrieval in the age of deep learning-detailed evaluation of VBS 2019 IEEE Trans. Multimedia 2020 23 243-256
[13]
Rossetto L, Gasser R, Sauter L, Bernstein A, Schuldt H, et al. Lokoč J et al. A system for interactive multimedia retrieval evaluations MultiMedia Modeling 2021 Cham Springer 385-390
[14]
Rossetto L, Schuldt H, Awad G, and Butt AA Kompatsiaris I, Huet B, Mezaris V, Gurrin C, Cheng W-H, and Vrochidis S V3C – a research video collection MultiMedia Modeling 2019 Cham Springer 349-360
[15]
Veselý P, Mejzlík F, Lokoč J, et al. Lokoč J et al. SOMHunter V2 at video browser showdown 2021 MultiMedia Modeling 2021 Cham Springer 461-466
[16]
Wang, K., Yin, Q., Wang, W., Wu, S., Wang, L.: A comprehensive survey on cross-modal retrieval. CoRR abs/1607.06215 (2016). http://arxiv.org/abs/1607.06215

Cited By

View all
  • (2024)Searching Temporally Distant Activities in Lifelog Data With PraK Tool V2Proceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge10.1145/3643489.3661131(111-116)Online publication date: 10-Jun-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image Guide Proceedings
MultiMedia Modeling: 29th International Conference, MMM 2023, Bergen, Norway, January 9–12, 2023, Proceedings, Part I
Jan 2023
718 pages
ISBN:978-3-031-27076-5
DOI:10.1007/978-3-031-27077-2

Publisher

Springer-Verlag

Berlin, Heidelberg

Publication History

Published: 29 March 2023

Author Tags

  1. Multimedia retrieval
  2. User study
  3. Cross-modal search

Qualifiers

  • Article

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)0
  • Downloads (Last 6 weeks)0
Reflects downloads up to 04 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Searching Temporally Distant Activities in Lifelog Data With PraK Tool V2Proceedings of the 7th Annual ACM Workshop on the Lifelog Search Challenge10.1145/3643489.3661131(111-116)Online publication date: 10-Jun-2024

View Options

View options

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media