research-article

A Novel Lightweight Audio-visual Saliency Model for Videos

Authors:

Qiangqiang Zhou,

Xiaokang YangAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 19, Issue 4

Article No.: 147, Pages 1 - 22

https://doi.org/10.1145/3576857

Published: 27 February 2023 Publication History

Abstract

Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention models only utilize visual information, their performance is limited but also requires high-computational complexity due to the limited information available. To overcome these problems, we propose a lightweight audio-visual saliency (LAVS) model for video sequences. To the best of our knowledge, this article is the first trial to utilize audio cues for an efficient deep-learning model for the video saliency estimation. First, spatial-temporal visual features are extracted by the lightweight receptive field block (RFB) with the bidirectional ConvLSTM units. Then, audio features are extracted by using an improved lightweight environment sound classification model. Subsequently, deep canonical correlation analysis (DCCA) aims at capturing the correspondence between audio and spatial-temporal visual features, thus obtaining a spatial-temporal auditory saliency. Lastly, the spatial-temporal visual and auditory saliency are fused to obtain the audio-visual saliency map. Extensive comparative experiments and ablation studies validate the performance of the LAVS model in terms of effectiveness and complexity.

References

[1]

Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. 2013. Deep canonical correlation analysis. In Proceedings of the International Conference on Machine Learning. PMLR, 1247–1255.

Digital Library

[2]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in Neural Information Processing Systems 29, 1 (2016), 892–900.

[3]

Cagdas Bak, Aysun Kocak, Erkut Erdem, and Aykut Erdem. 2017. Spatio-temporal saliency networks for dynamic saliency prediction. IEEE Transactions on Multimedia 20, 7 (2017), 1688–1698.

[4]

Giuseppe Boccignone, Vittorio Cuculo, Alessandro D’Amelio, Giuliano Grossi, and Raffaella Lanzarotti. 2018. Give ear to my face: Modelling multimodal attention to social interactions. In Proceedings of the European Conference on Computer Vision. 0–0.

[5]

Ali Borji, Ming-Ming Cheng, Huaizu Jiang, and Jia Li. 2015. Salient object detection: A benchmark. IEEE Transactions on Image Processing 24, 12 (2015), 5706–5722.

Digital Library

[6]

Ali Borji and Laurent Itti. 2012. State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 1 (2012), 185–207.

Digital Library

[7]

Ali Borji, Dicky N. Sihite, and Laurent Itti. 2012. Quantitative analysis of human-model agreement in visual saliency modeling: A comparative study. IEEE Transactions on Image Processing 22, 1 (2012), 55–69.

Digital Library

[8]

Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2018. What do different evaluation metrics tell us about saliency models? IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 3 (2018), 740–757.

Digital Library

[9]

Yanxiang Chen, Tam V. Nguyen, Mohan Kankanhalli, Jun Yuan, Shuicheng Yan, and Meng Wang. 2014. Audio matters in visual attention. IEEE Transactions on Circuits and Systems for Video Technology 24, 11 (2014), 1992–2003.

[10]

Ming-Ming Cheng, Niloy J. Mitra, Xiaolei Huang, Philip H. S. Torr, and Shi-Min Hu. 2014. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence 37, 3 (2014), 569–582.

Digital Library

[11]

Zhiyong Cheng and Jialie Shen. 2016. On effective location-aware music recommendation. ACM Transactions on Information Systems 34, 2 (2016), 1–32.

Digital Library

[12]

Zhiyong Cheng, Jialie Shen, Liqiang Nie, Tat-Seng Chua, and Mohan Kankanhalli. 2017. Exploring user-specific information in music retrieval. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 655–664.

Digital Library

[13]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Predicting human eye fixations via an lstm-based saliency attentive model. IEEE Transactions on Image Processing 27, 10 (2018), 5142–5154.

[14]

Antoine Coutrot and Nathalie Guyader. 2014. An audio-visual attention model for natural conversation scenes. In Proceedings of the 2014 IEEE International Conference on Image Processing. IEEE, 1100–1104.

[15]

Antoine Coutrot and Nathalie Guyader. 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision 14, 8 (2014), 5–5.

[16]

Antoine Coutrot and Nathalie Guyader. 2015. An efficient audio-visual saliency model to predict eye positions when looking at conversations. In Proceedings of the 2015 23rd European Signal Processing Conference. IEEE, 1531–1535.

[17]

Antoine Coutrot and Nathalie Guyader. 2016. Multimodal saliency models for videos. In Proceedings of the From Human Attention to Computational Attention. Springer, 291–304.

[18]

Antoine Coutrot, Nathalie Guyader, Gelu Ionescu, and Alice Caplier. 2012. Influence of soundtrack on eye movements during video exploration. Journal of Eye Movement Research (2012).

[19]

Tung Dang, Christos Papachristos, and Kostas Alexis. 2018. Visual saliency-aware receding horizon autonomous exploration with application to aerial robotics. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation. IEEE, 2526–2533.

Digital Library

[20]

Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 4 (1980), 357–366.

[21]

Zijun Deng, Xiaowei Hu, Lei Zhu, Xuemiao Xu, Jing Qin, Guoqiang Han, and Pheng-Ann Heng. 2018. R3net: Recurrent residual refinement network for saliency detection. In Proceedings of the 27th International Joint Conference on Artificial Intelligence. AAAI, 684–690.

[22]

Hehe Fan, Linchao Zhu, and Yi Yang. 2019. Cubic LSTMs for video prediction. In Proceedings of the AAAI Conference on Artificial Intelligence. 8263–8270.

Digital Library

[23]

Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In Proceedings of the European Conference on Computer Vision. Springer, 505–520.

[24]

Hadi Hadizadeh and Ivan V. Bajić. 2013. Saliency-aware video compression. IEEE Transactions on Image Processing 23, 1 (2013), 19–33.

Digital Library

[25]

Sen He, Hamed R. Tavakoli, Ali Borji, Yang Mi, and Nicolas Pugeault. 2019. Understanding and visualizing deep visual saliency models. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 10206–10215.

[26]

Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. Salicon: Reducing the semantic gap in saliency prediction by adapting deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision. 262–270.

[27]

Lai Jiang, Mai Xu, Tie Liu, Minglang Qiao, and Zulin Wang. 2018. Deepvs: A deep learning based video saliency prediction approach. In Proceedings of the European Conference on Computer Vision. 602–617.

Digital Library

[28]

Petros Koutras and Petros Maragos. 2015. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication 38, 5 (2015), 15–31.

Digital Library

[29]

Gábor Kovács, Yasuharu Kunii, Takao Maeda, and Hideki Hashimoto. 2019. Saliency and spatial information-based landmark selection for mobile robot navigation in natural environments. Advanced Robotics 33, 10 (2019), 520–535.

[30]

Alexander Kroner, Mario Senden, Kurt Driessens, and Rainer Goebel. 2020. Contextual encoder-decoder network for visual saliency prediction. Neural Networks 129, 8 (2020), 261–270.

[31]

Srinivas S. S. Kruthiventi, Kumar Ayush, and R. Venkatesh Babu. 2017. Deepfix: A fully convolutional neural network for predicting human eye fixations. IEEE Transactions on Image Processing 26, 9 (2017), 4446–4456.

Digital Library

[32]

M. Kümmerer, L. Theis, and M. Bethge. 2014. Deep gaze I: Boosting saliency prediction with feature maps trained on ImageNet. In Proceedings of the International Conference on Learning Representations. 1–12.

[33]

Matthias Kummerer, Thomas S. A. Wallis, Leon A. Gatys, and Matthias Bethge. 2017. Understanding low-and high-level contributions to fixation prediction. In Proceedings of the IEEE International Conference on Computer Vision. 4789–4798.

[34]

Qiuxia Lai, Wenguan Wang, Hanqiu Sun, and Jianbing Shen. 2019. Video saliency prediction using spatiotemporal residual attentive networks. IEEE Transactions on Image Processing 29, 10 (2019), 1113–1126.

[35]

Gayoung Lee, Yu-Wing Tai, and Junmo Kim. 2016. Deep saliency with encoded low level distance map and high level features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 660–668.

[36]

Guanbin Li and Yizhou Yu. 2015. Visual saliency based on multi-scale deep features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5455–5463.

[37]

Panagiotis Linardos, Eva Mohedano, Juan Jose Nieto, Noel O’Connor, Xavier Giró Nieto, and Kevin McGuinness. 2019. Simple vs complex temporal recurrences for video saliency prediction. In Proceedings of the 30th British Machine Vision Conference. 1–12.

[38]

Nian Liu, Junwei Han, and Ming-Hsuan Yang. 2018. Picanet: Learning pixel-wise contextual attention for saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3089–3098.

[39]

Songtao Liu, Di Huang, and Yunbo Wang. 2018. Receptive feld block net for accurate and fast object detection. In Proceedings of the European Conference on Computer Vision. 385–400.

[40]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3431–3440.

[41]

Mahshid Majd and Reza Safabakhsh. 2019. A motion-aware ConvLSTM network for action recognition. Applied Intelligence 49, 7 (2019), 2515–2521.

Digital Library

[42]

Sophie Marat, Mickäel Guironnet, and Denis Pellerin. 2007. Video summarization using a visual attention model. In Proceedings of the 2007 15th European Signal Processing Conference. IEEE, 1784–1788.

[43]

Harry McGurk and John MacDonald. 1976. Hearing lips and seeing voices. Nature 264, 5588 (1976), 746–748.

[44]

Kyle Min and Jason J. Corso. 2019. TASED-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE International Conference on Computer Vision. 2394–2403.

[45]

Xiongkuo Min, Guangtao Zhai, Ke Gu, and Xiaokang Yang. 2016. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications 13, 1 (2016), 1–23.

Digital Library

[46]

Parag K. Mital, Tim J. Smith, Robin L. Hill, and John M. Henderson. 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive Computation 3, 1 (2011), 5–24.

[47]

Junting Pan, Cristian Canton, Kevin McGuinness, Noel E. O’Connor, Jordi Torres, Elisa Sayrol, Xavier, and Giro-iNieto. 2017. SalGAN: Visual saliency prediction with generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Scene Understanding Workshop 2017.

[48]

Junting Pan, Elisa Sayrol, Xavier Giro-i Nieto, Kevin McGuinness, and Noel E. O’Connor. 2016. Shallow and deep convolutional networks for saliency prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 598–606.

[49]

Tanzila Rahman, Mengyu Yang, and Leonid Sigal. 2021. TriBERT: Human-centric audio-visual representation learning. Advances in Neural Information Processing Systems 34, 10 (2021), 9774–9787.

[50]

Sudarshan Ramenahalli, Daniel R. Mendat, Salvador Dura-Bernal, Eugenio Culurciello, Ernst Niebur, and Andreas Andreou. 2013. Audio-visual saliency map: Overview, basic models and hardware implementation. In Proceedings of the 2013 47th Annual Conference on Information Sciences and Systems. IEEE, 1–6.

[51]

Rémi Ratajczak, Denis Pellerin, Quentin Labourey, and Catherine Garbay. 2016. A fast audio-visual attention model for human detection and localization on a companion robot. In Proceedings of the 1st International Conference on Applications and Systems of Visual Paradigms.

[52]

Sylvestre-Alvise Rebuffi, Ruth Fong, Xu Ji, and Andrea Vedaldi. 2020. There and back again: Revisiting backpropagation saliency methods. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8839–8848.

[53]

Jonas Ruesch, Manuel Lopes, Alexandre Bernardino, Jonas Hornstein, José Santos-Victor, and Rolf Pfeifer. 2008. Multimodal saliency-based bottom-up attention a framework for the humanoid robot icub. In Proceedings of the 2008 IEEE International Conference on Robotics and Automation. IEEE, 962–967.

[54]

Boris Schauerte, Benjamin Kühn, Kristian Kroschel, and Rainer Stiefelhagen. 2011. Multimodal saliency-based attention for object-based scene analysis. In Proceedings of the 2011 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 1173–1179.

[55]

Christian Schörkhuber and Anssi Klapuri. 2010. Constant-Q transform toolbox for music processing. In Proceedings of the 7th Sound and Music Computing Conference, Barcelona, Spain. 3–64.

[56]

Yang Shao, Zhaozhang Jin, DeLiang Wang, and Soundararajan Srinivasan. 2009. An auditory-based feature for robust speech recognition. In Proceedings of the 2009 IEEE International Conference on Acoustics, Speech, and Signal Processing. IEEE, 4625–4628.

Digital Library

[57]

Jivitesh Sharma, Ole-Christoffer Granmo, and Morten Goodwin. 2019. Environment sound classification using multiple feature channels and attention based deep convolutional neural network. In Proceeding of the Interspeech, 1186–1190.

[58]

Naty Sidaty, Mohamed-Chaker Larabi, and Abdelhakim Saadane. 2017. Toward an audio-visual attention model for multimodal video content. Neurocomputing 259, 10 (2017), 94–111.

[59]

Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wetzstein. 2018. Saliency in VR: How do people explore virtual environments? IEEE Transactions on Visualization and Computer Graphics 24, 4 (2018), 1633–1642.

Digital Library

[60]

Guanghan Song, Denis Pellerin, and Lionel Granjon. 2013. Different types of sounds influence gaze differently in videos. Journal of Eye Movement Research 6, 4 (2013).

[61]

Antigoni Tsiami, Petros Koutras, Athanasios Katsamanis, Argiro Vatakis, and Petros Maragos. 2019. A behaviorally inspired fusion approach for computational audio-visual saliency modeling. Signal Processing: Image Communication 76, 10 (2019), 186–200.

Digital Library

[62]

Antigoni Tsiami, Petros Koutras, and Petros Maragos. 2020. Stavis: Spatio-temporal audiovisual saliency network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4766–4776.

[63]

Erik Van der Burg, Christian N. L. Olivers, Adelbert W. Bronkhorst, and Jan Theeuwes. 2008. Pip and pop: Nonspatial auditory signals improve spatial visual search. Journal of Experimental Psychology: Human Perception and Performance 34, 5 (2008), 1053.

[64]

Eleonora Vig, Michael Dorr, and David Cox. 2014. Large-scale optimization of hierarchical features for saliency prediction in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2798–2805.

Digital Library

[65]

Lijun Wang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. 2015. Deep networks for saliency detection via local estimation and global search. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3183–3192.

[66]

Tiantian Wang, Ali Borji, Lihe Zhang, Pingping Zhang, and Huchuan Lu. 2017. A stagewise refinement model for detecting salient objects in images. In Proceedings of the IEEE International Conference on Computer Vision. 4019–4028.

[67]

Wenguan Wang and Jianbing Shen. 2017. Deep visual attention prediction. IEEE Transactions on Image Processing 27, 5 (2017), 2368–2378.

Digital Library

[68]

Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4894–4903.

[69]

Wenguan Wang, Jianbing Shen, and Ling Shao. 2017. Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing 27, 1 (2017), 38–49.

[70]

Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. 2018. Eidetic 3d lstm: A model for video prediction and beyond. In Proceedings of the International Conference on Learning Representations.

[71]

Kentaro Yamada, Yusuke Sugano, Takahiro Okabe, Yoichi Sato, Akihiro Sugimoto, and Kazuo Hiraki. 2011. Attention prediction in egocentric video using motion and visual saliency. In Proceedings of the Pacific-Rim Symposium on Image and Video Technology. Springer, 277–288.

Digital Library

[72]

Qiong Yan, Li Xu, Jianping Shi, and Jiaya Jia. 2013. Hierarchical saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1155–1162.

Digital Library

[73]

Chuan Yang, Lihe Zhang, Huchuan Lu, Xiang Ruan, and Ming-Hsuan Yang. 2013. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3166–3173.

Digital Library

[74]

Sheng Yang, Guosheng Lin, Qiuping Jiang, and Weisi Lin. 2019. A dilated inception network for visual saliency prediction. IEEE Transactions on Multimedia 22, 8 (2019), 2163–2176.

[75]

Shunyu Yao, Xiongkuo Min, and Guangtao Zhai. 2021. Deep audio-visual fusion neural network for saliency estimation. In Proceedings of the 2021 IEEE International Conference on Image Processing. IEEE, 1604–1608.

[76]

Xia Yuan, Juan Yue, and Yanan Zhang. 2018. RGB-D saliency detection: Dataset and algorithm for robot vision. In Proceedings of the 2018 IEEE International Conference on Robotics and Biomimetics. IEEE, 1028–1033.

Digital Library

[77]

Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications 16, 3 (2020), 1–23.

Digital Library

[78]

Ziheng Zhang, Yanyu Xu, Jingyi Yu, and Shenghua Gao. 2018. Saliency detection in 360 videos. In Proceedings of the European Conference on Computer Vision. 488–503.

Digital Library

[79]

Dandan Zhu, Yongqing Chen, Tian Han, Defang Zhao, Yucheng Zhu, Qiangqiang Zhou, Guangtao Zhai, and Xiaokang Yang. 2020. Ransp: Ranking attention network for saliency prediction on omnidirectional images. In Proceedings of the 2020 IEEE International Conference on Multimedia and Expo. IEEE, 1–6.

[80]

Wangjiang Zhu, Shuang Liang, Yichen Wei, and Jian Sun. 2014. Saliency optimization from robust background detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2814–2821.

Digital Library

Cited By

Xu YZhang YYang QXu XLiu S(2024)Compressed Point Cloud Quality Index by Combining Global Appearance and Local DetailsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672567Online publication date: 15-Jun-2024
https://dl.acm.org/doi/10.1145/3672567
Peng BSun LLei JLiu BShen HLi WHuang Q(2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3663570
Chen YZhou JPeng Y(2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3640345
Show More Cited By

Index Terms

A Novel Lightweight Audio-visual Saliency Model for Videos
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision problems
        Interest point and salient region detections

Recommendations

Deep Audio-Visual Saliency: Baseline Model and Data
ETRA '20 Short Papers: ACM Symposium on Eye Tracking Research and Applications

This paper introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed “DAVE” in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named “AVE”. Despite existing a ...
Audio-visual saliency prediction with multisensory perception and integration
Abstract
Audio-visual saliency prediction (AVSP) is a task that aims to model human attention patterns in the perception of auditory and visual scenes. Given the challenges associated with perceiving and combining multi-modal saliency features from videos,...
Highlights
- A three-stream multisensory framework for audio-visual saliency prediction.
- Image saliency model offers image saliency for feature fusion with dynamic saliency.
- A self-supervised audio-visual fusion block fuses auditory and visual ...
Joint Learning of Audio–Visual Saliency Prediction and Sound Source Localization on Multi-face Videos
Abstract
Visual and audio events simultaneously occur and both attract attention. However, most existing saliency prediction works ignore the influence of audio and only consider vision modality. In this paper, we propose a multi-task learning method for ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Multimedia Computing, Communications, and Applications

ACM Transactions on Multimedia Computing, Communications, and Applications Volume 19, Issue 4

July 2023

263 pages

ISSN:1551-6857

EISSN:1551-6865

DOI:10.1145/3582888

Editor:
Abdulmotaleb El Saddik
Mohamed Bin Zayed University of Artificial Intelligence, UAE and University of Ottawa, Canada

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 February 2023

Online AM: 16 December 2022

Accepted: 07 December 2022

Revised: 05 November 2022

Received: 01 May 2022

Published in TOMM Volume 19, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Fundamental Research Funds for the Central Universities
Key Laboratory of Artificial Intelligence, Ministry of Education, P.R. China
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

10
Total Citations
View Citations
676
Total Downloads

Downloads (Last 12 months)313
Downloads (Last 6 weeks)16

Reflects downloads up to 14 Oct 2024

Other Metrics

View Author Metrics

Citations

Cited By

Xu YZhang YYang QXu XLiu S(2024)Compressed Point Cloud Quality Index by Combining Global Appearance and Local DetailsACM Transactions on Multimedia Computing, Communications, and Applications10.1145/3672567Online publication date: 15-Jun-2024
https://dl.acm.org/doi/10.1145/3672567
Peng BSun LLei JLiu BShen HLi WHuang Q(2024)Self-Supervised Monocular Depth Estimation via Binocular Geometric Correlation LearningACM Transactions on Multimedia Computing, Communications, and Applications10.1145/366357020:8(1-19)Online publication date: 13-Jun-2024
https://dl.acm.org/doi/10.1145/3663570
Chen YZhou JPeng Y(2024)SPIRIT: Style-guided Patch Interaction for Fashion Image Retrieval with Text FeedbackACM Transactions on Multimedia Computing, Communications, and Applications10.1145/364034520:6(1-17)Online publication date: 8-Mar-2024
https://dl.acm.org/doi/10.1145/3640345
Feng ZXu JMa LZhang S(2024)Efficient Video Transformers via Spatial-temporal Token Merging for Action RecognitionACM Transactions on Multimedia Computing, Communications, and Applications10.1145/363378120:4(1-21)Online publication date: 11-Jan-2024
https://dl.acm.org/doi/10.1145/3633781
Wen HSong XChen XWei YNie LChua THui Yang GWang HHan SHauff CZuccon GZhang Y(2024)Simple but Effective Raw-Data Level Multimodal Fusion for Composed Image RetrievalProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval10.1145/3626772.3657727(229-239)Online publication date: 10-Jul-2024
https://dl.acm.org/doi/10.1145/3626772.3657727
Yang QLi YLi CWang HYan SWei LDai WZou JXiong HFrossard P(2024)SVGC-AVA: 360-Degree Video Saliency Prediction With Spherical Vector-Based Graph Convolution and Audio-Visual AttentionIEEE Transactions on Multimedia10.1109/TMM.2023.330659626(3061-3076)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3306596
Xie JLiu ZLi GSong Y(2024)Audio-visual saliency prediction with multisensory perception and integrationImage and Vision Computing10.1016/j.imavis.2024.104955143:COnline publication date: 2-Jul-2024
https://dl.acm.org/doi/10.1016/j.imavis.2024.104955
Cheng SLi XZeng ZYan J(2024)ADS-VQA: Adaptive sampling model for video quality assessmentDisplays10.1016/j.displa.2024.10279284(102792)Online publication date: Sep-2024
https://doi.org/10.1016/j.displa.2024.102792
Malpica SMartin DSerrano AGutierrez DMasia B(2023)Task-Dependent Visual Behavior in Immersive Environments: A Comparative Study of Free Exploration, Memory and Visual SearchIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2023.332025929:11(4417-4425)Online publication date: 3-Oct-2023
https://dl.acm.org/doi/10.1109/TVCG.2023.3320259
Zhang YWu KZhao M(2023)An Audio-Visual Separation Model Integrating Dual-Channel Attention MechanismIEEE Access10.1109/ACCESS.2023.328786011(63069-63080)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3287860

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents