Location via proxy:   [ UP ]  
[Report a bug]   [Manage cookies]                
skip to main content
10.1145/3581783.3613809acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer

Published: 27 October 2023 Publication History

Abstract

Viewport prediction is a crucial aspect of tile-based 360° video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.

Supplementary Material

MP4 File (mmfp3892-video.mp4)
Presentation Video

References

[1]
A Deniz Aladagli, Erhan Ekmekcioglu, Dmitri Jarnikov, and Ahmet Kondoz. 2017. Predicting head trajectories in 360 virtual reality videos. In 2017 International Conference on 3D Immersion (IC3D). IEEE, 1--6.
[2]
Yixuan Ban, Lan Xie, Zhimin Xu, Xinggong Zhang, Zongming Guo, and Yue Wang. 2018. Cub360: Exploiting cross-users behaviors for viewport prediction in 360 video adaptive streaming. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[3]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.
[4]
Fang-Yi Chao, Cagri Ozcinar, and Aljosa Smolic. 2021. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In MMSP. 1--6.
[5]
Jinyu Chen, Xianzhuo Luo, Miao Hu, Di Wu, and Yipeng Zhou. 2020. Sparkle: User-aware viewport prediction in 360-degree video streaming. IEEE Transactions on Multimedia, Vol. 23 (2020), 3853--3866.
[6]
Federico Chiariotti. 2021. A survey on 360-degree video: Coding, quality of experience and streaming. Computer Communications, Vol. 177 (2021), 133--155.
[7]
Lovish Chopra, Sarthak Chakraborty, Abhijit Mondal, and Sandip Chakraborty. 2021. Parima: Viewport adaptive 360-degree video streaming. In Proceedings of the Web Conference 2021. 2379--2391.
[8]
Yago Sanchez de la Fuente, Gurdeep Singh Bhullar, Robert Skupin, Cornelius Hellge, and Thomas Schierl. 2019. Delay impact on MPEG OMAF's tile-based viewport-dependent 360 video streaming. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 9, 1 (2019), 18--28.
[9]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
[10]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[11]
Xianglong Feng, Viswanathan Swaminathan, and Sheng Wei. 2019. Viewport prediction for live 360-degree mobile video streaming using user-content hybrid motion tracking. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 3, 2 (2019), 1--22.
[12]
Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 1 (2022), 87--110.
[13]
Mohammadreza Jamali, Stéphane Coulombe, Ahmad Vakili, and Carlos Vazquez. 2020. LSTM-based viewpoint prediction for multi-quality tiled video coding in virtual reality streaming. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1--5.
[14]
Xiaolan Jiang, Si Ahmed Naas, Yi-Han Chiang, Stephan Sigg, and Yusheng Ji. 2020. SVP: Sinusoidal viewport prediction for 360-degree video streaming. IEEE Access, Vol. 8 (2020), 164471--164481.
[15]
Yuang Jiang, Konstantinos Poularakis, Diego Kiedanski, Sastry Kompella, and Leandros Tassiulas. 2022. Robust and Resource-efficient Machine Learning Aided Viewport Prediction in Virtual Reality. arXiv preprint arXiv:2212.09945 (2022).
[16]
Yue Kang, Zhao Cai, Chee-Wee Tan, Qian Huang, and Hefu Liu. 2020. Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics, Vol. 7, 2 (2020), 139--172.
[17]
Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.
[18]
Dongwon Lee, Minji Choi, and Joohyun Lee. 2021. Prediction of head movement in 360-degree videos using attention model. Sensors, Vol. 21, 11 (2021), 3678.
[19]
Chenge Li, Weixi Zhang, Yong Liu, and Yao Wang. 2019. Very long term field of view prediction for 360-degree video streaming. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 297--302.
[20]
Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.
[21]
Jie Li, Ling Han, Chong Zhang, Qiyue Li, and Zhi Liu. 2022. Spherical Convolution empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2022).
[22]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.
[23]
Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.
[24]
Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).
[25]
Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10437--10446.
[26]
Afshin Taghavi Nasrabadi, Aliehsan Samiei, and Ravi Prakash. 2020. Viewport prediction for 360 videos: a clustering approach. In Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video. 34--39.
[27]
Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. 2018. Your attention is unique: Detecting 360-degree video saliency in head-mounted display for head movement prediction. In Proceedings of the 26th ACM international conference on Multimedia. 1190--1198.
[28]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
[29]
Miguel Fabian Romero Rondon, Lucile Sassatelli, Ramon Aparicio Pardo, and Frederic Precioso. 2019. Revisiting Deep Architectures for Head Motion Prediction in 360° Videos. arXiv preprint arXiv:1911.11702 (2019).
[30]
Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.
[31]
Sam Van Damme, Maria Torres Vega, and Filip De Turck. 2022. Machine Learning based Content-Agnostic Viewport Prediction for 360-Degree Video. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 18, 2 (2022), 1--24.
[32]
Jeroen van der Hooft, Maria Torres Vega, Stefano Petrangeli, Tim Wauters, and Filip De Turck. 2019. Optimizing adaptive tile-based virtual reality video streaming. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). IEEE, 381--387.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.
[34]
Lan Xie, Zhimin Xu, Yixuan Ban, Xinggong Zhang, and Zongming Guo. 2017. 360probdash: Improving qoe of 360 video streaming using tile-based http adaptive streaming. In Proceedings of the 25th ACM international conference on Multimedia. 315--323.
[35]
Mai Xu, Yuhang Song, Jianyi Wang, MingLang Qiao, Liangyu Huo, and Zulin Wang. 2018b. Predicting head movement in panoramic video: A deep reinforcement learning approach. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 11 (2018), 2693--2708.
[36]
Tan Xu, Bo Han, and Feng Qian. 2019. Analyzing viewport prediction under different VR interactions. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies. 165--171.
[37]
Yanyu Xu, Yanbing Dong, Junru Wu, Zhengzhong Sun, Zhiru Shi, Jingyi Yu, and Shenghua Gao. 2018a. Gaze prediction in dynamic 360 immersive videos. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5333--5342.
[38]
Qin Yang, Junni Zou, Kexin Tang, Chenglin Li, and Hongkai Xiong. 2019. Single and sequential viewports prediction for 360-degree video streaming. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1--5.
[39]
Abid Yaqoob, Ting Bi, and Gabriel-Miro Muntean. 2020. A survey on adaptive 360 video streaming: Solutions, challenges and opportunities. IEEE Communications Surveys & Tutorials, Vol. 22, 4 (2020), 2801--2838.
[40]
Abid Yaqoob and Gabriel-Miro Muntean. 2021. A combined field-of-view prediction-assisted viewport adaptive delivery scheme for 360° videosIEEE Transactions on Broadcasting, Vol. 67, 3 (2021), 746--760.
[41]
Lei Zhang, Weizhen Xu, Donghuan Lu, Laizhong Cui, and Jiangchuan Liu. 2022. MFVP: Mobile-Friendly Viewport Prediction for Live 360-Degree Video Streaming. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.
[42]
Yucheng Zhu, Guangtao Zhai, Xiongkuo Min, and Jiantao Zhou. 2019. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Transactions on Multimedia, Vol. 22, 9 (2019), 2331--2344.

Cited By

View all
  • (2024)FHVAC: Feature-Level Hybrid Video Adaptive Configuration for Machine-Centric Live StreamingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337204635:5(780-795)Online publication date: May-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. multi-modal fusion
  2. tile classification
  3. transformer network
  4. viewport prediction

Qualifiers

  • Research-article

Funding Sources

  • China University Innovation Fund
  • Project of China Knowledge Centre for Engineering Science and Technology
  • NSFC under Grant
  • the MOE Innovation Research Team
  • National Key R&D Program of China

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24
The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne , VIC , Australia

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)142
  • Downloads (Last 6 weeks)6
Reflects downloads up to 30 Aug 2024

Other Metrics

Citations

Cited By

View all
  • (2024)FHVAC: Feature-Level Hybrid Video Adaptive Configuration for Machine-Centric Live StreamingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337204635:5(780-795)Online publication date: May-2024

View Options

Get Access

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media