research-article

Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer

Authors:

Wangdu ChenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3560 - 3568

https://doi.org/10.1145/3581783.3613809

Published: 27 October 2023 Publication History

Abstract

Viewport prediction is a crucial aspect of tile-based 360° video streaming system. However, existing trajectory based methods lack of robustness, also oversimplify the process of information construction and fusion between different modality inputs, leading to the error accumulation problem. In this paper, we propose a tile classification based viewport prediction method with Multi-modal Fusion Transformer, namely MFTR. Specifically, MFTR utilizes transformer-based networks to extract the long-range dependencies within each modality, then mine intra- and inter-modality relations to capture the combined impact of user historical inputs and video contents on future viewport selection. In addition, MFTR categorizes future tiles into two categories: user interested or not, and selects future viewport as the region that contains most user interested tiles. Comparing with predicting head trajectories, choosing future viewport based on tile's binary classification results exhibits better robustness and interpretability. To evaluate our proposed MFTR, we conduct extensive experiments on two widely used PVS-HM and Xu-Gaze dataset. MFTR shows superior performance over state-of-the-art methods in terms of average prediction accuracy and overlap ratio, also presents competitive computation efficiency.

Supplementary Material

MP4 File (mmfp3892-video.mp4)

Presentation Video

Download
26.48 MB

References

[1]

A Deniz Aladagli, Erhan Ekmekcioglu, Dmitri Jarnikov, and Ahmet Kondoz. 2017. Predicting head trajectories in 360 virtual reality videos. In 2017 International Conference on 3D Immersion (IC3D). IEEE, 1--6.

[2]

Yixuan Ban, Lan Xie, Zhimin Xu, Xinggong Zhang, Zongming Guo, and Yue Wang. 2018. Cub360: Exploiting cross-users behaviors for viewport prediction in 360 video adaptive streaming. In 2018 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[3]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. In European conference on computer vision. Springer, 213--229.

Digital Library

[4]

Fang-Yi Chao, Cagri Ozcinar, and Aljosa Smolic. 2021. Transformer-based Long-Term Viewport Prediction in 360° Video: Scanpath is All You Need. In MMSP. 1--6.

[5]

Jinyu Chen, Xianzhuo Luo, Miao Hu, Di Wu, and Yipeng Zhou. 2020. Sparkle: User-aware viewport prediction in 360-degree video streaming. IEEE Transactions on Multimedia, Vol. 23 (2020), 3853--3866.

Digital Library

[6]

Federico Chiariotti. 2021. A survey on 360-degree video: Coding, quality of experience and streaming. Computer Communications, Vol. 177 (2021), 133--155.

Digital Library

[7]

Lovish Chopra, Sarthak Chakraborty, Abhijit Mondal, and Sandip Chakraborty. 2021. Parima: Viewport adaptive 360-degree video streaming. In Proceedings of the Web Conference 2021. 2379--2391.

Digital Library

[8]

Yago Sanchez de la Fuente, Gurdeep Singh Bhullar, Robert Skupin, Cornelius Hellge, and Thomas Schierl. 2019. Delay impact on MPEG OMAF's tile-based viewport-dependent 360 video streaming. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, Vol. 9, 1 (2019), 18--28.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[10]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).

[11]

Xianglong Feng, Viswanathan Swaminathan, and Sheng Wei. 2019. Viewport prediction for live 360-degree mobile video streaming using user-content hybrid motion tracking. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Vol. 3, 2 (2019), 1--22.

Digital Library

[12]

Kai Han, Yunhe Wang, Hanting Chen, Xinghao Chen, Jianyuan Guo, Zhenhua Liu, Yehui Tang, An Xiao, Chunjing Xu, Yixing Xu, et al. 2022. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, Vol. 45, 1 (2022), 87--110.

[13]

Mohammadreza Jamali, Stéphane Coulombe, Ahmad Vakili, and Carlos Vazquez. 2020. LSTM-based viewpoint prediction for multi-quality tiled video coding in virtual reality streaming. In 2020 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1--5.

[14]

Xiaolan Jiang, Si Ahmed Naas, Yi-Han Chiang, Stephan Sigg, and Yusheng Ji. 2020. SVP: Sinusoidal viewport prediction for 360-degree video streaming. IEEE Access, Vol. 8 (2020), 164471--164481.

[15]

Yuang Jiang, Konstantinos Poularakis, Diego Kiedanski, Sastry Kompella, and Leandros Tassiulas. 2022. Robust and Resource-efficient Machine Learning Aided Viewport Prediction in Virtual Reality. arXiv preprint arXiv:2212.09945 (2022).

[16]

Yue Kang, Zhao Cai, Chee-Wee Tan, Qian Huang, and Hefu Liu. 2020. Natural language processing (NLP) in management research: A literature review. Journal of Management Analytics, Vol. 7, 2 (2020), 139--172.

[17]

Wonjae Kim, Bokyung Son, and Ildoo Kim. 2021. Vilt: Vision-and-language transformer without convolution or region supervision. In International Conference on Machine Learning. PMLR, 5583--5594.

[18]

Dongwon Lee, Minji Choi, and Joohyun Lee. 2021. Prediction of head movement in 360-degree videos using attention model. Sensors, Vol. 21, 11 (2021), 3678.

[19]

Chenge Li, Weixi Zhang, Yong Liu, and Yao Wang. 2019. Very long term field of view prediction for 360-degree video streaming. In 2019 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR). IEEE, 297--302.

[20]

Gen Li, Nan Duan, Yuejian Fang, Ming Gong, and Daxin Jiang. 2020. Unicoder-vl: A universal encoder for vision and language by cross-modal pre-training. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 11336--11344.

[21]

Jie Li, Ling Han, Chong Zhang, Qiyue Li, and Zhi Liu. 2022. Spherical Convolution empowered Viewport Prediction in 360 Video Multicast with Limited FoV Feedback. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2022).

[22]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision. 10012--10022.

[23]

Ze Liu, Jia Ning, Yue Cao, Yixuan Wei, Zheng Zhang, Stephen Lin, and Han Hu. 2022. Video swin transformer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3202--3211.

[24]

Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, Vol. 32 (2019).

[25]

Jiasen Lu, Vedanuj Goswami, Marcus Rohrbach, Devi Parikh, and Stefan Lee. 2020. 12-in-1: Multi-task vision and language representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10437--10446.

[26]

Afshin Taghavi Nasrabadi, Aliehsan Samiei, and Ravi Prakash. 2020. Viewport prediction for 360 videos: a clustering approach. In Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video. 34--39.

Digital Library

[27]

Anh Nguyen, Zhisheng Yan, and Klara Nahrstedt. 2018. Your attention is unique: Detecting 360-degree video saliency in head-mounted display for head movement prediction. In Proceedings of the 26th ACM international conference on Multimedia. 1190--1198.

Digital Library

[28]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).

[29]

Miguel Fabian Romero Rondon, Lucile Sassatelli, Ramon Aparicio Pardo, and Frederic Precioso. 2019. Revisiting Deep Architectures for Head Motion Prediction in 360° Videos. arXiv preprint arXiv:1911.11702 (2019).

[30]

Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. 2018. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4510--4520.

[31]

Sam Van Damme, Maria Torres Vega, and Filip De Turck. 2022. Machine Learning based Content-Agnostic Viewport Prediction for 360-Degree Video. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 18, 2 (2022), 1--24.

Digital Library

[32]

Jeroen van der Hooft, Maria Torres Vega, Stefano Petrangeli, Tim Wauters, and Filip De Turck. 2019. Optimizing adaptive tile-based virtual reality video streaming. In 2019 IFIP/IEEE Symposium on Integrated Network and Service Management (IM). IEEE, 381--387.

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems, Vol. 30.

[34]

Lan Xie, Zhimin Xu, Yixuan Ban, Xinggong Zhang, and Zongming Guo. 2017. 360probdash: Improving qoe of 360 video streaming using tile-based http adaptive streaming. In Proceedings of the 25th ACM international conference on Multimedia. 315--323.

Digital Library

[35]

Mai Xu, Yuhang Song, Jianyi Wang, MingLang Qiao, Liangyu Huo, and Zulin Wang. 2018b. Predicting head movement in panoramic video: A deep reinforcement learning approach. IEEE transactions on pattern analysis and machine intelligence, Vol. 41, 11 (2018), 2693--2708.

[36]

Tan Xu, Bo Han, and Feng Qian. 2019. Analyzing viewport prediction under different VR interactions. In Proceedings of the 15th International Conference on Emerging Networking Experiments And Technologies. 165--171.

Digital Library

[37]

Yanyu Xu, Yanbing Dong, Junru Wu, Zhengzhong Sun, Zhiru Shi, Jingyi Yu, and Shenghua Gao. 2018a. Gaze prediction in dynamic 360 immersive videos. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5333--5342.

[38]

Qin Yang, Junni Zou, Kexin Tang, Chenglin Li, and Hongkai Xiong. 2019. Single and sequential viewports prediction for 360-degree video streaming. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 1--5.

[39]

Abid Yaqoob, Ting Bi, and Gabriel-Miro Muntean. 2020. A survey on adaptive 360 video streaming: Solutions, challenges and opportunities. IEEE Communications Surveys & Tutorials, Vol. 22, 4 (2020), 2801--2838.

[40]

Abid Yaqoob and Gabriel-Miro Muntean. 2021. A combined field-of-view prediction-assisted viewport adaptive delivery scheme for 360° videosIEEE Transactions on Broadcasting, Vol. 67, 3 (2021), 746--760.

[41]

Lei Zhang, Weizhen Xu, Donghuan Lu, Laizhong Cui, and Jiangchuan Liu. 2022. MFVP: Mobile-Friendly Viewport Prediction for Live 360-Degree Video Streaming. In 2022 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 1--6.

[42]

Yucheng Zhu, Guangtao Zhai, Xiongkuo Min, and Jiantao Zhou. 2019. The prediction of saliency map for head and eye movements in 360 degree images. IEEE Transactions on Multimedia, Vol. 22, 9 (2019), 2331--2344.

Cited By

Zhang YZhang WDu HYan CLiu LZheng Q(2024)FHVAC: Feature-Level Hybrid Video Adaptive Configuration for Machine-Centric Live StreamingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337204635:5(780-795)Online publication date: May-2024
https://doi.org/10.1109/TPDS.2024.3372046

Index Terms

Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
1. Computing methodologies
  1. Artificial intelligence
    1. Planning and scheduling
      1. Planning under uncertainty
2. Information systems
  1. Information systems applications
    1. Multimedia information systems

Recommendations

Viewport prediction for 360° videos: a clustering approach
NOSSDAV '20: Proceedings of the 30th ACM Workshop on Network and Operating Systems Support for Digital Audio and Video

An important component for viewport-adaptive streaming of 360° videos is viewport prediction. Increasing viewport prediction horizon enables the client to prefetch more chunks into the playback buffer. Having longer buffer results in less rebuffering ...
Learning-Based Confidence Estimation for Multi-modal Classifier Fusion
Neural Information Processing
Abstract
We propose a novel confidence estimation method for predictions from a multi-class classifier. Unlike existing methods, we learn a confidence-estimator on the basis of a held-out set from the training data. The predicted confidence values by the ...
Multi-modal fusion network with intra- and inter-modality attention for prognosis prediction in breast cancer
Abstract
Accurate breast cancer prognosis prediction can help clinicians to develop appropriate treatment plans and improve life quality for patients. Recent prognostic prediction studies suggest that fusing multi-modal data, e.g., genomic data and ...
Highlights
- Multi-modal network with Intra- and Inter-modality attention for breast cancer prognosis prediction.
- Intra-SA modules mine modality-specific relations without bringing in the huge expansion of feature dimensions.
- Inter-CA module ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

China University Innovation Fund
Project of China Knowledge Centre for Engineering Science and Technology
NSFC under Grant
the MOE Innovation Research Team
National Key R&D Program of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 995 of 4,171 submissions, 24%

Upcoming Conference

MM '24

Sponsor:
sigmm

The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
142
Total Downloads

Downloads (Last 12 months)142
Downloads (Last 6 weeks)6

Reflects downloads up to 30 Aug 2024

Other Metrics

View Author Metrics

Citations

Cited By

Zhang YZhang WDu HYan CLiu LZheng Q(2024)FHVAC: Feature-Level Hybrid Video Adaptive Configuration for Machine-Centric Live StreamingIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2024.337204635:5(780-795)Online publication date: May-2024
https://doi.org/10.1109/TPDS.2024.3372046

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents