research-article

Open access

Advanced Predictive Tile Selection Using Dynamic Tiling for Prioritized 360° Video VR Streaming

Authors:

Abid Yaqoob,

Gabriel-Miro MunteanAuthors Info & Claims

ACM Transactions on Multimedia Computing, Communications and Applications, Volume 20, Issue 1

Article No.: 6, Pages 1 - 28

https://doi.org/10.1145/3603146

Published: 24 August 2023 Publication History

PDF eReader

Abstract

The widespread availability of smart computing and display devices such as mobile phones, gaming consoles, laptops, and tethered/untethered head-mounted displays has fueled an increase in demand for omnidirectional (360°) videos. 360° video applications enable users to change their viewing angles while interacting with the video during playback. This allows users to have a more personalized and interactive viewing experience. Unfortunately, these applications require substantial network and computational resources that the conventional infrastructure and end devices cannot support. Recently proposed viewport adaptive fixed tiling solutions stream only relevant video tiles based on user interaction with the virtual reality (VR) space to use existing transmission resources more efficiently. However, achieving real-time accurate viewport extraction and transmission in response to both head movements and bandwidth dynamics can be challenging, which can impact the user’s Quality of Experience (QoE). This article proposes innovative dynamic tiling-based adaptive 360° video streaming solutions in order to achieve high viewer QoE. First, novel and easy-to-scale tiling layout selection methods are introduced, and the best tiling layouts are employed in each adaptation interval based on the prediction-assisted visual quality metric and the observed viewport divergence. Second, a novel proactive tile selection approach is presented, which adaptively extracts tiles for each selected tiling layout based on two low-complex viewport prediction mechanisms. Finally, a practical dynamic tile priority-oriented bitrate adaptation scheme is introduced, which uniformly distributes the bitrate budget among different tiles during 360° video streaming. Extensive trace-driven experiments are conducted to evaluate the proposed solutions using head motion traces from 48 VR users for five 360° videos with tiling layouts of 4 × 3, 6 × 4, and 8 × 6 and segment durations of 1s, 1.5s, and 2s. The experimental evaluations show that the dynamic video tiling solutions achieve up to 11.2% more viewport matches and an average improvement in QoE of 9.7% to 18% compared to state-of-the-art 360° streaming approaches.

1 Introduction

Recently, 360° virtual reality (VR) video has improved the traditional streaming format by allowing the viewer to feel fully immersed in the video by providing a complete spherical field of view (FoV). This is achieved by capturing video from all directions using multiple cameras and then stitching the video together into a single, seamless sphere. Users can have an incredibly immersive experience, especially when using high-resolution head-mounted display (HMD) devices [53]. However, remote transmission and rendering of ultra-high-resolution panoramic content significantly exceeds the capacity of conventional infrastructure. However, the emerging 5G and beyond wireless network technologies are expected to bridge the current performance gap by offering higher network flexibility, transmission capacity, and mobility support [3].

Currently, a standard way to mitigate the transmission of ever-increasing 360° video services is through viewport-based adaptive streaming frameworks (i.e., monolithic streaming [7, 64] and tile-based streaming [37, 54]). Multiple versions of pre-defined viewports are prepared on the server side in monolith streaming. The entire spherical frame provides higher viewport quality and gradually lower outside quality for each viewing feedback. Contrarily, tile-based streaming lowers these requirements by spatially partitioning the video frames into independently encodable rectangular video parts known as tiles [21, 59]. The VR user can envision the FoV tiles in higher-quality levels [31, 65] compared to the other tiles, which are delivered in lower resolution [12, 38] or even discarded [49]. The user’s head motion patterns are an essential measurement for quality-efficient remote transmission. However, it is limited in many cases. Viewport prediction can help to reduce the time it takes for new tiles to be loaded as the viewer changes their viewing angle, improving the overall streaming experience. The client can allocate more bits to these tiles based on visual visit information [28].

The spatial partitioning structure of tiles plays a vital role in balancing viewport availability and bandwidth utilization. Existing fixed tiling layout solutions [15, 16, 36] stream variable-quality views in order to reduce data transmission. However, this can still lead to poor visual boundaries and inefficient use of bandwidth. In contrast, a dynamic tiling-based streaming framework reduces redundant data and provides improved FoV availability for different viewing behaviors of users. However, it is challenging to support dynamic tiling-based streaming under complex viewing patterns. Similarly, identifying and selecting prioritized views is necessary but not simple. Using traditional bitrate adaptation heuristics [42, 52] for tile-based streaming in the presence of various uncertainties (such as connection speed, user movements, segment sizes, etc.) is not practical due to the spatial and temporal separation of 360° content. Suppose a learning-based [22, 45, 46] or controlled adaptation technique [60, 61] can correctly calculate the bitrate for the next segment in real time. Still, it is strenuous to best match the quality scores due to the instantaneous short-term viewport updates.

This article introduces two novel Dynamic video Frames Tiling-based (DFT) 360° video streaming solutions involving a three-tier adaptation in terms of tiling layout adaptation, streaming tile selection, and bitrate adaptation. In an end-to-end remote 360° video transmission, the first solution, DFT1, decides an optimal tiling layout based on a newly proposed priority-assisted weighted visual quality metric. The second solution, referred to as DFT2, intelligently adapts the tiling version based on the head movement prediction accuracy for each video segment. The proposed DFT solutions perform prioritized tile selection by classifying streaming regions into the following cases: (1) Case 1: fixed viewport with no marginal tiles; (2) Case 2: fixed viewport with marginal tiles; and (3) Case 3: extended viewport with no marginal tiles. Finally, a DFT bitrate adaptation heuristic is designed in such a way as to support the dynamic tiling-based streaming framework by implementing prioritized bitrate budget distribution between different tile groups. This article has the following main contributions:

(1)

Adaptive Tiling Layout Switching Based on Visual Quality and Prediction Relevance: Two innovative solutions that dynamically determine tiling layouts, taking into account both visual quality prioritization (DFT1) and viewport prediction accuracy (DFT2), during each segment playback are introduced. In particular, DFT1 selects the highest-quality tiling layout to deliver an optimal viewing experience, effectively addressing the complexity and scalability issues faced by existing solutions. The second strategy, DFT2, tailors tiling layouts based on viewport prediction performance, thereby enhancing viewport availability across a variety of motion content.

(2)

Efficient Computation of Streaming Regions: A low-complexity, precise solution for determining the optimal arrangement of streaming tiles, utilizing a combination of two viewport prediction mechanisms, where the viewport is defined in terms of 110$^{\circ }$ angles in both horizontal and vertical directions, is described. This approach employs advanced tile classification, i.e., dynamic viewport and marginal regions, in order to improve the displayed viewport’s adaptability in response to non-native head movements.

(3)

Region-based Uniform Bitrate Adaptation: A dynamic tiling-based uniform bitrate adaptation algorithm that incorporates diverse adaptation policies, including aggressive, weighted, and conservative, is proposed. This novel algorithm proactively allocates the available bandwidth to specific spatial regions and optimizes viewer experience according to the desired adaptation strategy.

We present extensive experimental evaluations using real head motion traces of 48 VR users considering five 4K videos prepared in three tiling layouts (4 $\times$ 3, 6 $\times$ 4, 8 $\times$ 6) and with three segment durations (1s, 1.5s, 2s). Experimental results show that DFT improves the streaming performance measured in terms of viewport overlap (8.6% to 11.2%) and QoE (9.70% to 18%) under dynamic bandwidth conditions in comparison to popular fixed tiling-based and dynamic tiling-based solutions.

This work presents significant new contributions compared to our previously proposed solutions, CFOV [55] and DVS [56]. In comparison to [55] and [56], the proposed solutions have the following new points. First, two novel options for tiling layout selection are proposed that can improve viewport availability and reduce the transmission of redundant pixels under variable head movement prediction accuracy. Second, the DFT tile selection mechanisms are comprehensively different from those proposed before. DFTs employ adaptive marginal and extension region selections, which are fine-grained and help with highly dynamic viewing patterns. DVS considered visual complexity and circular distance between viewpoints to classify viewport, marginal, and background tile sets, while CFOV considered fixed and extended FoV scenarios and adopts a wider marginal region based on prediction results. Third, DFT solutions introduce a novel bitrate adaptation algorithm designed to handle dynamic adaptation decisions for multiple tiling layouts, which is a significant new contribution in contrast with the previously introduced fixed tiling-based solutions. DVS specifically switches between uniform (per-region) and non-uniform (per-tile) quality allocation strategies, while DFT considers per-region uniform bitrate adaptation. Finally, a significantly expanded testing setup is used to evaluate comparatively the streaming behaviors of both fixed and dynamic tiling-based solutions.

Article Organization: Section 2 discusses the most recent literature on 360° tile-based streaming. Section 3 details the structure of the proposed 360° adaptation framework and problem formulation. The details of tiling layout selection, tile selection, and tile bitrate adaptation are introduced in Section 4. Section 5 presents the experimental settings, results, and performance analysis. Finally, Section 6 offers conclusive remarks.

2 Background and Related Works

This section presents the important technical background linked to our research and provides a comprehensive overview of the most recent streaming techniques, applications, and limitations.

2.1 Fixed Viewport-based Streaming

In this streaming approach, the size of the viewer window (the “viewport”) is fixed. The system delivers a higher-quality version of the video to the portion of the video that is within the viewport. This approach takes into account the viewer’s dynamic motion patterns, as the viewport is adjusted to follow their movements.

Hosseini and Swaminathan [16] proposed a priority-based bitrate adaptation (PBA) algorithm for 360° video streaming that takes into account the location of different tiles within the video (central, surrounding, and outside). The algorithm starts by assigning the lowest-quality version of the video to the entire segment and then gradually increases the quality of the central tile to the highest level, followed by the surrounding and outer tiles. However, the PBA algorithm was evaluated using a VR setup with a 2K resolution and videos encoded using H.264/AVC, which may not be optimal for enriched 360° videos. Similarly, Chen et al. [4] proposed a system for adapting the quality of 360° video based on the location of different tiles within the viewport, with higher priority given to tiles in the center and lower priority given to tiles in the corners. However, this system does not take into account viewer motion or use any prediction mechanism and was evaluated using fixed network connections. Nasrabadi et al. [28] employed a cube map projection-based scalable video coding scheme where each face of the cube was divided into two horizontal and two vertical tiles and encoding was performed using one base layer and two enhancement layers. The experimental evaluations using four streams of different spatiotemporal complexities demonstrate that compared to the non-scalable coding, layer-assisted tile coding results in fewer rebuffering events while offering improved quality. Van der Hooft et al. [15] proposed a Uniform ViewPort (UVP) quality solution that is designed for use with a fixed viewport. UVP divides the video into two regions: the viewport, which is the portion of the video that is currently being displayed to the viewer, and the non-viewport, which is the rest of the video. The tiles in both regions are arranged using a prediction approach that extrapolates the viewer’s head motion to anticipate their upcoming viewing points. However, this method was only tested using three videos with a single segment duration. Wei et al. [45] proposed a hybrid adaptation solution to control viewpoint prediction and adaptation decisions by leveraging a deep reinforcement learning (DRL) method to continuously compute first the segment bitrate and then the per-tile bitrate based on predicted fixed viewport maps and use them in a cooperative bargaining game theory approach. The proposed solution processes head movement and eye fixation information to adjust the prioritized quality decisions within the spatial and temporal domains.

2.2 Marginal Region-based Streaming

In this streaming approach, a spatial extension, known as the “marginal area,” is defined around the viewport. The purpose of the marginal region is to provide a buffer around the viewport to account for possible errors in head movement prediction. Petrangeli et al. [36] proposed an adaptive virtual reality (AVR) streaming approach that divides the tiles of the 360° video into viewport, adjacent, and outside groups. The authors collected viewport traces using the Gear VR framework while 10 users watched a single 360° video. However, the evaluation was limited to a single 60-second-long 360° video clip. Ben Yahia et al. [2] divided the equirectangular frame into viewport, marginal, immediate background, and far background regions. The proposed model involves two viewport prediction intervals, i.e., before and during the delivery of the same segment. The client assigns variable weights to different priority regions and can update the resource allocation based on updated prediction results. Zou et al. [65] introduced a convolutional neural network (CNN)-based prediction mechanism and then distributed the communication resources for the quality selection of predicted tiles. The proposed solution maps the spherical representation to the planar projection to calculate the viewing probability of each tile. The tiles are then divided into viewport, marginal, and background tile groups. The marginal tiles surround the viewport in all directions, similar to [36]. However, CNN-based viewport prediction models are computationally expensive and are difficult to extend for different videos. Yuan et al. [57] proposed a simple yet effective buffer-based quality-aware bitrate adaptation algorithm to allocate different quality levels to the viewport, marginal, and outside tiles. The experimental evaluations using three 4K test sequences prepared in a 6 $\times$ 4 tiling layout under staged bandwidth variations show that the proposed solution favors the high visible quality levels with considerable navigation smoothness. However, concise simulations were performed for each video content (about 10s). Yadav and Ooi [50] modeled the per-tile bitrate allocation problem as a multiclass knapsack problem based on a dynamic profit function of the current FoV, buffer level, and per-tile representation level. The proposed tile-rate allocation solution based on the previously proposed non-tiled ABR algorithm [51] achieves good results in terms of reducing playback interruptions and quality switches while improving the overall quality and bandwidth savings. However, this approach may lead to higher spatial quality variance within the viewport, and the use of a separate buffer for each tile can cause the playback of the entire video to stall if one of the tiles is not downloaded in time.

2.3 Extended Viewport-based Streaming

Extended viewport-based streaming is a technique of delivering 360° video in which the viewport is virtually extended by a certain percentage, typically 10% to 30%, in order to provide a buffer around the viewport to account for viewer movements. Van der Hooft et al. [15] proposed a quality adaptation approach by considering the extended viewport (full-frame) region. This approach, called Center Tile First (CTF), focuses on improving the quality of the center or viewpoint tile and then gradually increases the quality of the remaining tiles. CTF was evaluated considering the weighted viewport quality metric, which assigns higher weights to the center tile quality and gradually lowers the weights toward the end tiles. It was shown to outperform the uniform viewport quality allocation solution, UVP, for the weighted viewport quality metric. However, when tested using average viewport quality, UVP performs better than CTF.

He et al. [14] proposed a joint adaptation solution that adjusts both the size of the viewport and the bitrate of the video based on network conditions. The algorithm measures the round-trip time (RTT) of the network connection and uses this information to determine the viewport size and the necessary bitrate for smooth streaming. Simulation results using the Network Simulator (NS)-3 tool showed that this adaptable viewport coverage approach can improve the quality of the streaming experience. However, the details of this work, such as the viewport prediction mechanism, the dataset and tiling layout used, and the content resolution, are not provided. Similarly, Hu et al. [17] proposed a system called MELiveOV for live streaming high-resolution 360° video using 5G-enabled edge servers to distribute processing tasks. This edge-based live streaming system adjusts the size of the viewport based on network conditions, with a smaller viewport (90°) requested in higher bitrates under poor network conditions and a larger viewport (120°) selected for streaming under ideal conditions. However, the performance of this work was only compared to a viewport-independent streaming approach. Guo et al. [13] proposed a solution for 360° video streaming that takes into account random motion patterns and variable network conditions for each viewer and tries to use multicast opportunities to reduce redundant data transmissions. The proposed solution computes the actual viewport tiles for the current user and adds more tiles to the viewing region based on the common interest of other users. The authors considered 100° viewport coverage and an extra 15° in both horizontal and vertical directions. Similarly, Long et al. [24] optimized the overall utility of multiple users in a wireless network environment with a single server. The proposed solution takes into account factors such as transmission time, video quality smoothness, and power constraints in order to maximize the aggregated utility of the users.

2.4 Dynamic Tiling-based Streaming

In dynamic tiling-based adaptive streaming, multiple tiling layouts are prepared on the server side in order to optimize the delivery of a 360° video to a viewer. The tiling layout that is used for a particular viewer may be changed dynamically in order to adapt to their viewing and network conditions. Khiem et al. [39] investigated the impact of tiling layouts on interactive zoomable video streaming by employing the dynamic cropping of regions of interest (RoI). The authors compared the performance of regular monolithic streaming and tile-based streaming using two HD videos and found that larger tiles can improve compression efficiency, but at the cost of transmitting redundant pixels. In this work, we attempt to reduce the transmission bits and provide improved viewport availability but with an unmodified decoder. In the follow-up work [30], the authors employed user access patterns to encode the different streaming regions with different encoding parameters. Our DFT solutions also assign variable uniform bitrates to different streaming regions, but with more profound viewing region selection and dynamic bandwidth distribution. Nguyen et al. [32] proposed an adaptive tiling selection (ATS) solution for 360° video streaming. The authors evaluated four different tiling layouts (4 $\times$ 3, 6 $\times$ 4, 8 $\times$ 4, and 8 $\times$ 8) and divided the selected tiles into viewport and non-viewport groups for each layout. During each adaptation interval, the tile sets that resulted in the minimum viewport distortion or the maximum viewport bitrate were chosen for streaming. However, this approach did not incorporate any viewport prediction mechanism and was tested using fixed network connections. Xiao et al. [48] proposed an optimal tiling solution by partitioning a 360° segment into variable-size sub-rectangles to minimize the storage cost on the server side. The proposed solution estimates the storage and transmission cost by extracting the motion vectors and sizes of all basic sub-rectangles. An integer linear program (ILP) is then used to output the optimal tiling version that covers possible views of the segment. The proposed solution achieves interesting results, but at the cost of increased computational complexity. We attempt to achieve a similar goal of balancing storage size and data transmission, but with reduced server-side storage overhead and by utilizing standard computing and streaming components. The proposed solutions are essential for viewers who want to take advantage of the immersive and interactive VR experience without having to invest in additional hardware.

Kattadige and Thilakarathna [20] proposed a method for selecting the tiling layout of each segment of a video based on the visual attention of the user. The approach involves analyzing the frames of the video, creating visual attention maps for the user, and dividing the frames into three regions based on the user’s attention. The proposed solution was compared to three fixed tiling layouts (4 $\times$ 6, 6 $\times$ 6, and 10 $\times$ 20) and was found to be more efficient in terms of pixels and bandwidth usage. Ozcinar et al. [34] employed visual attention maps to improve the network capacity planning for different tile groups. Variable-sized non-overlapping tiles are adaptively selected for each segment. However, real-time visual attention map computation and transmission require extensive resources, which is not in favor of this proposed solution. In a follow-up work [35], the authors extended their visual-attention-aware variable-size non-overlapping tile mapping to benefit from the dynamic tiling structure. Each 360° video frame was split into two fixed-sized polar tiles (one-fourth of the frame from the top and one-fourth from the bottom). The remaining equator region was horizontally divided into 1 and 2 tiles and then each part was divided into 1, 2, 4, 8, and 16 vertical tiles. Numerous dynamic tiling combinations can be considered using this division. The authors employed seven different spatial and temporal motion content types, but all with a duration of 10s. However, this type of tiling structure is not feasible in real-time streaming scenarios, as the two fixed-size polar tiles (half of the frame) need to be transmitted in full quality if any part of the viewport is predicted to be in that region.

Table 1 illustrates the most significant streaming techniques for tile-based adaptive 360° video streaming. These algorithms use user-specific viewing preferences to improve the user’s QoE by establishing a stable background. Most of the fixed viewport-based solutions [4, 15, 16] define variable quality levels within the viewport, which can lead to severe spatial quality oscillations even for perfect prediction results. Several solutions [2, 12, 36, 65] simply employ a fixed marginal area around the viewport in all directions. It can compensate for the highly dynamic viewing nature of the user; however, a significant waste of the bandwidth can be observed under medium to high prediction accuracy. Similarly, always extending the viewport region by 15° [13] and 10° [24] can lead to unnecessary transmission under perfect predictions. Different from previous works, in our approach the viewport and marginal region are considered special cases in the quest to overcome viewing uncertainty. Dynamic tiling solutions [20, 30, 34, 35] are theoretically effective in terms of increasing the picture quality and users’ QoE. However, some of these solutions require real-time visual mapping, which makes them difficult to implement in traditional on-demand scenarios. Mixing different resolution tiles [44] to provide a non-redundant viewport transmission [20, 34, 35] can result in users sensing quality variations and degradation for high and relatively static motion content. These solutions are difficult to be extended to consider different content types and are associated with additional coding and reconstruction overheads.

Table 1.

Streaming Technique	Works	Design	Dataset	Tile Layouts	Resolution	Segment Duration	Experimental Duration
Fixed Viewport	[16]	Non-uniform VP	5 Videos, 1 Users	6 tiles	720p-4K	-	Video duration
	[4]	Non-uniform VP	5 Videos [23]	3 $\times$ 3, 4 $\times$ 4, 5 $\times$ 5	2K	1s	20s
	[15]	Uniform and Non-uniform VP	3 Videos, 48 Users [47]	1$\times$ 1, 2 $\times$ 2, 4 $\times$ 2, 4 $\times$ 4, 8 $\times$ 4, 8 $\times$ 6, 8 $\times$ 8, 16 $\times$ 12/16	4K	1.067s	Video duration
	[49]	Probability Based	1 Video, 5 Users	6 $\times$ 12	2K	1s	3m
	[28]	Layer Assisted	4 Videos, 5 Users	6 and 24 tiles	4K	32 frames	Video duration
Marginal Region	[36]	Fixed Margin	1 Video, 10 Users	6 tiles	8K	1s, 2s, 4s	60s
	[2]	Fixed Margin	3 Videos, 3 Users [6]	6 $\times$ 4	4K	1s	1m
	[65]	Fixed Margin	3 Videos, 10 Users [1]	8 $\times$ 8	4K	1s	Video duration
	[57]	Dynamic Margin	3 Videos, 1 Trace	4 $\times$ 6	4K	2s	10s
Extended Viewport	[14]	Dynamic Extension	-	-	-	-	-
	[17]	Dynamic Extension	4 Videos, 1 User	4 $\times$ 6	4K	Live	Video duration
	[13]	Fixed Extension (15°)	1 Video	36 $\times$ 2	-	0.1s	Video duration
	[24]	Fixed Extension (10°)	1 Video	18 $\times$ 36	-	-	Video duration
Dynamic Tiling	[32]	Visual Distortion	1 Video, 10 Users	4 $\times$ 3, 6 $\times$ 4, 8 $\times$ 4, 8 $\times$ 8	4K	1s	60s
	[35]	Visual Attention	7 Videos, 25 Users	Multiple	8K	-	10s
	[20]	Region Based	30 Videos, 30 Users	Multiple	HD-4K	-	60s
	[48]	Variable Rectangles	5 Videos, 58 Users	Multiple	2K & 4K	-	-

Table 1. Summary of Tile-based Viewport Adaptive 360° Video Streaming Solutions

3 Proposed Dynamic Tiling-based Architecture

3.1 Dynamic Tiling-based System Architecture

Figure 1 illustrates the workflow of DFT solutions. On the server side, the 360° video is pre-processed by dividing it into a number of segments, i.e., $\mathcal {S}=\lbrace \mathcal {S}(1), \mathcal {S}(2), \ldots , \mathcal {S}(i), \ldots , \mathcal {S}(I)\rbrace$. Each segment is then divided into $l$ tiling layouts, i.e., $\mathcal {T}_l(i),{\it } \forall {\it } l\in \lbrace x, y, z\rbrace$, containing a small, medium, and large number of tiles, respectively. Each tiling layout is further divided into a number of tiles, i.e., $\mathcal {T}_l=\lbrace \mathcal {T}_l^{1}(i), \mathcal {T}_l^{2}(i), \ldots , \mathcal {T}_l^{k}(i), \ldots , \mathcal {T}_l^{K}(i)\rbrace$. These tiles are then encoded at a number of different bitrates, i.e., $\mathcal {L}_l=\lbrace \mathcal {L}^k_{l,1}(i), \mathcal {L}^k_{l,2}(i), \ldots , \mathcal {L}^k_{l,j}(i), \ldots , \mathcal {L}^k_{l,J}(i)\rbrace$. Let $\mathcal {L}^{k}_{l,j}(i)$ represent the $j$th bitrate of the $k$th tile in the $l$th tiling layout of the $i$th segment.

Fig. 1.

The DFT clients, which control the adaptive streaming operations, need to know in advance about the available tiling layouts on the server side. DFT2 performs tiling layout selection before determining the streaming tiles and bitrate allocations during each adaptation interval. The tiling layout selection module in DFT2 checks the overlap between the actual and predicted viewport areas during the previous segment. The streaming tile selection module selects sets of tiles for different priority regions (i.e., viewport ($\mathcal {T}_{l}^{v}(i)$), marginal ($\mathcal {T}_{l}^{m}(i)$), and background ($\mathcal {T}_{l}^{b}(i)$)) based on the predicted viewport coordinates for each segment. This helps to ensure that the video is able to adapt to the viewer’s movements and maintain a high level of quality by pre-downloading tiles that are most likely to be watched. The tile bitrate adaptation unit then selects appropriate bitrates for each tile based on the associated region and the available network capacity. DFT1, on the other hand, first calculates the streaming regions and relevant bitrates for each tiling layout. It then selects the tiling layout that results in the highest-weighted-area-based visual quality score in each adaptation interval. The segment request is then sent, and upon receiving the segments, the client decodes and reconstructs the requested views similar to fixed tiling-based views in the post-processing phase with no additional decoding overhead. The requested content is then presented to the user.

3.2 Problem Definition

In 360° adaptive video streaming, it is important to consider the user’s quality expectations, which depend largely on the quality of the visible area. Even if the viewport tiles are played at higher-quality levels, the intra- and inter-segment quality oscillations may not satisfy the user. The QoE metric used in this context includes viewport quality and spatial and temporal smoothness factors, as well as the risk of playback buffer issues.

•

Viewport Quality: The user is able to visualize only certain tiles during 360° video playback. The viewport quality reflects how much a user is satisfied with the visual perception. The client can be presented with any visual quality representation, but the average quality levels of the viewport tiles are highly correlated with the average bitrate that is actually consumed by the viewer. Therefore, by averaging the quality of the actual viewport tiles in segment $(i)$, for the $l$th tiling layout, the viewport quality is given as follows [37, 63]:

\begin{equation} {\it f}_{1}(i)=\frac{\sum _{k \in \mathcal {T}_{l}^{\hat{v}}(i)}{\ }\sum _{j \in \mathcal {L}_l}\mathcal {Q}(\mathcal {L}^{k}_{l,j}(i))}{|\mathcal {T}_{l}^{\hat{v}}(i)|}, \end{equation}

(1)

where $\mathcal {T}_{l}^{\hat{v}}(i)$ represents the actual viewport tiles set in the $(i)$th segment and $|\mathcal {T}_{l}^{\hat{v}}(i)|$ indicates the cardinality of the set. $\mathcal {Q}(\mathcal {L}^{k}_{l,j}(i))$ maps the $j$th bitrate of the $k$th tile to the particular video quality level.

•

Temporal Quality Oscillations: The inter-segment quality switches can reduce the “sense of being there” in an immersive environment. This may happen not only because of the network fluctuations but also due to the differences in head movement predictions. The user’s experience can be impaired by physiological symptoms such as dizziness and headache when observing frequent visual disparity [41]. Therefore, the inter-segment quality fluctuations should not be drastic and can be calculated as the difference between the observed viewport quality levels of two consecutive segments [37, 63]:

\begin{equation} {\it f}_{2}(i)=| {\it f}_{1}(i)-{\it f}_{1}(i-1)|. \end{equation}

(2)

•

Spatial Quality Oscillations: The visual tiles having different quality levels leads to complex perception. Cybersickness, viewing irritation, nausea, fatigue, and aversion [11], can be driven by inconsistent quality levels within the viewport. Compared to regular 2D videos, if the perceived quality of 360° tiles is not smooth, it will reduce the overall QoE. Following [19], we measured the spatial quality oscillations according to the coefficient of variation (CV) of viewport tiles’ quality:

\begin{equation} {\it f}_{3}(i)=\frac{\sigma (\mathcal {Q}(\mathcal {L}^{k}_{l,j}(i)))}{\mu (\mathcal {Q}(\mathcal {L}^{k}_{l,j}(i)))},{\quad }\forall k \in \mathcal {T}_{l}^{\hat{v}}(i),{ }\forall j \in \mathcal {L}_l. \end{equation}

(3)

The standard deviation of the viewport quality samples is in the numerator, and the mean of the samples is in the denominator.

•

Playback Buffer Risk: A large buffer capacity may not be efficient for 360° video streaming because of the constantly changing FoV during playback [9, 33]. Pre-buffering high-quality tiles can be risky, as the user’s FoV may shift at the time of playback. Instead of relying on the traditional playback discontinuity under short-term viewport prediction, it is more beneficial to assess directly risky buffer events based on the available connection bandwidth and the selected video bitrates. This can be expressed as follows [45]:

\begin{equation} f_{4}(i) = {\left\lbrace \begin{array}{ll}1, & {{\bf if} (\widehat{B}(i) \lt \sum _{k\in \mathcal {T}_{l}(i)}{\mathcal {L}^k_{l,j}(i)})} \\ 0, & Otherwise \end{array}\right.}, \end{equation}

(4)

where $\widehat{B}(i)$ represents the available bandwidth budget for the $(i)$th segment.

Following the principle behind the QoE metric for traditional video [26], some works [37, 62, 63] consider video quality, quality variations, rebuffering events, and so forth to model a QoE metric for 360° videos. The user-perceived QoE for each 360° segment is defined by a weighted summation formulation:

\begin{equation} {{\it QoE}(i)=\alpha \times {\it f}_{1}(i) - \beta \times {\it f}_{2}(i) - \gamma \times {\it f}_{3}(i) - \delta \times {\it f}_{4}(i),} \end{equation}

(5)

where $\alpha$, $\beta$, $\gamma$, and $\delta$ are the parameters indicating how much importance a user gives to video bitrate, temporal and spatial quality variances, and rebuffering risk, respectively. As users do not want to experience quality fluctuations and rebuffering events, the functions ${\it f}_{2}(i)$, ${\it f}_{3}(i)$, and ${\it f}_{4}(i)$ are set to negative.

Accurate evaluation of QoE is essential for optimizing the performance of traditional, multimedia [58], and immersive video content. The level of satisfaction a user experiences while watching a VR video is determined by how long they feel immersed in the scene. The proposed clients aim to select optimal bitrates for each segment in a dynamic tiling streaming system in order to maximize the user’s long-term QoE reward. The mathematical problem formulation is as follows:

Problem:

\begin{equation} {max \sum _{i\in \mathcal {S}} QoE (i)} \end{equation}

(6)

The proposed solutions solve this problem by implementing a three-tier adaptation mechanism. First, they select a relevant tiling layout for each segment. Next, DFT solutions dynamically perform the viewing area selection based on the two viewport prediction mechanisms to predict the most likely to be watched tiles. Finally, the tile bitrate adaptation mechanism improves the bitrate budget distribution between different tile groups. These mechanisms are elaborated on in the next section.

4 Proposed Dynamic Tiling-based Adaptation Algorithms

This section presents the adaptation algorithms for DFT1 and DFT2 streaming clients.

4.1 DFT Tiling Layout Selection Algorithms

Tile-based encoding brings several opportunities such as efficient video coding [40], improved quality distribution, and parallel [25] and partial decoding [5] for VR video applications. The choice of the appropriate tiling layout, which reflects the spatial partitioning of frame areas, impacts the overall video compression performance. In 360° video, the polar regions have higher viewing distortions and less viewing probability than the equator regions when transforming a spherical representation into a two-dimensional planar format, i.e., equirectangular projection. Therefore, encoding polar areas with more pixels consumes the user’s limited bandwidth to transmit data related to less relevant image regions. Fixed tiling solutions encode polar and equator regions at similar bitrate levels, leading to unattractive viewport boundaries and losing positive compression opportunities. Employing a smaller number of tiles (i.e., larger resolution tiles) can improve the compression performance in some cases. Yet, at the same time, it may include unnecessary higher-quality portions outside the viewport [35]. Contrarily, smaller resolution tiles can reduce the number of redundant pixels [45]; however, it may also cause visual distortions such as flickering, floating, and blurring at the edges of the tiles [8]. Finding ways to dynamically select the most appropriate tiling layout for a given viewing scenario and preferences is an important area of research. By developing smart techniques that can take these factors into account and adjust the tiling layout accordingly, it may be possible to improve the overall viewing experience. Therefore, the proposed solution considers two tiling layout selection solutions to lower redundant data transmission and facilitate a fine-grained visual perception for different motion content.

DFT1:The proposed DFT1 solution decides an optimal tiling layout during each adaptation interval based on the observed visual quality scores. Since the user gaze point is mostly located around the center of the viewport [27, 43], the viewpoint quality should have a higher priority compared to other tiles. Therefore, we design a priority-assisted visual quality metric to attentively select the suitable tiling layouts during 360° video streaming. In this context, DFT1 assigns different priority weights to the viewport tiles in such a way that the tiles closer to the viewpoint should have a higher priority compared to other tiles. The tiles are arranged based on how far they are located from the viewpoint. The priority weights are assigned such that the most important parts of the image, as determined by their proximity to the center of the viewer’s focus, are rendered with the highest quality, while less important parts of the image are rendered with lower quality. In this context, the highest and lowest weights are allocated for the mapped quality of the viewpoint and the last tile, respectively, in the sorted tile set. The weighted quality metric is given in Equation (7):

\begin{equation} { \mathcal {WQ}_{l}^{v}(i) = \frac{\sum _{k=1}^{|\mathcal {T}_{l}^{v}(i)|}\sum _{j=1}^{J}(2)^{|\mathcal {T}_{l}^{v}(i)|-k}\times \mathcal {Q}(\mathcal {L}^{k}_{l,j}(i))}{(2)^{|\mathcal {T}_{l}^{v}(i)|}-1}}, \end{equation}

(7)

where the quantity $|\mathcal {T}_{l}^{{v}}(i)|$ represents the number of tiles in the set of tiles predicted to be within the viewport, and $\mathcal {Q}(\mathcal {L}^{k}_{l,j}(i))$ maps the video bitrate to a specific quality level. Since we consider the extended viewport case, elaborated in Section 4.2, where the visual area can be different for different tiling layouts, for instance, an extended viewport with $\mathcal {T}_{x}^{{v}}(i)$ could cover more region as compared to an extended viewport with $\mathcal {T}_{z}^{{v}}(i)$. Therefore, we define the visual-area-based weighted video quality metric, which tries to balance the visual area and the weighted quality and is given in Equation (8):

\begin{equation} \mathcal {VQ}_{l}^{v}(i) = \frac{|\mathcal {T}_{l}^{v}(i)|}{|\mathcal {T}_{l}(i)|}\times \mathcal {WQ}_{l}^{v}(i), \end{equation}

(8)

where $|\mathcal {T}_{l}(i)|$ represents the total number of tiles in the tiling layout $l$. The tiling layout selection procedure for DFT1 is given as follows:

(1)

For each tiling layout:

•

Perform streaming tile selection and identify the streaming case using Algorithm 2.

•

Perform bitrate adaptation for the tile groups of the selected case using Algorithm 3.

•

Compute the prioritized visual-area-based quality scores using Equations (7) and (8).

(2)

Stream the tiles from the tiling layout that results in the highest visual levels.

DFT2: DFT2 decides an optimal tiling layout based on the viewport prediction performance. Unlike DFT1, which is based on visual area, DFT2 measures the closeness between actual and predicted viewport tile sets in terms of viewport overlap to select the appropriate tiling layout for the next segment. Let $\mathcal {O}(i-1)$ denote the overlap percentage of the actual and predicted viewport tiles for the $(i-1)$th segment; it is given as [29]

\begin{equation} \mathcal {O}(i-1) = \frac{|\mathcal {T}_{l}^{\hat{v}}(i-1) \cap \mathcal {T}_{l}^{v}(i-1)|}{|\mathcal {T}_{l}^{\hat{v}}(i-1)|}\times 100. \end{equation}

(9)

Algorithm 1 details the tiling layout selection procedure in DFT2. As no information is available at the start, the tiling layout with a larger number of tiles ($\mathcal {T}_z(i)$) is selected for the first segment (lines 1–2). If there is no overlap between actual and predicted viewing tiles, then the tiling layout with a smaller number of tiles ($\mathcal {T}_x(i)$) is selected for the $(i)$th segment to deal with fast head rotations (lines 3–4). If actual and predicted viewports perfectly overlap during the previous segment, the smallest-resolution tiles are selected to lessen the abundance of unnoticeable pixels outside the viewport region (lines 5–6). If the actual and predicted viewports partially overlap during the playback of the previous segment, medium-resolution tiles represented as $\mathcal {T}_y(i)$ are streamed for the next segment (lines 7–8). DFT solutions do not involve complex frame partitioning and ensure a flexible uniform tiling structure without any modifications of existing video coding and stream processing tools, which makes them attractive to be adopted in on-demand and live streaming scenarios. DFT1 is a scalable solution that can work with any number of tiling layouts. It is also practical for both simulation and real-time environments.

4.2 DFT Streaming Tile Selection Algorithm

The ability to choose the best-fit tiles in response to the user’s unpredictable head movements is one of the fundamental criteria for 360° video applications. The prediction accuracy of current streaming solutions based on a single viewport prediction technique can decrease when predicting longer in the future. To adaptively encompass the real viewing region, this work employs two viewpoint/viewport prediction techniques. It’s interesting to note that, in the majority of cases, the naive prediction model (using the current coordinates as predicted points) outperforms more sophisticated models [10]. The primary viewport tile set ($\mathcal {T}_{l}^{vn}(i)$) contains the viewport tiles actually watched by the user during the previous segment. The secondary viewport tile set ($\mathcal {T}_{l}^{vs}(i)$) is computed using a spherical walk approach described in [15].

Algorithm 2 aims to find appropriate tiles for the viewport, marginal, and background regions, respectively. The tile identification and selection are dynamically performed for each adaptation interval. Algorithm 2 takes as input the tile set $\mathcal {T}_{l}(i)$ with tiling layout $l$ for the $(i)$th segment, the primary predicted viewport tile set $\mathcal {T}_{l}^{vn}(i)$, and the secondary predicted viewport tile set $\mathcal {T}_{l}^{vs}(i)$. It outputs the estimated viewport tile set $\mathcal {T}_{l}^v(i)$, the estimated marginal tile set $\mathcal {T}_{l}^{m}(i)$, and the estimated background tile set $\mathcal {T}_{l}^b(i)$. The algorithm first determines the viewport tile set based on the intersection between the primary and secondary predicted viewport tile sets. If the primary and secondary predicted viewport tile sets are disjoint sets, then the viewport tile set is the union of the primary and secondary predicted viewport tile sets. Otherwise, the primary predicted viewport tile set is assigned to the viewport tile set. Next, the algorithm determines the marginal tile set, such that if the intersection of primary and secondary viewport sets is empty, then the marginal tile set is empty. Otherwise, the marginal tile set is the difference between the secondary predicted viewport tile set and the primary predicted viewport tile set. Finally, the algorithm determines the background tile set following a check between tile set and the primary and secondary predicted viewport tile sets. Specifically, all the tiles that do not belong to the viewport or marginal tile sets are added to the background tile set. Figures 2 and 3 illustrate the tile selection cases in DFT2 based on the output of Algorithm 1 for two consecutive segments. The black rectangle represents the primary predicted viewport, while the blue rectangle represents the secondary predicted viewport. The potential viewport tiles are represented by a purple window, whereas the marginal and background tiles are marked in light green and brown, respectively.

Fig. 2.

Fig. 3.

4.3 DFT Tile Bitrate Adaptation Algorithm

Adaptive streaming players usually maintain a large buffer space for regular 2D videos to absorb the uneven motions in video scenes and playback interruptions. However, for 360° videos, a large buffer capacity is not encouraged due to FoV dynamics. In practice, for 360° tiled video streaming, the buffer should be as small as possible (usually two segments [15]) to accommodate the new chunks in response to the user movements within the immersive video. Algorithm 3 takes into account both the predicted tiles and network conditions to more accurately adjust the video quality for a smoother viewing experience. This algorithm is specifically designed for dynamic tiling-based 360° video streaming. Both DFT1 and DFT2 clients employ the same bitrate adaptation algorithm to decide the suitable bitrates for tiles.

In the absence of buffer consideration, accurate bandwidth estimation is crucial to achieving higher playback performance [53]. An over/under-estimation of the available bandwidth can result in frequent rebuffering/lower-quality playback. Following [28], the bandwidth for the $(i)$th segment is computed as follows:

\begin{equation} {\widehat{B}}(i)=\frac{\sum _{\forall k, j}\mathcal {L}_{l,j}^k(i-1)*\tau }{\mathcal {D}(i-1)}, \end{equation}

(10)

where $\mathcal {L}_{l,j}^k(i-1)$ represents the bitrate of the previous segment, $\tau$ is the playback duration of the segment, and $\mathcal {D}(i-1)$ represents the download time of the $(i-1)$th segment. The proposed bitrate allocation algorithm considers aggressive, weighted, and conservative quality adjustments for different tile selection cases to improve the corresponding bitrate choice for each tile that the network can support. For tile selection Case 1, an aggressive quality adjustment is performed for viewport tiles. The algorithm performs a weighted quality adjustment if the marginal region is non-empty (Case 2 of Algorithm 2). A relatively conservative bitrate selection is performed for Case 3, where the viewport region is extended to lower the viewport mismatch while sacrificing the quality.

Algorithm 3 determines the bitrate selection for the tiles belonging to different priority regions calculated in Section 4.2. The input to the algorithm consists of various sets of video tiles (viewport, marginal, and background tiles), the number of tiles in the viewport and marginal regions, the available bandwidth for each segment of the video, and initial priority weights for the viewport and marginal tiles. The output of the algorithm is the selected bitrates for each tile in each segment of the video. The playback adaptation is performed for each segment after the previous segment has been fully downloaded. The algorithm begins by checking if the available bandwidth is less than or equal to the sum of the lowest bitrate options for all tiles in the current segment. If this is the case, the lowest bitrate is selected for all tiles (lines 1–2). If the available bandwidth is greater than or equal to the sum of the highest bitrate options for all tiles, the highest bitrate is selected for all tiles (lines 3–4). In other cases, the algorithm sets the bitrate for all tiles to the lowest bitrate option and calculates the remaining available bandwidth (lines 6–7). If there are tiles in the marginal region (i.e., $\mathcal {T}_{l}^{m}(i) \ne \emptyset$), the algorithm updates the priority weights for the viewport and marginal tiles (lines 9–10). The priority weights are determined based on the number of tiles in the viewport and marginal regions, with the viewport tiles being given higher priority. The viewport and marginal tiles (only possible in Case 2) are then allocated bandwidth based on the computed weights (lines 11–12). Next, the highest possible bitrates for the viewport and marginal tiles are chosen based on the available bandwidth for each region (lines 13–14). This ensures the weighted quality adaptation for viewport and marginal tiles. If there are no marginal tiles, then for Case 1 or Case 3 of Algorithm 2, an aggressive or relatively conservative quality allocation is considered for viewport tiles to ensure visual smoothness. After determining the bitrates for the viewport and marginal tiles, the bandwidth for the background tiles is calculated by subtracting the sum of these bitrates from the revised overall bandwidth budget (line 15). Finally, the bitrate of the background tiles is also increased, as long as it does not exceed the available bandwidth budget (line 16).

5 Experimental Evaluation

This section presents the experimental evaluations of our proposed solutions using a diverse range of content and network conditions.

5.1 Experimental Setup

The proposed solution evaluation is performed by modifying a VR player provided by [15] on a machine with an Intel Core i7-7500U CPU and 16 GB of memory running Ubuntu 16.04. In the experiments, the VR player retrieves 360° video segments from an HTTP server while the connection speed between the VR player and HTTP server was varied, as illustrated in Figure 4. Bandwidth trace 1 has more irregular increasing and decreasing trends compared to bandwidth trace 2. The maximum connection speed for trace 1 is 20 Mbps, while for trace 2, the maximum bandwidth value is 12 Mbps.

Fig. 4.

5.1.1 Content Pre-processing.

This work employs a highly cited open-source video and head movement dataset captured by Wu et al. [47]. The dataset contains real head movement patterns of 48 unique VR users viewing 18 long-duration videos in two learning-based testing sessions using an HTC Vive headset with a field of view of 110°. In the first experiment, participants were asked to explore the content without paying too much attention to the specifics of what they were looking at. In the second experiment, on the other hand, they were asked to focus on the content and pay close attention to it, simulating certain behaviors or habits. We choose five videos, namely, LOSC Football (experiment 1), Weekly Idol-Dancing (experiment 2), Google Spotlight-HELP (experiment 1), GoPro VR-Tahiti Surf (experiment 1), and Rio Olympics VR Interview (experiment 2), from this dataset. This is in line with the recommendations of ITU-T R. P.913 [18] and is typical for research and development solution evaluations. The five different duration immersive clips in this dataset can be classified into four categories: Sport (LOSC Football and GoPro VR-Tahiti Surf), Performance (Weekly Idol-Dancing), Film (Google Spotlight-HELP), and Talkshow (Rio Olympics VR Interview). These videos are referred to as Football, Performance, Spotlight, Surfing, and VR Interview throughout the remaining article. Table 2 summarizes the content features of five videos. All of the videos were resized to 4K resolution using FFmpeg¹ software. Following [12], we spatially split 360° videos into 4 $\times$ 3, 6 $\times$ 4, and 8 $\times$ 6 tiling layouts. This work suggests that the 6 $\times$ 4 tiling structure results in an optimal tradeoff between viewport availability, bitrate overhead, and bandwidth requirements. The video tiles were encoded using an open-source encoder called Kvazaar,² with five different quantization parameter (QP) values: 22, 27, 32, 37, and 42. Considering the experimental recommendations for selecting segment duration for viewport adaptive streaming [7, 38], three different duration, i.e., 1s, 1.5s, and 2s, MPEG-DASH video segments were generated using GPAC MP4Box.³ The playback buffer was set to two segments for each experiment. The average segment sizes for each video are shown in Table 3. The simulation length was set according to the duration of each video.

Table 2.

Videos	Category	Duration	Resolution	FPS
Football	Sport	$2^{\prime }44^{\prime \prime }$	3840 $\times$ 2160	25
Performance	Performance	$4^{\prime }38^{\prime \prime }$	3840 $\times$ 1920	29
Spotlight	Film	$4^{\prime }53^{\prime \prime }$	3840 $\times$ 2160	30
Surfing	Sport	$3^{\prime }25^{\prime \prime }$	3840 $\times$ 1920	29
VR-Interview	Talkshow	$3^{\prime }07^{\prime \prime }$	3840 $\times$ 1920	25

Table 2. Content Characteristics

Table 3.

Video	QP	1s			1.5s			2s
Video	QP	4 $\times$ 3	6 $\times$ 4	8 $\times$ 6	4 $\times$ 3	6 $\times$ 4	8 $\times$ 6	4 $\times$ 3	6 $\times$ 4	8 $\times$ 6
Football	22	6.9 $\pm$ 2.3	7.0 $\pm$ 2.3	7.2 $\pm$ 2.3	10.5 $\pm$ 5.1	10.6 $\pm$ 5.1	10.9 $\pm$ 5.2	13.8 $\pm$ 4.6	14.1 $\pm$ 4.6	14.4 $\pm$ 4.6
	27	3.5 $\pm$ 1.3	3.6 $\pm$ 1.4	3.8 $\pm$ 1.4	5.3 $\pm$ 2.8	5.5 $\pm$ 2.8	5.7 $\pm$ 2.9	7.1 $\pm$ 2.7	7.3 $\pm$ 2.7	7.6 $\pm$ 2.7
	32	1.9 $\pm$ 0.8	2 $\pm$ 0.8	2.2 $\pm$ 0.8	2.9 $\pm$ 1.6	3.1 $\pm$ 1.6	3.3 $\pm$ 1.6	3.9 $\pm$ 1.5	4.1 $\pm$ 1.5	4.5 $\pm$ 1.6
	37	1.1 $\pm$ 0.4	1.2 $\pm$ 0.4	1.4 $\pm$ 0.4	1.7 $\pm$ 0.9	1.8 $\pm$ 0.9	2.1 $\pm$ 1	2.3 $\pm$ 0.9	2.4 $\pm$ 0.9	2.8 $\pm$ 0.9
	42	0.7 $\pm$ 0.2	0.7 $\pm$ 0.2	0.9 $\pm$ 0.2	1 $\pm$ 0.5	1.1 $\pm$ 0.5	1.4 $\pm$ 0.6	1.3 $\pm$ 0.5	1.5 $\pm$ 0.5	1.8 $\pm$ 0.5
Performance	22	8.5 $\pm$ 2.9	8.6 $\pm$ 2.9	8.9 $\pm$ 3.0	12.8 $\pm$ 5.9	13.0 $\pm$ 5.9	13.4 $\pm$ 6.0	17.0 $\pm$ 4.7	17.3 $\pm$ 4.7	17.8 $\pm$ 4.8
	27	4.6 $\pm$ 1.7	4.7 $\pm$ 1.7	5.0 $\pm$ 1.7	6.9 $\pm$ 3.3	7.1 $\pm$ 3.3	7.5 $\pm$ 3.4	9.3 $\pm$ 2.7	9.5 $\pm$ 2.7	10.0 $\pm$ 2.7
	32	2.6 $\pm$ 0.9	2.7 $\pm$ 0.9	2.9 $\pm$ 0.9	4.0 $\pm$ 1.9	4.1 $\pm$ 1.9	4.5 $\pm$ 2.0	5.3 $\pm$ 1.5	5.5 $\pm$ 1.5	6.0 $\pm$ 1.5
	37	1.6 $\pm$ 0.5	1.7 $\pm$ 0.5	1.9 $\pm$ 0.5	2.4 $\pm$ 1.1	2.5 $\pm$ 1.1	2.8 $\pm$ 1.2	3.2 $\pm$ 0.9	3.4 $\pm$ 0.8	3.8 $\pm$ 0.9
	42	0.9 $\pm$ 0.3	1.0 $\pm$ 0.3	1.2 $\pm$ 0.3	1.4 $\pm$ 0.6	1.6 $\pm$ 0.6	1.9 $\pm$ 0.7	1.9 $\pm$ 0.5	2.1 $\pm$ 0.5	2.5 $\pm$ 0.5
Spotlight	22	13.6 $\pm$ 8.8	13.9 $\pm$ 8.8	14.3 $\pm$ 8.9	20.4 $\pm$ 15.0	20.9 $\pm$ 15.2	21.5 $\pm$ 15.4	27.1 $\pm$ 17.1	27.7 $\pm$ 17.2	28.5 $\pm$ 17.3
	27	7.2 $\pm$ 5.3	7.4 $\pm$ 5.3	7.7 $\pm$ 5.4	10.8 $\pm$ 8.9	11.1 $\pm$ 9.0	11.6 $\pm$ 9.1	14.3 $\pm$ 10.3	14.8 $\pm$ 10.4	15.5 $\pm$ 10.5
	32	4.0 $\pm$ 3.1	4.2 $\pm$ 3.1	4.5 $\pm$ 3.2	6.1 $\pm$ 5.2	6.3 $\pm$ 5.3	6.7 $\pm$ 5.4	8.1 $\pm$ 6.1	8.4 $\pm$ 6.1	9.0 $\pm$ 6.2
	37	2.3 $\pm$ 1.8	2.4 $\pm$ 1.8	2.7 $\pm$ 1.8	3.5 $\pm$ 2.9	3.7 $\pm$ 3.0	4.1 $\pm$ 3.1	4.7 $\pm$ 3.5	4.9 $\pm$ 3.5	5.4 $\pm$ 3.5
	42	1.3 $\pm$ 0.9	1.4 $\pm$ 0.9	1.6 $\pm$ 0.9	2.0 $\pm$ 1.5	2.2 $\pm$ 1.6	2.5 $\pm$ 1.6	2.7 $\pm$ 1.7	2.9 $\pm$ 1.8	3.3 $\pm$ 1.8
Surfing	22	22.7 $\pm$ 11.2	23.0 $\pm$ 11.3	23.5 $\pm$ 11.4	34.0 $\pm$ 21.0	34.5 $\pm$ 21.2	35.3 $\pm$ 21.4	45.3 $\pm$ 22.2	45.9 $\pm$ 22.3	46.9 $\pm$ 22.5
	27	12.8 $\pm$ 6.7	13.0 $\pm$ 6.8	13.4 $\pm$ 6.8	19.2 $\pm$ 12.4	19.5 $\pm$ 12.5	20.2 $\pm$ 12.7	25.5 $\pm$ 13.3	26.0 $\pm$ 13.4	26.8 $\pm$ 13.5
	32	7.2 $\pm$ 3.9	7.4 $\pm$ 3.9	7.7 $\pm$ 3.9	10.8 $\pm$ 7.1	11.1 $\pm$ 7.2	11.6 $\pm$ 7.3	14.4 $\pm$ 7.7	14.7 $\pm$ 7.8	15.4 $\pm$ 7.8
	37	4.0 $\pm$ 2.2	4.1 $\pm$ 2.2	4.4 $\pm$ 2.2	6.0 $\pm$ 3.9	6.2 $\pm$ 4.0	6.6 $\pm$ 4.1	7.9 $\pm$ 4.3	8.2 $\pm$ 4.3	8.8 $\pm$ 4.3
	42	2.1 $\pm$ 1.1	2.2 $\pm$ 1.1	2.5 $\pm$ 1.1	3.2 $\pm$ 2.0	3.4 $\pm$ 2.1	3.7 $\pm$ 2.2	4.2 $\pm$ 2.2	4.5 $\pm$ 2.2	5.0 $\pm$ 2.2
VR Interview	22	7.6 $\pm$ 1.0	7.7 $\pm$ 1.1	7.8 $\pm$ 1.1	11.4 $\pm$ 4.1	11.5 $\pm$ 4.2	11.8 $\pm$ 4.3	15.2 $\pm$ 2.0	15.4 $\pm$ 2.0	15.7 $\pm$ 2.1
	27	3.7 $\pm$ 0.7	3.8 $\pm$ 0.7	3.9 $\pm$ 0.7	5.5 $\pm$ 2.1	5.7 $\pm$ 2.2	5.9 $\pm$ 2.3	7.4 $\pm$ 1.3	7.6 $\pm$ 1.3	7.9 $\pm$ 1.4
	32	1.7 $\pm$ 0.3	1.8 $\pm$ 0.4	2.0 $\pm$ 0.4	2.6 $\pm$ 1.0	2.8 $\pm$ 1.1	3.0 $\pm$ 1.2	3.5 $\pm$ 0.7	3.7 $\pm$ 0.7	4.0 $\pm$ 0.8
	37	0.9 $\pm$ 0.2	1.0 $\pm$ 0.2	1.2 $\pm$ 0.2	1.4 $\pm$ 0.6	1.6 $\pm$ 0.6	1.8 $\pm$ 0.7	1.9 $\pm$ 0.4	2.1 $\pm$ 0.4	2.5 $\pm$ 0.4
	42	0.6 $\pm$ 0.1	0.7 $\pm$ 0.1	0.8 $\pm$ 0.1	0.9 $\pm$ 0.3	1.0 $\pm$ 0.4	1.3 $\pm$ 0.4	1.2 $\pm$ 0.2	1.4 $\pm$ 0.2	1.7 $\pm$ 0.2

Table 3. Average and Standard Deviations of Segment Bitrates (Mbps) for the Football, Performance, Spotlight, Surfing, and VR Interview Videos

5.1.2 Comparative Approaches.

DFT solutions are compared with dynamic tiling-based (ATS) and fixed tiling-based (UVP, CTF, PBA, AVR) solutions.

(1)

ATS [32]: This solution performs adaptive tile selection based on weighted viewport distortions. The tiling layout resulting in minimum viewport distortion or maximum viewport bitrate is selected for streaming during each decision interval.

(2)

UVP [15]: A straightforward per-region uniform quality adaptation approach for different frame areas classified by considering the user’s walk on a spherical surface prediction mechanism.

(3)

CTF [15]: This scheme is an extended version of UVP but takes into consideration the entire frame as a potential viewing area. Rather than dividing the frame into regions and assigning bitrates evenly across them, this method increases the quality of the video in a per-tile fashion, beginning with the center tiles and working outward toward the edges.

(4)

PBA [16]: The highly cited approach divides tiles into three zones, $Z_1$ (viewport center tile), $Z_2$ (surrounding tiles), and $Z_3$ (background tiles). In this system, priority-based bitrate adaptation is applied to tiles within certain regions, while also considering the available bandwidth budget.

(5)

AVR [36]: One of the early approaches that allows for efficient use of resources while maintaining a high quality of playback by dividing 360° frames into viewport, adjacent, and outside regions.

5.1.3 Evaluation Metrics.

The performance of the proposed and comparative schemes is assessed in terms of the following metrics:

(1)

Streaming Behavior: We evaluate how the DFT1 and DFT2 switch to different tiling layouts and behave in terms of adopting tile selection and bitrate adaptation scenarios. We also show how the ATS client switches between available tiling layouts for each streaming session.

(2)

Tile Overlap: This metric measures the real and predicted viewport tile overlap as defined in Equation (9).

(3)

Average QoE: It reflects the average quality score of all the users for each video for the QoE metric defined in Equation (5).

5.2 Experimental Results

This subsection presents the results of experiments and a thorough analysis of the performance of each solution in a variety of testing conditions.

5.2.1 Streaming Behavior.

Table 4 provides insight into how the DFT1 solution performs in erms of tiling layout selection, tile selection, and bitrate adaptation for five different motion 360° videos. DFT1 supports the larger visual area with higher-quality streaming; therefore, for all the videos, larger-resolution tiles (i.e., 4 $\times$ 3 and 6 $\times$ 4) are predominantly selected. However, the use of the 6 $\times$ 4 tiling layout decreases, while the use of the 4 $\times$ 3 tiling layout slightly increases (by 5.38%) when the segment duration is increased from 1s to 2s for all the videos. Overall, a small percentage of smaller-resolution tiles (i.e., 8 $\times$ 6) is selected for all videos. DFT1 selects a 4 $\times$ 3 tiling layout for more than 67% for the VR Interview video and mostly performs aggressive bitrate selection for selected tiling layouts. DFT1 fetches the segments of the Football, Performance, Spotlight, Surfing, and VR Interview videos by performing aggressive bitrate selection by up to 59.14%, 75.33%, 66.27%, 58.75%, and 78.43%, respectively, averaged across three segment durations. DFT1 performs weighted quality adjustments for segments of these videos by up to 32.37%, 21.08%, 27.64%, 32.66%, and 17.18%. Interestingly, there is a decrease in the percentage of aggressive quality adjustments and an increase in the percentage of weighted quality adjustments when the segment duration is increased. In addition, a tiny percentage of conservative bitrate selection is observed for all the videos in the DFT1 solution.

Table 4.

Videos	Segment Duration	Tiling Layout (%)			Tile Selection: Case 1 Bitrate: Aggressive			Tile Selection: Case 2 Bitrate: Weighted			Tile Selection: Case 3 Bitrate: Conservative
Videos	Segment Duration	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3
Football	1	17.73	52.27	30.00	8.84	37.39	20.45	8.70	14.36	6.45	0.19	0.53	3.09
	1.5	17.53	50.10	32.38	7.42	32.59	17.74	9.54	16.02	7.91	0.57	1.49	6.73
	2	17.48	47.97	34.55	6.43	30.06	16.51	9.88	15.19	9.07	1.17	2.72	8.97
Performance	1	4.96	66.74	28.30	2.68	55.40	21.41	2.16	11.08	5.42	0.12	0.25	1.47
	1.5	5.42	63.77	30.81	2.80	51.00	21.32	2.48	12.18	6.48	0.14	0.59	3.02
	2	6.71	60.49	32.79	3.75	46.66	20.97	2.67	13.08	7.69	0.30	0.75	4.14
Spotlight	1	21.00	44.49	34.51	12.99	32.75	27.48	7.71	11.13	5.09	0.30	0.61	1.93
	1.5	19.01	42.00	39.00	10.30	27.67	27.24	8.10	12.86	7.47	0.61	1.46	4.28
	2	18.06	38.90	43.04	8.57	24.14	27.66	8.65	12.50	9.41	0.84	2.27	5.97
Surfing	1	20.46	42.23	37.32	11.12	28.75	27.63	9.10	12.89	7.00	0.24	0.59	2.68
	1.5	19.30	39.81	40.89	8.67	23.62	24.68	10.04	14.32	9.87	0.59	1.87	6.34
	2	17.84	36.67	45.49	7.12	19.72	24.96	9.59	14.02	11.17	1.13	2.93	9.36
VR Interview	1	6.46	26.42	67.11	3.93	19.11	59.60	2.46	6.87	5.99	0.07	0.45	1.52
	1.5	7.29	24.14	68.57	4.33	16.15	57.48	2.67	7.17	7.64	0.29	0.82	3.44
	2	8.60	23.16	68.23	4.61	14.45	55.65	3.20	7.53	8.00	0.78	1.19	4.59

Table 4. Streaming Behavior of DFT1 Client in Terms of Tiling Layout Selection, Tile Selection, and Bitrate Adaptation Scenarios

The Percentage Results are Averaged for Five Videos Watched by 48 VR Users.

The streaming behavior of the DFT2 client is presented in Table 5. DFT2 achieves a perfect viewport match (by up to 65.50%), a partial viewport match (by up to 27.71%), and a complete viewport mismatch (by up to 6.77%) by selecting on average 8 $\times$ 6, 6 $\times$ 4, and 4 $\times$ 3 tiling layouts, respectively. In particular, DFT2 observes a perfect viewport match (i.e., 57.35% for the Football video, 73.88% for the Performance video, 64.95% for the Spotlight video, 55.46% for the Surfing video, and 75.86% for the VR Interview video) averaged across three prediction horizons. The lower values of perfect viewport match for the sports videos, i.e., Football and Surfing, reflect the fast-moving objects within these videos. Therefore, the client observes a lower percentage of aggressive bitrate adaptation, 34.03% and 31.31% for the Football and Surfing videos, with an 8 $\times$ 6 tiling layout in comparison to other videos. For content with minimal movements, such as the Performance and VR Interview videos, there is only a small percentage of viewport mismatch even when the segment duration is set to 2s. DFT2 requests a 4 $\times$ 3 tiling layout by up to 3.36% and 5.94% for the Performance and VR Interview videos, respectively, for 2s segment duration. Therefore, these videos observe a limited percentage of extended viewport and conservative bitrate adaptation cases compared to the other videos. Interestingly, the percentage of the 4 $\times$ 3 and 6 $\times$ 4 tiling layout selection increases with the increase in segment duration. Additionally, as the segment duration increases, the viewer tends to experience a higher percentage of weighted and conservative quality adjustments. Conversely, the percentage of fixed viewport cases that come with aggressive bitrate adjustments tends to decrease with longer segment durations. This is because the accuracy of predictions tends to decline when attempting to predict further into the future.

Table 5.

Videos	Segment Duration	Tiling Layout (%)			Tile Selection: Case 1 Bitrate: Aggressive			Tile Selection: Case 2 Bitrate: Weighted			Tile Selection: Case 3 Bitrate: Conservative
Videos	Segment Duration	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3	8 $\times$ 6	6 $\times$ 4	4 $\times$ 3
Football	1	62.94	31.97	5.09	39.46	10.66	0.74	23.24	21.16	2.39	0.24	0.15	1.96
	1.5	56.31	33.68	10.02	33.03	9.27	0.73	22.76	24.06	4.05	0.52	0.34	5.24
	2	52.82	34.55	12.63	29.62	8.51	0.76	22.41	25.28	4.07	0.79	0.76	7.80
Performance	1	77.81	20.51	1.68	57.95	8.03	0.20	19.66	12.40	0.85	0.19	0.08	0.62
	1.5	73.85	23.28	2.87	52.48	7.38	0.17	21.07	15.64	1.19	0.30	0.26	1.51
	2	69.99	26.65	3.36	48.91	7.76	0.10	20.65	18.51	0.99	0.43	0.37	2.26
Spotlight	1	71.39	24.18	4.44	46.03	8.26	0.75	25.21	15.73	1.92	0.15	0.19	1.76
	1.5	64.12	28.40	7.48	38.76	7.69	0.56	24.96	20.42	2.69	0.41	0.29	4.23
	2	59.35	30.77	9.88	35.50	9.18	1.50	23.38	21.03	2.95	0.47	0.55	5.43
Surfing	1	62.08	32.17	5.74	37.10	10.25	0.87	24.75	21.78	2.71	0.23	0.14	2.15
	1.5	54.41	34.93	10.66	30.52	8.06	0.81	23.51	26.51	4.43	0.38	0.36	5.43
	2	49.90	36.25	13.86	26.31	7.71	0.85	22.92	27.51	4.53	0.67	1.03	8.47
VR Interview	1	79.18	17.67	3.15	58.25	6.38	0.29	20.68	11.23	1.59	0.25	0.06	1.27
	1.5	74.75	20.33	4.92	53.41	6.15	0.34	20.93	13.93	1.65	0.40	0.25	2.94
	2	73.66	20.41	5.94	50.13	5.62	0.31	22.92	14.14	1.68	0.60	0.65	3.94

Table 5. Streaming Behavior of DFT2 Client in Terms of Tiling Layout Selection, Tile Selection, and Bitrate Adaptation Scenarios

The Percentage Results are Averaged for Five Videos Watched by 48 VR Users.

Figure 5 represents the streaming behavior of the ATS algorithm in terms of selecting the average tiling layouts for the entire video dataset. ATS selects tiling layouts based on the minimum weighted viewport distortions measured to achieve the maximum viewport bitrate. ATS results in selecting a 4 $\times$ 3 tiling layout mostly for the Football video, followed by 8 $\times$ 6 and 6 $\times$ 4 tiling grids. ATS requests 6 $\times$ 4 and 8 $\times$ 6 tiling layouts for about 15.06% and 50.94% of the streaming session for the Performance video with 1s, 1.5s, and 2s. The 6 $\times$ 4 tiling layout is mostly requested for the VR Interview video with a 2s segment duration. For the Spotlight and Surfing videos, ATS mostly requests an 8 $\times$ 6 tiling layout (41.23% and 38.71%) followed by the 6 $\times$ 4 (34.97% and 30.29%) and 4 $\times$ 3 (23.79% and 30.98%), respectively. For the entire test dataset, the ATS method achieves 42.61% for selecting 8 $\times$ 6, 29% for 6 $\times$ 4, and 28.39% for 4 $\times$ 3. This is because the larger tiling layout results in relatively larger segment sizes.

Fig. 5.

5.2.2 Average Tile Overlap.

Figure 6 summarizes the average tile overlap results (per video 48 head movement traces) for the DFT1, DFT2, ATS, and UVP methods under various prediction horizons. The ATS, UVP, CTF, PBA, and AVR streaming algorithms all use the spherical walk prediction method, which is used to inform adaptive tile selection and bitrate selection. According to Figure 6, it can be seen that the DFT1 method leads to higher tile overlap for all five videos. This is because the tiles in the dynamic tiling layouts produced by DFT1 are arranged based on the arc distance between the viewpoint and the center of each tile. This allows DFT1 to cover the viewport and reduce the risk of gaps in the visual field. The Football and Surfing videos tend to elicit more dynamic head movements from viewers because they contain fast-moving outdoor sports-related objects. In contrast, the Performance and VR interview videos tend to have a higher average tile overlap because they feature slower-moving indoor objects that are the primary focus of attention. This suggests that the nature of the content being watched can impact the amount of head movement and, in turn, the tile overlap observed in the video. It is notable that DFT1 and DFT2 attain higher matching performance and outperform the ATS and UVP methods for different user behaviors. For all 48 VR users, DFT1 and DFT2 experience an average tile overlap of 85.40% and 81.95% (Football), 92.22% and 90.43% (Performance) for 1s (Figure 6(a)), 84.09% and 80.50% (Spotlight), 90.03% and 86.94% (VR Interview) for 1.5s (Figure 6(b)), 74.93% and 70.91% (Surfing), and 88.92% and 85.54% (Performance) for 2s (Figure 6(c)) prediction windows. The proactive tile selection methods are able to adapt more effectively to the varied spatial and temporal information present in different motion scenes, which explains their superior performance. Simultaneously, the ATS method exhibits a lower average tile overlap than the UVP method for content with fast and stable head movements. As can be seen in the Spotlight video, DFT1 outperforms the ATS and UVP methods by up to 8.88% and 11.19% for the next 1.5s (Figure 6(b)), and by 14.66% and 12.43% for a 2s prediction horizon (Figure 6(c)), respectively. Similarly, DFT2 demonstrates its ability to increase viewport overlap for the Surfing video, outperforming other methods by achieving viewport overlap that is about 7.37%, 9.02%, and 10.35% higher for 1s, 1.5s, and 2s prediction times, respectively. For the Spotlight video, the average gain of DFT methods ranges from 6.27% to 9.32%, 7.02% to 10.61%, and 9.28% to 12.98% for different prediction horizons. The tile overlap for the DFT2 is reduced by 8.64% (Football) and by 10.23% (Surfing) when the segment duration is increased from 1s to 2s. In contrast, for the ATS and UVP methods, the tile overlap is reduced by 11.83% and 11.57% (Football) and by 13.29% and 13.17% (Surfing), respectively (Figure 6). This indicates that the DFT2 method is more effective at maintaining a high level of tile overlap even when the segment duration is increased. As a result, it can be concluded that employing two prediction mechanisms (as in DFT) leads to better viewing probability than employing a single prediction mechanism for fixed (UVP) and dynamic (ATS) tiling-based streaming.

Fig. 6.

5.2.3 Average QoE.

Next, the performance of the proposed solutions is tested against five tile-based methods by employing bandwidth trace 1 and trace 2 for the Football and Performance videos. We normalized the values of the QoE functions defined in Equations (1) through (4). The QoE weight coefficients are set as $\alpha =1, \beta = 0.8, \gamma = 0.6, \delta =0.2$. The weights are selected to emphasize a different combination of QoE objectives. A larger value of $\alpha$ indicates that the user is more concerned with the quality of the viewport, while a smaller value of $\delta$ indicates that the user places less importance on playback buffer risk. Increasing the weights of the $\beta$, $\gamma$, and $\delta$ parameters results in negative QoE values for CTF and PBA clients for Surfing videos. Therefore, these values are selected to provide a useful QoE comparison between the proposed and other solutions.

The reference tile-based delivery solutions use viewers’ head motion patterns to adaptively select bitrates. Figure 7 depicts the video quality experienced and averaged across 48 users for 1s, 1.5s, and 2s segments. It can be seen that the performance of the algorithms in Figures 7(a) and 7(c) is higher than that shown in Figures 7(b) and 7(d). The average QoE values are lower accordingly with bandwidth decrease for the same QoE weight coefficients. The higher QoE scores of larger tiling layouts (i.e., 6 $\times$ 4 and 8 $\times$ 6) for the 1s Performance video (Figures 7(c) and 7(d)) are due to the higher average tile overlap. Despite the lower tile overlap, the UVP, CTF, PBA, and AVR streaming methods achieve higher-quality scores for the Football video with a 1s segment duration due to the smaller average segment sizes (Figures 7(a) and 7(b)). Figure 7(a) results show that DFT1 improves the QoE compared to other methods by about 3.96%, 9.29%, and 12.90% for the Football video with 1s, 1.5s, and 2s segment durations when employing bandwidth trace 1. For both bandwidth traces, DFT1 outperforms ATS by about 25.31% to 38.71%, UVP by about 2.25% to 4.25%, CTF by about 5.08% to 7.67%, PBA by about 11.16% to 15.42%, and AVR by about 13.37% to 20.07% for the Football video with a 1.5s segment duration. Figure 7(b) shows that DFT2 achieves about 5.44% (for 1s), 12.56% (for 1.5s), and 15.98% (for 2s) higher average QoE for Football video streaming in comparison to other solutions. The increment in quality with the increase in segment duration reflects that DFT solutions have better prediction accuracy with longer segment duration. Similarly, Figures 7(c) and 7(d) show that DFT solutions observe the highest visual quality levels for all segment durations since they better accommodate the user’s viewing directions than the other methods. In particular, DFT1 achieves an average gain of 7.45% (1s), 14.42% (1.5s), and 17.69% (2s) for Performance video streaming under bandwidth trace 1, while it is increased to 10.34% (1s), 23.20% (1.5s), and 27.23% (2s) for bandwidth trace 2. Viewport mismatch leads to a drop in quality for tile-based streaming methods for longer segment lengths. In DFT methods, the combination of viewport coverage selection and bitrate selection policies favor the higher-quality perceptibility of the viewing area. For the Performance video with a 2s segment duration, DFT2 outperforms fixed tiling-based solutions by about 2.34% to 6.39%, 11.76% to 21.67%, 27.75% to 43.46%, and 23.80% to 35.58% for both bandwidth scenarios. The improved performance of DFT solutions over CTF and PBA methods is for the reason that they perform a uniform quality allocation for the predictive tiles to favor the higher visual quality levels with a reduced amount of data for the background tiles.

Fig. 7.

The results of the experiments on the Spotlight, Surfing, and VR Interview videos are shown in Figure 8. It can be seen that the Surfing and Spotlight videos require higher bitrates for satisfactory quality scores (as seen in Table 3), making it more difficult to achieve a high QoE with limited network connections and high QoE expectations. On the other hand, the VR Interview video has higher QoE scores due to its smaller average segment sizes and higher viewport overlap. Therefore, factors such as segment size, bandwidth capacity, and viewport prediction significantly impact the streaming performance of 360° videos. For example, when streaming the Spotlight video with a 2s segment duration, the DFT1 method achieves average QoE improvements of up to 29.8%, 12.15%, 24.36%, 28.7%, and 30.6% compared to ATS, UVP 8 $\times$ 6, CTF 8 $\times$ 6, PBA 6 $\times$ 4, and AVR 8 $\times$ 6, respectively (Figure 8(b)). This is because DFT1 has 14.65% and 12.15% higher average tile overlap than the ATS and UVP methods for the Spotlight video with a 2s segment duration (Figure 6(c)). The average quality score for the Surfing video with a 1s segment duration under bandwidth trace 2 (Figure 8(d)) is 64.21% for DFT1, 61.57% for DFT2, 37.53% for ATS, 56.45% for UVP 4 $\times$ 3, 48.79% for CTF 6 $\times$ 4, 38.9% for PBA 8 $\times$ 6, and 41.08% for AVR 4 $\times$ 3. For the VR Interview video with a 1.5s segment duration, DFT2 improves the average QoE by up to 20.55% compared to ATS, 3.02% compared to UVP, 9% compared to CTF, 17.61% compared to PBA, and 37.74% compared to AVR for bandwidth trace 2 (Figure 8(f)), while the average improvement for DFT1 is 25.7%, 5.3%, 13.94%, 23.64%, and 42.3% for all tiling layouts of ATS, UVP, CTF, PBA, and AVR, respectively, for the 2s VR Interview video (Figure 8(e)). The ATS method performs better than the AVR method in only a few cases for the Performance and VR Interview videos. The poor performance of the ATS method is due to its restriction of the quality of background tiles to minimum levels, which leads to lower-quality scores under lower and medium prediction performance. In Figure 8, it can be seen that when simulated with all tiling layouts, segment durations, and bandwidth profiles, the DFT1 and DFT2 methods result in QoE for the Spotlight, Surfing, and VR Interview videos, with improvements of 16.53%, 15.56%, and 13.62%, respectively. This is because the QoE metric used favors higher visible quality. The lower QoE values for the PBA algorithm are due to its strategy of assigning different priorities to tiles within the viewport zones (Z_1 and Z_2) and lead to poor user-perceived quality and visual smoothness. The AVR method, meanwhile, performs poorly even under stable head movements because it unnecessarily increases the quality of adjacent tiles. In general, the DFT1 and DFT2 solutions lead to average QoE improvements of 9.70% to 10.56% for the Football, 16.33% to 16.72% for the Performance, 15.08% to 18% for the Spotlight, 14.33% to 16.79% for the Surfing, and 13.45% to 13.79% for the VR Interview videos compared to other solutions.

Fig. 8.

5.2.4 Ablation Study–Impact of QoE Weight Coefficients.

We investigated and evaluated the influence of QoE weight coefficients on the streaming performance of adaptive 360° video solutions. For each streaming solution, we collected streaming metrics, presented in Equations (1) through (4), including viewport quality, temporal quality oscillations, spatial quality oscillation, and playback buffer risk, across a comprehensive testing dataset that encompassed five videos, three tiling patterns, three segment durations, and two bandwidth traces. Figure 9(a) illustrates the values for QoE weight coefficients where the values of $\alpha$, $\beta$, $\gamma$, and $\delta$ are in the range of 0 and 1. Figure 9(b) displays the average QoE values for each corresponding weight sample. The findings from Figure 9(b) reveal that DFT1 and DFT2 solutions consistently outperform the other methods by achieving the highest QoE scores across all combinations of QoE weight samples. The average QoE scores obtained are as follows: DFT1 (60.82%), DFT2 (59.17%), ATS (35.08%), UVP (56.06%), CTF (48.45%), PBA (38.23%), and AVR (36.02%). In general, DFT1 and DFT2 surpass ATS by up to 24% to 25.74%, UVP by up to 3.11% to 4.76%, CTF by up to 10.71% to 12.36%, PBA by up to 20.94% to 22.59%, and AVR by up to 23.14% to 24.79% in terms of improved QoE performance. The average QoE weight coefficients are $\alpha$ = 0.885, $\beta$ = 0.835, $\gamma$ = 0.817, and $\delta$ = 0.466.

Fig. 9.

5.3 Discussion

Existing fixed tiling-based adaptive streaming solutions aim to improve visual quality while reducing variations in spatial and temporal quality and the risk of playback interruptions. However, the proposed dynamic tiling-based streaming solutions result in more accurate viewport prediction and higher QoE levels since they systematize the best-resolution tiles for static and dynamic motion scenes. The ATS and UVP solutions allocate bitrate uniformly to tiles in the same classification to improve the visual smoothness objectives defined in Equations (2) and (3). However, ATS limits the background quality to the minimum level with lower viewport matching performance and achieves the lowest average QoE values for the entire test dataset. The UVP solution, on the other hand, increases the quality of the whole video to the highest possible level and produces better-quality scores even under difficult-to-predict head movements. For CTF and PBA solutions, the primary focus is on improving the quality of the center tile, which significantly leads to degraded quality levels for poor viewport prediction and spatial quality variations even for stable viewport prediction results. The underperformance of the AVR streaming method under drastic and stable viewport switches is due to the inefficient tiles’ arrangement to consume an essential share of the network bandwidth. In contrast, DFT solutions consume a much larger bandwidth share for the most likely to be watched tiles and result in higher QoE scores than the comparative methods for all tested datasets. DFT1 provides a useful tradeoff between visual area and visual quality, and DFT2 works to minimize the viewport mismatch ratio. Both proposed solutions work reasonably well under different testing settings and try to avoid unacceptable viewport deviations for end-users. The DFT solutions allocate a fair share of the bandwidth to tiles in the viewport, marginal, and background regions, resulting in lower spatial and temporal quality variations for different viewport prediction results. Under stable or variable motions of experienced or naive VR users, the dynamic selection of the tiling layouts and coverage of the visible region (Fixed/Extended) along with the aggressive, weighted, and/or conservative quality adjustment policies provide improved QoE for different bandwidth settings, segment sizes, and motion trends. Therefore, the proposed solutions have demonstrated their potential to offer superior quality of experience compared to other approaches for delivering 360° video.

6 Conclusions and Future Works

This artice proposed and evaluated two novel dynamic video frame tiling-based solutions, DFT1 and DFT2, for advanced predictive tile selection during adaptive 360° video streaming. DFT solutions achieve an appropriate balance between viewport availability and perceived visual quality. DFT1 performs an interactive tiling layout selection by leveraging the visual area and associated weighted quality with overcoming the attention field’s dynamics. DFT2 observes the potential viewport prediction errors to best accommodate different tiling layouts. DFT solutions extract the user attention fields by leveraging two viewport prediction mechanisms to select the best-fit dynamic size regions for transmission over bandwidth-limited networks. The proposed solutions consider the level of interest in each region when deciding how much bitrate it should receive in order to simplify the process of selecting the appropriate bitrate for each tile. The effectiveness of the DFT algorithms was evaluated through extensive trace-driven experiments. The experimental results on publicly available datasets under different segment lengths and bandwidth settings demonstrate that the proposed solutions achieve up to 8.6%, 9.77%, and 11.2% improved viewport availability for 1s, 1.5s, and 2s segment duration. At the same time, DFT solutions can improve QoE (9.7% to 18%) for different motion VR videos compared to other alternative solutions. In the future, we aim to develop a guidance-enhanced fuzzy reinforcement learning (FRL) solution to control the continuous tile selection and bitrate adaptation for equirectangular, cubemap, and truncated squared pyramid projected 360° videos under more complex network and head movement datasets. Using advanced QoE metrics, we will evaluate the effectiveness of our FRL-based solution and identify any potential optimization opportunities.

Footnotes

https://ffmpeg.org/.

http://ultravideo.fi/.

https://gpac.wp.imt.fr/mp4box/.

References

[1]

Yanan Bao, Huasen Wu, Tianxiao Zhang, Albara A. H. Ramli, and Xin Liu. 2016. Shooting a moving target: Motion-prediction-based transmission for 360-degree videos. In 2016 IEEE International Conference on Big Data (Big Data’16). IEEE, 1161–1170.

Videos	Category	Duration	Resolution	FPS
Football	Sport	\(2^{\prime }44^{\prime \prime }\)	3840 \(\times\) 2160	25
Performance	Performance	\(4^{\prime }38^{\prime \prime }\)	3840 \(\times\) 1920	29
Spotlight	Film	\(4^{\prime }53^{\prime \prime }\)	3840 \(\times\) 2160	30
Surfing	Sport	\(3^{\prime }25^{\prime \prime }\)	3840 \(\times\) 1920	29
VR-Interview	Talkshow	\(3^{\prime }07^{\prime \prime }\)	3840 \(\times\) 1920	25

Video	QP	1s			1.5s			2s
Video	QP	4 \(\times\) 3	6 \(\times\) 4	8 \(\times\) 6	4 \(\times\) 3	6 \(\times\) 4	8 \(\times\) 6	4 \(\times\) 3	6 \(\times\) 4	8 \(\times\) 6
Football	22	6.9 \(\pm\) 2.3	7.0 \(\pm\) 2.3	7.2 \(\pm\) 2.3	10.5 \(\pm\) 5.1	10.6 \(\pm\) 5.1	10.9 \(\pm\) 5.2	13.8 \(\pm\) 4.6	14.1 \(\pm\) 4.6	14.4 \(\pm\) 4.6
	27	3.5 \(\pm\) 1.3	3.6 \(\pm\) 1.4	3.8 \(\pm\) 1.4	5.3 \(\pm\) 2.8	5.5 \(\pm\) 2.8	5.7 \(\pm\) 2.9	7.1 \(\pm\) 2.7	7.3 \(\pm\) 2.7	7.6 \(\pm\) 2.7
	32	1.9 \(\pm\) 0.8	2 \(\pm\) 0.8	2.2 \(\pm\) 0.8	2.9 \(\pm\) 1.6	3.1 \(\pm\) 1.6	3.3 \(\pm\) 1.6	3.9 \(\pm\) 1.5	4.1 \(\pm\) 1.5	4.5 \(\pm\) 1.6
	37	1.1 \(\pm\) 0.4	1.2 \(\pm\) 0.4	1.4 \(\pm\) 0.4	1.7 \(\pm\) 0.9	1.8 \(\pm\) 0.9	2.1 \(\pm\) 1	2.3 \(\pm\) 0.9	2.4 \(\pm\) 0.9	2.8 \(\pm\) 0.9
	42	0.7 \(\pm\) 0.2	0.7 \(\pm\) 0.2	0.9 \(\pm\) 0.2	1 \(\pm\) 0.5	1.1 \(\pm\) 0.5	1.4 \(\pm\) 0.6	1.3 \(\pm\) 0.5	1.5 \(\pm\) 0.5	1.8 \(\pm\) 0.5
Performance	22	8.5 \(\pm\) 2.9	8.6 \(\pm\) 2.9	8.9 \(\pm\) 3.0	12.8 \(\pm\) 5.9	13.0 \(\pm\) 5.9	13.4 \(\pm\) 6.0	17.0 \(\pm\) 4.7	17.3 \(\pm\) 4.7	17.8 \(\pm\) 4.8
	27	4.6 \(\pm\) 1.7	4.7 \(\pm\) 1.7	5.0 \(\pm\) 1.7	6.9 \(\pm\) 3.3	7.1 \(\pm\) 3.3	7.5 \(\pm\) 3.4	9.3 \(\pm\) 2.7	9.5 \(\pm\) 2.7	10.0 \(\pm\) 2.7
	32	2.6 \(\pm\) 0.9	2.7 \(\pm\) 0.9	2.9 \(\pm\) 0.9	4.0 \(\pm\) 1.9	4.1 \(\pm\) 1.9	4.5 \(\pm\) 2.0	5.3 \(\pm\) 1.5	5.5 \(\pm\) 1.5	6.0 \(\pm\) 1.5
	37	1.6 \(\pm\) 0.5	1.7 \(\pm\) 0.5	1.9 \(\pm\) 0.5	2.4 \(\pm\) 1.1	2.5 \(\pm\) 1.1	2.8 \(\pm\) 1.2	3.2 \(\pm\) 0.9	3.4 \(\pm\) 0.8	3.8 \(\pm\) 0.9
	42	0.9 \(\pm\) 0.3	1.0 \(\pm\) 0.3	1.2 \(\pm\) 0.3	1.4 \(\pm\) 0.6	1.6 \(\pm\) 0.6	1.9 \(\pm\) 0.7	1.9 \(\pm\) 0.5	2.1 \(\pm\) 0.5	2.5 \(\pm\) 0.5
Spotlight	22	13.6 \(\pm\) 8.8	13.9 \(\pm\) 8.8	14.3 \(\pm\) 8.9	20.4 \(\pm\) 15.0	20.9 \(\pm\) 15.2	21.5 \(\pm\) 15.4	27.1 \(\pm\) 17.1	27.7 \(\pm\) 17.2	28.5 \(\pm\) 17.3
	27	7.2 \(\pm\) 5.3	7.4 \(\pm\) 5.3	7.7 \(\pm\) 5.4	10.8 \(\pm\) 8.9	11.1 \(\pm\) 9.0	11.6 \(\pm\) 9.1	14.3 \(\pm\) 10.3	14.8 \(\pm\) 10.4	15.5 \(\pm\) 10.5
	32	4.0 \(\pm\) 3.1	4.2 \(\pm\) 3.1	4.5 \(\pm\) 3.2	6.1 \(\pm\) 5.2	6.3 \(\pm\) 5.3	6.7 \(\pm\) 5.4	8.1 \(\pm\) 6.1	8.4 \(\pm\) 6.1	9.0 \(\pm\) 6.2
	37	2.3 \(\pm\) 1.8	2.4 \(\pm\) 1.8	2.7 \(\pm\) 1.8	3.5 \(\pm\) 2.9	3.7 \(\pm\) 3.0	4.1 \(\pm\) 3.1	4.7 \(\pm\) 3.5	4.9 \(\pm\) 3.5	5.4 \(\pm\) 3.5
	42	1.3 \(\pm\) 0.9	1.4 \(\pm\) 0.9	1.6 \(\pm\) 0.9	2.0 \(\pm\) 1.5	2.2 \(\pm\) 1.6	2.5 \(\pm\) 1.6	2.7 \(\pm\) 1.7	2.9 \(\pm\) 1.8	3.3 \(\pm\) 1.8
Surfing	22	22.7 \(\pm\) 11.2	23.0 \(\pm\) 11.3	23.5 \(\pm\) 11.4	34.0 \(\pm\) 21.0	34.5 \(\pm\) 21.2	35.3 \(\pm\) 21.4	45.3 \(\pm\) 22.2	45.9 \(\pm\) 22.3	46.9 \(\pm\) 22.5
	27	12.8 \(\pm\) 6.7	13.0 \(\pm\) 6.8	13.4 \(\pm\) 6.8	19.2 \(\pm\) 12.4	19.5 \(\pm\) 12.5	20.2 \(\pm\) 12.7	25.5 \(\pm\) 13.3	26.0 \(\pm\) 13.4	26.8 \(\pm\) 13.5
	32	7.2 \(\pm\) 3.9	7.4 \(\pm\) 3.9	7.7 \(\pm\) 3.9	10.8 \(\pm\) 7.1	11.1 \(\pm\) 7.2	11.6 \(\pm\) 7.3	14.4 \(\pm\) 7.7	14.7 \(\pm\) 7.8	15.4 \(\pm\) 7.8
	37	4.0 \(\pm\) 2.2	4.1 \(\pm\) 2.2	4.4 \(\pm\) 2.2	6.0 \(\pm\) 3.9	6.2 \(\pm\) 4.0	6.6 \(\pm\) 4.1	7.9 \(\pm\) 4.3	8.2 \(\pm\) 4.3	8.8 \(\pm\) 4.3
	42	2.1 \(\pm\) 1.1	2.2 \(\pm\) 1.1	2.5 \(\pm\) 1.1	3.2 \(\pm\) 2.0	3.4 \(\pm\) 2.1	3.7 \(\pm\) 2.2	4.2 \(\pm\) 2.2	4.5 \(\pm\) 2.2	5.0 \(\pm\) 2.2
VR Interview	22	7.6 \(\pm\) 1.0	7.7 \(\pm\) 1.1	7.8 \(\pm\) 1.1	11.4 \(\pm\) 4.1	11.5 \(\pm\) 4.2	11.8 \(\pm\) 4.3	15.2 \(\pm\) 2.0	15.4 \(\pm\) 2.0	15.7 \(\pm\) 2.1
	27	3.7 \(\pm\) 0.7	3.8 \(\pm\) 0.7	3.9 \(\pm\) 0.7	5.5 \(\pm\) 2.1	5.7 \(\pm\) 2.2	5.9 \(\pm\) 2.3	7.4 \(\pm\) 1.3	7.6 \(\pm\) 1.3	7.9 \(\pm\) 1.4
	32	1.7 \(\pm\) 0.3	1.8 \(\pm\) 0.4	2.0 \(\pm\) 0.4	2.6 \(\pm\) 1.0	2.8 \(\pm\) 1.1	3.0 \(\pm\) 1.2	3.5 \(\pm\) 0.7	3.7 \(\pm\) 0.7	4.0 \(\pm\) 0.8
	37	0.9 \(\pm\) 0.2	1.0 \(\pm\) 0.2	1.2 \(\pm\) 0.2	1.4 \(\pm\) 0.6	1.6 \(\pm\) 0.6	1.8 \(\pm\) 0.7	1.9 \(\pm\) 0.4	2.1 \(\pm\) 0.4	2.5 \(\pm\) 0.4
	42	0.6 \(\pm\) 0.1	0.7 \(\pm\) 0.1	0.8 \(\pm\) 0.1	0.9 \(\pm\) 0.3	1.0 \(\pm\) 0.4	1.3 \(\pm\) 0.4	1.2 \(\pm\) 0.2	1.4 \(\pm\) 0.2	1.7 \(\pm\) 0.2

Abstract

1 Introduction

2 Background and Related Works

2.1 Fixed Viewport-based Streaming

2.2 Marginal Region-based Streaming

2.3 Extended Viewport-based Streaming

2.4 Dynamic Tiling-based Streaming

3 Proposed Dynamic Tiling-based Architecture

3.1 Dynamic Tiling-based System Architecture

3.2 Problem Definition

4 Proposed Dynamic Tiling-based Adaptation Algorithms

4.1 DFT Tiling Layout Selection Algorithms

4.2 DFT Streaming Tile Selection Algorithm

4.3 DFT Tile Bitrate Adaptation Algorithm

5 Experimental Evaluation

5.1 Experimental Setup

5.1.1 Content Pre-processing.

5.1.2 Comparative Approaches.

5.1.3 Evaluation Metrics.

5.2 Experimental Results

5.2.1 Streaming Behavior.

5.2.2 Average Tile Overlap.

5.2.3 Average QoE.

5.2.4 Ablation Study–Impact of QoE Weight Coefficients.

5.3 Discussion

6 Conclusions and Future Works

Footnotes

References

Cited By

Index Terms

Recommendations

Viewport-aware adaptive 360° video streaming using tiles for virtual reality

An advanced QER selection algorithm for 360-degree VR video streaming

Transitions of viewport quality adaptation mechanisms in 360 degree video streaming

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations