Viewport-Adaptation-Induced Immersive Video Quality Assessment Database

Shaowei Xie


OmniDirectional Video (ODVs) offer users the freedom to navigate inside the virtualized environment. Instead of streaming the entire bulky content, a viewport or Field of View (FoV) adaptive streaming is preferred. We often stream the High-Quality (HQ) content within current viewport, but Low-Quality representation elsewhere, so as to reduce the network bandwidth consumption. Such scheme would lead to a quality refinement after user adapts his/her focus to a new viewport.Therefore, based on this dataset, we have attempted to model the perceptual impact of the quality variations (through adjusting the Quantization Stepsize (QS or q) and Spatial Resolution (SR or s)) with respect to the Refinement Duration (RD) when performing the refinement from an arbitrary LQ scale to an arbitrary HQ one. A number of quality variations are studied to cover sufficient use cases in practice, resulting in a unified analytical model, as a product of separable exponential functions that measure the QS and SR induced perceptual impacts in terms of the RD, and a perceptual index measuring the subjective quality of corresponding viewport video after refinement.

author={S. {Xie} and Y. {Xu} and Q. {Shen} and Z. {Ma} and W. {Zhang}},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={Modeling the Perceptual Quality of Viewport Adaptive Omnidirectional Video Streaming},


In general, we wish to model the perceptual impact of quality refinement from an arbitrary lower level Ql= Q(ql, sl) to an arbitrary higher level Q= Q(qh, sh) (Case C in Fig. 1(c)) without manually imposing any constraints, but depending on the actual network conditions. However, this incurs multi-variables for consideration at a time, setting a higher barrier in designing the effective subjective quality assessment methodology to study the sufficient use cases. Instead, we have attempted to derive the perceptual model by fixing a boundary condition but leaving the other one unconstrained. For example, we first assume the refinement from an arbitrary lower level Ql to the highest one QHst = Q(qmin, smax), which is noted as Case A in Fig. 1(c), and followed by the scenario that refinement happens from the lowest level QLst = Q(qmax, smin) to an unconstrained higher level Qh, denoted as Case B. We finally extend the models in Case A and Case B to generalized Case C, leading to a unified functional form, where the perceptual quality can be well explained by a product of exponential functions of ql or sl induced impact with respect to the RD τ, and a perceptual index measuring the subjective quality Qh after refinement.

We have chosen 31 ODVs, where 14 of them are test sequences selected by the international standard organization – Joint Video Exploration Team (JVET, marked with *), another 5 test sequences from Virtual Reality Unity (VRU) organization in China (marked with †) and 12 YouTube 360-degree videos. Note that “GT_Sheriff”is a Computer-Generated (CG) video. The native SRs are mostly sampled at 3840×1920, except some YouTube videos, i.e., ‘Elephants’, ‘Rhinos’, ‘Diving’, ‘Elephants2’, ‘Street’ at 3840×2048, and ‘Venice’ at 3840×2160. These YouTube videos are all clipped to 300 frames, to be consistent with those sequences from JVET and VRU. All videos are rendered at 30 FPS. The ODVs are selected to cover sufficient use cases and a wide range of spatio-temporal activities. Meanwhile, we also ensure that the videos contain sufficient salient regions, each of which could possibly belong to a distinct FoV. Usually, user’s viewport adapts across these salient FoVs. In addition to the training video “KiteFlite”,  26 ODVs are used to generate Processed Video Samples (PVSs) for evaluating aforementioned refinement use cases (Fig. 2), while the other 4 ODVs are prepared for real-life demonstration in the wild (Fig. 3), where we collect the MOSs when the users enjoy and navigate the content freely.

In our experiment, each PVS consists of three consecutive parts, i.e., the viewing period of FoV#1, viewport adaptation period and the viewing period of FoV#2, as shown in Fig. 1(a). Users start at FoV#1, then navigate their focuses to FoV#2. Quality refinement happens when we stabilize our focus in FoV#2. Specifically, the first temporal segment of FoV#2 is a few seconds long and encoded at a LQ scale Ql, followed by the HQ one Qh after refinement. To cover sufficient combinations of quality variation ∆Q = Qh − Ql and RD τ, we apply five different quantization parameters (i.e. QP = 22, 27, 32, 37, 42), or three SRs (i.e. native, 1=4 and 1=16 downscaled versions),and additionally set ten distinct RDs (RD = 0.1, 0.3, 0.7, 1.5, 2, 5 seconds for PVS generation in managed lab environment, and RD = 0.2, 0.5, 1.2, 3 seconds for validation in the wild.). Note that QP = 22 and the native SR are the compression parameters for the highest quality level. We keep the frame rate unchanged in this work, i.e., 30 FPS. We use open source x264 to produce H.264/AVC coded PVSs. The same methodology can be applied to HEVC compliant videos as well. Specifically, due to the horizontally 110-degree viewing range offered by HTC Vive HMD, we set 120º × 90º FoV when cropping it from the original content to fully cover the screen of HMD. The total length of each PVS is 10 seconds. 

For the model validation using a real-life viewport adaptive streaming system, we apply Cube Map Projection (CMP), which involves six faces to represent front, back, top, bottom, left and right viewports respectively. To further reduce the transport bandwidth, Truncated Square Pyramid (TSP) packing is utilized to produce current target viewport at native SR, but with other viewports downscaled. Six copies (to correspond six viewports of CMP scheme) of the same content will be cached in server, where appropriate video will be delivered using MPEG Media Transport Protocol (MMTP) compliant packets according to the user’s request. To guarantee the sufficient viewing duration, particularly including those few seconds used for stabilizing user’s focus during navigation, each video lasts 15 seconds.

PVS Samples


For a fast download, the original ODVs in .yuv format have been encoded into .mp4 format via lossless compression (FFmpeg settings: -vcodec libx265; -crf 0).



This database is a generation of distorted screen content video via different scenario.


Three distortion types with five distortion levels:

  • 264 compression: ffmpeg function with –g 8 – qp [24 30 36 42 48]
  • 265 compression: HM 16.18 function with encoder_lowdelay_main_rext.cfg –IntroPeriod 8 -GOPSize 4 – QP [24 30 36 42 48]
  • 265-SCC compression: HM 16.18 function with encoder_lowdelay_main_scc.cfg –IntroPeriod 8 -GOPSize 4 –FastSearch 2 -QP [24 30 36 42 48]

The configuration of the bitstream is included in the name of the file.

Take ‘MissionControlClip3_1920x1080_30p_8bit_420_304f_QP24.bin’ as an example.
Frame rate30
Bit depth8
YUV format420
Amount of Frames304
Distortion type*265 compression with QP=24

*264QP** means 264compression; QP**means 265compression; SCCQP**means 265SCCcompression.