AVT Workbook Video Streaming Whitepaper
AVT Workbook Video Streaming Whitepaper
AVT Workbook Video Streaming Whitepaper
MANAGER’S
Planning
for Video Streaming
What You
Really Need
to Know
Learning Objectives & Thought Leadership
Learn the key decision points and concepts,
and gain enough technical background information
PLANNING FOR VIDEO STREAMING | AVTechnology Manager’s Workbook
p Audience Requirements
p Latency Tolerance
p Video Compression
The content and images in this Workbook are owned by its authors and/or
respective companies and may not be reprinted or reproduced in any way.
Thought Leader
THE
Workbook
You Need
for Video Streaming
Streaming audio and video content across a net-
work is a broad topic, from real time connections
replacing a traditional AM matrix switch to con-
ferencing to cat videos on YouTube with a huge
range of mix and match technologies in any given
solution. Choosing the right solution involves con-
sidering both the application and content attri-
butes and narrowing down the technologies that
do not support those attributes.
The goal of this document is to explain key
decision points and concepts and give enough
technical background information to allow the au-
dience to make informed decisions in specifying
and integrating streaming media solutions. This
document will discuss networks only at a high
level and when it impacts the decision process. A
familiarity with IP networks and video technology
is helpful but not required.
Application
Attributes
Real Time or On Demand? . Audience .
Latency . Bandwidth . Security .
Management . Scalability
Attributes Description
Real Time or Real time streaming is consumed as it is produced.
On Demand On Demand streaming is recorded and each user consumes the video at their conve-
or Both nience.
Bandwidth The number of bits per second available on the network for content distribution.
Audio Reinforcement Acceptable <18ms ITU-T G.131, Talker echo and its control The
(Echo threshold for Marginal 18-30ms Influence of a Single Echo on the Audibility of
intelligibility) Speech, Helmut Haas
Content
Attributes
Content Source . Interface .
Resolution and Rate . Bandwidth
Attributes Description
Content Source What is generating the content?
Are there more than one content sources?
Do multiple sources need to be streamed simultaneously?
Resolution Video
(Source, Transmitted The number of pixels (the smallest element of an image) in the video.
and Reproduced) »»Generally specified as Width X Height
Audio
»»The number of bits of information in each audio sample
»»Higher bit depth result in higher maximum signal to noise ratio (S/N)
Rate Video
Frame Rate
»»The number of frames (complete images) per second Generally specified as frames
per second (fps)
Audio
Sample rate
»»The number of times per second an audio signal is sampled (measured for its
instantaneous value)
»»Higher sample rates allow for reproduction of higher frequencies
Bandwidth The number of bits per second required to transmit the required AV signal at an
acceptable quality
CD Audio 20 - 22,000Hz
Audio This means that the highest frequency which can be reproduced
Audio, which is inherently analog, is converted to a digital represen- is half the sample rate. The full range of human hearing is be-
tation that can be transmitted through a network by a process with tween 20Hz and 20kHz. Human speech is in the range of 300Hz
the easy to remember name, analog to digital conversion (ADC). to 3,500Hz. Common sampling frequencies run from 8,000Hz for
In ADC, the audio signal is sampled (measured) at a regular inter- telephone calls to 48,000Hz for professional audio recordings.
val. The resulting set of samples is a Pulse Code Modulated (PCM) PCM audio can be streamed uncompressed or it can be further
signal. compressed before streaming.
PCM has two attributes which determine the quality of the sam- If the audio source is in the form of a digital PCM stream that has
pled audio-bit depth and sampling rate. Bit depth, the number of a different sample rate from the desired stream, that conversion is
bits used to quantify the signal, determines the resolution of a sam- typically done as part of the encoding process.
ple. An 8 bit sample has 64 possible values, a 16 bit sample has
65,536, and a 24 bit sample has 16,777,216. BANDWIDTH
The sampling rate is the number of times in a second the sample In the context of content, bandwidth is the bitrate of a single AV
is taken. This determines the highest frequency that can be repro- stream crossing the network. The biggest trade offs typically made
duced, which is described in the Nyquist–Shannon sampling theo- in a streaming application are balancing quality and bandwidth.
rem which states: Based on the overall bandwidth available discovered in the appli-
If a function x(t) contains no frequencies higher than B hertz, it is com- cation phase, a bandwidth budget for the content must be estab-
pletely determined by giving its ordinates at a series of points spaced lished, and used as goal when evaluating the technologies present-
1/(2B) seconds apart. ed in the next section. AVT
Streaming Technology
Components
Video Compression . Compression and Codec Terminology
. Intra-frame Codecs . Inter-frame Codecs
. Containers . Session Controls . Transport Protocols
Group of Pictures
is a repeating pattern
of how the various types
of frames are arranged
This stream hierarchy makes MPEG codecs particularity suitable 64 samples. The macroblock is divided into 6 blocks with 64 sam-
for transport over IP networks. ples in each four Y, one Cb, one Cr. These 8 x 8 blocks are then com-
MPEG encoders’ capabilities are defined by profiles and levels. pressed using Discrete Cosine Transform (DCT) and the resulting
The profile defines the subset of features such as compression algo- compressed values are combined to form the compressed macrob-
rithm, chroma format, etc. The level defines the subset of quantita- lock. The entire I-frame is compressed as an array of macroblocks. If
tive capabilities such as maximum bit rate, maximum frame size. the height or width in pixels of the frame is not evenly divisible by
Decoders often describe their playback specification using the 16, the leftover area must still be encoded as a macroblock and the
maximum program and level they are capable of decoding. For ex- unallocated area is just not displayed. I-frames can be considered
ample a decoder which has a MPEG specification of MP@ML, can effectively identical to baseline JPEG images.
play back any stream encoded up to Main Profile, Main Level. Even if the next frame is a B-frame, the next frame to be encoded
MPEG defines a bitstream and a decoder function. A valid MPEG is a P-frame. This is because a B-frame is bi-predictive. It requires a
file or stream is one that can be decoded by the standard decod- compressed anchor frame before and after it as references for com-
er. This leaves the encoding methodology up to the manufacturer. pression.
While this means that there is room for innovation and growth,
the video quality and encoding efficiency is not guaranteed by the P-frame Compression Example
standard, so due diligence should be performed when choosing an
encoder.
Inter-frame Compression
MPEG codecs use a variety of compression techniques to compress
video based around a GOP. A greatly simplified version of the com-
pression is discussed below.
The first frame to be compressed in a GOP is the I-frame.
Macroblock Compression Example P-Frames exploit the fact that often much of the video changes little
if at all. This is called temporal (over time) redundancy.
The P-frame is divided into macroblocks and checked to see if
any macroblocks on the P-Frame equal any macroblocks on the an-
chor frame. If there are, they are recorded as “the same as that other
one” which takes much less space than a DTC compression (shown
in red on the above illustration).
The remaining macroblocks are checked to see if they mostly
match any other macroblock on the anchor frame. If they do, the
moved macroblock is recorded as a vector, (a direction and a dis-
tance) as shown in green on the above illustration. Any remaining
MPEG-4 compression is organized around blocks. The Illustration macroblocks are compressed with DTC. If a video drastically chang-
above depicts the data in a 16 pixel x 16 pixel macroblock, the es from one frame to the next (such as a cut), it is more efficient to
smallest unit of video in MPEG-2, with 4:2:0 chroma subsampling. encode it as an I-frame.
The Y component has 256 Y samples. The Cb and Cr each contain Next, any B frames between the two anchor frames. This uses the
H.261 MPEG 1
Color Model and Color Space
H.262 MPEG 2
Pixels have only one property, color. The color of a pixel in digital
H.263 MPEG 4 Part 2 video is represented by a string of bits.
A color model is a mathematical way of representing a color with
H.264 MPEG 4 Part 10
numbers. The most commonly known color model is RGB which rep-
resents a color by a combination of red, green and blue values. 24 bit
H.264 is an ITU adoption of MPEG 4 Part 10 (AVC). The ITU and RGB video has 8 bits representing each color. This allows for 256 X 256
ISO/IEC jointly maintain the standards so they remain identical. It X 256 = 16.7 million individual colors. Other bit depths can be used
is important to note that not all MPEG 4 video is H.264. An MPEG depending on the application. The color model most often used for
4 encoder may or may not be H.264, depending on the profiles im- digital video is Y’Cb’Cr’. The prime ( ‘ ) is a mathematical symbol that
plemented unless it states MPEG 4 Part 10 or AVC. An H.264 decod- shows that the value is non-linear (the sample values are not equally
er will be backwards compatible across all the MPEG 4 profiles and spaced). Y’Cb’Cr’ is very typically noted as YCbCr, as it will be for the
typically be backwards compatible across all the H.26X standards, remainder of this document. The term YUV which is the PAL and SECAM
including H.262 (MPEG-2). equivalent (and therefore incorrectly used in digital video) is also often
used interchangeably. Y is for luma (brightness), the simple, description
CONTAINERS of Cb’ and Cr’ is Blue – Y and Red –Y.
A container is a metafile; a file that can store multiple types of data Color space is the real world matchup between the numbers and the
in a structured format. A video container holds all the different files colors they represent. Depending on the range of colors needing to be
and data the player requires to play the video. This may include represented, the same color models can represent different color spac-
one or more audio channels, one or more videos, graphics files, es. When a computer which uses a 24 bit RGB model is compressed
metadata, and data files. Metadata id “data about data” it is a data with a standard video codec, the color space must be converted to YCb-
structure that organizes information the player may need, includ- Cr before it can be compressed. If you compress a video and the colors
ing, but not limited to, timing information, window size, what co- come out wrong the likely culprit is color space conversion.
dec(s) is being used, titles, program information, bookmarks, and
closed captioning.
Network Transit
TCP and UDP . When to Use Which
. Unicast and Multicast
Unicast Multicast
Each attribute you define will have a different importance Example: A classroom overflow with the ability of the remote
based on the application. The carts below offer generalized infor- classrooms to ask questions verbally.
mation on how various technologies compare within chosen at-
tributes. Some of the attributes such as management, security and Within a building or campus
some degree of scalability will be more dependent. »»They might prioritize the attributes Latency and Quality
Quality as an attribute is somewhat subjective, since each lossy with a very low priority on bandwidth since the application
compression algorithm has different artifacts which affect the qual- is staying within the local network that have very high avail-
ity. Subjective quality is also very dependent on resolution, frame able bandwidth and choose a proprietary codec using very high
rate, latency and the bandwidth allowed for the stream. bandwidth over multicast.
It is up to the reader to provide the priorities and tradeoffs
allowed. Between two campuses over the Internet: A classroom
overflow with the ability of the remote classrooms to ask
questions verbally.
»»They might prioritize Bandwidth and Latency and choose a
JPEG2000 codec, over unicast, since the Internet won’t support
multicast.
»»They might prioritize Bandwidth and Quality and choose a
H.264 codec since H.264 will give a higher quality at a given
bandwidth than JPEG2000 while sacrificing latency.
General Attributes Prioritized
Real Time (Audience Local) Transport Stream SDP with RTP Progressive Download
Multicast Multicast Unicast
Real Time (Audience Local) Multicast, UDP Unicast, UDP Unicast, TCP
On Demand HTTP Live Stream RTSP with RTP Progressive Download
High Quality to BYOD H.264 JPEG2000 Proprietary
Diverse Devices H.264 JPEG2000 Proprietary
High Quality Room Overflow Proprietary H.264 -
Set top boxes Transport Stream - -
Bandwidth H.264, other JPEG2000 Proprietary Codec
MPEG
Latency Proprietary Codec JPEG2000 H.264, other MPEG
Capture for production JPEG2000 H.264, other MPEG
Progressive N/A Fair RT: poor »»Requires the full file be available before serving
Download »»Does not require a streaming server, just a file server.
*Dependent on streaming server configuration
MPEG codecs H.264,H.265 1 Inter-frame compres- 100ms-600ms or At least one frame of latency at encode
sion is most efficient. more and decode. Additional fixed latency is
Later standards are added based on Group of Pictures (GOP)
more efficient size. Each b frame adds 1 frame of
encode latency. Any number
of b frames adds one frame decode
latency.