In this guide, we discuss encoding options to simplifying output renditions and improve flexibility, dynamically generating playlists with HLS and Smooth Streaming protocols and concatenating video using manifest files. The ultimate result it
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Architecting a Video Encoding Strategy Designed For Growth
1. - 1 -
Architecting a Video Encoding
Strategy Designed for Growth
When building out an online video strategy, there are myriad decisions that will
have a direct impact on how viewers engage with video. By architecting the
video experience from the beginning to be flexible and dynamic, it’s possible
to build a system that is not only a joy for users to watch, but is designed at its
core for growth. In this guide, we will discuss simplifying output renditions for
multi-device streaming, dynamically generating playlists with HLS and Smooth
Streaming protocols and concatenating video using manifest files.
RENDITIONS AND THE MODERN WORLD
In its most basic form, online video consists of
transcoding a single source file into a single
output file that will play over the Web. Each of
these video files is called a rendition, and an array
of renditions defines how video will be delivered
to end-users.
When YouTube launched in 2005, it delivered a
single output rendition through a basic player.
Fast forward to 2013 and the world of online video
is defined by HTML5/Flash players, ad-insertion,
recommendation engines, paywalls, and anywhere
from a handful to a boatload of renditions at
different bitrates and in various formats.
It may sound like a confusing mess, and it can
be, but there are strategies that can simplify your
approach to delivering video, shrink costs, and
improve the viewer’s experience. It all starts with
renditions.
CLIMB THE LADDER
Imagine the world of devices as a wall. At the
bottom of the wall are the least capable, most
painful-to-use feature phones with 3G connections
and a tiny screen. At the top of the wall, we have a
brand-new HDTV with a fast Internet connection.
Between the bottom and the top of the wall is a
range of devices, each having different processors,
GPUs, network connections and screen sizes.
The height of the wall is determined by average
content duration; the longer the duration, the
higher the wall. Renditions are like ladders that
help us start anywhere along the wall and climb up
or down smoothly. If the wall is high, there needs
to be more rungs on the ladder to ensure users
can smoothly climb up and down. If the wall is
short, we can get away with only a couple of rungs
and still provide a good experience.
Structuring Video Renditions for Simplicity
Encoding strategies to keep costs down and quality high.
2. - 2 -
Step 1: The First Ladder
The first step is to decide on a base format. The base format should be playable on a wide range of
devices. It might not always be the best choice on every device, but it should always be playable. The
goal of online video is to get in front of everybody.
Zencoder supports a wide swath of the most important output formats for Web, mobile and connected
TVs. Valid use cases exist for each of these formats; but, for the vast majority, MP4 is the best option
due to its ubiquity across the widest range of devices. The first ladder we build will be based on the
MP4 format.CLIMBING THE LADDER
BANDWIDTHDEVICESSCREEN SIZE
SD
720P
1080P
3. - 3 -
Step 2: Bitrates — Creating the Ladder’s Rungs
Now that we have decided which ladder to create first, we can begin constructing the rungs.
First, decide where on the wall the service should start and end. For example, consider a user-generated
content site where the average video duration is one minute. The maximum size of each video is
small, so there is no need to worry about buffering or stream disruptions; the player should be able to
download the whole stream in a few seconds, which means only a couple of renditions are needed, for
example, one HD and one SD.
On the other hand, consider a movie service with an average video length of 120 minutes. The files
are large, which means the user’s device won’t be able to download the entire stream. In addition,
users generally have higher expectations for the quality of feature films. We need to create a number
of renditions so users will be able to watch high-quality videos when they have a strong network
connection. If the connection is poor, we still want them to be able to watch a video, and then improve
the experience as soon as more bandwidth is available by providing intermediate renditions — stepping
up the ladder.
The longer the content and the higher the quality, the more renditions are needed to provide a
consistent viewing experience.CLIMBING THE LADDER
BANDWIDTHSCREEN SIZE
SD
720P
1080P
Step 2: Bitrates — Creating the Ladder’s Rungs
250kbps
500kbps
750kbps
1.5mbps
2.5mbps
5mbps
4. - 4 -
Step 3: Defining the Rungs
We have created a nice, smooth ladder, but there is room for improvement. Aside from bitrate and
resolution, H.264 has two other features that are used to target renditions at subsets of devices: profile
and level.
Profile defines the complexity of the encoding algorithm required to decode a given rendition ranging
from low to high complexity. The three most important profiles are baseline, main and high. Level
defines a maximum amount of pixels and bitrate that a certain rendition is guaranteed not to exceed.
The three most important levels are 3.0 (SD/legacy mobile), 3.1 (720p/mobile), and 4.1 (1080p/modern
devices).
At the bottom rung, we want to provide the widest array of support so that we can always deliver a
playable video regardless of the device. That means we should choose either baseline 3.0 or main 3.1,
and we should choose a resolution that is fairly modest, most likely between 480x270 or 640x360. As
we move up the ladder, we can gradually increment these values until we reach the top, where we can
maximize our video quality with high-profile, level 4.1, 1080p videos.CLIMBING THE LADDER
BANDWIDTHSCREEN SIZE
SD
720P
1080P
250kbps
500kbps
750kbps
1.5mbps
2.5mbps
5mbps
Baseline 3.0
Baseline 3.0
Main 3.1
Main 3.1
Main 3.1
High 4.1
Step 3: Defining the Rungs
5. - 5 -
Step 4: Formats — Duplicating Ladders
Now that our MP4s have been created, we have a stable base format and customers can watch video on
a variety of devices; we created a ladder to scale the wall. While MP4 is a strong baseline format, other
formats can improve the user’s experience. For example, HLS allows a user’s device to automatically
and seamlessly jump up and down the ladder.
Since we have already created MP4s, and because MP4 is a standard format, we can easily repackage it
into other formats. In fact, this is such an easy task that Zencoder charges only 25 percent of a normal
job to perform this duplication called transmuxing, and it can be done nearly instantly alongside a
group of MP4 encodings by using “source,” “copy_video,” and “copy_audio.”
The “source” command tells Zencoder to reuse the file created under a given output “label.” So, if we
create a file with “label:”:= “MP4_250,” all we need to do is use “source:” “MP4_250” to tell Zencoder
to reuse this rendition. “Copy_video” and “copy audio” will then extract the elemental audio and video
tracks, and repackage them into an HLS formatted file.
We can do the same thing for smooth streaming as well. And almost instantly, at a fraction of the cost,
we have created two new ladders that let virtually anybody watch great quality video.
DUPLICATING LADDERS
SD
720P
1080P
250kbps
500kbps
750kbps
1.5mbps
2.5mbps
5mbps
Baseline 3.0
Baseline 3.0
Main 3.1
Main 3.1
Main 3.1
High 4.1
HLS
250kbps
500kbps
750kbps
1.5mbps
2.5mbps
5mbps
Baseline 3.0
Baseline 3.0
Main 3.1
Main 3.1
Main 3.1
High 4.1
MP4
6. - 6 -
Step 5: Refine
The most important thing a video service can do is commit itself to constantly improving, revisiting, and
refining its renditions.
With the pace of online video accelerating by the day, what seems terrific today might only be sufficient
next year. In a couple of years, it will be downright obsolete. Zencoder helps solve these issues by being
a driving force behind the bleeding edge of video encoding technology. We are constantly updating
and building our tools to make the encoding platform faster and more stable with higher quality. The
next step is up to you.
Constantly testing new variations to find the best set of renditions for your users will result in a more
stable and optimized delivery infrastructure and a more engaged user base.
The Dynamic Generation of Playlists
For years, there were two basic models of Internet streaming: server-based proprietary technology such
as RTMP or progressive download. Server-based streaming allows the delivery of multi-bitrate streams
that can be switched on demand, but it requires licensing expensive software. Progressive download
can be done over Apache, but switching bitrates requires playback to stop.
The advent of HTTP-based streaming protocols such as HLS and Smooth Streaming meant that
streaming delivery was possible over standard HTTP connections using commodity server technology
such as Apache. Seamless bitrate switching became commonplace and delivery over CDNs was
simple as it was fundamentally the same as delivering any file over HTTP. HTTP streaming has resulted
in nothing short of a revolution in the delivery of streaming media, vastly reducing the cost and
complexity of high-quality streaming.
When designing a video platform there are countless things to consider; however, one of the most
important and oft-overlooked decisions is how to treat HTTP-based manifest files.
A STATIC MANIFEST FILE
In the physical world, when you purchase a video, you look at the packaging, grab the box, head to the
checkout stand, pay the cashier, go home and insert it into your player.
Most video platforms are structured pretty similarly; fundamentally, a group of metadata (the box)
is associated with a playable media item (the video). Most video platforms start with the concept of
a single URL that connects the metadata to a single MP4 video. As a video platform becomes more
complex, there may be multiple URLs connected to the metadata representing multiple bitrates,
resolutions, or perhaps other media associated with the main item such as previews or special features.
Things become more complicated when trying to extend the physical model to an online streaming
world that includes HTTP-based streaming protocols such as HLS. HLS is based on many fragments
of a video file linked together by a text file called a manifest. When implementing HLS, the most
straightforward method is to simply add a URL that links to the manifest, or m3u8 file. This has the
benefit of being extremely easy and fitting into the existing model.
The drawbacks are that HLS is not really like a static media item. For example, an MP4 is very much like
a video track on a DVD; it is a single video at a single resolution and bitrate. The HLS manifest consists,
most likely, of multiple bitrates, resolutions, and thousands of fragmented pieces of video. HLS has the
capacity to do so much more than an MP4, so why treat it the same?
7. - 7 -
THE HLS PLAYLIST
An HLS playlist includes some metadata that describes basic elements of the stream, and an ordered
set of links to fragments of the video. By downloading each fragment, or segment of the video and
playing them back in sequence, the user is able to watch what appears to be a single continuous video.
EXTM3U
#EXT-X-PLAYLIST-TYPE:VOD
#EXT-X-TARGETDURATION:10
#EXTINF:10, file-0001.ts
#EXTINF:10, file-0002.ts
#EXTINF:10, file-0003.ts
#EXTINF:10, file-0003.ts
#EXT-X-ENDLIST
Above is a basic m3u8 playlist. It links to four video segments. To generate this data programmatically,
all that is needed is the filename of the first item, the target duration of the segments (in this case, 10),
and the total number of segments.
THE HLS MANIFEST
An HLS manifest is an unordered series of links to playlists. There are two reasons for having multiple
playlists: to provide various bitrates and to provide for backup playlists. Here is a typical playlist where
each of the .m3u8’s is a relative link to another HLS playlist:
#EXTM3U
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=2040000
file-2040k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1540000
file-1540k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=1040000
file-1040k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=640000
file-640k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=440000
file-440k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=240000
file-240k.m3u8
#EXT-X-STREAM-INF:PROGRAM-ID=1,BANDWIDTH=64000
file-64k.m3u8
The playlists are of varying bitrates and resolutions in order to provide smooth playback regardless of
the network conditions. All that is needed to generate a manifest are the bitrates of each playlist and
their relative paths.
FILLING IN THE BLANKS
There are many other important pieces of information that an online video platform should be capturing
for each encoded video asset: video codec, audio codec, container, and total bitrate are just a few.
The data stored for a single video item should be meaningful to the viewer (description, rating, cast),
meaningful to the platform (duration, views, engagement), and meaningful for applications (format,
resolution, bitrate). With this data, you enable a viewer to decide what to watch, the system to decide
how to program, and the application to decide how to playback.
8. - 8 -
By capturing the data necessary to
programmatically generate a playlist, a manifest
and the codec information for each of the
playlists, it becomes possible to have a system
where manifests and playlists are generated per
request.
EXAMPLE — THE FIRST PLAYLIST
The HLS specification determines that whichever
playlist comes first in the manifest will be the
first chosen to playback. In the previous section’s
example, the first item in the list was also the
highest quality track. That is fine for users with
a fast, stable Internet connection, but for people
with slower connections it will take some time for
playback to start.
It would be better to determine whether the
device appeared to have a good Internet
connection then customize the playlist
accordingly. Luckily, with dynamic manifest
generation, that is exactly what the system is set
up to accomplish.
For the purposes of this exercise, assume a
request for a manifest is made with an ordered
array of bitrates. For example, the request
[2040,1540,1040,640,440,240,64] would return a
playlist identical to the one in the previous section.
On iOS, it is possible to determine if the user is on
WiFi or a cellular connection. Since data has been
captured about each playlist including bitrate,
resolution, and other such parameters, an app can
intelligently decide how to order the manifest.
For example, it may be determined that it is best
to start between 800-1200kbps if the user is on
WiFi and between 200-600kbps if the user is on
a cellular connection. If the user were on WiFi, the
app would request an array that looks something
like: [1040,2040,1540,640,440,240,64]. If the
app detected only a cellular connection, it would
request [440,2040,1540,1040,640,240,64].
EXAMPLE — THE LEGACY DEVICE
On Android, video support is a bit of a black box.
For years, the official Android documentation only
supported the use of 640x480 baseline h264 MP4
video, even though certain models were able to
handle 1080p. In the case of HLS, support is even
more fragmented and difficult to understand.
Luckily, Android is dominated by a handful of
marquee devices. With dynamic manifests, the
app can target not only which is the best playlist
to start with, but can exclude playlists that are
determined to be incompatible.
Since our media items are also capturing data
such as resolution and codec information, support
can be targeted at specific devices. An app could
decide to send all of the renditions: [2040,15
40,1040,640,440,240,64]. Or, an older device
that only supports up to 720p could remove the
highest rendition: [1540,1040,640,440,240,64].
Furthermore, beyond the world of mobile devices,
if the app is a connected TV, it could remove the
lowest quality renditions: [2040,1540,1040,640].
Choosing a static manifest model is perfectly fine.
Some flexibility is lost, but there is nothing wrong
with simplicity. Many use cases, especially in the
user-generated content world, do not require
the amount of complexity dynamic generation
involves; however, dynamic manifest generation
opens a lot of doors for those willing to take the
plunge.
9. - 9 -
Video Concatenation Using Manifest Files
CONCATENATION AND THE OLD WAY
Content equals value, so, in the video world, one way to create more value is by taking a single video
and mixing it with other videos to create a new piece of content. Many times this is done through
concatenation, or the ability to stitch multiple videos together, which represents a basic form of editing.
Add to that the creation of clips through edit lists and you have two of the most basic functions of a
non-linear editor.
As promising as concatenation appears, it can also introduce a burden on both infrastructure and
operations. Imagine a social video portal. Depending on the devices they target, there could be
anywhere between a handful to many dozens of output formats per video. Should they decide to
concatenate multiple videos to extend the value of their library, they will also see a massive increase in
storage cost and the complexity of managing assets. Each time a new combination of videos is created,
a series of fixed assets are generated and need to be stored.
Traditional concatenation involves creating new video files that are combinations of multiple existing
files, creating a mess of large files.
STORAGE
ONE OF THE AVAILABLE
concatenated videos
is sent to the player
Traditional concatenation involves creating new video files that are combinations
of multiple existing files, creating a mess of large files.
CONCATENATED VIDEO
PLAYER REQUEST
10. - 10 -
HLS1
VIDEO CONCATENATION USING MANIFEST FILES
The introduction of manifest-driven, HTTP-based streaming protocols has created an entirely new
paradigm for creating dynamic viewing experiences. Traditionally, the only option for delivering
multiple combinations of clips from a single piece of content was through editing, which means the
creation of fixed assets. With technology such as HLS – since the playable item is no longer a video file,
but a simple text file – making edits to a video is the same as making edits to a document in a word
processor.
For a video platform, there are two ways to treat the HLS m3u8 manifest file. Most simply, the m3u8
file can be treated as a discreet, playable asset. In this model, the m3u8 is stored on the origin server
alongside the segmented TS files and delivered to devices. The result is simple and quick to implement,
but the m3u8 file can only be changed through a manual process.
Instead, by treating the manifest as something that is dynamically generated, it becomes possible to
deliver a virtually limitless combination of clips to viewers. In this model, the m3u8 is generated on the
fly, so it does not sit on the server but will be created and delivered every time it is requested.
1
This article is focused on HTTP Live Streaming (HLS), but the basic concepts are
valid for other HTTP-based streaming protocols as well.
By generating HLS manifests on the fly, an unlimited combination of videos can be seamlessly
delivered instantly to end-users.
ANY COMBINATION of segmented TS files
is sent to the playerSTORAGE
M3U8
GENERATION
PLAYER REQUEST
CONCATENATED VIDEO
By generating HLS manifests on the fly, an unlimited combination of vieos can be seamlessly delivered to end-users.
11. - 11 -
DYNAMIC MANIFEST GENERATION
What is a manifest file? It is a combination of some
metadata and links to segments of video:
Exemplary Video A
#EXTM3U
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-TARGETDURATION:10
#EXTINF:10,
Exemplary_A_segment-01.ts
#EXTINF:10,
Exemplary_A_segment-02.ts
The above m3u8 has two video segments of 10
seconds each, so the total video length is 20
seconds. Exemplary Video A, which, by the way is
a truly great video, is 20 seconds long. Now let’s
imagine we also have:
Exemplary Video B
#EXTM3U
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-TARGETDURATION:10
#EXTINF:10,
Exemplary_B_segment-01.ts
#EXTINF:10,
Exemplary_B_segment-02.ts
And let’s also say that we know that a particular
viewer would be thrilled to watch a combination of
both videos, with Video B running first and Video
A running second:
Superb Video
#EXTM3U
#EXT-X-MEDIA-SEQUENCE:0
#EXT-X-TARGETDURATION:10
#EXTINF:10,
Exemplary_B_segment-01.ts
#EXTINF:10,
Exemplary_B_segment-02.ts
#EXT-X-DISCONTINUITY
#EXTINF:10,
Exemplary_A_segment-01.ts
#EXTINF:10,
Exemplary_A_segment-02.ts
Instantly, without creating any permanent assets
that need to be stored on origin, and without
involving an editor to create a new asset, we
have generated a new video for the user that
begins with Video B followed by Video A. As
if that wasn’t cool enough, the video will play
seamlessly as though it was a single video.
You may have noticed a small addition to the
m3u8, the “Discontinuity Flag:”
#EXT-X-DISCONTINUITY
Placing this tag in the m3u8 tells the player to
expect the next video segment to be a different
resolution or have a different audio profile than
the last. If the videos are all encoded with the
same resolution, codecs, and profiles, then this tag
can be left out.
EXTENDING THE NEW MODEL
The heavy lifting for making a video platform
capable of delivering on-the-fly, custom playback
experiences is to treat the m3u8 manifest not
as a fixed asset, but as something that needs to
be generated per request. That means that the
backend must be aware of the location of every
segment of video, the total number of segments
per item, and the length of each segment.
There are ways to make this more simple. For
example, by naming the files consistently, only
the base filename needs to be known for all of
the segments, and the segment iteration can be
handled programmatically. It can be assumed that
all segments except the final segment will be of
the same target duration, so only the duration
of the final segment needs to be stored. So, for
a single video file with many video segments,
all that needs to be stored is base path, base
filename, number of segments, average segment
length, and length of the last segment.
By considering even long-form titles to be
a combination of scenes, or even further, by
considering scenes to be a combination of shots,
there is an incredible amount of power that can be
unlocked through dynamic manifest generation.
If planned for and built early, the architecture of
the delivery platform can achieve a great deal
of flexibility without subsequent increase in
operational or infrastructure costs.
CONTACT
sales@zencoder.com
zencoder.com