Video audio

Video & Audio
representa0on and coding

Sistemi Mul+mediali ‐ DIS 2011

5.1 Types of Video Signals
Component video

• Component video: Higher‐end video systems make use of three separate
video signals for the red, green, and blue image planes. Each color channel
is sent as a separate video signal.

(a) Most computer systems use Component Video, with separate signals for R, G,
and B signals.

(b) For any color separaHon scheme, Component Video gives the best color
reproducHon since there is no “crosstalk” between the three channels.

(c) This is not the case for S‐Video or Composite Video, discussed next.
Component video, however, requires more bandwidth and good
synchronizaHon of the three components.

2 Li & Drew


Composite Video — 1 Signal
• Composite video: color (“chrominance”) and intensity (“luminance”) signals are
mixed into a single carrier wave.

a) Chrominance is a composiHon of two color components (I and Q, or U and V).

b) In NTSC TV, e.g., I and Q are combined into a chroma signal, and a color subcarrier is then
employed to put the chroma signal at the high‐frequency end of the signal shared with the
luminance signal.

c) The chrominance and luminance components can be separated at the receiver end and then
the two color components can be further recovered.

d) When connecHng to TVs or VCRs, Composite Video uses only one wire and video color signals
are mixed, not sent separately. The audio and sync signals are addiHons to this one signal.

• Since color and intensity are wrapped into the same signal, some interference
between the luminance and chrominance signals is inevitable.

3 Li & Drew


S‐Video — 2 Signals
• S‐Video: as a compromise, (separated video, or Super‐video, e.g., in S‐VHS)
uses two wires, one for luminance and another for a composite
chrominance signal.

• As a result, there is less crosstalk between the color informaHon and the
crucial gray‐scale informaHon.

• The reason for placing luminance into its own part of the signal is that
black‐and‐white informaHon is most crucial for visual percepHon.

– In fact, humans are able to diﬀerenHate spaHal resoluHon in grayscale images
with a much higher acuity than for the color part of color images.

– As a result, we can send less accurate color informaHon than must be sent for
intensity informaHon — we can only see fairly large blobs of color, so it makes
sense to send less color detail.

4 Li & Drew


5.2 Analog Video
• An analog signal f(t) samples a Hme‐varying image. So‐called
“progressive” scanning traces through a complete picture (a frame)
row‐wise for each Hme interval.

• In TV, and in some monitors and mulHmedia standards as well,
another system, called “interlaced” scanning is used:

a) The odd‐numbered lines are traced first, and then the even‐numbered
lines are traced. This results in “odd” and “even” fields — two fields
make up one frame.

b) In fact, the odd lines (starHng from 1) end up at the middle of a line
at the end of the odd field, and the even scan starts at a half‐way point.

5 Li & Drew


• Table 5.2 gives a comparison of the three major analog
broadcast TV systems.

Table 5.2: Comparison of Analog Broadcast TV Systems
Total
Frame # of Bandwidth
Scan
Channel Alloca0on (MHz)
TV System Rate
Width
(fps) Lines Y I or U Q or V
(MHz)
NTSC 29.97 525 6.0 4.2 1.6 0.6
PAL 25 625 8.0 5.5 1.8 1.8
SECAM 25 625 8.0 6.0 2.0 2.0

6 Li & Drew


5.3 Digital Video
• The advantages of digital representaHon for video are many.
For example:

(a) Video can be stored on digital devices or in memory, ready to
be processed (noise removal, cut and paste, etc.), and
integrated to various mulHmedia applicaHons;

(b) Direct access is possible, which makes nonlinear video ediHng
achievable as a simple, rather than a complex, task;

(c) Repeated recording does not degrade image quality;

(d) Ease of encrypHon and beier tolerance to channel noise.

7 Li & Drew


CCIR Standards for Digital Video
• CCIR is the ConsultaHve Commiiee for
InternaHonal Radio, and one of the most
important standards it has produced is
CCIR‐601, for component digital video.

– This standard has since become standard ITU‐
R‐601, an internaHonal standard for professional
video applicaHons
— adopted by certain digital video formats including the
popular DV video.

8 Li & Drew


HDTV (High Defini0on TV)
• The main thrust of HDTV (High DefiniHon TV) is not to increase the
“definiHon” in each unit area, but rather to increase the visual field
especially in its width.

(a) The first generaHon of HDTV was based on an analog technology developed
by Sony and NHK in Japan in the late 1970s.

(b) MUSE (MUlHple sub‐Nyquist Sampling Encoding) was an improved NHK HDTV
with hybrid analog/digital technologies that was put in use in the 1990s. It has
1,125 scan lines, interlaced (60 fields per second), and 16:9 aspect raHo.

(c) Since uncompressed HDTV will easily demand more than 20 MHz bandwidth,
which will not fit in the current 6 MHz or 8 MHz channels, various
compression techniques are being invesHgated.

(d) It is also anHcipated that high quality HDTV signals will be transmiied using
more than one channel even amer compression.

9 Li & Drew


• A brief history of HDTV evoluHon:

(a) In 1987, the FCC decided that HDTV standards must be compaHble with the
exisHng NTSC standard and be confined to the exisHng VHF (Very High
Frequency) and UHF (Ultra High Frequency) bands.

(b) In 1990, the FCC announced a very different iniHaHve, i.e., its preference for a
full‐resoluHon HDTV, and it was decided that HDTV would be simultaneously
broadcast with the exisHng NTSC TV and eventually replace it.

(c) Witnessing a boom of proposals for digital HDTV, the FCC made a key decision
to go all‐digital in 1993. A “grand alliance” was formed that included four main
proposals, by General Instruments, MIT, Zenith, and AT&T, and by Thomson,
Philips, Sarnoff and others.

(d) This eventually led to the formaHon of the ATSC (Advanced Television Systems
Commiiee) — responsible for the standard for TV broadcasHng of HDTV.

(e) In 1995 the U.S. FCC Advisory Commiiee on Advanced Television Service
recommended that the ATSC Digital Television Standard be adopted.

10 Li & Drew


• The standard supports video scanning formats shown in
Table 5.4. In the table, “I” mean interlaced scan and “P”
means progressive (non‐interlaced) scan.

Table 5.4: Advanced Digital TV formats supported by ATSC

# of Ac0ve

# of Ac0ve Aspect Ra0o Picture Rate
Pixels per line Lines
1,920 1,080 16:9 60I 30P 24P
1,280 720 16:9 60P 30P 24P
704 480
16:9 & 4:3 60I 60P 30P 24P
640 480 4:3 60I 60P 30P 24P

11 Li & Drew


• For video, MPEG‐2 is chosen as the compression
standard. For audio, AC‐3 is the standard. It supports
the so‐called 5.1 channel Dolby surround sound, i.e.,
five surround channels plus a subwoofer channel.

• The salient difference between convenHonal TV and
HDTV:

(a) HDTV has a much wider aspect raHo of 16:9 instead of
4:3.

(b) HDTV moves toward progressive (non‐interlaced) scan.
The raHonale is that interlacing introduces serrated edges
to moving objects and flickers along horizontal edges.

12 Li & Drew


• The FCC has planned to replace all analog
broadcast services with digital TV broadcasHng by
the year 2009. The services provided will include:

– SDTV (Standard Defini0on TV): the current NTSC TV
or higher.

– EDTV (Enhanced Defini0on TV): 480 acHve lines or
higher, i.e., the third and fourth rows in Table 5.4.

– HDTV (High Defini0on TV): 720 acHve lines or higher.

13 Li & Drew


6.1 Digi0za0on of Sound
What is Sound?

• Sound is a wave phenomenon like light, but is macroscopic
and involves molecules of air being compressed and
expanded under the acHon of some physical device.

(a) For example, a speaker in an audio system vibrates back and
forth and produces a longitudinal pressure wave that we
perceive as sound.

(b) Since sound is a pressure wave, it takes on conHnuous values,
as opposed to digiHzed ones.

14 Li & Drew


(c) Even though such pressure waves are
longitudinal, they sHll have ordinary wave
properHes and behaviors, such as reflecHon
(bouncing), refracHon (change of angle when
entering a medium with a different density)
and diffracHon (bending around an obstacle).

(d) If we wish to use a digital version of sound
waves we must form digiHzed representaHons
of audio informaHon.

15 Li & Drew


Digi0za0on
• Digi0za0on means conversion to a stream of
numbers, and preferably these numbers
should be integers for eﬃciency.

• Fig. 6.1 shows the 1‐dimensional nature of
sound: amplitude values depend on a 1D
variable, Hme. (And note that images depend
instead on a 2D set of variables, x and y).

16 Li & Drew


Fig. 6.1: An analog signal: conHnuous measurement
of pressure wave.
17 Li & Drew


• The graph in Fig. 6.1 has to be made digital in both Hme and
amplitude. To digiHze, the signal must be sampled in each
dimension: in Hme, and in amplitude.

(a) Sampling means measuring the quanHty we are interested in, usually
at evenly‐spaced intervals.

(b) The ﬁrst kind of sampling, using measurements only at evenly spaced
Hme intervals, is simply called, sampling. The rate at which it is
performed is called the sampling frequency (see Fig. 6.2(a)).

(c) For audio, typical sampling rates are from 8 kHz (8,000 samples per
second) to 48 kHz. This range is determined by the Nyquist theorem,
discussed later.

(d) Sampling in the amplitude or voltage dimension is called
quan0za0on. Fig. 6.2(b) shows this kind of sampling.

18 Li & Drew


Signal to Noise Ra0o (SNR)
• The raHo of the power of the correct signal and the noise is
called the signal to noise ra+o (SNR) — a measure of the
quality of the signal.

• The SNR is usually measured in decibels (dB), where 1 dB is
a tenth of a bel. The SNR value, in units of dB, is deﬁned in
terms of base‐10 logarithms of squared voltages, as
follows:
2
Vsignal Vsignal
SNR = 10 log10 2 = 20 log10
Vnoise Vnoise (6.2)

19 Li & Drew


a)  The power in a signal is proporHonal to the
square of the voltage. For example, if the
signal voltage Vsignal is 10 Hmes the noise,
then the SNR is 20 ∗ log10(10) = 20dB.

b) In terms of power, if the power from ten
violins is ten Hmes that from one violin
playing, then the raHo of power is 10dB, or
1B.

c) To know: Power — 10; Signal Voltage — 20.
20 Li & Drew


• The usual levels of sound we hear around us are described in terms of decibels, as a
raHo to the quietest sound we are capable of hearing. Table 6.1 shows approximate
levels for these sounds.

Table 6.1: Magnitude levels of common sounds, in decibels

Threshold of hearing 0

Rustle of leaves   10
Very quiet room   20
Average room   40
ConversaHon   60
Busy street   70
Loud radio   80
Train through staHon
90

Riveter   100

Threshold of discomfort   120

Threshold of pain   140
Damage to ear drum   160

21 Li & Drew


Audio Filtering
• Prior to sampling and AD conversion, the audio signal is also usually filtered
to remove unwanted frequencies. The frequencies kept depend on the
applicaHon:

(a) For speech, typically from 50Hz to 10kHz is retained, and other frequencies
are blocked by the use of a band‐pass filter that screens out lower and higher
frequencies.

(b) An audio music signal will typically contain from about 20Hz up to 20kHz.

(c) At the DA converter end, high frequencies may reappear in the output —
because of sampling and then quanHzaHon, smooth input signal is replaced by
a series of step funcHons containing all possible frequencies.

(d) So at the decoder side, a lowpass filter is used amer the DA circuit.

22 Li & Drew


Audio Quality vs. Data Rate
• The uncompressed data rate increases as more bits are used for
quanHzaHon. Stereo: double the bandwidth. to transmit a digital
audio signal.

Table 6.2: Data rate and bandwidth in sample audio applicaHons
Quality Sample Rate Bits per

Mono / Data Rate Frequency Band
(Khz) Sample
Stereo (uncompressed) (KHz)
(kB/sec)
Telephone 8 8
Mono 8 0.200‐3.4
AM Radio 11.025 8
Mono 11.0 0.1‐5.5
FM Radio 22.05 16

Stereo 88.2 0.02‐11

CD 44.1 16 Stereo 176.4 0.005‐20

DAT 48 16 Stereo 192.0 0.005‐20

DVD Audio 192 (max) 24(max) 6 channels 1,200 (max) 0‐96 (max)

23 Li & Drew


6.2 MIDI: Musical Instrument Digital Interface
• Use the sound card’s defaults for sounds: use a simple
scripHng language and hardware setup called MIDI.

• MIDI Overview

(a) MIDI is a scripHng language — it codes “events” that stand for
the producHon of sounds. E.g., a MIDI event might include
values for the pitch of a single note, its duraHon, and its
volume.

(b) MIDI is a standard adopted by the electronic music industry for
controlling devices, such as synthesizers and sound cards, that
produce music.

24 Li & Drew


(c) The MIDI standard is supported by most
synthesizers, so sounds created on one
synthesizer can be played and manipulated on
another synthesizer and sound reasonably close.

(d) Computers must have a special MIDI interface,
but this is incorporated into most sound cards.
The sound card must also have both D/A and A/D
converters.

25 Li & Drew


MIDI Concepts
• MIDI channels are used to separate messages.

(a) There are 16 channels numbered from 0 to 15. The
channel forms the last 4 bits (the least signiﬁcant bits) of
the message.

(b) Usually a channel is associated with a parHcular
instrument: e.g., channel 1 is the piano, channel 10 is the
drums, etc.

(c) Nevertheless, one can switch instruments midstream, if
desired, and associate another instrument with any
channel.

26 Li & Drew


• System messages
(a) Several other types of messages, e.g. a general message
for all instruments indicaHng a change in tuning or Hming.

(b) If the ﬁrst 4 bits are all 1s, then the message is
interpreted as a system common message.

• The way a syntheHc musical instrument responds to a
MIDI message is usually by simply ignoring any play
sound message that is not for its channel.

– If several messages are for its channel, then the instrument
responds, provided it is mul0‐voice, i.e., can play more
than a single note at once.

27 Li & Drew


• It is easy to confuse the term voice with the term 0mbre — the
laier is MIDI terminology for just what instrument that is trying to
be emulated, e.g. a piano as opposed to a violin: it is the quality of
the sound.

(a) An instrument (or sound card) that is mul0‐0mbral is one that is
capable of playing many different sounds at the same Hme, e.g., piano,
brass, drums, etc.

(b) On the other hand, the term voice, while someHmes used by
musicians to mean the same thing as Hmbre, is used in MIDI to mean
every different Hmbre and pitch that the tone module can produce at
the same Hme.

• Different Hmbres are produced digitally by using a patch — the set
of control sevngs that define a parHcular Hmbre. Patches are
omen organized into databases, called banks.

28 Li & Drew


• The data in a MIDI status byte is between 128 and 255; each
of the data bytes is between 0 and 127. Actual MIDI bytes
are 10‐bit, including a 0 start and 0 stop bit.

Fig. 6.8: Stream of 10‐bit bytes; for typical MIDI messages,
these consist of {Status byte, Data Byte, Data Byte} = {Note
On, Note Number, Note Velocity}

29 Li & Drew


Hardware Aspects of MIDI
• The MIDI hardware setup consists of a 31.25 kbps serial
connecHon. Usually, MIDI‐capable units are either
Input devices or Output devices, not both.

• A tradiHonal synthesizer is shown in Fig. 6.10:

Fig. 6.10: A MIDI synthesizer
30 Li & Drew


• The physical MIDI ports consist of 5‐pin connectors for
IN and OUT, as well as a third connector called THRU.

(a)  MIDI communicaHon is half‐duplex.

(b) MIDI IN is the connector via which the device receives all
MIDI data.

(c) MIDI OUT is the connector through which the device
transmits all the MIDI data it generates itself.

(d) MIDI THRU is the connector by which the device echoes
the data it receives from MIDI IN. Note that it is only the
MIDI IN data that is echoed by MIDI THRU — all the data
generated by the device itself is sent via MIDI OUT.

31 Li & Drew


• A typical MIDI sequencer setup is shown in Fig. 6.11:

Fig. 6.11: A typical MIDI setup
32 Li & Drew


Structure of MIDI Messages
• MIDI messages can be classiﬁed into two types: channel
messages and system messages, as in Fig. 6.12:

Fig. 6.12: MIDI message taxonomy
33 Li & Drew


• A. Channel messages: can have up to 3 bytes:

a) The first byte is the status byte (the opcode, as it were); has its most significant bit set to 1.

b) The 4 low‐order bits idenHfy which channel this message belongs to (for 16 possible channels).

c) The 3 remaining bits hold the message. For a data byte, the most significant bit is set to 0.

• A.1. Voice messages:

a) This type of channel message controls a voice, i.e., sends informaHon specifying which note to
play or to turn off, and encodes key pressure.

b) Voice messages are also used to specify controller effects such as sustain, vibrato, tremolo, and
the pitch wheel.

c) Table 6.3 lists these operaHons.

34 Li & Drew


Table 6.3: MIDI voice messages

Voice Message Status Byte
Data Byte1 Data Byte2
Note Oﬀ &H8n
Key number Note Oﬀ velocity
Note On &H9n Key number Note On velocity

Poly. Key Pressure &HAn Key number Amount

Control Change &HBn Controller num. Controller value

Program Change &HCn Program number None

Channel Pressure &HDn Pressure value None

Pitch Bend &HEn MSB LSB

(** &H indicates hexadecimal, and ‘n’ in the status byte hex
value stands for a channel number. All values are in 0..127
except Controller number, which is in 0..120)

35 Li & Drew


General MIDI
• General MIDI is a scheme for standardizing the assignment
of instruments to patch numbers.

a) A standard percussion map specifies 47 percussion sounds.

b) Where a “note” appears on a musical score determines what percussion instrument is being
struck: a bongo drum, a cymbal.

c) Other requirements for General MIDI compaHbility: MIDI device must support all 16 channels; a
device must be mulHHmbral (i.e., each channel can play a different instrument/program); a
device must be polyphonic (i.e., each channel is able to play many voices); and there must be
a minimum of 24 dynamically allocated voices.

• General MIDI Level2: An extended general MIDI has recently been defined, with a
standard .smf “Standard MIDI File” format defined — inclusion of extra
character informaHon, such as karaoke lyrics.

36 Li & Drew


MIDI to WAV Conversion
• Some programs, such as early versions of
Premiere, cannot include .mid files — instead,
they insist on .wav format files.

a) Various shareware programs exist for approximaHng
a reasonable conversion between MIDI and WAV
formats.

b) These programs essenHally consist of large lookup
files that try to subsHtute pre‐defined or shimed WAV
output for MIDI messages, with inconsistent success.

37 Li & Drew


7.1 Introduc0on
• Compression: the process of coding that will
eﬀecHvely reduce the total number of bits
needed to represent certain informaHon.

Fig. 7.1: A General Data Compression Scheme.

38 Li & Drew


Introduc0on (cont’d)
• If the compression and decompression processes
induce no informaHon loss, then the compression
scheme is lossless; otherwise, it is lossy.

• Compression ra0o:

B0
compression ratio = (7.1)
B1

B0 – number of bits before compression
B1 – number of bits amer compression
39 Li & Drew


7.2 Basics of Informa0on Theory
• The entropy η of an informaHon source with alphabet S =
{s1, s2, . . . , sn} is:
n
1
η = H (S ) = pi log 2
∑ (7.2)
i =1 pi
n

= −∑ pi log 2 pi (7.3)
i =1
pi – probability that symbol si will occur in S.

log 1 – indicates the amount of informaHon ( self‐
2 pi
informaHon as deﬁned by Shannon) contained in si, which
corresponds to the number of bits needed to encode si.

40 Li & Drew


Distribu0on of Gray‐Level Intensi0es

Fig. 7.2 Histograms for Two Gray‐level Images.

• Fig. 7.2(a) shows the histogram of an image with uniform distribuHon of
gray‐level intensiHes, i.e., i pi = 1/256.  Hence, the entropy of this image
is:

log2256 = 8    (7.4)

• Fig. 7.2(b) shows the histogram of an image with two possible values. Its
entropy is 0.92.

41 Li & Drew


Entropy and Code Length
• As can be seen in Eq. (7.3): the entropy η is a weighted‐sum
log 1
of terms ; hence it represents the average amount of
2 pi
informaHon contained per symbol in the source S.

• The entropy η speciﬁes the lower bound for the average
number of bits to code each symbol in S, i.e.,

η≤l (7.5)

l
‐ the average length (measured in bits) of the codewords
produced by the encoder.

42 Li & Drew


7.3 Run‐Length Coding
• Memoryless Source: an informaHon source that is
independently distributed. Namely, the value of the
current symbol does not depend on the values of the
previously appeared symbols.

• Instead of assuming memoryless source, Run‐Length Coding
(RLC) exploits memory present in the informaHon source.

• Ra0onale for RLC: if the informaHon source has the
property that symbols tend to form conHnuous groups,
then such symbol and the length of the group can be
coded.

43 Li & Drew


7.4 Variable‐Length Coding (VLC)
Shannon‐Fano Algorithm — a top‐down approach

1. Sort the symbols according to the frequency count of their
occurrences.

2. Recursively divide the symbols into two parts, each with
approximately the same number of counts, unHl all parts contain
only one symbol.

An Example: coding of “HELLO”
Symbol H E L O

Count 1 1 2 1

Frequency count of the symbols in ”HELLO”.
44 Li & Drew


Huffman Coding
ALGORITHM 7.1 Huffman Coding Algorithm— a boiom‐up approach

1. IniHalizaHon: Put all symbols on a list sorted according to their frequency counts.

2. Repeat unHl the list has only one symbol lem:

(1) From the list pick two symbols with the lowest frequency counts. Form a Huffman subtree
that has these two symbols as child nodes and create a parent node.

(2) Assign the sum of the children’s frequency counts to the parent and insert it into the list such
that the order is maintained.

(3) Delete the children from the list.

3. Assign a codeword for each leaf based on the path from the root.

45 Li & Drew


Fig. 7.5: Coding Tree for “HELLO” using the Huﬀman Algorithm.

46 Li & Drew


Huﬀman Coding (cont’d)
In Fig. 7.5, new symbols P1, P2, P3 are created
to refer to the parent nodes in the Huﬀman
coding tree. The contents in the list are
illustrated below:

Amer iniHalizaHon: L H E O
Amer iteraHon (a): L P1 H
Amer iteraHon (b): L P2
Amer iteraHon (c): P3

47 Li & Drew


Proper0es of Huffman Coding
1. Unique Prefix Property: No Huffman code is a prefix of any other Huffman
code ‐ precludes any ambiguity in decoding.

2. Op0mality: minimum redundancy code ‐ proved opHmal for a given data
model (i.e., a given, accurate, probability distribuHon):

• The two least frequent symbols will have the same length for their Huffman
codes, differing only at the last bit.

• Symbols that occur more frequently will have shorter Huffman codes than
symbols that occur less frequently.

• The average code length for an informaHon source S is strictly less than η + 1.
Combined with Eq. (7.5), we have:

l < η +1 (7.6)

48 Li & Drew


7.7 Lossless Image Compression
• Approaches of Differen0al Coding of Images:

– Given an original image I(x, y), using a simple difference operator
we can define a difference image d(x, y) as follows:
d(x, y) = I(x, y) − I(x − 1, y) (7.9)
or use the discrete version of the 2‐D Laplacian operator to
define a difference image d(x, y) as
d(x, y) = 4 I(x, y) − I(x, y − 1) − I(x, y +1) − I(x+1, y) − I(x − 1, y)
(7.10)

• Due to spa+al redundancy existed in normal images I, the
difference image d will have a narrower histogram and
hence a smaller entropy, as shown in Fig. 7.9.

49 Li & Drew


Fig. 7.9: DistribuHons for Original versus DerivaHve Images. (a,b): Original
gray‐level image and its parHal derivaHve image; (c,d): Histograms for original
and derivaHve images.

(This ﬁgure uses a commonly employed image called “Barb”.)

50 Li & Drew


8.1 Introduc0on
• Lossless compression algorithms do not deliver
compression ra+os that are high enough. Hence,
most mulHmedia compression algorithms are
lossy.

• What is lossy compression?
– The compressed data is not the same as the original
data, but a close approximaHon of it.
– Yields a much higher compression raHo than that of
lossless compression.

51 Li & Drew


8.2 Distor0on Measures
• The three most commonly used distorHon measures in image compression are:

– mean square error (MSE) σ2,
N
2 1 2
σ =
N ∑ (x
n =1
n − yn ) (8.1)

where xn, yn, and N are the input data sequence, reconstructed data sequence, and length of the
data sequence respecHvely.

– signal to noise ra+o (SNR), in decibel units (dB),
σ x2
SNR = 10 log10 2 (8.2)
σd
2 2
where is the average square value of the original data sequence and is the MSE.
σx σd

– peak signal to noise ra+o (PSNR),

x2
peak (8.3)
PSNR = 10 log10 2
σ d

52 Li & Drew


Spa0al Frequency and DCT
• Spa+al frequency indicates how many Hmes pixel
values change across an image block.

• The DCT formalizes this noHon with a measure of how
much the image contents change in correspondence to
the number of cycles of a cosine wave per block.

• The role of the DCT is to decompose the original signal
into its DC and AC components; the role of the IDCT is
to reconstruct (re‐compose) the signal.

53 Li & Drew


Deﬁni0on of DCT:
Given an input funcHon f(i, j) over two integer variables i and j
(a piece of an image), the 2D DCT transforms it into a new
funcHon F(u, v), with integer u and v running over the same
range as i and j. The general deﬁniHon of the transform is:

2 C (u) C (v) M −1 N −1 (2i + 1)·uπ (2 j + 1)·vπ
F (u, v) = ∑∑ cos ·cos · f (i, j ) (8.15)
MN 2M 2N
i =0 j =0

where i, u = 0, 1, . . . ,M − 1; j, v = 0, 1, . . . ,N − 1; and the
constants C(u) and C(v) are determined by
 2
 if ξ = 0, (8.16)
C (ξ ) =  2
 1
 otherwise.

54 Li & Drew


2D Discrete Cosine Transform (2D DCT):

C (u) C (v) 7 7 (2i + 1)uπ (2 j + 1)vπ (8.17)
F (u, v) = ∑∑ cos 16 cos 16 f (i, j)
4 i =0 j =0
where i, j, u, v = 0, 1, . . . , 7, and the constants C(u) and C(v) are determined
by Eq. (8.5.16).

2D Inverse Discrete Cosine Transform (2D IDCT):
The inverse funcHon is almost the same, with the roles of f(i, j) and F(u, v)
reversed, except that now C(u)C(v) must stand inside the sums:

7 7 (8.18)
% C (u ) C (v) (2i + 1)uπ (2 j + 1)vπ
f (i, j ) = ∑∑ cos cos F (u, v)
4
where i, j, u, v = 0, 1, . . . , 7.
u =0 v =0
16 16

55 Li & Drew


The DCT is a linear transform:
In general, a transform T (or funcHon) is linear, iﬀ

T (! p + !q ) = !T ( p ) + !T (q ) (8.21)

where α and β are constants, p and q are any
funcHons, variables or constants.

From the deﬁniHon in Eq. 8.17 or 8.19, this
property can readily be proven for the DCT because
it uses only simple arithmeHc operaHons.

56 Li & Drew


The Cosine Basis Func0ons
• FuncHon Bp(i) and Bq(i) are orthogonal, if

i
∑ [ B (i)·B (i)] = 0
p q if p ≠ q (8.22)

• FuncHon Bp(i) and Bq(i) are orthonormal, if they are orthogonal and

[ Bp (i)·Bq (i )] = 1
∑ if p = q (8.23)
i
• It can be shown that:
7

cos
(2i + 1)· pπ
·cos
(2i + 1)·qπ 

∑ 16 16 =0 if p ≠ q
i =0  
7
 C ( p) (2i + 1)· pπ C (q) (2i + 1)·qπ 
∑ 2 cos · cos  =1 if p = q
i =0  16 2 16 

57 Li & Drew


Fig. 8.9: Graphical IllustraHon of 8 × 8 2D DCT basis.
58 Li & Drew


2D Separable Basis
• The 2D DCT can be separated into a sequence of two,
1D DCT steps:
7
G(i, v) = 1 C (v) cos (2 j + 1)vπ f (i, j ) (8.24)
2 ∑ 16

j =0

7
1 C (u ) cos (2i + 1)uπ G (i, v)
(8.25)
F (u, v) = ∑
2 i =0
16
• It is straigh•orward to see that this simple change
saves
many arithmeHc steps. The number of iteraHons
required is reduced from 8 × 8 to 8+8.

59 Li & Drew


9.1 The JPEG Standard
• JPEG is an image compression standard that was developed
by the “Joint Photographic Experts Group”. JPEG was
formally accepted as an internaHonal standard in 1992.

• JPEG is a lossy image compression method. It employs a
transform coding method using the DCT (Discrete Cosine
Transform).

• An image is a funcHon of i and j (or convenHonally x and y)
in the spa+al domain. The 2D DCT is used as one step in
JPEG in order to yield a frequency response which is a
funcHon F(u, v) in the spa+al frequency domain, indexed
by two integers u and v.

60 Li & Drew


Observa0ons for JPEG Image Compression
• The eﬀecHveness of the DCT transform coding
method in JPEG relies on 3 major observaHons:

Observa0on 1: Useful image contents change
relaHvely slowly across the image, i.e., it is unusual
for intensity values to vary widely several Hmes in
a small area, for example, within an 8×8 image
block.

• much of the informaHon in an image is repeated,
hence “spaHal redundancy”.
61 Li & Drew


Observa0ons for JPEG Image Compression
(cont’d)
Observa0on 2: Psychophysical experiments suggest that
humans are much less likely to noHce the loss of very high
spaHal frequency components than the loss of lower
frequency components.

• the spaHal redundancy can be reduced by largely reducing
the high spaHal frequency contents.

Observa0on 3: Visual acuity (accuracy in disHnguishing closely
spaced lines) is much greater for gray (“black and white”)
than for color.

• chroma subsampling (4:2:0) is used in JPEG.
62 Li & Drew


Fig. 9.1: Block diagram for JPEG encoder.
63 Li & Drew


9.1.1 Main Steps in JPEG Image Compression

• Transform RGB to YIQ or YUV and subsample color.

• DCT on image blocks.

• QuanHzaHon.

• Zig‐zag ordering and run‐length encoding.

• Entropy coding.

64 Li & Drew


DCT on image blocks
• Each image is divided into 8 × 8 blocks. The 2D
DCT is applied to each block image f(i, j), with
output being the DCT coefficients F(u, v) for each
block.

• Using blocks, however, has the effect of isolaHng
each block from its neighboring context. This is
why JPEG images look choppy (“blocky”) when a
high compression ra+o is specified by the user.
65 Li & Drew


Quan0za0on
 F (u , v) 
ˆ
F (u , v) = round   (9.1)
 Q(u, v) 

• F(u, v) represents a DCT coeﬃcient, Q(u, v) is a “quanHzaHon matrix” entry,
and represents the quan+zed DCT coeﬃcients which
JPEG will use in the succeeding entropy coding.
ˆ
F (u, v)
– The quan0za0on step is the main source for loss in JPEG compression.

– The entries of Q(u, v) tend to have larger values towards the lower right corner.
This aims to introduce more loss at the higher spaHal frequencies — a pracHce
supported by ObservaHons 1 and 2.

– Table 9.1 and 9.2 show the default Q(u, v) values obtained from psychophysical
studies with the goal of maximizing the compression raHo while minimizing
perceptual losses in JPEG images.

66 Li & Drew


Table 9.1 The Luminance Quan0za0on Table
16 11 10 16
24 40 51 61
12 12 14 19 26 58 60 55
14 13 16 24
40 57 69 56
14 17 22 29 51
18 22 37 56
109 103 62
68
87 80
77
24 35 55 64
81 104 113 92
49 64 78 87 103 121 120 101
Table 9.2 The Chrominance Quan0za0on Table
72 92 95 98 112 100 103 99

17 18 24 47 99 99 99 99
18 21 26 66 99 99 99 99
24 26 56 99 99 99 99 99
47 66 99 99 99 99 99 99
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99
99 99 99 99 99 99 99 99

67 Li & Drew


An 8 × 8 block from the Y image of ‘Lena’
200 202 189 188 189 175 175 175

515 65 -12 4 1 2
-8 5
200 203 198 188 189 182 178 175
203 200 200 195 200 187 185 175 -16 3 2 0 0 -11
-2 3
200 200 200 200 197 187 187 187 -12 6 11 -1 3 0
1 -2
200 205 200 200 195 188 187 175

-8 3 -4 2 -2 -3
-5 -2
200 200 200 200 200 190 187 175
205 200 199 200 191 187 187 175 0 -2 7 -5 4 0
-1 -4
210 200 200 200 188 185 187 186 0 -3 -1 0 4 1
-1 0
f(i, j)
3 -2 -3 3 3 -1
-1
Fig. 9.2: JPEG compression for a smooth image block.
3
-2 5 -2 4 -2 2 -3 0
68
F(u, v) Li & Drew


9.1.2 Four Commonly Used JPEG Modes
• SequenHal Mode — the default JPEG mode,
implicitly assumed in the discussions so far.
Each graylevel image or color image
component is encoded in a single lem‐to‐right,
top‐to‐boiom scan.
• Progressive Mode.
• Hierarchical Mode.
• Lossless Mode
69 Li & Drew


9.1.3 A Glance at the JPEG Bitstream

Fig. 9.6: JPEG bitstream.
70 Li & Drew


9.2 The JPEG2000 Standard
• Design Goals:
– To provide a beier rate‐distorHon tradeoﬀ and
improved subjecHve image quality.
– To provide addiHonal funcHonaliHes lacking in the
current JPEG standard.

• The JPEG2000 standard addresses the following
problems:
– Lossless and Lossy Compression: There is currently
no standard that can provide superior lossless
compression and lossy compression in a single
bitstream.
71 Li & Drew


– Low Bit‐rate Compression: The current JPEG standard
oﬀers excellent rate‐distorHon performance in mid
and high bit‐rates. However, at bit‐rates below 0.25
bpp, subjecHve distorHon becomes unacceptable. This
is important if we hope to receive images on our web‐
enabled ubiquitous devices, such as web‐aware
wristwatches and so on.
– Large Images: The new standard will allow image
resoluHons greater than 64K by 64K without Hling. It
can handle image size up to 232 − 1.
– Single Decompression Architecture: The current JPEG
standard has 44 modes, many of which are
applicaHon speciﬁc and not used by the majority of
JPEG decoders.

72 Li & Drew


– Transmission in Noisy Environments: The new
standard will provide improved error resilience for
transmission in noisy environments such as wireless
networks and the Internet.
– Progressive Transmission: The new standard provides
seamless quality and resoluHon scalability from low to
high bit‐rate. The target bit‐rate and reconstrucHon
resoluHon need not be known at the Hme of
compression.
– Region of Interest Coding: The new standard allows
the speciﬁcaHon of Regions of Interest (ROI) which
can be coded with superior quality than the rest of
the image. One might like to code the face of a
speaker with more quality than the surrounding
furniture.

73 Li & Drew


– Computer Generated Imagery: The current JPEG
standard is opHmized for natural imagery and does
not perform well on computer generated imagery.
– Compound Documents: The new standard oﬀers
metadata mechanisms for incorporaHng addiHonal
non‐image data as part of the ﬁle. This might be
useful for including text along with imagery, as one
important example.

• In addiHon, JPEG2000 is able to handle up to 256
channels of informaHon whereas the current
JPEG standard is only able to handle three color
channels.
74 Li & Drew


Proper0es of JPEG2000 Image Compression
• Uses Embedded Block Coding with OpHmized TruncaHon
(EBCOT) algorithm which parHHons each subband LL, LH,
HL, HH produced by the wavelet transform into small
blocks called “code blocks”.

• A separate scalable bitstream is generated for each code
block improved error resilience.

Fig. 9.7: Code block structure of EBCOT.

75 Li & Drew


Main Steps of JPEG2000 Image Compression

• Embedded Block coding and bitstream
generaHon.

• Post compression rate distorHon (PCRD)
opHmizaHon.

• Layer formaHon and representaHon.

76 Li & Drew


Region of Interest Coding in JPEG2000
• Goal:
– ParHcular regions of the image may contain important
informaHon, thus should be coded with beier quality than
others.

• Usually implemented using the MAXSHIFT method which
scales up the coeﬃcients within the ROI so that they are
placed into higher bit‐planes.

• During the embedded coding process, the resulHng bits are
placed in front of the non‐ROI part of the image. Therefore,
given a reduced bit‐rate, the ROI will be decoded and
reﬁned before the rest of the image.

77 Li & Drew


Fig. 9.11: Region of interest (ROI) coding of an image using a circularly
shaped ROI. (a) 0.4 bpp, (b) 0.5 bpp, (c) 0.6bpp, and (d) 0.7 bpp.

78 Li & Drew


Fig. 9.12: Performance comparison for JPEG and JPEG2000 on
diﬀerent image types. (a): Natural images.

79 Li & Drew


(a)
Fig. 9.13: Comparison of JPEG and JPEG2000. (a) Original image.

80 Li & Drew


(b)

(c)
Fig. 9.13 (Cont’d): Comparison of JPEG and JPEG2000. (b) JPEG (lem) and JPEG2000 (right) images
compressed at 0.75 bpp. (c) JPEG (lem) and JPEG2000 (right) images compressed at 0.25 bpp.

81 Li & Drew


9.3 The JPEG‐LS Standard
• JPEG‐LS is in the current ISO/ITU standard for lossless or “near
lossless” compression of conHnuous tone images.

• It is part of a larger ISO eﬀort aimed at beier compression of
medical images.

• Uses the LOCO‐I (LOw COmplexity LOssless Compression for Images)
algorithm proposed by Hewlei‐Packard.

• MoHvated by the observaHon that complexity reducHon is omen
more important than small increases in compression oﬀered by
more complex algorithms.
Main Advantage: Low complexity!

82 Li & Drew


10.1 Introduc0on to Video Compression
•  A video consists of a Hme‐ordered sequence of frames, i.e., images.

•  An obvious soluHon to video compression would be predic+ve coding
based on previous frames.
Compression proceeds by subtracHng images: subtract in Hme order and
code the residual error.

•  It can be done even beier by searching for just the right parts of the
image to subtract from the previous frame.

83 Li & Drew


10.2 Video Compression with Mo0on Compensa0on

•  ConsecuHve frames in a video are similar — temporal redundancy exists.

•  Temporal redundancy is exploited so that not every frame of the video needs to be
coded independently as a new image.
The diﬀerence between the current frame and other frame(s) in the sequence will be
coded — small values and low entropy, good for compression.

•  Steps of Video compression based on Mo#on Compensa#on (MC):
1. MoHon EsHmaHon (moHon vector search).
2. MC‐based PredicHon.
3. DerivaHon of the predicHon error, i.e., the diﬀerence.

84 Li & Drew


Mo0on Compensa0on
•  Each image is divided into macroblocks of size N x N.
‐ By default, N = 16 for luminance images. For chrominance images,
N = 8 if 4:2:0 chroma subsampling is adopted.
•  MoHon compensaHon is performed at the macroblock level.
‐ The current image frame is referred to as Target Frame.
‐ A match is sought between the macroblock in the Target Frame and the
most similar macroblock in previous and/or future frame(s) (referred to
as Reference frame(s)).
‐ The displacement of the reference macroblock to the target macroblock is
called a mo+on vector MV.
‐ Figure 10.1 shows the case of forward predic+on in which the Reference
frame is taken to be a previous frame.

85 Li & Drew


Fig. 10.1: Macroblocks and MoHon Vector in Video Compression.

•  MV search is usually limited to a small immediate neighborhood —
both horizontal and verHcal displacements in the range [−p, p].
This makes a search window of size (2p + 1) x (2p + 1).

86 Li & Drew


10.3 Search for Mo0on Vectors
•  The difference between two macroblocks can then be measured by their Mean
Absolute Difference (MAD):
N −1 N −1
MAD(i, j ) = 12 ∑∑ C ( x + k , y + l ) − R( x + i + k , y + j + l ) (10.1)
N k =0 l =0

N — size of the macroblock,
k and l — indices for pixels in the macroblock,
i and j — horizontal and verHcal displacements,
C ( x + k, y + l ) — pixels in macroblock in Target frame,
R ( x + i + k, y + j + l ) — pixels in macroblock in Reference frame.

•  The goal of the search is to find a vector (i, j) as the moHon vector MV = (u, v),
such that MAD(i, j) is minimum:
(u, v) = [ (i, j ) | MAD(i, j ) is minimum, i ∈[− p, p], j ∈[− p, p] ] (10.2)

87 Li & Drew


Sequen0al Search
•  Sequen0al search: sequenHally search the whole (2p + 1) x (2p + 1)
window in the Reference frame (also referred to as Full search).
‐  a macroblock centered at each of the posiHons within the window is
compared to the macroblock in the Target frame pixel by pixel and their
respecHve MAD is then derived using Eq. (10.1).

‐  The vector (i, j) that oﬀers the least MAD is designated as the MV (u, v)
for the macroblock in the Target frame.

‐ sequenHal search method is very costly — assuming each pixel
comparison requires three operaHons (subtracHon, absolute value,
addiHon), the cost for obtaining a moHon vector for a single macroblock
is (2p + 1) (2p + 1) N 2 3 O ( p 2 N 2 ).

88 Li & Drew


PROCEDURE 10.1 Mo0on‐vector:sequen0al‐search
begin
min_MAD = LARGE NUMBER; /* IniHalizaHon */
for i = −p to p
for j = −p to p
{
cur_MAD = MAD(i, j);
if cur_MAD < min_MAD
{
min_MAD = cur_MAD;
u = i; /* Get the coordinates for MV. */
v = j;
}
}
end

89 Li & Drew


2D Logarithmic Search

•  Logarithmic search: a cheaper version, that is subopHmal but sHll
usually eﬀecHve.
•  The procedure for 2D Logarithmic Search of moHon vectors takes
several iteraHons and is akin to a binary search:
‐ As illustrated in Fig.10.2, iniHally only nine locaHons in the search window
are used as seeds for a MAD‐based search; they are marked as 1 .
‐ Amer the one that yields the minimum MAD is located, the center of the
new search region is moved to it and the step‐size ( oﬀset ) is reduced
to half.
‐ In the next iteraHon, the nine new locaHons are marked as 2 and so on.

90 Li & Drew


Hierarchical Search
•  The search can beneﬁt from a hierarchical (mulHresoluHon) approach
in which iniHal esHmaHon of the moHon vector can be obtained from
images with a signiﬁcantly reduced resoluHon.
•  Figure 10.3: a three‐level hierarchical search in which the original
image is at Level 0, images at Levels 1 and 2 are obtained by down‐
sampling from the previous levels by a factor of 2, and the iniHal
search is conducted at Level 2.
Since the size of the macroblock is smaller and p can also be
proporHonally reduced, the number of operaHons required is greatly
reduced.

91 Li & Drew


Fig. 10.3: A Three‐level Hierarchical Search for MoHon Vectors.

92 Li & Drew


Hierarchical Search (Cont'd)
•  Given the esHmated moHon vector (uk, vk) at Level k, a 3 x 3 neighborhood
centered at (2 ∙ uk, 2 ∙ vk) at Level k − 1 is searched for the refined moHon
vector.
•  the refinement is such that at Level k − 1 the moHon vector (uk−1 , vk−1)
saHsfies:
(2uk − 1 ≤ uk−1 ≤ 2uk +1, 2vk − 1 ≤ vk−1 ≤ 2vk +1)
•  Let (xk0, yk0) denote the center of the macroblock at Level k in the Target
frame. The procedure for hierarchical moHon vector search for the
macroblock centered at (x00, y00) in the Target frame can be outlined as
follows:

93 Li & Drew


10.4 H.261

•  H.261: An earlier digital video compression standard, its principle of
MC‐based compression is retained in all later video compression
standards.
‐ The standard was designed for videophone, video conferencing and other
audiovisual services over ISDN.

‐ The video codec supports bit‐rates of p x 64 kbps, where p ranges from 1 to
30 (Hence also known as p * 64).

‐ Require that the delay of the video encoder be less than 150 msec so that
the video can be used for real‐Hme bidirecHonal video conferencing.

94 Li & Drew


ITU Recommendations & H.261 Video Formats

•  H.261 belongs to the following set of ITU recommendaHons for visual
telephony systems:

1.  H.221 — Frame structure for an audiovisual channel supporHng 64 to
1,920 kbps.
2.  H.230 — Frame control signals for audiovisual systems.
3.  H.242 — Audiovisual communicaHon protocols.
4.  H.261 — Video encoder/decoder for audiovisual services at p x 64 kbps.
5.  H.320 — Narrow‐band audiovisual terminal equipment for p x 64 kbps
transmission.

95 Li & Drew


Table 10.2 Video Formats Supported by H.261

96 Li & Drew


Fig. 10.4: H.261 Frame Sequence.

97 Li & Drew


H.261 Frame Sequence
•  Two types of image frames are deﬁned: Intra‐frames (I‐frames) and Inter‐
frames (P‐frames):
‐ I‐frames are treated as independent images. Transform coding method similar
to JPEG is applied within each I‐frame, hence Intra .
‐ P‐frames are not independent: coded by a forward predicHve coding method
(predicHon from a previous P‐frame is allowed — not just from a previous I‐
frame).
‐ Temporal redundancy removal is included in P‐frame coding, whereas I‐frame
coding performs only spa0al redundancy removal.
‐  To avoid propagaHon of coding errors, an I‐frame is usually sent a couple of
Hmes in each second of the video.

•  MoHon vectors in H.261 are always measured in units of full pixel and
they have a limited range of ± 15 pixels, i.e., p = 15.

98 Li & Drew


Intra‐frame (I‐frame) Coding

Fig. 10.5: I‐frame Coding.

•  Macroblocks are of size 16 x 16 pixels for the Y frame, and 8 x 8 for Cb
and Cr frames, since 4:2:0 chroma subsampling is employed. A
macroblock consists of four Y, one Cb, and one Cr 8 x 8 blocks.
•  For each 8 x 8 block a DCT transform is applied, the DCT coeﬃcients
then go through quanHzaHon zigzag scan and entropy coding.

99 Li & Drew


Inter-frame (P-frame) Predictive Coding

•  Figure 10.6 shows the H.261 P‐frame coding scheme
based on moHon compensaHon:

‐ For each macroblock in the Target frame, a moHon vector is
allocated by one of the search methods discussed earlier.

‐ Amer the predicHon, a diﬀerence macroblock is derived to measure
the predic+on error.

‐ Each of these 8 x 8 blocks go through DCT, quanHzaHon, zigzag scan
and entropy coding procedures.

100 Li & Drew


•  The P‐frame coding encodes the diﬀerence macroblock (not the
Target macroblock itself).

•  SomeHmes, a good match cannot be found, i.e., the predicHon error
exceeds a certain acceptable level.
‐ The MB itself is then encoded (treated as an Intra MB) and in this case it is
termed a non‐mo+on compensated MB.

•  For a moHon vector, the diﬀerence MVD is sent for entropy coding:
MVD = MVPreceding − MVCurrent (10.3)

101 Li & Drew


Fig. 10.6: H.261 P‐frame Coding Based on Mo0on Compensa0on.

102 Li & Drew


11.1 Overview
• MPEG: Moving Pictures Experts Group, established in
1988 for the development of digital video.

• It is appropriately recognized that proprietary interests
need to be maintained within the family of MPEG
standards:

– Accomplished by deﬁning only a compressed bitstream
that implicitly deﬁnes the decoder.

– The compression algorithms, and thus the encoders, are
completely up to the manufacturers.

103 Li & Drew


11.2 MPEG‐1
• MPEG‐1 adopts the CCIR601 digital TV format also known as
SIF (Source Input Format).

• MPEG‐1 supports only non‐interlaced video. Normally, its
picture resoluHon is:

– 352 × 240 for NTSC video at 30 fps
– 352 × 288 for PAL video at 25 fps
– It uses 4:2:0 chroma subsampling

• The MPEG‐1 standard is also referred to as ISO/IEC 11172.
It has ﬁve parts: 11172‐1 Systems, 11172‐2 Video, 11172‐3
Audio, 11172‐4 Conformance, and 11172‐5 Somware.
104 Li & Drew


Mo0on Compensa0on in MPEG‐1
• MoHon CompensaHon (MC) based video encoding in H.
261 works as follows:

– In MoHon EsHmaHon (ME), each macroblock (MB) of the
Target P‐frame is assigned a best matching MB from the
previously coded I or P frame ‐ predic0on.

– predic0on error: The diﬀerence between the MB and its
matching MB, sent to DCT and its subsequent encoding
steps.

– The predicHon is from a previous frame — forward
predic0on.
105 Li & Drew


Fig 11.1: The Need for BidirecHonal Search.

The MB containing part of a ball in the Target frame
cannot ﬁnd a good matching MB in the previous frame
because half of the ball was occluded by another object.
A match however can readily be obtained from the next
frame.
106 Li & Drew


Mo0on Compensa0on in MPEG‐1 (Cont’d)
• MPEG introduces a third frame type — B‐frames, and its accompanying bi‐
direcHonal moHon compensaHon.

• The MC‐based B‐frame coding idea is illustrated in Fig. 11.2:

– Each MB from a B‐frame will have up to two moHon vectors (MVs) (one from
the forward and one from the backward predicHon).

– If matching in both direcHons is successful, then two MVs will be sent and the
two corresponding matching MBs are averaged (indicated by ‘%’ in the ﬁgure)
before comparing to the Target MB for generaHng the predicHon error.

– If an acceptable match can be found in only one of the reference frames, then
only one MV and its corresponding MB will be used from either the forward or
backward predicHon.

107 Li & Drew


Fig 11.2: B‐frame Coding Based on BidirecHonal MoHon
CompensaHon.

108 Li & Drew


Fig 11.3: MPEG Frame Sequence.
109 Li & Drew


Fig 11.5: Layers of MPEG‐1 Video Bitstream.
110 Li & Drew


11.3 MPEG‐2
• MPEG‐2: For higher quality video at a bit‐rate of more than
4 Mbps.

• Defined seven profiles aimed at different applicaHons:

– Simple, Main, SNR scalable, Spa0ally scalable, High, 4:2:2,
Mul0view.
– Within each profile, up to four levels are defined (Table 11.5).
– The DVD video specificaHon allows only four display resoluHons:
720×480, 704×480, 352×480, and 352×240
— a restricted form of the MPEG‐2 Main profile at the Main
and Low levels.

111 Li & Drew


Table 11.5: Profiles and Levels in MPEG‐2
Level Simple Main SNR Spa0ally High 4:2:2 Mul0view
profile profile Scalable Scalable Profile Profile Profile
profile profile
High * *
High 1440 * * *
Main * * * * * *
Low * *

Table 11.6: Four Levels in the Main Profile of MPEG‐2
Level Max. Max Max
Max coded Applica0on
Resolu0on fps pixels/sec Data Rate

(Mbps)
High   1,920 × 1,152 60   62.7 × 106
80 film producHon
High 1440 1,440 × 1,152 60 47.0 × 106
60   consumer HDTV
Main   720 × 576 30 10.4 × 106
6 15 studio TV
Low 352 × 288 30 3.0 × 10 4 consumer tape equiv.

112 Li & Drew


Suppor0ng Interlaced Video
• MPEG‐2 must support interlaced video as well since this is
one of the opHons for digital broadcast TV and HDTV.

• In interlaced video each frame consists of two fields,
referred to as the top‐field and the boZom‐field.

– In a Frame‐picture, all scanlines from both fields are interleaved
to form a single frame, then divided into 16×16 macroblocks
and coded using MC.

– If each field is treated as a separate picture, then it is called
Field‐picture.

113 Li & Drew


Fig. 11.6: Field pictures and Field‐predicHon for Field‐pictures in MPEG‐2.
(a) Frame−picture vs. Field−pictures, (b) Field PredicHon for Field−pictures

114 Li & Drew


Five Modes of Predic0ons
• MPEG‐2 deﬁnes Frame Predic0on and Field
Predic0on as well as ﬁve predicHon modes:

1.  Frame Predic0on for Frame‐pictures: IdenHcal to
MPEG‐1 MC‐based predicHon methods in both P‐
frames and B‐frames.

2.  Field Predic0on for Field‐pictures: A macroblock
size of 16 × 16 from Field‐pictures is used. For
details, see Fig. 11.6(b).
115 Li & Drew


3. Field Predic0on for Frame‐pictures: The top‐field and
boiom‐field of a Frame‐picture are treated separately.
Each 16 × 16 macroblock (MB) from the target Frame‐
picture is split into two 16 × 8 parts, each coming from one
field. Field predicHon is carried out for these 16 × 8 parts
in a manner similar to that shown in Fig. 11.6(b).

4. 16×8 MC for Field‐pictures: Each 16×16 macroblock (MB)
from the target Field‐picture is split into top and boiom
16 × 8 halves. Field predicHon is performed on each half.
This generates two moHon vectors for each 16×16 MB in
the P‐Field‐picture, and up to four moHon vectors for each
MB in the B‐Field‐picture.

This mode is good for a finer MC when moHon is rapid and
irregular.

116 Li & Drew


5. Dual‐Prime for P‐pictures: First, Field predicHon from each
previous field with the same parity (top or boiom) is
made. Each moHon vector mv is then used to derive a
calculated moHon vector cv in the field with the opposite
parity taking into account the temporal scaling and verHcal
shim between lines in the top and boiom fields.  For each
MB the pair mv and cv yields two preliminary predicHons.
Their predicHon errors are averaged and used as the final
predicHon error.

This mode mimics B‐picture predicHon for P‐pictures
without adopHng backward predicHon (and hence with
less encoding delay).

This is the only mode that can be used for either Frame‐
pictures or Field‐pictures.

117 Li & Drew


Alternate Scan and Field DCT
• Techniques aimed at improving the effecHveness of DCT on
predicHon errors, only applicable to Frame‐pictures in interlaced
videos:

– Due to the nature of interlaced video the consecuHve rows in the 8×8
blocks are from different fields, there exists less correlaHon between
them than between the alternate rows.

– Alternate scan recognizes the fact that in interlaced video the verHcally
higher spaHal frequency components may have larger magnitudes and
thus allows them to be scanned earlier in the sequence.

• In MPEG‐2, Field_DCT can also be used to address the same issue.

118 Li & Drew


Fig 11.7: Zigzag and Alternate Scans of DCT Coeﬃcients
for Progressive and Interlaced Videos in MPEG‐2.
119 Li & Drew

Video audio

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (15)

Ähnlich wie Video audio

Ähnlich wie Video audio (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Video audio