2. Introduction
Audio signal or analog signal uses PCM Digitization process which
involves SAMPLING.
Sampling rate > or = to : 2(Highest frequency component).
Band-limited Signal: When the BW of comm. Channel to be used is less
than minimum sampling rate then signal needs to be bandlimited.
Speech Signal:(15Hz-10kHz)
Max. freq. component is 10kHz
Minimum Sampling rate: 2x10=20 ksps
Bits per sample=12bits per sample
Bit rate used: (Sampling rate X Bits per sample) =240 kbps
General Audio Signal:(50Hz-20kHz)
Max. freq. component is 20kHz
Minimum Sampling rate: 2x20=40 ksps
Bits per sample=16 bits per sample
Bit rate used: 1.28 Mbps
3. How the concept of Audio Compression
comes?
In most MM applications, BW of communication channel that are
available does not support such high bit rates of 240kbps and 1.28Mbps
but offers less bit rates….
So what is the solution?????? There are two solutions…they are………
Solution 1:Audio signal is sampled at lower rate! (BAD ONE)
Merit: Simple to implement
Demerit: 1.Quality of decoded signal is reduced resulting in loss of
HF components from orignal signal
2. Use of few bps results in high QN
Solution 2: Compression Algorithm can be used! (GOOD ONE)
Give good perceptual quality
Reduced BW requirement
Further discussion is on Audio Compression Methods……
4. 1. Differential Pulse Code Modulation
(DPCM)
Differential pulse code modulation is a derivative of the standard
PCM
It uses the fact that the range of differences in amplitudes between
successive samples of the audio waveform is less than the range of
the actual sample amplitudes
Hence fewer bits are required to represent the difference signals
than in case of PCM for the same sampling rate.
It reduces the bit rate requirements from 64kbps to 56kbps.
6. Operation of DPCM:
Encoder
Previously digitized sample is held in the register (R)
The DPCM signal is computed by subtracting the current contents (Ro)
from the new output by the ADC (PCM)
The register value is then updated before transmission
DPCM=PCM-R0
Decoder
Decoder simply adds the previous register contents (PCM) with the
DPCM
R1=R0+DPCM
Limitation of DPCM:
ADC operations introduces quantization errors each time and will
introduce cumulative errors in the value stored in the register(R).
So previous value (R) is only approximation!!!!!!!! ...We really need
more accurate version of previous signal that we got in................
7. 2. Third Order Predictive DPCM
To eliminate this noise effect predictive methods are used to predict a
more accurate version of the previous signal (use not only the current
signal but also varying proportions of a number of the preceding
estimated signals)
These proportions used are known as predictor coefficients
Difference signal is computed by subtracting varying proportions of the
last three predicted values from the current output by the ADC.
It reduces the bit rate requirements from 64kbps to 32kbps.
9. Operation of Third Order Predictive
DPCM
R1, R2, R3 will be subtracted from PCM
The values in the R1 register will be transferred to R2 and R2 to R3 and the
new predicted value goes into R1
Decoder operates in a similar way by adding the same proportions of the
last three computed PCM signals to the received DPCM signal
10. 3. Adaptive differential PCM
(ADPCM)
FirstADPCM International standard is defined in ITU-T
Recommendation G.721
Savings of bandwidth is possible by varying the number of bits used for
difference signal depending on its amplitude (fewer bits to encode smaller
difference signals)
Based on the same principle as the DPCM except an eight-order predictor is
used and the number of bits used to quantize each difference is varied
This can be either 6 bits – producing 32 kbps – to obtain a better quality
output than with third order DPCM, or 5 bits- producing 16 kbps – if lower
bandwidth is more important
Second ADPCM International standard is defined in ITU-T
Recommendation G.722
Better sound quality at the cost of added complexity.
Input speech BW is extended from 50-7kHz compared with 3.4kHz for a
standard PCM system
Wider BW give rise to high quality........as you need in video conferencing..
11. This uses Subband coding
In this coding input signal prior to sampling is passed through two
filters:
One passes only signal frequencies in the range 50Hz - 3.5kHz and
Other only frequencies in the range 3.5kHz - 7kHz
By doing this the input signal is effectively divided into two separate
equal-bandwidth signals,
first known as the lower subband signal and,
second the upper subband signal
Each is then sampled and encoded independently using ADPCM.
The use of two subbands has the advantage that different bit rates can
be used for each.
The two bitstreams are multiplexed to produce the transmitted signal
– in such a way that, decoder in the receiver is able to divide them
back again into two separate streams for decoding.
Operating bit rates are 64,56, or 48kbps.
13. 4. Adaptive predictive coding
Even higher levels of compression possible at higher levels of
complexity
These can be obtained by also making predictor coefficients
adaptive
In practice, the optimum set of predictor coefficients continuously
vary since they are a function of the characteristics of the audio signal
being digitized
The optimum set of coefficients are then computed and these are
used to predict more accurately the previous signal
This type of compression can reduce the bandwidth requirements to
8kbps while still obtaining an acceptable perceived quality
14. 5.Linear predictive coding
With this coding Perceptual Features of an audio waveform are
analysed by the source first.
These are then quantized and sent and the destination uses them,
together with a sound synthesizer, to regenerate a sound that is
perceptually comparable with the source audio signal
With this compression technique, although the generated speech can
often be sound synthetic, but very high levels of compressions can be
achieved.
Now, what are those perceptual features.....need to be
analyzed..........?????
15. In terms of speech, Three features which determine the perception of
a signal by the ear are its:
Pitch: This is closely related to the frequency of the signal. This is important
since ear is more sensitive to signals in the range 2-5kHz
Period: This is the duration of the signal
Loudness: This is determined by the amount of energy in the signal
In addition, orign of sound is also important. These are called vocal
tract excitation parameters
Voiced sound: These are generated through vocal chords. e.g letters m,v,
and i.
Unvoiced sound: With these vocal chords are open. e.g letters f and s.
16. Operation of LPC encoder and decoder
ENCODER:
The input speech waveform is first sampled and quantized at a defined
rate.
A block of digitized samples – known as segment - is then analysed to
determine the various perceptual parameters of the speech that it
contains.
The output of the encoder is a string of frames, one for each segment
Each frame contains:
fields for pitch and loudness
Period determined by the sampling rate being used
Notification of whether the signal is voiced or unvoiced
Set of computed modal coefficients
Some LPC encoders uses up to 10 set of previous model coefficients to
predict the output sound called LPC-10 and uses bit rates as low as
2.4kbps-1.2 kbps.
17. Operation of LPC encoder and decoder
cont.. ..........
DECODER
Speech signal generated by vocal tract model in the decoder is a function
of the:
Present output of speech synthesizer (as determined by the current state of
model coefficients).
Plus a linear combination of previous set of model coefficients.
APPLICATION
Generated sound at this low rate is very synthetic and so LPC encoders
are used primarily in Military Applications where BW is all important.
20. 6. Code-excited LPC (CELPC)
The synthesiser used in most LPC decoders are based on a very basic
model of the vocal tract
These are intended for use with applications in which the amount of
bandwidth available is limited but the perceived quality of the speech
must be of acceptable standard for use in various multimedia
applications
In CELPC model instead of treating each digitized segment
independently for encoding purposes, just a limited set of segments are
used, each known as a wave template
A pre computed set of templates are held by the encoder and the
decoder in what is known as the template codebook
Each of the individual digitized samples that make up a particular
template in the codebook are differently encoded
21. All coders of this type have a delay associated with them which is
incurred while each block of digitized samples is analysed by the encoder
and the speech is reconstructed at the decoder
The combined delay value is known as the coder’s processing delay
In addition before the speech samples can be analysed it is necessary to
buffer the block of samples
The time to accumulate the block of samples is known as the algorithmic
delay
The coders delay an important parameter in conventional telephony
application, a low-delay coder is required whereas in an interactive
application delay of several seconds before the speech starts is acceptable
22. Perceptual Coding (PC)
LPC and CELP are used only for telephony applications and hence
compression of speech signal.
PC are designed for compression of general audio such as that associated
with a digital television broadcast.
Use a psychoacoustic model (this exploits a number of limitations of
human ear).
Using this approach, sampled segments of the source audio waveform
are analysed – but only those features that are perceptible to the ear are
transmitted.
E.g although the human ear is sensitive to signals in the range 15Hz to 20
kHz, the level of sensitivity to each signal is non-linear; that is the ear is
more sensitive to some signals than others.
WHAT IS THAT LIMITATION OF HUMAN
EARS..................??????
................. MASKING.........EFFECT
23. Frequency Masking: When multiple signals are present in audio, a
strong signal may reduce the level of sensitivity of the ear to other
signals which are near to it in frequency.
Temporal masking: When the ear hears a loud sound it takes a short
but a finite time before it could hear a quieter sound.
Psychoacoustic Model is used to identify those signals which are
influenced by masking and these are then eliminated from the
transmitted signal........and hence compression is achieved ...
24. Sensitivity of the ear:
The dynamic range of ear is defined as the loudest sound it can hear to
the quietest sound
Sensitivity of the ear varies with the frequency of the signal as
shown....in next slide.
The ear is most sensitive to signals in the range 2-5kHz hence the signals
in this band are the quietest the ear is sensitive to.
Vertical axis gives all the other signal amplitudes relative to this signal
(2-5 kHz).
In the fig. although the Signal A & B have same relative amplitude, signal
A would be heard only because it is above the hearing threshold and B is
below the hearing threshold.
26. Frequency Masking
When an audio sound consists of multiple frequency signals is
present, the sensitivity of the ear changes and varies with the
relative amplitude of the signal
27. Conclusions from diagram:
Signal B is larger than signal A. This causes the basic sensitivity curve of
the ear to be distorted in the region of signal B
Signal A will no longer be heard as it is within the distortion band.
Variation of frequency masking effect with frequency:
Masking effect at various frequencies 1, 4, and 8kHz are shown as:
Width of masking curve (means range of frequencies that are affected)
increases with increasing frequency.
The width of each curve at a particular signal level is known as the
critical bandwidth for that frequency.
For frequencies greater than 500Hz critical bandwidth increases linearly
in multiples of 100Hz.
29. Temporal masking
After the ear hears a loud sound it takes a further short time before it can
hear a quieter sound.
This is known as the temporal masking.
After the loud sound ceases it takes a short period of time for the signal
amplitude to decay.
During this time, signals whose amplitudes are less than the decay
envelope will not be heard and hence need not be transmitted.
In order to exploit this phenomenon, the input audio waveform must be
processed over a time period that is comparable with that associated with
temporal masking.
31. Audio Compression – MPEG Audio
coder
MOTION PICTURE EXPERT GROUP was formed by the
ISO to formulate a set of standards relating to a range of
Multimedia applications that involves the use of video with
sound. The coder associated with Audio Compression form a
part of these standards are known as MPEG audio coders
32. Why Do We Need International
Standards?
International standardization is conducted to achieve
inter-operability .
Only syntax and decoder are specified.
Encoder is not standardized and its optimization is left
to the manufacturer.
Standards provide state-of-the-art technology that is
developed by a group of experts in the field.
Not only solve current problems, but also anticipate
the future application requirements.
34. MPEG audio coder
The audio input signal is first sampled and quantized using PCM.
The bandwidth available for transmission is divided into a number of
frequency subbands using a bank of analysis filters.
Analysis filter bank:
Maps each set of 32 (time related) PCM samples into an equivalent set of 32
frequency samples.
Determines the peak amplitude in each subband (consisting of 12 freq.
components) called scaling factor.
Processing associated with both frequency and temporal masking is carried
out by the psychoacoustic model.
In basic encoder the time duration of each sampled segment of the audio
input signal is equal to the time to accumulate 12 successive sets of 32
PCM.
12 sets of 32 PCM time samples are converted into frequency components
using DFT.
35. The output of the psychoacoustic model is a set of what are known as
signal-to-mask ratios (SMRs) and indicate the frequency components
whose amplitude is below the audible threshold.
This is done to have more bits for highest sensitivity regions compared
with less sensitive regions.
In an encoder all the frequency components are carried in a frame.
36. Frame Format:
HEADER: contains information such as the sampling frequency that has
been used
SBS:The peak amplitude level in each subband is first quantized using 6
bits and a further 4 bits are then used to quantize the 12 frequency
components in the subband relative to this level. Collectively this is called
Subband Sample format.
Ancillary data field: at the end of the frame optional.
for example: used to carry additional coded samples associated with the
surround-sound that is present with some digital video broadcasts.
37. At the decoder section the de-quantizers will determine the magnitude of
each signal
The synthesis filters will produce the PCM samples at the decoders
Various Parameters associated with Encoder
Sampling rate used : 32ksps
Max. Signal freq. Component: 16khz so each subband has BW=500Hz.
12 successive set of 32 PCM are used having:
Time duration = (12X32)=384 PCM samples
38. Summary of MPEG layer 1,2 and 3 Perceptual
Encoders
Layer Application Compressed bit rate
1 Digital Audio cassette 32-448kbps
Digital Audio and Video
2 broadcasting 32-192kbps
CD-quality audio over low bit rate
3 64kbps
channels
40. What is VIDEO ?
VIDEO is simply a sequence of digitized pictures, video is also
referred to as moving pictures and the terms “frames” and “picture” are
used interchangeably.
APPLICATION:
Interpersonal: Video Telephony & Video Conferencing
Interactive: access to stored video in various forms
Entertainment: Digital TV & MOD/VOD
Problem with uncompressed Video:
Raw video contains an immense amount of data
Communication and storage capabilities are limited and expensive.
41. Definitions related to VIDEO:
Bit-rate
Information stored/transmitted per unit time
Usually measured in Mbps (Megabits per second)
Resolution
Number of pixels per frame
Ranges from 160x120 to 1920x1080
FPS (frames per second)
Usually 24, 25, 30, 50 or 60
Don’t need more because of limitations of the human eye
42. Video Compression: Why?
Bandwidth Reduction………………….
Application Data Rate
Uncompressed Compressed
Video Conference
352 X 240 30.4 Mbps 64 - 768 kbps
CD-ROM Digital Video
352 X 240 60.8 Mbps 1.5 - 4 Mbps
Broadcast Video
720 X 480 248.8 Mbps 3 - 8 Mbps
HDTV
1280 X 720 1.33 Gbps 20 Mbps
43. Video Compression Standards:
STANDARD APPLICATION BIT RATE
JPEG Continuous-tone still-image Variable
compression
H.261 Video telephony and p x 64 kb/s
teleconferencing over ISDN
MPEG-1 Video on digital storage media 1.5 Mb/s
(CD-ROM)
MPEG-2 Digital Television > 2 Mb/s
H.263 Video telephony over PSTN < 33.6 kb/s
MPEG-4 Object-based coding, synthetic Variable
content, interactivity
H.264 From Low bitrate coding to HD Variable
encoding, HD-DVD, Surveillance,
Video conferencing.
45. Spatial Redundancy
Take advantage of similarity among most neighboring pixels
Occur inside frame
46. Temporal Redundancy
Take advantage of similarity between successive frames
Is measured in between the frames: measure ME & MC
950 951 952
47. Motion Estimation (ME): To measure movement between successive
frames.
Motion Compensation (MC): This is the additional information that
must be sent to indicate any small differences between the predicted and
actual positions of the moving segments involved
49. Intracoded (I-Frames)
I-frames (Intracoded frames) are encoded without reference to any
other frames.
Each frame is treated as a separate picture and the Y, Cr and Cb
matrices are encoded separately using JPEG.................in next
slide........
I–frames the compression level is small
They are good for the first frame relating to a new scene in a movie
I-frames must be repeated at regular intervals to avoid losing the whole
picture as during transmission it can get corrupted and hence looses the
frame
The number of frames/pictures between successive I-frames is known
as a group of pictures (GOP). Typical values of GOP are N=3 - 12
50.
51. Encoding of I-Frame:
RGB to YUV
less information required for YUV (humans less sensitive to
chrominance)
Macro Blocks
Take groups of pixels (16x16)
Discrete Cosine Transformation (DCT)
Based on Fourier analysis where represent signal as sum of
sine's and cosine’s
Concentrates on higher-frequency values
Represent pixels in blocks with fewer numbers
Quantization
Reduce data required for co-efficients
Entropy coding
Compress
52. Encoding of I-Frame cont….
Zig-Zag Scan,
Quantization Run-length
• major reduction coding
• controls ‘quality’
53. Predictive Frame (P-frame)
The encoding of the P-frame is relative to the contents of either a
preceding I-frame or a preceding P-frame
P-frames are encoded using a combination of motion estimation and motion
compensation
The accuracy of the prediction operation is determined by how well any
movement between successive frames is estimated. This is known as the
motion estimation
Since the estimation is not exact, additional information must also be sent to
indicate any small differences between the predicted and actual positions
of the moving segments involved. This is known as the motion
compensation
No of P frames between I-frames is limited to avoid error propagation
(since any error present in the first P-frame will be propagated to the next)
No. Of frames between a P-Frame and immediately preceding I-or-P
Frame is called prediction span(M)
55. Bi-directional Frame (B-frame)
For fast moving video e.g movies, B-frames (Bi-directional) are
used. Their contents are predicted using the past and the future frames.
B-frame is encoded relative to the preceding as well as the succeeding
I & P frame.
B-frame results in encoding delay because time needed to wait for the
next I or P frame in the sequence.
B- frames provides highest level of compression and because they are
not involved in the coding of other frames they do not propagate
errors.
56. PB-Frames
PB-frame: It does not refer to a new frame type as such but rather
the way two neighbouring P- and B-frames are encoded as if they were
a single frame
57. D-frame
This is application specific used in MOD/VOD applications.
In these application user wish for fast forward or rewind through the
movie, this requires the compressed video to be decompressed at a
much higher rate. To support this function encoded bit stream also
contains D-frame.
58. Motion Estimation & Motion Compensation
(Encoding of P & B frame)
Motion estimation involves comparing small segments of two consecutive
frames for differences, and as difference is detected a search is carried out
to determine which neighbouring segments the original segment has
moved.
To limit the time for search the comparison is limited to few segments
P-Frame: We will estimate the motion that has taken place between the
frame being encoded and preceding I or P frame (in case of P frame)
B-Frame: We will estimate the motion that has taken place between the
frame being encoded and preceding I or P frame as well as succeeding I
or P frame. (in case of B frame).
59. P-frame encoding
The digitized contents of the Y matrix associated with each frame are first
divided into a two-dimensional matrix of 16 X 16 pixels known as a
MACROBLOCK
60. MB consists of :
4 DCT blocks (8X8) for the luminance signals
1 DCT block each for the two chrominance signals (Cb and Cr).
Each MB has an address associated with it.
To encode a p-frame the contents of each macroblock in the frame –
known as the target frame are compared on a pixel-by-pixel basis with the
contents of the preceding I or P frames (reference frames)
I or P P
Reference Frame Target Frame
SEARCH........SEARCH.........SEARCH..............O/P may be...:-
If a close match is found then only the address of the MB is encoded
If a match is not found the search is extended to cover an area around the MB
in the reference frame.
61. All the possible MB in the selected search area in reference frame
are searched for a match………………………..
Case 1:if a close match is found then two parameters are
encoded:
Motion Vector(V): It indicates the (x,y) offset of the MB encoded. It is
further encoded by differential encoding
Prediction Error: It consists of three matrices (one each for Y, Cb, Cr)
each of which contains the difference values between those in Target MB
and set of pixels in the search area in the Reference frame that produced the
closed match. This is encoded by same method as used for I frame
Case 2: If a match is not found e.g if the moving object is
moved out of the extended search area
MB is encoded independently in the same way as MBs in the I frame.
62. Match is said to be found if the mean of absolute errors in all the pixel
positions in the difference Difference MB (MD) is less than a given
threshold.
63.
64. B-frame encoding
To encode a B-frame, any motion is estimated with reference to both the
immediately preceding I- or P-frame and the immediately succeeding P-
or I-frame.
The parameters motion vector and prediction error (difference matrices)
which are computed using:
first the preceding frame as reference and
then succeeding frame as reference.
A third motion vector and set of difference matrices are then computed
using the target and the mean of the other two predicted set of values
(MD and MD’).
65.
66. Decoding of I, P, and B frames:
I-frames :
decode immediately to recreate original frame
P-frames:
The received information is decoded and the resulting information is
used with the decoded contents of the preceding I/P frames (two
buffers are used)
B-frames:
The received information is decoded and the resulting information is
used with the decoded contents of the preceding and succeeding P or
I frame (three buffers are used)
67. Implementation schematic – I-frames
The encoding procedure used for the macroblocks that make up an I-
frame is the same as that used in the JPEG standard to encode each 8 x 8
block of pixels.
Implementation Issues:
I-frame same as JPEG implementation
FDCT, Quantization, entropy encoding
Assuming 4 blocks for the luminance and 2 blocks for the chrominance,
each macroblock (MB) would require six 8x8 pixel blocks to be encoded
68. Implementation Schematic- P-frames
In the case of P-frames, encoding of each macroblock is dependent on
output of the motion estimation (ME) unit which, in turn, depends
on the contents of the MB (target frame) being encoded and the
contents of the macroblock in the search area (reference frame) that
produces the closest match. There are three possibilities:
If the two contents are the same, only the address of the macroblock
in the reference frame is encoded
If the two contents are very close, both the motion vector and the
difference matrices associated with the macroblock in the reference
frame are encoded
If no close match is found, then the target macroblock is encoded in
the same way as a macroblock in an I-frame
69. In order to carry out its role, the motion estimation unit
containing the search logic, utilizes a copy of the (uncoded)
reference frame
70. Implementation schematic – B-frames
The same previous procedure is followed for encoding B-
frames except both the preceding (reference) and the
succeeding frame to the target frame are involved
71. Macroblock encoded bit-stream format–
For each macroblock it is necessary to identify the type of encoding that has
been used. This is the role of the formatter.
Type – indicates the type of frame encoded I, P or B
Address – identifies the location of the macroblock in the frame
Quantization Value – is the value used to quantize all the DCT
coefficients in the macroblock
Motion vector – encoded vector
Block representation – indicates which of the six 8X8 blocks that make
up the macroblcok are present
B1, B2, ..B6: JPEG encoded DCT coefficients for those blocks present
72. MPEG (Moving Pictures Expert Group)
Committee of experts that develops video encoding standards
in the year 1990.
Until recently, was the only game in town (still the most
popular, by far)
Suitable for wide range of videos
Low resolution to high resolution
Slow movement to fast action
Can be implemented either in software or hardware
73. MPEG:
MPEG-1 ISO Recommendation 11172
Source intermediate digitization format (SIF) is used.
Uses resolution of 352x288 pixels and used for VHS quality audio and video
on CD-ROM at a bit rate of 1.5 Mbps
MPEG-2 ISO Recommendation 13818
Used in recording and transmission of studio quality audio and video.
Different levels of video resolution possible
Low: 352X288 comparable with MPEG-1
Main: 720X 576 pixels studio quality video and audio, bit rate up to
15 Mbps
High: 1920X1152 pixels used in wide screen HDTV bit rate of up to
80Mbps are possible
74. MPEG-4: Used for interactive multimedia applications over the
Internet and over various entertainment networks
MPEG standard contains features to enable a user not only to passively
access a video sequence using for example the start/stop/ but also enables
the manipulation of the individual elements that make up a scene within a
video
In MPEG-4 each video frame is segmented into a number of video
object planes (VOP) each of which will correspond to an AVO (Audio
visual object) of interest.
75. MPEG-1
• Uses a similar video compression technique as H.261; the
digitization format used is the source intermediate format
(SIF) and progressive scanning with a refresh rate of 0 Hz
(NTSC) and 25 Hz (for PAL)
76. Performance
Compression for I-frames are similar to JPEG for Video typically
10:1 through to 20:1 depending on the complexity of the frame
contents
P and B frames are higher compression and in the region of 20:1
through to 30:1 for P frame and 30:1 to 50:1 for B-frames
77. Video Compression – MPEG-1 video
bitstream structure: composition
• The compressed bitstream produced by the video encoder is
hierarchical: at the top level, the complete compressed video
(sequence) which consists of a string of groups of pictures
78. Video Compression – MPEG-1 video
bitstream structure: format
• In order for the decoder to decompress the received
bitstream, each data structure must be clearly identified within
the bitstream
79. Video Compression – MPEG-4 coding
principles
• Content based video coding principles showing how a frame/
scene is defined in the form of multiple video object planes
80. Video Compression – MPEG – 4
encoder/decoder schematic
• Before being compressed each scene is defined in the form
of a background and one or more foreground audio-visual
objects (AVOs)
81. Video Compression – MPEG VOP encoder
The audio associated with an AVO is compressed using one of
the algorithms described before and depends on the available
bit rate of the transmission channel and the sound quality
required