3.introduction onwards deepa

Extraction of visual and motion saliency for automatic video object 1
Dept. of Computer Science & Engineering AWH Engineering College
1. INTRODUCTION
In imaging science, image processing is any form of signal processing for which the
input is an image, such as a photograph or video frame; the output of image processing may
be either an image or a set of characteristics or parameters related to the image. Most
image-processing techniques involve treating the image as a two-dimensional signal and
applying standard signal-processing techniques to it. Image processing usually refers
to digital image processing, but optical and analog image processing also are possible.
Image processing, in its broadest and most literal sense, aims to address the goal of
providing practical, reliable and affordable means to allow machines to cope with images
while assisting man in his general endeavors. In Electrical Engineering and
Computer Science it is any form of signal processing for which the input is an image, such
as a video frame; The output of this may be either image or, a set of parameters related to
image. In most image processing techniques, image is treated as two dimensional signal. In
short, act of examining images for the purpose of identifying objects and judging their
significance.
Image processing refers to processing of a 2D picture by a computer. Basic
definitions: An image defined in the “real world” is considered to be a function of two real
variables, for example, a(x,y) with a as the amplitude (e.g. brightness) of the image at the
real coordinate position (x,y). Modern digital technology has made it possible to manipulate
multi-dimensional signals with systems that range from simple digital circuits to advanced
parallel computers. The goal of this manipulation can be divided into three categories:
1) Image Processing (image in -> image out)
2) Image Analysis (image in -> measurements out)
3) Image Understanding (image in -> high-level description out)
An image may be considered to contain sub-images sometimes referred to as regions-
of-interest, ROIs, or simply regions. This concept reflects the fact that images frequently
contain collections of objects each of which can be the basis for a region. In a sophisticated
image processing system it should be possible to apply specific image processing operations

to selected regions. Thus one part of an image (region) might be processed to suppress
motion blur while another part might be processed to improve color rendition. Sequence of
image processing:
Most usually, image processing systems require that the images be available in
digitized form, that is, arrays of finite length binary words. For digitization, the given Image
is sampled on a discrete grid and each sample or pixel is quantized using a finite number of
bits. The digitized image is processed by a computer.
To display a digital image, it is first converted into analog signal, which is scanned
onto a display. Closely related to image processing are computer graphics and computer
vision. In computer graphics, images are manually made from physical models of objects,
environments, and lighting, instead of being acquired (via imaging devices such as cameras)
from natural scenes, as in most animated movies.
Computer vision, on the other hand, is often considered high-level image processing
out of which a machine/computer/software intends to decipher the physical contents of an
image or a sequence of images (e.g., videos or 3D full-body magnetic resonance scans). In
modern sciences and technologies, images also gain much broader scopes due to the ever
growing importance of scientific visualization (of often large-scale complex
scientific/experimental data). Examples include microarray data in genetic research, or real-
time multi-asset portfolio trading in finance. Before going to processing an image, it is
converted into a digital form.
Digitization includes sampling of image and quantization of sampled values. After
converting the image into bit information, processing is performed. This processing
technique may be Image enhancement, Image restoration, and Image compression.
1) Image enhancement: It refers to accentuation, or sharpening, of image features such
as boundaries, or contrast to make a graphic display more useful for display &
analysis. This process does not increase the inherent information content in data. It
includes gray level & contrast manipulation, noise reduction, edge crispening and
sharpening, filtering, interpolation and magnification, pseudo coloring, and so on.

2) Image restoration: It is concerned with filtering the observed image to minimize the
effect of degradations. Effectiveness of image restoration depends on the extent and
accuracy of the knowledge of degradation process as well as on filter design. Image
restoration differs from image enhancement in that the latter is concerned with more
extraction or accentuation of image features.
3) Image compression: It is concerned with minimizing the number of bits required to
represent an image. Application of compression are in broadcast TV, remote sensing
via satellite, military communication via aircraft, radar, teleconferencing, facsimile
transmission, for educational & business documents, medical images that arise in
computer tomography, magnetic resonance imaging and digital radiology, motion,
pictures, satellite images, weather maps, geological surveys and so on.
1) Text compression – CCITT GROUP3 & GROUP4
2) Still image compression – JPEG
3) Video image compression – MPEG
Digital Image Processing is a rapidly evolving field with growing applications in Science
and Engineering. Modern digital technology has made it possible to manipulate multi-
dimensional signals. Digital Image Processing has a broad spectrum of applications. They
include remote sensing data via satellite, medical image processing, radar, sonar and acoustic
image processing and robotics.
Uncompressed multimedia graphics, audio and video data require considerable
storage capacity and transmission bandwidth. Despite rapid progress in mass-storage
density, processor speeds, and digital communication system performance, demand for data
storage capacity and data-transmission bandwidth continues to outstrip the capabilities of
available technologies. This is a crippling disadvantage during transmission and storage. So
there arises a need for data compression of images.
There are several Image compression techniques. Two ways of classifying
compression techniques are mentioned here :1) Loss less Vs Lossy compression and 2)
Predictive Vs Transform coding. For correct diagnosis, the medical images should be
displayed with 100% quality. The popular JPEG image compression technique is lossy

technique so causes some loss in quality of image. Even though the loss is not a cause of
concern for non-medical images, it makes the analysis of medical images a difficult task. So
it is not suitable for the compression of medical images.
A digital remotely sensed image is typically composed of picture elements (pixels)
located at the intersection of each row i and column j in each K bands of imagery.
Associated with each pixel is a number known as Digital Number (DN) or Brightness Value
(BV), that depicts the average radiance of a relatively small area within a scene.
A smaller number indicates low average radiance from the area and the high number
is an indicator of high radiant properties of the area. The size of this area effects the
reproduction of details within the scene. As pixel size is reduced more scene detail is
presented in digital representation.
While displaying the different bands of a multispectral data set, images obtained in
different bands are displayed in image planes (other than their own) the color composite is
regarded as False Color Composite (FCC). High spectral resolution is important when
producing color components.
For a true color composite an image data used in red, green and blue spectral region
must be assigned bits of red, green and blue image processor frame buffer memory. A color
infrared composite ‘standard false color composite’ is displayed by placing the infrared, red,
green in the red, green and blue frame buffer memory.
In this healthy vegetation shows up in shades of red because vegetation absorbs most
of green and red energy but reflects approximately half of incident Infrared energy. Urban
areas reflect equal portions of NIR, R & G, and therefore they appear as steel grey.
Geometric distortions manifest themselves as errors in the position of a pixel relative
to other pixels in the scene and with respect to their absolute position within some defined
map projection. If left uncorrected, these geometric distortions render any data extracted
from the image useless. This is particularly so if the information is to be compared to other
data sets, be it from another image or a GIS data set.

Distortions occur for many reasons. For instance distortions occur due to changes in
platform attitude (roll, pitch and yaw), altitude, earth rotation, earth curvature, panoramic
distortion and detector delay. Most of these distortions can be modelled mathematically and
are removed before you buy an image. Changes in attitude however can be difficult to
account for mathematically and so a procedure called image rectification is performed.
Satellite systems are however geometrically quite stable and geometric rectification
is a simple procedure based on a mapping transformation relating real ground coordinates,
say in easting and northing, to image line and pixel coordinates.
Rectification is a process of geometrically correcting an image so that it can be
represented on a planar surface , conform to other images or conform to a map. That is, it is
the process by which geometry of an image is made planimetric. It is necessary when
accurate area, distance and direction measurements are required to be made from the
imagery. It is achieved by transforming the data from one grid system into another grid
system using ageometric transformation. Rectification is not necessary if there is no
distortion in the image. For example, if an image file is produced by scanning or digitizing a
paper map that is in the desired projection system, then that image is already planar and does
not require rectification unless there is some skew or rotation of the image.
Scanning and digitizing produce images that are planar, but do not contain any map
coordinate information. These images need only to be geo-referenced, which is a much
simpler process than rectification. In many cases, the image header can simply be updated
with new map coordinate information. This involves redefining the map coordinate of the
upper left corner of the image and the cell size (the area represented by each pixel).
Ground Control Points (GCP) are the specific pixels in the input image for which the
output map coordinates are known. By using more points than necessary to solve the
transformation equations a least squares solution may be found that minimises the sum of the
squares of the errors. Care should be exercised when selecting ground control points as their
number, quality and distribution affect the result of the rectification
. Once the mapping transformation has been determined a procedure called
resampling is employed. Resampling matches the coordinates of image pixels to their real

world coordinates and writes a new image on a pixel by pixel basis. Since the grid of pixels
in the source image rarely matches the grid for the reference image, the pixels are resampled
so that new data file values for the output file can be calculated.
Image enhancement techniques improve the quality of an image as perceived by a
human. These techniques are most useful because many satellite images when examined on a
colour display give inadequate information for image interpretation. There is no conscious
effort to improve the fidelity of the image with regard to some ideal form of the image.
There exists a wide variety of techniques for improving image quality.
The contrast stretch, density slicing, edge enhancement, and spatial filtering are the
more commonly used techniques. Image enhancement is attempted after the image is
corrected for geometric and radiometric distortions. Image enhancement methods are applied
separately to each band of a multispectral image. Digital techniques have been found to be
most satisfactory than the photographic technique for image enhancement, because of the
precision and wide variety of digital processes.
Contrast generally refers to the difference in luminance or grey level values in an
image and is an important characteristic. It can be defined as the ratio of the maximum
intensity to the minimum intensity over an image. Contrast ratio has a strong bearing on the
resolving power and detectability of an image. Larger this ratio, more easy it is to interpret
the image. Satellite images lack adequate contrast and require contrast improvement.
Contrast enhancement techniques expand the range of brightness values in an image
so that the image can be efficiently displayed in a manner desired by the analyst. The density
values in a scene are literally pulled farther apart, that is, expanded over a greater range. The
effect is to increase the visual contrast between two areas of different uniform densities. This
enables the analyst to discriminate easily between areas initially having a small difference in
density.
This is the simplest contrast stretch algorithm. The grey values in the original image
and the modified image follow a linear relation in this algorithm. A density number in the
low range of the original histogram is assigned to extremely black and a value at the high
end is assigned to extremely white. The remaining pixel values are distributed linearly

between these extremes. The features or details that were obscure on the original image will
be clear in the contrast stretched image. Linear contrast stretch operation can be represented
graphically. To provide optimal contrast and color variation in color composites the small
range of grey values in each band is stretched to the full brightness range of the output or
display unit.
In these methods, the input and output data values follow a non-linear
transformation. The general form of the non-linear contrast enhancement is defined by y = f
(x), where x is the input data value and y is the output data value. The non-linear contrast
enhancement techniques have been found to be useful for enhancing the colour contrast
between the nearly classes and subclasses of a main class.
In this report ,chapter 2 deals with the model of visual attention system which builds
on a second biologically plausible architecture at the basis of several models and an
automatic segmentation algorithm for video frames captured by a webcam that closely
approximates depth segmentation from a stereo camera.
The model of visual attention system is related to the so-called “feature integration
theory,” explaining human visual search strategies. Visual input is first decomposed into a
set of topographic feature maps. Different spatial locations then compete for saliency within
each map, such that only locations which locally stand out from their surround can persist.
All feature maps feed, in a purely bottom-up manner, into a master “saliency map,” which
topographically codes for local conspicuity over the entire visual scene.
In primates, such a map is believed to be located in the posterior parietal cortex as
well as in the various visual maps in the pulvinar nuclei of the thalamus. The model’s
saliency map is endowed with internal dynamics which generate attentional shifts.
An automatic segmentation algorithm exploits motion and its spatial context as a
powerful cue for layer separation, and the correct level of geometric rigidity is automatically
learned from training data.
The algorithm benefits from a novel, quantized motion representation (cluster
centroids of the spatiotemporal derivatives of video frames), referred to as motons. Motons

(related to textons), inspired by recent research in motion modeling and object/material
recognition are combined with shape filters to model long-range spatial correlations (shape).
These new features prove useful at capturing the visual context and filling in missing, texture
less, or motionless regions.
Fused motion-shape cues are discriminatively selected by supervised learning. Key to
the technique is a classifier trained on depth-defined layer labels, such as those used in a
stereo setting , as opposed to motion-defined layer labels. Thus, to induce depth in the
absence of stereo while maintaining generalization, the classifier is forced to combine other
available cues accordingly.
Combining multiple cues addresses one of the two aforementioned requirements,
robustness, for the bilayer segmentation. To meet the other requirement, efficiency, a
straightforward way is to trade accuracy for speed if only one type of classifier is used.
However, it is able to achieve efficiency without sacrificing much accuracy if multiple types
of classifiers, such as Ada Boost, decision trees, random forests, random ferns, and attention
cascade, are available. That is, if the “strong” classifiers are viewed as a composition of
“weak” learners (decision stumps), it may be able to control how the strong classifiers are
constructed to fit the limitations in evaluation time.
In this, it describes a general taxonomy of classifiers which interprets these common
algorithms as variants of a single tree-based classifier. This taxonomy allows to compare the
different algorithms fairly in terms of evaluation complexity (time) and select the most
efficient or accurate one for the application at hand.
By fusing motion shape, color, and contrast with local smoothness prior in a
conditional random field model, it achieve pixel wise binary segmentation through min-cut.
The result is a segmentation algorithm that is efficient and robust to distracting events and
that requires no initialization.
The chapter 3 proposes a robust video object extraction (VOE) framework, which
utilizes both visual and motion saliency information across video frames. The observed
saliency information allows to infer several visual and motion cues for learning foreground
and background models, and a conditional random field (CRF) is applied to automatically

determines the label (foreground or background) of each pixel based on the observed
models.
With the ability to preserve both spatial and temporal consistency, VOE framework
exhibits promising results on a variety of videos, and produces quantitatively and
qualitatively satisfactory performance. While here it focus on VOE problems for single
concept videos (i.e., videos which have only one object category of interest presented), this
method is able to deal with multiple object instances (of the same type) with pose, scale, etc.
variations. The chapter 4 deals with the comparison between the three schemes. The chapter
5 gives a conclusion.

2. LITERATUE SURVEY
Digital Image Processing is a rapidly evolving field with growing applications in
Science and Engineering. Modern digital technology has made it possible to manipulate
multi-dimensional signals. Digital Image Processing has a broad spectrum of applications.
They include remote sensing data via satellite, medical image processing, radar, sonar and
acoustic image processing and robotics.
Uncompressed multimedia graphics, audio and video data require considerable
storage capacity and transmission bandwidth. Despite rapid progress in mass-storage
density, processor speeds, and digital communication system performance, demand for data
storage capacity and data-transmission bandwidth continues to outstrip the capabilities of
available technologies. This is a crippling disadvantage during transmission and storage. So
there arises a need for data compression of images.
There are several Image compression techniques. Two ways of classifying
compression techniques are mentioned here.
1) Loss less Vs Lossy compression
2) Predictive Vs Transform coding
For correct diagnosis, the medical images should be displayed with 100% quality. The
popular JPEG image compression technique is lossy technique so causes some loss in quality
of image. Even though the loss is not a cause of concern for non-medical images, it makes
the analysis of medical images a difficult task. So it is not suitable for the compression of
medical images.
The first scheme “Model Of Saliency-Based Visual Attention[1]” represents a model
of saliency-based visual attention system, inspired by the behavior and the neuronal
architecture of the early primate visual system, is presented. Multiscale image features are
combined into a single topographical saliency map. A dynamical neural network then selects
attended locations in order of decreasing saliency. The system breaks down the complex
problem of scene understanding by rapidly selecting, in a computationally efficient manner,
conspicuous locations to be analyzed in detail.

The second scheme “Bilayer Segmentation Of Webcam Videos[2]” addresses the
problem of extracting a foreground layer from video chat captured by a (monocular) webcam
that closely approximates depth segmentation from a stereo camera. The foreground is
intuitively defined as a subject of video chat, not necessarily frontal, while the background is
literally anything else. Applications for this technique include background substitution,
compression, adaptive bit rate video transmission, and tracking. These applications have at
least two requirements:
1) Robust segmentation against strong distracting events, such as people moving in
the background, camera shake, or illumination change, and
2) Efficient separation for attaining live streaming speed.
Image processing can be defined a acquisition and processing of visual information by
computer. Computer representation of an image requires the equivalent of many thousands
of words of data, so the massive amount of data required for image is a primary reason for
the development of many sub areas with field of computer imaging, such as image
compression and segmentation.
Another important aspect of computer imaging involves the ultimate “receiver” of
visual information in some case the human visual system and in some cases the human
visual system and in others the computer itself. Computer imaging can be separate into two
primary categories: computer vision and image processing which are not totally separate and
distinct. The boundaries that separate the two are fuzzy, but this definition allows to explore
the differences between the two and to explore the difference between the two and to
understand how they fit together. One of the major topics within this field of computer
vision is image analysis. Image analysis involves the examination of the image data to
facilitate solving vision problem. The image analysis process involves two other topics:
1) Feature Extraction: is the process of acquiring higher level image
information, such as shape or color information.
2) Pattern Classification: is the act of taking this higher –level information and
identifying objects within the image.

Computer vision systems are used in many and various types of environments, such as
Manufacturing Systems, Medical Community, Law Enforcement, Infrared Imaging,
Satellites Orbiting.
Image processing is computer imaging where application involves a human being in
the visual loop. In other words the image are to be examined and a acted upon by people.
The major topics within the field of image processing include:
1) Image restoration.
2) Image enhancement.
3) Image compression.
2.1 MODEL OF SALIENCY-BASED VISUAL ATTENTION
Primates have a remarkable ability to interpret complex scenes in real time, despite
the limited speed of the neuronal hardware available for such tasks. Intermediate and higher
visual processes appear to select a subset of the available sensory information before further
processing, most likely to reduce the complexity of scene analysis.
This selection appears to be implemented in the form of a spatially circumscribed
region of the visual field, the so-called “focus of attention,” which scans the scene both in a
rapid, bottom-up, saliency-driven, and task-independent manner as well as in a slower, top-
down, volition-controlled, and task-dependent manner. Models of attention include
“dynamic routing” models, in which information from only a small region of the visual field
can progress through the cortical visual hierarchy.
The attended region is selected through dynamic modifications of cortical
connectivity or through the establishment of specific temporal patterns of activity, under
both top-down (task-dependent) and bottom-up (scene-dependent) control.
The model used here builds on a second biologically plausible architecture at the
basis of several models. It is related to the so-called “feature integration theory,” explaining
human visual search strategies. Visual input is first decomposed into a set of topographic

feature maps. Different spatial locations then compete for saliency within each map, such
that only locations which locally stand out from their surround can persist.
All feature maps feed, in a purely bottom-up manner, into a master “saliency map,”
which topographically codes for local conspicuity over the entire visual scene. In primates,
such a map is believed to be located in the posterior parietal cortex as well as in the various
visual maps in the pulvinar nuclei of the thalamus.
The model’s saliency map is endowed with internal dynamics which generate
attentional shifts. This model consequently represents a complete account of bottom-up
saliency and does not require any top-down guidance to shift attention.
This framework provides a massively parallel method for the fast selection of a small
number of interesting image locations to be analyzed by more complex and time consuming
object-recognition processes. Extending this approach in “guided-search,” feedback from
higher cortical areas (e.g., knowledge about targets to be found) was used to weight the
importance of different features, such that only those with high weights could reach higher
processing levels.
Input is provided in the form of static color images, usually digitized at 640 × 480
resolution. Nine spatial scales are created using dyadic Gaussian pyramids, which
progressively low-pass filter and subsample the input image, yielding horizontal and vertical
image-reduction factors ranging from 1:1 (scale zero) to 1:256 (scale eight) in eight octaves.
Each feature is computed by a set of linear “center-surround” operations akin to
visual receptive fields: Typical visual neurons are most sensitive in a small region of the
visual space (the center), while stimuli presented in a broader, weaker antagonistic region
concentric with the center (the surround) inhibit the neuronal response.
Such an architecture, sensitive to local spatial discontinuities, is particularly well-
suited to detecting locations which stand out from their surround and is a general
computational principle in the retina, lateral geniculate nucleus, and primary visual cortex.
Center-surround is implemented in the model as the difference between fine and coarse

scales: The center is a pixel at scale c ϵ {2, 3, 4}, and the surround is the corresponding pixel
at scale s = c + δ, with δ ϵ {3, 4}.
The across-scale difference between two maps, denoted “Ө” below, is obtained by
interpolation to the finer scale and point-by-point subtraction. Using several scales not only
for c but also for δ = s - c yields truly multiscale feature extraction, by including different
size ratios between the center and surround regions (contrary to previously used fixed
ratios).
2.1.1 Extraction of Early Visual Features
With r, g, and b being the red, green, and blue channels of the input image, an
intensity image I is obtained as I = (r + g + b)/3. I is used to create a Gaussian pyramid I(σ),
where σ ɛ [0..8] is the scale. The r, g, and b channels are normalized by I in order to
decouple hue from intensity. However, because hue variations are not perceivable at very
low luminance (and hence are not salient), normalization is only applied at the locations
where I is larger than 1/10 of its maximum over the entire image (other locations yield zero
r, g, and b). Four broadly-tuned color channels are created: R = r - (g + b)/2 for red, G = g -
(r + b)/2 for green, B = b - (r + g)/2 for blue, and Y = (r + g)/2 - |r - g|/2 - b for yellow
(negative values are set to zero).
Four Gaussian pyramids R(s), G(s), B(s), and Y(s) are created from these color
channels. Center-surround differences (Ө defined previously) between a “center” fine scale c
and a “surround” coarser scale s yield the feature maps. The first set of feature maps is
concerned with intensity contrast, which, in mammals, is detected by neurons sensitive either
to dark centers on bright surrounds or to bright centers on dark surrounds.
Here, both types of sensitivities are simultaneously computed (using a rectification)
in a set of six maps I(c, s), with c ϵ {2, 3, 4} and s = c + δ, δ ϵ {3, 4}:
I(c, s) = |I(c) Ө I(s)|. (1)
A second set of maps is similarly constructed for the color channels, which, in cortex,
are represented using a so-called “color double-opponent” system: In the center of their

receptive fields, neurons are excited by one color (e.g., red) and inhibited by another (e.g.,
green), while the converse is true in the surround.
Such spatial and chromatic opponency exists for the red/green, green/red,
blue/yellow, and yellow/blue color pairs in human primary visual cortex. Accordingly, maps
RG(c,s) are created in the model to simultaneously account for red/green and green/red
double opponency (2) and BY(c, s) for blue/yellow and yellow/blue double opponency (3):
RG(c, s) = |(R(c) - G(c)) Ө (G(s) - R(s))| (2)
BY(c, s) = |(B(c) - Y(c)) Ө (Y(s) - B(s))|. (3)
Local orientation information is obtained from I using oriented Gabor pyramids
O(σ,θ), where σ ϵ [0..8] represents the scale and θ ϵ {0o
, 45o
, 90o
, 135o
} is the preferred
orientation. (Gabor filters, which are the product of a cosine grating and a 2D Gaussian
envelope, approximate the receptive field sensitivity profile (impulse response) of
orientation-selective neurons in primary visual cortex.)
Orientation feature maps, O(c, s, θ), encode, as a group, local orientation contrast
between the center and surround scales:
O(c, s, θ) = |O(c, θ) Ө O(s, θ)|. (4)
In total, 42 feature maps are computed: six for intensity, 12 for color, and 24 for
orientation.
2.1.2 The Saliency Map
The purpose of the saliency map is to represent the conspicuity—or “saliency”—at
every location in the visual field by a scalar quantity and to guide the selection of attended
locations, based on the spatial distribution of saliency. A combination of the feature maps
provides bottom-up input to the saliency map, modeled as a dynamical neural network.
One difficulty in combining different feature maps is that they represent a priori not
comparable modalities, with different dynamic ranges and extraction mechanisms. Also,

because all 42 feature maps are combined, salient objects appearing strongly in only a few
maps may be masked by noise or by less-salient objects present in a larger number of maps.
In the absence of top-down supervision, it propose a map normalization operator, N(.),
which globally promotes maps in which a small number of strong peaks of activity
(conspicuous locations) is present, while globally suppressing maps which contain numerous
comparable peak responses. N(.) consists of
1) normalizing the values in the map to a fixed range [0..M], in order to eliminate
modality-dependent amplitude differences;
2) finding the location of the map’s global maximum M and computing the average m
of all its other local maxima; and
3) globally multiplying the map by (M – m)2
.
Only local maxima of activity are considered, such that N(.) compares responses associated
with meaningful “activitation spots” in the map and ignores homogeneous areas. Comparing
the maximum activity in the entire map to the average overall activation measures how
different the most active location is from the average.
When this difference is large, the most active location stands out, and the map is
strongly promoted. When the difference is small, the map contains nothing unique and is
suppressed. The biological motivation behind the design of N(.) is that it coarsely replicates
cortical lateral inhibition mechanisms, in which neighboring similar features inhibit each
other via specific, anatomically defined connections.
Feature maps are combined into three “conspicuity maps,” I for intensity (5), C for
color (6), and O for orientation (7), at the scale (σ = 4) of the saliency map. They are
obtained through across-scale addition, “ ” which consists of reduction of each map to
scale four and point-by-point addition:
I = N 4
c=2 N c=4
s=c+s N(I(c,s)) (5)
C= N4
c=2 N c=4
s=c+s [N(RG(c,s)) + N(BY(c,s))] (6)

For orientation, four intermediary maps are first created by combination of the six feature
maps for a given θ and are then combined into a single orientation conspicuity map:
O = Σθϵ {0
o
,45
o
,90
o
,135
o
} N( N4
c=2 N c=4
s=c+s N(O(c,s,θ))) (7)
The motivation for the creation of three separate channels, I, C, and O, and their individual
normalization is the hypothesis that similar features compete strongly for saliency, while
different modalities contribute independently to the saliency map. The three conspicuity
maps are normalized and summed into the final input S to the saliency map:
S= 1/3 (N(I) + N(C) + N(O)) (8)
At any given time, the maximum of the saliency map (SM) defines the most salient image
location, to which the focus of attention (FOA) should be directed. This could now simply
select the most active location as defining the point where the model should next attend.
However, in a neuronally plausible implementation, the model the SM as a 2D layer of leaky
integrate-and-fire neurons at scale four.
These model neurons consist of a single capacitance which integrates the charge
delivered by synaptic input, of a leakage conductance, and of a voltage threshold. When the
threshold is reached, a prototypical spike is generated, and the capacitive charge is shunted
to zero. The SM feeds into a biologically-plausible 2D “winner-take-all” (WTA) neural
network, at scale σ = 4, in which synaptic interactions among units ensure that only the most
active location remains, while all other locations are suppressed.
The neurons in the SM receive excitatory inputs from S and are all independent. The
potential of SM neurons at more salient locations hence increases faster (these neurons are
used as pure integrators and do not fire). Each SM neuron excites its corresponding WTA
neuron. All WTA neurons also evolve independently of each other, until one (the “winner”)
first reaches threshold and fires. This triggers three simultaneous mechanisms:
1) The FOA is shifted to the location of the winner neuron;
2) The global inhibition of the WTA is triggered and completely inhibits (resets) all
WTA neurons;

3) Local inhibition is transiently activated in the SM, in an area with the size and new
location of the FOA; this not only yields dynamical shifts of the FOA, by allowing
the next most salient location to subsequently become the winner, but it also prevents
the FOA from immediately returning to a previously-attended location.
Such an “inhibition of return” has been demonstrated in human visual psychophysics. In
order to slightly bias the model to subsequently jump to salient locations spatially close to
the currently-attended location, a small excitation is transiently activated in the SM, in a near
surround of the FOA (“proximity preference” rule of Koch and Ullman). Since it do not
model any top-down attentional component, the FOA is a simple disk whose radius is fixed
to one sixth of the smaller of the input image width or height.
The time constants, conductances, and firing thresholds of the simulated neurons
were chosen so that the FOA jumps from one salient location to the next in approximately
30–70 ms (simulated time), and that an attended area is inhibited for approximately 500–900
ms, as has been observed psychophysically. The difference in the relative magnitude of these
delays proved sufficient to ensure thorough scanning of the image and prevented cycling
through only a limited number of locations. All parameters are fixed in the implementation,
and the system proved stable over time for all images studied.
2.2 BILAYER SEGMENTATION OF WEBCAM VIDEOS
This addresses the problem of extracting a foreground layer from video chat captured
by a (monocular) webcam that closely approximates depth segmentation from a stereo
camera. The foreground is intuitively defined as a subject of video chat, not necessarily
frontal, while the background is literally anything else. Applications for this technique
include background substitution, compression, adaptive bit rate video transmission, and
tracking.
This focuses on the common scenario of the video chat application. Image layer
extraction has been an active research area. Recent work in this area has produced
compelling, real-time algorithms, based on either stereo or motion. Some previously used
algorithms require initialization in the form of a “clean” image of the background.

Stereo-based segmentation seems to achieve the most robust results, as background
objects are correctly separated from the foreground independently from their motion versus-
stasis characteristics. This aims at achieving a similar behavior monocularly. In some
monocular systems, the static background assumption causes inaccurate segmentation in the
presence of camera shake (e.g., for a webcam mounted on a laptop screen), illumination
change, and large objects moving in the background.
Although the algorithm in precludes the need for a clean background image, the
segmentation still suffers in the presence of large background motion. Moreover,
initialization is sometimes necessary in the form of global color models. However, in the
videochat application, such rigid models are not capable of accurately describing the
foreground motion. Furthermore, to avoid the complexities associated with optical flow
computation.
The algorithm exploits motion and its spatial context as a powerful cue for layer
separation, and the correct level of geometric rigidity is automatically learned from training
data. The algorithm benefits from a novel, quantized motion representation (cluster centroids
of the spatiotemporal derivatives of video frames), referred to as motons. Motons (related to
textons), inspired by recent research in motion modeling and object/material recognition are
combined with shape filters to model long-range spatial correlations (shape).
These new features prove useful at capturing the visual context and filling in missing,
texture less, or motionless regions. Fused motion-shape cues are discriminatively selected by
supervised learning. Key to the technique is a classifier trained on depth-defined layer labels,
such as those used in a stereo setting , as opposed to motion-defined layer labels.
Thus, to induce depth in the absence of stereo while maintaining generalization, the
classifier is forced to combine other available cues accordingly. Combining multiple cues
addresses one of the two aforementioned requirements, robustness, for the bilayer
segmentation. To meet the other requirement, efficiency, a straightforward way is to trade
accuracy for speed if only one type of classifier is used.
However, it is able to achieve efficiency without sacrificing much accuracy if
multiple types of classifiers, such as AdaBoost, decision trees, random forests, random ferns,

and attention cascade, are available. That is, if the “strong” classifiers are viewed as a
composition of “weak” learners (decision stumps), it may be able to control how the strong
classifiers are constructed to fit the limitations in evaluation time.
In this, it describes a general taxonomy of classifiers which interprets these common
algorithms as variants of a single tree-based classifier. This taxonomy allows to compare the
different algorithms fairly in terms of evaluation complexity (time) and select the most
efficient or accurate one for the application at hand.
By fusing motion shape, color, and contrast with local smoothness prior in a
conditional random field model, it achieve pixel wise binary segmentation through min-cut.
The result is a segmentation algorithm that is efficient and robust to distracting events and
that requires no initialization.
2.2.1 Notation
Given an input sequence of images, a frame is represented as an array
z = (z1, z2, . . . . zn, . . . . zN) of pixels in the YUV color space, indexed by the pixel position
n. A frame at time t is denoted zt
. Temporal derivatives are denoted z = (z1, z2, . . . ,zn, . . . ,
zN) and computed as zt
n = | G(zt
n) – G(zt-1
n) | at each time t with a Gaussian kernel G(.) at a
scale of σt pixels. Spatial gradients g = (g1, g2, . . . , gn, . . . ,gN), in which gn = | zn | , are
computed by convolving the imageswith the derivatives of Gaussian (DoG) kernels of width
σs. Here, it use σs = σt = 0:8, approximating a Nyquist sampling filter. Spatiotemporal
derivatives are computed on the Y channel only. Given Om = (g, z), the segmentation task is
to infer a binary label xn ϵ {Fg,Bg}.
2.2.2 Motons
Compute the two-dimensional Om for all training pixels and then cluster the two-
dimensional Om into M clusters using expectation maximization (EM). The M resulting
cluster centroids are called motons. An example with M = 10 motons is shown in Fig. 2.1
This operation may be interpreted as building a vocabulary of motion-based visual words.
The visual words capture information about the motion and the “edgeness” of image pixels,
rather than their texture content as in textons.

Fig 2.1 Motons. Training spatiotemporal derivatives clustered into 10 cluster centroids (motons).
Different colors for different clusters.
Clustering 1) enables efficient indexing of the joint (g, z) space while maintaining a
useful correlation between g and z and 2) reduces sensitivity to noise. It have empirically
tested 6, 10, and 15 clusters with multiple random starts. A dictionary size of just 10 motons
has proven sufficient, and the clusters are generally stable in multiple runs. The moton
representation also yields fewer segmentation errors than the use of Om directly.
The observation in is that strong edges with low temporal derivatives usually
correspond to background regions, while strong edges with high temporal derivatives are
likely to be in the foreground. Textureless regions tend to have their Fg/Bg log-likelihood
ratio (LLR) close to zero due to uncertainty. Such motion-versus-stasis discrimination
properties are retained by the quantized representation; however, they cannot sufficiently
separate the moving background from the moving foreground.
Given a dictionary of motons, each pixel in a new image can be assigned to its
closest moton by maximum likelihood (ML). Therefore, each pixel can now be replaced by
an index into the small visual dictionary. Then, a moton map can be decomposed into its M
component bands, namely “moton bands”.
Thus, this have M moton bands Ik
, k = 1, . . ..,M, for each video frame z. Each Ik
is a
binary image, with Ik
(n) indicating whether the nth pixel has been assigned to the kth moton
or not.

2.2.3 Shape Filters
In video chat application, the foreground object (usually a person) moves nonrigidly
yet in a structured fashion. This briefly explains the shape filters and then shows how to
capture the spatial context of motion adaptively. To detect faces using the spatial context of
an image patch (detection window), builds a over complete set of Harr-like shape filters,
shown in Fig. 2.2.
Fig 2.2 Shape Filters (a) A shape filter composed of two rectangles (shown in a detection window centered at
a particular pixel), (b) and (c) Shape filters composed of more rectangles.
The sum of the pixel value in the white rectangles is subtracted by the sum of the
pixel value in the black rectangles. The result value of the subtraction is the feature response
used for classification. The shape filter is applied only to a gray-scale image, that is, the
“pixel value” in that paper corresponds to image intensity.
In speaker detection, similar shape filters are applied to a gray-scale image, a frame
difference, and a frame running average. In object categorization, texton layout filters
generalize the shape filters by randomly generating the coordinates of the white and black
rectangles and apply the filters to randomly selected texton channels (bands). Experiments in
multiple applications, the shape filters produce simple but effective features for
classification.

In order to efficiently compute the feature value of the shape filters, integral image
processing can be used. An integral image ii(x, y, k) is the sum of the pixel value in the
rectangle from the top left corner (0,0) to the pixel (x,y) in image band k (k = 1). Therefore,
computing the sum of the pixel value for an arbitrary rectangle with the top left at (x1,y1) and
the bottom right at (x2,y2) in image band k requires only three operations ii(x2, y2,k) –
ii(x1,y2,k) – ii(x2,y1,k) + ii(x1,y1,k), that is, constant time complexity for each rectangle
regardless of its size in the shape filter and band index k.
To infer Fg/Bg for a pixel n, the contextual information within a (sliding) detection
window centered at n is used. The size of the detection window is about the size of the video
frame and fixed for all the pixels. Within a detection window a shape filter is defined as a
moton-rectangle pair (k,r), with k indexing in the dictionary of motons and r indexing a
rectangular mask. First, here denote all of the possible shape filters within a detection
window as S* and then define a whole set of d shape filters S = {(ki,ri)}, i = 1, . . . , d, by
randomly selecting moton-rectangle pairs from S*.
For each pixel position n, compute the associated feature ψ as follows: Given the
moton k, the center the detection window at n and count the number of pixels in Ik
that fall in
the offset rectangle mask r. This count is denoted vn(k r). The feature value ψn(i,j) is obtained
by simply subtracting the moton counts collected for the two shape filters (ki,ri) and (kj,rj),
i.e., ψn(i, j) = vn(ki,ri) – vn(kj,rj). The moton counts vn can be computed efficiently by one
integral image for every moton band Ik
. Therefore, given S, by randomly selecting i, j, and k
(1 ≤ i; j ≤ d, 1 ≤ k ≤ M), a total of d2
. M2
features can be computed at every pixel n.
2.2.4 The Tree-Cube Taxonomy
Common classification algorithms, such as decision trees, boosting, and random
forests, share the fact that they build “strong” classifiers from a combination of “weak”
learners, often just decision stumps. The main difference among these algorithms is the way
the weak learners are combined, and exploring the difference may lead to an accurate
classifier that is also efficient enough to fit the limitations of the evaluation time. This
section presents a useful framework for constructing strong classifiers by combining weak
learners in different ways to facilitate the analysis of accuracy and efficiency.

The three most common ways to combine weak learners are 1) hierarchically (H),
2) by averaging (A), and 3) by boosting (B), or more generally adaptive reweighting and
combining (ARCing). The origin represents the weak learner (e.g., a decision stump), and
axes H, A, and B represent those three basic combination “moves.”
1. The H-move hierarchically combines weak learners into decision trees. During
training, a new weak learner is iteratively created and attached to a leaf node, if needed,
based on information gain. Evaluation of one instance includes only the weak learners along
the corresponding decision path in the tree. It can be shown that the H-move reduces
classification bias.
2. The B-move, instead, linearly combines weak learners. After the insertion of each
weak learner, the training data are reweighted or resampled such that the last weak learner
has 50 percent accuracy in the new data distribution. Evaluation of one instance includes the
weighted sum of the outputs of all of the weak learners in the booster. Examples of the B-
move include AdaBoost and gentle boost. Boosting reduces the empirical error bound by
perturbing the training data.
3. The A-move creates strong classifiers by averaging the results of many weak
learners. Note that all of the weak learners added by the A-move solve the same problem
while those sequentially added by the H and B-moves solve problems with a different data
distribution. Thus, the main computational advantage in training is that each weak learner
can be learned independently and in parallel. When the weak learner is a random tree, A-
move gives rise to random forests. The A-move also reduces classification variance.
Paths, not vertices, along the edges of the cube in Fig. 2.3 correspond to different
combinations of weak learners and thus (unlimited) different strong classifiers. If this restrict
each of the three basic moves to being used once at most, the tree taxonomy produces three
order-1 algorithms (excluding the base learner itself), six order-2, and six order-3 algorithms.
Many known algorithms, such as boosting (B), decision trees (H), booster of trees (HB), and
random forests (HA), are conveniently mapped into paths through the tree cube. Also note
that the widely used attention cascade can be interpreted as a one-sided tree of boosters
(BH). The tree-cube taxonomy also enables to explore new algorithms (e.g., HAB) and

Averaging
of weak
learners
HB
BHHA
AH
BA
AB
compare them to other algorithms of the same order (e.g., BHA). Next, it explore which
classifier performs best for the video segmentation application. Following the tree-cube
taxonomy, it focus on comparing three common order-2 models: HA, HB, and BA, that is,
the compare random forests (RFs) of trees, booster of trees (BT), and ensemble of boosters
(EB) to evaluate the behavior of different moves. As a sanity check, it also evaluate the
performance of a common order-1 model B, namely, the booster of stumps (using gentle
boost, denoted as GB). From this elaborate comparison, it gained a number of insights into
the design of tree-based classifiers. These insights, such as bias/variance reduction,
accuracy/efficiency trade-off, the complexity of the problem, and the labeling quality of the
data, have motivated the design of an order-3 classifier, HBA.
2.2.5 Random Forests Vs Booster Of Trees Vs Ensemble Of Boosters
The base weak learner used is the widely used decision stump. A decision stump
applied to the nth pixel takes the form h(n) = a . 1(ψn(i, j) > θ) + b, in which 1(.) is a 0-1
indicator function and ψn(i,j) is the shape filter response for the ith and jth shape filters.
Fig 2.3 The tree-cube taxonomy of classifiers captures many classification algorithms in a single
structure
Stacking weak learners hierarchically
Reweighting of
training data
B: Booster of weak
learners (common
AdaBoost)
A: Forest of weak learners
Weak
learner
H: Decision tree
H
B
A

Decision Tree
When training a tree, it compute θ, a, and b of a new decision stump at each iteration
for either the least-square error or the maximum entropy gain, as described later. During
testing, the output F(n) of a tree classifier is the output of the leaf node.
Gentle Boost
Out of the many variants of boosting, here the focus is on the Gentle Boost algorithm
because of its robustness properties. For the nth pixel, a strong classifier F(n) is a linear
combination of stumps F(n) = ΣL
l=1 hi(n). It employ gentle boost as the B-move (in Fig. 2.3)
algorithm for GB, BT, and EB.
Random Forests
A forest is made of many trees, and its output F(n) is the average of the output of all the trees
(the A-move). A random forest is an ensemble of decision trees trained with random
features. In this case, each tree is trained by adding new stumps in the leaf nodes in order to
achieve maximum information gain. However, unlike boosting, the training data of RF are
not reweighted for different trees. In the more complicated order-3 classifier HBA, the
training data are only weighted within each BT, but the BTs are constructed independently.
RF has been applied to the recognition problems, such as OCR and keypoint recognition in
vision.
Randomization
The tree-based classifiers are effectively trained by optimizing each stump on only a
few (1,000 in the implementation) randomly selected shape filter features. This reduces the
statistical dependence between weak learners and provides increased efficiency without
significantly affecting their accuracy.
In all three algorithms, the classification confidence is computed by softmax
transformation as follows:
P(xn = Fg|Om) = exp(F(n))/(1 + exp(F(n))) (9)

2.2.6 Layer Segmentation
Segmentation is cast as an energy minimization problem in which the energy to be
minimized is similar, the only difference being that the stereo-match unary potential UM is
replaced by the motion-shape unary potential UMS:
UMS
(Om,x;Ө) = Σn=1
N
log(P(xn|Om)) (10)
in which P is from (1). The CRF energy is as follows:
E(Om,z,x;Ө) = γMSUMS
(Om,x;Ө) + γCUC
(z,x;Ө) + V(z,x;Ө) (11)
UC is the color potential (a combination of global and pixelwise contributions), and V is the
widely used contrast-sensitive spatial smoothness. Model parameters are incorporated in Ө.
Relative weights γMS and γC are optimized discriminatively from the training data. The
final segmentation is inferred by a binary min-cut. No complex temporal models are used
here. Finally, because many segmentation errors are caused by strong background edges,
background edge abating, which adaptively “attenuates” background edges, could also be
exploited here if a pixelwise background model were learned on the fly.

3. EXPLORING VISUAL AND MOTION SALIENCY FOR
AUTOMATIC VIDEO OBJECT EXTRACTION
Humans can easily determine the subject of interest in a video, even though that
subject is presented in an unknown or cluttered background or even has never been seen
before. With the complex cognitive capabilities exhibited by human brains, this process can
be interpreted as simultaneous extraction of both foreground and background information
from a video.
Many researchers have been working toward closing the gap between human and
computer vision. However, without any prior knowledge on the subject of interest or training
data, it is still very challenging for computer vision algorithms to automatically extract the
foreground object of interest in a video. As a result, if one needs to design an algorithm to
automatically extract the foreground objects from a video, several tasks need to be
addressed.
1) Unknown object category and unknown number of the object instances in a video.
2) Complex or unexpected motion of foreground objects due to articulated parts or
arbitrary poses.
3) Ambiguous appearance between foreground and background regions due to similar
color, low contrast, insufficient lighting, etc. conditions.
In practice, it is infeasible to manipulate all possible foreground object or background
models beforehand. However, if one can extract representative information from either
foreground or background (or both) regions from a video, the extracted information can be
utilized to distinguish between foreground and background regions, and thus the task of
foreground object extraction can be addressed. Most of the prior works either consider a
fixed background or assume that the background exhibits dominant motion across video
frames. These assumptions might not be practical for real world applications, since they
cannot generalize well to videos captured by freely moving cameras with arbitrary
movements. Here, this propose a robust video object extraction (VOE) framework, which
utilizes both visual and motion saliency information across video frames. The observed
saliency information allows to infer several visual and motion cues for learning foreground

and background models, and a conditional random field (CRF) is applied to automatically
determines the label (foreground or background) of each pixel based on the observed
models. With the ability to preserve both spatial and temporal consistency, VOE framework
exhibits promising results on a variety of videos, and produces quantitatively and
qualitatively satisfactory performance. While here it focus on VOE problems for single
concept videos (i.e., videos which have only one object category of interest presented), this
method is able to deal with multiple object instances (of the same type) with pose, scale, etc.
variations.
3.1 AUTOMATIC OBJECT MODELING AND EXTRACTION
In most existing unsupervised VOE approaches assume the foreground objects as
outliers in terms of the observed motion information, so that the induced appearance, color,
etc. features are utilized for distinguishing between foreground and background regions.
However, these methods cannot generalize well to videos captured by freely moving
cameras as discussed earlier. This proposes a saliency-based VOE framework which learns
saliency information in both spatial (visual) and temporal (motion) domains. By advancing
conditional random fields (CRF), the integration of the resulting features can automatically
identify the foreground object without the need to treat either foreground or background as
outliers.
3.1.1 Determination of Visual Saliency
To extract visual saliency of each frame, perform image segmentation on each video
frame and extract color and contrast information. In this work, it advance Turbopixels for
segmentation, and the resulting image segments (superpixels) are applied to perform
saliency detection. The use of Turbopixels allows to produce edge preserving superpixels
with similar sizes, which would achieve improved visual saliency results as verified later.
For the kth superpixel rk, calculate its saliency score S(rk ) as follows:
S(rk ) = Σrk≠riexp(Ds(rk , ri )/σ2
s )ω(ri )Dr(rk , ri )
≈ Σrk≠ri exp(Ds(rk , ri )/σ2
s )Dr(rk , ri ) (12)

where Ds is the Euclidean distance between the centroid of rk and that of its surrounding
superpixels ri , while σs controls the width of the kernel. The parameter ω(ri ) is the weight of
the neighbor superpixel ri, which is proportional to the number of pixels in ri. ω(ri) can be
treated as a constant for all superpixels due to the use of Turbopixels (with similar sizes).
The last term Dr(rk,ri) measures the color difference between rk and ri , which is also in terms
of Euclidean distance. Consider the pixel i as a salient point if its saliency score satisfies S(i)
> 0.8 ∗ max(S), and the collection of the resulting salient pixels will be considered as a
salient point set. Since image pixels which are closer to this salient point set should be
visually more significant than those which are farther away, further refine the saliency S(i )
for each pixel i as follows:
S(i ) = S(i ) ∗ (1 − dist(i )/distmax) (13)
where S(i ) is the original saliency score derived by (12), and dist(i ) measures the nearest
Euclidian distance to the salient point set. The distmax in (13) is determined as the
maximum distance from a pixel of interest to its nearest salient point within an image, thus it
is an image-dependent constant.
3.1.2 Extraction of Motion-Induced Cues
There are three steps for extraction motion-induced cues :
1) Determination of Motion Saliency: Unlike prior works which assume that either
foreground or background exhibits dominant motion, this framework aims at extracting
motion salient regions based on the retrieved optical flow information. To detect each
moving part and its corresponding pixels, it perform dense optical-flow forward and
backward propagation at each frame of a video. A moving pixel qt at frame t is determined
by
qt = qt, t – 1 ∩ qt, t+1 (14)
where q denotes the pixel pair detected by forward or backward optical flow propagation. Do
not ignore the frames which result in a large number of moving pixels at this stage and thus
the setting is more practical for real-world videos captured by freely-moving cameras.

After determining the moving regions, it propose to derive the saliency scores for
each pixel in terms of the associated optical flow information. Inspired by visual saliency
approaches, apply algorithms in (11) and (12) on the derived optical flow results to calculate
the motion saliency M(i, t) for each pixel i at frame t, and the saliency score at each frame is
normalized to the range of [0, 1].
It is worth noting that, when the foreground object exhibits significant movements
(compared to background), its motion will be easily captured by optical flow and thus the
corresponding motion salient regions can be easily extracted. On the other hand, if the
camera is moving and thus results in remarkable background movements, the motion
saliency method will still be able to identify motion salient regions (associated with the
foreground object).
The motion saliency derived from the optical flow has a better representative
capability in describing the foreground regions than the direct use of the optical flow does.
The foreground object (the surfer) is significantly more salient than the moving background
in terms of motion.
2) Learning of Shape Cues: Although motion saliency allows to capture motion salient
regions within and across video frames, those regions might only correspond to moving parts
of the foreground object within some time interval. If simply assumes the foreground should
be near the high motion saliency region as the method did, it cannot easily identify the entire
foreground object.
Since it is typically observed that each moving part of a foreground object forms a
complete sampling of the entire foreground object, advance part-based shape information
induced by motion cues for characterizing the foreground object. To describe the motion
salient regions, convert the motion saliency image into a binary output and extract the shape
information from the motion salient regions. More precisely, first binarize the
aforementioned motion saliency M(i, t) into Mask(i, t) using a threshold of 0.25.
Divide each video frame into disjoint 8 × 8 pixel patches. For each image patch, if
more than 30% of its pixels are with high motion saliency (i.e., pixel value of 1 in the
binarized output), compute the histogram of oriented gradients (HOG) descriptors with 4 × 4

= 16 grids for representing its shape information. To capture scale invariant shape
information, it further downgrade the resolution of each frame and repeat the above process.
Choose the lowest resolution of the scaled image as a quarter of that of the original
one. Since the use of sparse representation has been shown to be very effective in many
computer vision tasks, learn an over-complete codebook and determine the associated sparse
representation of each HOG. Now, for a total of N HOG descriptors calculated for the above
motion-salient patches {hn, n = 1, 2, . . . , N} in a p-dimensional space, it construct an over-
complete dictionary Dp × K
which includes K basis vectors, and it determine the
corresponding sparse coefficient αn of each HOG descriptor. Therefore, the sparse coding
problem can be formulated as
minD,α 1/N ΣN
n=1 1/2||hn − Dαn||2
2 + λ||αn||1 (15)
where λ balances the sparsity of αn and the l2-norm reconstruction error. It use the software
developed to solve the above problem. Note that each codeword is illustrated by averaging
image patches with the top 15 αn coefficients.
To alleviate the possible presence of background in each codeword k, combine the
binarized masks of the top 15 patches using the corresponding weights αn to obtain the map
Mk. As a result, the moving pixels within each map (induced by motion saliency) has non-
zero pixel values, and the remaining parts of that patch are considered as static background
and thus are zeroes. After obtaining the dictionary and the masks to represent the shape of
foreground object, it use them to encode all image patches at each frame.
This is to recover non-moving regions of the foreground object which does not have
significant motion and thus cannot be detected by motion cues. For each image patch, it
derive its sparse coefficient vector α, and each entry of this vector indicates the contribution
of each shape codeword. Correspondingly, it use the associated masks and their weight
coefficients to calculate the final mask for each image patch. Finally, the reconstructed
image at frame t using the above maps Mk can be denoted as foreground shape likelihood XS
t
, which is calculated as follows:
XS
t=Σn∈ItΣK
k=1(αn,k · Mk ) (16)

where αn,k is the weight for the nth patch using the kth codeword. XS
t serves as the
likelihood of foreground object at frame t in terms of shape information.
3) Learning of Color Cues: Besides the motion-induced shape information, extract both
foreground and background color information for improved VOE performance. According to
the observation and the assumption that each moving part of the foreground object forms a
complete sampling of itself, it cannot construct foreground or background color models
simply based on visual or motion saliency detection results at each individual frame;
otherwise, foreground object regions which are not salient in terms of visual or motion
appearance will be considered as background, and the resulting color models will not be of
sufficient discriminating capability. In this work, utilize the shape likelihood XS
t obtained
from the previous step, and threshold this likelihood by 0.5 to determine the candidate
foreground (FSshape) and background (BSshape) regions.
In other words, consider color information of pixels in FSshape for calculating the
foreground color GMM, and those in BSshape for deriving the background color GMM. Once
these candidate foreground and background regions are determined, it use Gaussian mixture
models (GMM) GC
f and GC
b to model the RGB distributions for each model. The parameters
of GMM such as mean vectors and covariance matrices are determined by performing an
expectation-maximization (EM) algorithm. Finally, integrate both foreground and
background color models with visual saliency and shape likelihood into a unified framework
for VOE.
3.2 CONDITION RANDOM FIELD FOR VOE
In most existing unsupervised VOE approaches assume the foreground objects as
outliers in terms of the observed motion information, so that the induced appearance, color,
etc. features are utilized for distinguishing between foreground and background regions.
However, these methods cannot generalize well to videos captured by freely moving
cameras as discussed earlier. This proposes a saliency-based VOE framework which learns
saliency information in both spatial (visual) and temporal (motion) domains. By advancing
conditional random fields (CRF), the integration of the resulting features can automatically
identify the foreground object without the need to treat either foreground or background as
outliers. The following are the methods using condition random field:

3.2.1 Feature Fusion via CRF
Utilizing an undirected graph, conditional random field(CRF) is a powerful technique
to estimate the structural information (e.g. class label) of a set of variables with the
associated observations. For video foreground object segmentation, CRF has been applied to
predict the label of each observed pixel in an image I. Pixel i in a video frame is associated
with observation zi , while the hidden node Fi indicates its corresponding label (i.e.
foreground or background).
In this framework, the label Fi is calculated by the observation zi , while the spatial
coherence between this output and neighboring observations zj and labels Fj are
simultaneously taken into consideration. Therefore, predicting the label of an observation
node is equivalent to maximizing the following posterior probability function
p(F|I,ψ) ∝ exp{−(Σi∈I(ψi ) + Σi∈I, j∈Neighbor(ψi, j ) )} (17)
where ψi is the unary term which infers the likelihood of Fi with observation zi . ψi,j is the
pairwise term describing the relationship between neighboring pixels zi and zj, and that
between their predicted output labels Fi and Fj. Note that the observation z can be
represented by a particular feature, or a combination of multiple types of features. To solve a
CRF optimization problem, one can convert the above problem into an energy minimization
task, and the object energy function E of (17) can be derived as
E = −log(p)
=Σi∈I(ψi ) + Σi∈I,j∈Neighbor(ψi, j )
= Eunary + Epairwise. (18)
In VOE framework, it defines the shape energy function ES
in terms of shape likelihood XS
t
(derived by (16)) as one of the unary terms
ES
= −ws
log( XS
t ). (19)
In addition to shape information, it need incorporate visual saliency and color cues into the
introduced CRF framework. As discussed earlier, derive foreground and background color

models for VOE, and thus the unary term EC
describing color information is defined as
follows:
EC
= wc
(ECF
− ECB
). (20)
Note that the foreground and background color GMM models GC
f and GC
b are utilized to
derive the associated energy terms ECF
and ECB
, which are calculated as
ECF
= −log(Σi∈I GC
f (i ))
ECB
= −log(Σi∈I GC
b(i )) .
As for the visual saliency cue at frame t, it convert the visual saliency score St derived in
(17) into the following energy term EV
:
EV
= −wv
log( St ). (21)
Note that in the above equations, parameters ws
, wc
, and wv
are the weights for shape, color,
and visual saliency cues, respectively. These weights control the contributions of the
associated energy terms of the CRF model for performing VOE.
It is also worth noting that, it can be concluded that the disregard of background
color models would limit the performance of VOE, since the only use of foreground color
model might not be sufficient for distinguishing between foreground and background
regions. In VOE framework, now utilize multiple types of visual and motion salient features
for VOE, and experiments will confirm the effectiveness and robustness of the approach on a
variety of real-world videos.
3.2.2 Preserving Spatio-Temporal Consistency
In the same shot of a video, an object of interest can be considered as a compact
space-time volume, which exhibits smooth changes in location, scale, and motion across
frames. Therefore, how to preserve spatial and temporal consistency within the extracted
foreground object regions across video frames is a major obstacle for VOE. Since there is no
guarantee that combining multiple motion-induced features would address the above

problem, it need to enforce additional constraints in the CRF model in order to achieve this
goal.
1) Spatial Continuity for VOE: When applying a pixel-level prediction process for VOE
(like ours and some prior VOE methods do), the spatial structure of the extracted foreground
region is typically not considered during the VOE process. This is because that the
prediction made for one pixel is not related to those for its neighboring ones. To maintain the
spatial consistency for the extracted foreground object, add a pairwise term in the CRF
framework. The introduced pairwise term Ei, j is defined as
Ei, j = Σi∈I,j ∈ Neighbor |Fi − Fj |×(λ1 + λ2 (exp(−(|| zi − zj ||)/β))) . (22)
Note that β is set as the averaged pixel color difference of all pairs of neighboring pixels. In
(22), λ1 is a data-independent Ising prior to smoothen the predicted labels, and λ2 is to relax
the tendency of smoothness if color observations zi and zj form an edge (i.e. when || zi − zj||
is large). This pairwise term is able to produce coherent labeling results even under low
contrast or blurring effects.
2) Temporal Consistency for VOE: Although it exploit both visual and motion saliency
information for determining the foreground object, the motion-induced features such as
shape and foreground/background color GMM models might not be able to well describe the
changes of foreground objects across videos due to issues such as motion blur, compression
loss, or noise/artifacts presented in video frames.
To alleviate this concern, choose to propagate the foreground/background shape
likelihood and CRF prediction outputs across video frames for preserving temporal
continuity in the VOE results. To be more precise, when constructing the foreground and
background color GMM models, the corresponding pixel sets FS and BS will not only be
produced by the shape likelihood FSshape and BSshape at the current frame, those at the
previous frame (including the CRF prediction outputs Fforeground and Fbackground) will be
considered to update FS and BS as well. In other words, update foreground and background
pixel sets FS and BS at frame t + 1 by
FSt+1 = FSshape(t + 1) U FSshape(t) U Fforeground(t)

BSt+1 = BSshape(t + 1) U BSshape(t) U Fbackground(t) (23)
where Fforeground(t) indicates the pixels at frame t to be predicted as foreground, and FSshape(t)
is the set of pixels whose shape likelihood is above 0.5 as described in Section III.B3.
Similar remarks apply for Fbackground(t) and BSshape(t). Finally, by integrating (19), (20), (21),
and (22), plus the introduced terms for preserving spatial and temporal information, the
objective energy function (18) can be updated as
E = Eunary + Epairwise
= (ES
+ ECF
− ECB
+ EV
) + Ei, j
= ES
+ EC
+ EV
+ Ei, j . (24)
To minimize (24), one can apply graph-based energy minimization techniques such as max-
flow/min-cut algorithms. When the optimization process is complete, the labeling function
output F would indicate the class label (foreground or background) of each observed pixel at
each frame, and thus the VOE problem is solved accordingly.

4. COMPARISON
The model of saliency-based visual attention system was able to reproduce human
performance for a number of pop-out tasks. When a target differed from an array of
surrounding distracters by its unique orientation, color, intensity, or size, it was always the
first attended location, irrespective of the number of distractors.
Contrarily, when the target differed from the distractors only by a conjunction of
features (e.g., it was the only red horizontal bar in a mixed array of red vertical and green
horizontal bars), the search time necessary to find the target increased linearly with the
number of distractors. It is difficult to objectively evaluate the model, because no objective
reference is available for comparison, and observers may disagree on which locations are the
most salient. It has simple architecture and feed-forward feature-extraction mechanisms, the
model is capable of strong performance with complex natural scenes.
Without modifying the preattentive feature-extraction stages, the model cannot detect
conjunctions of features. While the system immediately detects a target which differs from
surrounding distracters by its unique size, intensity, color, or orientation, it will fail at
detecting targets salient for unimplemented feature types (e.g., T junctions or line
terminators, for which the existence of specific neural detectors remains controversial).
For simplicity, it also have not implemented any recurrent mechanism within the
feature maps and, hence, cannot reproduce phenomena like contour completion and closure,
which are important for certain types of human pop-out. In addition, at present, the model
does not include any magnocellular motion channel, which is known to play a strong role in
human saliency.
Bilayer segmentation of webcam videos reports the lowest classification error
achieved and the corresponding frame rate for each of the three algorithms at their “optimal”
parameter setting according to validation. RF produces the lowest errors, then evaluate the
improved speed of RF when it is allowed to produce the same error level as GB, BT, and
EB.

It reduce the size of the RF ensemble and observe its suboptimal classification error
when RF evaluates at the same speed as GB, EB, and BT. In all cases, RF outperforms the
other classifiers.The median error of segmentation errors with respect to ground truth is
around or below one percent. Segmentation of past frames affects only the current frame
from the learned color model.
Under harsh lighting conditions, unary potentials may not be very strong; thus, the
Ising smoothness term may force the segmentation to “cut through” the shoulder and hair
regions. Similar effects may occur in stationary frames. Noise in the temporal derivatives
also affects the results. This situation can be detected by monitoring the magnitude of the
motion. Bilayer Segmentation is capable of inferring bilayer segmentation monocularly even
in the presence of distracting background motion without the need for manual initialization.
Bilayer Segmentation has accomplished the following:
1) introduced novel visual features that capture motion and motion context
efficiently,
2) provided a general understanding of tree-based classifiers, and
3) determined an efficient and accurate classifier in the form of random forests.
This confirm accurate and robust layer segmentation in videochat sequences. It achieves the
lowest unary classification error in the videochat application.
Although the color features can be automatically determined from the input video,
these methods still need the user to train object detectors for extracting shape or motion
features. Recently, researchers used some preliminary strokes to manually select the
foreground and background regions to train local classifiers to detect the foreground objects.
While these works produce promising results, it might not be practical for users to manually
annotate a large amount of video data.
VOE framework approach was able to extract visual salient regions, even for videos
with low visual contrast. VOE method is able to automatically extract foreground objects of

interests without any prior knowledge or the need to collect training data in advance. is able
to handle videos captured by freely moving camera, or with complex background motion.
It will be very difficult for unsupervised VOE methods to properly detect the
foreground regions even multiple types of visual and motion induced features are
considered. Discrimination between such challenging foreground and background regions
might require one to observe both visual and motion cues over a longer period.
Or, if the video is with sufficient resolution, one can consider to utilize trajectory
information of the extracted local interest points for determining the candidate foreground
regions. In such cases, one can expect improved VOE results.
PSKA allows neighboring nodes in a BAN to agree to a symmetric (shared) cryptographic
key, in an authenticated manner, using physiological signals obtained from the subject. No
initialization or pre-deployment is required; simply deploying sensors in a BAN is enough to
make them communicate securely.

5. CONCLUSION
The first scheme is a conceptually simple computational model for saliency-driven
focal visual attention. The biological insight guiding its architecture proved efficient in
reproducing some of the performances of primate visual systems. The efficiency of this
approach for target detection critically depends on the feature types implemented. The
framework presented here can consequently be easily tailored to arbitrary tasks through the
implementation of dedicated feature maps.
The second scheme ‘bilayer segmentation’ is capable of inferring bilayer
segmentation monocularly even in the presence of distracting background motion without
the need for manual initialization. It has accomplished the following:
1) introduced novel visual features that capture motion and motion context
efficiently,
2) provided a general understanding of tree-based classifiers, and
3) determined an efficient and accurate classifier in the form of random forests.
Bilayer segmentation extract the top point of the foreground pixels and use it as a reference
point for framing window. This result can be used for the “smart framing” of videos. On the
one hand, this simple head tracker is efficient. It works for both frontal and side views, and it
is not affected by background motion. On the other hand, it is heavily tailored toward the
videochat application. It achieves the lowest unary classification error in the videochat
application.
The third scheme ‘VOE methods’ was shown to better model the foreground object due to
the fusion of multiple types of saliency-induced features. The major advantage of this
method is that it do not require the prior knowledge of the object of interest (i.e., the need to
collect training data), nor the interaction from the users during the segmentation progress.

REFERENCES
[1] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visual attention for rapid
scene analysis,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 11, pp. 1254–1259,
Nov. 1998.
[2] P. Yin, A. Criminisi, J. M. Winn, and I. A. Essa, “Bilayer segmentation of webcam
videos using tree-based classifiers,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,
pp. 30–42, Jan. 2011.
[3] Wei-Te Li, Haw-Shiuan Chang, Kuo-Chin Lien, Hui-Tang Chang, and Yu-Chiang Frank
Wang, ” Exploring Visual and Motion Saliency for Automatic Video Object Extraction”
IEEE Transactions On Image Processing, vol. 22, no. 7, July 2013
[4] http://en.wikipedia.org/wiki/Image_processing

GLOSSARY
1. Saliency Map: The Saliency Map(SM) is a topographically arranged map that
represents visual saliency of a corresponding visual scene.
2. WTA: Winner-take-all is a computational principle applied in computational models
of neural networks by which neurons in a layer compete with each other for activation.
3. Texton: Textons refer to fundamental micro-structures in natural images and are
considered as the atoms of pre-attentive human visual perception
4. CRF: Conditional random fields (CRFs) are a class of statistical modeling
method often applied in pattern recognition and machine learning, where they are used
for structured prediction.

3.introduction onwards deepa

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (13)

Similar to 3.introduction onwards deepa

Similar to 3.introduction onwards deepa (20)

More from Safalsha Babu

More from Safalsha Babu (16)

3.introduction onwards deepa