4. Related Work : 3-d Object Retrieval
EC
120 R. Toldo, U. Castellani, and A. Fusiel
R
SH
Bronstein et al.
Hough Transforms and 3D SURF for robust three dimensional classification 3
of transformations
pdif (x) = (log Kατ2 (x, x) − log Kατ1 (x, x), . . . ,
log Kατ m (x, x) − log Kατm−1 (x, x)),
ˆ
p(x) = |(F pdif (x))(ω1 , . . . , ωn )|, (3)
where F is the discrete Fourier transform, and ω1 , . . . , ωn denotes
a set of frequencies at which the transformed vector is sampled.
Taking differences of logarithms removes the scaling constant, and
the Fourier transform converts the scale-space shift into a complex
A. M. Bronstein, et.al.
phase, which is removed by taking the absolute value. Typically,
(a) (b)
J.Knopp, et.al.
(c)
a large m is used to make the representation insensitive to large R. Toldo, et.al.
Fig. 2. Illustration of the detection of 3D SURF features. The shape (a) is voxelized
2011
scaling factors and edge effects. Such a descriptor was dubbed
2010
into the cube grid (side of length 256) (b). 3D SURF features are detected and back-
projected to the shape (c), where detected features are represented as Kokkinoswith
Scale-Invariant HKS (SI-HKS) [Bronstein and spheres and 2010].
2009
the radius illustrating the feature scale.
3.3 Numerical Computation of HKS
Maria Isabel Restrepo. February 7, 2012 4
5. crucial because it allows further tasks such as recognition,
navigation, and data compression to exploit contextual in-
Related Work: Scene Description In LIDAR
Thommen Korah formation. A keySwarup Medasani
contribution is our novel Strip Histogram
Yuri Owechko
Grid representation that encodes the scene as a grid of ver-
Nokia Research Center, Hollywood HRL Labs, Malibu
tical 3D population histograms rising up from the locally
{thommen.korah}@nokia.com {smedasani,yowechko}@hrl.com
detected ground. This scheme captures the nature of the
real world, thereby making segmentation tasks intuitive and
efficient. Our algorithms work across a large spectrum of
Abstract urban objects ranging from buildings and forested areas to
cars and other small street side objects. The methods have
As part of a large-scale 3D recognition system applied to areas spanning several kilometers in mul-
been for LI-
DAR data from urban scenes, we describe an tiple citiesfor data collected from both aerial and ground
approach with
sensors exhibiting different properties. We processed almost
segmenting millions of points into coherent regions that ide-
ally belong to a single real-world object. Segmentation is spanning an area of 3.3 km in less than an
a billion points 2
crucial because it allows further tasks such ashour on a regular desktop.
recognition,
navigation, and data compression to exploit contextual in-
formation. A key contribution is our novel Strip Histogram
Grid representation that encodes the scene as a grid of ver-
1. Introduction
tical 3D population histograms rising up from the locally
detected ground. This scheme captures the nature of the describes an approach for segmenting 3D ob-
This work
real world, thereby making segmentation tasksjects from high-resolution scans of complex urban environ-
intuitive and
efficient. Our algorithms work across a largements. Advances in sensor technology have enabled such
spectrum of
Object Detection from Large-Scale 3D Datasets
Light Standard buildings and forested areaspoint clouds to be routinely collected using both
urban objects ranging from colorized to 56
Figure 1: Top image is an input pointcloud for a 100x100
cars and other small street side objects. The methods have and airborne LIDAR platforms. The push
ground-based T. Korah, et.al. 2011
towards location-based services has increased demand for
been applied to areas spanning several kilometers in mul-
square meter tile color-mapped by height. Bottom shows
the result of segmentation. Each colored region ideally cor-
tiple cities with data collected from both aerial and ground digital maps of urban environments. The
highly accurate
responds to a physical object. This tile has over 3 million
sensors exhibiting different properties. We processed3D data contains millions of data points p = (x, y, z)
1 input almost
points.
a billion points spanning an area of 3.3 km2 in lessstore the spatial coordinates and possibly RGB color
that than an
hour on a regular desktop. 0.9
Car information. Segmentation can provide valuable contextual
information to Post
Short subsequent recognition or scene understand- linearly with the number of points. As a key part of our 3D
0.8 recognition system that demonstrated over 60% accuracy on
ing modules, making these tasks more efficient. Millions
Newspaper Box of 3D points need to be reduced to perceptually “mean- 40 classes, segmentation took less than an hour on a regular
1. Introduction 0.7
PC to process a collection of nearly 1 billion points.
ingful” groupings. To be effective for target recognition,
This work describes an approach0.6 segmenting Carob- disaster planning, processing must scale sub-
for simulation, or
3D Detailed geometric data at city-scales has not been pos-
jects from high-resolution scans of complex urban environ-
0.5
Traffic Light
ments. Advances in sensor technology have enabled such 74
Car
colorized point clouds to be routinely collected using both
0.4 Figure 1: Top image is an input pointcloud for a 100x100
(c) Zoomed0.4 view The push
ground-based and airborne LIDAR platforms. 0.6 0.8 1
square meter tile color-mapped by height. Bottom shows
towards location-based services has increased demand for
the result of segmentation. Each colored region ideally cor-
00 manually labeled objects et.al. truth area
in the
A. Golovinskiy,Left: The precision-recall curve for carand P. Mordohai,million points con
A. Patterson detection on 200 2008
highly accurate digital maps of urban environments. The
responds to a physical object. This tile has over 3 million
Fig. 6.
input 3D data contains millions of data points p = (x, y, z)
d points, with colors representing labels.) A points.
2009 1221 cars. (Precision is the x-axis and recall the y-axis.) Right: Screenshot o
that store the spatial coordinates and possibly RGB color
taining
information. Segmentation can provide valuable contextual
s on bottom, is shown in (c).(Automatically
information to subsequent recognition or scene understand- linearly with the number of points. As a key part of our 3D
detected cars. Cars are in random colors and the background in original colors.
ing modules, making these tasks more efficient. Millions recognition system that demonstrated over 60% accuracy on
Maria Isabel Restrepo. February 7, 2012
of 3D points need to be reduced to perceptually “mean- 40 classes, segmentation took less than an hour on a regular 5
9. Challenges Of Multi-View Stereo
Scene Ambiguity:
Scene Uncertainty: 5
(a) (a) (b) (b)
(c)
(a) (c)
(d)
(b) (d)
(e)
(c) (d)
Maria Isabel Restrepo. February 7, 2012 6
10. Probabilistic 3-d Volumetric Model: PVM
Probabilistic representation of 3-d scenes based on
volumetric units -voxel.
C
RX
I
IX
Voxel Volume!
V
S
X'
P(IX|V=X’)!
Intesity!
Pollard and Mundy, 2007
Maria Isabel Restrepo. February 7, 2012 7
11. Probabilistic 3-d Volumetric Modeling
C
RX
I
IX
Voxel Volume!
V
S
X'
P(IX|V=X’)!
Intesity!
Maria Isabel Restrepo. February 7, 2012 8
12. Probabilistic 3-d Volumetric Modeling
Surface probability is given by on-line Bayesian learning
pN (Ix +1 |X 2 S)
N
P N +1 (X 2 S|Ix +1 ) = P N (X 2 S)
N
pN (Ix +1 )
N
C
RX
I
IX
Voxel Volume!
V
S
X'
P(IX|V=X’)!
Intesity!
Maria Isabel Restrepo. February 7, 2012 9
13. observed image intensity, as well the Gaussian mixture (1) at that voxel explains the intensity
to contain the observed surface observed in the N+1 image better than any other voxel along
usion. The process of updating the Probabilistic 3-d Volumetric Modeling
the projection ray.
pancy probabilities is explained in
pN (IX +1 |X 2 S)
N
Update using information along a projection ray
P N +1 (X 2 S) = P N (X 2 S)
p N (I N +1 )
(3)
X
e model X
pN (IX +1 |V = X 0 )P (V = X 0 |X 2 S)
N
voxel is modeledpwith N +1 |X 2 S)
N
(IX Gaussian
a
N N X 0 2RX
en P (X 2 S)
by (1). I, refers to the +1
N (I N
grey- = P (X 2 S) X
considered a vector pwith X )
various pN (IX +1 |V = X 0 )P N (V = X 0 )
N
X 0 2RX
or. The quantities, µk , k and !k ,
(4)
and mixing parameters associated
C
ution. W is the sum of !k for all To make the PVM representation clear, a term by term
R
is given by k; for this particular explanation of the update equation in 4 is outlined.
I
X
xture components. I X N N +1
• The term p (IX |V = X 0 ) is computed using the
Voxel Volume! !
1
(I µk )2 mixture of Gaussians model stored at the voxel X 0 .
2 2
p
2
exp k (1) • The probability of a voxel X producing the color in
0
2⇡ k V
the image is interpreted geometrically, where a voxel
mixture S learned using a modi-
are produces the intensity seen in the image if it is a surface
on (EM) algorithm similar to that element and it is not occluded by other voxels along the
X'
modeling [45]. The update of |V=X’)!
P(I
the X ray. Thus,
Intesity!
P N (V = X 0 ) = P N (X 0 2 S)P N (X 0 is not occluded) (5)
+1 The probability of occlusion is defined as the probability
that all voxels between X 0 and the sensor are empty,10
! Maria Isabel Restrepo. February 7, 2012
14. observed image intensity, as well the Gaussian mixture (1) at that voxel explains the intensity
to contain the observed surface observed in the N+1 image better than any other voxel along
usion. The process of updating the Probabilistic 3-d Volumetric Modeling
the projection ray.
pancy probabilities is explained in
pN (IX +1 |X 2 S)
N
Every voxel contains appearance information
P N +1 (X 2 S) = P N (X 2 S)
p N (I N +1 )
(3)
X
e model X
pN (IX +1 |V = X 0 )P (V = X 0 |X 2 S)
N
voxel is modeledpwith N +1 |X 2 S)
N
(IX Gaussian
a
N N X 0 2RX
en P (X 2 S)
by (1). I, refers to the +1
N (I N
grey- = P (X 2 S) X
considered a vector pwith X )
various pN (IX +1 |V = X 0 )P N (V = X 0 )
N
X 0 2RX
or. The quantities, µk , k and !k ,
(4)
and mixing parameters associated
C
ution. W is the sum of !k for all To make the PVM representation of the a term by term
Probability clear, observed
R
is given by k; for this particular explanation of the update equation given that the
I
X
intensity, in 4 is outlined.
xture components. I
• The term p (IX voxels produced the color
X N N +1
|V = X 0 ) is computed using the
Voxel Volume! !
1
(I µk )2 mixture of Gaussians model the image voxel X 0 .
seen in stored at the
2
p
2
exp 2 k (1) • The probability of a voxel X producing the color in
0
2⇡ k V
the image is interpreted geometrically, where a voxel
3
!
X wk (I µk )2
mixture S learned using a modi-
are produces the intensity seen in1the image if it is a surface
2 2
on (EM) algorithm similar to that p e
element and it is not occluded by2other voxels along the
k
X' W 2⇡ k
modeling [45]. The update of |V=X’)!
P(I
the X ray. Thus, k=1
Intesity!
P N (V = X 0 ) = P N (X 0 2 S)P N (X 0 is not occluded) (5)
+1 The probability of occlusion is defined as the probability
that all voxels between X 0 and the sensor are empty,11
! Maria Isabel Restrepo. February 7, 2012
15. ance model
observed image intensity, as well the N +1 Gaussian mixture (1) atNthat +1 (Iexplains 2 S) 0
N p (I N voxel = X 0|X (V = X |X 2 S)
p |V X )P the intensity
h voxel is modeled with asurface observed(X theS) = 0image better thanpany other)voxel along (
to contain the observed Gaussian P in 2 N+1 P (X 2 S) X
N (I N +1
usion. The process of updating grey- = P NProbabilistic
X 2RX 3-d Volumetric Modeling X
e model
given by (1). I, refers to the the the projection 2 S) XX N NN +1+1 (X ray.
pancy probabilities is explained in p p X X |V |V = X 0 )P N (V X 0 |X )2
(I (I N = X 0 )P (V = = X 0
oxelconsidered a vectorawith various
be is modeled with Gaussian
X 0 2R
color. The quantities, µk , k and !k , = P N (X 2 S) X 0 2RX X pN (I N +1 |X 2 S)
en by (1). I, refers to the grey- N +1 N X
e, and mixing parameters associatedP X
(X 2 S) = P (X 2 S)N (I N +1 |V+1 X 0 )P N (V = X 0 )
p N =
(3) (4)
considered a vector with various pN (IX )
X
ribution. W is the sum of !k for all To make the PVM representation clear, a term by term
or. modelquantities, µk , k and !k ,
e The X X 0 2RX
res is given by k; forN +1 particular explanation of the updateNequation X 0 )Pis outlined. 2 S) (
this pN (IX +1 |V = in 4 (V = X 0 |X
voxel is modeledpwith a associated
nd mixing parameters GaussianS)N
mixture components. X |X 2
N (I X0 N
N The term 2RX N +1 |V = X 0 ) is computed using the
tion. W is2the sum 2to the +1 all = PTo(X 2 S)the pPVM representation clear, a term 0 by ter
en P (X I,S)
by (1). refers of !N grey-
!
N (I k for •
make XX(I
pwith X ) mixture of Gaussians(IX +1 |Vstored 0at the voxel X 0 .
N
pN model = X )P N (V = X )
is given by ak; for this particular explanation of the0 update equation in 4 is outlined.
considered vector various
(I µk )
1 2 2
pThe quantities, µ ,
or. 2⇡ 2 exp (1) • The probability Xof a voxel X producing the color in
0
ture components. k k and !k ,
k X 2R
N N +1
k • The term pis (IX
the image |V =geometrically, where using th
interpreted X 0 ) is computed a (4)voxel
and mixing parameters associated ! C
e1 mixture are 2 2 (I µk )2
learned !k a all mixture ofthe Probability instored at termit by aterm0 .
produces Gaussians modelthat image if is X
intensity seen the a voxel
ution. W is the sum ofusingfor modi- To make the PVM representation clear, a the voxel surface
expby algorithm similar to(1) explanation of the updateof a occluded4by producing the color
• The probability not voxel X other voxels along the
0
element and it is equation color outlined.
produced the in is seen in
R
zation (EM) k; for this particularthat
2⇡ given
is 2
k X
I
nd modeling [45]. The update of the • The term pN (I Ninterpreted geometrically, where thevox
xturekcomponents. X I the image is +1
ray. Thus,
|V = X ) isthe image
0
computed using
a
Voxel Volume! !
ixture are learned using a modi-
(I µk )2
producesGaussians model stored N image if it is a surfa
N of the0 intensity seen in the the voxel X 0 .
X
mixture = X ) = P N (X 0 2 S)P at(X 0 is not occluded) (5
1 (EM) algorithm similar to that P (V
on
p exp 2 2
k (1) element and itof anot occluded by other the color in th
• The probability
is voxel X 0 producing voxels along
2⇡ 2
modeling [45]. The update of the the TheThus, interpreted geometrically, where a voxel
ray. probability of occlusion is defined as the probability
V
N +1 k image is
k that all voxels between0 XtheandNthe ifsensora are empty,
produces the intensity N
0
mixture S N +1
d! are learnedNusing a modi- Pnamely: X 0 ) = P seen in S)P (X 0 is is occluded)
N
(V = (X 2 image it not surface
(I
on (EM) algorithm X '
N
µk similar to that (2)
) element and it is not occluded by other voxels along the
! + !k Y
ray. Thus, 0
The (X is not occluded) =
N probability of occlusion is defined N the probabili
modeling [45]. The update of the (1 P as 00 2 S)) (6
P(I |V=X’)!
1 d!
X
P (X
N +1 N Intesity!
2 N 2
P N (V all X 0 ) = P N (X 0 2 S)P00Nand0 the sensor are (5)
that = voxels between X <X 0 is not occluded) empt
(I µk ) ( k) 0
d! + N k !+1 N X (X
(I µN )
k (2) namely:
• The term P (V = X |X 2 S) is computed analogously
N N 0
ng
!k weight, d!, upon observing image
+1 The probability of0 occlusion is defined as the probability
Y
P N P (V = between X= and instances of P empty, S)
toall voxels X ). However, anythe (1 P N (XN (X S))
N0 00
d!analyzing N +1 distributions in other
the N 2 N 2 that (X is not occluded) 0 sensor are 2 212
! Maria Isabel Restrepo. February 7,µ )
(I 2012 ( )
18. Spatial Optimization: Octree
p(intensity)
p(intensity)
intensity
intensity
Crispell, Mundy and Taubin 2011
Miller, Jain and Mundy 2011
Maria Isabel Restrepo. February 7, 2012 15
19. Probabilistic 3-d Volumetric Modeling
Demo:
https://vimeo.com/43729866
Maria Isabel Restrepo. February 7, 2012 16
20. Geometry And Appearance
Demo:
https://vimeo.com/43690883
https://vimeo.com/45322168
Maria Isabel Restrepo. February 7, 2012 17
21. Expected Appearance Volume Model: EVM
Voxel’s Expected =
E(IX |V = X )P (X 2 S)
0 0
Appearance
Maria Isabel Restrepo. February 7, 2012 18
22. Object Categorization: Bag Of Volumetric Words
Parking
Car
Plane
Building
House
Input: Feature Descriptor: Volumetric Classifier:
EVM sampling: Taylor Vocabulary: Naive Bayes
Dense PCA K-means
Maria Isabel Restrepo. February 7, 2012 19
23. Experiments: Data Collection
http://vision.lems.brown.edu/project_desc/Object-Recognition-in-Probabilistic-3D-Scenes
Maria Isabel Restrepo. February 7, 2012 20
24. Experiments: Train And Test Sites
Site 1 Site 2 Site 3 Site 5 Site 6
Site 7 Site 8 Site 10 Site 11 Site 12
Site 16 Site 18 Site 21 Site 22 Site 23
Site 25 Site 26 Site 27
http://vision.lems.brown.edu/project_desc/Object-Recognition-in-Probabilistic-3D-Scenes
Maria Isabel Restrepo. February 7, 2012 21
25. Experiments: The Input
Camera matrices were recovered using Bundler: Snavely, N. and Seitz, S. (2006). Photo tourism: exploring photo collections in
3D. ACM Transactions on Graphics.
Maria Isabel Restrepo. February 7, 2012 22
26. Feature Description
394 D. Saupe and D.V. Vrani´
c
Global Features
Spherical Harmonics: D. Saupe and D. V. Vrani, 2001
Original 823-d Zernik Moments: M. Novotnia and R. Klein, 2003
harmonics 162 harmonics 242 harmonics
Transforms and 3D IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 21, NO. 5, 3 Regional Point Descriptors
SURF for robust Recognizing Objects in Range Data Using 1999
Fig. 1. Multi-resolution representationFeatures AND MACHINE INTELLIGENCE,| VOL. 21, NO. 5, MA
227
IEEE TRANSACTIONS ON PATTERN ANALYSIS r(u) = max{r ≥ 0
Local of the function
three dimensional classification ru ∈MAY
I ∪{0}} used to derive feature vectors frommakes thecoefficients for spherical harmonics.
Sampling logarithmically Fourier descriptor
more robust to distortions in shape with distance
from the basis point. Bins closer to the center are
smaller in all three spherical dimensions, so we use
a minimum radius (rmin > 0) to avoid being overly
3 Functions on the Sphere for 3D Shape Feature Vectors
sensitive to small differences in shape very close
to the center. The Θ and Φ divisions are evenly
In this section we describe the feature vectors used in our comparative study. As
spaced along the 180◦ and 360◦ elevation and az-
onents of our surface representation. A surface described by a polygonal surface mesh can be represented for matching as a set of
and surface normals and (b) spin images.
imuth ranges.
3
3D models we take triangle meshes consisting of triangles {T , . . . , T }, T ⊂ R ,
enes is difficult. The usual method for(b)
the cency. Given enough points, weighted count w(pi )
ating object-centered coordinate systems inBin(j, k, l) accumulates a any object can be represented
dealing by points sensed on the object Spin so surface meshes
1
surface, Images: Johnson and Hebert, 1999
m i
(a) for each point pi whose spherical coordinates rela-
(c) 3
given by vertices (geometry) {p , . . . , p }, p = (x , y , z ) ∈ R and an index
r is to segment the scene into object and non- can represent objects of general shape. Surface meshes can
ponents [1], [7]; naturally, this is difficult if the pbe generated from described n a polygonal Rdo i ), SURF: Knopp, et.al. 2010
3-D
tive to fall A surface1 different types i [R andi j+1 i Fig. 2. Visualization for matching as
. 1. Components of our surface representation. within the radius by of sensors j , surface mesh can be represented of the
interval not
ustrationisandthe detection ofand (b) spin images. [Φ The shape (a)elevation interval histogram bins ofmthe 3D
the object
table with three vertices per triangle (topology). Then our object is I =
unknown. An alternative to seg-3D SURF features. Φ
T,
3D points of surface normals azimuth generally contain,sensor-specific information; they are sen- is voxelized
interval k representations. The useShape mesh
k+1 ) and
ube grid (side of length coordinate sys- SURF features are detected and back- Context: context. et.al. 2004
is to construct object-centered 256) (b). 3D sor-independent of surface
shape
Frome, i=1 i
local features detected in the scene [Θl ,[18]; as representations for 3D shapes thebeen avoided in for
[9], Θl+1 ). The contribution to has bin count the
to the shape (c), where 2012
Maria Isabel Restrepo. February 7, detected features are represented as spheres and with 23
27. Feature Formation
Volumetric Form of Vector Form of Voxel
Voxel Neighborhoods Neighborhoods
E(IX |V = X )P (X 2 S) 0 0
24
Maria Isabel Restrepo. February 7, 2012
28. Feature Formation
Volumetric Form of Vector Form of Voxel
Voxel Neighborhoods Neighborhoods
E(IX |V = X )P (X 2 S) 0 0
24
Maria Isabel Restrepo. February 7, 2012
29. Feature Formation
Volumetric Form of Vector Form of Voxel
Voxel Neighborhoods Neighborhoods
E(IX |V = X )P (X 2 S) 0 0
24
Maria Isabel Restrepo. February 7, 2012
30. Feature Formation
Volumetric Form of Vector Form of Voxel
Voxel Neighborhoods Neighborhoods
E(IX |V = X )P (X 2 S) 0 0
24
Maria Isabel Restrepo. February 7, 2012
31. ity leaf nodes contain the Gaussian mixture models
) Feature Description: PCA Features
(c)
tree subdivision of space proposed by Crispell [20]. S. In the PCA spac
by the eigenvalue decomposition of
1-dimensional space d-dimensional space
neighborhood (represented by a d-dimensional featur
a1 ⇧d
e1
x) can be exactly expressed as x = x + i=1 ai ei , w
¯
by theprincipal axes associated S. In the PCAeigenvalues,
are eigenvalue decomposition of with the d space, every
neighborhood x ⇡ x + a1 e1 by a d-dimensional feature vector
(represented
are the corresponding coefficients.⇧d k-dimensional
¯
A a e , where e
x) can be exactly expressed as x = x + i=1 i i
¯ i
approximation of the neighborhoodseigenvalues, and ai b
are principal axes associated with the d can be obtained ⇧k
the first k on the samplecomponents i.e.k-dimensional
EVD principal ˜ ¯
x = x + i=
are the corresponding coefficients. A k-dimensional (k < d)
approximation, for k<d
scatter matrix S a detailed analysis of the recons
approximationpresents
Section V of the neighborhoods can be obtained by using
⇧k 2
the firstof local neighborhoods,i.e. x = x + x| , ai ei a.
error k principal components namely ¯ ˜ |x ˜i=1 as
Section V presents a detailed analysis of In the remainder
of dimension and training set size. 2 the reconstruction
error of localvector arrangement of |x x| , as coefficien
paper, the neighborhoods, namely projection a function
˜
of dimension and training set size. In the remainder of this
PCA the vector arrangement of projection coefficients in the
paper,
space is referred to as a PCA feature.
Maria Isabel Restrepo. February 7, 2012 25
32. on
ni
⌃ ⌃ ⌃ nj nk ⇥2
es, as the computation of derivatives in (i, j,expectationj,volume m
V the k) Taylor Features
˜
E= Feature Description: V (i, k)
(5) EVM, i= ni j= nj k= nk a least square error minimiz
can be expressed as
of the following energy function.
’s expected ˜
Where V (i, j, k) is the Taylor series approximation of
ni nj nk
⌃ ⌃ ⌃ a volume V centered on2 the ⇥
nces, Minimize: E =3-d appearance of V (i, j, k) V (i, j, k)
as the expected ˜
identify- point (i, j, k). Using the second degree Taylor expansion o
i= ni j= nj k= nk
st of the about (0, 0, 0), ( 6) becomes
is (PCA) ˜ ⇤
Where V (i, j, k) is the Taylor series approximation o
⌅2
⌃
epresents expected 3-d appearance of axvolume 1 xT Hx
E= V (x) V0 T
G V centered on th
by identify- point (i, j, x Using the second degree Taylor expansion
or sense. k). 2!
most ofof
order the about (0, 0, 0), ( 6) becomes
ysis (PCA) Where V0 , G, H are the zeroth derivative, the grad
e scatter ⌃ ⇤ ⌅2
represents vector and the Hessian matrix of the 1 T
T volume of expe
E= V (x) V0 x G x Hx
error sense. 3-d appearances about the point (0, 0, 0), respectively.
2!
obtained coefficients for 3-d derivative operators can be found by
x
he octree of imizing (7) withG, H are the zeroth derivative,second o
ng order
Where V0 , respect to the zeroth, first and the gra
mple scatter
aces and derivatives. The computedmatrix of the volume are exp
vector and the Hessian derivative operators of app
location, algebraically to neighborhoods in the 0, 0), respectively.
3-d appearances about the point (0, EVM. The respo
re obtained
Maria Isabel Restrepo. February 7, 2012 26
33. Learning The Codebook
Learn Volumetric Vocabulary using K-Means Clustering:
✤ Determine the best number of means: Heuristically
✤ Convergence depends on initialization: P. S. Bradley
and U. M. Fayyad. 1998
Maria Isabel Restrepo. February 7, 2012 27
35. ssification, the class label with i=1 i
414
ep a count the number of cluster centerstheLearning Class Distributions
is obtained, cij , of in the vocabulary.of the 405
number From 413
obability isachosen v , ominimizeUsing Bayes 415
ion meth-
center, vi , times cluster center,
proposed
tooccursUsing Bayes
quantization step a count is obtained, c , of the number of
occurs in object j . in object o . ij
414
415
he means.
i
posteriori class probability class probability is given by:
formula, the a posteriori is given by:
j
406
416416
f the data 417
417
of a particular category be the
The clus- P (Cl |oi ) ⇥ P (oi |Cl )P (Cl )
(8)
(Cl |oi ) ⇥ P (oi likelihood(Clan object is given by the product of
|Cl )P of )
(8) 407418
k-means, The 418419
frequency
is the class label and N is the
distances the likelihoods of the independent entries of the vocabulary, 408
419
420
the initialan object ),is given estimated l product ofThe full
od of P (vj |Cl which are by the during learning. 421
s label l. Then, the set of all
e manage-independent entries posterior becomes:
of the expression for the class of the vocabulary, 409
420422
⌥
f subsam-
d k-means estimated )during learning. )The full
ch are Nc (C |o ⇥ P (C ) P (v |C cji k 421
423
O= O , where N is the
P l i (9) 410424
meansclass posteriorlbecomes: c
l j l
the pro- l=1 j=1 422425
he vocabulary of 3-d expected
etric train- Nm
⇥cji 411
423
426
ng parallel
are avail-⌥ k
k
⇧
cji k ⇧ m=1:om O
cjm ⌃
⌃ 412
424
427
ed as V = v , where k is
428
⇥ not be l )
P (C P (vj |ClP (Cl= ⇧ k l (9) ⌃
uld ⇥ ) i ) ⇧ ⌃ (10) 429
i=1 ⇧ Nm ⌃ 425
Therefore,
s in the vocabulary. From the
j=1 j=1 ⇤
⇥ ji
cnm ⌅ 413430
which is a n=1 m=1:om cOl 426431
N m
obtained, c⇧ ,: number of times accluster i⌃ in object j 414
ij of the number occurs of 427
4 k jm
⇧ ⌃ 415
428
curs in object o . Using Bayes
⇧ m=1:o O
Maria Isabel Restrepo. February 7, 2012 ⌃m l 29
36. appearance patterns be defined as V = i=1 vi , where k is
the ⌥N c Then, the set of Bayes the 409
withnumber of clusterl.centers in the vocabulary. all Classifier 4
class label Classification: From
efined as O = l=1 is l , where , of is number of 4104
quantization step a count Oobtained, cijNc thethe
times a cluster center, vi , occurs in object oj . Using Bayes 4114
es. Let the vocabulary of 3-d expected 4
⌥k
formula, the a posteriori class probability is given by:
4124
be defined as V )= P (o |C vi ,(C ) where k is (8)
frequency
P (Cl |oi ⇥ i=1 l )P l
i
er centers in the vocabulary. From the 4134
The likelihood of an object is given by the product of 4
count is obtained, cij , of the number of 414
the likelihoods of the independent entries of the vocabulary, 4
er, (vj |Coccurs in object oj during learning. The full
P vi , l ), which are estimated . Using Bayes 4154
expression for the class posterior becomes:
eriori class probability is given by: 4164
4
k 4174
oi )P⇥ P (oi |Cl )P= l ) l )
(C
(Cl |oi ) ⇥ P (C P (vj |Cl ) cji
(8) (9)
4184
j=1
⇥cji 4194
of an object is given by the product of
N m
4
⇧ the vocabulary,⌃
e independent entries of
k ⇧
cjm 4204
⌃
⇧ m=1:om Ol ⌃
Maria Isabel Restrepo. February 7, 2012 ⇥ P (C ) ⇧ ⌃ (10) 30
37. appearance patterns be defined as V = i=1 vi , where k is
withnumber of clusterl.centers in theLearning of all the 4094
the class label ⌥N c Then, the set Class Distributions
vocabulary. From
efined as O = l=1 is l , where , of is number of 4104
quantization step a count Oobtained, cijNc thethe
times a cluster center, vi , occurs in object oj . Using Bayes 4114
es. Let the vocabulary of 3-d expected 4
⌥k
formula, the a posteriori class probability is given by:
4124
be defined as V )= P (o |C vi ,(C ) where k is (8)
frequency
P (Cl |oi ⇥ i=1 l )P l
i
er centers in the vocabulary. From the 4134
The likelihood of an object is given by the product of 4
count is obtained, cij , of the number of 414
the likelihoods of the independent entries of the vocabulary, 4
er, (vj |Coccurs in object oj during learning. The full
P vi , l ), which are estimated . Using Bayes 4154
expression for the class posterior becomes:
eriori class probability is given by: 4164
4
k 4174
oi )P⇥ P (oi |Cl )P= l ) l )
(C
(Cl |oi ) ⇥ P (C P (vj |Cl ) cji
(8) (9)
Train 4184
j=1
⇥cji 4194
of an object is given by the product of
N m
4
⇧ the vocabulary,⌃
e independent entries of
k ⇧
cjm 4204
⌃
⇧ m=1:om Ol ⌃
Maria Isabel Restrepo. February 7, 2012 ⇥ P (C ) ⇧ ⌃ (10) 31
38. appearance patterns be defined as V = i=1 vi , where k is
withnumber of clusterl.centers in theLearning of all the 4094
the class label ⌥N c Then, the set Class Distributions
vocabulary. From
efined as O = l=1 is l , where , of is number of 4104
quantization step a count Oobtained, cijNc thethe
times a cluster center, vi , occurs in object oj . Using Bayes 4114
es. Let the vocabulary of 3-d expected 4
⌥k
formula, the a posteriori class probability is given by:
4124
be defined as V )= P (o |C vi ,(C ) where k is (8)
frequency
P (Cl |oi ⇥ i=1 l )P l
i
er centers in the vocabulary. From the 4134
The likelihood of an object is given by the product of 4
count is obtained, cij , of the number of 414
the likelihoods of the independent entries of the vocabulary, 4
er, (vj |Coccurs in object oj during learning. The full
P vi , l ), which are estimated . Using Bayes 4154
expression for the class posterior becomes:
eriori class probability is given by: 4164
4
k
Test 4174
oi )P⇥ P (oi |Cl )P= l ) l )
(C
(Cl |oi ) ⇥ P (C P (vj |Cl ) cji
(8) (9)
Train 4184
j=1
⇥cji 4194
of an object is given by the product of
N m
4
⇧ the vocabulary,⌃
e independent entries of
k ⇧
cjm 4204
⌃
⇧ m=1:om Ol ⌃
Maria Isabel Restrepo. February 7, 2012 ⇥ P (C ) ⇧ ⌃ (10) 32
39. Results: PCA Classes
Buildings
Planes
Maria Isabel Restrepo. February 7, 2012 33
41. during training and classification.
Experiments: Number Of Objects
Table 2: Number of objects in every category.
Planes Cars Houses Buildings Parking Lots
Train 18 54 61 24 27
Test 16 29 45 15 17
Two measurements were used to evaluate the clas-
sification performance: (i) classifier accuracy (i.e the
fraction of correctly classified objects), and (ii) the
confusion matrix. During classification experiments,
the number of clusters in the codebook was varied
from k = 2 to k = 100. Figure 4 presents classification
accuracy as a function of the number of clusters. For
18 Probabilistic Sites
both, Taylor-based features and PCA-based features,
Maria Isabel Restrepo. February 7, 2012 35
43. row corresponds to those learned with Taylor-based features. The x-axis shows the
feature. The most probable volumetric featuresResults:class are shown Matrix
for each Confusion beside each
was
True Parking
Class
Plane House Building Car
Lot
True
Class
Plane House Building Car
Parking
Lot very
are
Plane 0.86 0.02 0.00 0.03 0.00 Plane 0.86 0.02 0.00 0.03 0.00
neg
House 0.00 0.67 0.27 0.00 0.12 House 0.00 0.64 0.27 0.00 0.12
that
not
Building 0.00 0.31 0.67 0.00 0.00 Building 0.00 0.33 0.67 0.00 0.00 num
⇤, i
Car 0.00 0.00 0.07 0.93 0.00 0.00 0.00 0.07 0.86 0.00
Car
F
Parking
0.14 0.00 0.00 0.03 0.88
Parking
0.14 0.00 0.00 0.10 0.88
mat
Lot Lot
sam
(a) PCA (b) Taylor vari
Fig. 9. Confusion matrix for a 20-keyword codebook of PCA based features valu
on the left and Taylor based features on the right clas
Maria Isabel Restrepo. February 7, 2012 cate
37
44. Future Work
✴ Evaluation of effectiveness of the EVM, by performing classification
tasks on different underlying 3-d reconstruction algorithms.
✴ Performance evaluation of additional feature descriptors.
✴ Explore algorithms for detection.
Maria Isabel Restrepo. February 7, 2012 38
45. Effectiveness Of Probabilistic Volumetric Learning
Maria Isabel Restrepo. February 7, 2012
Y. Furukawa and J. Ponce, 2010 39
46. Effectiveness Of Probabilistic Volumetric Learning
Probabilistic 3-d Modeling Threshold Based 3-d Modeling
Maria Isabel Restrepo. February 7, 2012 40
Welcome Everyone. My name is M.R I come from Brown U and I am please to our work on ORIP3-dS. This is a joint work with BM and Mundy\n
Let me start by explaining the goal of the our worK:\nSupposed we are given an image sequence or video of realistic, large scale scene. Where images are collected under unrestricted conditions in terms of illumination conditions, weather, resolution and so on.\n
Then we are interested in characterizing the infromation of the three dimensional scene, such that we can provide an automated descriptions of the objects present in this three-dimensional world.\nWe are interested in being able to tell where are the buildings, the streets, the trees the water and so on.\n\nBefore, I move on to explain the set of methods that we propose to achieve this goal, I would like to briefly discuss some related work in the area of 3-d object recognition\n
In recent years, there has been an exponential growth of the number of 3-d models. Typically these models are obtained from 3-d scanners or CAD models. \nTherefore, a lot of the work in 3-d shape understanding has focused on the problem of object retrieval. During object retrieval the task to be performed is given a query objects an algorithm needs to retrieve the closest match in the database. The main difference between these works and what our work, is that we are not working with isolated objects, and shape information is collected under very different conditions.\nAlso, we are not doing instance recognition but class recognition\n
Another body of work, that is more inline with what we want to achieve, and that is the area of segmentation and object recognition in large scale point clouds that are obtained using LIDAR sensors.\nVery encouraging results have been recently reported in these area. As mentioned, we have a very similar goal but we operate on geometry is learned from images. We believe that future models could use combination of imagery and lidar to achieve more accurate representations.\n
The first challenge is known a scene ambiguity. And it can be described as follows: Suppose we have a surface with three regions of constant appearance, observed from two different view points. Now the surfaces in the constant color regions can be reconstructed anywhere within the diamond shape regions. Therefore, whenever featureless surfacess are observed the 3-d geometry cannot be precisely modeled. Of course the number and positions of the cameras determined the area of the ambiguous regions.\n\nThe second difficulty is known as scene uncertainty and it happens when the same 3-d structure has a different appearance in diffent viewpoints, due to transient objects, the reflevtivity proberties or sensor noise. \nFinally I would like to enphiseze that when choosing a 3-d reconstruction technique it is important to handle or model the ambiguities just explained.\n
The first challenge is known a scene ambiguity. And it can be described as follows: Suppose we have a surface with three regions of constant appearance, observed from two different view points. Now the surfaces in the constant color regions can be reconstructed anywhere within the diamond shape regions. Therefore, whenever featureless surfacess are observed the 3-d geometry cannot be precisely modeled. Of course the number and positions of the cameras determined the area of the ambiguous regions.\n\nThe second difficulty is known as scene uncertainty and it happens when the same 3-d structure has a different appearance in diffent viewpoints, due to transient objects, the reflevtivity proberties or sensor noise. \nFinally I would like to enphiseze that when choosing a 3-d reconstruction technique it is important to handle or model the ambiguities just explained.\n
The first challenge is known a scene ambiguity. And it can be described as follows: Suppose we have a surface with three regions of constant appearance, observed from two different view points. Now the surfaces in the constant color regions can be reconstructed anywhere within the diamond shape regions. Therefore, whenever featureless surfacess are observed the 3-d geometry cannot be precisely modeled. Of course the number and positions of the cameras determined the area of the ambiguous regions.\n\nThe second difficulty is known as scene uncertainty and it happens when the same 3-d structure has a different appearance in diffent viewpoints, due to transient objects, the reflevtivity proberties or sensor noise. \nFinally I would like to enphiseze that when choosing a 3-d reconstruction technique it is important to handle or model the ambiguities just explained.\n
The first challenge is known a scene ambiguity. And it can be described as follows: Suppose we have a surface with three regions of constant appearance, observed from two different view points. Now the surfaces in the constant color regions can be reconstructed anywhere within the diamond shape regions. Therefore, whenever featureless surfacess are observed the 3-d geometry cannot be precisely modeled. Of course the number and positions of the cameras determined the area of the ambiguous regions.\n\nThe second difficulty is known as scene uncertainty and it happens when the same 3-d structure has a different appearance in diffent viewpoints, due to transient objects, the reflevtivity proberties or sensor noise. \nFinally I would like to enphiseze that when choosing a 3-d reconstruction technique it is important to handle or model the ambiguities just explained.\n
The first challenge is known a scene ambiguity. And it can be described as follows: Suppose we have a surface with three regions of constant appearance, observed from two different view points. Now the surfaces in the constant color regions can be reconstructed anywhere within the diamond shape regions. Therefore, whenever featureless surfacess are observed the 3-d geometry cannot be precisely modeled. Of course the number and positions of the cameras determined the area of the ambiguous regions.\n\nThe second difficulty is known as scene uncertainty and it happens when the same 3-d structure has a different appearance in diffent viewpoints, due to transient objects, the reflevtivity proberties or sensor noise. \nFinally I would like to enphiseze that when choosing a 3-d reconstruction technique it is important to handle or model the ambiguities just explained.\n
In our work we propose model scene geometry and appearance through a probabilistin volumetric model. This model was first proposed by pollard and mundy in 2007\nIn its original from a region of 3-d space is decomposed into regular 3-d cells called voxels. A voxel contains information about the geometry and appearance of that portion of space and the information on voxels is learned using input images, calibrated camera matrices and corresponding projection rays.\n
The problem set up is a follows: For every pixel in an image the is an associated projection ray. Here denoted RX\nAt every voxel in that ray, the the geometry and appearance models are updated using the intensity in the corresponding pixel. Along the ray there is only one voxel that produces the color scene in the image\n
Geometry is models as surface probability. At every point in time a voxels has 2 possible states - It is a surface element or its not.\nThe probability of a voxel being a surface is updated with the information in an image manner using Bayesian learning. The update is done in an online fasion, by this i mean using only one image at a time. This allows the model to adapt to the ever-changing world surfaces.\n
\nThe bayesian update can be expressed using the probability and appearance information along the projection ray. The interpretation of this equation is a that surface probability at a particular voxel increases if the appearance model at that voxel explains the given intensity better than any other voxel along the projection ray. \n
At every voxel, appearance is modeled by a mixture of gaussians that is updated using expectation-maximization.\n
The other term in this equation corresponds to the probability of a voxel being the one that caused the intensity seen in the image. This term is interpreted geometrically, where a voxels caused the color in the image if it is a surface element and it is not occludded. Occlussion is modeled as the probability that the space between that voxel and the sensor is empty\n
One difficulty of Pollar&#x2019;d model is that the storage requierements are high.\nIn practice most of the voxels in a scene correspond to empty space\n
Ideally, we would like represent the information near surfaces with high resolution and use a coarse voxels on empty space \n
In 2011, C,M, T proposed a Variable resolution model. Crispell&#x2019;s model is based on an octree subdivision of space. Where geometry and appearance information is stored at the leave cells. The screenshot on the bottom right compares details of two models, One reconstructed with a egular grid, and the other using the octree model.\nFinally, Miller&#x2026; proposed a GPU implememetation of Crispell&#x2019;s model. And that is the implementation that we use in our current work\n
This is a volumetric rendering of a scene&#x2019;s reconstructed geometry. Where white corresponds to surface probability 1 and black to empty space\n
Even more exciting is to show you renderings of the combined surface and appearance information, pay attention to the level of detail and sharpness achieved by the model\n\nExplain the streaks in the air\n
Now that we have learned the geometry and appearance at every voxel we would like to combine this information. To do so, the expected appearance is multiplied by the occupancy. This allows us to explore not only interesting features in the geometry of objects but also in their appearance\n
At this point let me move on to explain the object categorization pipeline. In this work we propose to use a bag of features representation to learn and classify objects.\nI have explained the how input objects are represented using the volume expected appearances. \nFor each object neighborhoods are sampled in a dense manner and describe\nMotivated by the success os this bag of features methods in 2-d images\n
Before explaining the details of our bag-of-features models, let me talk about the data and inputs used in the experiments.\nFor this work, we had some fun and flew in a helicopter \n
\n
We only use grey scale images\n
In general objects can be described using either local feature or global ones.\nGlobal features describe the overal shape of the object. Examples are 3-d moments, sheprical harmonics among others. \nLocal features on the other hand describe neighbohoods with local suppport. \nLocal images tend to be more robust in the precesnse of occlusions. Also, global images depend on succesful presegmentation ....\n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
Let me explain how the volumetric rendering of voxel neighborhoods should be interpreted: First, recall that the underlying function is the expected appearance of a voxel. Then White regions in this volumes correspond \n
The first feature we propose is based on the principal comp... anl of local neighborhoods. \nPCA finds the directions that best represents the data in the least square scene. \nThe data can be expressed exactly using all principal directions, or approximates using a smaller number of them\nIn our experiment the original space had 125 dimensions, and the approximation was achieved using 10 dimensions\n\n
The second type of feature that we propose is based on the Taylor series approximation of the local volumetric function.\nDifferential kernels are found by minimizing the suqare distance between the volumetric function and its&#x2019;s teaylor series approximation.\nAt every location in space we assign a descripto that is made etheir form PCA projection coefficients of the 10 responses to the taylor kernels\n
After computing descriptors at every location in the objects, for all objects in a training. We would like to find a small number of descriptors that represent all samples\n
Mostly composed of slowly varying first order derivatives\n
\n
\n
\n
\n
\n
\n
\n
\n
The\n
Just as preview of further work that we have performed&#x2026;. if there is time\n