Often, high dimensional data lie
close to a low-dimensional submanifold and it is of interest to understand the geometry of these submanifolds.
The homology groups of a manifold are important topological invariants that provide an algebraic summary of the manifold.
These groups contain rich topological information, for instance, about the connected components, holes, tunnels and sometimes the dimension of the manifold.
In this paper, we consider the statistical problem of estimating the homology of a manifold from noisy samples under several different noise models.
We derive upper and lower bounds on the minimax risk for this problem.
Our upper bounds are based on estimators which are constructed from a union of balls of appropriate radius around carefully selected points.
In each case we establish complementary lower bounds using Le Cam's lemma.
4. Something like a joke.
What is topological inference?
It’s when you infer the topology of a
space given only a finite subset.
5. Something like a joke.
What is topological inference?
It’s when you infer the topology of a
space given only a finite subset.
6. We add geometric and statistical hypotheses to
make the problem well-posed.
Geometric Assumption:
The underlying space is a smooth manifold M.
Statistical Assumption:
The points are drawn i.i.d. from a distribution derived from M.
7. We add geometric and statistical hypotheses to
make the problem well-posed.
Geometric Assumption:
The underlying space is a smooth manifold M.
Statistical Assumption:
The points are drawn i.i.d. from a distribution derived from M.
8. We add geometric and statistical hypotheses to
make the problem well-posed.
Geometric Assumption:
The underlying space is a smooth manifold M.
Statistical Assumption:
The points are drawn i.i.d. from a distribution derived from M.
9. We add geometric and statistical hypotheses to
make the problem well-posed.
Geometric Assumption:
The underlying space is a smooth manifold M.
Statistical Assumption:
The points are drawn i.i.d. from a distribution derived from M.
12. Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
13. Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
14. Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of
the best possible algorithm?
15. sam pled i.i.d.
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of
the best possible algorithm?
16. ion
pled i.i.d. distribut d on
sam supporte
Input: n points from a d-manifold M in D-dimensions.
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of
the best possible algorithm?
17. ion
pled i.i.d. distribut d on
sam supporte
Input: n points from a d-manifold M in D-dimensions.
e
with nois
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of
the best possible algorithm?
18. ion
pled i.i.d. distribut d on
sam supporte
Input: n points from a d-manifold M in D-dimensions.
e
estimate
of with nois
an
Output: The homology of M.
Upper bound: What is the worst case complexity?
Lower Bound: What is the worst case complexity of
the best possible algorithm?
19. ion
pled i.i.d. distribut d on
sam supporte
Input: n points from a d-manifold M in D-dimensions.
e
estimate
of with nois
an
Output: The homology of M.
Upper bound: What is the worst case complexity?
ing
probabil ity of giv
r
a wro ng answe
Lower Bound: What is the worst case complexity of
the best possible algorithm?
20. ion
pled i.i.d. distribut d on
sam supporte
Input: n points from a d-manifold M in D-dimensions.
e
estimate
of with nois
an
Output: The homology of M.
Upper bound: What is the worst case complexity?
ing
probabil ity of giv
r
a wro ng answe
Lower Bound: What is the worst case complexity of
the best possible algorithm?
The Goal:
Matching Bounds
(asymptotically)
21. Minimax risk is the error probability of the
best estimator on the hardest examples.
22. Minimax risk is the error probability of the
best estimator on the hardest examples.
Minimax Risk: Rn = inf sup n ˆ
Q (H = H(M ))
ˆ
H Q∈Q
23. Minimax risk is the error probability of the
best estimator on the hardest examples.
Minimax Risk: Rn = inf sup n ˆ
Q (H = H(M ))
ˆ
H Q∈Q
the best r
estimato
24. Minimax risk is the error probability of the
best estimator on the hardest examples.
Minimax Risk: Rn = inf sup n ˆ
Q (H = H(M ))
ˆ
H Q∈Q
the best r t
t
he hardes n
estimato d istributio
25. Minimax risk is the error probability of the
best estimator on the hardest examples.
Minimax Risk: Rn = inf sup n ˆ
Q (H = H(M ))
ˆ
H Q∈Q
product ion
the best r t
t
he hardes n distribut
estimato d istributio
26. Minimax risk is the error probability of the
best estimator on the hardest examples.
Minimax Risk: Rn = inf sup n ˆ
Q (H = H(M ))
ˆ
H Q∈Q
product ion the true
the best r t
he hardes n distribut
estimato
t homology
d istributio
27. Minimax risk is the error probability of the
best estimator on the hardest examples.
Minimax Risk: Rn = inf sup n ˆ
Q (H = H(M ))
ˆ
H Q∈Q
product ion the true
the best r t
he hardes n distribut
estimato
t homology
d istributio
Sample Complexity: n( ) = min{n : Rn ≤ }
29. We assume manifolds without boundary of
bounded volume and reach.
Let M be the set of compact d-dimensional
Riemannian manifolds without boundary such that
30. We assume manifolds without boundary of
bounded volume and reach.
Let M be the set of compact d-dimensional
Riemannian manifolds without boundary such that
1 M ⊂ ballD (0, 1)
31. We assume manifolds without boundary of
bounded volume and reach.
Let M be the set of compact d-dimensional
Riemannian manifolds without boundary such that
1 M ⊂ ballD (0, 1)
2 vol(M ) ≤ cd
32. We assume manifolds without boundary of
bounded volume and reach.
Let M be the set of compact d-dimensional
Riemannian manifolds without boundary such that
1 M ⊂ ballD (0, 1)
2 vol(M ) ≤ cd
3 The reach of M is at most τ .
33. We assume manifolds without boundary of
bounded volume and reach.
Let M be the set of compact d-dimensional
Riemannian manifolds without boundary such that
1 M ⊂ ballD (0, 1)
2 vol(M ) ≤ cd
3 The reach of M is at most τ .
Let P be the set of probability distributions
supported over M ∈ M with densities bounded
from below by a constant a.
34. We consider 4 different noise models.
Noiseless Clutter
Tubular Additive
35. We consider 4 different noise models.
Noiseless Clutter
Q=P
Tubular Additive
36. We consider 4 different noise models.
Noiseless Clutter
Q = (1 − γ)U + γP
Q=P
P ∈P
U is uniform
on ball(0, 1)
Tubular Additive
37. We consider 4 different noise models.
Noiseless Clutter
Q = (1 − γ)U + γP
Q=P
P ∈P
U is uniform
on ball(0, 1)
Tubular Additive
Let QM,σ be
uniform on M σ .
Q = {QM,σ : M ∈ M}
38. We consider 4 different noise models.
Noiseless Clutter
Q = (1 − γ)U + γP
Q=P
P ∈P
U is uniform
on ball(0, 1)
Tubular Additive
Let QM,σ be
uniform on M σ . Q = {P Φ : P ∈ P}
Φ is Gaussian
Q = {QM,σ : M ∈ M} with σ τ
or Φ has Fourier transform
bounded away from 0
and τ is fixed.
39. Le Cam’s Lemma is a powerful tool for
proving minimax lower bounds.
40. Le Cam’s Lemma is a powerful tool for
proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let θ(Q) take values
in a metric space (X, ρ) for Q ∈ Q. For any Q1 , Q2 ∈ Q,
inf sup EQn ˆ θ(Q)) ≥ 1 ρ(θ(Q1 ), θ(Q2 ))(1−TV(Q1 , Q2 ))2n
ρ(θ,
ˆ
θ Q∈Q 8
41. Le Cam’s Lemma is a powerful tool for
proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let θ(Q) take values
in a metric space (X, ρ) for Q ∈ Q. For any Q1 , Q2 ∈ Q,
inf sup EQn ˆ θ(Q)) ≥ 1 ρ(θ(Q1 ), θ(Q2 ))(1−TV(Q1 , Q2 ))2n
ρ(θ,
ˆ
θ Q∈Q 8
0 if x = y
For homology, use the trivial metric. ρ(x, y) =
1 if x = y
42. Le Cam’s Lemma is a powerful tool for
proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let θ(Q) take values
in a metric space (X, ρ) for Q ∈ Q. For any Q1 , Q2 ∈ Q,
inf sup EQn ˆ θ(Q)) ≥ 1 ρ(θ(Q1 ), θ(Q2 ))(1−TV(Q1 , Q2 ))2n
ρ(θ,
ˆ
θ Q∈Q 8
0 if x = y
For homology, use the trivial metric. ρ(x, y) =
1 if x = y
n ˆ = H(M )) ≥ 1 (1 − TV(Q1 , Q2 ))2n
inf sup Q (H
ˆ
H Q∈Q 8
43. Le Cam’s Lemma is a powerful tool for
proving minimax lower bounds.
Lemma. Let Q be a set of distributions. Let θ(Q) take values
in a metric space (X, ρ) for Q ∈ Q. For any Q1 , Q2 ∈ Q,
inf sup EQn ˆ θ(Q)) ≥ 1 ρ(θ(Q1 ), θ(Q2 ))(1−TV(Q1 , Q2 ))2n
ρ(θ,
ˆ
θ Q∈Q 8
0 if x = y
For homology, use the trivial metric. ρ(x, y) =
1 if x = y
ˆ = H(M )) ≥ 1 (1 − TV(Q1 , Q2 ))2n
Rn = inf sup Q (Hn
ˆ
H Q∈Q 8
44. The lower bound requires two manifolds that are
geometrically close but topologically distinct.
45. The lower bound requires two manifolds that are
geometrically close but topologically distinct.
B = balld (0, 1 − τ ) A = B balld (0, 2τ )
46. The lower bound requires two manifolds that are
geometrically close but topologically distinct.
B = balld (0, 1 − τ ) A = B balld (0, 2τ )
M1 = ∂(B ) τ
M2 = ∂(A ) τ
47. The lower bound requires two manifolds that are
geometrically close but topologically distinct.
B = balld (0, 1 − τ ) A = B balld (0, 2τ )
M1 = ∂(B ) τ
M2 = ∂(A ) τ
The overlap
49. It suffices to bound the total variation distance.
Total Variation Distance:
TV(Q1 , Q2 ) = sup |Q1 (A) − Q2 (A)|
A
≤ a max{vol(M1 M2 ), vol(M2 M1 )}
≤ Cd aτ d
50. It suffices to bound the total variation distance.
Total Variation Distance:
TV(Q1 , Q2 ) = sup |Q1 (A) − Q2 (A)|
A
≤ a max{vol(M1 M2 ), vol(M2 M1 )}
≤ Cd aτ d
Minimax Risk:
1 2n 1 d 2n 1 −2Cd aτ d n
Rn ≥ (1 − TV(Q1 , Q2 ) ≥ (1 − Cd aτ ) ≥ e
8 8 8
51. It suffices to bound the total variation distance.
Total Variation Distance:
TV(Q1 , Q2 ) = sup |Q1 (A) − Q2 (A)|
A
≤ a max{vol(M1 M2 ), vol(M2 M1 )}
≤ Cd aτ d
Minimax Risk:
1 2n 1 d 2n 1 −2Cd aτ d n
Rn ≥ (1 − TV(Q1 , Q2 ) ≥ (1 − Cd aτ ) ≥ e
8 8 8
Sampling Rate: 1
d
1
n( ) ≥ log
τ ε
52. The upper bound uses a union of balls to estimate
the homology of M.
53. The upper bound uses a union of balls to estimate
the homology of M.
54. The upper bound uses a union of balls to estimate
the homology of M.
55. The upper bound uses a union of balls to estimate
the homology of M.
1 Take a union of balls.
56. The upper bound uses a union of balls to estimate
the homology of M.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
57. The upper bound uses a union of balls to estimate
the homology of M.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
58. The upper bound uses a union of balls to estimate
the homology of M.
0 Denoise the data.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
59. The upper bound uses a union of balls to estimate
the homology of M.
0 Denoise the data.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
60. The upper bound uses a union of balls to estimate
the homology of M.
0 Denoise the data.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
61. The upper bound uses a union of balls to estimate
the homology of M.
0 Denoise the data.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
62. The upper bound uses a union of balls to estimate
the homology of M.
0 Denoise the data.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
63. The upper bound uses a union of balls to estimate
the homology of M.
0 Denoise the data.
1 Take a union of balls.
2 resulting Cech complex.
Compute the homology of the
To prove: The density is bounded from below near M
and from above far from M.
66. Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
67. Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
3 Homotopy equivalence?
68. Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
3 Homotopy equivalence?
4 How to choose parameters?
69. Many fundamental problems are still open.
1 Is the reach the right parameter?
2 What about manifolds with boundary?
3 Homotopy equivalence?
4 How to choose parameters?
5 Are there efficient algorithms?