Томми Яаккола «Масштабирование структурных предсказаний»
День MIT в Яндекс, московский офис Яндекс
Основные тезисы выступления:
- Использовании структурного предсказания в приложениях, связанных с обработкой естественного языка, компьютерным зрением, вычислительной биологией.
- Использовании методов двойственного разложения в качестве точных алгоритмов предсказания.
- Изучении методов для эффективной оценки моделей структурного предсказания, не требующих решения единой комбинаторной задачи высокой сложности
3. Structured prediction
• The goal is to learn a mapping from input examples (x)
to complex objects (y)
- e.g., from sentences (x) to dependency parses (y)
y=
x= * John saw a movie yesterday that he liked
4. 0.4 0.4
= x) ✓
• We’d like to learn these =
s(y; x)s(y; func
became su⇥ciently high do we find cI2 at the mutated Acknowledgments
0.4 0.4 0.4
Binding Freq
Binding Freq
Binding Freq
Binding Freq
Binding Freq
Binding Freq
OR 1 as well. Note, however, that cI2 inhibits transcrip-
0.2 0.2 0.2 0.2 0.2
0
!2 0 Structured prediction
tion at OR 3 prior 0to occupying OR 1. 0 Thus the binding This work was supported in
!2
0
2 0
at the mutated ORRNAcould not beRNA 10 10 without10in- and by NSF 10 10 grant 10
10
frepressor/f 1
10 10 10
frepressor/f observed
10 10
frepressor/fRNA
10
0
frepressor/fRNA ITR
10
!2
f042
repre
2 0 !2 2 0 !2 2
f 0
terventions. “Fundaci´n Rafael del Pino
o
• The(a) O 3is to learn a3mapping from input examples (x)
goal (a) O
R (b) O 2 (b) O 2
R
50 50
R R (c) O
to complex objects (y)
x
Figure• Predictions are again are Ragain and mutated OR 1 for increasin
3: Predicted protein binding to sites O 3, OR 2, qualitatively mutated OR
Figure• Predictions qualitatively R 3, OR 2, and correct
3: Predicted protein binding to sites O correct
- e.g., from pairs of images (x) to disparity maps (y)
References
• We’d•ylike tolike to these functio
We’d learn learn these f
ame 7 Discussion we find cI2 do we find cI2 at Acknowledgments
became su⇥ciently high at the mutated the mutated Acknowledgm
su⇥ciently high do
1 as well. Note, as well. Note, cI2 inhibits transcrip-
OR 1 however, that however, that cI2 inhibits transcrip-
y =3 priorO 1. Thus the binding This binding This work was supp
n at OR 3 prior to occupying to occupying OR 1. Thus the work was supported in in pha
tion at OR
[1] Adam Arkin, John Ro
Stochastic kinetic ana
way bifurcation part b
We believe the game theoretic approach provides a com-
R
the mutatedcausalcould notOR 1observed without in- and by NSF ITR Genetics, 149:16
pelling
at the mutated beof could not be observed without in- andgrant 0428715.g
OR 1 abstraction biological systems with re- cells. by NSF ITR
ventions. terventions. The model is complete with prov-
Art “Fundaci´n Rafael del Pino” Fello
Books o “Fundaci´n Rafael
Dolls o Laundry
source constraints. Figure 2. The six datasets used in this paper. Shown is the left image of each pair
[2] Kenneth J. Arrow and
ably convergent algorithms for finding equilibria on a
x=
genome-wide scale. x [Scharstein & Pal 07, Mid
x an equilibrium for a c
have an MPE estimate from running graph cuts we use it
to compute our expectation Referencesthe em- References
metrica, 22(3):265–290
in a manner similar to
The results from the small scale distribution. Training a en-
pirical application are lattice-structured model us-
DiscussionOur model successfully approach described here is Adam Arkin, John Ross, and
couraging. Discussion
7 ing the reproduces known thus [3] Z. Bar-Joseph, G. G
[1] a generalization Adam Arkin, J
[1] of
behavior of the
y y
Viterbi path-based methods described in [32]. For our learn-
switch on ing experiments of use straightforward gradient-based up- B. Gordon
the basis we molecularStochastic Yoo, J. kinetic analysis o
level competition and resource constraints, al.,learning rate. way bifurcation in phage -in
(Scharsteinvariable ’07) the
dates with a et without
believe theWe believe the game theoretic approach provides a com-
game theoretic approach provides a com-
Stochastic kin
T. Jaakkola,bifurcatio
way R. Youn
need to assume protein-protein interactions between cI2cells. Genetics, discovery of gtationalcells. Genetics
149:1633–164
ling causal abstraction of abstraction of biologicalre-
pelling causal biological systems withArtsystems with re-
dimers and cI2 and RNA-polymerase. EvenBooksthe con- Booksnetworks. Laundry Biot
Art Datasets
4. in Dolls Nature
Dolls
rce constraints. The model is Figure 2.model is Figure 2. The six with prov-this paper. Shown is theand the cor
source constraints. The The six with complete datasets used the left image of each pair left image
complete datasets prov-this paper. Shown is in
used in
text of this well-known sub-system, however, few quan-Kenneth J.[2] KennethGerard
[2] 2003. Arrow and J. Ar
y convergent algorithms for algorithms for finding a
ably convergent finding equilibria on equilibria on a
5. Structured prediction
• The goal is to learn a mapping from input examples (x)
to complex objects (y)
- e.g., from pairs of web pages (x) to their alignments (y)
Structured-Prediction Algorithm for Example-Based Web Design
y= semantic alignment
Ranjitha Kumar Jerry O. Talton Salman Ahmad Scott R Klemmer
Stanford University⇤
a corpus of design examples unparalleled
ver, leveraging existing designs to pro-
x=
ntly difficult. This paper introduces the
utomatically transferring design and con-
Bricolage introduces a novel structured-
learns to create coherent mappings be-
n human-generated exemplars. The pro-
be used to automatically transfer the con-
he style and layout of another. We show
o accurately reproduce human page map- (Kumar et al., ’10)
s a general, efficient, and automatic tech-
ent between a variety of real Web pages.
N
rely on examples for inspiration [Herring
es can facilitate better design work [Lee
7. Goals and challenges
• Goals
- use rich classes of output structures
- exercise fine control of how structures are chosen (scoring)
- learn models efficiently from data
• Challenges
- prediction problems are often provably hard
- most learning algorithms rely on explicit predictions and are
therefore inefficient with large amounts of data
- richer structures lead to ambiguity
8. Structured prediction
• The goal is to learn a mapping from input examples (x)
to complex objects (y)
- e.g., from sentences (x) to dependency parses (y)
y=
x= * John saw a movie yesterday that he liked
- in lexicalized dependency parsing, we draw an arc from the
head word of each phrase to words that modify it
- the resulting parse is a directed tree. In many languages, the
tree is non-projective (crossing arcs)
- each sentence is mapped to arc scores; the parse is obtained
as the highest scoring directed tree
9. Structured prediction
• The goal is to learn a mapping from sentences (x) to
dependency parses (y)
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
y(i, j) = 1 if arc i ! j is selected
n
and zero otherwise
2
1
⇤ 1 2 n
10. Structured prediction
• The goal is to learn a mapping from sentences (x) to
dependency parses (y)
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
y(i, j) = 1 if arc i ! j is selected
n
and zero otherwise
x ! w · f (x; i, j) = ✓(i, j)
sentence features arc scores
parameters
2
1
⇤ 1 2 n
11. Structured prediction
• The goal is to learn a mapping from sentences (x) to
dependency parses (y)
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
y(i, j) = 1 if arc i ! j is selected
n
and zero otherwise
x ! w · f (x; i, j) = ✓(i, j)
sentence features arc scores
parameters
⇢X
2
1 y ⇤ = argmax y(i, j)✓(i, j) + ✓T (y)
y
⇤ 1 2 n i,j
highest scoring tree
12. Structured prediction
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
• The complexity of the prediction task depends on how
we score each candidate tree
• In an arc factoredX
model (as before) each arc is scored
separately y(i, j)✓(i, j)
i,j
• The highest scoring tree is found as the maximum
weighted directed spanning tree
⇢X
y ⇤ = argmax y(i, j)✓(i, j) + ✓T (y)
y
i,j
13. Structured prediction
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
• The complexity of the prediction task depends on how
we score each candidate tree
• In an arc factoredX
model (as before) each arc is scored
separately y(i, j)✓(i, j)
i,j
• The highest scoring tree is found as the maximum
weighted directed spanning tree
⇢X
y ⇤ = argmax y(i, j)✓(i, j) + ✓T (y)
y
i,j
14. Structured prediction
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
• The complexity of the prediction task depends on how
we score each candidate tree
• It is often advantageous to include interactions between
modifiers (outgoing arcs) known as “sibling scoring”
X
✓i (y|i ), where y|i = { y(i, j), j 6= i }
i
15. Structured prediction
y=
x= * John saw a movie yesterday that he liked
i=0 1 2 n
• The complexity of the prediction task depends on how
we score each candidate tree
• It is often advantageous to include interactions between
modifiers (outgoing arcs) known as “sibling scoring”
X
✓i (y|i ), where y|i = { y(i, j), j 6= i }
i
• Finding the highest scoring tree is now NP-hard
(McDonald and Satta,⇢2007)
X
y ⇤ = argmax ✓(y|i ) + ✓T (y)
y
i
16. Decomposition
* John saw a movie yesterday that he liked
i=0 1 2 n
✓T (y) directed tree
arc factored scores
...
✓0(y|0) ✓2(y|2) ✓n(y|n)
modifiers (outgoing arcs) solved
separately for each word
• We can always turn a hard problem into an easy one by
solving each “part” separately from others
• But the parts are unlikely to agree on a solution ...
17. Dual decomposition
* John saw a movie yesterday that he liked
i=0 1 2 n
X
✓T (y) directed tree ✓T (y) + y(i, j) (i, j)
arc factored scores i,j
effective arc
agreement
... X ...
✓0(y|0) ✓2(y|2) ✓n(y|n) ✓i(y|i) y(i, j) (i, j)
modifiers (outgoing arcs) solved j6=i
separately for each word
• We can encourage parts to agree on the maximizing
arcs via Lagrange multipliers (c.f. Guignard, Fisher, ‘80s)
18. Dual decomposition algorithm
• An iterative sub-gradient algorithm (Koo et al., 2010)
* John saw a movie yesterday that he liked
find a directed spanning tree X
ˆ
y = argmax ✓T (y) + y(i, j) (i, j)
y
i,j
find modifiers of each word X
ˆ0
y|i = argmax ✓i(y|i) y(i, j) (i, j)
y|i
j6=i
update Lagrange multipliers based on disagreement
(i, j) (i, j) + ↵k y 0(i, j)
ˆ y (i, j)
ˆ
19. Dual decomposition algorithm
• An iterative sub-gradient algorithm (Koo et al., 2010)
* John saw a movie yesterday that he liked
find a directed spanning tree X
ˆ
y = argmax ✓T (y) + y(i, j) (i, j)
y
i,j
find modifiers of each word X
ˆ0
y|i = argmax ✓i(y|i) y(i, j) (i, j)
y|i
j6=i
update Lagrange multipliers based on disagreement
(i, j) (i, j) + ↵k y 0(i, j)
ˆ y (i, j)
ˆ
• Thm: The solution is optimal if an agreement (no
updates) is reached
20. Dual decomposition in practice
nvergence shows the percentage of test cases where the
• The table
sub-gradient algorithm quickly finds the optimal solution
CertS CertG
Dan 99.07 98.45
Dut 98.19 97.93
Por 99.65 99.31
Slo 90.55 95.27
Swe 98.71 98.97
Tur 98.72 99.04
1
Eng 98.65 99.18
Eng2 98.96 99.12
Dan 98.50 98.50
Dut 98.00 99.50
21. Goals and challenges
• Goals
- use rich classes of output structures
- exercise fine control of how structures are chosen (scoring)
- learn models efficiently from data
• Challenges
X - prediction problems may be provably hard but we can solve
practical instances effectively with decomposition methods
- most learning algorithms rely on explicit predictions and are
therefore inefficient with large amounts of data
- richer structures lead to ambiguity
22. Learning to predict
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
- e.g., lexicalized dependency parsing
(1) y (2) ...
y
John saw a movie x (2) * kids make nutritious snacks ...
x(1) *
23. • We can0#,12(#' the equilibriumfindthe gamecI2
find of (binding the game cI2
RNAp
frequencies) ••Prediction is often maximizingmaxMRF
••Prediction is often done by done by maximizing
Prediction is often maximizing an MRF
Prediction is often done by done by maximizing
RNAp
m
an s(y;
1 1 1
Binding Frequency (time!average
Binding Frequency (time!average
Binding Frequency (time!average
cI2 0.8 5&1-6#
RNAp
Binding Frequency (time!averag
• We can0#,12(#' the equilibrium of RNAp
Binding Frequency (time!averag
Binding Frequency (time!averag
0.8 0.8 cI2 5&1-6# (binding frequencies)
0.8 0.8 0.8
as a function of overall functionconcentrations.7&8.6,
as a protein of overall proteinRepressor
Repressor
concentrations. Repressor Repressor
X
Repressor Repressor
3-142(#
X y
RNA!polymerase RNA!polymerase RNA!polymerase
0.6 3-142(# 0.6
RNA!polymerase
0.6 7&8.6,
RNA!polymerase RNA!polymerase
Learning to predict✓f (yf ; x)✓f (yf ; x)
0.6 0.6 0.6
0.4
0.2
0.4 0.4
cI
0.2
s(y; x) =s(y; x) =
O R3 O R2 O R1 cI cro R3 O R2 O R1
O
O R3 O R2 O0.2 cI cro R3 0.2 R2 O R1
O O
0.4 0.4
cro
cro
0.4
cI
0.2 0.2
R1
x x the score functions from data f f
0 0 2 0 0 0 0
x
!2 0 !2 0 !2 2 0 !2 2 0 !2 2 0 !2 2 0 2
x can of the equilibrium of the game (binding
••We’dfind the equilibriumfindthe game (binding frequencies) frequencies)
10 10 1010 10 10 10 10 10 10 10 10 10 10 10 10 10 10
like •to estimate
f /f frepressor/fRNA f /f frepressor/f f /f frepressor/fRNA
repressor RNA repressor RNA repressor RNA
Bindinget OR31998
Arkin in al. Bindinget OR31998
Arkin in al.
Binding in OR2 Binding inRNA 2
OR Binding in OR1 Binding in OR1
Bindinget OR31998
Arkin in al. Binding in OR2 Binding in OR1
such thatFigure•3:bindingoftotheproteinareproteinthemutated(binding frequencies) 1 for increasingmaxs(ym
Bindinget al. 1998
Arkin in
(a) OR 3 OR3 1 1 Binding in OR2 1 Binding in OR1 1
Figure• function of overall Predictions 3, O again correct 2, and mutated amounts of cI . max of s(y
50 50 (b) O 2 (c) O 1
Binding Frequency (time!average)
Binding Frequency (time!average)
Binding Frequency (time!average)
1 (a) OR 31 R (b) OR 2 1 R (c) OR 1
We
Binding Frequency (time!average)
Binding Frequency (time!average)
Binding Frequency (time!average)
1 1
We can
Binding Frequency (time!average)
Binding Frequency (time!average)
Binding Frequency (time!average)
1 1 1
• We can find thegame (binding frequencies)
equilibrium of game 0#,12(#' 5&1-6#
Binding Frequency (time!average)
Binding Frequency (time!average)
Binding Frequency (time!average)
• We can find the equilibrium 0#,12(#' 5&1-6#
0.8 0#,12(#'
0.8 0#,12(#'0.8 0.8
5&1-6# 0.8 5&1-6# 0.8
as a function concentrations. and
of overall protein concentrations. for increasing O Repressor Repressor Repressor
as a Predictions are againof overall 2, to sites O 3, O 1
3: Predicted protein a protein sites O binding
as Predicted qualitativelyconcentrations.
0.8
function qualitatively correct
X X y
0.8
Repressor
amounts cI .
Repressor 3-142(#
as a function of overall protein concentrations.
3-142(# RNA!polymerase
0.8 RNA!polymerase
R
Repressor R 7&8.6,
0.8
Repressor
Repressor
RNA!polymeraseR
0.8
R 7&8.6, Repressor
RNA!polymerase
0.8
Repressor
RRepressor ⇢
RNA!polymerase 2 Repressor
RNA!polymerase 2
0.6
0.6
X
3-142(#
X y
0.6 3-142(# 0.6
RNA!polymerase
0.6 0.6
RNA!polymerase 0.6
7&8.6,
RNA!polymerase
0.6
0.6
0.6
7&8.6,
RNA!polymerase 0.6
RNA!polymerase
0.6
RNA!polymerase
yhowever,well.find cIinhibits transcrip- cIinhibits transcrip-= s(y; x) .= from✓data.;x) d
=that Note,We’d find s(y;mutated = s(y; 1, (y ;x) ✓ff(yff; x)
became su⇥ciently high do weargmaxdomutated fAcknowledgments =✓ (y. ,;n
(i) ⇠ (i)
• O 1 as cI•y2Y that cI wlikes(y;x) functions f x) (y
at the we · i ✓ .=
0.4 0.4 0.4
O 1 as well. Note,We’d like to learn to learn
0.4
became su⇥ciently high
0.2
0.4
however,
(xthese ,Acknowledgments
at the , y)
x) these functions from
x) f
ff 0.2
0.4
2
0.4
0.2
0.4 2
0.2
0.4
0.4
0.2
0.4
0.2
0.4
R 0.2 R 0.2 2 0.2 2 0.2 0.2 0.2
tion at OR 3 prior totion at OR 3OR 1. Thus 0the binding This workbinding This work was supported 52 GM68762 grant GM68762
occupying prior to occupying OR 1. Thus the was supported in part by NIH 0 in part by NIH grant 52
0
10 0
0
10 0 10 10 0 !2
parameterized 1010and 101010 ITR 10e10 0428715. is a e 1010
10 10 0
0
10 10
0
0
could not 10 10 observed without 10 repressor/fRNA by 10 10 ITR grant 0428715. /fRNAgrant 10
10
frepressor/fRNA be
10
could
10 /fRNA in- 10 and
!2 0 NSF 10 0
frepressor
0
10 10
!2
at the mutated OR 1at10the mutated OR 1 frepressor10 /f not be fobserved without in-frepressor/fRNAby NSF 10 Luis P´rez-Brevafrepressor/fRNAP´rez-Breva is a
Luis
10
2
!2 2
0
0
!2
!2
2
2
0
0
!2 2
!2 2
0
0
!2
!2
ff 2
2
0
0
!2 2
!2 2
ff 0
0
2
2
10 10
terventions.
frepressor/fRNA
terventions.
(a) O 3
frepressor RNA
(a) OR 3 (b) O 2
scores on RafaelO“Fundaci´n Rafael del Pino” Fellow.1
frepressor/fRNA
“Fundaci´
R
frepressor/fRNA
del Pino” Fellow.
(b) R 2
50 o
frepressor/fRNA
(c) O 150
frepressor/fRNA
(c) OR R R
(a) OR 3 (a) OR 3 (b) OR 2 (b)
50 OR 2 (c) OR 150 (c) OR 1
x x
- e.g., stereo reconstruction
3: PredictedFigure 3: Predictions are again qualitatively mutated OR 1 for increasing amounts of cI2 .
protein Predicted protein binding to sites OR 3, OR 2, and correct
Figure• Predictions• binding to sites OR 3, OR 2, and mutated OR 1 for increasing amounts of cI2 .
Figure• Predictionsare again sites OR 3, OR 2, andcorrectOR 1 for increasing amounts of cI2 .
proteinare againqualitatively sites OR 3, R 2, and mutated
3: PredictedFigure• binding to proteinare again mutated OReferences OR 1 for increasing amounts of cI2 .
3: Predictions binding to correct
Predicted qualitatively qualitatively correct
References
OR 1 as well. Note, OR 1 as well. Note, however, that cI2 inhibits transcrip-
however, that cI2 inhibits transcrip-
(1)
however, that cI2 inhibits transcrip- (2) y
y
OR 1 as well. Note, OR 1 as well. Note, however, that cI2 inhibits transcrip- ••We’dlearn these functions functions from
••We’d likeWe’d like to learn these functions from
to learn to learn these from data. P
became su⇥ciently becameDiscussion the mutated cI2 [1] the mutated John Adam and Harley H. McAdams.
7 Discussion7 do su⇥ciently2 high do we find Acknowledgments Arkin, John Ross, and Harley H. McAdams
high we find cI at at Adam Arkin, Acknowledgments
y
became su⇥ciently became su⇥ciently2 high do mutated cI2 at the mutated Acknowledgments
high do we find cI at the we find Acknowledgments
y ...
[1] Ross,
We’d like to like these functions from data.
Stochastic kinetic analysis of kinetic analysis path-
Stochastic developmental of developmental path
tion at OR 3 prior to occupying OR 1. to occupyingapproach provideswas supportedway bifurcation in phagepart coliNIH grant52 col
tion at OR 3 prior Thus the binding This workbifurcation in phage was supported 52 -infected excherichia
We believe3theR 1. to theoretic com- 1.Thusway binding
gameprovides binding This the a com-
OR 1. This work -infectedgrantin
in part by NIH excherichia by52 GM68762 52 GM68
We believe 3 prior to theoretic approach occupying OR Thuswork binding This work was supported in part by NIH grant GM6
tion at OR the game occupying O
tion at OR prior Thus the a the was supported in part by NIH grant GM68762
at the mutated OR 1 couldmutatedobserved without in- and by NSF ITR grant 0428715. ITR grant 0428715. a
at pelling causal abstraction ofwith re- and cells.with ITR grantcells. Genetics, P´rez-Breva isis August rez-Breva i
the of biological1systems biological systemsNSF re- and by NSF Luis 149:1633–1648, Luis P´ 1998.
not be OR could not be observed without in- 149:1633–1648, August 1998.
Genetics, e e
at the causal abstraction not be OArt1 could notBooks observed without in- and 0428715. ITR grant 0428715. a
pelling mutated OR 1 couldmutatedobserved without in- Art
at the be by Dolls Books by NSF MoebiusP´rez-Breva Moebius P´rez-Breva
Luis Laundry
e Luis e
terventions. terventions. is complete with prov- “Fundaci´n nprov-thisdel Pino”ison left image ofdel pair and theFellow. ground-truth disparities
source constraints.2. The six datasets used incomplete with Rafael ofdel Shown Fellow.
source constraints. terventions. Figure
The model
R
The model is Figurepaper. Showno o usedimage paper. pair and the n Rafaeleach Pino”disparities.
“Fundaci´ Rafael
Laundry Dolls Reindeer Reindeer
terventions. this “Fundaci´ Rafael each Pino” o
2. The six datasets left in
is the “Fundaci´ Fellow. ground-truth Fellow.
the corresponding del Pino”corresponding
[2] Kenneth J. a [2] and Gerard Debreu. Existence Debreu. Existence o
Arrow Kenneth J. Arrow and Gerard of
x [Scharstein & Pal 07, Middlebury dataset]data
x [Scharstein & Pal 07, Middlebury
ably convergent algorithms for finding (2)
ably convergent algorithms for finding equilibria on a equilibria on
(1) scale. an equilibrium for a competitive economy. Econo- economy. Econo
x an equilibrium for a competitive
genome-wide scale. genome-wide an MPE estimate from have an MPE estimate from it
have
x
x x
References References ...
running graph cuts we use running graph cuts we use it
metrica, 22(3):265–290, July 1954.
to compute our expectation in acompute our expectation in22(3):265–290,the em- 1954.
to manner similar to the em- a manner similar to July
metrica,
References References
The results from the small scale distribution.small scale distribution. Training a en-
The results from the Training pirical application are lattice-structured model us-
pirical application are a en- lattice-structured model us-
Discussion couraging. ing the approach successfully path-based methods of
77 successfully path-based methods describedreproduces learn-
Our reproduces known [3] Adam Arkin, a G. Adam Arkin, Lee, Ross, and Harley H. McAda
model known
ing the approach described here is thus [3] Z. Bar-Joseph, G. Gerber, T. Lee, N. Rinaldi
generalization of
7couraging. Our modelDiscussion described here is thus[1]generalizationdescribedJohn Ross, learn- Harley H.N. Rinaldi,
P
a and McAdams.
Viterbi in [32]. Z. our Arkin, in [32]. For our andT. John
Bar-Joseph, [1] Gerber,
7 Discussionbehavior ofViterbi
behavior of the the experiments we molecular
y
Discussion switch on ing experimentsStochastic B. Gordon Yoo,of B. GordonRoss, Robert, E. H. McAd
[1] of molecular JohnJ.Adam Arkin, John McAdams.
Adam
For
we Yoo,
[1] Ross, up- developmental path- Harley Fraenkel
level competition and resource dates with a variable Jaakkola, in phage and D. Gi ord. and of
level competition and resource constraints, withoutconstraints,T. learning rate. R. Young, bifurcation Young,Compu-
the without the
way bifurcation
F.
Harley H. F. and
switch on ingthe basis of use straightforward gradient-based up- kinetic analysis Robert, E. Fraenkel,developmental pa
the basis J. use straightforward gradient-based
y Stochastic kinetic analysis of
Stochastic kinetic T. Jaakkola,developmental path-Gi ord. Compu
analysis of kinetic analysis D. developmental p
Stochastic R. excherichia coli
way -infected excherichia coli
yy P
We believe the game theoreticassume protein-protein interactions way bifurcation in tational -infectedinof phage -infected excherichia
phage discovery phage
dates with a variable learning rate.
need to approach provides aapproach provides a com-
Wetheoreticthe game theoretic a com-
believe approach provides com- betweendiscovery of way bifurcation in1998. modules and regulatory
cI2 149:1633–1648, August gene -infected excherichia
need to assume game believe the game theoretic approach provides a com-
We believe the protein-protein interactions between cI2
We of biological systems with re- tational
cells. Genetics, gene modules and regulatory
cells. Genetics, 149:1633–1648, August 1998.
pelling causal abstraction causalcI2 and RNA-polymerase. systems the con- Laundrycells. Genetics, Biotechnology, 21(11):1337–1342
pelling and abstraction of theBooks ArtEvencells. Genetics, 149:1633–1648, August 1998.
dimers and cI2abstraction of biologicalEven inofwith Datasets networks. re-
RNA-polymerase. systems biological systems withNature Biotechnology, 21(11):1337–1342,Moebius
dimers
pelling causal and pelling causalDatasets
4. complete with biological
abstraction
Art con-
4. re- in with
Dolls Books networks. Moebius 149:1633–1648, August 1998.
re- Laundry Dolls Nature Laundry Reindeer Moebius Reind
source constraints. sourcemodel is Figure 2. The sub-system, however,with quan-this paper. pair andis the left image of ground-truth disparities.
The of this well-known model isincomplete datasets leftBooks of each Shown the corresponding each pair and the corresponding ground-truth dispar
text constraints. Art The six withprov-this paper. Shownfew the J. in
Books Dolls
The six datasets used Figure 2. The six is the prov-
Art Dolls Moebius Laundry Reindeer Re
text of constraints. sourcemodel is Figure 2. The model isinorderpaper. 2003.significant in this[2] 2003. is the J. Arroweach pairGerard Debreu. Existenc
The constraints.
image
source this well-known sub-system,complete few quan-complete withused ArroweachKenneth left image of ground-truth disparities.
however, datasets prov-this[2] The six datasets left image ofpaper.training the corresponding and and the corresponding ground-truth disp
Kenneth prov- and Gerard Debreu. Existence of
[Scharstein & Pal 07,&Middlebury dataset] da
[Scharstein Middlebury dataset]
used Figure 2. obtain a is
Shown pair and data
[Scharstein & Pal 07,& Pal 07, Middlebury da
ably convergent algorithms for finding equilibria are In amount of about data J.aArrow and GerardJ. Arrow and Gerard Debreu. Existen
ablyresultsexperimentalequilibriaon availablean approaches, we have [2] competitive economy. Econo-
convergent order results for a a [2] Kenneth
In algorithms significant to training
finding equilibria bind- on
used amount of Shown
titative for findingto obtain a forbind- learning equilibrium forcreated 30 new Debreu. Existence of
titative experimental
ably convergent algorithms are available aboutforonstereo created 30 newon a a Kenneth
ably convergent MPE estimate from running graph equilibriait
genome-wide scale. genome-wide scale.
ing. Proper validation to scale.
relies on estimatingthe small scale
The results from the game parameters
algorithms
for stereo learning approaches, we have cuts we use
validation and usestereo MPE with ground-truth
have an
version from available
to compute our
of cuts
manner similarthe the em-
to
to compute our expectation em-
en- the
[Scharstein Pal 07, Middlebury
finding an equilibrium for ancompetitive economy. Econo- economy. Eco
use
in
equilibrium for a competitive
aan use
ing.and usestereoour modelground-truth andatasetsestimateweG. 22(3):265–290,weequilibrium for a competitive economy. von
Properhave an MPE with thereforeourgraphusing an auto-it disparities usingB.G. Berg, Robert B. Winter, and Peter H. Ec
model therefore graph OttoJuly 1954.and Peter H. von
running [4] cuts
have disparities Ottofrom Berg, Robert an auto-it
of datasets estimate from arunning [4] estimate from Winter,
genome-wide scale. genome-widecompute our expectation in have an MPE metrica, running graph cuts we use22(3):265–290, July 1954.
metrica, it
of game parameterssimilar structured-lighting technique ofJuly 22(3):265–290, mechanisms of protein
the are in a manner of metrica, 22(3):265–290, em- 1954.
relies on estimating theexpectation mated version fromtoavailable
matedapplication structured-lighting technique of [2]. a manner similar to the [2].Di usion- driven July 1954.
Hippel.
metrica,
24. Learning to predict
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
• The prediction problem can be challenging. Can we
learn the parameters more easily?
25. Learning to predict
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
• The prediction problem can be challenging. Can we
learn the parameters more easily?
• Thm: (Sontag et al.) If “max” is hard, then learning is
hard as well
26. Learning to predict
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
• Each training example introduces (often) exponentially
many linear constraints
w · f (x(i) , y (i) ) > w · f (x(i) , y), 8 y 2 Y y (i)
score of the target score for an the set of all
structure alternative alternatives
27. Learning to predict
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
• Each training example introduces (often) exponentially
many linear constraints
w · f (x(i) , y (i) ) > w · f (x(i) , y), 8 y 2 Y y (i)
score of the target score for an the set of all
structure alternative alternatives
28. Learning with pseudo-max
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
• Each training example now provides a small number of
linear constraints for alternatives “around the target”
w · f (x(i) , y (i) ) > w · f (x(i) , y), 8 y 2 Y (i)
score of the target score for an reduced set of
structure alternative alternatives
where each alternative may differ from the target in at
most one (or a few) coordinates
29. Learning with pseudo-max
• We’d like to estimate the score functions from data
such that
⇢
y (i) ⇠ argmax w · f (x(i) , y) ,
= i = 1, . . . , n
y2Y
parameterized
scores
• Each training example now provides a small number of
linear constraints for alternatives “around the target”
w · f (x(i) , y (i) ) > w · f (x(i) , y), 8 y 2 Y (i)
score of the target score for an reduced set of
structure alternative alternatives
• Thm: consistency still guaranteed in “restricted” cases
(cf. pseudo-likelihood)
30. Learning with pseudo-max
• When the assumptions are strictly correct
0.2
exact
LP−relaxation
0.15 pseudo−max
Test error
0.1
0.05
0 1 2 3
10 10 10
Train size
• In practice (multi-label prediction)
0.4
exact
LP−relaxation
0.3 pseudo−max
Test error
0.2
0.1
0 1 2 3 4
10 10 10 10
31. Goals and challenges
• Goals
- use rich classes of output structures
- exercise fine control of how structures are chosen (scoring)
- learn models efficiently from data
• Challenges
X - prediction problems may be provably hard but we can solve
practical instances effectively with decomposition methods
X - most learning algorithms rely on explicit predictions and are
therefore inefficient. Much weaker predictions (constraints)
may suffice for learning.
- richer structures lead to ambiguity
32. Dealing with ambiguity
• Ambiguity underlies many problems that are otherwise
well suited for structured prediction
- e.g., dependency parsing
* kids make nutritious snacks
- e.g., pose estimation