Abstract : For many years, Machine Learning has focused on a key issue: the design of input features to solve prediction tasks. In this presentation, we show that many learning tasks from structured output prediction to zero-shot learning can benefit from an appropriate design of output features, broadening the scope of regression. As an illustration, I will briefly review different examples and recent results obtained in my team.
chapter 5.pptx: drainage and irrigation engineering
"Let us talk about output features! by Florence d’Alché-Buc, LTCI & Full Professor @Télécom ParisTech
1. Let us talk about output features !
Florence d’Alch´e-Buc
Joint work with C´eline Brouard (Aalto U.), Juho Rousu (Aalto U.), Alexandre
Garcia, Slim Essid, Chlo´e Clavel, Moussab Djerrab
LTCI, T´el´ecom ParisTech
This work has been partially funded by the T´el´ecom ParisTech Chair Machine
Learning for Big Data
3. Supervised learning in a nutshell
Supervised Learning helps us to answer questions of the following form:
• Given some input object x ∈ X, provide a prediction ˆy of some
output object y ∈ Y associated to x
using
• a training set Sn = {(xi , yi ), i = 1, . . . , n}
• a class of functions H
• a loss (y, y ) that tells how much y and y differ
• a complexity measure Ω of a function h ∈ H
and a learning algorithm A able to build a predictive model hn from Sn
and .
Assumption: the new datapoint x is assumed to be drawn from the same
distribution than the sample Sn.
1
4. Input Features, Input representation
Once you are in the right space it is easier to decide. Basically choosing
the features strongly influences the choice of models
2
5. Input Features, Input representation
The choice of input representation to describe input objects is key to the
success of Machine Learning Algorithms:
• Feature space in tree-based methods
• Choice of kernels in kernel methods: say how you want to compare
two objects with k(x, x )
• Representation learning in Deep learning: provide the raw data to
the deep neural networks, the first layers learn appropriate
representations to be handled by next layers
3
6. Question: what about output features ?
• May the choice of appropriate output features help for the prediction
task at hand ?
• In other words: is it interesting to modify the output space to get an
easier problem ?
• What is the price to come back to the original problem ?
We will give a short overview of the use of output features in machine
learning problems.
4
8. First Example: input output kernel regression for Metabolite
identification
• x is a mass spectrum
• y is a metabolite signature (a binary vector with presence and
absence of substructure)
Difficult problem to solve in chemoinformatics. Requires in silico
approaches.
5
9. Output features for structured output prediction
Choose φ : Y → F and solve the following simpler problem:
1. Learning problem: hn = arg minh
n
i=1 L(φ(yi ), h(xi )) + Ω(h)
2. Predictive Model:fn(x) = arg miny∈Y L(φ(y), hn(x))
Question: How to choose L, how do we choose the output feature map φ
?
6
10. Implicit Output features for structured output prediction
Use the kernel trick in the output space
Output Kernel Regression (Geurts et al. 2006, 2007; Brouard et al. 2011,
2016)
7
11. Implicit Output feature for structured output prediction
Use the kernel trick in the output space: example, metabolite
identification
• L(φ(y), h(x)) = φ(y) − h(x) 2
• φ is not explicitely defined but a kernel k on Y is defined (gaussian
kernel on finite dimensional fingerprints of the molecules)
• g is the decoding function: if k is normalized, we have:
fn(x) = g(hn(x))) = arg miny∈Y φ(y) − hn(x) 2
• When h is chosen as a kernel-based model with an input kernel and
an output kernel, everything goes nicely: closed-form solution !
Brouard et al. JMLR, 2016.
Brouard et al. Bioinformatics, 2016.
8
12. Overview of the problem
• Setup : we want to predict the labels of a known target graph
structure (encoded by a directed graph).
”x=TripAdvisor review” ⇒ ”y=sentence level opinion annotations”
The room was ok,
nothing special, still
a perfect choice to
quickly join the main
places.
9
13. Overview of the problem
• Setup : we want to predict the labels of a known target graph
structure (encoded by a directed graph).
”x=TripAdvisor review” ⇒ ”y=sentence level opinion annotations”
The room was OK,
nothing special, still
a perfect choice to
quickly join the main
places.
9
14. Overview of the problem
• Additional difficulty : we want to be able to asbtain on some nodes
of the graphs while continuing to make prediction.
”x=TripAdvisor review” ⇒ y=sentence level opinion annotations”
The room was ok,
nothing special, still
a perfect choice to
quickly join the main
places.
10
15. Overview of the problem
• Additional difficulty : we want to be able to asbtain on some nodes
of the graphs while continuing to make prediction.
”x=TripAdvisor review” ⇒ y=sentence level opinion annotations”
The room was OK,
nothing special, still
a perfect choice to
quickly join the main
places.
10
16. Output Features (for hierarchical) structure labeling with ab-
stention
We seek h a prediction function and r a reject function.
• Learning deals with: hn = arg minh i ψwa(yi ) − h(xi ) 2
+ Ω(h)
• Abstention is handled at the very last moment:
(fn(x), rn(x)) = arg min(yf ,yr )∈YF,R hn(x), Cψa(yf , yr )
ψwa and ψa with the help of C: take into account the tree structure
Garcia et al., ICML 2018.
11
18. Third example: zero-shot learning
Multiclass classification ? A simple question, really ?
A human being is able to recognise an object (an animal) in an image
even though he/she has never seen an instance of it before.
The classic setting of supervised learning does not address this issue: the
relevant task is not just about recognising an index of class, it is about
recognising the concept underlying a class
12
19. Realistic Scenario for Multiclass classification
• You know the set of possible classes Y
• Your training dataset
• does contain a handful of instances for each classes: few-shot
learning: the so-called small data regime
• does contain at least one instance per class : one-shot learning
• does not contain instances of some classes: Y = Yseen
∪ Yunseen
:
zero-shot learning
See Xian et al. 2018 (a review in IEEE Trans. PAMI)
13
20. Few, one, zero shot-learning in image recognition
Use a semantic encoding z = φ(y) ∈ Rd
of class y ∈ † such that two
close classes have close representation.
Major tool: take the name of an object class and encode it as a
semantic vector in a finite dimensional space with word2vec or Glove
1. Predictive Model:fw (x) = arg maxy∈Y S(x, φ(y), w)
2. Learning problem: minw
n
i=1 (φ(yi ), fw (xi )) + Ω(w)
Question: how to improve φ ? Learn a good encoding of the output
data.
14
21. A relevant output embedding: the Fisher Score !
A plugging for any method that uses any semantic encoding z = φ(y)
ψ(z) = θ(log(pθ(z))) ∈ R|θ|
Example: take pθ as a Gaussian mixture model
• Use ψ(z) as the new output feature vector that encodes a class y
• The new code ψ ◦ φ(y) highlights the proximity of some classes:
those that belong to the same cluster but are not seen in the
training phase will anyway benefit from the learning as well. 15
23. Output Fisher Embedding Regression
To wrap up:
1. First estimate θ from {z1
, . . . , zC
} the set of semantic vectors
encoding the classes
2. Encode each yi from the training dataset as: ψˆθ(zi )
3. Solve the regression problem with your preferred multiple output
regression tool: minh i ψˆθ(zi ) − h(xi ) 2
+ Ω(h)
4. Prediction Phase: for each x, compute arg min ψ ◦ φ(y) − h(x) 2
The approach is also valid for any structured output learning problem as
well (results for text-to-time-series, for instance)
17
24. Experimental results for OFER : Multiclass prediction task Cal-
tech101
Number of Modes for the GMM : C = 2
# ex/
class
Classification accuracy on Test Set: mean ± std (%)
m-SVM Sem-IOKR Sem-KRR OFER-GMM
1 9.61 ± 3.98 13.40 ± 2.22 14.83 ± 4.02 38.22 ± 2.87
3 33.89 ± 1.79 22.51 ± 1.81 22.71 ± 2.33 46.33 ± 2.44
5 47.63 ± 2.87 24.90 ± 1.27 25.91 ± 1.28 49.40 ± 2.09
7 55.19 ± 2.43 26.84 ± 0.92 27.42 ± 1.59 50.39 ± 2.04
10 58.55 ± 1.84 31.27 ± 1.84 29.49 ± 1.39 50.49 ± 1.07
Table 1: Results on Caltech101 with a growing number of labeled examples
per class.
18
25. Results on zero-shot learning
top-1 accuracy in %
Method SUN CUB AWA1 AWA2 aPY
IAP 19.4 24.0 35.9 35.9 36.6
CONSE 38.8 34.3 45.6 44.5 26.9
LATEM 55.3 49.3 55.1 55.8 35.2
ALE 58.1 54.9 59.9 62.5 39.7
DEVISE 56.5 52.0 54.2 59.7 39.8
SJE 53.7 53.9 65.6 61.9 32.9
ESZSL 54.5 53.9 58.2 58.6 38.3
SYNC 56.3 55.6 54.0 46.6 23.9
GFZSL 60.6 49.3 68.3 63.8 38.4
KRR-ZSL 0.27 7.24 2.77 1.95 0.3
OFER-ZSL 42.7 38.7 46.6 45.7 28.5
SJE-OFE 55.6 57.1 69.3 64.2 33.4
Table 2: Comparison of OFER-ZSL against state of the art methods with
(att) attributes.
19
27. • Encoding appropriatedly the outputs helps in structured output
prediction
• The problem can be seen as defining new families of surrogate losses
that involve to solve an easy (fast-to-compute) subsidiary regression
problem
• Output Fisher Embedding is just one example of learned
representation
• Current works: learning both the output feature map and the
surrogate regressor, extension to deep learning
20
28. References
• C´eline Brouard, Huibin Shen, Kai Dhrkop, Florence d’Alch´e-Buc, Sebastian Bocker, Juho
Rousu: Fast metabolite identification with Input Output Kernel Regression. Bioinformatics
32(12): 28-36 (2016)
• C. Brouard, M. Szafranski, F. d’Alch´e-Buc: Input Output Kernel Regression: Supervised and
Semi-Supervised Structured Output Prediction with Operator-Valued Kernels. Journal of
Machine Learning Research 17: 176:1-176:48 (2016)
• Moussab Djerrab, Alexandre Garcia, Maxime Sangnier, Florence d’Alch´e-Buc: Output Fisher
embedding regression. Machine Learning 107(8-10): 1229-1256 (2018)
• Alexandre Garcia, Chlo´e Clavel, Slim Essid, Florence d’Alch´e-Buc: Structured Output
Learning with Abstention: Application to Accurate Opinion Prediction. ICML 2018:
1681-1689
• Anna Korba, Alexandre Garcia, Florence d’Alch´e-Buc: A Structured Prediction Approach for
Label Ranking. To appear, NIPS (2018)
21