CLiC-it 2018 Presentation

Parsing Italian texts together is better
than parsing them alone!
Oronzo Antonelli, Fabio Tamburini
University of Bologna
CLiC-it 2018, 10 December 2018

Two Goals
Test the eﬀectiveness of Dependency Parsers based on Deep Neural
Networks on Italian.
We collected nine diﬀerent state-of-the-art parsers;
All parsers hyper-parameters have been set up following the
recommendation of the developers to obtain the best performance.
Propose ensemble systems able to further improve the neural parsers
performances on Italian texts.
Focus on ensemble systems that can be build using pre-trained
parsing models.
Antonelli and Tamburini Parsing Italian texts together is better than parsing them alone! 10/12/2018 1 / 12

Parsers
All parsers considered in this study are based on two popular
approaches:
Transition-based: train a classifier to predict the next transition given
the previous ones.
Graph-based: learn the score of each arc and then find the
dependency tree using a maximum spanning tree (MST) algorithm.
Parser Approach Architecture Optimizer
Chen & Manning (2014) T-based MLP AdaGrad
Ballesteros et al. (2015) T-based Stack LSTM SGD
Kiperwasser & Goldberg (2016) T/G-based Deep BiLSTM with MLP Adam
Andor et al. (2016) T-based MLP Momentum
Cheng et al. (2016) G-based BiGRU attention with MLP AdaGrad
Dozat & Manning (2017) G-based Deep Biaffine attention with MLP Adam
Shi et al. (2017) T/G-based Deep Biaffine attention with MLP Adam
Nguyen et al. (2017) G-based Deep BiLSTM with MLP Adam

Setups and Evaluation Metrics
Two Italian corpora from the Universal Dependencies (UD) project
have been used to train/evaluate the models:
UD Italian 2.1, composed of generic domain texts, contains 13.884
sentences with train/dev/test splitting of 12.838/564/482;
UD PoSTWITA 2.2, composed of social media texts, contains
6.713 sentences with train/dev/test splitting of 5.368/671/674;
The train/validation/test cycle was executed 5 times for each of the
9 parsers, by considering three diﬀerent setups:
Setup0 use only the UD Italian 2.1 dataset (generic);
Setup1 use only the UD Italian PoSTWITA 2.2 dataset (domain);
Setup2 use the UD Italian 2.1 dataset joined with the UD Italian
PoSTWITA 2.2 dataset, keeping the test set of PoSTWITA (mixed);
Two standard accuracy metrics were selected to evaluate the models
with respect to the gold standard:
UAS: percentage of predicted words with the same head.
LAS: percentage of predicted words with the same head and rel.

Results on UD Italian texts (Setup0), µ ± σ.
Valid. UD Ita Test UD Ita
UAS LAS UAS LAS
C&M (2014) 88.20±0.18% 85.46±0.14% 89.33±0.17% 86.85±0.22%
Ballesteros et al. (2015) 91.15±0.11% 88.55±0.23% 91.57±0.38% 89.15±0.33%
K&G (2016) – T 91.17±0.29% 88.42±0.24% 91.21±0.33% 88.72±0.24%
K&G (2016) – G 91.85±0.27% 89.23±0.31% 92.04±0.18% 89.65±0.10%
Andor et al. (2016) 85.52±0.34% 77.67±0.30% 87.70±0.31% 79.48±0.24%
Cheng et al. (2016) 92.42±0.00% 89.60±0.00% 92.82±0.00% 90.26±0.00%
D&M (2017) 93.37±0.27% 91.37±0.24% 93.72±0.14% 91.84±0.18%
Shi et al. (2017) 89.67±0.24% 85.05±0.24% 89.89±0.29% 84.55±0.30%
Nguyen et al. (2017) 90.37±0.12% 87.19±0.21% 90.67±0.15% 87.58±0.11%
The best results in the Italian dep. parsing were obtained at EVALITA
2014 with UAS 93.55% and LAS 88.76% on a subset of UD Italian.

Results on UD PosTWITA texts (Setup1), µ ± σ.
Valid. UD PoSTW Test UD PoSTW
UAS LAS UAS LAS
C&M (2014) 81.03±0.17% 75.24±0.30% 81.50±0.28% 76.07±0.17%
K&G (2016) – T 77.38±0.14% 68.81±0.25% 77.41±0.43% 69.13±0.43%
K&G (2016) – G 78.81±0.23% 70.14±0.33% 78.78±0.44% 70.52±0.51%
Andor et al. (2016) 77.74±0.25% 66.63±0.16% 77.78±0.33% 67.21±0.30%
Cheng et al. (2016) 84.78±0.00% 78.51±0.00% 86.12±0.00% 79.89±0.00%
D&M (2017) 85.01±0.16% 78.80±0.09% 86.26±0.16% 80.40±0.19%
Shi et al. (2017) 80.52±0.18% 73.71±0.14% 81.11±0.29% 74.53±0.26%
Nguyen et al. (2017) 82.02±0.11% 75.20±0.24% 82.74±0.39% 76.22±0.41%

Results on UD It.+PosTWITA texts (Setup2), µ ± σ.
Valid. UD Ita+PoSTW Test UD PoSTW
UAS LAS UAS LAS
C&M (2014) 85.52±0.13% 81.51±0.05% 82.62±0.24% 77.45±0.23%
K&G (2016) – T 83.89±0.23% 77.77±0.26% 80.47±0.36% 72.92±0.46%
K&G (2016) – G 84.70±0.14% 78.41±0.14% 81.41±0.37% 73.49±0.19%
Andor et al. (2016) 82.95±0.33% 73.46±0.37% 79.81±0.27% 69.19±0.19%
Cheng et al. (2016) 89.16±0.00% 84.56±0.00% 86.85±0.00% 80.93±0.00%
D&M (2017) 89.72±0.10% 85.85±0.13% 87.22±0.24% 81.65±0.21%
Shi et al. (2017) 85.85±0.36% 80.00±0.39% 83.12±0.50% 76.38±0.38%
Nguyen et al. (2017) 86.81±0.04% 82.13±0.09% 84.09±0.07% 78.02±0.11%

Ensemble systems: Theoretical Gain
Let us consider two oracles (Choi et al. 2015):
Micro chooses the best dependency relation among m dependencies
relations involved in an ensemble.
Macro chooses the best tree for a sentence among the m dependency
trees involved in an ensemble;
Results for an ensemble system using Micro and Macro oracles
and considering all parsers.
Validation Test
UAS LAS UAS LAS
Setup0
Micro 98.30% 97.82% 98.08% 97.72%
Macro 96.62% 95.10% 96.31% 94.82%
Setup2
Micro 97.08% 96.02% 96.32% 94.73%
Macro 94.62% 91.29% 93.27% 88.50%

Tested Ensemble Techniques
Voting. Each parser contributes by assigning a vote on every
dependency edge.
Majority: for each word is taken the edge with highest number of
votes, in case of a draw take the choice of the ﬁrst parser.
Switching: with majority the dependency tree could be ill-formed, in
this case the tree is replaced with the output of the ﬁrst parser.
Reparsing. An MST algorithm is used to reparse a graph build using
each word in the sentence as a node, the edges for all the parses and
the number of votes as the edges weights.
cle: Chu-Liu/Edmonds algorithm.
eisner: Eisner algorithm.
Distilling: Train a distillation parser using a loss function with a cost
that incorporates ensemble uncertainty estimates for each possible
attachment.

Setup
The best model on validation set was taken from Setup0 (UD Italian)
and Setup2 (UD Italian + PoSTWITA)
For the voting approach the following parsers combinations were
used:
The best three (DM17+CH16+BA15);
The worst three (AN16+CM14+SH17);
The best plus those with lowest agreement (DM17+CM14+SH17);
The worst plus all the others (AN16+ALL);
The best plus all the others (DM17+ALL).
For the reparsing approach the following parsers combinations were
used:
The best three (DM17+CH16+BA15);
All parsers (ALL).
For the distilling approach we considered the combination of all
parsers together (ALL).

Comparing the Ensembles Results
Diﬀerences in performances evaluated on the test set with respect to
the best single parser (DM17).
Setup0
Ensemble strategy UAS LAS
Voting: majority (DM17+ALL) 93.94% (+0.19%) 92.41% (+0.38%)
Voting: switching (DM17+ALL) 93.91% (+0.16%) 92.37% (+0.34%)
Reparsing: cle (ALL) 94.00% (+0.25%) 92.48% (+0.45%)
Reparsing: eisner (ALL) 93.95% (+0.20%) 92.35% (+0.32%)
Distilling (ALL) 92.50% (–1.25%) 89.93% (–2.10%)
Setup2
Ensemble statregy UAS LAS
Voting: majority (DM17+ALL) 88.51% (+0.92%) 84.42% (+2.47%)
Voting: switching (DM17+ALL) 88.50% (+0.91%) 84.20% (+2.25%)
Reparsing: cle (ALL) 88.36% (+0.77%) 84.25% (+2.30%)
Reparsing: eisner (ALL) 88.31% (+0.72%) 84.08% (+2.13%)
Distilling (ALL) 86.73% (–0.86%) 81.39% (–0.56%)

Voting-Majority Side Eﬀects
Even if the voting-majority strategy exhibit good results, we have to
consider that it may produce some ill-formed dependency trees.
The numbers of ill-formed trees obtained by using the majority
strategy for both setups are reported in the following table:
Setup0 Setup2
Voters Valid Test Valid Test Average
DM17+CH16+BA15 9/564 7/482 31/1235 31/674 2.5%
AN16+CM14+SH17 45/564 25/482 88/1235 77/674 7.9%
DM17+CM14+SH17 6/564 6/482 19/1235 23/674 1.8%
AN16+ALL 18/564 17/482 73/1235 63/674 5.5%
DM17+ALL 17/564 11/482 75/1235 57/674 5.0%
For tasks that do not involve a subsequent manual correction,
the majority strategy is not the recommended choice.

Conclusions
The experiments we made show that recent neural parsers are able to
achieve results that deﬁne the new state-of-the-art for Italian (both
on UD Italian and UD PoSTWITA).
The ensemble models we proposed were able to increase single parser
performances especially when using in-domain data (PoSTWITA),
exhibiting relevant improvements (∼ 1% in UAS and ∼ 2.5% in LAS).
Performances of the ensemble models increase as the number of
parsers grows.

CLiC-it 2018 Presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie CLiC-it 2018 Presentation

Ähnlich wie CLiC-it 2018 Presentation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (17)

CLiC-it 2018 Presentation