How AI, OpenAI, and ChatGPT impact business and software.
Idiom Token Classification using Sentential Distributed Semantics (Giancarlo D. Salton)
1. Idiom Token Classification using Sentential
Distributed Semantics
Giancarlo D. Salton Robert J. Ross John D. Kelleher
Applied Intelligence Research Centre
School of Computing
2. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
2/45
3. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Idioms 3/45
4. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idioms
Idioms are multiword expressions (MWE)
Their meaning is non-compositional
No linguistic agreement upon the set of characteristics defining
idioms
Idioms 4/45
5. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiomatic and Literal Usages
Literally...
Actually...
How to distinguish between a literal and idiomatic usage?
Idiom token classification
Idioms 5/45
6. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Previous Work
Previous work used “per-expression” models
– different set of features for each expression
in general, these features are not reusable
– i.e., a model is trained for each particular expression
In our opinion the state-of-the-art is Peng et al. (2014)
– Also “per-expression” classification
– Topic models
– Up to 5 paragraphs of context!
Idioms 6/45
8. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
General Classifiers?
Can we find a common set of features?
Can we train a general classifier?
hold+horses
vs.
break+ice
vs.
spill+beans
Idioms 8/45
9. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Distributed Representations 9/45
10. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Distributed Representations of Words
Word2vec (Mikolov et al., 2013)
Distributed Representations 10/45
15. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Skip-thought Vectors (or Sent2Vec)
(Kiros et al., 2015)
Encoder/Decoder Framework
– Encoder learns to encode information about the context of an
input sentence
Distributed representations = features!
Distributed Representations 15/45
16. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Distributed Representations vs Idioms
Distributed representations cluster words (word2vec) or
sentences (sent2vec) with similar semantics
– Empirical results have shown that
Idiomatic vs. literal usages
– Idioms should alse be in a different part of space than literal
expressions (at least when considering the same expression)
Distributed Representations 16/45
17. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
“Per-expression” classification 17/45
18. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“Per-expression” settings
Following baseline evaluation (Peng et al., 2014)
4 expressions from VNC-Tokens dataset:
– blow+whistle, lose+head, make+scene and take+heart
Balanced training sets
Imbalanced test sets
“Per-expression” classification 18/45
19. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“Per-expression” classifiers
K-Nearest Neighbours
– 2, 3, 5 and 10 neighbours
Support Vector Machines
– Linear SVM: linear kernel and grid search for best parameters
– Grid SVM: grid search for best kernel/parameters
– SGD SVM: linear kernel trained with Stochastic Gradient Descent
“Per-expression” classification 19/45
24. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“Per-expression” evaluation
No single model performed best for all expressions
SVM consistently outperformed K-NNs
Peng et al. (2014) features may capture a different set of
dimensions
Combination with baseline model may result in stronger classifier
“Per-expression” classification 24/45
25. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
“General” classification 25/45
26. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” settings
Simulation of expected behaviour on real data
27 expressions of “balanced” part of VNC-Tokens dataset
Imbalanced training set
Imbalanced test set
“General” classification 26/45
27. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” classifiers
SVMs only
– Linear SVM: linear kernel and grid search for best parameters
– Grid SVM: grid search for best kernel/parameters
– SGD SVM: linear kernel trained with Stochastic Gradient Descent
“General” classification 27/45
28. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” results
Linear SVM Grid SVM SGD SVM
Expressions Pr. Rec. F1 Pr. Rec. F1 Pr. Rec. F1
blow+whistle 0.84 0.67 0.75 0.84 0.68 0.75 0.67 0.59 0.63
lose+head 0.78 0.66 0.72 0.75 0.64 0.69 0.75 0.67 0.71
make+scene 0.92 0.84 0.88 0.92 0.81 0.86 0.78 0.81 0.79
take+heart 0.94 0.79 0.86 0.94 0.80 0.86 0.86 0.80 0.83
Total 0.84 0.80 0.83 0.84 0.80 0.83 0.79 0.79 0.78
“General” classification 28/45
29. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
“General classifier” evaluation
Expected behaviour on “real world”
– Consider imbalances of real data
2 classifiers had high performance
– Same general precision, recall and F1
– Deviations occurred across individual expressions
Performance is still not consistent over all classifiers and across
expressions
“General” classification 29/45
30. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
PCA Analysis of Distributed Representations on
“General” classifier
“General” classification 30/45
31. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Conclusions 31/45
32. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Conclusions
Our approach needs less resources to achieve roughly the same
performance
SVM generally perform better than KNNs
“General classifier” is feasible
“Per-expression” does achieve better results in some cases
Conclusions 32/45
33. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Future Work on Idiom Token Classification 33/45
34. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Future Work on Idiom Token Classification
Apply to other languages than English
Apply to other datasets
– e.g., the IDX Corpus
What are the main sources of error for the “general classifier”?
– Better understanding of representations is needed
Future Work on Idiom Token Classification 34/45
35. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Outline
Idioms
Distributed Representations
“Per-expression” classification
“General” classification
Conclusions
Future Work on Idiom Token Classification
Idiom Classification on Machine Translation Pipeline
Idiom Classification on Machine Translation Pipeline 35/45
36. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 36/45
37. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 37/45
38. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 38/45
39. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 39/45
40. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 40/45
41. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 41/45
42. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 42/45
43. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Idiom Token Classification on Machine Translation
Pipeline
(Salton et al., 2014b)
Idiom Classification on Machine Translation Pipeline 43/45
44. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
References
Ryan Kiros, Yukun Zhu, Ruslan R Salakhutdinov, Richard Zemel, Raquel Urtasun,
Antonio Torralba, and Sanja Fidler. 2015. Skip-thought vectors. In Advances in
Neural Information Processing Systems 28, pages 3276–3284.
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.
Distributed representations of words and phrases and their compositionality. In
Advances in Neural Information Processing Systems 26, pages 3111–3119.
Jing Peng, Anna Feldman, and Ekaterina Vylomova. 2014. Classifying idiomatic and
literal expressions using topic models and intensity of emotions. In Proceedings of the
2014 Conference on Empirical Methods in Natural Language Processing (EMNLP),
pages 2019–2027, October.
Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014a. An Empirical
Study of the Impact of Idioms on Phrase Based Statistical Machine Translation of
English to Brazilian-Portuguese. In Third Workshop on Hybrid Approaches to
Translation (HyTra), pages 36–41.
Giancarlo D. Salton, Robert J. Ross, and John D. Kelleher. 2014b. Evaluation of a
substitution method for idiom transformation in statistical machine translation. In The
10th Workshop on Multiword Expressions (MWE 2014), pages 38–42.
44/45
45. Idiom Token Classification using Sentential Distributed Semantics NLP Dublin Meetup
Thank you!
Giancarlo D. Salton would like to thank CAPES (“Coordenao de
Aperfeioamento de Pessoal de Nvel Superior”) for his Science Without
Borders scholarship, proc n. 9050-13-2
45/45