Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another Language
Andrew Smirnov and Valentin Mendelev - Applying Word Embeddings to Leverage Knowledge Available in One Language in Order to Solve a Practical Text Classification Problem In Another Language
1. Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
TITLE OF PRESENTATION (FORMAT: TAHOMA 27, UPPER CASE)
Subtitle (FORMAT: TAHOMA 22)
APPLYING WORD EMBEDDINGS TO
LEVERAGE KNOWLEDGE AVAILABLE IN ONE
LANGUAGE TO SOLVE A TEXT
CLASSIFICATION PROBLEM IN ANOTHER
LANGUAGE
Andrew Smirnov and Valentin Mendelev
smirnov-a@speechpro.com
AIST 2016
2. 2Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CONTENTS
The problem
Word embeddings
Knowledge transfer
Results
New results
Conclusions
3. 3Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CALL STEERING IN DIFFERENT
LANGUAGES
Low amount of training data in a target
language
Up to 40 classes
Classifier has to be build rapidly
Our goal is to be able to build a classifier
having only class titles and 1-5 artificially
generated examples for each class
THE PROBLEM
«У меня вот просто не технический
вопрос, а просто можно ли во
время отпуска отключить вот этот
пакет*»
Приостановка услуг**
* My question is not a technical one, I simply want to
suspend this package while I'm on vacation
** Service suspensionTRAINING DATA
6000 users’ requests in Russian
250 manual translations from Russian
into Kazakh divided on development and
test sets
4. 4Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
WORD EMBEDDINGS
The CBOW architecture predicts the current word
based on thecontext, and the Skip-gram predicts
surrounding words given the current word
Mikolov, Tomas, et al. "Efficient estimation of word representations in
vector space." arXiv preprint arXiv:1301.3781 (2013)
DETAILS
CBOW
Training set for Russian: ~200m tokens
Conversations, books, news articles
Training set for Kazakh: ~30m tokens
Kazakh Wikipedia and news articles
Vector representation dimension is 200
for Russian and 100 for Kazakh
5. 5Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
KNOWLEDGE TRANSFER
Possible categories
сервисы / services
баланс / balance
интернет / internet
неисправность интернет / internet
failure
….
Transfer destination
Target -> Source
-> Classify
Source -> Target
-> Build Classifier
-> Classify
Transfer mechanism
Manual translation
Automatic translation
Semantic vector space
6. 6Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
KNOWLEDGE TRANSFER
VECTOR SPACE TRANSFORMATION
APPROACH*
Translate a set of words (5000 most frequent ones from the training
corpus)
Train a linear model by minimizing L2 distance
Apply the transformation and build kNN classifier
*Mikolov, Tomas, Quoc V. Le, and Ilya Sutskever. "Exploiting similarities among languages for machine
translation." arXiv preprint arXiv:1309.4168(2013).
7. 7Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
RESULTS
Leave one out cross-validation results for kNN classifier on Kazakh language
8. 8Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
NEW RESULTS
Classification accuracy for kNN and CNN (not leave one out)
Classification accuracy for 10 classes
9. 9Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CONCLUSIONS
Knowledge transfer allows to achieve reasonable classification accuracy
for low-resource tasks
CNN and translation strategies produce better results
We want to do better
10. 10Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
THANK YOU
CONTACTS
Russia
4 Krasutskogo street, St. Petersburg,
196084
Tel.: +7 812 325-8848
Fax: +7 812 327 9297
Email: info@speechpro.com
USA
Suite 316, 369 Lexington ave
New York, NY, 10017
Tel.: +1 646 237 7895
Email: sales-usa@speechpro.com
ABOUT THE COMPANY
STC-Innovations is a leader in the multimodal biometric
market. STC-Innovations develops multimodal biometric
solutions based on person-identifying technologies via voice,
face and other noncontact biometric features.
STC-Innovations is a spin-off company of the Speech
Technologies Center, leading global provider of innovative
systems in high-quality recording, audio and video processing
and analysis, speech synthesis and recognition, and real-time,
high-accuracy voice and facial biometrics solutions with over
20 years of research, development and implementation
experience in Russia and internationally.
STC is ISO-9001: 2008 certified.
Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
AIST 2016
11. 11Financially supported by the Ministry of Education and Science
of the Russian Federation, Contract 14.579.21.0008, ID RFMEFI57914X0008.
CLIENTS & PARTNERS
ФСИН
России
Минобороны
России
ФСБ
России
МВД
России
МЧС
России
Минкомсвязь
России МВД
Эквадора
COMMUNICATION
FINANCE & INSURANCE
TRANSPORT
MINING & ENERGY
GOVERNMENT
SPORTS & ENTERTAINMENT
MEDICINE
МВД
Мексики
Hinweis der Redaktion
Для того чтобы заменить картинку необходимо:
Вид-Образец слайдов
Выбрать первый слайд первого образца
Выделяем картинку на первом слое-С помощью клавиши Shift передвигаем картинку в сторону
Выделяем картинку на следующем слое
Меню Формат-Изменить рисунок-Выбираем нужный файл
Возвращаем картинку верхнего слоя на прежнее место
Выходим из режима образца слайда
На однотипных слайдах, размер и положение текста и картинки должны быть одинаковыми.
Отступ от левого края всегда 0,9 см. Проверить это можно так:
Правой клавишей мыши нажать на объект
Выбрать пункт меню Формат фигуры-Положение-По горизонтали
Отступ от правого края всегда по краю заголовка.
Таблицы использовать без заливок ячеек, только контурные!
Таблицы использовать без заливок ячеек, только контурные!
Таблицы использовать без заливок ячеек, только контурные!
Таблицы использовать без заливок ячеек, только контурные!
Таблицы использовать без заливок ячеек, только контурные!
Таблицы использовать без заливок ячеек, только контурные!
Таблицы использовать без заливок ячеек, только контурные!
Слайд с перечислением наших клиентов и партнеров. Набор логотипов компаний представлен только в качестве примера и изменяется менеджером.Заголовки отраслей оформляются только так, как представлено на примере (формат: Tahoma 12, ПРОПИСНОЙ РЕГИСТР, цвет1: R:0; G:154; B:166, толщина контура: 1пт). Меняются только логотипы при необходимости.