RusProfiling Gender Identification in Russian Texts PAN@FIRE

RusProfiling
Gender Identification in
Russian Texts
PAN@FIRE 2017
Bangalore, 8-10 December
Francisco Rangel
Autoritas Consulting
Paolo Rosso
PRHLT - Universitat Politècnica
de València - Spain
Pavel Seredin & Olga Litvinova
RusProfiling Lab & Kurchatov Institute
Russia
Tatiana Litvinova
RusProfiling Lab
Russia

Introduction
Author profiling aims at identifying
personal traits such as age, gender,
native language or personality traits from
writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN@FIRE’17RusProfiling

Task goal
To predict Gender in Russian text
from a cross-genre perspective:
- Essays.
- Facebook.
- Twitter.
- Reviews.
- Gender-Imitated texts. 3

Corporadescription
Dataset Genre Number of Authors Description
Training Twitter
600 - Manually annotated (name, picture…)
- From 1 to 200 tweets per author
Test Essays
370 - Between 1 and 2 texts per author
- From RusPersonality corpus
- Topics: letter to a friend, picture description, letter
to an employee...
- Average length of 150 words
Facebook
228 - Different age groups (20+, 30+, 40+)
- Diffierent cities (minimum mutual friends)
- Average length of 1,000 words
Twitter
400 - A random partition ensuring no interjections with
training authors
Reviews
776 - From TrustPilot corpus
- One text per author
- Average lenght of 80 words
Gender-Imitated
94 - From the Gender Imitation Corpus
- Three texts per author:
- Normal style
- Imitating the other gender
- Obfuscating her style

Evaluation measures
5
The accuracy measure is calculated per corpus.
The final ranking is obtained by calculating the weighted
accuracy such as if the corpus were concatenated.

6
Baselines
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.

7
Participants
AmritaNLP [18] V. Vinayan, N. J.R., H. NB, A. Kumar M, and S. K P.
Amritanlp@pan-rusprofiling: Author profiling using machine learning techniques.
BITS_Pilani [1] R. Bhargava, G. Goel, A. Shah, and Y. Sharma.
Gender identification in russian texts.
CIC [7] I. Markov, H. Gomez-Adorno, G. Sidorov, and A. Gelbukh.
The winning approach to cross-genre gender identification in russian at rusprofiling
2017.
DUBL [17] G. Skitalinskaya, L. Akhtyamova, and J. Cardiff.
Cross-genre gender identification in russian texts using topic modeling working note:
Team dubl.
RBG [3] B. Ganesh HB, A. Kumar M, and S. KP.
Representation of target classes for text classification - amrita cen nlp@rusprofiling
pan 2017.

8
Participants’ runs per dataset
Dataset Runs
Essays 18
Facebook 19
Twitter 19
Reviews 19
Gender-Imitated 19
Total 93

Approaches
9

Approaches - Preprocessing
10
Obtain plain text BITS_Pilani
Remove stopwords BITS_Pilani, DUBL
Remove short words DUBL
Twitter specific elements (mentions, hashtags, urls) BITS_Pilani, DUBL
Remove punctuation marks BITS_Pilani, CIC
Remove numbers BITS_Pilani
Remove non-cyrillic characters CIC
Lemmatisation DUBL

Approaches - Features
11
AmritaNLP - Number of user mentions
- Hashtags
- Urls
- Emoticons
- Punctuation marks
- Average word length
- Tf-idf bag-of-words
BITS_Pilani - Linguistic patterns such as word endings or the use of first person singular pronouns within
a distance to a verb in past tense
- (Combined with) Deep learning techniques
CIC - Word and character n-grams
- Words most frequently used per gender
- Linguistic patterns such as word endings or the use of first person singular pronouns within
a distance to a verb in past tense
DUBL - Topic modelling
RBG - A representation scheme based on the texts belonging to the corresponding target classes.

Approaches - Methods
12
Support Vector Machines AmritaNLP, CIC, RBG
Random Forest AmritaNLP
AdaBoost AmritaNLP
Additive Regularization for Topic Modelling DUBL
Rule-based BITS_Pilani
Long-Short Term Memory networks BITS_Pilani

Results on Essays
13
- Best result:
- A combination of linguistic rules and
deep learning.
- 10% higher than second best result.
- Second best result:
- Stylistic features with traditional
machine learning.
- 7 runs below the bow and majority
baselines.
- LDR baseline outperforms by 3% and 13%
the best systems.

Results on Facebook
14
- 4 best results:
- SVMs with combinations of n-grams
and linguistic rules.
- 2 results higher than 90%
- 5 & 6 best result:
- Linguistic rules combined with deep
learning.
- 5 runs below the majority baseline.
- 12 runs below the bow baseline.

Results on Twitter
15
- 2 best results:
- 3 best result:
learning.
- 4 runs below the majority baseline.
- Bow baseline below the majority baseline.

Results on Reviews
16
- 2 best results:
- 3 & 4 best result:
learning.
- 5 runs below the majority baselines.
- Bow baseline ties the majority baseline.
- LDR baseline outperforms by 4% the best
system.

Results on Gender Imitation
17
- 2 best results:
- Linguistic rules combined with deep learning.
- 3 best result:
- Stylistic features with traditional machine
learning.
- 4 - 7 best result:
- SVMs with combinations of n-grams and
linguistic rules.
- 11 runs below the majority and bow baselines.
- Most systems below 5% of increment over the
majority baseline

Global ranking
18
- 4 best results:
- SVMs with combinations of n-grams and
linguistic rules.
- 5 - 7 best results:
- Stylistic features with traditional machine
learning.
- Deep learning approach does not participated in all
the datasets.
- 9 runs below the majority and 10 below the bow
baselines.
- LDR baseline outperforms the best result by 6.65%

Conclusions
● The task aimed at identifying gender from Russian texts from a cross-genre
perspective:
○ Essays, Twitter, Facebook, reviews, gender-imitated.
● There have been 5 participants sending 93 runs.
● Accuracy was used to evaluate the systems.
● Several different features:
○ Traditional hand-crafted features such as word and character n-grams, and
stylometrics, with traditional machine learning methods such as Support Vector
Machines.
○ Deep learning techniques.
● Wrt. results:
○ Deep learning techniques obtained almost the best results, especially in essays
and gender-imitated texts.
○ The best results were not achieved in Twitter but in Facebook.
○ Almost the worst results were obtained in reviews.
19

20
On behalf of the RusProfiling task organisers:
Thank you very much for participating
and hope to see you next year!!

RusProfiling Gender Identification in Russian Texts PAN@FIRE

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (6)

Ähnlich wie RusProfiling Gender Identification in Russian Texts PAN@FIRE

Ähnlich wie RusProfiling Gender Identification in Russian Texts PAN@FIRE (20)

Mehr von Francisco Manuel Rangel Pardo

Mehr von Francisco Manuel Rangel Pardo (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

RusProfiling Gender Identification in Russian Texts PAN@FIRE