SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Downloaden Sie, um offline zu lesen
RusProfiling
Gender Identification in
Russian Texts
PAN@FIRE 2017
Bangalore, 8-10 December
Francisco Rangel
Autoritas Consulting
Paolo Rosso
PRHLT - Universitat Politècnica
de València - Spain
Pavel Seredin & Olga Litvinova
RusProfiling Lab & Kurchatov Institute
Russia
Tatiana Litvinova
RusProfiling Lab
Russia
Introduction
Author profiling aims at identifying
personal traits such as age, gender,
native language or personality traits from
writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN@FIRE’17RusProfiling
Task goal
To predict Gender in Russian text
from a cross-genre perspective:
- Essays.
- Facebook.
- Twitter.
- Reviews.
- Gender-Imitated texts. 3
PAN@FIRE’17RusProfiling
Corporadescription
Dataset Genre Number of Authors Description
Training Twitter
600 - Manually annotated (name, picture…)
- From 1 to 200 tweets per author
Test Essays
370 - Between 1 and 2 texts per author
- From RusPersonality corpus
- Topics: letter to a friend, picture description, letter
to an employee...
- Average length of 150 words
Facebook
228 - Different age groups (20+, 30+, 40+)
- Diffierent cities (minimum mutual friends)
- Average length of 1,000 words
Twitter
400 - A random partition ensuring no interjections with
training authors
Reviews
776 - From TrustPilot corpus
- One text per author
- Average lenght of 80 words
Gender-Imitated
94 - From the Gender Imitation Corpus
- Three texts per author:
- Normal style
- Imitating the other gender
- Obfuscating her style
PAN@FIRE’17RusProfiling
Evaluation measures
5
The accuracy measure is calculated per corpus.
The final ranking is obtained by calculating the weighted
accuracy such as if the corpus were concatenated.
PAN@FIRE’17RusProfiling
6
PAN@FIRE’17RusProfiling
Baselines
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.
7
PAN@FIRE’17RusProfiling
Participants
AmritaNLP [18] V. Vinayan, N. J.R., H. NB, A. Kumar M, and S. K P.
Amritanlp@pan-rusprofiling: Author profiling using machine learning techniques.
BITS_Pilani [1] R. Bhargava, G. Goel, A. Shah, and Y. Sharma.
Gender identification in russian texts.
CIC [7] I. Markov, H. Gomez-Adorno, G. Sidorov, and A. Gelbukh.
The winning approach to cross-genre gender identification in russian at rusprofiling
2017.
DUBL [17] G. Skitalinskaya, L. Akhtyamova, and J. Cardiff.
Cross-genre gender identification in russian texts using topic modeling working note:
Team dubl.
RBG [3] B. Ganesh HB, A. Kumar M, and S. KP.
Representation of target classes for text classification - amrita cen nlp@rusprofiling
pan 2017.
8
PAN@FIRE’17RusProfiling
Participants’ runs per dataset
Dataset Runs
Essays 18
Facebook 19
Twitter 19
Reviews 19
Gender-Imitated 19
Total 93
Approaches
9
PAN@FIRE’17RusProfiling
Approaches - Preprocessing
10
Obtain plain text BITS_Pilani
Remove stopwords BITS_Pilani, DUBL
Remove short words DUBL
Twitter specific elements (mentions, hashtags, urls) BITS_Pilani, DUBL
Remove punctuation marks BITS_Pilani, CIC
Remove numbers BITS_Pilani
Remove non-cyrillic characters CIC
Lemmatisation DUBL
PAN@FIRE’17RusProfiling
Approaches - Features
11
AmritaNLP - Number of user mentions
- Hashtags
- Urls
- Emoticons
- Punctuation marks
- Average word length
- Tf-idf bag-of-words
BITS_Pilani - Linguistic patterns such as word endings or the use of first person singular pronouns within
a distance to a verb in past tense
- (Combined with) Deep learning techniques
CIC - Word and character n-grams
- Words most frequently used per gender
- Linguistic patterns such as word endings or the use of first person singular pronouns within
a distance to a verb in past tense
DUBL - Topic modelling
RBG - A representation scheme based on the texts belonging to the corresponding target classes.
PAN@FIRE’17RusProfiling
Approaches - Methods
12
Support Vector Machines AmritaNLP, CIC, RBG
Random Forest AmritaNLP
AdaBoost AmritaNLP
Additive Regularization for Topic Modelling DUBL
Rule-based BITS_Pilani
Long-Short Term Memory networks BITS_Pilani
PAN@FIRE’17RusProfiling
Results on Essays
13
PAN@FIRE’17RusProfiling
- Best result:
- A combination of linguistic rules and
deep learning.
- 10% higher than second best result.
- Second best result:
- Stylistic features with traditional
machine learning.
- 7 runs below the bow and majority
baselines.
- LDR baseline outperforms by 3% and 13%
the best systems.
Results on Facebook
14
PAN@FIRE’17RusProfiling
- 4 best results:
- SVMs with combinations of n-grams
and linguistic rules.
- 2 results higher than 90%
- 5 & 6 best result:
- Linguistic rules combined with deep
learning.
- 5 runs below the majority baseline.
- 12 runs below the bow baseline.
Results on Twitter
15
PAN@FIRE’17RusProfiling
- 2 best results:
- SVMs with combinations of n-grams
and linguistic rules.
- 3 best result:
- Linguistic rules combined with deep
learning.
- 4 runs below the majority baseline.
- Bow baseline below the majority baseline.
Results on Reviews
16
PAN@FIRE’17RusProfiling
- 2 best results:
- SVMs with combinations of n-grams
and linguistic rules.
- 3 & 4 best result:
- Linguistic rules combined with deep
learning.
- 5 runs below the majority baselines.
- Bow baseline ties the majority baseline.
- LDR baseline outperforms by 4% the best
system.
Results on Gender Imitation
17
PAN@FIRE’17RusProfiling
- 2 best results:
- Linguistic rules combined with deep learning.
- 3 best result:
- Stylistic features with traditional machine
learning.
- 4 - 7 best result:
- SVMs with combinations of n-grams and
linguistic rules.
- 11 runs below the majority and bow baselines.
- Most systems below 5% of increment over the
majority baseline
Global ranking
18
PAN@FIRE’17RusProfiling
- 4 best results:
- SVMs with combinations of n-grams and
linguistic rules.
- 5 - 7 best results:
- Stylistic features with traditional machine
learning.
- Deep learning approach does not participated in all
the datasets.
- 9 runs below the majority and 10 below the bow
baselines.
- LDR baseline outperforms the best result by 6.65%
Conclusions
● The task aimed at identifying gender from Russian texts from a cross-genre
perspective:
○ Essays, Twitter, Facebook, reviews, gender-imitated.
● There have been 5 participants sending 93 runs.
● Accuracy was used to evaluate the systems.
● Several different features:
○ Traditional hand-crafted features such as word and character n-grams, and
stylometrics, with traditional machine learning methods such as Support Vector
Machines.
○ Deep learning techniques.
● Wrt. results:
○ Deep learning techniques obtained almost the best results, especially in essays
and gender-imitated texts.
○ The best results were not achieved in Twitter but in Facebook.
○ Almost the worst results were obtained in reviews.
19
PAN@FIRE’17RusProfiling
20
On behalf of the RusProfiling task organisers:
Thank you very much for participating
and hope to see you next year!!
PAN@FIRE’17RusProfiling

Weitere ähnliche Inhalte

Was ist angesagt?

Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question AnsweringMarina Santini
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email TrafficDavid Graus
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...Francisco Manuel Rangel Pardo
 
Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Nik Spirin
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Francisco Manuel Rangel Pardo
 
Sentiment Index of the Russian Speaking Facebook
Sentiment Index of the Russian Speaking FacebookSentiment Index of the Russian Speaking Facebook
Sentiment Index of the Russian Speaking FacebookAlexander Panchenko
 

Was ist angesagt? (6)

Lecture: Question Answering
Lecture: Question AnsweringLecture: Question Answering
Lecture: Question Answering
 
Understanding Email Traffic
Understanding Email TrafficUnderstanding Email Traffic
Understanding Email Traffic
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
 
Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...Optimizing Search User Interfaces and Interactions within Professional Social...
Optimizing Search User Interfaces and Interactions within Professional Social...
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
 
Sentiment Index of the Russian Speaking Facebook
Sentiment Index of the Russian Speaking FacebookSentiment Index of the Russian Speaking Facebook
Sentiment Index of the Russian Speaking Facebook
 

Ähnlich wie RusProfiling Gender Identification in Russian Texts PAN@FIRE

A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Saurav Jha
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Jinho Choi
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoLidia Pivovarova
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)Francisco Manuel Rangel Pardo
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Fabio Benedetti
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Francisco Manuel Rangel Pardo
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustFrancisco Manuel Rangel Pardo
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationSeonghyun Kim
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attributionReza Ramezani
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015RIILP
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterSudarsun Santhiappan
 
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...Siyamak Barzegar
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Nischal Lal Shrestha
 

Ähnlich wie RusProfiling Gender Identification in Russian Texts PAN@FIRE (20)

A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...
Gender Classification of Blog Authors: With Feature Engineering and Deep Lear...
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue for Span-b...
 
AINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, NikolenkoAINL 2016: Galinsky, Alekseev, Nikolenko
AINL 2016: Galinsky, Alekseev, Nikolenko
 
Share
ShareShare
Share
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
 
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...Context Semantic Analysis: a knowledge-based technique for computing inter-do...
Context Semantic Analysis: a knowledge-based technique for computing inter-do...
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
Enriching Word Vectors with Subword Information
Enriching Word Vectors with Subword InformationEnriching Word Vectors with Subword Information
Enriching Word Vectors with Subword Information
 
NLP and Knowledge Graphs
NLP and Knowledge GraphsNLP and Knowledge Graphs
NLP and Knowledge Graphs
 
Authorship attribution
Authorship attributionAuthorship attribution
Authorship attribution
 
AINL 2016: Kravchenko
AINL 2016: KravchenkoAINL 2016: Kravchenko
AINL 2016: Kravchenko
 
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
ESR10 Joachim Daiber - EXPERT Summer School - Malaga 2015
 
Topic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam FilterTopic Models Based Personalized Spam Filter
Topic Models Based Personalized Spam Filter
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
Semantic Relatedness for All (Languages): A Comparative Analysis of Multiling...
 
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
Comparative Analysis of ML and DL Models for Gender Prediction in Devanagari ...
 
Use of language and author profiling.key
Use of language and author profiling.keyUse of language and author profiling.key
Use of language and author profiling.key
 

Mehr von Francisco Manuel Rangel Pardo

AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019Francisco Manuel Rangel Pardo
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Francisco Manuel Rangel Pardo
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Francisco Manuel Rangel Pardo
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Francisco Manuel Rangel Pardo
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Francisco Manuel Rangel Pardo
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Francisco Manuel Rangel Pardo
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Francisco Manuel Rangel Pardo
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...Francisco Manuel Rangel Pardo
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Francisco Manuel Rangel Pardo
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artFrancisco Manuel Rangel Pardo
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Francisco Manuel Rangel Pardo
 
Social Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de NegocioSocial Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de NegocioFrancisco Manuel Rangel Pardo
 
Dualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresaDualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresaFrancisco Manuel Rangel Pardo
 

Mehr von Francisco Manuel Rangel Pardo (20)

AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
Stance and Gender Detection in Tweets on Catalan Independence. Ibereval@SEPLN...
 
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
Gender and Language Variety Identification in Twitter. Overview of the 5th. A...
 
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016Overview of the 4th. Author Profiling task at PAN-CLEF 2016
Overview of the 4th. Author Profiling task at PAN-CLEF 2016
 
Redes sociales y preadolescentes
Redes sociales y preadolescentesRedes sociales y preadolescentes
Redes sociales y preadolescentes
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
My Phd Student T-Shirt
My Phd Student T-ShirtMy Phd Student T-Shirt
My Phd Student T-Shirt
 
Kico's Stairway to Phd
Kico's Stairway to PhdKico's Stairway to Phd
Kico's Stairway to Phd
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the art
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
 
Social Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de NegocioSocial Business Intelligence - Inteligencia Social de Negocio
Social Business Intelligence - Inteligencia Social de Negocio
 
Dualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresaDualidad onda-partícula del científico de datos en la empresa
Dualidad onda-partícula del científico de datos en la empresa
 

Kürzlich hochgeladen

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degreeyuu sss
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxaleedritatuxx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024Timothy Spann
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degreeyuu sss
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Boston Institute of Analytics
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGIThomas Poetter
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxellehsormae
 

Kürzlich hochgeladen (20)

NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
办美国阿肯色大学小石城分校毕业证成绩单pdf电子版制作修改#真实留信入库#永久存档#真实可查#diploma#degree
 
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptxmodul pembelajaran robotic Workshop _ by Slidesgo.pptx
modul pembelajaran robotic Workshop _ by Slidesgo.pptx
 
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
April 2024 - NLIT Cloudera Real-Time LLM Streaming 2024
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
毕业文凭制作#回国入职#diploma#degree澳洲中央昆士兰大学毕业证成绩单pdf电子版制作修改#毕业文凭制作#回国入职#diploma#degree
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
Data Analysis Project : Targeting the Right Customers, Presentation on Bank M...
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGILLMs, LMMs, their Improvement Suggestions and the Path towards AGI
LLMs, LMMs, their Improvement Suggestions and the Path towards AGI
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
Vision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptxVision, Mission, Goals and Objectives ppt..pptx
Vision, Mission, Goals and Objectives ppt..pptx
 

RusProfiling Gender Identification in Russian Texts PAN@FIRE

  • 1. RusProfiling Gender Identification in Russian Texts PAN@FIRE 2017 Bangalore, 8-10 December Francisco Rangel Autoritas Consulting Paolo Rosso PRHLT - Universitat Politècnica de València - Spain Pavel Seredin & Olga Litvinova RusProfiling Lab & Kurchatov Institute Russia Tatiana Litvinova RusProfiling Lab Russia
  • 2. Introduction Author profiling aims at identifying personal traits such as age, gender, native language or personality traits from writings. This is crucial for: - Marketing - Security - Forensics 2 PAN@FIRE’17RusProfiling
  • 3. Task goal To predict Gender in Russian text from a cross-genre perspective: - Essays. - Facebook. - Twitter. - Reviews. - Gender-Imitated texts. 3 PAN@FIRE’17RusProfiling
  • 4. Corporadescription Dataset Genre Number of Authors Description Training Twitter 600 - Manually annotated (name, picture…) - From 1 to 200 tweets per author Test Essays 370 - Between 1 and 2 texts per author - From RusPersonality corpus - Topics: letter to a friend, picture description, letter to an employee... - Average length of 150 words Facebook 228 - Different age groups (20+, 30+, 40+) - Diffierent cities (minimum mutual friends) - Average length of 1,000 words Twitter 400 - A random partition ensuring no interjections with training authors Reviews 776 - From TrustPilot corpus - One text per author - Average lenght of 80 words Gender-Imitated 94 - From the Gender Imitation Corpus - Three texts per author: - Normal style - Imitating the other gender - Obfuscating her style PAN@FIRE’17RusProfiling
  • 5. Evaluation measures 5 The accuracy measure is calculated per corpus. The final ranking is obtained by calculating the weighted accuracy such as if the corpus were concatenated. PAN@FIRE’17RusProfiling
  • 6. 6 PAN@FIRE’17RusProfiling Baselines ● BASELINE-stat: A statistical baseline that emulates random choice. ● BASELINE-bow: ○ Documents represented as bag-of-words. ○ The 1,000 most common words in the training set. ○ Weighted by absolute frequency. ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. ● BASELINE-LDR: ○ Documents represented by the probability distribution of occurrence of their words in the different classes. ○ Each word is weighted depending on its probability of belonging to each class. ○ The distribution of weights for a given document should be closer to the weights of its corresponding class.
  • 7. 7 PAN@FIRE’17RusProfiling Participants AmritaNLP [18] V. Vinayan, N. J.R., H. NB, A. Kumar M, and S. K P. Amritanlp@pan-rusprofiling: Author profiling using machine learning techniques. BITS_Pilani [1] R. Bhargava, G. Goel, A. Shah, and Y. Sharma. Gender identification in russian texts. CIC [7] I. Markov, H. Gomez-Adorno, G. Sidorov, and A. Gelbukh. The winning approach to cross-genre gender identification in russian at rusprofiling 2017. DUBL [17] G. Skitalinskaya, L. Akhtyamova, and J. Cardiff. Cross-genre gender identification in russian texts using topic modeling working note: Team dubl. RBG [3] B. Ganesh HB, A. Kumar M, and S. KP. Representation of target classes for text classification - amrita cen nlp@rusprofiling pan 2017.
  • 8. 8 PAN@FIRE’17RusProfiling Participants’ runs per dataset Dataset Runs Essays 18 Facebook 19 Twitter 19 Reviews 19 Gender-Imitated 19 Total 93
  • 10. Approaches - Preprocessing 10 Obtain plain text BITS_Pilani Remove stopwords BITS_Pilani, DUBL Remove short words DUBL Twitter specific elements (mentions, hashtags, urls) BITS_Pilani, DUBL Remove punctuation marks BITS_Pilani, CIC Remove numbers BITS_Pilani Remove non-cyrillic characters CIC Lemmatisation DUBL PAN@FIRE’17RusProfiling
  • 11. Approaches - Features 11 AmritaNLP - Number of user mentions - Hashtags - Urls - Emoticons - Punctuation marks - Average word length - Tf-idf bag-of-words BITS_Pilani - Linguistic patterns such as word endings or the use of first person singular pronouns within a distance to a verb in past tense - (Combined with) Deep learning techniques CIC - Word and character n-grams - Words most frequently used per gender - Linguistic patterns such as word endings or the use of first person singular pronouns within a distance to a verb in past tense DUBL - Topic modelling RBG - A representation scheme based on the texts belonging to the corresponding target classes. PAN@FIRE’17RusProfiling
  • 12. Approaches - Methods 12 Support Vector Machines AmritaNLP, CIC, RBG Random Forest AmritaNLP AdaBoost AmritaNLP Additive Regularization for Topic Modelling DUBL Rule-based BITS_Pilani Long-Short Term Memory networks BITS_Pilani PAN@FIRE’17RusProfiling
  • 13. Results on Essays 13 PAN@FIRE’17RusProfiling - Best result: - A combination of linguistic rules and deep learning. - 10% higher than second best result. - Second best result: - Stylistic features with traditional machine learning. - 7 runs below the bow and majority baselines. - LDR baseline outperforms by 3% and 13% the best systems.
  • 14. Results on Facebook 14 PAN@FIRE’17RusProfiling - 4 best results: - SVMs with combinations of n-grams and linguistic rules. - 2 results higher than 90% - 5 & 6 best result: - Linguistic rules combined with deep learning. - 5 runs below the majority baseline. - 12 runs below the bow baseline.
  • 15. Results on Twitter 15 PAN@FIRE’17RusProfiling - 2 best results: - SVMs with combinations of n-grams and linguistic rules. - 3 best result: - Linguistic rules combined with deep learning. - 4 runs below the majority baseline. - Bow baseline below the majority baseline.
  • 16. Results on Reviews 16 PAN@FIRE’17RusProfiling - 2 best results: - SVMs with combinations of n-grams and linguistic rules. - 3 & 4 best result: - Linguistic rules combined with deep learning. - 5 runs below the majority baselines. - Bow baseline ties the majority baseline. - LDR baseline outperforms by 4% the best system.
  • 17. Results on Gender Imitation 17 PAN@FIRE’17RusProfiling - 2 best results: - Linguistic rules combined with deep learning. - 3 best result: - Stylistic features with traditional machine learning. - 4 - 7 best result: - SVMs with combinations of n-grams and linguistic rules. - 11 runs below the majority and bow baselines. - Most systems below 5% of increment over the majority baseline
  • 18. Global ranking 18 PAN@FIRE’17RusProfiling - 4 best results: - SVMs with combinations of n-grams and linguistic rules. - 5 - 7 best results: - Stylistic features with traditional machine learning. - Deep learning approach does not participated in all the datasets. - 9 runs below the majority and 10 below the bow baselines. - LDR baseline outperforms the best result by 6.65%
  • 19. Conclusions ● The task aimed at identifying gender from Russian texts from a cross-genre perspective: ○ Essays, Twitter, Facebook, reviews, gender-imitated. ● There have been 5 participants sending 93 runs. ● Accuracy was used to evaluate the systems. ● Several different features: ○ Traditional hand-crafted features such as word and character n-grams, and stylometrics, with traditional machine learning methods such as Support Vector Machines. ○ Deep learning techniques. ● Wrt. results: ○ Deep learning techniques obtained almost the best results, especially in essays and gender-imitated texts. ○ The best results were not achieved in Twitter but in Facebook. ○ Almost the worst results were obtained in reviews. 19 PAN@FIRE’17RusProfiling
  • 20. 20 On behalf of the RusProfiling task organisers: Thank you very much for participating and hope to see you next year!! PAN@FIRE’17RusProfiling