SlideShare a Scribd company logo
1 of 31
Download to read offline
5th Author Profiling task at PAN
Gender and Language Variety
Identification in Twitter
PAN-AP-2017 CLEF 2017
Dublin, 11-14 September
Francisco Rangel
Autoritas Consulting &
PRHLT Research Center -
Universitat Politècnica de València
Paolo Rosso
PRHLT Research Center
Universitat Politècnica de Valencia
Martin Potthast & Benno Stein
Bauhaus-Universität Weimar
Introduction
Author profiling aims at identifying
personal traits such as age, gender,
personality traits, native language,
language variety… from writings.
This is crucial for:
- Marketing
- Security
- Forensics
2
PAN’16AuthorProfiling
Task goal
To investigate the identification of
author’s gender and language
variety together.
3
PAN’16AuthorProfiling
Four languages:
English Spanish PortugueseArabic
Corpus collection
4
PAN’16AuthorProfiling
● Step 1: Languages and varieties selection.
● Step 2: Tweets per region retrieval.
Corpus collection
5
PAN’16AuthorProfiling
● Step 3: Unique authors identification.
● Step 4: Authors selection:
○ Tweets are not retweets.
○ Tweets are written in the corresponding language.
● Step 5: Language variety annotation:
○ 80% of tweet meta-data coincide with:
■ Geotagging.
■ Toponyms of the region.
● Step 6: Gender annotation:
○ Automatically: dictionary of proper nouns.
○ Manually: visual review.
Corpus
6
PAN’16AuthorProfiling
● Step 7: Corpus construction:
○ 500 authors per variety and gender.
■ 300 for training, 200 for test.
○ 100 tweets per author.
The accuracy is calculated per task and language.
Then, the averages per task are calculated:
Finally, the ranking is the global average:
Evaluation measures
7
PAN’16AuthorProfiling
Baselines
8
PAN’16AuthorProfiling
● BASELINE-stat: A statistical baseline that emulates random
choice.
● BASELINE-bow:
○ Documents represented as bag-of-words.
○ The 1,000 most common words in the training set.
○ Weighted by absolute frequency.
○ Preprocess: lowercase, removal of punctuation signs and
numbers, removal of stopwords.
● BASELINE-LDR:
○ Documents represented by the probability distribution of
occurrence of their words in the different classes.
○ Each word is weighted depending on its probability of
belonging to each class.
○ The distribution of weights for a given document should be
closer to the weights of its corresponding class.
22 participants
20 working notes
19 countries 9
PAN’16AuthorProfiling
Qatar
Netherlands
Cuba
Slovenia
Approaches
10
PAN’16AuthorProfiling
Approaches - Preprocessing
11
PAN’16AuthorProfiling
HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira
Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti
Stop words Kheng et al.; Martinc et al.
Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al.
Remove short tweets Kheng et al.
Twitter specific components:
hashtags, urls, mentions and RTs
Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.;
Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti
Out-of-alphabet words Schaetti
Expand contractions Adame et al.
Approaches - Features
12
PAN’16AuthorProfiling
Stylistic features:
- Ratios of links
- Hashtag or user mentions
- Character flooding
- Emoticons / laugher expressions
- Domain names
Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame
et al.; Markov et al.
Emotional features:
● Emotions
● Appraisal
● Admiration
● Pos/neg emoticons
● Sentiment words
● ...
Adame et al.; Martinc et al.
Specific lists of words, most
discriminant words, ..
Martinc et al.; Kocher & Savoy; Khan
Approaches - Features
13
PAN’16AuthorProfiling
N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.;
Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti;
Ciobanu et al.
Bag-of-words Adame et al.; Tellez et al.
Tf-idf n-grams Poulston et al.; Schaetti; Basile et al.
LSA Kheng et al.
Second order representation Pastor et al.
Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et
al.
Character embeddings Franco-Salvador et al.; Miura et al.
Approaches - Methods
14
PAN’16AuthorProfiling
Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov
SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile
et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.;
Naive Bayes Kheng et al.
Distance-based approaches Adame et al.; Kocher & Savoy; Khan
Recurrent Neural Networks Kodiyan et al.; Miura et al.
Convolutional Neural
Networks
Schaetti; Sierra et al.; Miura et al.
Deep Averaging Networks Franco-Salvador et al.
Gender results
15
PAN’16AuthorProfiling
Variety results
16
PAN’16AuthorProfiling
Confusion among varieties (AR)
17
PAN’16AuthorProfiling
Confusion among varieties (PT)
18
PAN’16AuthorProfiling
Confusion among varieties (ES)
19
PAN’16AuthorProfiling
Confusion among varieties (EN)
20
PAN’16AuthorProfiling
Coarse vs. fine grained English
21
PAN’16AuthorProfiling
● American: United States + Canada.
● European: Great Britain + Ireland.
● Oceanic: New Zealand + Australia.
The impact of the Gender in Variety Identification
22
PAN’16AuthorProfiling
● All participants’ predictions together.
● Except in Spanish, it is less difficult to predict the variety when the
author is a female.
The difficulty of Gender Id. depending on Variety
23
PAN’16AuthorProfiling
● All participants’ predictions together.
● For most Arabic and Portuguese varieties, females are less difficult to be identified.
● In case of Spanish and English both genders are similarly difficult to be identified.
Joint evaluation
24
PAN’16AuthorProfiling
Final ranking
25
PAN’16AuthorProfiling
*
26
PAN’16AuthorProfiling
PAN-AP 2017 best results
Conclusions
● High combination of features: content-based, stylometric, n-grams, … and for the first time deep
learning approaches have been widely used.
○ Deep learning approaches did not obtain the best results.
● Per language:
○ The best results have been obtained in Portuguese.
○ The average worst results in gender identification have been obtained in Arabic.
○ The average worst results in language variety identification have been obtained in English.
● Per variety:
○ In Arabic: The most difficult Gulf. The easiest Levantine.
○ In English, the highest confusion occurs among varieties which share regional locations.
○ In Spanish, most confusions through Colombia. The highest confusion is from Peru.
○ Portuguese is asymetric: Highest confusions from Portugal to Brazil.
● Coarse vs. fine-grained evaluation in English:
○ Significant differences, although not very high (3.75%) in the case of the best approaches.
● The impact of the gender in the language variety identification:
○ In Arabic and Portuguese the differences among genders are significant.
● The difficulty of gender identification depending on the language variety:
○ For most Arabic and Portuguese varieties, females are less difficult to be identified.
○ In case of Spanish and English both genders are similarly difficult to be identified.
27
PAN’16AuthorProfiling
Task impact
28
PAN’16AuthorProfiling
PARTICIPANTS COUNTRIES CITATIONS
PAN-AP 2013
21 16 67 (+28)
PAN-AP 2014
10 8 41 (+25)
PAN-AP 2015
22 13 42 (+25)
PAN-AP 2016
22 15 5
PAN-AP 2017
22 19
Next year?
29
PAN’16AuthorProfiling
Industry at PAN (Author Profiling)
30
PAN’16AuthorProfiling
Organisation Sponsors
Participants
31
PAN’16AuthorProfiling
On behalf of the author profiling task organisers:
Thank you very much for participating
and hope to see you next year!!

More Related Content

Similar to Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
The Annual International Conference on Languages, Linguistics, Translation and Literature
 
The book of abstracts of the second annual international conference on langua...
The book of abstracts of the second annual international conference on langua...The book of abstracts of the second annual international conference on langua...
The book of abstracts of the second annual international conference on langua...
The Annual International Conference on Languages, Linguistics, Translation and Literature
 
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
Phoenix Tree Publishing Inc
 
Code Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna BistaCode Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna Bista
Ana Azevedo
 

Similar to Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017. (20)

Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
Overview of the 9th Author Profiling task at PAN: Profiling Hate Speech Sprea...
 
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
Overview of the 8th Author Profiling task at PAN: Profiling Fake News Spreade...
 
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling  ...
Overview of the 7th Author Profiling task at PAN: Bots and Gender Profiling ...
 
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
Interview Project Paper of the 2017/1 INGLÊS: HABILIDADES INTEGRADAS II - TN4...
 
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
کتیب ملخص المقالات، المؤتمر الدولي الثانی السنویه حول القضايا الراهنة للغات، ...
 
The book of abstracts of the second annual international conference on langua...
The book of abstracts of the second annual international conference on langua...The book of abstracts of the second annual international conference on langua...
The book of abstracts of the second annual international conference on langua...
 
کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...
کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...
کتاب چکیده دومین کنفرانس بین المللی سالانه بررسی مسائل جاری زبان ها، گویش ها ...
 
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
2015 NCLC - Implementing Proficiency-Based Standards in K–12 Chinese Programs...
 
PODCASTING; READING 5
PODCASTING; READING 5PODCASTING; READING 5
PODCASTING; READING 5
 
Code Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna BistaCode Switching: a paper by Krishna Bista
Code Switching: a paper by Krishna Bista
 
Alderson´s question revisited: Is reading in a foreign language a language pr...
Alderson´s question revisited: Is reading in a foreign language a language pr...Alderson´s question revisited: Is reading in a foreign language a language pr...
Alderson´s question revisited: Is reading in a foreign language a language pr...
 
Protocolo adriana pool
Protocolo adriana poolProtocolo adriana pool
Protocolo adriana pool
 
Caderno do Aluno Inglês 1 ano vol 1 2014-2017
Caderno do Aluno Inglês 1 ano vol 1 2014-2017Caderno do Aluno Inglês 1 ano vol 1 2014-2017
Caderno do Aluno Inglês 1 ano vol 1 2014-2017
 
Ethnonyms
EthnonymsEthnonyms
Ethnonyms
 
Dodson_Honors_Thesis_2006
Dodson_Honors_Thesis_2006Dodson_Honors_Thesis_2006
Dodson_Honors_Thesis_2006
 
PRTESOLGram - May2015
PRTESOLGram - May2015PRTESOLGram - May2015
PRTESOLGram - May2015
 
S5 effective assessments - actfl
S5   effective assessments - actflS5   effective assessments - actfl
S5 effective assessments - actfl
 
Author Profiling. PAN@CLEF-2013 Task
Author Profiling. PAN@CLEF-2013 TaskAuthor Profiling. PAN@CLEF-2013 Task
Author Profiling. PAN@CLEF-2013 Task
 
Introduction to Academic Writing and Publishing in English (2018)
Introduction to Academic Writing and Publishing in English (2018)Introduction to Academic Writing and Publishing in English (2018)
Introduction to Academic Writing and Publishing in English (2018)
 
Corpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and LearningCorpus Linguistics for Language Teaching and Learning
Corpus Linguistics for Language Teaching and Learning
 

More from Francisco Manuel Rangel Pardo

More from Francisco Manuel Rangel Pardo (20)

Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
Profiling Irony and Stereotype Spreaders on Twitter (IROSTEREO)
 
AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019AL4Trust - Artificial Intelligence for Building Trust 2019
AL4Trust - Artificial Intelligence for Building Trust 2019
 
Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.Author Profiling en Social Media. En la Academia... y en la Industria.
Author Profiling en Social Media. En la Academia... y en la Industria.
 
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
Multimodal Stance Detection in Tweets on Catalan #1Oct Referendum @Ibereval 2...
 
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
Overview of the 6th Author Profiling task at PAN: Multimodal Gender Identific...
 
RusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRERusProfiling Gender Identification in Russian Texts PAN@FIRE
RusProfiling Gender Identification in Russian Texts PAN@FIRE
 
AL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building TrustAL4Trust - Artificial Intelligence for Building Trust
AL4Trust - Artificial Intelligence for Building Trust
 
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
PR-SOCO Personality Recognition in SOurce COde (PAN@FIRE 2016)
 
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
Overview of PAN'16 - New challenges for Authorship Analysis: Cross-genre prof...
 
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
El Futuro de las Comunicaciones Personales a Través de los Dispositivos Móvil...
 
Smart Listening - MUIinf
Smart Listening - MUIinfSmart Listening - MUIinf
Smart Listening - MUIinf
 
IA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidadIA + Big Data = problema + oportunidad
IA + Big Data = problema + oportunidad
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015Author Profiling task at PAN Lab at CLEF 2015
Author Profiling task at PAN Lab at CLEF 2015
 
EmoGraph for Age and Gender Identification
EmoGraph for Age and Gender IdentificationEmoGraph for Age and Gender Identification
EmoGraph for Age and Gender Identification
 
My Phd Student T-Shirt
My Phd Student T-ShirtMy Phd Student T-Shirt
My Phd Student T-Shirt
 
Kico's Stairway to Phd
Kico's Stairway to PhdKico's Stairway to Phd
Kico's Stairway to Phd
 
Native Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the artNative Language Identification - Brief review to the state of the art
Native Language Identification - Brief review to the state of the art
 
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
Overview of the 2nd. Author Profiling task at PAN-CLEF 2014
 

Recently uploaded

Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
SayantanBiswas37
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 

Recently uploaded (20)

Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Computer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdfComputer science Sql cheat sheet.pdf.pdf
Computer science Sql cheat sheet.pdf.pdf
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 

Gender and Language Variety Identification in Twitter. Overview of the 5th. Author Profiling task at PAN@CLEF 2017.

  • 1. 5th Author Profiling task at PAN Gender and Language Variety Identification in Twitter PAN-AP-2017 CLEF 2017 Dublin, 11-14 September Francisco Rangel Autoritas Consulting & PRHLT Research Center - Universitat Politècnica de València Paolo Rosso PRHLT Research Center Universitat Politècnica de Valencia Martin Potthast & Benno Stein Bauhaus-Universität Weimar
  • 2. Introduction Author profiling aims at identifying personal traits such as age, gender, personality traits, native language, language variety… from writings. This is crucial for: - Marketing - Security - Forensics 2 PAN’16AuthorProfiling
  • 3. Task goal To investigate the identification of author’s gender and language variety together. 3 PAN’16AuthorProfiling Four languages: English Spanish PortugueseArabic
  • 4. Corpus collection 4 PAN’16AuthorProfiling ● Step 1: Languages and varieties selection. ● Step 2: Tweets per region retrieval.
  • 5. Corpus collection 5 PAN’16AuthorProfiling ● Step 3: Unique authors identification. ● Step 4: Authors selection: ○ Tweets are not retweets. ○ Tweets are written in the corresponding language. ● Step 5: Language variety annotation: ○ 80% of tweet meta-data coincide with: ■ Geotagging. ■ Toponyms of the region. ● Step 6: Gender annotation: ○ Automatically: dictionary of proper nouns. ○ Manually: visual review.
  • 6. Corpus 6 PAN’16AuthorProfiling ● Step 7: Corpus construction: ○ 500 authors per variety and gender. ■ 300 for training, 200 for test. ○ 100 tweets per author.
  • 7. The accuracy is calculated per task and language. Then, the averages per task are calculated: Finally, the ranking is the global average: Evaluation measures 7 PAN’16AuthorProfiling
  • 8. Baselines 8 PAN’16AuthorProfiling ● BASELINE-stat: A statistical baseline that emulates random choice. ● BASELINE-bow: ○ Documents represented as bag-of-words. ○ The 1,000 most common words in the training set. ○ Weighted by absolute frequency. ○ Preprocess: lowercase, removal of punctuation signs and numbers, removal of stopwords. ● BASELINE-LDR: ○ Documents represented by the probability distribution of occurrence of their words in the different classes. ○ Each word is weighted depending on its probability of belonging to each class. ○ The distribution of weights for a given document should be closer to the weights of its corresponding class.
  • 9. 22 participants 20 working notes 19 countries 9 PAN’16AuthorProfiling Qatar Netherlands Cuba Slovenia
  • 11. Approaches - Preprocessing 11 PAN’16AuthorProfiling HTML cleaning to obtain plain text Khan. Martinc et al.; Ribeiro-Oliveira & Ferreira Punctuation signs Ribeiro-Oliveira & Ferreira; Martinc et al.; Schaetti Stop words Kheng et al.; Martinc et al. Lowercase Franco-Salvador et al.; Kheng et al.; Kodiyan et al.; Miura et al. Remove short tweets Kheng et al. Twitter specific components: hashtags, urls, mentions and RTs Franco-Salvador et al.; Adame et al.; Kheng et al.; Kodiyan et al.; Markov et al.; Miura et al.; Ribeiro-Oliveira & Ferreira; Schaetti Out-of-alphabet words Schaetti Expand contractions Adame et al.
  • 12. Approaches - Features 12 PAN’16AuthorProfiling Stylistic features: - Ratios of links - Hashtag or user mentions - Character flooding - Emoticons / laugher expressions - Domain names Alrifai et al.; Ribeiro-Oliveira & Ferreira; Martinc et al.; Adame et al.; Markov et al. Emotional features: ● Emotions ● Appraisal ● Admiration ● Pos/neg emoticons ● Sentiment words ● ... Adame et al.; Martinc et al. Specific lists of words, most discriminant words, .. Martinc et al.; Kocher & Savoy; Khan
  • 13. Approaches - Features 13 PAN’16AuthorProfiling N-gram models Martinc et al.;, Alrifai et al.; Kheng et al.; Markov et al.; Ribeiro-Oliveira & Ferreira; Ogaltsov & Romanov; Schaetti; Ciobanu et al. Bag-of-words Adame et al.; Tellez et al. Tf-idf n-grams Poulston et al.; Schaetti; Basile et al. LSA Kheng et al. Second order representation Pastor et al. Word embeddings Ignatov et al.; Kodiyan et al.; Sierra et al.; Poulston et al.; Miura et al. Character embeddings Franco-Salvador et al.; Miura et al.
  • 14. Approaches - Methods 14 PAN’16AuthorProfiling Logistic regression Ignatov et al.; Martinc et al.; Poulston et al.; Ogaltsov & Romanov SVM Alrifai et al.; Kheng et al.; Pastor et al.; Markov et al.; Tellez et al.; Basile et al.; Ribeiro-Oliveira & Ferreira; Ciobanu et al.; Naive Bayes Kheng et al. Distance-based approaches Adame et al.; Kocher & Savoy; Khan Recurrent Neural Networks Kodiyan et al.; Miura et al. Convolutional Neural Networks Schaetti; Sierra et al.; Miura et al. Deep Averaging Networks Franco-Salvador et al.
  • 17. Confusion among varieties (AR) 17 PAN’16AuthorProfiling
  • 18. Confusion among varieties (PT) 18 PAN’16AuthorProfiling
  • 19. Confusion among varieties (ES) 19 PAN’16AuthorProfiling
  • 20. Confusion among varieties (EN) 20 PAN’16AuthorProfiling
  • 21. Coarse vs. fine grained English 21 PAN’16AuthorProfiling ● American: United States + Canada. ● European: Great Britain + Ireland. ● Oceanic: New Zealand + Australia.
  • 22. The impact of the Gender in Variety Identification 22 PAN’16AuthorProfiling ● All participants’ predictions together. ● Except in Spanish, it is less difficult to predict the variety when the author is a female.
  • 23. The difficulty of Gender Id. depending on Variety 23 PAN’16AuthorProfiling ● All participants’ predictions together. ● For most Arabic and Portuguese varieties, females are less difficult to be identified. ● In case of Spanish and English both genders are similarly difficult to be identified.
  • 27. Conclusions ● High combination of features: content-based, stylometric, n-grams, … and for the first time deep learning approaches have been widely used. ○ Deep learning approaches did not obtain the best results. ● Per language: ○ The best results have been obtained in Portuguese. ○ The average worst results in gender identification have been obtained in Arabic. ○ The average worst results in language variety identification have been obtained in English. ● Per variety: ○ In Arabic: The most difficult Gulf. The easiest Levantine. ○ In English, the highest confusion occurs among varieties which share regional locations. ○ In Spanish, most confusions through Colombia. The highest confusion is from Peru. ○ Portuguese is asymetric: Highest confusions from Portugal to Brazil. ● Coarse vs. fine-grained evaluation in English: ○ Significant differences, although not very high (3.75%) in the case of the best approaches. ● The impact of the gender in the language variety identification: ○ In Arabic and Portuguese the differences among genders are significant. ● The difficulty of gender identification depending on the language variety: ○ For most Arabic and Portuguese varieties, females are less difficult to be identified. ○ In case of Spanish and English both genders are similarly difficult to be identified. 27 PAN’16AuthorProfiling
  • 28. Task impact 28 PAN’16AuthorProfiling PARTICIPANTS COUNTRIES CITATIONS PAN-AP 2013 21 16 67 (+28) PAN-AP 2014 10 8 41 (+25) PAN-AP 2015 22 13 42 (+25) PAN-AP 2016 22 15 5 PAN-AP 2017 22 19
  • 30. Industry at PAN (Author Profiling) 30 PAN’16AuthorProfiling Organisation Sponsors Participants
  • 31. 31 PAN’16AuthorProfiling On behalf of the author profiling task organisers: Thank you very much for participating and hope to see you next year!!