SlideShare ist ein Scribd-Unternehmen logo
1 von 56
Downloaden Sie, um offline zu lesen
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Text Analysis of Social Networks: 
Working with FB and VK Data 
AI Ukraine Conference, Kharkiv, Ukraine 
Alexander Panchenko 
alexander.panchenko@uclouvain.be 
Digital Society Laboratory LLC & UCLouvain 
October 25, 2014 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Social networks from the users’s standpoint 
Facebook (FB) and VKontakte (VK) 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Social networks from the data miner’s standpoint 
Facebook (FB) and VKontakte (VK) 
Profiles: a set of user attributes 
categorical variables (region, city, profession, etc.) 
integer variables (age, graduation year, etc.) 
text variables (name, surname, etc.) 
Network: a graph that relates users 
friendship graph 
followers graph 
commenting graph, etc. 
Texts: 
posts 
comments 
group titles and descriptions 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Gathering of VK and FB data 
Big Data: VK worth tens or even hundreds of TB 
Decide what do you need (posts, profiles, etc.). 
Download: 
API 
Scraping 
Download limits and API limitations are specific for each 
network. 
Parallelization is very practical, especially horizontal one: 
Amazon EC2, Distributed Message Queues 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Storing VK and FB data 
Again, Big Data 
NoSQL solutions are helpful 
Raw data: Amazon S3 
For analysis: HDFS 
Efficient retrieval: Elastic Search 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Social Network Analysis 
Structure analysis: friendship graph, comments graph, etc. 
Content analysis: profile attributes, posts, comments, etc. 
Combined approaches. 
What scientific communities analyze social networks? 
60s – the first structural methods 
00s – online social network analysis boom 
Social Network Analysis community (Sociologists, 
Statisticians, Physicists) 
Data and Graph Mining community 
Natural Language Processing community 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Technologies for analysis of social networks 
Machine Learning: hidden vs observable user attributes 
Training of the model often can be scaled vertically 
Applying the model should be scaled horizontally 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Problem 
Joint work with Andrey Teterin. 
Detect gender of a user 
to profile a user; 
user segmentation is helpful in search, advertisement, etc. 
By text written by a user: 
Ciot et al. [2013], Koppel et al. [2002], Goswami et al. [2009], 
Mukherjee and Liu [2010], Peersman et al. [2011], Rao et al. 
[2010], Rangel and Rosso Rangel and Rosso [2013], Al Zamal 
et al.Al Zamal et al. [2012] and Lui et al. Liu et al. [2012]. 
By full name: Burger et al. [2011], Panchenko and Teterin 
[2014] 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Online demo 
http://research.digsolab.com/gender 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Training Data 
100,000 full names of Facebook users with known gender 
full name – first and last name of a user 
gender: male or female 
names written in both Cyrillic and Latin alphabets 
“Alexander Ivanov”, “Masha Sidorova”, “Pavel Nikolenko”, etc. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Training Data 
Figure : Name-surname Alexander co-Panchenko occurrences: Text rows Analysis and of columns Social Networks 
are sorted by
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Character endings of Russian names 
72% of first names have typical male/female ending 
68% of surnames have typical male/female ending 
a typical male/female ending splits males from females with an 
error less than 5% 
gender of  50% first names recognized with 8 endings 
gender of  50% second names recognized with 5 endings 
Conclusion 
Simple symbolic ending-based method cannot robustly classify 
about 30% of names. This motivates the need for a more 
sophisticated statistical approach. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Character endings of Russian names 
Type Ending Gender Error, % Example 
first name na (íà) female 0.27 Ekaterina 
first name iya (èÿ) female 0.32 Anastasiya 
first name ei (åé) male 0.16 Sergei 
first name dr (äð) male 0.00 Alexandr 
first name ga (ãà) male 4.94 Serega 
first name an (àí) male 4.99 Ivan 
first name la (ëà) female 4.23 Luidmila 
first name ii (èé) male 0.34 Yurii 
second name va (âà) female 0.28 Morozova 
second name ov (îâ) male 0.21 Objedkov 
second name na (íà) female 2.22 Matyushina 
second name ev (åâ) male 0.44 Sergeev 
second name in (èí) male 1.94 Teterin 
Table : Most discriminative and frequent two character endings of 
Russian names. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Gender Detection Method 
input: a string representing a name of a person 
output: gender (male or female) 
binary classification task 
Features 
endings 
character n-grams 
dictionary of male/female names and surnames 
Model 
L2-regularized Logistic Regression 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Features 
Character n-grams 
males: Alexander Yaroskavski, Oleg Arbuzov 
females: Alexandra Yaroskavskaya, Nayaliya Arbuzova 
BUT: “Sidorenko”, “Moroz” or “Bondar”! 
two most common one-character endings: “a” and “ya” (“ÿ”) 
Dictionaries of first and last names 
probability that it belongs to the male gender: 
P(c = malejfirstname), P(c = malejlastname). 
3,427 first names, 11,411 last names 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Results 
Model Accuracy Precision Recall F-measure 
rule-based baseline 0,638 0,995 0,633 0,774 
endings 0,850  0,002 0,921  0,003 0,784  0,004 0,847  0,002 
3-grams 0,944  0,003 0,948  0,003 0,946  0,003 0,947  0,003 
dicts 0,956  0,002 0,992  0,001 0,925  0,003 0,957  0,002 
endings+3-grams 0,946  0,003 0,950  0,002 0,947  0,004 0,949  0,003 
3-grams+dicts 0,956  0,003 0,960  0,003 0,957  0,004 0,959  0,003 
endings+3-grams+dicts 0,957  0,003 0,961  0,003 0,959  0,004 0,960  0,002 
Table : Results of the experiments on the training set of 10,000 names. 
Here endings – 4 Russian female endings, trigrams – 1000 most frequent 
3-grams, dictionary – name/surname dict. This table presents precision, 
recall and F-measure of the female class. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Results 
Figure : Learning curves of single and combined models. Accuracy was 
estimated on separate sample of 10,000 names. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Problem 
Motivation 
Goal: to detect Russian-speaking users 
Cyrillic alphabet is used also by Ukrainian, Belorussian, 
Bulgarian, Serbian, Macedonian, Kazakh, etc 
Research Questions 
Which method is the best for Russian language? 
How to adopt it to the FB profile? 
Contributions 
Comparison of Russian-enabled language detection modules. 
A technique for identification of Russian-speaking users. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Method 
input: a FB user profile 
output: is Russian-speaker? (or a set of languages user speaks) 
Common Russian character trigrams 
íà ,  ïð,  òî,  íå,  ëè,  ïî, íî ,  â ,  íà,  
òü,  íå,  è ,  êî,  îì, ïðî, òî ,  èõ,  êà, àòü, 
îòî,  çà,  èå, îâà, òåë, òîð,  äå, îé , ñòè,  
îò, àõ ,  ìè, ñòð,  áå,  âî,  ðà, àÿ , âàò, åé , 
åò ,  æå, è÷å, èÿ , îâ , ñòî,  îá, âåð, ãî , è 
â, è ï, è ñ, èè , èñò, î â, îñò, òðà,  òå, åëè, 
åðå, êîò, ëüí, íèê, íòè, î ñ 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Existing modules for language identification 
langid.py 
https://github.com/saffsd/langid.py 
Advanced n-gram selection 
chromium compact language detector (cld) 
https://code.google.com/p/chromium-compact-language-detector 
guess-language 
https://code.google.com/p/guess-language 
Google Translate API 
https://developers.google.com/translate/v2/using_ 
rest#detect-language 
20$/1M characters 
Yandex Translate API 
http://api.yandex.ru/translate 
Free of charge, 1M of characters / day (by September 2013) 
Many more, e.g. language-detection for Java 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
DBpedia Dataset 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Accuracy of Different Language Detection Modules 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Facebook Dataset: Method 
Profile text: posts + comments + user names – Latin 
symbols. 
Profile text length: 3,367 +- 17,540 
Russian-speakers: P(ru)  0.95 
Core Russian-speakers: 
P(ru)  0.95 
# Cyrillic symbols = 20% 
locale is ru_RU 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Facebook Dataset: Results 
9,906,524 public FB profiles (= 50 cyr. characters) 
8,687,915 (88%) Russian-speaking users 
3,190,813 (32%) core Russian-speaking public Facebook users 
5,365,691 (54%) of profiles with no profile text (= 200 
characters) 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Problem 
Joint work with Dmitry Babaev and Sergei Objedkov. 
input: some SN data representing a user 
output: list of user interests 
Motivation 
Advertisement: targeting, user segmentation, etc. 
Recommendations of content and friends 
Customization of user experience 
. . . 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Data: FB and VK groups 
Text corpus 
41 million of VK groups 
11 million of FB publics 
1.5 million of FB groups 
Data format 
Title and/or description 
List of members 
Number of comments, likes, posts by a member 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Data: VK groups 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
253 interests detected by our system 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Method 
1 Create a text index of groups 
2 Create a keyword list for each of 253 interests 
3 KW classifier: 
Retrieve top k groups retrieved by a set interest keywords 
Rank by TF-IDF 
Associate group’s interests with its users 
A group may have multiple interests 
4 ML classifier: 
Use top k groups as a training data 
BOW features 
Keyword features 
Linear models: L2 LR, Liner SVM, NB 
Classify all groups 
A group may have up to three top interests 
Associate group’s interests with its users 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Association of group’s interests with its users 
Engagement of a person into an interest category is 
proportional to the activity of the person in groups of this category: 
e  wlike  l + ws:comm  cs + wl:comm  cl + wrepost  r 
l – the number of post likes 
cs – the number of short comments 
cl – the number of long comments 
r – the number of reposts 
Association score of a user and an interest depends on 
engagement in a group and on the number of groups: 
all    efb  gfb +
evk  gvk : 
evk , efb – engagement into VK/FB interest 
gvk , gfb – number of groups a user has in FB/VK 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Results 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Results per category: the best and the worst 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Top 30 interests on FB and VK 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Intersection of the top 30 interests on FB and VK 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Interests co-occurrences 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Problem 
Joint work with Dmitry Babaev and Segei Objedkov. 
Motivation 
input: a user profile of one social network 
output: profile of the same person in another social network 
immediate applications in marketing, search, security, etc. 
Contribution 
user identity resolution approach 
precision of 0.98 and recall of 0.54 
the method is computationally effective and easily parallelizable 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Dataset 
VK Facebook 
Number of users in our dataset 89,561,085 2,903,144 
Number of users in Russia 1 100,000,000 13,000,000 
User overlap 29% 88% 
training set: 92,488 matched FB-VK profiles 
1According to comScore and http://vk.com/about 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Profile matching algorithm 
1 Candidate generation. For each VK profile we retrieve a set 
of FB profiles with similar first and second names. 
2 Candidate ranking. The candidates are ranked according to 
similarity of their friends. 
3 Selection of the best candidate. The goal of the final step 
is to select the best match from the list of candidates. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Candidate generation 
Retrieve FB users with names similar to the input VK profile. 
Two names are similar if the first letters are the same and the 
edit distance between names  2. 
Levenshtein Automata for fuzzy match between a VK user 
name and all FB user names 
Automatically extracted dictionary of name synonyms: 
“Alexander”, “Sasha”, “Sanya”, “Sanek”, etc. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Candidate ranking 
The higher the number of friends with similar names in VK 
and FB profiles, the greater the similarity of these profiles. 
Two friends are considered to be similar if: 
First two letters of their last names match 
Similarity between first/last names sims are greater than 
thresholds ;
: 
sims (si ; sj ) = 1  
lev (si ; sj ) 
max(jsi j; jsj j) 
; 
Contribution of each friend to similarity simp of two profiles 
pvk and pfb is inverse of name expectation frequency: 
X 
simp(pvk ; pfb) = 
j:sims (sf 
i ;sf 
j )^sims (ss 
i ;ss 
j )
min(1; 
N 
jsf 
j j  jss 
j j 
): 
Here sf 
i and ss 
i are first and second names of a VK profile, 
correspondingly, while sf 
j and ss 
j refer to a FB profile. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Best candidate selection 
FB candidates are ranked according to similarity simp to an 
input profile pvk 
The best candidate pfb should pass two thresholds to match: 
its score should be higher than the score threshold 
: 
simp(pvk ; pfb)  
: 
either the only candidate or score ratio between it and the next 
best candidate p0 
fb should be higher than the ratio threshold : 
simp(pvk ; pfb) 
simp(pvk ; p0 
fb) 
 : 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Results 
Figure : Precision-recall plot of the matching method. The bold line 
denotes the best precision at given recall. 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Results: matching VK and FB profiles 
First name threshold,  0.8 
Second name threshold,
0.6 
Profile score threshold, 
 3 
Profile ratio threshold,  5 
Number of matched profiles 644,334 (22%) 
Expected precision 0.98 
Expected recall 0.54 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Outline 
1 Social Network Data 
2 Social Network Analysis 
3 User Gender Detection 
4 User Language Detection 
5 User Interests Detection 
6 VK-FB User Matching 
7 Other SNA Tasks 
Alexander Panchenko Text Analysis of Social Networks
Data SNA User Gender User Language User Interests User Matching Other tasks References 
Much more fun stuff can be done with the FB/VK data 
User Age  Region Detection 
Tell me who are your friends, and I will say who you are. 
Most frequent age/region of friends. 
Reject users with high variation of age/region among friends. 
Up to 85-90% of precision. 
User Income Detection 
Transfer learning: target variable is not present in SNs. 
Training a model on a set of users with known income. 
Applying the model on the social network profiles. 
Alexander Panchenko Text Analysis of Social Networks

Weitere ähnliche Inhalte

Andere mochten auch

Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Alexander Panchenko
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
Alexander Panchenko
 

Andere mochten auch (20)

Document
DocumentDocument
Document
 
Similarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation ExtractionSimilarity Measures for Semantic Relation Extraction
Similarity Measures for Semantic Relation Extraction
 
Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...Вычислительная лексическая семантика: метрики семантической близости и их при...
Вычислительная лексическая семантика: метрики семантической близости и их при...
 
Social Networks at Scale
Social Networks at ScaleSocial Networks at Scale
Social Networks at Scale
 
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
Dmitry Gubanov. An Approach to the Study of Formal and Informal Relations of ...
 
2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna2015 pdf-marc smith-node xl-social media sna
2015 pdf-marc smith-node xl-social media sna
 
Making the invisible visible through SNA
Making the invisible visible through SNAMaking the invisible visible through SNA
Making the invisible visible through SNA
 
Социальные сети в России. Исследование аудитории Вконтакте.
Социальные сети в России. Исследование аудитории Вконтакте.Социальные сети в России. Исследование аудитории Вконтакте.
Социальные сети в России. Исследование аудитории Вконтакте.
 
Making Sense of Word Embeddings
Making Sense of Word EmbeddingsMaking Sense of Word Embeddings
Making Sense of Word Embeddings
 
Semantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation ExtractionSemantic Similarity Measures for Semantic Relation Extraction
Semantic Similarity Measures for Semantic Relation Extraction
 
Social Network Analysis for Competitive Intelligence
Social Network Analysis for Competitive IntelligenceSocial Network Analysis for Competitive Intelligence
Social Network Analysis for Competitive Intelligence
 
Social network analysis
Social network analysisSocial network analysis
Social network analysis
 
Using Social Network Analysis to Assess Organizational Development Initiatives
Using Social Network Analysis to Assess Organizational Development InitiativesUsing Social Network Analysis to Assess Organizational Development Initiatives
Using Social Network Analysis to Assess Organizational Development Initiatives
 
Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...Social Network Analysis (SNA) and its implications for knowledge discovery in...
Social Network Analysis (SNA) and its implications for knowledge discovery in...
 
project
projectproject
project
 
Text analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco ControlText analytics in Python and R with examples from Tobacco Control
Text analytics in Python and R with examples from Tobacco Control
 
Statistical analysis of facebook using r
Statistical analysis of facebook using rStatistical analysis of facebook using r
Statistical analysis of facebook using r
 
TN3270 Access to Mainframe SNA Applications
TN3270 Access to Mainframe SNA ApplicationsTN3270 Access to Mainframe SNA Applications
TN3270 Access to Mainframe SNA Applications
 
Three different classifiers for facial age estimation based on K-nearest neig...
Three different classifiers for facial age estimation based on K-nearest neig...Three different classifiers for facial age estimation based on K-nearest neig...
Three different classifiers for facial age estimation based on K-nearest neig...
 
Дмитрий Бугайченко, Одноклассники. Анализ данных в социальных сетях на практике
Дмитрий Бугайченко, Одноклассники. Анализ данных в социальных сетях на практикеДмитрий Бугайченко, Одноклассники. Анализ данных в социальных сетях на практике
Дмитрий Бугайченко, Одноклассники. Анализ данных в социальных сетях на практике
 

Ähnlich wie Text Analysis of Social Networks: Working with FB and VK Data

2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
webuploader
 
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Anastasia Zhukova
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
Deepak K
 
Mining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network ResearchMining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network Research
Marko Rodriguez
 
Network Theory Final Draft
Network Theory Final DraftNetwork Theory Final Draft
Network Theory Final Draft
kmbaldau
 

Ähnlich wie Text Analysis of Social Networks: Working with FB and VK Data (20)

Vivo Search
Vivo SearchVivo Search
Vivo Search
 
2006-05-25__coi-semdis
2006-05-25__coi-semdis2006-05-25__coi-semdis
2006-05-25__coi-semdis
 
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
Automated Identification of Framing by Word Choice and Labeling to Reveal Med...
 
Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.Improving VIVO search through semantic ranking.
Improving VIVO search through semantic ranking.
 
Quality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sourcesQuality assessment of Wikipedia and its sources
Quality assessment of Wikipedia and its sources
 
Review of "Tastes, ties, and time: A new social network dataset using Faceboo...
Review of "Tastes, ties, and time: A new social network dataset using Faceboo...Review of "Tastes, ties, and time: A new social network dataset using Faceboo...
Review of "Tastes, ties, and time: A new social network dataset using Faceboo...
 
Network Analysis and Law: Introductory Tutorial @ Jurix 2011 Meeting (Vienna)
Network Analysis and Law: Introductory Tutorial @ Jurix 2011 Meeting (Vienna)Network Analysis and Law: Introductory Tutorial @ Jurix 2011 Meeting (Vienna)
Network Analysis and Law: Introductory Tutorial @ Jurix 2011 Meeting (Vienna)
 
Social Network Analysis
Social Network AnalysisSocial Network Analysis
Social Network Analysis
 
Optimizing Search Interactions within Professional Social Networks (thesis p...
Optimizing Search Interactions within Professional Social Networks (thesis p...Optimizing Search Interactions within Professional Social Networks (thesis p...
Optimizing Search Interactions within Professional Social Networks (thesis p...
 
Factoid based natural language question generation system
Factoid based natural language question generation systemFactoid based natural language question generation system
Factoid based natural language question generation system
 
Data science syllabus
Data science syllabusData science syllabus
Data science syllabus
 
Question Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning IssuesQuestion Answering over Linked Data - Reasoning Issues
Question Answering over Linked Data - Reasoning Issues
 
Can we predict your sentiments by listening to your peers?
Can we predict your sentiments by listening to your peers?Can we predict your sentiments by listening to your peers?
Can we predict your sentiments by listening to your peers?
 
Knowledge Acquisition from Social Awareness Streams
Knowledge Acquisition from Social Awareness StreamsKnowledge Acquisition from Social Awareness Streams
Knowledge Acquisition from Social Awareness Streams
 
Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)Social Network Analysis based on MOOC's (Massive Open Online Classes)
Social Network Analysis based on MOOC's (Massive Open Online Classes)
 
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
Mining and analyzing social media   part 1 - hicss47 tutorial - dave kingMining and analyzing social media   part 1 - hicss47 tutorial - dave king
Mining and analyzing social media part 1 - hicss47 tutorial - dave king
 
Mining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network ResearchMining and Supporting Community Structures in Sensor Network Research
Mining and Supporting Community Structures in Sensor Network Research
 
Network Theory Final Draft
Network Theory Final DraftNetwork Theory Final Draft
Network Theory Final Draft
 
Dissertation Defense
Dissertation DefenseDissertation Defense
Dissertation Defense
 
Future of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic WebFuture of Web 2.0 & The Semantic Web
Future of Web 2.0 & The Semantic Web
 

Mehr von Alexander Panchenko

The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
Alexander Panchenko
 
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationUsing Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Alexander Panchenko
 
Неологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукНеологизмы в социальной сети Фейсбук
Неологизмы в социальной сети Фейсбук
Alexander Panchenko
 

Mehr von Alexander Panchenko (13)

Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...Graph's not dead: from unsupervised induction of linguistic structures from t...
Graph's not dead: from unsupervised induction of linguistic structures from t...
 
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common CrawlBuilding a Web-Scale Dependency-Parsed Corpus from Common Crawl
Building a Web-Scale Dependency-Parsed Corpus from Common Crawl
 
Improving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic ClassesImproving Hypernymy Extraction with Distributional Semantic Classes
Improving Hypernymy Extraction with Distributional Semantic Classes
 
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical ResourcesInducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
Inducing Interpretable Word Senses for WSD and Enrichment of Lexical Resources
 
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
IIT-UHH at SemEval-2017 Task 3: Exploring Multiple Features for Community Que...
 
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
Fighting with Sparsity of the Synonymy Dictionaries for Automatic Synset Indu...
 
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...The 6th Conference on Analysis of Images, Social Networks, and Texts  (AIST 2...
The 6th Conference on Analysis of Images, Social Networks, and Texts (AIST 2...
 
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense DisambiguationUsing Linked Disambiguated Distributional Networks for Word Sense Disambiguation
Using Linked Disambiguated Distributional Networks for Word Sense Disambiguation
 
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
Unsupervised Does Not Mean Uninterpretable: The Case for Word Sense Induction...
 
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
Noun Sense Induction and Disambiguation using Graph-Based Distributional Sema...
 
Getting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part IIGetting started in Apache Spark and Flink (with Scala) - Part II
Getting started in Apache Spark and Flink (with Scala) - Part II
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
 
Неологизмы в социальной сети Фейсбук
Неологизмы в социальной сети ФейсбукНеологизмы в социальной сети Фейсбук
Неологизмы в социальной сети Фейсбук
 

Kürzlich hochgeladen

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
ssuser79fe74
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
PirithiRaju
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
Areesha Ahmad
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Sérgio Sacani
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
Lokesh Kothari
 

Kürzlich hochgeladen (20)

Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
Feature-aligned N-BEATS with Sinkhorn divergence (ICLR '24)
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
Chemical Tests; flame test, positive and negative ions test Edexcel Internati...
 
GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)GBSN - Microbiology (Unit 3)
GBSN - Microbiology (Unit 3)
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)GBSN - Microbiology (Unit 2)
GBSN - Microbiology (Unit 2)
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
GUIDELINES ON SIMILAR BIOLOGICS Regulatory Requirements for Marketing Authori...
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 

Text Analysis of Social Networks: Working with FB and VK Data

  • 1. Data SNA User Gender User Language User Interests User Matching Other tasks References Text Analysis of Social Networks: Working with FB and VK Data AI Ukraine Conference, Kharkiv, Ukraine Alexander Panchenko alexander.panchenko@uclouvain.be Digital Society Laboratory LLC & UCLouvain October 25, 2014 Alexander Panchenko Text Analysis of Social Networks
  • 2. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 3. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 4. Data SNA User Gender User Language User Interests User Matching Other tasks References Social networks from the users’s standpoint Facebook (FB) and VKontakte (VK) Alexander Panchenko Text Analysis of Social Networks
  • 5. Data SNA User Gender User Language User Interests User Matching Other tasks References Social networks from the data miner’s standpoint Facebook (FB) and VKontakte (VK) Profiles: a set of user attributes categorical variables (region, city, profession, etc.) integer variables (age, graduation year, etc.) text variables (name, surname, etc.) Network: a graph that relates users friendship graph followers graph commenting graph, etc. Texts: posts comments group titles and descriptions Alexander Panchenko Text Analysis of Social Networks
  • 6. Data SNA User Gender User Language User Interests User Matching Other tasks References Gathering of VK and FB data Big Data: VK worth tens or even hundreds of TB Decide what do you need (posts, profiles, etc.). Download: API Scraping Download limits and API limitations are specific for each network. Parallelization is very practical, especially horizontal one: Amazon EC2, Distributed Message Queues Alexander Panchenko Text Analysis of Social Networks
  • 7. Data SNA User Gender User Language User Interests User Matching Other tasks References Storing VK and FB data Again, Big Data NoSQL solutions are helpful Raw data: Amazon S3 For analysis: HDFS Efficient retrieval: Elastic Search Alexander Panchenko Text Analysis of Social Networks
  • 8. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 9. Data SNA User Gender User Language User Interests User Matching Other tasks References Social Network Analysis Structure analysis: friendship graph, comments graph, etc. Content analysis: profile attributes, posts, comments, etc. Combined approaches. What scientific communities analyze social networks? 60s – the first structural methods 00s – online social network analysis boom Social Network Analysis community (Sociologists, Statisticians, Physicists) Data and Graph Mining community Natural Language Processing community Alexander Panchenko Text Analysis of Social Networks
  • 10. Data SNA User Gender User Language User Interests User Matching Other tasks References Technologies for analysis of social networks Machine Learning: hidden vs observable user attributes Training of the model often can be scaled vertically Applying the model should be scaled horizontally Alexander Panchenko Text Analysis of Social Networks
  • 11. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 12. Data SNA User Gender User Language User Interests User Matching Other tasks References Problem Joint work with Andrey Teterin. Detect gender of a user to profile a user; user segmentation is helpful in search, advertisement, etc. By text written by a user: Ciot et al. [2013], Koppel et al. [2002], Goswami et al. [2009], Mukherjee and Liu [2010], Peersman et al. [2011], Rao et al. [2010], Rangel and Rosso Rangel and Rosso [2013], Al Zamal et al.Al Zamal et al. [2012] and Lui et al. Liu et al. [2012]. By full name: Burger et al. [2011], Panchenko and Teterin [2014] Alexander Panchenko Text Analysis of Social Networks
  • 13. Data SNA User Gender User Language User Interests User Matching Other tasks References Online demo http://research.digsolab.com/gender Alexander Panchenko Text Analysis of Social Networks
  • 14. Data SNA User Gender User Language User Interests User Matching Other tasks References Training Data 100,000 full names of Facebook users with known gender full name – first and last name of a user gender: male or female names written in both Cyrillic and Latin alphabets “Alexander Ivanov”, “Masha Sidorova”, “Pavel Nikolenko”, etc. Alexander Panchenko Text Analysis of Social Networks
  • 15. Data SNA User Gender User Language User Interests User Matching Other tasks References Training Data Figure : Name-surname Alexander co-Panchenko occurrences: Text rows Analysis and of columns Social Networks are sorted by
  • 16. Data SNA User Gender User Language User Interests User Matching Other tasks References Character endings of Russian names 72% of first names have typical male/female ending 68% of surnames have typical male/female ending a typical male/female ending splits males from females with an error less than 5% gender of 50% first names recognized with 8 endings gender of 50% second names recognized with 5 endings Conclusion Simple symbolic ending-based method cannot robustly classify about 30% of names. This motivates the need for a more sophisticated statistical approach. Alexander Panchenko Text Analysis of Social Networks
  • 17. Data SNA User Gender User Language User Interests User Matching Other tasks References Character endings of Russian names Type Ending Gender Error, % Example first name na (íà) female 0.27 Ekaterina first name iya (èÿ) female 0.32 Anastasiya first name ei (åé) male 0.16 Sergei first name dr (äð) male 0.00 Alexandr first name ga (ãà) male 4.94 Serega first name an (àí) male 4.99 Ivan first name la (ëà) female 4.23 Luidmila first name ii (èé) male 0.34 Yurii second name va (âà) female 0.28 Morozova second name ov (îâ) male 0.21 Objedkov second name na (íà) female 2.22 Matyushina second name ev (åâ) male 0.44 Sergeev second name in (èí) male 1.94 Teterin Table : Most discriminative and frequent two character endings of Russian names. Alexander Panchenko Text Analysis of Social Networks
  • 18. Data SNA User Gender User Language User Interests User Matching Other tasks References Gender Detection Method input: a string representing a name of a person output: gender (male or female) binary classification task Features endings character n-grams dictionary of male/female names and surnames Model L2-regularized Logistic Regression Alexander Panchenko Text Analysis of Social Networks
  • 19. Data SNA User Gender User Language User Interests User Matching Other tasks References Features Character n-grams males: Alexander Yaroskavski, Oleg Arbuzov females: Alexandra Yaroskavskaya, Nayaliya Arbuzova BUT: “Sidorenko”, “Moroz” or “Bondar”! two most common one-character endings: “a” and “ya” (“ÿ”) Dictionaries of first and last names probability that it belongs to the male gender: P(c = malejfirstname), P(c = malejlastname). 3,427 first names, 11,411 last names Alexander Panchenko Text Analysis of Social Networks
  • 20. Data SNA User Gender User Language User Interests User Matching Other tasks References Results Model Accuracy Precision Recall F-measure rule-based baseline 0,638 0,995 0,633 0,774 endings 0,850 0,002 0,921 0,003 0,784 0,004 0,847 0,002 3-grams 0,944 0,003 0,948 0,003 0,946 0,003 0,947 0,003 dicts 0,956 0,002 0,992 0,001 0,925 0,003 0,957 0,002 endings+3-grams 0,946 0,003 0,950 0,002 0,947 0,004 0,949 0,003 3-grams+dicts 0,956 0,003 0,960 0,003 0,957 0,004 0,959 0,003 endings+3-grams+dicts 0,957 0,003 0,961 0,003 0,959 0,004 0,960 0,002 Table : Results of the experiments on the training set of 10,000 names. Here endings – 4 Russian female endings, trigrams – 1000 most frequent 3-grams, dictionary – name/surname dict. This table presents precision, recall and F-measure of the female class. Alexander Panchenko Text Analysis of Social Networks
  • 21. Data SNA User Gender User Language User Interests User Matching Other tasks References Results Figure : Learning curves of single and combined models. Accuracy was estimated on separate sample of 10,000 names. Alexander Panchenko Text Analysis of Social Networks
  • 22. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 23. Data SNA User Gender User Language User Interests User Matching Other tasks References Problem Motivation Goal: to detect Russian-speaking users Cyrillic alphabet is used also by Ukrainian, Belorussian, Bulgarian, Serbian, Macedonian, Kazakh, etc Research Questions Which method is the best for Russian language? How to adopt it to the FB profile? Contributions Comparison of Russian-enabled language detection modules. A technique for identification of Russian-speaking users. Alexander Panchenko Text Analysis of Social Networks
  • 24. Data SNA User Gender User Language User Interests User Matching Other tasks References Method input: a FB user profile output: is Russian-speaker? (or a set of languages user speaks) Common Russian character trigrams íà , ïð, òî, íå, ëè, ïî, íî , â , íà, òü, íå, è , êî, îì, ïðî, òî , èõ, êà, àòü, îòî, çà, èå, îâà, òåë, òîð, äå, îé , ñòè, îò, àõ , ìè, ñòð, áå, âî, ðà, àÿ , âàò, åé , åò , æå, è÷å, èÿ , îâ , ñòî, îá, âåð, ãî , è â, è ï, è ñ, èè , èñò, î â, îñò, òðà, òå, åëè, åðå, êîò, ëüí, íèê, íòè, î ñ Alexander Panchenko Text Analysis of Social Networks
  • 25. Data SNA User Gender User Language User Interests User Matching Other tasks References Existing modules for language identification langid.py https://github.com/saffsd/langid.py Advanced n-gram selection chromium compact language detector (cld) https://code.google.com/p/chromium-compact-language-detector guess-language https://code.google.com/p/guess-language Google Translate API https://developers.google.com/translate/v2/using_ rest#detect-language 20$/1M characters Yandex Translate API http://api.yandex.ru/translate Free of charge, 1M of characters / day (by September 2013) Many more, e.g. language-detection for Java Alexander Panchenko Text Analysis of Social Networks
  • 26. Data SNA User Gender User Language User Interests User Matching Other tasks References DBpedia Dataset Alexander Panchenko Text Analysis of Social Networks
  • 27. Data SNA User Gender User Language User Interests User Matching Other tasks References Accuracy of Different Language Detection Modules Alexander Panchenko Text Analysis of Social Networks
  • 28. Data SNA User Gender User Language User Interests User Matching Other tasks References Facebook Dataset: Method Profile text: posts + comments + user names – Latin symbols. Profile text length: 3,367 +- 17,540 Russian-speakers: P(ru) 0.95 Core Russian-speakers: P(ru) 0.95 # Cyrillic symbols = 20% locale is ru_RU Alexander Panchenko Text Analysis of Social Networks
  • 29. Data SNA User Gender User Language User Interests User Matching Other tasks References Facebook Dataset: Results 9,906,524 public FB profiles (= 50 cyr. characters) 8,687,915 (88%) Russian-speaking users 3,190,813 (32%) core Russian-speaking public Facebook users 5,365,691 (54%) of profiles with no profile text (= 200 characters) Alexander Panchenko Text Analysis of Social Networks
  • 30. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 31. Data SNA User Gender User Language User Interests User Matching Other tasks References Problem Joint work with Dmitry Babaev and Sergei Objedkov. input: some SN data representing a user output: list of user interests Motivation Advertisement: targeting, user segmentation, etc. Recommendations of content and friends Customization of user experience . . . Alexander Panchenko Text Analysis of Social Networks
  • 32. Data SNA User Gender User Language User Interests User Matching Other tasks References Data: FB and VK groups Text corpus 41 million of VK groups 11 million of FB publics 1.5 million of FB groups Data format Title and/or description List of members Number of comments, likes, posts by a member Alexander Panchenko Text Analysis of Social Networks
  • 33. Data SNA User Gender User Language User Interests User Matching Other tasks References Data: VK groups Alexander Panchenko Text Analysis of Social Networks
  • 34. Data SNA User Gender User Language User Interests User Matching Other tasks References 253 interests detected by our system Alexander Panchenko Text Analysis of Social Networks
  • 35. Data SNA User Gender User Language User Interests User Matching Other tasks References Method 1 Create a text index of groups 2 Create a keyword list for each of 253 interests 3 KW classifier: Retrieve top k groups retrieved by a set interest keywords Rank by TF-IDF Associate group’s interests with its users A group may have multiple interests 4 ML classifier: Use top k groups as a training data BOW features Keyword features Linear models: L2 LR, Liner SVM, NB Classify all groups A group may have up to three top interests Associate group’s interests with its users Alexander Panchenko Text Analysis of Social Networks
  • 36. Data SNA User Gender User Language User Interests User Matching Other tasks References Association of group’s interests with its users Engagement of a person into an interest category is proportional to the activity of the person in groups of this category: e wlike l + ws:comm cs + wl:comm cl + wrepost r l – the number of post likes cs – the number of short comments cl – the number of long comments r – the number of reposts Association score of a user and an interest depends on engagement in a group and on the number of groups: all efb gfb +
  • 37. evk gvk : evk , efb – engagement into VK/FB interest gvk , gfb – number of groups a user has in FB/VK Alexander Panchenko Text Analysis of Social Networks
  • 38. Data SNA User Gender User Language User Interests User Matching Other tasks References Results Alexander Panchenko Text Analysis of Social Networks
  • 39. Data SNA User Gender User Language User Interests User Matching Other tasks References Results per category: the best and the worst Alexander Panchenko Text Analysis of Social Networks
  • 40. Data SNA User Gender User Language User Interests User Matching Other tasks References Top 30 interests on FB and VK Alexander Panchenko Text Analysis of Social Networks
  • 41. Data SNA User Gender User Language User Interests User Matching Other tasks References Intersection of the top 30 interests on FB and VK Alexander Panchenko Text Analysis of Social Networks
  • 42. Data SNA User Gender User Language User Interests User Matching Other tasks References Interests co-occurrences Alexander Panchenko Text Analysis of Social Networks
  • 43. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 44. Data SNA User Gender User Language User Interests User Matching Other tasks References Problem Joint work with Dmitry Babaev and Segei Objedkov. Motivation input: a user profile of one social network output: profile of the same person in another social network immediate applications in marketing, search, security, etc. Contribution user identity resolution approach precision of 0.98 and recall of 0.54 the method is computationally effective and easily parallelizable Alexander Panchenko Text Analysis of Social Networks
  • 45. Data SNA User Gender User Language User Interests User Matching Other tasks References Dataset VK Facebook Number of users in our dataset 89,561,085 2,903,144 Number of users in Russia 1 100,000,000 13,000,000 User overlap 29% 88% training set: 92,488 matched FB-VK profiles 1According to comScore and http://vk.com/about Alexander Panchenko Text Analysis of Social Networks
  • 46. Data SNA User Gender User Language User Interests User Matching Other tasks References Profile matching algorithm 1 Candidate generation. For each VK profile we retrieve a set of FB profiles with similar first and second names. 2 Candidate ranking. The candidates are ranked according to similarity of their friends. 3 Selection of the best candidate. The goal of the final step is to select the best match from the list of candidates. Alexander Panchenko Text Analysis of Social Networks
  • 47. Data SNA User Gender User Language User Interests User Matching Other tasks References Candidate generation Retrieve FB users with names similar to the input VK profile. Two names are similar if the first letters are the same and the edit distance between names 2. Levenshtein Automata for fuzzy match between a VK user name and all FB user names Automatically extracted dictionary of name synonyms: “Alexander”, “Sasha”, “Sanya”, “Sanek”, etc. Alexander Panchenko Text Analysis of Social Networks
  • 48. Data SNA User Gender User Language User Interests User Matching Other tasks References Candidate ranking The higher the number of friends with similar names in VK and FB profiles, the greater the similarity of these profiles. Two friends are considered to be similar if: First two letters of their last names match Similarity between first/last names sims are greater than thresholds ;
  • 49. : sims (si ; sj ) = 1 lev (si ; sj ) max(jsi j; jsj j) ; Contribution of each friend to similarity simp of two profiles pvk and pfb is inverse of name expectation frequency: X simp(pvk ; pfb) = j:sims (sf i ;sf j )^sims (ss i ;ss j )
  • 50. min(1; N jsf j j jss j j ): Here sf i and ss i are first and second names of a VK profile, correspondingly, while sf j and ss j refer to a FB profile. Alexander Panchenko Text Analysis of Social Networks
  • 51. Data SNA User Gender User Language User Interests User Matching Other tasks References Best candidate selection FB candidates are ranked according to similarity simp to an input profile pvk The best candidate pfb should pass two thresholds to match: its score should be higher than the score threshold : simp(pvk ; pfb) : either the only candidate or score ratio between it and the next best candidate p0 fb should be higher than the ratio threshold : simp(pvk ; pfb) simp(pvk ; p0 fb) : Alexander Panchenko Text Analysis of Social Networks
  • 52. Data SNA User Gender User Language User Interests User Matching Other tasks References Results Figure : Precision-recall plot of the matching method. The bold line denotes the best precision at given recall. Alexander Panchenko Text Analysis of Social Networks
  • 53. Data SNA User Gender User Language User Interests User Matching Other tasks References Results: matching VK and FB profiles First name threshold, 0.8 Second name threshold,
  • 54. 0.6 Profile score threshold, 3 Profile ratio threshold, 5 Number of matched profiles 644,334 (22%) Expected precision 0.98 Expected recall 0.54 Alexander Panchenko Text Analysis of Social Networks
  • 55. Data SNA User Gender User Language User Interests User Matching Other tasks References Outline 1 Social Network Data 2 Social Network Analysis 3 User Gender Detection 4 User Language Detection 5 User Interests Detection 6 VK-FB User Matching 7 Other SNA Tasks Alexander Panchenko Text Analysis of Social Networks
  • 56. Data SNA User Gender User Language User Interests User Matching Other tasks References Much more fun stuff can be done with the FB/VK data User Age Region Detection Tell me who are your friends, and I will say who you are. Most frequent age/region of friends. Reject users with high variation of age/region among friends. Up to 85-90% of precision. User Income Detection Transfer learning: target variable is not present in SNs. Training a model on a set of users with known income. Applying the model on the social network profiles. Alexander Panchenko Text Analysis of Social Networks
  • 57. Data SNA User Gender User Language User Interests User Matching Other tasks References Thank you! Questions? Alexander Panchenko Text Analysis of Social Networks
  • 58. Data SNA User Gender User Language User Interests User Matching Other tasks References Al Zamal, F., Liu, W., and Ruths, D. (2012). Homophily and latent attribute inference: Inferring latent attributes of twitter users from neighbors. In ICWSM. Burger, J. D., Henderson, J., Kim, G., and Zarrella, G. (2011). Discriminating gender on twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1301–1309. Association for Computational Linguistics. Ciot, M., Sonderegger, M., and Ruths, D. (2013). Gender inference of twitter users in non-english contexts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Wash, pages 18–21. Goswami, S., Sarkar, S., and Rustagi, M. (2009). Stylometric analysis of bloggers’ age and gender. In Third International AAAI Conference on Weblogs and Social Media. Koppel, M., Argamon, S., and Shimoni, A. R. (2002). Alexander Panchenko Text Analysis of Social Networks
  • 59. Data SNA User Gender User Language User Interests User Matching Other tasks References Automatically categorizing written texts by author gender. Literary and Linguistic Computing, 17(4):401–412. Liu, W., Zamal, F. A., and Ruths, D. (2012). Using social media to infer gender composition of commuter populations. Proceedings of the When the City Meets the Citizen Worksop. Mukherjee, A. and Liu, B. (2010). Improving gender classification of blog authors. In Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing, pages 207–217. Association for Computational Linguistics. Peersman, C., Daelemans, W., and Van Vaerenbergh, L. (2011). Predicting age and gender in online social networks. In Proceedings of the 3rd international workshop on Search and mining user-generated contents, pages 37–44. ACM. Rangel, F. and Rosso, P. (2013). Use of language and author profiling: Identification of gender and age. Natural Language Processing and Cognitive Science, page 177. Alexander Panchenko Text Analysis of Social Networks
  • 60. Data SNA User Gender User Language User Interests User Matching Other tasks References Rao, D., Yarowsky, D., Shreevats, A., and Gupta, M. (2010). Classifying latent user attributes in twitter. In Proceedings of the 2nd international workshop on Search and mining user-generated contents, pages 37–44. ACM. Alexander Panchenko Text Analysis of Social Networks