This document describes research on linking a user's accounts across multiple online social networks. It discusses the challenges in linking accounts as usernames and profiles can differ across networks. Existing techniques for linking are reviewed, along with their limitations. The paper then presents a new supervised learning approach to link Twitter and LinkedIn accounts based on similarity metrics for different profile fields. Evaluation shows the approach can accurately match accounts with 98% accuracy and discover new candidate matches for a given user profile.
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Studying Digital Footprints Across Social Networks
1. Studying User Footprints in Different Online Social Networks
Studying User Footprints in
Different Online Social Networks
Anshu Malhotra 1 , Luam Totti2 , Wagner Meira Jr.2 ,
Ponnurangam Kumaraguru 1 , Virg´ Almeida2
ılio
1
Indraprastha Institute of Information Technology
New Delhi, India
2
Universidade Federal de Minas Gerais
Belo Horizonte, Brazil
August, 2012
2. Studying User Footprints in Different Online Social Networks
Online Digital Footprints
Users commonly register and access accounts (some times
several) on multiple and diverse online services like
Facebook, LinkedIn, Twitter and Youtube
The set of all information related to the user, either provided
directly or observed from the user’s interaction, is often called
the user’s online digital footprint [6]
3. Studying User Footprints in Different Online Social Networks
Linking User’s Online Accounts
To create a user’s digital footprint the user’s multiple
accounts must be known.
We call this process linking user’s online accounts.
4. Studying User Footprints in Different Online Social Networks
Linking User’s Online Accounts
Linking user accounts from different services can serve several
purposes [1, 4, 10, 11, 12, 13]:
Centralize user information, enforcing data consistency and
simplifying account maintenance
Enrich recommendation systems
Cross-system personalization
Enable cross-system characterization and pattern analysis
Assess and possibly prevent unwanted information leakage,
thereby protecting users from various privacy and security
threats
5. Studying User Footprints in Different Online Social Networks
Main Challenges
Users may choose different (and unrelated) usernames on
different services, which may be unrelated to their real names
[5]
People with common names tend to have similar usernames
[8, 17]
Users may enter inconsistent and misleading information
across their profiles [5], unintentionally or often deliberately in
order to preserve privacy
Heterogeneity in the network structure and profile fields
among the services
6. Studying User Footprints in Different Online Social Networks
Existing Techniques
Various techniques have been proposed for unifying /
disambiguating users’ various profiles across different online
services:
Techniques based on FOAF ontology & graphs [4, 9, 10, 11]
Techniques based on user generated tags [5, 12, 13]
Techniques based on usernames [8, 17]
Techniques based on user profile attributes [1, 2, 3, 6, 7, 14]
7. Studying User Footprints in Different Online Social Networks
Limitations of Existing Techniques
Specificity to certain types of social networks
Dependency on identifiers like email IDs, Instant Messenger
IDs which might not be publicly available
Use of simple text matching algorithms for comparing
complex profile fields
Manual and experimental assignment of weights and
thresholds which can be subjective and not scalable
8. Studying User Footprints in Different Online Social Networks
Limitations of Existing Techniques
Use of small datasets (biggest being 5,000 users), wherein
the data collection and evaluation was done manually in
some approaches
Real world evaluation has not been done for most of the
techniques
Computationally expensive
9. Studying User Footprints in Different Online Social Networks
Major Contributions of our Work
An scalable supervised learning approach for linking users’
accounts from different services
Evaluation of different context specific similarity metrics
for comparing different profile fields
Results using a large dataset of linking user accounts across
Twitter and LinkedIn
Evaluation of the system’s performance for discovering new
accounts for a given user.
10. Studying User Footprints in Different Online Social Networks
User Profile Disambiguation - System Architecture
Account Correlation Extractor: collates the dataset of user
profiles known to be belonging to the same user across
different social networks
Profile Crawler crawls the public profile information from
user accounts for these services
A user’s Online Digital Footprints are generated after Feature
Extraction and Selection
Various Classifiers are trained for account pairs belonging to
the same users and pairs belonging to different users, which
are then used to disambiguate user profiles i.e. classify the
given input profile pairs to be belonging to the same user or
not
11. Studying User Footprints in Different Online Social Networks
User Profile Disambiguation - System Architecture
12. Studying User Footprints in Different Online Social Networks
Dataset Collection - 1st Stage
The training and testing dataset was collated from two types
of sources:
Social Aggregators: Services that allow users to specify their
multiple accounts in order to create an unified feed. We
crawled 883,668 users from FriendFeed1 and 38,755 users from
Profilactic.2
Social Graph:3 API that constructs and provides social
interactions data, including information about users accounts
on multiple services. Of the 14 million users collected, only 3.9
million had useful information.
1
http://friendfeed.com/
http://www.profilactic.com/
3
http://code.google.com/apis/socialgraph/docs/
2
13. Studying User Footprints in Different Online Social Networks
Dataset Collection - Example
twitter,justinbieber youtube,kidrauhl youtube,justinbieber
twitter,aplusk facebook,ashton
youtube,felipeneto youtube,felipenetovlog
youtube,maspoxavida twitter,pecesiqueira
youtube,jp youtube,MysteryGuitarMan
youtube,descealetra twitter,cauemoura
twitter,jcillpam11six facebook,jeremiahcillpam11six
twitter,cjdances facebook,cjperrydances
twitter,sirhilton facebook,richardsrueda
14. Studying User Footprints in Different Online Social Networks
Dataset Collection - 2nd Stage
Publicly available information on each account was then
collected from each service
Four services were initially chosen for the analysis
Twitter, LinkedIn, YouTube and Flickr
Six profile fields were chosen for the analysis
UserID, display name, description, location, connections, image
15. Studying User Footprints in Different Online Social Networks
Dataset Collection
Due to high percentage of missing fields, YouTube and Flickr
were excluded
All further analysis in this work refers only to Twitter and
LinkedIn accounts
80
70
Missing (%)
60
Twitter
LinkedIn
YouTube
Flickr
50
40
30
20
10
0
Name
Location Description
Image
16. Studying User Footprints in Different Online Social Networks
Profile Similarity
A profile may be seen as a N-dimensional vector, where each
component is a profile field [14]
Therefore, comparing profiles can be done component-wise
However, components are of very distinct nature and hence
demand different similarity methods for comparison
In this work we evaluated different approaches for comparing
each profile field
17. Studying User Footprints in Different Online Social Networks
Similarity Metrics
UserID & Display Name:
Jaro-Winkler[15] distance (JW ) is best suited for similarity
between small strings, hence it was used for both these fields
Description (desc): The fields had punctuation and stop
words removed. The words were then lemmatized and
converted to lower case to produce the final token set.
TF-IDF: Cosine similarity between the two token sets using
their tf-idf vector space representation
Jaccard (Jacc): Jaccard’s similarity score between the two
token sets
Ontology (Ont): Wu-Palmer [16] similarity distance between
the Wordnet based ontologies of each description field
18. Studying User Footprints in Different Online Social Networks
Similarity Metrics
Location (Loc): Tokens were extracted from the location
fields of both profiles by removing the punctuations and
converting them to lower case.
Sub-string (Substr ): Normalized score of number of tokens
from one field value present as a substring of the other field
value
Geo-distance (Geo): Euclidean Distance between the two
locations using their latitude and longitudes (Google Maps
GeoCoding API4 ).
Jaro-Winkler distance
Jaccard’s score
4
https://developers.google.com/maps
19. Studying User Footprints in Different Online Social Networks
Similarity Metrics
Profile Image (img ):
The profile image was downloaded and stored locally
Each image was then scaled down to 48 X 48 pixels using
cubic spline interpolation
Each image was then converted to gray scale.
Each image could then be represented as a vector of values
from 0 to 255 to which functions for computing Mean Square
Error (mse), Peak Signal-to-Noise Ratio (psnr), and
Levenshtein (ls) were applied to quantify profile image
similarity
20. Studying User Footprints in Different Online Social Networks
Similarity Metrics
Number Of Connections (conn): For Twitter the number of
connections of a user u is the number of users that u follows.
For LinkedIn is the number of users in the private network of
user u
The number of connections in different services can assume
different ranges, with different meanings
Normalized (norm): Each connection value c was normalized
to the range [0..1] using the smallest and greatest connection
values observed in each service. The similarity was then taken
as the unsigned difference between the two values.
Class: Each value was assigned a (equally sized) class denoting
how big it was. Five classes were used in this work (0-4). The
similarity was taken as the different between the two classes.
21. Studying User Footprints in Different Online Social Networks
Evaluation Experiments
Feature Analysis: To analyze the discriminative capacity of
different profile attributes and similarity metrics for
disambiguating user profiles
Matching Profiles: To test the effectiveness of supervised
learning approaches for classifying two profiles as belonging
the same user
Discovering Candidate Profiles: To evaluate the
performance of our framework for discovering new accounts
for a given known user
The analysis was done using a dataset of account pairs of
29,129 unique users
22. Studying User Footprints in Different Online Social Networks
Feature Analysis
UserID
IG
Relief
MDL
Gini
Name
JW
JW
Jacc
Description
TF-IDF
Ont
Norm
Connections
Class
0.548
0.434
0.379
0.151
0.812
0.521
0.562
0.217
0.286
0.134
0.274
0.084
0.323
0.180
0.300
0.092
0.161
0.113
0.188
0.051
0.000
0.002
-0.006
0.000
0.009
0.095
0.006
0.003
Location
Image
JW
IG
Relief
MDL
Gini
Jacc
Substr
Geo
MSE
PSNR
LS
0.232
0.108
0.158
0.067
0.337
0.041
0.233
0.098
0.350
0.039
0.270
0.102
0.520
0.227
0.488
0.146
0.183
0.157
0.205
0.051
0.184
0.158
0.205
0.051
0.215
0.188
0.227
0.061
Table : Discriminative capacity of each pair < feature, metric > according to
four different approaches.
23. Studying User Footprints in Different Online Social Networks
Feature Analysis - Box Plots
userid jw
name jw
location jw
1
1
1
0.8
0.8
0.8
0.6
0.6
0.6
0.4
0.4
0.4
0.2
0.2
0.2
0
0
Match
Non Match
0
Match
location jaccard
Non Match
Match
location substring
Non Match
location geo
300
1
1
0.8
0.8
0.6
0.6
0.4
0.4
100
0.2
0.2
50
0
0
Match
Non Match
250
200
150
0
Match
Non Match
Match
Non Match
Figure : Box plots for the UserID, Name and Location features.
24. Studying User Footprints in Different Online Social Networks
Feature Analysis - Box Plots
description jaccard
description tf-idf
0.5
1
0.4
0.8
0.3
0.6
0.2
0.4
0.1
description ontology
0.2
0
0.1
0
0
Match
Non Match
Match
connections class
Non Match
Match
image psnr
5
30
4
image levenshtein
25
3
Non Match
20
1
15
2
0.9
10
1
5
0
0
Match
Non Match
0.8
Match
Non Match
Match
Non Match
Figure : Box plots for the Description, Connections and Image features.
25. Studying User Footprints in Different Online Social Networks
Matching User Profiles
The most promising similarity metrics and features were used
to train classifiers for the task of detecting profiles that belong
to the same user
Similarity Vector: E.g.: <useridjw , descjaccard , locgeo >
Training Set:
Positive Examples: Similarity vectors for the accounts pairs of
the dataset
Negative Examples: Equal number of negative examples
synthesized by randomly pairing accounts from different users
and calculating their similarity vectors
A total of 58,258 training instances
26. Studying User Footprints in Different Online Social Networks
Matching User Profiles
After training the classifiers were tested with Twitter-LinkedIn
profile pairs to be classified as a “Match” or a “Non Match”
A “Match” means that the two given input profiles belong to
the same user, while “Non Match” means they don’t
Classifiers used:
Na¨ Bayes
ıve
Decision Tree
SVM
kNN
27. Studying User Footprints in Different Online Social Networks
Matching User Profiles
Results were generated for all possible combinations of
profile features and similarity metrics using 10-fold cross
validation.
As shown below, we achieved accuracy, precision and recall as
98%, 99% and 96% respectively for the best feature set
Na¨ Bayes
ıve
Decision Tree
SVM
kNN
Accuracy
0.980
0.965
0.972
0.898
Precision
0.996
0.994
0.988
0.998
Recall
0.964
0.936
0.956
0.798
F1
0.980
0.964
0.971
0.887
Table : Results for multiple classifiers using the feature set
{namejw , useridjw , locgeo , desctfidf , imgls , connnorm }.
28. Studying User Footprints in Different Online Social Networks
Discovering New User Profiles
So the results are very good using an static dataset, but what
if we don’t have candidates to match to a known user
profile?
A system was developed for retrieving profile candidates of
possible matches for a known account from some other
service.
A part (one fifth) of the true positive data was reserved to be
the testing set
A Na¨ Bayes classifier was trained with the remaining set
ıve
and was modified to return the probability of the two input
profiles of belonging to the same user
29. Studying User Footprints in Different Online Social Networks
Discovering Candidate User Profiles
We query Twitter’s API using LinkedIn’s display name for
each profile pair from the testing dataset
For each of the profiles returned from the Twitter API, we
compute the similarity vector with the LinkedIn profile of the
user
We next used the trained classifier to return the probability of
each of these profiles of belonging to the same user
We rank the Twitter profiles in decreasing order of their
probabilities
Ideally the correct Twitter profile (of the profile pair from the
testing set) should be at the top of this ranking
30. Studying User Footprints in Different Online Social Networks
Discovering Candidate User Profiles
80
Profiles Found (%)
75
70
65
60
55
50
All features
Best Features
45
40
5
10
15
20
Rank position
Figure : Relation between the position in the rank r and the percentage
of times the right profile is found in a position lower or equal to r .
31. Studying User Footprints in Different Online Social Networks
Discovering Candidate User Profiles
In 64% of the cases the right profile was found in the first
position of the rank when using all features, while this value
was 49% for the set of the best features
This means that using all features instead of only the best can
help to disambiguate between the possible candidates.
In 75% of the times the right profile was in the first 3
positions of the rank
This suggests the system can be used in a semi-supervised
manner
32. Studying User Footprints in Different Online Social Networks
Conclusions & Results
Applied automated techniques to identify accounts beloging
to a same user in different online services
Only publicly available information was extracted and used
Proposed and evaluated multiple similarity metrics, comparing
their discriminative capacity for the task profile linking
UserID and Name, when compared using the Jaro-Winkler
metric, were the most discriminative features
33. Studying User Footprints in Different Online Social Networks
Conclusions & Results
For the best set of features and similarity metrics we achieved
accuracy, precision and recall as 98%, 99% and 96%
respectively
Evaluation of the system’s performance for discovering the
user’s profile on Twitter given his display name on LinkedIn
Using all features instead of only the most discriminative ones
has shown better results
The system may be used to match user profile automatically
with 64% accuracy or in a semi supervised manner to narrow
down candidate profiles
34. Studying User Footprints in Different Online Social Networks
Future Work
Incorporate more profile fields
Generalize our model to include other social networks
Adapt our system to handle missing and incorrect profile
attributes
35. Studying User Footprints in Different Online Social Networks
For any further information, please write to
pk@iiitd.ac.in
precog.iiitd.edu.in
36. Studying User Footprints in Different Online Social Networks
Bibliography I
Carmagnola, F., and Cena, F.
User identification for cross-system personalisation.
Inf. Sci. 179, 1-2 (Jan. 2009), 16–32.
Carmagnola, F., Osborne, F., and Torre, I.
Cross-systems identification of users in the social web.
In 8th IADIS Int. Conf. WWW/INTERNET, Rome, Italy (2009), pp. 129–134.
Carmagnola, F., Osborne, F., and Torre, I.
User data distributed on the social web: how to identify users on different social
systems and collecting data about them.
In Proceedings of the 1st International Workshop on Information Heterogeneity
and Fusion in Recommender Systems (New York, NY, USA, 2010), HetRec ’10,
ACM, pp. 9–15.
Golbeck, J., and Rothstein, M.
Linking social networks on the web with foaf: a semantic web case study.
AAAI’08, pp. 1138–1143.
Iofciu, T., Fankhauser, P., Abel, F., and Bischoff, K.
Identifying users across social tagging systems, 2011.
37. Studying User Footprints in Different Online Social Networks
Bibliography II
Irani, D., Webb, S., Li, K., and Pu, C.
Large online social footprints–an emerging threat.
In CSE ’09 (aug. 2009), vol. 3, pp. 271 –276.
Kontaxis, G., Polakis, I., Ioannidis, S., and Markatos, E.
Detecting social network profile cloning.
In PerCom (march 2011), pp. 295 –300.
ˆ
Perito, D., Castelluccia, C., Kaafar, M. A., and Manils, P.
How unique and traceable are usernames?
In PETS (2011), pp. 1–17.
Rowe, M.
Applying semantic social graphs to disambiguate identity references.
In 6th Annual European Semantic Web Conference (ESWC2009) (June 2009),
pp. 461–475.
Rowe, M.
Interlinking distributed social graphs.
In LDOW2009 (April,Spring 2009).
38. Studying User Footprints in Different Online Social Networks
Bibliography III
Rowe, M., and Ciravegna, F.
Harnessing the social web: The science of identity disambiguation.
In Web Science Conference (2010).
Szomszor, M., Alani, H., Cantador, I., O’Hara, K., and Shadbolt, N.
Semantic modelling of user interests based on cross-folksonomy analysis.
ISWC ’08, pp. 632–648.
Szomszor, M. N., Cantador, I., and Alani, H.
Correlating user profiles from multiple folksonomies.
HT ’08, pp. 33–42.
Vosecky, J., Hong, D., and Shen, V. Y.
User identification across multiple social networks.
In NDT’09 (July 2009).
Winkler, W. E.
String comparator metrics and enhanced decision rules in the fellegi-sunter
model of record linkage.
In Proceedings of the Section on Survey Research (1990), pp. 354–359.
39. Studying User Footprints in Different Online Social Networks
Bibliography IV
Wu, Z., and Palmer, M.
Verbs semantics and lexical selection.
In Proceedings of the 32nd annual meeting on Association for Computational
Linguistics (Stroudsburg, PA, USA, 1994), ACL ’94, Association for
Computational Linguistics, pp. 133–138.
Zafarani, R., and Liu, H.
Connecting corresponding identities across communities, 2009.