People manage a spectrum of identities in cyber domains. Profiling individuals and assigning them to distinct groups or classes have potential applications in targeted services, online fraud detection, extensive social sorting, and cyber-security. This talk presents the Uncertainty of Identity Toolset, a framework for the identification and profiling of users from their social media accounts and e-mail addresses. More specifically, we discuss the design and implementation of two tools of the framework. The Twitter Geographic Profiler tool builds a map of the ethno-cultural communities of a person's friends on Twitter social media service. The E-mail Address Profiler tool identifies the probable identities of individuals from their e-mail addresses and maps their geographical distribution across the UK. To this end, this presents a framework for profiling the digital traces of individuals.
Using Digital Traces for User Profiling: the Uncertainty of Identity Toolset
1. Using Digital Traces for User Profiling: the Uncertainty
of Identity Toolset
Muhammad Adnan1, Antonio Lima2, Luca Rossi2, Suresh Veluru3, Paul
Longley1, Mirco Musolesi2, Muttukrishnan Rajarajan3
1 Department of Geography, University College London
2 School of Computer Science, University of Birmingham
3 School of Engineering and Mathematical Sciences, City University London
Web: www.uncertaintyofidentity.com
2. Introduction
• Past years have witnessed a rapid growth of the use of
online services
• Online shopping, bank transactions, social networking services
• Issues related to cyber-crimes, identity frauds, and hacking
• This project aims to combining real and virtual world
datasets to better understand the identity of individuals
• Identities
• Real world (Name: Forename & Surname)
• Virtual world (Email addresses, Social media accounts etc)
3. Introduction
• This paper presents a framework for the identification and
profiling of individuals from their
• Social media accounts
• E-mail addresses
• Twitter Geographic Profiler
• Maps ethno-cultural communities of a person’s friends
• E-mail Address Profiler
• Used a database of family names to extract probably identities from
E-mail addresses
• Could have potential applications in targeted marketing and
online fraud detection
4. Outline
• Onomap
• A Name (Forename and Surname) classification system
• Twitter Geographic Profiler
• Extracting identities of Twitter users
• Mapping them to probable ethnic origins
• E-mail Address Profiler
• Extracting identities from E-mail addresses
• Geographic distribution
5. Onomap classification
• A name is a person’s ethnic, linguistic, and cultural identity
• A network of Forename-Surname pairs was created by using
Pablo
Forenames Surnames
Mateos
Garcia
Pérez
...
Juan
Rosa
Marta
...
Sánchez
Rodríguez
the data from 26 different countries
• www.onomap.org
Name: Pablo Mateos
9. Twitter Geographic Profiler
• Given an individual’s Twitter Username or ID
• Extracts the information of individual’s friends
• Extracts the forename-surname pairs of the friends
• Maps forename-surname pairs to Onomap
• Builds an ethno-cultural profile person’s friends
• Maps the geographic distribution
10. Data available through the Twitter API
• User ID
• User Creation Date
• Followers
• Friends
• Language
• Location
• Name
• Screen Name or User Name
• Time Zone
• Geo Enabled
• Latitude
• Longitude
• Tweet date and time
• Tweet text
11. Twitter: getting the ids and usernames
• Given a Twitter username of a person, we use the Twitter
API to get the list of friends’ ids
– A max of 15 requests every 15 minutes is allowed
– Each query can get up to 5000 ids
– Generally enough to download all the ids
• Using the ids, we fetch the name associated to each id
– Limited to 180 requests every 15 min
– Returns a single string from which we need to extract the name
and surname tokens
– Not necessarily a valid forename + surname!
• E.g., “University of Birmingham”, “John1965”, “ What is Love”,
“Mystic_mind”
12. Twitter: getting forename-surname pairs
• Name field was divided into different tokens
• Forenames and Surnames were detected by matching the
string tokens against the database of forename surnames
pairs of 26 countries
• Users discarded
– where tokens were not matched against valid forename and
surname
13. Onomap: from names to ethnicity
• ONOMAP (www.onomap.org) was applied on forename –
surname pairs
Kevin Hodge (English)
Pablo Mateos (Spanish)
…
…
…
…
14. Friends’ Ethnicity Histogram
GEOGRAPHIC PROFILER
cultural communities of a
determine the distribution
groups of the friends of a
integrate information from two
Note, that the same ideas
other Online Social
Foursquare1. However,
around different and
Foursquare’s venues. In this
because of the general
not restricted to a specific
Facebook, information is
username of the person being
surname, forename) pairs of
of names to a list of
classification of Onomap.
probable countries of
estimate respectively the
set of possible ethno-cultural
countries. In the following
details of the tool and
terms of users' privacy.
Twitter is directed, in the
necessarily reciprocated.
associated with each user,
following and one for the
Figure 1: Screenshot of the Twitter Geographic Profiler. The
bottom part of the screen shows the histogram of the Twitter
user's friends ethno-cultural groups.
Once the entire list of friends name + surname pairs has been parsed, we can
easily estimate the distribution over the set of possible ethno-cultural groups of
the Twitter user's friends
her followers. In this
representing the list of a user's
actually follow a limited number of profiles, which are then
accessible even with the rate limitation in place.
With the list of (surname, forename) pairs to hand, we query
Onomap to get the ethno-cultural classification associated with
each (surname, forename) pair, and the
SearchSurnameTopCountries method to get the list of the
countries where an instance of a given surname was observed.
15. pair among the extracted tokens. In this work we mark as invalid
any string that is composed of a single token. If this is the case,
we skip the profile of the corresponding friend.
Friends’ Geographic Origins
Map showing the geographic origin of the Twitter user's friends’ surnames as
assigned by our tool. Below the map the user is shown a list of the top 10
countries with the respective frequency.
If the string contains two or more tokens, we take the first one to
be the forename and the last one to be the surname. Moreover,
when a (surname, forename) pair is sent to Onomap, an error
distance matrix one can Euclidean space for the purpose similar ethno-cultural groups.
However, note that we expect ethno-cultural groups to vary is, on average a resident of spanning a wider spectrum of Swansea4, due to the substantial in London. As a consequence, performed within a limited been shown that roughly 50% assigned in their profile, and are at town level [10], thus feasible.
Given the friendships distribution it is also possible to use identify individuals or group of the ethno-cultural groups also infer the ethnicity of an but for which a list of friend To understand the extent of we should stress that the default profile of a user as public. Although private, thus making it impossible profile, when testing our tool profile. Consequently, we download the list of names Figure 2: Map showing the geographical origin of the Twitter
ethno-cultural profiling.
user's friends’ surnames as assigned by our tool. Below the
map the user is shown a list of the top 10 countries with the
As for the limitations of the respective frequency.
we observed that the Twitter noise, which can considerably computation. The source of of extracting the surname string introduces unwanted
16. Twitter Geographic Profiler
• Potential applications include
– Measure the level of segregation/integration of a given individual
(community) as the Shannon entropy of the (average) friends’
ethnicity histogram
– Outliers detection: identify uncommon behaviors, e.g., individuals
that stand out in terms of the ethno-cultural groups they bond with
• Limitations
– Twitter data is very noisy
– We need a better heuristic to extract forename + surname
18. E-mail address profiler
• In many instances, an e-mail address encapsulates some
kind of identity information
– Forename or surname
• This tool
– Extracts identities of individuals from their e-mail addresses
– Maps the geographical distribution of a Surname in the UK
• The tool identifies surname or forename as substring in an
email address
• Tool builds a suffix tree of an e-mail address and searches
for probable identities
19. An example suffix tree
Suffix Tree for a name aamalam$. The surname for this name is alam$
and it has been shown at a leaf node
20. Surname matching algorithm
• Surname matching algorithm constructs a suffix tree for an
email address.
• Uses a database of surnames and forenames and matches
them
– with each substring of the suffix tree
• A probable identity is the substring where a surname or
forename matches with the substring
• We use a database of the most common 10,000 surnames
in the UK
21. E-mail Address Profiler: geographic distribution
• 2007 Electoral Register
– Name and Address of every individual who is eligible to vote in
the UK
• Every postcode in the Electoral Register was converted
to latitude/longitude values
• The tool maps all the latitude/longitudes for a particular
surname geographically
• Onomap is used to identify the probable ethnic origin of
a surname
24. Conclusion
• A toolkit for identity detection and profiling
• Identification and profiling of ethno-cultural characteristics of
individuals
• From Social media accounts and e-mail address
• Future work will include
• The extension of Twitter Geographic Profiler for other social media
services
• The extension of E-mail address profiler to process a large corpus of
e-mail address
• Study of privacy implications on social media services