Spatio-temporal linkage of real and virtual identity
1. Spatio-temporal linkage of real and
virtual identity
Muhammad Adnan (and Paul Longley)
University College London
2. Geodemographics
• “Analysis of people by where they live [places]”
(Sleight, 1993:3)
• Social similarity, not locational proximity
Home
Person Address
Area
3.
4. Identity of individuals in the real world
• Name (Forename & Surname)
• Surnames have geographic concentrations
• Prospects for linkage with socio-economic data
• E.g. Analysing the socio-economic circumstances of
different ethnic groups
5. An example – gbnames.publicprofiler.org
Longley Cheshire
6. An example – Output Area Classification
Kingston upon Hull Hereford
14. The European scale
16 countries.
400 million people.
5.95 million unique
surnames
Courtesy: James Cheshire
15. Onomap classification
Forename-Surname clustering
(based on Hanks and Tucker, 2000)
UK Electoral Roll
Mateos
Pablo
Garcia
Juan Pérez
Forenames Surnames
Rosa ...
Marta Sánchez
... Rodríguez
...
– Several iterations until self-contained cluster is exhausted
– Cluster assigned a cultural, ethnic & linguistic Onomap type
– Probability of ethnicity assigned to each name
Mateos et al (2007) CASA Working Paper 116
19. Uncertainty and virtual identity
• Identity increasingly shaped by online activities
– => value may be leveraged from the fusion of physical
and virtual data sources
• Data fusion and generalisation to relate physical
and virtual properties
• Use of residence alongside activity patterns and
social network information
20. Most of us have virtual identities
• Email address; social media accounts
• People use different procedures and providers to
establish virtual identities
• Harvesting these data has interesting potential
applications
• Cyber crime
• Cyber geodemographics (Facebook has already started
this)
21. Most of us have virtual identities
• Facebook data mining engine
• Analyses the words you use and tailors advertisement
accordingly
23. Starting Point
http://worldnames.publicprofiler.org
• Worldnames has been archiving „Surname search‟,
„Email Address‟, „Gender‟, and „IP Address‟ for
searches over the past 6 months
• c. 175,000 records: email validation
• 150,000 usable „IP Address‟ entries
24. IP Address to Latitude/Longitude conversion
http://quova.com
An API to convert “IP addresses” to their corresponding
latitude / longitude values
25. IP Address to Latitude/Longitude conversion
http://quova.com
A search for an IP Address in UCL (128.40.214.196)
26. Top Countries
Website was searched from 155 countries over the past
6 months UNITED STATES
UNITED KINGDOM
76708
21892
CANADA 8154
GERMANY 7158
ITALY 4058
90000 AUSTRALIA 2978
BRAZIL 2440
80000 FRANCE 2028
ARGENTINA 1958
70000 SPAIN 1830
NEW ZEALAND 1236
60000
NETHERLANDS 1074
50000 GREECE 1040
SWITZERLAND 992
40000 BELGIUM 940
POLAND 880
30000 AUSTRIA 874
MEXICO 834
20000
IRELAND 710
SWEDEN 630
10000
0
32. Popular Surname Searches
SMITH 708
JONES 306
JOHNSON 258
ANDERSON 224
WILLIAMS 222
800
MILLER 218
MARTIN 202
700 WILSON 194
BROWN 194
MOORE 188
600
THOMAS 178
TAYLOR 170
500 CLARK 164
LEE 160
ROBERTS 156
400
DAVIS 152
CAMPBELL 144
300 LEWIS 138
HARRIS 138
MITCHELL 136
200
100
0
40. Who use their surnames as part of their email
address
• Approximately 40% of the users have their surname
as part of their email address
• abbie.harper@hotmail.com (Surname: Harper)
• helmut.kempe@inode.at (Surname: Kempe)
• Top Countries
50
45
40
35
30
25
20
15
10
5
0
41. Who use long email addresses ?
• Grand mean average email length of 8 characters
• Number of characters on the left side of „@‟
• United Kingdom, USA, Canada, and other European countries
• People from South American countries and India have long
email addresses (Average length: 13 characters)
BRAZIL ANA.ARAUJO3909@CREASP.ORG.BR (14 characters)
CHILE BYRON.DELGADO.INOSTROZA@HOTMAIL.COM (25 characters)
URUGUAY DIEGOJAVIERZEBALLOS@GMAIL.COM (17 characters)
INDIA GANGULYDEEPANJAN@HOTMAIL.COM (18 characters)
ARGENTINA AGUSTINAREYNOZO@GMAIL.COM (13 characters)
• South Indians have longer email address than North Indians
42. What else we can infer from email addresses
• Internet service provider
• A.GOODEVE@AOL. COM
• BERRYMANL@BTINTERNET.COM
• CARL@VALLEYWISP.NET (Person lives in a rural area of northeast Oregon)
• Country of origin
• A.HAKIM26@YAHOO.FR
• CBARNES@MEDIAWORKS.CO.NZ
• Probable temporal aspects
• ABBY527@OPTONLINE.NET
• BERZINSKY102@YAHOO.COM
• C.JOHNSTON2@BTINTERNET.COM
43. What else we can infer from email addresses
• Probable forename of a person
• BEVERLY.RICHARDS@YAHOO.COM
• BJORN.SOBRY@HOTMAIL.COM
• BRANDAN.HOLMES@HOTMAIL.COM
• How up to date someone is with technology
• ALEXANDER.BREUSCH@GMAIL.COM
• WILLIAM.NEALON@GOOGLEMAIL.COM
• Professional Affiliations
• CHRIS@IEEE.ORG
44. What else we can infer from email addresses
• Work Locations
• DOUG.GOODMAN@FOUNDATION.ORG.UK
• GRL@KCS.ORG.UK
• ERM43@CAM.AC.UK
• Studying
• RTRIPOLI@STUDENT.UMASS.EDU
• CBALIN01@STUDENTS.BBK.AC.UK
• KATHERINE.LITTEN@STUDENT.KIRKWOOD.EDU
45. Conclusion and future work
• There are some interesting patterns found in the study of
email addresses
• some problems (accuracy of geocoding techniques)
• Prospect of data linkage of data coded to unit postcode level
• cluster analysis and data mining techniques
• Future work may involve the data mining of Facebook and
Twitter data
• issues of generalisation
• Visualisation of the data
47. A research agenda
1 Acquire relevant real and virtual data sources and devise DBMS
2 Devise GB-wide classification of NICT usage at neighbourhood
scale
3 Devise GB-wide classification of social network traffic
4 Develop enhanced worldnames site to harvest real and virtual
user data
5 Undertake text analysis of worldnames user data and use to link
classifications (2) and (3)
6 Devise, implement and analyse social networking application and
cybergeodemographic classification