SlideShare ist ein Scribd-Unternehmen logo
1 von 44
An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles Mining Data Semantics Workshop 2011 Carlton Northern Old Dominion University 8/25/2011 1
Background Digital Preservation How are students using social media as a digital preservation strategy? Evaluating Personal Archiving Strategies for Internet-based Information - Marshall, McCown, Nelson http://www.cs.odu.edu/~mln/pubs/archiving-2007/eval-personal-arch-strat-archiving07.pdf 2
Goal Ascertain the set of social media profiles for ODU CS students. { } ... 3
4 What's out there already?
5 Intelius
Wink / my life 6
Google 7
Requirements and Assumptions Approach must be automated - no human interaction except for search query consisting of: location organization profession/education domain. Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity Must find profiles not indexed by search engines Can use any means available including using search engines, page scraping, web service APIs, etc. Only publicly declared identities; do not expose obfuscated identities  e.g., “Bruce Wayne“  -> “Batman" Find profiles from 25 pre-defined sites (next slide) Approach must be extensible,  i.e. new social media sites can be added with minimal changes. 8
Social Media Sites 9
Approach 10
11 Algorithm Discovery Phase Generate Usernames Check Rapportive Disambiguation Phase Assign Points for Keywords, Email, Me and Friend Links Check Google and Yahoo Check Sites for Profiles Check Sites For Profiles Check Social Graph Remove Duplicates *Run multiple times
Discovery Phase 12
Starting Information Given: Full name, i.e. Carlton Northern CS username, i.e. cnorther CS email, i.e. cnorther@cs.odu.edu .forward files -> carlton.northern@gmail.com CS profile URI, i.e. http://www.cs.odu.edu/~cnorther Inferred: School affiliation, i.e. Old Dominion Approximate location, i.e. Norfolk, Hampton Roads Computer Science affiliation, i.e. software engineer 13
Username Generation Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have: cnorthern northernc carlton.northern carlton_northern carlton-norther 14
Poll Sites Issue HTTP GET to determine if a profile exists with a generated username Create site templates for links: http://www.facebook.com/’username here’ http://www.stumbleupon.com/stumbler/’username here’ https://picasaweb.google.com/’username here’ 2016 students, 6 usernames, 25 sites = 302k requests GET http://www.facebook.com/carlton.northern HTTP/1.1 If 200 accept response, profile exists, else it doesn’t. Soft 404’s can be somewhat problematic but can be handled. Some sites detect robots and will present a Captcha which is also problematic. 15
Run existing profile URLs through Google Social Graph to find “me” links. 16 Google’s Social Graph API
“Me” Links “me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity For example, a me link from my CS profile page to twitter: 17 <html>  	  <head>      <title>Carlton Northern's CS Home Page</title>     </head>    <body>      stuff here ...     <a href=http://twitter.com/carltonnorthern rel=“me”>My Twitter</a>   </body> </html>
Rapportive Rapportive is a contacts relationship management (CRM) tool that sits on top of Gmail Uses AJAX and JSON to serve up content to their Gmail widget. Mined .forward files on the CS departmental server  Found only 24 email addresses out of 2016 students Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results. Produced 15.9% of our truth set profile results with only 1.6% being unique to Rapportive 18
Google and Yahoo Query Google and Yahoo using their respective APIs. “carlton northern" AND norfolk “carlton northern" AND “computer science" “carlton northern" AND “old dominion“ “carlton northern” site:http://www.facebook.com Geonames could be used to derive nearby cities to automatically form search queries The same could be done with WordNet to derive profession or education terms 19
Google and Yahoo Calls to Google and Yahoo need to be limited because of API restrictions. Google restricts use to about 1,000 requests per hour Furthermore, best results are in the first 1 – 8 positions of the result set 20
Disambiguation Phase 21
22 ,[object Object],[object Object]
Personally Identifiable Information Rich Profile 24
Point System Simple point system: Keyword matching Link community structure analysis Extraction of semantic and feature data from profiles 11 points is considered a validated profile. Points can range from a total negative score to about 50. 25
Keyword Matching 1 point for weak indicators  1 word terms like “programmer” or “student” 4 points for stronger indicators  2 or more words terms like “computer science” or “software engineer” 7 points for very strong indicators  locations i.e. “norfolk” or “portsmouth” Localized advertisements can be problematic  2 points for first name or given name  4 points for last name 26
Name Matching Facebook, Linkedin, Google, and Twitter, use real names so: 2 points for a first name or diminutive/nickname 5 points for a last name Subtract 21 points if neither a nickname or diminutive and a last name are found Watch out for diminutive/nicknames! http://code.google.com/p/nickname-and-diminutive-names-lookup/ Linkedin in provides location add or subtract 7 points 27
Link Community Structure Analysis Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points 28 Validated Profile Not-Validated Profile Assign 5 points to Michael’s Twitter profile
Me Links and Email Matching 10 points if a profile is found from Rapportive 10 points if a profile has a me link from an already validated profile 29 Validated Profile Not-Validated Profile Assign 10 points to Carlton’s Twitter profile
Experiment 30
Dataset 2016 students from our departmental server 142 graduate 1874 undergraduate Generated 9GB worth of data Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries Use information retrieval metrics of precision, recalland f-measure to assess our truth set 31
Truth Set Results Summary 32
Social Media Web Site Results 33
34 Whole Set Service Graph
35
36 Truth Set User Graph
37 Whole Set User Graph
38
39 Whole Set User Graph Without Blogger Links
40 Closeup
Future Work Facial recognition Better link community structure analysis Perform quantitative social media digital preservation study Remove social media sites that produced no or little results (unpopular) and add new ones (foursquare.com)? 41
Potential Impacts/Uses Open source intelligence gathering “Open source” as in publicly available information Social media research Measure the social health of an organization 42
Conclusions Completely automated with the only human interaction being with the creation of the search query Precision 0.863, recall .526, f-measure 0.632 The approach uses non-traditional search mechanisms to achieve it's goals Only publicly available information was used 43
44 Carlton Northern carlton.northern@gmail.com http://carlton-northern.com/
MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Weitere ähnliche Inhalte

Was ist angesagt?

Talent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso finalTalent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso finalTalent42
 
Social Media Resources
Social Media ResourcesSocial Media Resources
Social Media ResourcesScott Triana
 
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...Carlton Northern
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for studentslarchmeany1
 
Diving Into Facebook And Twitter
Diving Into Facebook And TwitterDiving Into Facebook And Twitter
Diving Into Facebook And TwitterPaulette Bennett
 
Searching the Web for Your Next Job
Searching the Web for Your Next JobSearching the Web for Your Next Job
Searching the Web for Your Next JobNoah Wolfe
 
Lis4380 f13-w4
Lis4380 f13-w4Lis4380 f13-w4
Lis4380 f13-w4caseyyu
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Is the Age of privacy over? Facebook, Privacy and Qualitative Research
Is the Age of privacy over?  Facebook, Privacy and Qualitative ResearchIs the Age of privacy over?  Facebook, Privacy and Qualitative Research
Is the Age of privacy over? Facebook, Privacy and Qualitative ResearchLisa Blenkinsop
 
Lis4380 f13-w7
Lis4380 f13-w7Lis4380 f13-w7
Lis4380 f13-w7caseyyu
 
Social media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchSocial media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchToronto Metropolitan University
 
Using Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional NetworkingUsing Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional NetworkingShaundra Walker
 

Was ist angesagt? (15)

Talent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso finalTalent42 2014 Jeopardy preso final
Talent42 2014 Jeopardy preso final
 
LinkedIn Report
LinkedIn ReportLinkedIn Report
LinkedIn Report
 
Social Media Resources
Social Media ResourcesSocial Media Resources
Social Media Resources
 
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
MDS 2011 Paper: An Unsupervised Approach to Discovering and Disambiguating So...
 
Research power point for students
Research power point for studentsResearch power point for students
Research power point for students
 
Diving Into Facebook And Twitter
Diving Into Facebook And TwitterDiving Into Facebook And Twitter
Diving Into Facebook And Twitter
 
Searching the Web for Your Next Job
Searching the Web for Your Next JobSearching the Web for Your Next Job
Searching the Web for Your Next Job
 
Lis4380 f13-w4
Lis4380 f13-w4Lis4380 f13-w4
Lis4380 f13-w4
 
Unit 35
Unit 35Unit 35
Unit 35
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Is the Age of privacy over? Facebook, Privacy and Qualitative Research
Is the Age of privacy over?  Facebook, Privacy and Qualitative ResearchIs the Age of privacy over?  Facebook, Privacy and Qualitative Research
Is the Age of privacy over? Facebook, Privacy and Qualitative Research
 
Lis4380 f13-w7
Lis4380 f13-w7Lis4380 f13-w7
Lis4380 f13-w7
 
Crossroads Social Network Survival Guide
Crossroads Social Network Survival GuideCrossroads Social Network Survival Guide
Crossroads Social Network Survival Guide
 
Social media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for researchSocial media data stewardship: The ethics of social media data use for research
Social media data stewardship: The ethics of social media data use for research
 
Using Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional NetworkingUsing Social Media Technologies for Professional Networking
Using Social Media Technologies for Professional Networking
 

Andere mochten auch

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Mitul Tiwari
 
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Yahoo Developer Network
 
Hadoop integration with SAP HANA
Hadoop integration with SAP HANAHadoop integration with SAP HANA
Hadoop integration with SAP HANADebajit Banerjee
 
Liberate your Application Logging
Liberate your Application LoggingLiberate your Application Logging
Liberate your Application LoggingGlenn Block
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsDataWorks Summit
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteAmr Awadallah
 

Andere mochten auch (6)

Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
Big Data Ecosystem at LinkedIn. Keynote talk at Big Data Innovators Gathering...
 
Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010Workflow on Hadoop Using Oozie__HadoopSummit2010
Workflow on Hadoop Using Oozie__HadoopSummit2010
 
Hadoop integration with SAP HANA
Hadoop integration with SAP HANAHadoop integration with SAP HANA
Hadoop integration with SAP HANA
 
Liberate your Application Logging
Liberate your Application LoggingLiberate your Application Logging
Liberate your Application Logging
 
SQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analyticsSQL-H a new way to enable SQL analytics
SQL-H a new way to enable SQL analytics
 
Schema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-WriteSchema-on-Read vs Schema-on-Write
Schema-on-Read vs Schema-on-Write
 

Ähnlich wie MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

Path 101 Opportunity
Path 101 OpportunityPath 101 Opportunity
Path 101 Opportunitypath101
 
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptxYouemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptxYoueminAngeRoxaneMie
 
SEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankingsSEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankingsWil Reynolds
 
Polishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self PromotionPolishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self PromotionSusanne Markgren
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015Marianne Sweeny
 
2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restout2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restouttata tanishq
 
Sla canada student nov 25 2021
Sla canada student nov 25 2021Sla canada student nov 25 2021
Sla canada student nov 25 2021Stephen Abram
 
Using Social Media In A Job Search
Using Social Media In A Job SearchUsing Social Media In A Job Search
Using Social Media In A Job Searchcssceo
 
Social Media Career Development & Job Search
Social Media Career Development & Job SearchSocial Media Career Development & Job Search
Social Media Career Development & Job SearchJoel Postman
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and SearchPeter Skomoroch
 
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...Nicola Osborne
 
Seo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the LineSeo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the Lineerycked
 
LMFAO Leveraging Machines for Awesome Outreach
LMFAO  Leveraging Machines for Awesome OutreachLMFAO  Leveraging Machines for Awesome Outreach
LMFAO Leveraging Machines for Awesome OutreachGareth Simpson
 
Job Search Employ On
Job Search Employ OnJob Search Employ On
Job Search Employ OnAmy O'Donnell
 
Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?Don Presant
 
What Is Path 101
What Is Path 101What Is Path 101
What Is Path 101path101
 
Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117Thomas Lassandro
 

Ähnlich wie MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles (20)

Path 101 Opportunity
Path 101 OpportunityPath 101 Opportunity
Path 101 Opportunity
 
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptxYouemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
Youemin_Ange_Roxane_Miessan_-_(Final_Exam)_Your_Major-Career_Investigation.pptx
 
SEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankingsSEO - when doing the right things doesn't help you improve rankings
SEO - when doing the right things doesn't help you improve rankings
 
Optimising Google's Knowledge Graph - #SMX Munich
Optimising Google's Knowledge Graph - #SMX MunichOptimising Google's Knowledge Graph - #SMX Munich
Optimising Google's Knowledge Graph - #SMX Munich
 
Polishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self PromotionPolishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
Polishing Your (Online) Portfolio: Using Free Web Tools for Self Promotion
 
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
Team of Rivals: UX, SEO, Content & Dev  UXDC 2015Team of Rivals: UX, SEO, Content & Dev  UXDC 2015
Team of Rivals: UX, SEO, Content & Dev UXDC 2015
 
2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restout2007 09-27-social networking-allen-restout
2007 09-27-social networking-allen-restout
 
Sla canada student nov 25 2021
Sla canada student nov 25 2021Sla canada student nov 25 2021
Sla canada student nov 25 2021
 
Using Social Media In A Job Search
Using Social Media In A Job SearchUsing Social Media In A Job Search
Using Social Media In A Job Search
 
Social Media Career Development & Job Search
Social Media Career Development & Job SearchSocial Media Career Development & Job Search
Social Media Career Development & Job Search
 
Skills, Reputation, and Search
Skills, Reputation, and SearchSkills, Reputation, and Search
Skills, Reputation, and Search
 
Cataloguing your friends and neighbours
Cataloguing your friends and neighboursCataloguing your friends and neighbours
Cataloguing your friends and neighbours
 
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
Cataloguing Your Friends and Neighbours: Personal Metadata and the Opportunit...
 
Seo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the LineSeo and Social Media: the Blurring of the Line
Seo and Social Media: the Blurring of the Line
 
LMFAO Leveraging Machines for Awesome Outreach
LMFAO  Leveraging Machines for Awesome OutreachLMFAO  Leveraging Machines for Awesome Outreach
LMFAO Leveraging Machines for Awesome Outreach
 
Job Search Employ On
Job Search Employ OnJob Search Employ On
Job Search Employ On
 
Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?Do Employers Look at ePortfolios?
Do Employers Look at ePortfolios?
 
Pratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnectPratical Deep Dive into the Semantic Web - #smconnect
Pratical Deep Dive into the Semantic Web - #smconnect
 
What Is Path 101
What Is Path 101What Is Path 101
What Is Path 101
 
Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117Social Networking Using Linked In For Job Search V9 00 091117
Social Networking Using Linked In For Job Search V9 00 091117
 

Kürzlich hochgeladen

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 

Kürzlich hochgeladen (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 

MDS 2011 Presentation: An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles

  • 1. An Unsupervised Approach to Discovering and Disambiguating Social Media Profiles Mining Data Semantics Workshop 2011 Carlton Northern Old Dominion University 8/25/2011 1
  • 2. Background Digital Preservation How are students using social media as a digital preservation strategy? Evaluating Personal Archiving Strategies for Internet-based Information - Marshall, McCown, Nelson http://www.cs.odu.edu/~mln/pubs/archiving-2007/eval-personal-arch-strat-archiving07.pdf 2
  • 3. Goal Ascertain the set of social media profiles for ODU CS students. { } ... 3
  • 4. 4 What's out there already?
  • 6. Wink / my life 6
  • 8. Requirements and Assumptions Approach must be automated - no human interaction except for search query consisting of: location organization profession/education domain. Achieve precision 0.7 or higher and f-measure 0.5 or higher comparable to a human level of the same activity Must find profiles not indexed by search engines Can use any means available including using search engines, page scraping, web service APIs, etc. Only publicly declared identities; do not expose obfuscated identities e.g., “Bruce Wayne“ -> “Batman" Find profiles from 25 pre-defined sites (next slide) Approach must be extensible, i.e. new social media sites can be added with minimal changes. 8
  • 11. 11 Algorithm Discovery Phase Generate Usernames Check Rapportive Disambiguation Phase Assign Points for Keywords, Email, Me and Friend Links Check Google and Yahoo Check Sites for Profiles Check Sites For Profiles Check Social Graph Remove Duplicates *Run multiple times
  • 13. Starting Information Given: Full name, i.e. Carlton Northern CS username, i.e. cnorther CS email, i.e. cnorther@cs.odu.edu .forward files -> carlton.northern@gmail.com CS profile URI, i.e. http://www.cs.odu.edu/~cnorther Inferred: School affiliation, i.e. Old Dominion Approximate location, i.e. Norfolk, Hampton Roads Computer Science affiliation, i.e. software engineer 13
  • 14. Username Generation Generate usernames from full name derivatives, i.e. for “Carlton Northern” we have: cnorthern northernc carlton.northern carlton_northern carlton-norther 14
  • 15. Poll Sites Issue HTTP GET to determine if a profile exists with a generated username Create site templates for links: http://www.facebook.com/’username here’ http://www.stumbleupon.com/stumbler/’username here’ https://picasaweb.google.com/’username here’ 2016 students, 6 usernames, 25 sites = 302k requests GET http://www.facebook.com/carlton.northern HTTP/1.1 If 200 accept response, profile exists, else it doesn’t. Soft 404’s can be somewhat problematic but can be handled. Some sites detect robots and will present a Captcha which is also problematic. 15
  • 16. Run existing profile URLs through Google Social Graph to find “me” links. 16 Google’s Social Graph API
  • 17. “Me” Links “me” links are links in Friend of a Friend (FOAF) and XHTML Friends Network (XFN) that specify the same identity For example, a me link from my CS profile page to twitter: 17 <html> <head> <title>Carlton Northern's CS Home Page</title> </head> <body> stuff here ... <a href=http://twitter.com/carltonnorthern rel=“me”>My Twitter</a> </body> </html>
  • 18. Rapportive Rapportive is a contacts relationship management (CRM) tool that sits on top of Gmail Uses AJAX and JSON to serve up content to their Gmail widget. Mined .forward files on the CS departmental server Found only 24 email addresses out of 2016 students Run CS and non CS email addresses through Rapportive’s not-so-public API to access their results. Produced 15.9% of our truth set profile results with only 1.6% being unique to Rapportive 18
  • 19. Google and Yahoo Query Google and Yahoo using their respective APIs. “carlton northern" AND norfolk “carlton northern" AND “computer science" “carlton northern" AND “old dominion“ “carlton northern” site:http://www.facebook.com Geonames could be used to derive nearby cities to automatically form search queries The same could be done with WordNet to derive profession or education terms 19
  • 20. Google and Yahoo Calls to Google and Yahoo need to be limited because of API restrictions. Google restricts use to about 1,000 requests per hour Furthermore, best results are in the first 1 – 8 positions of the result set 20
  • 22.
  • 24. Point System Simple point system: Keyword matching Link community structure analysis Extraction of semantic and feature data from profiles 11 points is considered a validated profile. Points can range from a total negative score to about 50. 25
  • 25. Keyword Matching 1 point for weak indicators 1 word terms like “programmer” or “student” 4 points for stronger indicators 2 or more words terms like “computer science” or “software engineer” 7 points for very strong indicators locations i.e. “norfolk” or “portsmouth” Localized advertisements can be problematic 2 points for first name or given name 4 points for last name 26
  • 26. Name Matching Facebook, Linkedin, Google, and Twitter, use real names so: 2 points for a first name or diminutive/nickname 5 points for a last name Subtract 21 points if neither a nickname or diminutive and a last name are found Watch out for diminutive/nicknames! http://code.google.com/p/nickname-and-diminutive-names-lookup/ Linkedin in provides location add or subtract 7 points 27
  • 27. Link Community Structure Analysis Retrieve all links in a page and see if they point to other validated profiles in the data set, if so, assign 5 points 28 Validated Profile Not-Validated Profile Assign 5 points to Michael’s Twitter profile
  • 28. Me Links and Email Matching 10 points if a profile is found from Rapportive 10 points if a profile has a me link from an already validated profile 29 Validated Profile Not-Validated Profile Assign 10 points to Carlton’s Twitter profile
  • 30. Dataset 2016 students from our departmental server 142 graduate 1874 undergraduate Generated 9GB worth of data Truth set: 20 graduate students and 2 professors from our research group Web Science and Digital Libraries Use information retrieval metrics of precision, recalland f-measure to assess our truth set 31
  • 31. Truth Set Results Summary 32
  • 32. Social Media Web Site Results 33
  • 33. 34 Whole Set Service Graph
  • 34. 35
  • 35. 36 Truth Set User Graph
  • 36. 37 Whole Set User Graph
  • 37. 38
  • 38. 39 Whole Set User Graph Without Blogger Links
  • 40. Future Work Facial recognition Better link community structure analysis Perform quantitative social media digital preservation study Remove social media sites that produced no or little results (unpopular) and add new ones (foursquare.com)? 41
  • 41. Potential Impacts/Uses Open source intelligence gathering “Open source” as in publicly available information Social media research Measure the social health of an organization 42
  • 42. Conclusions Completely automated with the only human interaction being with the creation of the search query Precision 0.863, recall .526, f-measure 0.632 The approach uses non-traditional search mechanisms to achieve it's goals Only publicly available information was used 43
  • 43. 44 Carlton Northern carlton.northern@gmail.com http://carlton-northern.com/