Barangay Council for the Protection of Children (BCPC) Orientation.pptx
Inferring Web Citations using Social Data and SPARQL Rules
1. Inferring Web Citations using Social Data and SPARQL Rules Matthew Rowe Organisations, Information and Knowledge Group University of Sheffield
2. Outline Problem Setting Personal Information Dissemination SPARQL Rules: Identifying Web Citations Generating Seed Data Gathering Possible Web Citations Inferring Web Citations Evaluation Conclusions Future Work
3. Personal Information on the Web Personal information on the Web is disseminated: Voluntarily Involuntarily Increase in personal information: Identity Theft Lateral Surveillance Web users must discover their identity web references 2 stage process Find possible references Identify definite references
10. Problem Setting Performing identification manually: Time consuming Laborious Handle masses of information Repeated often The Web keeps changing Solution = automated techniques Alleviate the need for humans Need background knowledge Who am I searching for? What makes them unique?
13. Generating Seed Data Profiles on Social Web are leveraged as seed data To generate seed data: Export Social Graphs Interface with the platform’s API Convert proprietary response into RDF Biographical Information Social Network Information Enrich Graphs with URIs Interlink graphs Detect equivalent foaf:Person instances Builds a single social graph
14. Generating Seed Data Profiles on Social Web are leveraged as seed data To generate seed data: Export Social Graphs Interface with the platform’s API Convert proprietary response into RDF Biographical Information Social Network Information Enrich Graphs with URIs Interlink graphs Detect equivalent foaf:Person instances Builds a single social graph http://www.dcs.shef.ac.uk/~mrowe/foafgenerator.html
15. Generating Seed Data Profiles on Social Web are leveraged as seed data To generate seed data: Export Social Graphs Interface with the platform’s API Convert proprietary response into RDF Biographical Information Social Network Information Enrich Graphs with URIs Interlink graphs Detect equivalent foaf:Person instances Builds a single social graph
16. Generating Seed Data Profiles on Social Web are leveraged as seed data To generate seed data: Export Social Graphs Interface with the platform’s API Convert proprietary response into RDF Biographical Information Social Network Information Enrich Graphs with URIs Interlink graphs Detect equivalent foaf:Person instances Builds a single social graph Blocking Step Compare values of Inverse Functional Properties Compare Geo URIs Compare Geo data
17. Generating Seed Data Profiles on Social Web are leveraged as seed data To generate seed data: Export Social Graphs Interface with the platform’s API Convert proprietary response into RDF Biographical Information Social Network Information Enrich Graphs with URIs Interlink graphs Detect equivalent foaf:Person instances Builds a single social graph
19. Gathering Possible Web Citations Search WWW and Semantic Web for possible citations Web resources come in many flavours: Data Models, HTML documents, XHTML documents Convert into RDF XHTML Documents: Use GRDDL Automated RDF model lifting HTML Documents: Apply person name gazetteer: identify person information Apply Hidden Markov Model to extract information Build RDF model from information M Rowe. Data.dcs: Converting Legacy Data into Linked Data. In proceedings of Linked Data on the Web Workshop, WWW 2010. Raleigh, USA. (2010)
21. Inferring Web Citations using SPARQL Rules Seed data = solitary example to build rules State of the art rule induction strategies are limited E.g. FOIL and C4.5 Build rules from RDF instances! 1. Extract instances from Seed Data 2. For each instance, build a rule: Build a skeleton rule Add triples to the rule Create a new rule if a triple’s predicate is Inverse Functional 3. Apply the rules to the web resources
22. Inferring Web Citations using SPARQL Rules Seed data = solitary example to build rules State of the art rule induction strategies are limited E.g. FOIL and C4.5 Build rules from RDF instances! 1. Extract instances from Seed Data 2. For each instance, build a rule: Build a skeleton rule Add triples to the rule Create a new rule if a triple’s predicate is Inverse Functional 3. Apply the rules to the web resources
23. Inferring Web Citations using SPARQL Rules Seed data = solitary example to build rules State of the art rule induction strategies are limited E.g. FOIL and C4.5 Build rules from RDF instances! 1. Extract instances from Seed Data 2. For each instance, build a rule: Build a skeleton rule Add triples to the rule Create a new rule if a triple’s predicate is Inverse Functional 3. Apply the rules to the web resources PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n }
24. Inferring Web Citations using SPARQL Rules Seed data = solitary example to build rules State of the art rule induction strategies are limited E.g. FOIL and C4.5 Build rules from RDF instances! 1. Extract instances from Seed Data 2. For each instance, build a rule: Build a skeleton rule Add triples to the rule Create a new rule if a triple’s predicate is Inverse Functional 3. Apply the rules to the web resources PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:name ?m . ?urlfoaf:topic ?r . ?rfoaf:name ?m }
25. Inferring Web Citations using SPARQL Rules Seed data = solitary example to build rules State of the art rule induction strategies are limited E.g. FOIL and C4.5 Build rules from RDF instances! 1. Extract instances from Seed Data 2. For each instance, build a rule: Build a skeleton rule Add triples to the rule Create a new rule if a triple’s predicate is Inverse Functional 3. Apply the rules to the web resources PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:homepage ?h . ?urlfoaf:topic ?r . ?rfoaf:homepage ?h }
26. Inferring Web Citations using SPARQL Rules Seed data = solitary example to build rules State of the art rule induction strategies are limited E.g. FOIL and C4.5 Build rules from RDF instances! 1. Extract instances from Seed Data 2. For each instance, build a rule: Build a skeleton rule Add triples to the rule Create a new rule if a triple’s predicate is Inverse Functional 3. Apply the rules PREFIX foaf:<http://xmlns.com/foaf/0.1/> CONSTRUCT { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:page ?url } WHERE { <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:name ?n . ?urlfoaf:topic ?p . ?pfoaf:name ?n . <http://www.dcs.shef.ac.uk/~mrowe/foaf.rdf#me> foaf:knows ?q . ?qfoaf:homepage ?h . ?urlfoaf:topic ?r . ?rfoaf:homepage ?h }
27. Evaluation Measures: Precision, Recall, F-Measure Dataset 50 participants from the Semantic Web and Web 2.0 communities Seed data collected from Facebook and Twitter ~17300 web resources: 346 web resources for each participant Baselines Baseline 1: Person name as positive classification Skeleton SPARQL Rule Baseline 2: Human Processing
28. Results High precision Better than humans Triple Patterns Low recall Rules are strict No room for variability Hard to generalise No learning from disambiguation decisions
29. Conclusions SPARQL Rules are precise Poor generalisation however Outperform humans at low web presence levels “Needle in a haystack problem” User profiles provide seed data Inexpensively Capturing: Biographical information Social networking information Inability to learn from identifications Plan for future work Overcome poor seed data feature coverage
30. Twitter: @mattroweshow Web: http://www.dcs.shef.ac.uk/~mrowe Email: m.rowe@dcs.shef.ac.uk Questions? For more information: M Rowe and F Ciravegna. Disambiguating Identity Web References using Web 2.0 Data and Semantics. In Press for special issue on "Web 2.0" in the Journal of Web Semantics. (2010)
Editor's Notes
VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
1,900,000 results returned for my nameI am:A conductorA cyclistWrote the song “Wannabe” by the Spice GirlsPhD StudentThat is only the first page!It gets worse later on:Lawyer, surfer, another phd student
VoluntaryE.g. personal web pages, blog pages,InvoluntaryE.g. publication of electoral registers, people listings/aggregators (123people.co.uk) Automated techniquesRequire background knowledge!Expensive to product manually (e.g. form filling)Must be accurateCommon Problem in Machine Learning![Yu, 2004] – Highlights the painstaking methods required to acquire labelled/seed data
3 Stage Approach USER CENTRIC!:1. Gather seed dataVia Profiles on the Social Web2. Gather possible web citationsVia Querying Search Engines3. Identify Web Citations Via SPARQL RulesIntuition:A person with appear on the Web with people they know!Similar intuition used by SOA disambiguation techniques
3 Stage Approach:1. Gather seed dataVia Profiles on the Social Web2. Gather possible web citationsVia Querying Search Engines3. Identify Web Citations Via SPARQL RulesIntuition:A person with appear on the Web with people they know!Similar intuition used by SOA disambiguation techniques
Export individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
Export individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
Export individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
Export individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
Export individual social graphsFrom Facebook, Twitter, etc!Overcomes data portability issues!
3 Stage Approach:1. Gather seed dataVia Profiles on the Social Web2. Gather possible web citationsVia Querying Search Engines3. Identify Web Citations Via SPARQL RulesIntuition:A person with appear on the Web with people they know!Similar intuition used by SOA disambiguation techniques
Some flavours taste better to machines
Now have seed data AS RDF!Now have possible web citations AS RDF!
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques
We now have our seed data and a collection of web resources Both are in RDF! Now we can pass them onto the disambiguation techniques
Precision =proportion of web resources which are correctly labelled as citing a personRecall = proportion of web references which are correctly disambiguated F-Measure = harmonic mean of precision and recall
Achieves high levels of precisionOutperforming humans and other baselinesSPARQL rules require strict literal and resource matching within the triple patternsLeads to poor recall levels howeverUnable to learn from past disambiguation decisionsAt lower-levels of web presence (where identity web references are sparse) rules outperform all baselines in terms of f-measureHumans find it difficult to detect sparse web referencesAutomating disambiguation at such levels is more suitable
Similar to state of the art workUses co-occurrence in a web pages as denoting a relationship