TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
Integrating and Interpreting Social Data from Heterogeneous Sources
1. Integrating and Interpreting Social Data from Heterogeneous Sources Matthew Rowe Organisations, Information and Knowledge Group University of Sheffield SuvodeepMazumdar Department of Information Studies University of Sheffield
2. Outline Information overload Increase in social data publication Interlinking social data Metadata Generation Integrating Social Data Application: Interpreting Social Data Cumbrian Floods Use Case Interacting with Social Data Conclusions
3. Information Overload Masses of social data are published every day E.g. 50 million tweets (600 per second) http://blog.twitter.com 22million Facebook users in the UK http://www.clickymedia.co.uk/2009/10/uk-facebook-user-statistics-october-2009/ Too much information to deal with! Social data is multi-faceted: Provenance Topic Geo Trend services (e.g. trendistic, blogpulse): Focus on majority consensus Need to listen in to a specific topic Concentrate on a single source/platform Do not consider geo facet
4.
5.
6. Interlinking Social Data Consider multi-faceted nature of social data: Allows fine-grained analysis Show geo-localised social data Relevant past social data Solution: Interlink social data from heterogeneous sources Use semantics! Consistent data interpretation
7. Metadata Generation Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post and itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance
8. Metadata Generation <photo id="949406913" media="photo"> <owner nsid="54948696@N00”/> <title>DSC00171.JPG</title> <description></description> <dates posted="1205398307" taken="2009-01-09 09:16:31" lastupdate="1257421561" /> <tags> <tag id="24539622-2330113101-400" author="54948696@N00" raw="arctic">arctic</tag> <tag id="24539622-2330113101-401" author="54948696@N00" raw="monkeys">monkeys</tag> </tags> <location latitude="53.4813" longitude="-2.2392" place_id="R8vDw_abBpSzUA"> <locality place_id="R8vDw_abBpSzUA" woeid="27872">Manchester</locality> <region place_id="pn4MsiGbBZlXeplyXg" woeid="24554868">England</region> <country place_id="DevLebebApj4RVbtaQ" woeid="23424975">United Kingdom</country> </location> </photo> Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post and itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status>
9. Metadata Generation Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post and itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status>
10. Metadata Generation <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status> Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post/itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <http://twitter.com/mattroweshow/9774519667> rdf:typesioc:Post ; rdf:typeitr:LocalizedResource ;
11. Metadata Generation <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status> Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post/itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <http://twitter.com/mattroweshow/9774519667> rdf:typesioc:Post ; rdf:typeitr:LocalizedResource ; sioc:content "Writing up our Geovation work for #lupas2010." ; dcterms:subject "lupas2010" ;
12. Metadata Generation <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status> Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post/itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <http://twitter.com/mattroweshow/9774519667> rdf:typesioc:Post ; rdf:typeitr:LocalizedResource ; sioc:content "Writing up our Geovation work for #lupas2010." ; dcterms:subject "lupas2010" ; itr:has_Localization _:a2 . _:a2 rdf:typegml:Geometry ; gml:pos "53.3833,-1.4722" .
13. Metadata Generation <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status> Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post/itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <http://twitter.com/mattroweshow/9774519667> rdf:typesioc:Post ; rdf:typeitr:LocalizedResource ; sioc:content "Writing up our Geovation work for #lupas2010." ; dcterms:subject "lupas2010" ; dcterms:created "2010-2-28 12:22:47.0" ; itr:has_Localization _:a2 . _:a2 rdf:typegml:Geometry ; gml:pos "53.3833,-1.4722" .
14. Metadata Generation <status> <created_at>Sun Feb 28 12:22:47 +0000 2010</created_at> <id>9774519667</id> <text>Writing up our Geovation work for #lupas2010.</text> <truncated>false</truncated> <in_reply_to_status_id></in_reply_to_status_id> <in_reply_to_user_id></in_reply_to_user_id> <favorited>false</favorited> <in_reply_to_screen_name></in_reply_to_screen_name> <geo xmlns:georss="http://www.georss.org/georss"> <georss:point>53.3833,-1.4722</georss:point> </geo> </status> Web 2.0 platforms return data using: Proprietary formats; Heterogeneous data schemas Need to link data together from disparate sources A social data fragment = a single piece of social data E.g. A tweet, an image, a video Lift each social data fragment to RDF: Create an instance of sioc:Post/itr:LocalizedResource Assign it a URI Assign the content to the instance (topic) Use hashtags of the microblog Create an instance of gml:Geometry (geo) Capture geo facet Assign timestamp of fragment creation (provenance) Using dc:created Assign the fragment to its owner (provenance) Create foaf:Person instance <http://twitter.com/mattroweshow> rdf:typefoaf:Person ; rdf:typeitr:LocalizedResource ; foaf:name "Matthew Rowe" ; foaf:homepage <http://www.dcs.shef.ac.uk/~mrowe> ; <http://twitter.com/mattroweshow/9774519667> rdf:typesioc:Post ; rdf:typeitr:LocalizedResource ; sioc:content "Writing up our Geovation work for #lupas2010." ; dcterms:subject "lupas2010" ; dcterms:created "2010-2-28 12:22:47.0" ; sioc:hasCreator <http://twitter.com/mattroweshow> ; itr:has_Localization _:a2 . _:a2 rdf:typegml:Geometry ; gml:pos "53.3833,-1.4722" .
15. Integrated Social Data Triplify social data from multiple platforms Flickr XML response -> RDF Picassa XML response -> RDF Use common semantics Can perform SPARQL queries PREFIX dcterms:<http://purl.org/dc/terms> SELECT ?item WHERE { ?item dcterms:subject "iranelections" . ?item dcterms:created ?date } ORDER BY DESC(?date) PREFIX dcterms:<http://purl.org/dc/terms> PREFIX itr:<http://www.dcs.shef.ac.uk/~gregoire/interaction/ns#> PREFIX gml:<http://www.opengis.net/gml/> SELECT DISTINCT ?post ?tag WHERE { ?post dcterms:subject ?tag . ?post itr:has_Localization ?geo . ?geo gml:pos "53.4813,-2.2392" }
16. Interpreting Social Data Cumbrian Use Case UK region suffered worst floods in centuries Observe the effects in social data Rise in publication Fine-grained geocoded social data Dataset: Microblogs from 200 Cumbrian Twitter users Published during 2009 3513 microblogs Produced 475,043 triples Images from Flickr taken in Cumbria 6663 images Produced 182,304
17. Interacting with Social Data Built a visualisation application to analyse social data fragments http://www.dcs.shef.ac.uk/~suvodeep/ViziSocial Filter by date Lower slider Fine-grained focus Zoom in Tag cloud Shows fragment topics Window controls tag cloud topics Markers contain number of fragments
18. Conclusions Consistent interpretation of social data Across heterogeneous sources Application Allows analyses of social data To fine-grained detail Utilises multiple facets of social data Requires metadata Issue of scalability Future Work Adapting to real time data acquisition Focussing on South Yorkshire region at present Assess scalability issue