Fabrizio Orlandi's PhD Viva @Insight NUI Galway (ex-DERI) - 31/03/2014.
Supervisors: Alexandre Passant and John G. Breslin.
Examiners: Fabien Gandon and Stefan Decker
5. 1 – Heterogeneous data sources
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Atlanta
…
Microblog?
Challenges
5 / 37
Social
Networking
Service?
6. 2 – Lack of provenance
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Atlanta
…
Where?Who?
How?
Challenges
6 / 37
What?
7. 3 – Semantics of entities of interest
Sport
CEV Volleyball Cup
Music
Heavy Metal
Mastodon
Atlanta
…
Semantics?
Pragmatics?
Relevance?
Challenges
7 / 37
8. Research Questions
1. Aggregation of Social Web data:
How can we aggregate and represent user data distributed across
heterogeneous social media systems for profiling user interests?
2. Provenance of data for user profiling:
What is the role of provenance on the Social Web and on the Web of
Data and how to leverage its potential for user profiling?
3. Semantic enrichment of user profiles and personalisation:
How to combine data from the Social and Semantic Web for enriching
user profiles of interests and deploying them to different
personalisation tasks?
8 / 37
9. Research Goal
How can we collect, represent, aggregate, mine, enrich and
deploy user profiles of interests on the Social Web for
multi-source personalisation?
9 / 37
11. 1. Aggregation of Social Web data:
How can we aggregate and represent user data distributed across
heterogeneous social media systems for profiling user interests?
11 / 37
12. Aggregation of Social Web Data
Modelling solution for Social Web data and user profiles
Based on SIOC, FOAF and extensions
Experiments on wikis
[Orlandi, Passant. WikiSym. ACM. 2010.] 12 / 37
13. Music
Heavy Metal
Mastodon
Atlanta
CEV Champions League
Volleyball
Semantic Web
RDF
“Mastodon is the best heavy metal band from Atlanta…
Can’t wait to see them live again!”
“Trentino vs Lugano about to start - Diatec youngster to
impress again in CEV Champions League #volleyball”
User likes RDF and SemanticWeb on Facebook
• Natural language
processing tools
for entity extraction
(Zemanta & Spotlight)
• Frequency + time-decay
weighting schemes
Example
13 / 37
14. Aggregation and Mining of Interests
14
7 types of user profiling strategies:
2 types of DBpedia entities: Categories vs. Resources
2 types of weighting-scheme for category-based methods
- Cat1: Interests Weight Propagation
- Cat2: Interests Weight Propagation w/ Cat. Discount
2 types of exponential Time Decay function
- Short mean lifetime
- Long mean lifetime
1 “bag-of-words” (Tag-based) state-of-the-art approach
days120
days360
15. Evaluation
User study: 21 users rating their user profiles from Twitter & Facebook
210 ratings for each of the 7 different profiling methods
Aggregation and Mining of Interests
0
0.2
0.4
0.6
0.8
1
P@10
AVG
Score
Key findings
DBpedia resource-based profiles
outperform Dbpedia category-based and
tag-based profiles.
Best strategy: Resources + Frequency &
Slow Time Decay weighting scheme
[Orlandi, Breslin, Passant. I-Semantics. ACM. 2012.] 15 / 37
16. 1. Aggregation of Social Web data:
How can we aggregate and represent user data distributed across
heterogeneous social media systems for profiling user interests?
2. Provenance of data for user profiling:
What is the role of provenance on the Social Web and on the Web of
Data and how to leverage its potential for user profiling?
16 / 37
17. Motivation: use of provenance information as core of the profiling heuristics
to improve mining of user interests and semantic enrichment
Data Provenance as the history, the origins and the evolution of data
Who created/modified it? When? What is the content? Where is it located?
How and Why was it created? Which tools and processes were used?
Provenance of Data
Provenance as the “bridge” between
Social Web and Web of Data
e.g. Wikipedia/DBpedia
17 / 37
18. Use Case: Provenance on Wikis
Provenance on the Social Web
for the Web of Data
A semantic model to represent provenance information in wikis
A software architecture to extract provenance from Wikipedia
An application that uses and exposes provenance data to compute measures
and statistics on Wikipedia articles
[Orlandi, Champin, Passant. SWPM at ISWC. 2010.] 18 / 37
20. Using detailed provenance information extracted from Wikipedia we are
able to compute provenance also for DBpedia resources.
Analyzing the “diffs” between the revisions of Wikipedia articles and the
users' contributions we identify the edits on Wikipedia that resulted in a
change in the related DBpedia resource.
We built a model and an application that shows provenance information for
each triple on DBpedia that is the result of users' edits on Wikipedia.
Provenance on the Web of Data
for the Social Web
Use Case: Provenance on DBpedia
[Orlandi, Passant. Journal of Web Semantics. 2011] 20 / 37
21. Semantic provenance in DBpedia
• Using detailed provenance information extracted from Wikipedia we are able
to compute provenance also for DBpedia resources.
• Analyzing the “diffs” between the revisions of Wikipedia articles and the
users' contributions we identify the edits on Wikipedia that resulted in a
change in the related DBpedia resource.
• We built an application that shows provenance information for each triple on
DBpedia that is the result of users' edits on Wikipedia.
21 / 37
22. Provenance for Profiling Interests
Different provenance features to support interest mining
Not only: authorship and temporal features
But also: social media source, object, type of action, …
22 / 37
23. Provenance for Profiling Interests
User study: 27 users on Twitter and Facebook
They evaluated their aggregated and provenance-aware user profiles
Social Feature Score
E FB education 4.62
E FB workplace 4.60
I TW followees’ posts 4.03
I FB checkins 3.95
E FB interests 3.95
E FB likes 3.92
I TW favourite posts 3.76
I TW retweets 3.76
I TW posts 3.61
I TW replies 3.52
I FB status updates 3.50
I FB media actions 3.24
I FB comments 2.56
I FB direct posts 2.37
AVG Scores from 1 to 5
Locations, explicit profile info
and also followees’ posts
provide better accuracy for
mining user interests
Interests stated explicitly by
users produce user profiles 20%
more accurate than implicitly
1 3 5
[Orlandi, Kapanipathi, Sheth, Passant. IEEE/ACM WI. 2013] 23 / 37
24. 2. Provenance of data for user profiling:
What is the role of provenance on the Social Web and on the Web of
Data and how to leverage its potential for user profiling?
3. Semantic enrichment of user profiles and personalisation:
How to combine data from the Social and Semantic Web for enriching
user profiles of interests and deploying them to different
personalisation tasks?
24 / 37
26. Music
Heavy Metal
Mastodon (band)
CEV Champions League
Volleyball
Semantic Web
RDF
Example
Are all the extracted entities useful for personalisation?
How are concepts/entities being used on the Social Web? (Pragmatics)
Very abstract, very popular
Specific and time-dependent on events, etc.
Specific and time-dependent on events, etc.
Abstract and not popular
Abstract and popular
Specific and not popular
Very popular
26 / 37
27. Characterising Concepts of Interest
27
Novel measures for the characterisation and semantic expansion of
concepts of interest
Enrichment of entity-based user profiles for personalisation
Popularity of concepts on the Social Web (using Twitter)
How popular an entity is on the Social Web? How frequently is it
mentioned/used at that point of time?
Trend and temporal dynamics (using Wikipedia page views)
The trend and evolution of the frequency of mentions of an entity on
the Social Web (i.e. popularity over time)
Specificity and categorisation of entities of interest (using LOD)
The level of abstraction that an entity has in a common conceptual
schema shared by humans
27 / 37
28. Requirements
Use case: real-time personalisation of Social Web streams
1. Real-time computation of the dimensions
2. Results constantly up to date with the real world
3. Knowledge base and domain independent approach
28 / 37
30. Real-time Semantic Personalisation of
Social Web Streams
“SPOTS”: A methodology for real-time personalisation of any large
social stream
Automatic dynamic generation of multi-source user profiles of interests.
Semantic enrichment of concepts of interest with provenance and Linked
Data info.
Ranking and selection of the interests according to their relevance for the
user and for the personalisation use case.
Informativeness measures for posts to filter a large social stream.
Evaluation of the approach on the public Twitter stream
Against Twitter #Discover: from 192% increase in accuracy
30 / 37
31. [Kapanipathi, Orlandi, Sheth, Passant. SPIM at ISWC 2011.]
31
Real-time Semantic Personalisation of
Social Web Streams
31
32. Evaluation on SPOTS
User study to evaluate the impact of the enrichment on a
personalisation use case
27 users, 800 user ratings collected
Main outcome:
Popularity and Temporal Dynamics are useful measures for real-time
personalisation
SPOTS Improvement*
No Enrichment ---
Trendy +29%
Not Stable +26%
At Least 2 Features +9%
Specific + Not Popular +5%
* In recommendations accuracy over non-enriched profiles 32 / 37
33. Evaluation on User Profiles
User study to evaluate the impact of the enrichment on user profiles
according to users’ judgement
27 users, 800 user ratings collected
Main outcome:
Specificity is more useful than popularity measures according to user perception
User Profiles Improvement*
No Enrichment ---
Not Specific + Not Popular +13%
Not Specific +8%
Not Popular +2%
Stable + Not Trendy +1%
* In profile accuracy over non-enriched profiles 33 / 37
35. Summary
We provide and evaluate a complete methodology for profiling user
interests across multiple sources on the Social Web
Collect, Represent, Aggregate, Mine, Enrich, Deploy
Aggregation of user data:
• Semantic representation of Social Web content and user activities
Provenance of data:
• Improves profiling accuracy and connects Social Web and WoD
Mining of user interests:
• Provenance + Linked Data/Entity-based strategies + time decay, outperform
traditional “bag-of-words” strategies and facilitate enrichment
Semantic enrichment:
• Improves profiling accuracy and it is necessary for the deployment of the
profiles in a personalisation use case
• Different types of personalisation need different entities of interest
35 / 37
36. Future Work
Federated Personal Data Manager
Privacy-aware, interoperable, autonomous,
user profiling infrastructure
Provenance at Web Scale
Necessary to focus on techniques for an easier and less expensive tracking and
management of provenance on the Social Semantic Web
Adaptive Profiling of User Interests
Adaptation of the profiling algorithm and strategy according to the application and
the context
36 / 37
37. Contributions & Dissemination
Semantic Web modelling solutions for Social Web data, user
profiles, provenance on the Social Web and Web of Data.
A provenance computation framework
Novel measures for characterising entities of interest
A real-time personalisation system for large Social Web streams
User studies for different profiling strategies, provenance features
and personalisation use-cases
A privacy-aware user profile management system
Publications
2 journal, 4 conference, 2 workshop papers
37 / 37
Thanks!
39. Context
39
User Modelling
• The process of representing a user or some of his/her
characteristics (e.g. interests, workplace, location, etc.)
User Profile
• A characterisation of a user at a particular point of time
40. Experiment
6 types of user profiles evaluated:
2 types of DBpedia entities
Categories vs. Resources
2 types of weighting-scheme for category-based methods
Cat1: Interests Weight Propagation
Cat2: Interests Weight Propagation w/ Cat. Discount
2 types of exponential Time Decay function
Short mean lifetime
Long mean lifetime
days120
days360
41. Experiment
6 types of user profiles evaluated:
Cat2
Cat1-120 Cat1-360 Cat2-120 Cat2-360Res-120 Res-360
Res Cat
Cat1
42. 42
User-based Evaluation
We asked users to rate the top 10 interests generated for each of
the 6 profiling strategies
Question:
“Please rate how relevant is each concept for representing your
personal interests and context…”
Rating:
0 (not at all or don't know), 1 (low), 2, 3, 4, 5 (high)
Rating converted to a (0…10) scale
Performance evaluated with:
MRR (Mean Reciprocal Rank)
P@10 (Precision at K = 10)
Comparison with a Baseline
A traditional approach based on “keyword frequency”
43.
44. Evaluation
On average for:
200 Tweets & 200 Facebook posts, and items.
~106 interests – DBpedia Resources
~720 interests – DBpedia Categories (~7 times)
Statistical significance for:
Resources vs. Categories (p<0.05)
Any method vs. Baseline (p<0.05)
Not for time decay (p~0.2) and Cat1 vs. Cat2