Extracting emerging knowledge from social media - WWW2017

Extracting Emerging Knowledge
from Social Media
Marco Brambilla, Stefano Ceri, Emanuele Della Valle,
Riccardo Volonterio, Felix Acero Salazar
marco.brambilla@polimi.it
marcobrambi
WWW 2017, Perth, Australia

Humans aim at
formalizing
knowledge

Ontology is the philosophical study of
the nature of being, becoming,
existence or reality
and the basic categories of being and their
relations.

the nature of being, becoming,
existence or reality
the basic categories
of being and their relations.

Formalizing new knowledge is hard
Only high frequency emerges
The long tail challenge

There are more things
In heaven and earth, Horatio,
Than are dreamt of in your philosophy.
Shakespeare (Hamlet Act 1, scene 5)

The Answer to the Great Question...
Of Life, the Universe and Everything
Data
Information
Knowledge
WisdomContext
independence
Understanding
Understanding relations
Understanding patterns
Understanding principles

Our focus: The Evolving Knowledge
known
social
factoid
a
c
¬c
bpotentially
emerging potentially
decaying
actual and solid
d

Heaven and Heart
How to peer into the world
through an effective window?
TWO INGREDIENTS
Social media – the data
Domain experts – the context

Can we use social media to discover and codify
emerging knowledge?

Knowledge Enrichment Setting
HF Entity1 HF Entity5
HF Entity2 HF Entity4
HF Entity3
LF Entity1
??
LF Entity2 LF Entity4
LF Entity3
??
High Frequency
Entities
Low Frequency
Entities
??
?? ????
??
Type1
Type11
Type2
Type111
Instances
Types
<<instanceof>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
<<instanceof>>
??
??
??
??
??
Seed Entity
Seed Type
Type of
interest
Legend
Expert inputs
Enrichment problems
Property2
Relations HF - LF entities
Relations LF - LF entities
Typing of LF entities
Extraction of new LF entities
Property1
?? ?? ??
Finding attribute values

Input (1): Domain Specific Types
Types selected by the expert
Relevant for the domain

Input (2): Seeds (emerging entities)
Known and selected by the domain expert
Belonging to an expert type
Thoroughly Described
# @ a

Objectives
(1) Discover candidate unknown emerging entities
(2) Determine the relevance of the candidate
(3) Determine the type of the candidate

Step (1): Social Media Sourcing
Collect content produced by the seeds

Step (2): Candidate Extraction
Potentially any entity extracted from the social
streams of the seeds
Resulting in huge sets of candidates
Our hyp.: take only SN users as candidates
# @ w
@

Step (3): Candidate Pruning
Initial pruning of candidates based on
TF-DF:= df * ttf / (N – df +1)
Where: df = Number of seeds with which a candidate co-occurs with;
ttf = Total number of times a candidate occurs in the analyzed content;
N = Number of seeds.
Ranking + threshold
(*) variant of TF-IDF that does not discount document frequency because we are actually happy about frequent appearance
(we don’t look for information entropy!)

Step (4): Candidate Description
Repeat social media sourcing for candidates
A potentially good candidate is one that behaves
similarly to one or more of the seeds
Our hyp.: Talks about the same things
# @ w

Step (5): Candidate Ranking
Seed
centroid

Step (6): Feature selection
Purely syntactic
only user handles (accounts)
handles and hashtags
Semantic:
based on entity extraction / Dbpedia
based on deep learning on images / ClarifAI

Step (6): Semantic Feature selection for text
9 basic strategies
Generating 18 combinations of T + E strategies

990 semantic strategies evaluated
18 alternative feature vectors
11 different weighting values for aggregations
5 levels of recall for entity extraction
( + 3 different distance functions analyzed)

Experiments
Fashion Brands
Writers
Exhibitions

Emerging Australian Writers – 22 seeds
http://www.emergingwritersfestival.org.au/ in June in Melbourne

Emerging Australian Writers
Weighting parameter
Entity extraction recall

Emerging Australian Writers
Precision @ K for two strategies
EHE—AST CHE—AST

Cross-scenario
39 strategies always outperform
the syntactic one
Writers
Expo
Fashion

Conclusions
Extraction of relevant emerging entities
Top, Fast and Reliable are the important
Off-the-shelf or as-a-service tools

Repeatability in time (years!)
Recursion (candidates to seeds)
Multi-source data collection
Multiple types
Emerging relations
Emerging types
Challenges ahead

You can try it yourself!
http://datascience.deib.polimi.it/social-knowledge

THANKS!
QUESTIONS?
Marco Brambilla, Stefano Ceri, Emanuele Della Valle, Riccardo Volonterio, Felix Acero Salazar
Extracting Emerging Knowledge from Social Media
Marco Brambilla @marcobrambi marco.brambilla@polimi.it
http://datascience.deib.polimi.it http://home.deib.polimi.it/marcobrambi

Extracting emerging knowledge from social media - WWW2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Marco Brambilla

Mehr von Marco Brambilla (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Extracting emerging knowledge from social media - WWW2017