Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Mining Social Media with Linked Open Data,
Entity Recognition and Event Extraction
Leon Derczynski
Kalina Bontcheva
Third Workshop on Data Extraction and Object Search,
Oxford,
7 July 2013

Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per month
They write ~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber

What resources do we have now?
Large, content-rich, connected, digital streams of human discourse
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→A sampling of human behaviour

Entity annotation components
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
Linking entities

Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accuracy on microblogs: 41% F1
L. Derczynski, D. Maynard, N. Aswani, K. Bontcheva. ''Microblog-Genre Noise and Impact on Semantic Annotation
Accuracy.'' 24th ACM Conference on Hypertext and Social Media. 2013
Newswire:
Microblog:
Gotta dress up for london fashion week and party in
style!!!
London Fashion Week grows up – but mustn't take
itself too seriously. Once a launching pad for new
designers, it is fast becoming the main event. But
LFW mustn't let the luxury and money crush its
sense of silliness.

NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise: 80% F1

Word-level linking performance
Dataset: Ritter NER + DBpedia URIs
Detect mentions of entity in tweets
Crowdsourced annotations
Expert gold standard
Discard after disagreement or ambiguity
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 81%

Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(ORG)", it's like the same thing but with glitter!
Actual:
Branching out from Lincoln park after dark(PROD) ... Hello
"Russian Navy(PROD)", it's like the same thing but with glitter!
Clue in unusual collocations
+ ?

LODIE: LOD-based Inf. Extr.
Uses DBPedia as reference knowledge graph
Why DBPedia?
Regularly updated (from Wikipedia)
Good source for named entities
A hierarchy of concepts
A capital is also a city, but not vice versa
Relations between concepts
Paris locatedIn France
ParisHilton bornIn NewYorkCity
Demo: http://demos.gate.ac.uk/trendminer/obie/

LODIE: LOD-based Inf. Extr.
We increase recall by:
Deriving abbreviations from link anchor texts in Wikipedia
''She was born in <a href=''New_York_(city)''>NYC</a>''
Rank boosting terms using redirect pages
Matching NE candidates using include wild card queries (e.g.
Burton upon Trent and Burton-on-Trent)
This makes disambiguation harder (precision)
Use naive string, latent semantic, and contextual similarity metrics +
URI commonness to disambiguate
This is what achieved our good results!
Demo: http://demos.gate.ac.uk/trendminer/obie/

Social media contains events
How are events differently described in social media and news?
Conventional docs (e.g. newswire) have contextual info
Central event in distinct document segment (e.g. headline)
Location
Actors / participants
Causes
Outcomes
Similar prior events
This kind of description not found in social media
No editing guidelines
Often limited message length
Instead, event facets are represented sparsely
Only 1-2 facets per message about the event

Event extraction
Social media streams are punctuated with descriptions of events
… Accompanied by event facets
''Obama is visiting Russia''
''The US president has not visited Putin before''
Many viewpoints on the same temporal entity
(like triples)
How can we extract these?
We use the TimeML definitions of events in text:
Minimal lexicalisation (i.e. annotate one word)
Event classes: we focus on ACTIONs and OCCURRENCEs

Event extraction
How can we extract event mentions?
Conventional approaches are hybrid:
Statistical learning
Syntactic structures
Existing TimeML resources
TimeBank corpus (newswire)
Evita event extraction tool
Adapting to social media text
Negatively impacted by problems with NER
Short sentence structure
→ Use shallow linguistic techniques and fuzzy matches
Evita: F1 80.1
TIPSem: F1 81.4 (on well-formed text)
USFD Arcomem: F1 81.1 (noise-resilient)

LOD for event reassembly
What is needed to reassemble events from social media?
Identify mentions of the same event
Collect facets and integrate them
LOD gives unique identifiers for facet values
Many possible lexicalisations for the same event (run, control)
Identify co-referring mentions though:
Shared actors
Consistent facets (i.e. non-conflicting)
Lexical event similarity (e.g. wordnet)
This helps
cluster mentions of the same event
Aggolmerate facets
Final product: Event description grounded in linked open data

Conclusion
Event extraction from social media
using
linked open data
enables
extraction of rich event descriptions

Thank you!
Thank you for listening!
Do you have any questions?

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (16)

Ähnlich wie Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction

Ähnlich wie Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction (20)

Mehr von Leon Derczynski

Mehr von Leon Derczynski (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extraction