SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Microblog-Genre Noise and
Impact on Semantic Annotation Accuracy
Leon Derczynski
Diana Maynard
Niraj Aswani
Kalina Bontcheva
Evolution of communication
Functional utterances
Vowels
Velar closure: consonants
Speech
New modality: writing
Digital text
E-mail
Social media
Increased
machine-
readable
information
??
Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per month
They write ~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber
What resources do we have now?
Large, content-rich, linked, digital streams of human communication
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→A sampling of human behaviour
Linking these resources
Why is it useful to link this data?
Identifying subjects, themes, … ''entities''
What's the text about?
Who'd be interested in it?
How can we summarise it?
- very important, considering information overload in soc med!
Typical annotation pipeline
Language ID
Tokenisation
PoS tagging
Text
Typical annotation pipeline
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
Linking entities
Pipeline cumulative effect
Good performance is important at each stage – not just entity linking
Language ID
LADY GAGA IS BETTER THE 5th TIME OH BABY(:
The Jan. 21 show started with the unveiling of an
impressive three-story castle from which Gaga
emerges. The band members were in various
portals, separated from each other for most of the
show. For the next 2 hours and 15 minutes, Lady
Gaga repeatedly stormed the moveable castle,
turning it into her own gothic Barbie Dreamhouse
Newswire:
Microblog:
Language ID difficulties
General accuracy on microblogs: 89.5%
Problems include switching language mid-text:
je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt,
het is duidelijk, ze bedekken hun gezicht. Get over it
New info in this format:
Metadata:
spatial information
linked URLs
Emoticons:
:) vs. ^_^
cu vs. 88
Accuracy when customised to genre: 97.4%
Tokenisation
General accuracy on microblogs: 80%
Goal is to convert byte stream to readily-digestible word chunks
Word bound discovery is a critical language acquisition task
The LIBYAN AID Team successfully shipped these broadcasting
equipment to Misrata last August 2011, to establish an FM Radio
station ranging 600km, broadcasting to the west side of Libya to
help overthrow Gaddafi's regime.
RT @JosetteSheeran: @WFP #Libya breakthru! We move
urgently needed #food (wheat, flour) by truck convoy into
western Libya for 1st time :D
Tokenisation difficulties
Not curated, so typos
Improper grammar – e.g. apostrophe usage; live with it!
doesnt → doesnt
doesn't → does n't
Smileys and emoticons
I <3 you → I & lt ; you
This piece ;,,( so emotional → this piece ; , , ( so emotional
Loss of information (sentiment)
Punctuation for emphasis
*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *
Words run together
Tokenisation fixes
Custom tokeniser!
Apostrophe insertion
Slang unpacking
Ima get u → I 'm going to get you
Emoticon rules
- if we can spot them, we know not to break them
Customised accuracy on microblogs: 96%
Part of speech tagging
Goal is to assign words to classes (verb, noun etc)
General accuracy on newswire:
97.3% token, 56.8% sentence
General accuracy on microblogs:
73.6% token, 4.24% sentence
Sentence-level accuracy important:
without whole sentence correct, difficult to extract syntax
Part of speech tagging difficulties
Many unknowns:
Music bands
Soulja Boy | TheDeAndreWay.com in stores Nov 2, 2010
Places
#LB #news: Silverado Park Pool Swim Lessons
Capitalisation way off
@thewantedmusic on my tv :) aka derek
last day of sorting pope visit to birmingham stuff out
Slang
~HAPPY B-DAY TAYLOR !!! LUVZ YA~
Orthographic errors
dont even have homwork today, suprising ?
Dialect
Shall we go out for dinner this evening?
Ey yo wen u gon let me tap dat
Part of speech tagging fixes
Slang dictionary for ''repair''
Won't cover previously-unseen slang
In-genre labelled data
Expensive to create!
Leverage ML
Existing taggers can handle unknown words
Maximise use of these features!
General accuracy on microblogs:
73.6% token, 4.24% sentence
Accuracy when customised to genre:
88.4% token, 25.40% sentence
Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accuracy on microblogs: 41% F1
Newswire:
Microblog:
Gotta dress up for london fashion week and party in
style!!!
London Fashion Week grows up – but mustn't take
itself too seriously. Once a launching pad for new
designers, it is fast becoming the main event. But
LFW mustn't let the luxury and money crush its
sense of silliness.
NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise: 80% F1
NER on Facebook
Longer texts than tweets
Still has informal tone
MWEs are a problem!
- all capitalised:
Green Europe Imperiled as Debt Crises
Triggers Carbon Market Drop
Difficult, though easier than Twitter
Maybe due to the possibility of including more verbal context?
Entity linking
Goal is to find out which entity a mention refers to
''The murderer was Professor Plum, in the Library,
with the Candlestick!''
Which Professor Plum?
Disambiguation is through connecting text to the web of data
dbpedia.org/resource/Professor_Plum_(astrophysicist)
Two tasks:
- Whole-text linking
- Entity-level linking
''Aboutness''
Goal: answer ''What entities is this text about?''
Good for tweets:
Lack of lexicalised context
Not all related concepts are in the text
Helpful for summarisation
No concern for entity bounds; (finding them is tough in microblog!)
* but *
Added concern for themes in text
e.g. ''marketing'', ''US elections''
Aboutness performance
Corpus:
from Meij et al. ''Adding semantics to microblog posts.''
468 tweets
From one to six concepts per tweet
DBpedia spotlight: highest recall (47.5)
TextRazor: highest precision (64.6)
Zemanta: highest F1 (41.0)
Zemanta tuned for blog entries – so compensates for some noise
Word-level linking
Goal is to link an entity
Given:
The entity mention
Surrounding microblog context
No corpora exist for this exact task:
Two commercially produced ones
Policy says ''no sharing''
How can we approach this key task?
Word-level linking performance
Dataset: Replab
Task is to determine relatedness-or-not
Six entities given
Few hundred tweets per entity
Detect mentions of entity in tweets
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 around 70
Word-level linking issues
NER errors
Missed entities damages / destroys linking
Specificity problems
Lufthansa Cargo
Lufthansa Cargo
Which organisation to choose?
Require good NER
Direct linking chunking reduces precision:
Apple trees in the home garden bit.ly/yOztKs
Pipeline NER does not mark Apple as entity here
Lack of disambiguation context is a problem!
Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(ORG)", it's like the same thing but with glitter!
Actual:
Branching out from Lincoln park after dark(ORG) ... Hello "Russian
Navy", it's like the same thing but with glitter!
Clue in unusual collocations
+ ?
Whole pipeline: how to fix?
Common genre problems centre on mucky, uncurated text
Orth error
Slang
Brevity
Condensed
Non-Chicago punctuation..
Maybe clearing up this will improve performance?
Normalisation
General solution for overcoming linguistic noise
How to repair?
1. Gazetteer (quick & dirty); or..
2. Noisy channel model
Task is to ''reverse engineer'' the noise on this channel
Brown clustering; double metaphone; auto orth correction
An honest, well-formed sentence u wot m8 biber #lol
Normalisation performance
NER on tweets:
Rule-based
No normalisation F1 80%
Gazetter normalisation F1 81%
Noisy channel F1 81%
ML-based
No normalisation F1 49.1%
Gazetter normalisation F1 47.6%
Noisy channel F1 49.3%
Negligible performance impact, and introduces errors!
Sentiment change:
undisambiguable → disambiguable
Meaning change:
She has Huntington's → She has Huntingdon's
Future directions
MORE DATA!
and better..
no IAA for many resources
Maybe from the crowd?
MORE CONTEXT!
Not just linguistic
Microblog has host of metatdata
Explicit:
Time, Place, URIs, Hashtags
Implicit:
Friend network
Previous messages
Thank you!
Thank you for listening!
Do you have any questions?

Weitere ähnliche Inhalte

Andere mochten auch

Web Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMAWeb Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMACynthia Typaldos
 
Web Communities With RelationSys And D2C
Web Communities With RelationSys And D2CWeb Communities With RelationSys And D2C
Web Communities With RelationSys And D2CDavid Terrar
 
Microblog,change the world
Microblog,change the worldMicroblog,change the world
Microblog,change the worldJinGui LI
 
Building Web Communities That Add Value
Building Web Communities That Add ValueBuilding Web Communities That Add Value
Building Web Communities That Add ValueDavid Terrar
 
Microblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in ChinaMicroblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in ChinaMedia MIC
 
Distributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataDistributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataMarko Rodriguez
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiJohn Breslin
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCGilad Lotan
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQLMike Crabb
 

Andere mochten auch (13)

Web Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMAWeb Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMA
 
Web Communities With RelationSys And D2C
Web Communities With RelationSys And D2CWeb Communities With RelationSys And D2C
Web Communities With RelationSys And D2C
 
Microblog,change the world
Microblog,change the worldMicroblog,change the world
Microblog,change the world
 
Building Web Communities That Add Value
Building Web Communities That Add ValueBuilding Web Communities That Add Value
Building Web Communities That Add Value
 
Microblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in ChinaMicroblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in China
 
Distributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataDistributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of Data
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYC
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium Presentation
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Web Mining
Web Mining Web Mining
Web Mining
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQL
 

Ähnlich wie Microblog-genre noise and its impact on semantic annotation accuracy

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Soylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd InsideSoylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd InsideMichael Bernstein
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project Jie Bao
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleMichael Bernstein
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part IDelip Rao
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendEgor Pushkin
 
Best Argument Essay Topics
Best Argument Essay TopicsBest Argument Essay Topics
Best Argument Essay TopicsJessica Miller
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Publish Once, Brand Everywhere
Publish Once, Brand EverywherePublish Once, Brand Everywhere
Publish Once, Brand Everywherelaurendreier
 
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...ABES
 

Ähnlich wie Microblog-genre noise and its impact on semantic annotation accuracy (20)

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Soylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd InsideSoylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd Inside
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the people
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part I
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon Comprehend
 
Best Argument Essay Topics
Best Argument Essay TopicsBest Argument Essay Topics
Best Argument Essay Topics
 
Putnam "This is Today's Metadata Quality"
Putnam "This is Today's Metadata Quality"Putnam "This is Today's Metadata Quality"
Putnam "This is Today's Metadata Quality"
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
 
FWS Cutting Edge presentation
FWS Cutting Edge presentationFWS Cutting Edge presentation
FWS Cutting Edge presentation
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Publish Once, Brand Everywhere
Publish Once, Brand EverywherePublish Once, Brand Everywhere
Publish Once, Brand Everywhere
 
Sattose talk
Sattose talkSattose talk
Sattose talk
 
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
 

Mehr von Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCLeon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringLeon Derczynski
 

Mehr von Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 

Microblog-genre noise and its impact on semantic annotation accuracy

  • 1. Microblog-Genre Noise and Impact on Semantic Annotation Accuracy Leon Derczynski Diana Maynard Niraj Aswani Kalina Bontcheva
  • 2. Evolution of communication Functional utterances Vowels Velar closure: consonants Speech New modality: writing Digital text E-mail Social media Increased machine- readable information ??
  • 3. Social Media = Big Data Gartner ''3V'' definition: 1.Volume 2.Velocity 3.Variety High volume & velocity of messages: Twitter has ~20 000 000 users per month They write ~500 000 000 messages per day Massive variety: Stock markets; Earthquakes; Social arrangements; … Bieber
  • 4. What resources do we have now? Large, content-rich, linked, digital streams of human communication We transfer knowledge via communication Sampling communication gives a sample of human knowledge ''You've only done that which you can communicate'' The metadata (time – place – imagery) gives a richer resource: →A sampling of human behaviour
  • 5. Linking these resources Why is it useful to link this data? Identifying subjects, themes, … ''entities'' What's the text about? Who'd be interested in it? How can we summarise it? - very important, considering information overload in soc med!
  • 6. Typical annotation pipeline Language ID Tokenisation PoS tagging Text
  • 7. Typical annotation pipeline Named entity recognition dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer) Linking entities
  • 8. Pipeline cumulative effect Good performance is important at each stage – not just entity linking
  • 9. Language ID LADY GAGA IS BETTER THE 5th TIME OH BABY(: The Jan. 21 show started with the unveiling of an impressive three-story castle from which Gaga emerges. The band members were in various portals, separated from each other for most of the show. For the next 2 hours and 15 minutes, Lady Gaga repeatedly stormed the moveable castle, turning it into her own gothic Barbie Dreamhouse Newswire: Microblog:
  • 10. Language ID difficulties General accuracy on microblogs: 89.5% Problems include switching language mid-text: je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt, het is duidelijk, ze bedekken hun gezicht. Get over it New info in this format: Metadata: spatial information linked URLs Emoticons: :) vs. ^_^ cu vs. 88 Accuracy when customised to genre: 97.4%
  • 11. Tokenisation General accuracy on microblogs: 80% Goal is to convert byte stream to readily-digestible word chunks Word bound discovery is a critical language acquisition task The LIBYAN AID Team successfully shipped these broadcasting equipment to Misrata last August 2011, to establish an FM Radio station ranging 600km, broadcasting to the west side of Libya to help overthrow Gaddafi's regime. RT @JosetteSheeran: @WFP #Libya breakthru! We move urgently needed #food (wheat, flour) by truck convoy into western Libya for 1st time :D
  • 12. Tokenisation difficulties Not curated, so typos Improper grammar – e.g. apostrophe usage; live with it! doesnt → doesnt doesn't → does n't Smileys and emoticons I <3 you → I & lt ; you This piece ;,,( so emotional → this piece ; , , ( so emotional Loss of information (sentiment) Punctuation for emphasis *HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU * Words run together
  • 13. Tokenisation fixes Custom tokeniser! Apostrophe insertion Slang unpacking Ima get u → I 'm going to get you Emoticon rules - if we can spot them, we know not to break them Customised accuracy on microblogs: 96%
  • 14. Part of speech tagging Goal is to assign words to classes (verb, noun etc) General accuracy on newswire: 97.3% token, 56.8% sentence General accuracy on microblogs: 73.6% token, 4.24% sentence Sentence-level accuracy important: without whole sentence correct, difficult to extract syntax
  • 15. Part of speech tagging difficulties Many unknowns: Music bands Soulja Boy | TheDeAndreWay.com in stores Nov 2, 2010 Places #LB #news: Silverado Park Pool Swim Lessons Capitalisation way off @thewantedmusic on my tv :) aka derek last day of sorting pope visit to birmingham stuff out Slang ~HAPPY B-DAY TAYLOR !!! LUVZ YA~ Orthographic errors dont even have homwork today, suprising ? Dialect Shall we go out for dinner this evening? Ey yo wen u gon let me tap dat
  • 16. Part of speech tagging fixes Slang dictionary for ''repair'' Won't cover previously-unseen slang In-genre labelled data Expensive to create! Leverage ML Existing taggers can handle unknown words Maximise use of these features! General accuracy on microblogs: 73.6% token, 4.24% sentence Accuracy when customised to genre: 88.4% token, 25.40% sentence
  • 17. Named Entity Recognition Goal is to find entities we might like to link General accuracy on newswire: 89% F1 General accuracy on microblogs: 41% F1 Newswire: Microblog: Gotta dress up for london fashion week and party in style!!! London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.
  • 18. NER difficulties Rule-based systems get the bulk of entities (newswire 77% F1) ML-based systems do well at the remainder (newswire 89% F1) Small proportion of difficult entities Many complex issues Using improved pipeline: ML struggles, even with in-genre data: 49% F1 Rules cut through microblog noise: 80% F1
  • 19. NER on Facebook Longer texts than tweets Still has informal tone MWEs are a problem! - all capitalised: Green Europe Imperiled as Debt Crises Triggers Carbon Market Drop Difficult, though easier than Twitter Maybe due to the possibility of including more verbal context?
  • 20. Entity linking Goal is to find out which entity a mention refers to ''The murderer was Professor Plum, in the Library, with the Candlestick!'' Which Professor Plum? Disambiguation is through connecting text to the web of data dbpedia.org/resource/Professor_Plum_(astrophysicist) Two tasks: - Whole-text linking - Entity-level linking
  • 21. ''Aboutness'' Goal: answer ''What entities is this text about?'' Good for tweets: Lack of lexicalised context Not all related concepts are in the text Helpful for summarisation No concern for entity bounds; (finding them is tough in microblog!) * but * Added concern for themes in text e.g. ''marketing'', ''US elections''
  • 22. Aboutness performance Corpus: from Meij et al. ''Adding semantics to microblog posts.'' 468 tweets From one to six concepts per tweet DBpedia spotlight: highest recall (47.5) TextRazor: highest precision (64.6) Zemanta: highest F1 (41.0) Zemanta tuned for blog entries – so compensates for some noise
  • 23. Word-level linking Goal is to link an entity Given: The entity mention Surrounding microblog context No corpora exist for this exact task: Two commercially produced ones Policy says ''no sharing'' How can we approach this key task?
  • 24. Word-level linking performance Dataset: Replab Task is to determine relatedness-or-not Six entities given Few hundred tweets per entity Detect mentions of entity in tweets We disambiguate mentions to DBpedia / Wikipedia (easy to map) General performance: F1 around 70
  • 25. Word-level linking issues NER errors Missed entities damages / destroys linking Specificity problems Lufthansa Cargo Lufthansa Cargo Which organisation to choose? Require good NER Direct linking chunking reduces precision: Apple trees in the home garden bit.ly/yOztKs Pipeline NER does not mark Apple as entity here Lack of disambiguation context is a problem!
  • 26. Word-level linking issues Automatic annotation: Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter! Actual: Branching out from Lincoln park after dark(ORG) ... Hello "Russian Navy", it's like the same thing but with glitter! Clue in unusual collocations + ?
  • 27. Whole pipeline: how to fix? Common genre problems centre on mucky, uncurated text Orth error Slang Brevity Condensed Non-Chicago punctuation.. Maybe clearing up this will improve performance?
  • 28. Normalisation General solution for overcoming linguistic noise How to repair? 1. Gazetteer (quick & dirty); or.. 2. Noisy channel model Task is to ''reverse engineer'' the noise on this channel Brown clustering; double metaphone; auto orth correction An honest, well-formed sentence u wot m8 biber #lol
  • 29. Normalisation performance NER on tweets: Rule-based No normalisation F1 80% Gazetter normalisation F1 81% Noisy channel F1 81% ML-based No normalisation F1 49.1% Gazetter normalisation F1 47.6% Noisy channel F1 49.3% Negligible performance impact, and introduces errors! Sentiment change: undisambiguable → disambiguable Meaning change: She has Huntington's → She has Huntingdon's
  • 30. Future directions MORE DATA! and better.. no IAA for many resources Maybe from the crowd? MORE CONTEXT! Not just linguistic Microblog has host of metatdata Explicit: Time, Place, URIs, Hashtags Implicit: Friend network Previous messages
  • 31. Thank you! Thank you for listening! Do you have any questions?