SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Downloaden Sie, um offline zu lesen
Microblog-Genre Noise and
Impact on Semantic Annotation Accuracy
Leon Derczynski
Diana Maynard
Niraj Aswani
Kalina Bontcheva
Evolution of communication
Functional utterances
Vowels
Velar closure: consonants
Speech
New modality: writing
Digital text
E-mail
Social media
Increased
machine-
readable
information
??
Social Media = Big Data
Gartner ''3V'' definition:
1.Volume
2.Velocity
3.Variety
High volume & velocity of messages:
Twitter has ~20 000 000 users per month
They write ~500 000 000 messages per day
Massive variety:
Stock markets;
Earthquakes;
Social arrangements;
… Bieber
What resources do we have now?
Large, content-rich, linked, digital streams of human communication
We transfer knowledge via communication
Sampling communication gives a sample of human knowledge
''You've only done that which you can communicate''
The metadata (time – place – imagery) gives a richer resource:
→A sampling of human behaviour
Linking these resources
Why is it useful to link this data?
Identifying subjects, themes, … ''entities''
What's the text about?
Who'd be interested in it?
How can we summarise it?
- very important, considering information overload in soc med!
Typical annotation pipeline
Language ID
Tokenisation
PoS tagging
Text
Typical annotation pipeline
Named entity recognition
dbpedia.org/resource/.....
Michael_Jackson
Michael_Jackson_(writer)
Linking entities
Pipeline cumulative effect
Good performance is important at each stage – not just entity linking
Language ID
LADY GAGA IS BETTER THE 5th TIME OH BABY(:
The Jan. 21 show started with the unveiling of an
impressive three-story castle from which Gaga
emerges. The band members were in various
portals, separated from each other for most of the
show. For the next 2 hours and 15 minutes, Lady
Gaga repeatedly stormed the moveable castle,
turning it into her own gothic Barbie Dreamhouse
Newswire:
Microblog:
Language ID difficulties
General accuracy on microblogs: 89.5%
Problems include switching language mid-text:
je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt,
het is duidelijk, ze bedekken hun gezicht. Get over it
New info in this format:
Metadata:
spatial information
linked URLs
Emoticons:
:) vs. ^_^
cu vs. 88
Accuracy when customised to genre: 97.4%
Tokenisation
General accuracy on microblogs: 80%
Goal is to convert byte stream to readily-digestible word chunks
Word bound discovery is a critical language acquisition task
The LIBYAN AID Team successfully shipped these broadcasting
equipment to Misrata last August 2011, to establish an FM Radio
station ranging 600km, broadcasting to the west side of Libya to
help overthrow Gaddafi's regime.
RT @JosetteSheeran: @WFP #Libya breakthru! We move
urgently needed #food (wheat, flour) by truck convoy into
western Libya for 1st time :D
Tokenisation difficulties
Not curated, so typos
Improper grammar – e.g. apostrophe usage; live with it!
doesnt → doesnt
doesn't → does n't
Smileys and emoticons
I <3 you → I & lt ; you
This piece ;,,( so emotional → this piece ; , , ( so emotional
Loss of information (sentiment)
Punctuation for emphasis
*HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU *
Words run together
Tokenisation fixes
Custom tokeniser!
Apostrophe insertion
Slang unpacking
Ima get u → I 'm going to get you
Emoticon rules
- if we can spot them, we know not to break them
Customised accuracy on microblogs: 96%
Part of speech tagging
Goal is to assign words to classes (verb, noun etc)
General accuracy on newswire:
97.3% token, 56.8% sentence
General accuracy on microblogs:
73.6% token, 4.24% sentence
Sentence-level accuracy important:
without whole sentence correct, difficult to extract syntax
Part of speech tagging difficulties
Many unknowns:
Music bands
Soulja Boy | TheDeAndreWay.com in stores Nov 2, 2010
Places
#LB #news: Silverado Park Pool Swim Lessons
Capitalisation way off
@thewantedmusic on my tv :) aka derek
last day of sorting pope visit to birmingham stuff out
Slang
~HAPPY B-DAY TAYLOR !!! LUVZ YA~
Orthographic errors
dont even have homwork today, suprising ?
Dialect
Shall we go out for dinner this evening?
Ey yo wen u gon let me tap dat
Part of speech tagging fixes
Slang dictionary for ''repair''
Won't cover previously-unseen slang
In-genre labelled data
Expensive to create!
Leverage ML
Existing taggers can handle unknown words
Maximise use of these features!
General accuracy on microblogs:
73.6% token, 4.24% sentence
Accuracy when customised to genre:
88.4% token, 25.40% sentence
Named Entity Recognition
Goal is to find entities we might like to link
General accuracy on newswire: 89% F1
General accuracy on microblogs: 41% F1
Newswire:
Microblog:
Gotta dress up for london fashion week and party in
style!!!
London Fashion Week grows up – but mustn't take
itself too seriously. Once a launching pad for new
designers, it is fast becoming the main event. But
LFW mustn't let the luxury and money crush its
sense of silliness.
NER difficulties
Rule-based systems get the bulk of entities (newswire 77% F1)
ML-based systems do well at the remainder (newswire 89% F1)
Small proportion of
difficult entities
Many complex issues
Using improved pipeline:
ML struggles, even with in-genre data: 49% F1
Rules cut through microblog noise: 80% F1
NER on Facebook
Longer texts than tweets
Still has informal tone
MWEs are a problem!
- all capitalised:
Green Europe Imperiled as Debt Crises
Triggers Carbon Market Drop
Difficult, though easier than Twitter
Maybe due to the possibility of including more verbal context?
Entity linking
Goal is to find out which entity a mention refers to
''The murderer was Professor Plum, in the Library,
with the Candlestick!''
Which Professor Plum?
Disambiguation is through connecting text to the web of data
dbpedia.org/resource/Professor_Plum_(astrophysicist)
Two tasks:
- Whole-text linking
- Entity-level linking
''Aboutness''
Goal: answer ''What entities is this text about?''
Good for tweets:
Lack of lexicalised context
Not all related concepts are in the text
Helpful for summarisation
No concern for entity bounds; (finding them is tough in microblog!)
* but *
Added concern for themes in text
e.g. ''marketing'', ''US elections''
Aboutness performance
Corpus:
from Meij et al. ''Adding semantics to microblog posts.''
468 tweets
From one to six concepts per tweet
DBpedia spotlight: highest recall (47.5)
TextRazor: highest precision (64.6)
Zemanta: highest F1 (41.0)
Zemanta tuned for blog entries – so compensates for some noise
Word-level linking
Goal is to link an entity
Given:
The entity mention
Surrounding microblog context
No corpora exist for this exact task:
Two commercially produced ones
Policy says ''no sharing''
How can we approach this key task?
Word-level linking performance
Dataset: Replab
Task is to determine relatedness-or-not
Six entities given
Few hundred tweets per entity
Detect mentions of entity in tweets
We disambiguate mentions to DBpedia / Wikipedia (easy to map)
General performance: F1 around 70
Word-level linking issues
NER errors
Missed entities damages / destroys linking
Specificity problems
Lufthansa Cargo
Lufthansa Cargo
Which organisation to choose?
Require good NER
Direct linking chunking reduces precision:
Apple trees in the home garden bit.ly/yOztKs
Pipeline NER does not mark Apple as entity here
Lack of disambiguation context is a problem!
Word-level linking issues
Automatic annotation:
Branching out from Lincoln park(LOC) after dark ... Hello "Russian
Navy(ORG)", it's like the same thing but with glitter!
Actual:
Branching out from Lincoln park after dark(ORG) ... Hello "Russian
Navy", it's like the same thing but with glitter!
Clue in unusual collocations
+ ?
Whole pipeline: how to fix?
Common genre problems centre on mucky, uncurated text
Orth error
Slang
Brevity
Condensed
Non-Chicago punctuation..
Maybe clearing up this will improve performance?
Normalisation
General solution for overcoming linguistic noise
How to repair?
1. Gazetteer (quick & dirty); or..
2. Noisy channel model
Task is to ''reverse engineer'' the noise on this channel
Brown clustering; double metaphone; auto orth correction
An honest, well-formed sentence u wot m8 biber #lol
Normalisation performance
NER on tweets:
Rule-based
No normalisation F1 80%
Gazetter normalisation F1 81%
Noisy channel F1 81%
ML-based
No normalisation F1 49.1%
Gazetter normalisation F1 47.6%
Noisy channel F1 49.3%
Negligible performance impact, and introduces errors!
Sentiment change:
undisambiguable → disambiguable
Meaning change:
She has Huntington's → She has Huntingdon's
Future directions
MORE DATA!
and better..
no IAA for many resources
Maybe from the crowd?
MORE CONTEXT!
Not just linguistic
Microblog has host of metatdata
Explicit:
Time, Place, URIs, Hashtags
Implicit:
Friend network
Previous messages
Thank you!
Thank you for listening!
Do you have any questions?

Weitere ähnliche Inhalte

Andere mochten auch

Web Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMAWeb Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMACynthia Typaldos
 
Web Communities With RelationSys And D2C
Web Communities With RelationSys And D2CWeb Communities With RelationSys And D2C
Web Communities With RelationSys And D2CDavid Terrar
 
Microblog,change the world
Microblog,change the worldMicroblog,change the world
Microblog,change the worldJinGui LI
 
Building Web Communities That Add Value
Building Web Communities That Add ValueBuilding Web Communities That Add Value
Building Web Communities That Add ValueDavid Terrar
 
Microblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in ChinaMicroblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in ChinaMedia MIC
 
Distributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataDistributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataMarko Rodriguez
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiJohn Breslin
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCGilad Lotan
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQLMike Crabb
 

Andere mochten auch (13)

Web Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMAWeb Communities Oct 1998 Presentation to SVAMA
Web Communities Oct 1998 Presentation to SVAMA
 
Web Communities With RelationSys And D2C
Web Communities With RelationSys And D2CWeb Communities With RelationSys And D2C
Web Communities With RelationSys And D2C
 
Microblog,change the world
Microblog,change the worldMicroblog,change the world
Microblog,change the world
 
Building Web Communities That Add Value
Building Web Communities That Add ValueBuilding Web Communities That Add Value
Building Web Communities That Add Value
 
Microblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in ChinaMicroblog tells_Research on Sina Microblog in China
Microblog tells_Research on Sina Microblog in China
 
Distributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of DataDistributed Graph Databases and the Emerging Web of Data
Distributed Graph Databases and the Emerging Web of Data
 
SP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with GephiSP1: Exploratory Network Analysis with Gephi
SP1: Exploratory Network Analysis with Gephi
 
Networkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYCNetworkx & Gephi Tutorial #Pydata NYC
Networkx & Gephi Tutorial #Pydata NYC
 
Gephi Consortium Presentation
Gephi Consortium PresentationGephi Consortium Presentation
Gephi Consortium Presentation
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
Web Mining
Web Mining Web Mining
Web Mining
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
A Beginners Guide to noSQL
A Beginners Guide to noSQLA Beginners Guide to noSQL
A Beginners Guide to noSQL
 

Ähnlich wie Microblog-genre noise and its impact on semantic annotation accuracy

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Leon Derczynski
 
Soylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd InsideSoylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd InsideMichael Bernstein
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processingpunedevscom
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLLawrie Hunter
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1Sara Hooker
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project Jie Bao
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleMichael Bernstein
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part IDelip Rao
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendEgor Pushkin
 
Best Argument Essay Topics
Best Argument Essay TopicsBest Argument Essay Topics
Best Argument Essay TopicsJessica Miller
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...Maryam Farooq
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsForward Gradient
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Publish Once, Brand Everywhere
Publish Once, Brand EverywherePublish Once, Brand Everywhere
Publish Once, Brand Everywherelaurendreier
 
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...ABES
 

Ähnlich wie Microblog-genre noise and its impact on semantic annotation accuracy (20)

Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
Mining Social Media with Linked Open Data, Entity Recognition, and Event Extr...
 
Soylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd InsideSoylent: A Word Processor with a Crowd Inside
Soylent: A Word Processor with a Crowd Inside
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Gadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALLGadgets pwn us? A pattern language for CALL
Gadgets pwn us? A pattern language for CALL
 
Module 8: Natural language processing Pt 1
Module 8:  Natural language processing Pt 1Module 8:  Natural language processing Pt 1
Module 8: Natural language processing Pt 1
 
Collaborative Ontology Building Project
Collaborative Ontology Building Project  Collaborative Ontology Building Project
Collaborative Ontology Building Project
 
HarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the peopleHarambeeNet: Data by the people, for the people
HarambeeNet: Data by the people, for the people
 
NLP in Practice - Part I
NLP in Practice - Part INLP in Practice - Part I
NLP in Practice - Part I
 
State of NLP and Amazon Comprehend
State of NLP and Amazon ComprehendState of NLP and Amazon Comprehend
State of NLP and Amazon Comprehend
 
Best Argument Essay Topics
Best Argument Essay TopicsBest Argument Essay Topics
Best Argument Essay Topics
 
Putnam "This is Today's Metadata Quality"
Putnam "This is Today's Metadata Quality"Putnam "This is Today's Metadata Quality"
Putnam "This is Today's Metadata Quality"
 
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
NYAI #27: Cognitive Architecture & Natural Language Processing w/ Dr. Catheri...
 
FWS Cutting Edge presentation
FWS Cutting Edge presentationFWS Cutting Edge presentation
FWS Cutting Edge presentation
 
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and ApplicationsICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
ICDM 2019 Tutorial: Speech and Language Processing: New Tools and Applications
 
lexicographic evidence
lexicographic evidencelexicographic evidence
lexicographic evidence
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Publish Once, Brand Everywhere
Publish Once, Brand EverywherePublish Once, Brand Everywhere
Publish Once, Brand Everywhere
 
Sattose talk
Sattose talkSattose talk
Sattose talk
 
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
Jabes 2009 - Conférence inaugurale "Un avenir sans livres pour les bibliothèq...
 
Bird05 nltk-intro
Bird05 nltk-introBird05 nltk-intro
Bird05 nltk-intro
 

Mehr von Leon Derczynski

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and VeracityLeon Derczynski
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018Leon Derczynski
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceLeon Derczynski
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCLeon Derczynski
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingLeon Derczynski
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social MediaLeon Derczynski
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesLeon Derczynski
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Leon Derczynski
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doLeon Derczynski
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsLeon Derczynski
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextLeon Derczynski
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkLeon Derczynski
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataLeon Derczynski
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseLeon Derczynski
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceLeon Derczynski
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesLeon Derczynski
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringLeon Derczynski
 

Mehr von Leon Derczynski (20)

Joint Rumour Stance and Veracity
Joint Rumour Stance and VeracityJoint Rumour Stance and Veracity
Joint Rumour Stance and Veracity
 
State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018State of Tools for NLP in Danish: 2018
State of Tools for NLP in Danish: 2018
 
RumourEval
RumourEvalRumourEval
RumourEval
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Handling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGCHandling and Mining Linguistic Variation in UGC
Handling and Mining Linguistic Variation in UGC
 
Efficient named entity annotation through pre-empting
Efficient named entity annotation through pre-emptingEfficient named entity annotation through pre-empting
Efficient named entity annotation through pre-empting
 
Leveraging the Power of Social Media
Leveraging the Power of Social MediaLeveraging the Power of Social Media
Leveraging the Power of Social Media
 
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice GuidelinesCorpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
Corpus Annotation through Crowdsourcing: Towards Best Practice Guidelines
 
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Rec...
 
Christmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I doChristmas Presentation at Aarhus: What I do
Christmas Presentation at Aarhus: What I do
 
Recognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal ExpressionsRecognising and Interpreting Named Temporal Expressions
Recognising and Interpreting Named Temporal Expressions
 
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog TextTwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text
 
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data Twitter Part-of-Speech Tagging for All:  Overcoming Sparse and Noisy Data
Twitter Part-of-Speech Tagging for All: Overcoming Sparse and Noisy Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
Empirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense FrameworkEmpirical Validation of Reichenbach’s Tense Framework
Empirical Validation of Reichenbach’s Tense Framework
 
Towards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media DataTowards Context-Aware Search and Analysis on Social Media Data
Towards Context-Aware Search and Analysis on Social Media Data
 
Determining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in DiscourseDetermining the Types of Temporal Relations in Discourse
Determining the Types of Temporal Relations in Discourse
 
TIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation ResourceTIMEN: An Open Temporal Expression Normalisation Resource
TIMEN: An Open Temporal Expression Normalisation Resource
 
Review of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologiesReview of: Challenges of migrating to agile methodologies
Review of: Challenges of migrating to agile methodologies
 
A data driven approach to query expansion in question answering
A data driven approach to query expansion in question answeringA data driven approach to query expansion in question answering
A data driven approach to query expansion in question answering
 

Kürzlich hochgeladen

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxBkGupta21
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Kürzlich hochgeladen (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
unit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptxunit 4 immunoblotting technique complete.pptx
unit 4 immunoblotting technique complete.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Microblog-genre noise and its impact on semantic annotation accuracy

  • 1. Microblog-Genre Noise and Impact on Semantic Annotation Accuracy Leon Derczynski Diana Maynard Niraj Aswani Kalina Bontcheva
  • 2. Evolution of communication Functional utterances Vowels Velar closure: consonants Speech New modality: writing Digital text E-mail Social media Increased machine- readable information ??
  • 3. Social Media = Big Data Gartner ''3V'' definition: 1.Volume 2.Velocity 3.Variety High volume & velocity of messages: Twitter has ~20 000 000 users per month They write ~500 000 000 messages per day Massive variety: Stock markets; Earthquakes; Social arrangements; … Bieber
  • 4. What resources do we have now? Large, content-rich, linked, digital streams of human communication We transfer knowledge via communication Sampling communication gives a sample of human knowledge ''You've only done that which you can communicate'' The metadata (time – place – imagery) gives a richer resource: →A sampling of human behaviour
  • 5. Linking these resources Why is it useful to link this data? Identifying subjects, themes, … ''entities'' What's the text about? Who'd be interested in it? How can we summarise it? - very important, considering information overload in soc med!
  • 6. Typical annotation pipeline Language ID Tokenisation PoS tagging Text
  • 7. Typical annotation pipeline Named entity recognition dbpedia.org/resource/..... Michael_Jackson Michael_Jackson_(writer) Linking entities
  • 8. Pipeline cumulative effect Good performance is important at each stage – not just entity linking
  • 9. Language ID LADY GAGA IS BETTER THE 5th TIME OH BABY(: The Jan. 21 show started with the unveiling of an impressive three-story castle from which Gaga emerges. The band members were in various portals, separated from each other for most of the show. For the next 2 hours and 15 minutes, Lady Gaga repeatedly stormed the moveable castle, turning it into her own gothic Barbie Dreamhouse Newswire: Microblog:
  • 10. Language ID difficulties General accuracy on microblogs: 89.5% Problems include switching language mid-text: je bent Jacques cousteau niet die een nieuwe soort heeft ontdekt, het is duidelijk, ze bedekken hun gezicht. Get over it New info in this format: Metadata: spatial information linked URLs Emoticons: :) vs. ^_^ cu vs. 88 Accuracy when customised to genre: 97.4%
  • 11. Tokenisation General accuracy on microblogs: 80% Goal is to convert byte stream to readily-digestible word chunks Word bound discovery is a critical language acquisition task The LIBYAN AID Team successfully shipped these broadcasting equipment to Misrata last August 2011, to establish an FM Radio station ranging 600km, broadcasting to the west side of Libya to help overthrow Gaddafi's regime. RT @JosetteSheeran: @WFP #Libya breakthru! We move urgently needed #food (wheat, flour) by truck convoy into western Libya for 1st time :D
  • 12. Tokenisation difficulties Not curated, so typos Improper grammar – e.g. apostrophe usage; live with it! doesnt → doesnt doesn't → does n't Smileys and emoticons I <3 you → I & lt ; you This piece ;,,( so emotional → this piece ; , , ( so emotional Loss of information (sentiment) Punctuation for emphasis *HUGS YOU**KISSES YOU* → * HUGS YOU**KISSES YOU * Words run together
  • 13. Tokenisation fixes Custom tokeniser! Apostrophe insertion Slang unpacking Ima get u → I 'm going to get you Emoticon rules - if we can spot them, we know not to break them Customised accuracy on microblogs: 96%
  • 14. Part of speech tagging Goal is to assign words to classes (verb, noun etc) General accuracy on newswire: 97.3% token, 56.8% sentence General accuracy on microblogs: 73.6% token, 4.24% sentence Sentence-level accuracy important: without whole sentence correct, difficult to extract syntax
  • 15. Part of speech tagging difficulties Many unknowns: Music bands Soulja Boy | TheDeAndreWay.com in stores Nov 2, 2010 Places #LB #news: Silverado Park Pool Swim Lessons Capitalisation way off @thewantedmusic on my tv :) aka derek last day of sorting pope visit to birmingham stuff out Slang ~HAPPY B-DAY TAYLOR !!! LUVZ YA~ Orthographic errors dont even have homwork today, suprising ? Dialect Shall we go out for dinner this evening? Ey yo wen u gon let me tap dat
  • 16. Part of speech tagging fixes Slang dictionary for ''repair'' Won't cover previously-unseen slang In-genre labelled data Expensive to create! Leverage ML Existing taggers can handle unknown words Maximise use of these features! General accuracy on microblogs: 73.6% token, 4.24% sentence Accuracy when customised to genre: 88.4% token, 25.40% sentence
  • 17. Named Entity Recognition Goal is to find entities we might like to link General accuracy on newswire: 89% F1 General accuracy on microblogs: 41% F1 Newswire: Microblog: Gotta dress up for london fashion week and party in style!!! London Fashion Week grows up – but mustn't take itself too seriously. Once a launching pad for new designers, it is fast becoming the main event. But LFW mustn't let the luxury and money crush its sense of silliness.
  • 18. NER difficulties Rule-based systems get the bulk of entities (newswire 77% F1) ML-based systems do well at the remainder (newswire 89% F1) Small proportion of difficult entities Many complex issues Using improved pipeline: ML struggles, even with in-genre data: 49% F1 Rules cut through microblog noise: 80% F1
  • 19. NER on Facebook Longer texts than tweets Still has informal tone MWEs are a problem! - all capitalised: Green Europe Imperiled as Debt Crises Triggers Carbon Market Drop Difficult, though easier than Twitter Maybe due to the possibility of including more verbal context?
  • 20. Entity linking Goal is to find out which entity a mention refers to ''The murderer was Professor Plum, in the Library, with the Candlestick!'' Which Professor Plum? Disambiguation is through connecting text to the web of data dbpedia.org/resource/Professor_Plum_(astrophysicist) Two tasks: - Whole-text linking - Entity-level linking
  • 21. ''Aboutness'' Goal: answer ''What entities is this text about?'' Good for tweets: Lack of lexicalised context Not all related concepts are in the text Helpful for summarisation No concern for entity bounds; (finding them is tough in microblog!) * but * Added concern for themes in text e.g. ''marketing'', ''US elections''
  • 22. Aboutness performance Corpus: from Meij et al. ''Adding semantics to microblog posts.'' 468 tweets From one to six concepts per tweet DBpedia spotlight: highest recall (47.5) TextRazor: highest precision (64.6) Zemanta: highest F1 (41.0) Zemanta tuned for blog entries – so compensates for some noise
  • 23. Word-level linking Goal is to link an entity Given: The entity mention Surrounding microblog context No corpora exist for this exact task: Two commercially produced ones Policy says ''no sharing'' How can we approach this key task?
  • 24. Word-level linking performance Dataset: Replab Task is to determine relatedness-or-not Six entities given Few hundred tweets per entity Detect mentions of entity in tweets We disambiguate mentions to DBpedia / Wikipedia (easy to map) General performance: F1 around 70
  • 25. Word-level linking issues NER errors Missed entities damages / destroys linking Specificity problems Lufthansa Cargo Lufthansa Cargo Which organisation to choose? Require good NER Direct linking chunking reduces precision: Apple trees in the home garden bit.ly/yOztKs Pipeline NER does not mark Apple as entity here Lack of disambiguation context is a problem!
  • 26. Word-level linking issues Automatic annotation: Branching out from Lincoln park(LOC) after dark ... Hello "Russian Navy(ORG)", it's like the same thing but with glitter! Actual: Branching out from Lincoln park after dark(ORG) ... Hello "Russian Navy", it's like the same thing but with glitter! Clue in unusual collocations + ?
  • 27. Whole pipeline: how to fix? Common genre problems centre on mucky, uncurated text Orth error Slang Brevity Condensed Non-Chicago punctuation.. Maybe clearing up this will improve performance?
  • 28. Normalisation General solution for overcoming linguistic noise How to repair? 1. Gazetteer (quick & dirty); or.. 2. Noisy channel model Task is to ''reverse engineer'' the noise on this channel Brown clustering; double metaphone; auto orth correction An honest, well-formed sentence u wot m8 biber #lol
  • 29. Normalisation performance NER on tweets: Rule-based No normalisation F1 80% Gazetter normalisation F1 81% Noisy channel F1 81% ML-based No normalisation F1 49.1% Gazetter normalisation F1 47.6% Noisy channel F1 49.3% Negligible performance impact, and introduces errors! Sentiment change: undisambiguable → disambiguable Meaning change: She has Huntington's → She has Huntingdon's
  • 30. Future directions MORE DATA! and better.. no IAA for many resources Maybe from the crowd? MORE CONTEXT! Not just linguistic Microblog has host of metatdata Explicit: Time, Place, URIs, Hashtags Implicit: Friend network Previous messages
  • 31. Thank you! Thank you for listening! Do you have any questions?