TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

University of Sheffield, NLP
TwitIE: An Open-Source Information Extraction
Pipeline for Microblog Text
Kalina Bontcheva
Leon Derczynski
Adam Funk
Mark A. Greenwood
Diana Maynard
Niraj Aswani
© The University of Sheffield, 1995-2013
This work is licensed under
the Creative Commons Attribution-NonCommercial-NoDerivs Licence

The Problem
• Running ANNIE on 300 news articles – 87% f-score
• Running ANNIE on some tweets - < 40% f-score

Example: Persons in news articles

Example: Persons in tweets

Genre Differences in Entity Types
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places related to
current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet companies,
sports clubs

Tweet-specific NER challenges
• Capitalisation is not indicative of named entities
• All uppercase, e.g. APPLE IS AWSOME
• All lowercase, e.g. all welcome, joe included
• All letters upper initial, e.g. 10 Quotes from Amy Poehler
That Will Get You Through High School
• Unusual spelling, acronyms, and abbreviations
• Social media conventions:
• Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance
• @Mentions, e.g. @edchi (PER), @mcg_graz (LOC),
@BBC (ORG)

TwitIE: GATE’s new Twitter NER pipeline

Importing tweets into GATE
• GATE now supports JSON format import for tweets
• Located in the Format_Twitter plugin
• Automatically used for files *.json
• Alternatively, specify text/x-json-twitter as a mime type
• The tweet text becomes the document, all other JSON
fields become features

Language Detection: Less than 50% English
 The main challenges on tweets/Facebook status updates:
the short number of tokens (10 tokens/tweet on average)
the noisy nature of the words (abbreviations, misspellings).
 Due to the length of the text, we can make the assumption that
one tweet is written in only one language
 We have adapted the TextCat language identification plugin
 Provided fingerprints for 5 languages: DE, EN, FR, ES, NL
 You can extend it to new languages easily

Language Detection Examples

Tokenisation
 Splitting a text into its constituent parts
 Plenty of “unusual”, but very important tokens in social media:
– @Apple – mentions of company/brand/person names
– #fail, #SteveJobs – hashtags expressing sentiment, person
or company names
– :-(, :-), :-P – emoticons (punctuation and optionally letters)
– URLs
 Tokenisation key for entity recognition and opinion mining
 A study of 1.1 million tweets: 26% of English tweets have a
URL, 16.6% - a hashtag, and 54.8% - a user name mention
[Carter, 2013].

Example
– #WiredBizCon #nike vp said when @Apple saw what
http://nikeplus.com did, #SteveJobs was like wow I didn't
expect this at all.
– Tokenising on white space doesn't work that well:
• Nike and Apple are company names, but if we have
tokens such as #nike and @Apple, this will make the
entity recognition harder, as it will need to look at sub-
token level
– Tokenising on white space and punctuation characters
doesn't work well either: URLs get separated (http,
nikeplus), as are emoticons and email addresses

The TwitIE Tokeniser
Treat RTs and URLs as 1 token each
#nike is two tokens (# and nike) plus a separate
annotation HashTag covering both. Same for @mentions
-> UserID
Capitalisation is preserved, but an orthography feature is
added: all caps, lowercase, mixCase
Date and phone number normalisation, lowercasing, and
emoticons are optionally done later in separate modules
Consequently, tokenisation is faster and more generic
Also, more tailored to our NER module

POS Tagging
• The accuracy of the Stanford POS tagger drops from about
97% on news to 80% on tweets (Ritter, 2011)
• Need for an adapted POS tagger, specifically for tweets
• We re-trained the Stanford POS tagger using some hand-
annotated tweets, IRC and news texts
• Next we compare the differences between the ANNIE POS
Tagger and the Tweet POS Tagger on the example tweets

POS Tagging Example
• TwitIE POS tagger on the left
• ANNIE POS tagger on the right
• The TwitIE POS tagger is a separate paper at RANLP’2013
• Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)

Tweet Normalisation
 “RT @Bthompson WRITEZ: @libbyabrego honored?!
Everybody knows the libster is nice with it...lol...(thankkkks a
bunch;))”
 OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
 Similar to SMS normalisation
 For some components to work well (POS tagger, parser), it is
necessary to produce a normalised version of each token
 BUT uppercasing, and letter and exclamation mark repetition
often convey strong sentiment
 Therefore some choose not to normalise, while others keep
both versions of the tokens

A normalised example
 Normaliser currently based on spelling correction and some
lists of common abbreviations
 Outstanding issues:
Insert new Token annotations, so easier to POS tag, etc?
For example: “trying to” now 1 annotation
Some abbreviations which span token boundaries (e.g. gr8,
do n’t) difficult to handle
Capitalisation and punctuation normalisation

TwitIE NER Results

Trying TwitIE
• Plugin in the latest GATE snapshot and forthcoming 7.2
release
• Download details at: https://gate.ac.uk/wiki/twitie.html
• Available soon as a web service on the forthcoming
AnnoMarket NLP cloud marketplace:
• https://annomarket.com/

Coming Soon: TwitIE-as-a-Service
Preview of some text analytics services on AnnoMarket.com

Acknowledgements
• Kalina Bontcheva is supported by a Career Acceleration
Fellowship from the Engineering and Physical Sciences
Research Council (grant EP/I004327/1)
• This research is also partially supported by the EU-funded
FP7 TrendMiner project (http://www.trendminer-project.eu)
and the CHIST-ERA uComp project (http://www.ucomp.eu)
Thank you for your time!

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (17)

Ähnlich wie TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Ähnlich wie TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text (20)

Mehr von Leon Derczynski

Mehr von Leon Derczynski (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

TwitIE: An Open-Source Information Extraction Pipeline for Microblog Text

Hinweis der Redaktion