Code: http://gate.ac.uk/wiki/twitie.html
Paper: https://gate.ac.uk/sale/ranlp2013/twitie/twitie-ranlp2013.pdf
Twitter is the largest source of microblog text, responsible for gigabytes of human discourse every day. Processing microblog text is difficult: the genre is noisy, documents have little context, and utterances are very short. As such, conventional NLP tools fail when faced with tweets and other microblog text. We present TwitIE, an open-source NLP pipeline customised to microblog text at every stage. Additionally, it includes Twitter-specific data import and metadata handling. This paper introduces each stage of the TwitIE pipeline, which is a modification of the GATE ANNIE open-source pipeline for news text. An evaluation against some state-of-the-art systems is also presented.
5. University of Sheffield, NLP
Genre Differences in Entity Types
News Tweets
PER Politicians, business
leaders, journalists,
celebrities
Sportsmen, actors, TV
personalities, celebrities,
names of friends
LOC Countries, cities, rivers,
and other places related to
current affairs
Restaurants, bars, local
landmarks/areas, cities,
rarely countries
ORG Public and private
companies, government
organisations
Bands, internet companies,
sports clubs
6. University of Sheffield, NLP
Tweet-specific NER challenges
• Capitalisation is not indicative of named entities
• All uppercase, e.g. APPLE IS AWSOME
• All lowercase, e.g. all welcome, joe included
• All letters upper initial, e.g. 10 Quotes from Amy Poehler
That Will Get You Through High School
• Unusual spelling, acronyms, and abbreviations
• Social media conventions:
• Hashtags, e.g. #ukuncut, #RusselBrand, #taxavoidance
• @Mentions, e.g. @edchi (PER), @mcg_graz (LOC),
@BBC (ORG)
8. University of Sheffield, NLP
Importing tweets into GATE
• GATE now supports JSON format import for tweets
• Located in the Format_Twitter plugin
• Automatically used for files *.json
• Alternatively, specify text/x-json-twitter as a mime type
• The tweet text becomes the document, all other JSON
fields become features
9. University of Sheffield, NLP
Language Detection: Less than 50% English
The main challenges on tweets/Facebook status updates:
the short number of tokens (10 tokens/tweet on average)
the noisy nature of the words (abbreviations, misspellings).
Due to the length of the text, we can make the assumption that
one tweet is written in only one language
We have adapted the TextCat language identification plugin
Provided fingerprints for 5 languages: DE, EN, FR, ES, NL
You can extend it to new languages easily
11. University of Sheffield, NLP
Tokenisation
Splitting a text into its constituent parts
Plenty of “unusual”, but very important tokens in social media:
– @Apple – mentions of company/brand/person names
– #fail, #SteveJobs – hashtags expressing sentiment, person
or company names
– :-(, :-), :-P – emoticons (punctuation and optionally letters)
– URLs
Tokenisation key for entity recognition and opinion mining
A study of 1.1 million tweets: 26% of English tweets have a
URL, 16.6% - a hashtag, and 54.8% - a user name mention
[Carter, 2013].
12. University of Sheffield, NLP
Example
– #WiredBizCon #nike vp said when @Apple saw what
http://nikeplus.com did, #SteveJobs was like wow I didn't
expect this at all.
– Tokenising on white space doesn't work that well:
• Nike and Apple are company names, but if we have
tokens such as #nike and @Apple, this will make the
entity recognition harder, as it will need to look at sub-
token level
– Tokenising on white space and punctuation characters
doesn't work well either: URLs get separated (http,
nikeplus), as are emoticons and email addresses
13. University of Sheffield, NLP
The TwitIE Tokeniser
Treat RTs and URLs as 1 token each
#nike is two tokens (# and nike) plus a separate
annotation HashTag covering both. Same for @mentions
-> UserID
Capitalisation is preserved, but an orthography feature is
added: all caps, lowercase, mixCase
Date and phone number normalisation, lowercasing, and
emoticons are optionally done later in separate modules
Consequently, tokenisation is faster and more generic
Also, more tailored to our NER module
14. University of Sheffield, NLP
POS Tagging
• The accuracy of the Stanford POS tagger drops from about
97% on news to 80% on tweets (Ritter, 2011)
• Need for an adapted POS tagger, specifically for tweets
• We re-trained the Stanford POS tagger using some hand-
annotated tweets, IRC and news texts
• Next we compare the differences between the ANNIE POS
Tagger and the Tweet POS Tagger on the example tweets
15. University of Sheffield, NLP
POS Tagging Example
• TwitIE POS tagger on the left
• ANNIE POS tagger on the right
• The TwitIE POS tagger is a separate paper at RANLP’2013
• Beats Ritter (2011); uses a grown-up tag set (cf. Gimpel, 2011)
16. University of Sheffield, NLP
Tweet Normalisation
“RT @Bthompson WRITEZ: @libbyabrego honored?!
Everybody knows the libster is nice with it...lol...(thankkkks a
bunch;))”
OMG! I’m so guilty!!! Sprained biibii’s leg! ARGHHHHHH!!!!!!
Similar to SMS normalisation
For some components to work well (POS tagger, parser), it is
necessary to produce a normalised version of each token
BUT uppercasing, and letter and exclamation mark repetition
often convey strong sentiment
Therefore some choose not to normalise, while others keep
both versions of the tokens
17. University of Sheffield, NLP
A normalised example
Normaliser currently based on spelling correction and some
lists of common abbreviations
Outstanding issues:
Insert new Token annotations, so easier to POS tag, etc?
For example: “trying to” now 1 annotation
Some abbreviations which span token boundaries (e.g. gr8,
do n’t) difficult to handle
Capitalisation and punctuation normalisation
19. University of Sheffield, NLP
Trying TwitIE
• Plugin in the latest GATE snapshot and forthcoming 7.2
release
• Download details at: https://gate.ac.uk/wiki/twitie.html
• Available soon as a web service on the forthcoming
AnnoMarket NLP cloud marketplace:
• https://annomarket.com/
20. University of Sheffield, NLP
Coming Soon: TwitIE-as-a-Service
Preview of some text analytics services on AnnoMarket.com
21. University of Sheffield, NLP
Acknowledgements
• Kalina Bontcheva is supported by a Career Acceleration
Fellowship from the Engineering and Physical Sciences
Research Council (grant EP/I004327/1)
• This research is also partially supported by the EU-funded
FP7 TrendMiner project (http://www.trendminer-project.eu)
and the CHIST-ERA uComp project (http://www.ucomp.eu)
Thank you for your time!
Hinweis der Redaktion
Leon, in the paper you show ANNIE 60% on the dev set. The above 40% is on the entire ds that’s in svn. Feel free to replace that table, as you like. I could not load the dev set into GATE, due to its strange format. I am sure there’s a script somewhere that’ll convert it into a proper .conll format, I just had no time to find and run it. It’s ok, nobody will notice perhaps :)
These are mostly politicians. Often their names are preceded by their titles. There is also bigger context, within which entity coreference helps with detection (e.g. Atef and Mohammed Atef; bin Laden and Osama bin Laden).
These are names of friends, singers, artists, sportspeople, and celebrities. Often in lowercase, referred to by first or surname only and sometimes misspelled.
Hashtags: some contain locations, some – person names, and others are phrases For the @Mentions – IIRR Ritter (or some similar recent paper on Twitter NER) wrote that @mentions were excluded from their evaluation, since they are trivially recognisable as persons. Well, the point is – they are not all persons (used to be true). Now we have locations/facilities, organisations, as well as some products, research projects, and the like. Hence, even though it’s trivial to identify @mentions as an NE, assigning it the appropriate NE type is far from a solved problem!