This document provides an overview of the GATE (General Architecture for Text Engineering) natural language processing toolkit. It discusses how GATE can be used to analyze social media texts, recognize entities and events, perform semantic search, and extract information. GATE includes components for language processing, information extraction tools, and resources for visualizing and annotating text. The document demonstrates running GATE's ANNIE information extraction system on news texts and tweets to recognize named entities. It also shows how GATE's semantic search tool Mimir can query annotated texts and semantic metadata.
1. University of Sheffield, NLP
Text Analysis with GATE
Diana Maynard
University of Sheffield
Big Social Data Workshop
Reading, UK, April 2015
Š The University of Sheffield, 2015
This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike Licence
2. University of Sheffield, NLP
Outline of the Tutorial
⢠Introduction to NLP and Information Extraction
⢠Introduction to GATE
⢠Social media analysis
⢠Example applications
⢠Semantic Search, Nesta, Decarbonet.
3. University of Sheffield, NLP
We are all connected to each other...
â Information,
thoughts and
opinions are
shared prolifically
on the social web
these days
â 72% of online
adults use social
networking sites
4. University of Sheffield, NLP
Your grandmother is three times as likely to
use a social networking site now as in 2009
5. University of Sheffield, NLP
There are hundreds of tools for social media
analytics
â Most of them are commercial and not freely available
â The research tools tend to focus on specific topics and
scenarios, and aren't easily adaptable
â The analysis they do often doesn't go much beyond number
crunching, e.g.
â look at number of tweets, retweets, favourites
â filter by hashtag or keyword for topic categorisation
â use off-the-shelf sentiment tools
â use counts of word length, POS categories etc
â very little semantics, don't deal with variation, ambiguity,
slang, sarcasm etc.
6. University of Sheffield, NLP
Analysing Social Media is harder than it sounds
There are lots
of things to
think about!
7. University of Sheffield, NLP
Analysing language in social media is hard
â Grundman:politics makes #climatechange scientific issue,people
donât like knowitall rational voice tellin em wat 2do
â @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
â Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
â Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE ⌠F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
10. University of Sheffield, NLP
It is difficult to access unstructured information
efficiently
Information extraction tools can help you:
â Save time and money on management of text and data from
multiple sources
â Find hidden links scattered across huge volumes of diverse
information
â Integrate structured data from variety of sources
â Interlink text and data
â Collect information and extract new facts
11. University of Sheffield, NLP
What is Entity Recognition?
â
Entity Recognition is about recogising and classifying key Named
Entities and terms in the text
â
A Named Entity is a Person, Location, Organisation, Date etc.
â
A term is a key concept or phrase that is representative of the text
â
Entities and terms may be described in different ways but refer to
the same thing. We call this co-reference.
Mitt Romney, the favorite to win the Republican nomination for president in 2012
DatePerson Term
The GOP tweeted that they had knocked on 75,000 doors in Ohio the day prior.
Organisation
co-reference
Location
12. University of Sheffield, NLP
What is Event Recognition?
â
An event is an action or situation relevant to the domain
expressed by some relation between entities or terms.
â
It is always grounded in time, e.g. the performance of a
band, an election, the death of a person
Mitt Romney, the favorite to win the Republican nomination for president in 2012
Event DatePerson
Relation Relation
13. University of Sheffield, NLP
Why are Entities and Events Useful?
â They can help answer the âBig 5â journalism questions
(who, what, when, where, why)
â They can be used to categorise the texts in different ways
â look at all texts about Obama.
â They can be used as targets for opinion mining
â find out what people think about President Obama
â When linked to an ontology and/or combined with other
information, they can be used for reasoning about things
not explicit in the text
â seeing how opinions about different American
presidents have changed over the years
14. University of Sheffield, NLP
Approaches to Information Extraction
Knowledge Engineering
ďŹ
rule based
ďŹ
developed by
experienced language
engineers
ďŹ
make use of human
intuition
ďŹ
easier to understand
results
ďŹ
development could be
very time consuming
ďŹ
some changes may be
hard to accommodate
Learning Systems
ďŹ
use statistics or other
machine learning
ďŹ
developers do not need
LE expertise
ďŹ
requires large amounts of
annotated training data
ďŹ
some changes may
require re-annotation of
the entire training corpus
15. University of Sheffield, NLP
Seems like we need a tool to do this clever
stuff for us.
How about GATE?
16. University of Sheffield, NLP
What is GATE?
GATE is an NLP toolkit developed at the University of Sheffield
over the last 20 years.
It's open source and freely available. http://gate.ac.uk
⢠components for language processing, e.g. parsers, machine
learning tools, stemmers, IR tools, IE components for various
languages...
⢠tools for visualising and manipulating text, annotations,
ontologies, parse trees, etc.
⢠various information extraction tools
⢠evaluation and benchmarking tools
17. University of Sheffield, NLP
GATE components
â Language Resources (LRs), e.g. lexicons, corpora,
ontologies
â Processing Resources (PRs), e.g. parsers, generators,
taggers
â
Visual Resources (VRs), i.e. visualisation and editing
components
â Algorithms are separated from the data, which means:
â the two can be developed independently by users with
different expertise.
â alternative resources of one type can be used without
affecting the other, e.g. a different visual resource can be
used with the same language resource
18. University of Sheffield, NLP
ANNIE
⢠ANNIE is GATE's rule-based IE system
⢠It uses the language engineering approach (though we also
have tools in GATE for ML)
⢠Distributed as part of GATE
⢠Uses a finite-state pattern-action rule language, JAPE
⢠ANNIE contains a reusable and easily extendable set of
components:
â generic preprocessing components for tokenisation,
sentence splitting etc
â components for performing NE on general open domain text
23. University of Sheffield, NLP
Named Entity Grammars
⢠Hand-coded rules written in JAPE applied to annotations to
identify NEs
⢠Phases run sequentially and constitute a cascade of FSTs over
annotations
⢠Annotations from format analysis, tokeniser. splitter, POS tagger,
morphological analysis, gazetteer etc.
⢠Because phases are sequential, annotations can be built up over
a period of phases, as new information is gleaned
⢠Standard named entities: persons, locations, organisations, dates,
addresses, money
⢠Basic NE grammars can be adapted for new applications, domains
and languages
26. University of Sheffield, NLP
Hands-on: Running ANNIE on news texts
â
Start GATE by double clicking on the icon
â
Because ANNIE is a ready-made application, we can just load it
directly from the menu
â
Right click on Applications, select âReady-made applicationsâ
and then ANNIE
â
Create a new corpus (right click on Language Resources â
New â GATE corpus and click âOK)
â
Populate the corpus (right click on the corpus, and use the file
chooser in Directory URL to select the hands-on/corpora/news-
texts directory. Click âOpenâ and then âOKâ)
â
Double click on ANNIE and select âRun this applicationâ
â
This will now run the application on all the documents in your
corpus
27. University of Sheffield, NLP
Hands-on: Looking at the results
â Now you can inspect your annotated data: double click on a
document to open it
â Select âAnnotation Setsâ and âAnnotation Listâ from the tabs
â Click on the top arrow above âOriginal markupsâ in the right
pane
â You should see a mixture of Named Entity annotations
(Person, Location etc) and some other linguistic annotations
(Token, Sentence etc) in the Default annotation set
â Click on the name of an annotation in the right hand pane that
you want to view (you can select multiple annotations)
â Hover over the annotation in the text to get more info
â Try the Annotation Stack for a different view
29. University of Sheffield, NLP
Analysing language in social media is hard
â Grundman:politics makes #climatechange scientific issue,people
donât like knowitall rational voice tellin em wat 2do
â @adambation Try reading this article , it looks like it would be
really helpful and not obvious at all. http://t.co/mo3vODoX
â Want to solve the problem of #ClimateChange? Just #vote for a
#politician! Poof! Problem gone! #sarcasm #TVP #99%
â Human Caused #ClimateChange is a Monumental Scam!
http://www.youtube.com/watch?v=LiX792kNQeE ⌠F**k yes!!
Lying to us like MOFO's Tax The Air We Breath! F**k Them!
30. University of Sheffield, NLP
Challenges for NLP
â Noisy language: unusual punctuation, capitalisation, spelling, use
of slang, sarcasm etc.
â Terse nature of microposts such as tweets
â Use of hashtags, @mentions etc causes problems for
tokenisation #thisistricky
â Lack of context gives rise to ambiguities
â NER performs poorly on microposts, mainly because of linguistic
pre-processing failure
â Performance of standard IE tools decreases from ~90% to
~40% when run on tweets rather than news articles
31. University of Sheffield, NLP
Lack of context causes ambiguity
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
??
32. University of Sheffield, NLP
Getting the NEs right is crucial
Branching out from Lincoln park after dark ... Hello Russian
Navy, it's like the same thing but with glitter!
33. University of Sheffield, NLP
Hands-on with TwitIE
â First we need to load the TWITIE plugin
â Click on the green jigsaw-shaped icon at the top to load the
plugins manager (this takes a few seconds to load)
â Scroll down to the Twitter plugin and select âLoad nowâ.
â Click âApply Allâ and then âCloseâ.
â Now right-click on âApplicationsâ, select âReady-made
applicationsâ and âTwitIEâ
â Create another new corpus, name it âTweetsâ and populate
with texts from hands-on/corpora/tweets
â Once loaded, double click on TwitIE to open it, and then select
âRun this applicationâ (make sure the tweets corpus is chosen)
â Use the Annotation Stack view to look at Tokens in hashtags
35. University of Sheffield, NLP
GATE MĂmir
⢠can be used to index and search over text, annotations,
semantic metadata (concepts and instances)
⢠allows queries that arbitrarily mix full-text, structural, linguistic
and semantic annotations
⢠is open source
36. University of Sheffield, NLP
What can GATE MĂmir do that Google can't?
Show me:
⢠all documents mentioning a temperature between 30 and 90
degrees F (expressed in any unit)
⢠all abstracts written in French on patent documents from the last
6 months which mention any form of the word âtransistorâ in the
English abstract
⢠the names of the patent inventors of those abstracts
⢠all documents mentioning steel industries in the UK, along with
their location
37. University of Sheffield, NLP
Search News Articles for
Politicians born in Sheffield
http://demos.gate.ac.uk/mimir/gpd/search/gus
38. University of Sheffield, NLP
MIMIR: Searching Text Mining Results
â
Searching and managing text annotations, semantic
information, and full text documents in one search engine
â
Queries over annotation graphs
â
Regular expressions, Kleene operators
â
Designed to be integrated as a web service in custom end-user
systems with bespoke interfaces
â
Demos at http://services.gate.ac.uk/mimir/
39. University of Sheffield, NLP
Hands-on with Semantic Search
Try these queries on the BBC News demo:
http://services.gate.ac.uk/mimir/gpd/search/index
â Gordon Brown
â Gordon Brown said
â
Gordon Brown [0..3] root:say
â
{Person} [0..3] root:say
â
{Person.gender=female}[0..3] root:say
â
Make sure you type the queries EXACTLY or they probably won't
work!
â
Try making up some of your own queries.
40. University of Sheffield, NLP
Try your hand with some SPARQL
(for the more adventurous!)
â
{Person inst ="http://dbpedia.org/resource/Gordon_Brown"} [0..3]
root:say
â
{Person sparql="SELECT ?inst WHERE { ?inst a :Politician }"}
[0..3] root:say
â
{Person sparql = "SELECT ?inst WHERE { ?inst :party
<http://dbpedia.org/resource/Labour_Party_%28UK%29> }" }
[0..3] root:say
â
{Person sparql = "SELECT ?inst WHERE {
?inst :party <http://dbpedia.org/resource/Labour_Party_%28UK
%29> .
?inst :almaMater
<http://dbpedia.org/resource/University_of_Edinburgh> }"
} [0..3] root:say
Can you work out what these do?
41. University of Sheffield, NLP
We don't just have to look at politicians saying and
measuring things
â If we first process the text with other NLP tools such as
sentiment analysis, we can also search for positive or negative
documents
â Or positive/negative comments about certain
people/topics/things/companies
â In the DecarboNet project, we're looking at people's opinions
about topics relating to climate change, e.g. fracking
â We could index on the sentiment annotations too
â Other people are using the combination of opinion mining and
MIMIR to look at e.g. customer feedback about products /
companies on a huge corpus
49. University of Sheffield, NLP
The Political Futures Tracker
â Example of using the technology on a real scenario - analysing
political tweets in the run-up to the UK elections
â Project funded by Nesta http://www.nesta.org.uk/
â Series of blog posts about the project, leading up to the
election
http://www.nesta.org.uk/news/political-futures-tracker
â Some of our own more technical blog posts about the
underlying tools:
http://gate.ac.uk/projects/pft/
51. University of Sheffield, NLP
Topic recognition
This term is found by first
recognising the head
word âjobâ from a list
under the theme
âemploymentâ and
matching against its root
form in the text, i.e. âjobâ.
It is then extended to
include the adjectival
modifier âBritishâ, which
is not present in a list
anywhere.
53. University of Sheffield, NLP
What do the different parties talk about?
Conservative vs Labour
Transport, Europe and employment are
mentioned more by Conservatives
NHS is mentioned more by Labour
57. University of Sheffield, NLP
Summary
â Text mining is a very useful pre-requisite for doing all kinds of
more interesting things, like semantic search
â Semantic web search allows you to do much more interesting
kinds of search than a standard text-based search
â Text mining is hard, so it won't always be correct, however
â This is especially true on lower quality text such as social
media
â On the other hand, social media has some of the most
interesting material to look at
â Still plenty of work to be done, but there are lots of tools that
you can use now and get useful results
â We run annual GATE training courses where you can spend a
whole week learning all this and more!
58. University of Sheffield, NLP
Acknowledgements and further information
â Research partially supported by the European Union/EU under the Information
and Communication Technologies (ICT) theme of the 7th Framework Programme
for R&D (FP7) DecarboNet (610829) http://www.decarbonet.eu and Nesta
http://nesta.org.uk
â Annual GATE training course in June in
Sheffield:https://gate.ac.uk/family/training.html
â Download gate: http://gate.ac.uk/download
59. University of Sheffield, NLP
Relevant publications
â K Bontcheva, V Tablan, H Cunningham. Semantic Search over Documents
and Ontologies (2014) Bridging Between Information Retrieval and
Databases, 31-53
â V. Tablan, K. Bontcheva, I. Roberts, and H. Cunningham. MĂmir: An open-
source semantic search framework for interactive information seeking and
discovery. Journal of Web Semantics: Science, Services and Agents on the
World Wide Web, 2014
â A. Dietzel and D. Maynard. Climate Change: A Chance for Political Re-
Engagement? In Proc. of the Political Studies Association 65th Annual
International Conference, April 2015, Sheffield, UK.
â Diana Maynard. Challenges in Analysing Social Media. In Adrian DuĹa,
Dietrich Nelle, GĂźnter Stock and Gert G. Wagner (eds.) (2014): Facing the
Future: European Research Infrastructures for the Humanities and Social
Sciences. SCIVERO Verlag, Berlin, 2014.
â These and other relevant papers on the GATE website:
http://gate.ac.uk/gate/doc/papers.html