3. Overview
Natural Language Processing 101
The NLP pipeline
NLP tasks
NLP Challenges
NIF (NLP Interchange Format)
Monday, December 3, 12
4. NLP: What is it?
NLP or text analytics adds semantic understanding of:
named entities: people, companies, locations, etc.
pattern-based entities: email-addresses, phone numbers
concepts: abstractions of entities
facts and relationships
concrete and abstract attributes (e.g., 5 years, expensive)
subjectivity in the form of opinions, sentiments and
emotions
SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011
Monday, December 3, 12
5. 80% of relevant information to businesses is in
‘unstructured’ textual form:
web pages, news and blog articles, forum postings,
other social media
email and messages
surveys, feedback forms, warranty claims
scientific literature, books, legal documents, patents
...
SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011
Monday, December 3, 12
6. NLP: What is it for?
NLP transforms unstructured text into structured
information which may be:
categorised
queried
mined for patterns, topics or themes
presented intelligently
visualised and explored
SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011
Monday, December 3, 12
7. NLP: Some history
1950 - 1980: Handwritten rules
Russian - English translation system
ELIZA
Since 1980: Machine learning
IBM’s Watson
Monday, December 3, 12
8. NLP: Tasks
IMAGE SOURCE: HTTP://NLTK.ORG/IMAGES/DIALOGUE.PNG
Monday, December 3, 12
9. Morphological/Lexical
Analysis
Language identification
Tokenisation
Stemming/Lemmatisation
Monday, December 3, 12
10. Syntactic Analysis
Text segmentation
Part of Speech (POS) tagging
Chunking
Shallow Parsing
Monday, December 3, 12
11. Semantic Analysis
Named entity recognition (NER)
Relation finding
Semantic role labelling (SRL)
Word-sense disambiguation (WSD)
Co-reference/anaphora resolution
Monday, December 3, 12
14. Named Entity
Recognition Explained
Monday, December 3, 12
15. NER: State-of-the-Art
Statistical methods: Conditional Random Fields
(CRF)
Precision: 92.15%
Recall: 92.39%
F-Measure: 92.27%
Monday, December 3, 12
16. Precision
How many predictions were correct?
P=TP/(TP+FP)
ACTUAL
Spam Not Spam
True Positive False Positive
Spam
PREDICTED
(TP) (FP)
False True Negative
Not Spam
Negative (FN) (TN)
Monday, December 3, 12
17. Recall
Of the total number of instances in a class, how many
were found?
R=TP/(TP+FN)
ACTUAL
Spam Not Spam
True Positive False Positive
Spam
PREDICTED
(TP) (FP)
False True Negative
Not Spam
Negative (FN) (TN)
Monday, December 3, 12
18. F-Score
Harmonic mean of Precision and Recall
F=2 • P • R/(P+R)
[Acc=(TP+TN)/(TP+FP+FN+TN)]
ACTUAL
Spam Not Spam
True Positive False Positive
Spam
PREDICTED
(TP) (FP)
False True Negative
Not Spam
Negative (FN) (TN)
Monday, December 3, 12
19. Machine Learning 101
Training
1. Collect a set of representative training documents
2. Label each token for its entity class or other (O)
3. Design feature extractors appropriate to the text and classes
4. Train a sequence classifier to predict the labels from the data
Testing
1. Receive a set of testing documents
2. Run sequence model inference to label each token
3. Appropriately output the recognised entities
SLIDE FROM: HTTP://WWW.STANFORD.EDU/CLASS/CS124/LEC/INFORMATION_EXTRACTION_AND_NAMED_ENTITY_RECOGNITION.PDF
Monday, December 3, 12
20. k-NN
HTTP://WWW.YOUTUBE.COM/USER/ANTALVANDENBOSCH#P/
U/2/PB4QATZITLQ
Monday, December 3, 12
21. NER Training Data
IOB Scheme
Inside, Outside, Begin
For each type of entity there is an I-XXX and a B-XXX tag
Non-entities are tagged O
B-XXX only used if two entities of same type next to each other
Assumes that named entities are non-recursive and don’t overlap
Example:
Meg Whitman CEO of eBay
I-PER I-PER O O I-ORG
SLIDE FROM: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF
Monday, December 3, 12
22. Features for text
learning task
Is the word capitalised?
Is the word at the start of a sentence?
What is the Part of speech tag?
Previous and following words
Info from gazetteers
Useful features help your learner, badly chosen features may
harm it
SLIDE BASED ON: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF
Monday, December 3, 12
23. Relation Finding
Explained
Amphibia Anura
Monday, December 3, 12
24. Relation Finding:
State-of-the-Art
Induce relation-dictionaries using slot filling
(AutoSlog)
Example-based learning (Snowball)
Pattern-recognition over shallow parses (LEILA)
Monday, December 3, 12
25. Relation Finding: pattern
finding over shallow parses
relation
direction frequency rating
candidate
is a municipality
45 +
and a town in
is a municipality
19 +
and a city in
is a municipality
10 +
in
is one of the five
5 -
districts of
is the name of
5 -
two provinces in
Monday, December 3, 12
26. RL for domain modelling
Species
Order Town Type
is a
(1.000) is a town in on the island of
is a (0.794) (0.500)
(0.854)
is a is a
is found in
(1.000) (0.833)
(0.566) Location
is a municipality in
Family (0.891)
is a
(1.000) is a town in is in
is a (0.759) (0.500)
Genus
(0.750) may refer to
is found in (0.560)
Type Name occur in (0.635)
Class
(0.750) Country
is found in
(0.573)
occur in
may refer to
(0.333)
(0.482)
Province
Monday, December 3, 12
27. RL for template filling
Date Ship Type Crew Ransom
2005/04/10 Feisty Gas LNG carrier 12 $315,000
2005/06/27 Semlow Freighter 10 $50,000
Bulk
2005/10/28 Panagia 22 $700,000
Carrier
Seabourn
2005/11/05 Cruise ship 210 none
Spirit
Monday, December 3, 12
29. Opinion Mining:
State-of-the-Art
Supervised learning using features such as:
opinion words and phrases
negation
part-of-speech-tags
dependency parsing
Monday, December 3, 12
30. Positive or negative?
“I bought an iPhone a few days ago. It was such a
nice phone. The touch screen was really cool. The
voice quality was clear too. Although the battery
life was not long, that is ok for me. However, my
mother was mad with me as I did not tell her
before I bought it. She also thought the phone was
too expensive, and wanted me to return it to the
shop. … ”
EXAMPLE FROM: BING LIU (2010) SENTIMENT ANALYSIS AND SUBJECTIVITY,
IN: NLP HANDBOOK, 2ND EDITION, N. INDURKHYA AND F. J. DAMERAU (EDS),
2010.
Monday, December 3, 12
31. IBM’s Watson
HTTP://WWW.YOUTUBE.COM/WATCH?V=DYWO4ZKSFXW
Monday, December 3, 12
32. NLP: Challenges
Negation
Messy text (twitter and SMS language)
Domain adaptation
Cross- and multi-document text analysis
Resource-scarce languages
Monday, December 3, 12
36. NIF: Why do we need it?
Integration of NLP tools
Bridge between LOD and NLP communities
Monday, December 3, 12
37. NIF Claims
1. NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is
compatible with all other tools, which implement NIF.
2. NIF achieves this interoperability by using and defining a most common denominator for annotations.
This means that some standard annotations are required to be used. On the other hand NIF is flexible and
allows the NLP tools to add any extra annotations at will.
3. NIF allows to create tool chains without a large amount of up-front development work. As the output of each
tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you
need to solve a certain task.
4. As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it:
RDF makes data integration easy: URIs, LinkedData
OWL is based on Description Logics (Types, Type inheritance)
Availability of open data sets (access and licence)
Reusability of Vocabularies and Ontologies
Diverse serializations for annotations: XML, Turtle,
RDFa+XHTML
Scalable tool support (Databases, Reasoning)
Data is flexible and can be queried / transformed in many ways
Monday, December 3, 12
38. Structural
interoperability
NIF specifies how to create an identifier for
uniquely locating arbitrary substrings in a
document
either using offset- or context-hash-based
URIs
String ontology to describe Strings
Structured Sentence Ontology
Monday, December 3, 12
39. Conceptual
Interoperability
Lemma and stem annotations are data type
properties in the Structured Sentence Ontology
POS tags use OLiA (Ontologies or Linguistic
Annotations)
NER tags use Semantic Content Management
System (SCMS) EU Project
Monday, December 3, 12
40. Access Interoperability
Main interface: wrapper to NIF Web service
IMG: HTTP://NLP2RDF.ORG/FILES/2011/09/NIF_ARCHITECTURE.PNG
Monday, December 3, 12
41. NLP/NIF: Wrap up
NLP History and tasks
Machine learning 101
Use-cases NER, relation finding and opinion
mining
Interoperability NLP results with NIF
Monday, December 3, 12
42. Further reading/Tools
Peter Jackson and Isabelle Moulinier (2007)Natural
Language Processing for Online Applications: Text
Retrieval, Extraction and Categorization. John
Benjamins. ISBN: 9027249938
ACL Anthology: A Digital Archive of Research
Papers in Computational Linguistics
Machine learning: WEKA
Natural language processing: GATE
Monday, December 3, 12