KM Lecture11 nlp/nif

NLP/NIF
Knowledge and Media 2012-2013
Lecture 11
Monday, December 3, 12

Overview

Natural Language Processing 101

The NLP pipeline

NLP tasks

NLP Challenges

NIF (NLP Interchange Format)


NLP: What is it?
NLP or text analytics adds semantic understanding of:

named entities: people, companies, locations, etc.

pattern-based entities: email-addresses, phone numbers

concepts: abstractions of entities

facts and relationships

concrete and abstract attributes (e.g., 5 years, expensive)

subjectivity in the form of opinions, sentiments and
emotions

SLIDE INSPIRATION: HTTP://WWW.SLIDESHARE.NET/SETHGRIMES/TEXT-ANALYTICS-OVERVIEW-2011


80% of relevant information to businesses is in
‘unstructured’ textual form:

web pages, news and blog articles, forum postings,
other social media

email and messages

surveys, feedback forms, warranty claims

scientiﬁc literature, books, legal documents, patents

...



NLP: What is it for?
NLP transforms unstructured text into structured
information which may be:

categorised

queried

mined for patterns, topics or themes

presented intelligently

visualised and explored



NLP: Some history

1950 - 1980: Handwritten rules

Russian - English translation system

ELIZA

Since 1980: Machine learning

IBM’s Watson


NLP: Tasks

IMAGE SOURCE: HTTP://NLTK.ORG/IMAGES/DIALOGUE.PNG


Morphological/Lexical
Analysis

Language identiﬁcation

Tokenisation

Stemming/Lemmatisation


Syntactic Analysis

Text segmentation

Part of Speech (POS) tagging

Chunking

Shallow Parsing


Semantic Analysis

Named entity recognition (NER)

Relation ﬁnding

Semantic role labelling (SRL)

Word-sense disambiguation (WSD)

Co-reference/anaphora resolution


Semantic Analysis (ctd)

Topic detection/segmentation

Machine Translation (MT)

Sentiment analysis/opinion mining

Automatic summarisation


NLP: Approaches

Rule-based

Statistical

Hybrid methods


Named Entity
Recognition Explained


NER: State-of-the-Art

Statistical methods: Conditional Random Fields
(CRF)

Precision: 92.15%

Recall: 92.39%

F-Measure: 92.27%


Precision
How many predictions were correct?

P=TP/(TP+FP)
ACTUAL

Spam Not Spam

True Positive False Positive
Spam
PREDICTED

(TP) (FP)

False True Negative
Not Spam
Negative (FN) (TN)

Recall
Of the total number of instances in a class, how many
were found?

R=TP/(TP+FN)
ACTUAL

Spam Not Spam

Spam
PREDICTED

(TP) (FP)

False True Negative
Not Spam
Negative (FN) (TN)

F-Score
Harmonic mean of Precision and Recall

F=2 • P • R/(P+R)

[Acc=(TP+TN)/(TP+FP+FN+TN)]
ACTUAL

Spam Not Spam

Spam
PREDICTED

(TP) (FP)

False True Negative
Not Spam
Negative (FN) (TN)

Machine Learning 101
Training

1. Collect a set of representative training documents

2. Label each token for its entity class or other (O)

3. Design feature extractors appropriate to the text and classes

4. Train a sequence classiﬁer to predict the labels from the data

Testing

1. Receive a set of testing documents

2. Run sequence model inference to label each token

3. Appropriately output the recognised entities

SLIDE FROM: HTTP://WWW.STANFORD.EDU/CLASS/CS124/LEC/INFORMATION_EXTRACTION_AND_NAMED_ENTITY_RECOGNITION.PDF


k-NN

HTTP://WWW.YOUTUBE.COM/USER/ANTALVANDENBOSCH#P/
U/2/PB4QATZITLQ

NER Training Data
IOB Scheme

Inside, Outside, Begin

For each type of entity there is an I-XXX and a B-XXX tag

Non-entities are tagged O

B-XXX only used if two entities of same type next to each other

Assumes that named entities are non-recursive and don’t overlap

Example:

Meg Whitman CEO of eBay
I-PER I-PER O O I-ORG

SLIDE FROM: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF


Features for text
learning task
Is the word capitalised?

Is the word at the start of a sentence?

What is the Part of speech tag?

Previous and following words

Info from gazetteers

Useful features help your learner, badly chosen features may
harm it
SLIDE BASED ON: HTTP://WWW.INF.ED.AC.UK/TEACHING/COURSES/EMNLP/SLIDES/EMNLP07.PDF


Relation Finding
Explained
Amphibia Anura


Relation Finding:
State-of-the-Art

Induce relation-dictionaries using slot ﬁlling
(AutoSlog)

Example-based learning (Snowball)

Pattern-recognition over shallow parses (LEILA)


Relation Finding: pattern
ﬁnding over shallow parses
relation
direction frequency rating
candidate
is a municipality
45 +
and a town in
is a municipality
19 +
and a city in
is a municipality
10 +
in
is one of the ﬁve
5 -
districts of
is the name of
5 -
two provinces in


RL for domain modelling

Species
Order Town Type

is a
(1.000) is a town in on the island of
is a (0.794) (0.500)
(0.854)
is a is a
is found in
(1.000) (0.833)
(0.566) Location
is a municipality in
Family (0.891)

is a
(1.000) is a town in is in
is a (0.759) (0.500)
Genus
(0.750) may refer to
is found in (0.560)
Type Name occur in (0.635)
Class
(0.750) Country
is found in
(0.573)
occur in
may refer to
(0.333)
(0.482)

Province


RL for template ﬁlling
Date Ship Type Crew Ransom

2005/04/10 Feisty Gas LNG carrier 12 $315,000

2005/06/27 Semlow Freighter 10 $50,000

Bulk
2005/10/28 Panagia 22 $700,000
Carrier
Seabourn
2005/11/05 Cruise ship 210 none
Spirit


Opinion Mining
Explained


Opinion Mining:
State-of-the-Art
Supervised learning using features such as:

opinion words and phrases

negation

part-of-speech-tags

dependency parsing


Positive or negative?

“I bought an iPhone a few days ago. It was such a
nice phone. The touch screen was really cool. The
voice quality was clear too. Although the battery
life was not long, that is ok for me. However, my
mother was mad with me as I did not tell her
before I bought it. She also thought the phone was
too expensive, and wanted me to return it to the
shop. … ”

EXAMPLE FROM: BING LIU (2010) SENTIMENT ANALYSIS AND SUBJECTIVITY,
IN: NLP HANDBOOK, 2ND EDITION, N. INDURKHYA AND F. J. DAMERAU (EDS),
2010.


IBM’s Watson

HTTP://WWW.YOUTUBE.COM/WATCH?V=DYWO4ZKSFXW

NLP: Challenges

Negation

Messy text (twitter and SMS language)

Domain adaptation

Cross- and multi-document text analysis

Resource-scarce languages


NIF: Natural Language
Processing Interchange Format


Look familiar?


NIF: Why do we need it?
Integration of NLP tools

Bridge between LOD and NLP communities


NIF Claims
1. NIF provides global interoperability. If an NLP tool incorporates a NIF parser and a NIF serializer, it is
compatible with all other tools, which implement NIF.

2. NIF achieves this interoperability by using and defining a most common denominator for annotations.
This means that some standard annotations are required to be used. On the other hand NIF is flexible and
allows the NLP tools to add any extra annotations at will.

3. NIF allows to create tool chains without a large amount of up-front development work. As the output of each
tool is compatible, you can try and test really fast, whether the tools you selected actually produce what you
need to solve a certain task.

4. As NIF is based on RDF/OWL, you can choose from a broad range of tools and technologies to work with it:

RDF makes data integration easy: URIs, LinkedData

OWL is based on Description Logics (Types, Type inheritance)

Availability of open data sets (access and licence)

Reusability of Vocabularies and Ontologies

Diverse serializations for annotations: XML, Turtle,
RDFa+XHTML

Scalable tool support (Databases, Reasoning)

Data is flexible and can be queried / transformed in many ways


Structural
interoperability
NIF speciﬁes how to create an identiﬁer for
uniquely locating arbitrary substrings in a
document

either using offset- or context-hash-based
URIs

String ontology to describe Strings

Structured Sentence Ontology


Conceptual
Interoperability
Lemma and stem annotations are data type
properties in the Structured Sentence Ontology

POS tags use OLiA (Ontologies or Linguistic
Annotations)

NER tags use Semantic Content Management
System (SCMS) EU Project


Access Interoperability
Main interface: wrapper to NIF Web service

IMG: HTTP://NLP2RDF.ORG/FILES/2011/09/NIF_ARCHITECTURE.PNG


NLP/NIF: Wrap up

NLP History and tasks

Machine learning 101

Use-cases NER, relation ﬁnding and opinion
mining

Interoperability NLP results with NIF


Further reading/Tools
Peter Jackson and Isabelle Moulinier (2007)Natural
Language Processing for Online Applications: Text
Retrieval, Extraction and Categorization. John
Benjamins. ISBN: 9027249938

ACL Anthology: A Digital Archive of Research
Papers in Computational Linguistics

Machine learning: WEKA

Natural language processing: GATE


KM Lecture11 nlp/nif

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von Marieke van Erp

Mehr von Marieke van Erp (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

KM Lecture11 nlp/nif