This is an update of a talk I originally gave in 2010. I had intended to make a wholesale update to all the slides, but noticed that one of them was worth keeping verbatim: a snapshot of the state of the art back then (see slide 38). Less than a decade has passed since then but there are some interesting and noticeable changes. For example, there was no word2vec, GloVe or fastText, or any of the neurally-inspired distributed representations and frameworks that are now so popular. Also no mention of sentiment analysis (maybe that was an oversight on my part, but I rather think that what we perceive as a commodity technology now was just not sufficiently mainstream back then).
Also if you compare with Jurafsky and Martin's current take on the state of the art (see slide 39), you could argue that POS tagging, NER, IE and MT have all made significant progress too (which I would agree with). I am not sure I share their view that summarisation is in the 'still really hard' category; but like many things, it depends on how & where you set the quality bar.
2. Tony Russell-Rose, PhD
Director, UXLabs
Led applied research & technology innovation at Microsoft, Canon, Reuters, BT
Labs, HP Labs, Endeca
Visiting Professor of Cognitive Computing & AI, Essex University. Founder of
Search Solutions conference series
Founder, 2dSearch: next generation advanced search
PhD (natural language interfaces), MSc in human-computer interaction, first
degree in engineering (human factors)
5 patents for search user interface design. 80+ scientific and technical papers
on AI/NLP, information retrieval and user experience
Honorary Visiting Fellow at City University London (Centre for Interactive
Systems Research)
Blog: http://isquared.wordpress.com
Industry Expertise
Media & Publishing
Healthcare
Business Intelligence
Financial Services
eCommerce
Domain Expertise
UX research
Information architecture
Search & information
retrieval
AI + machine learning
2Intro to Natural Language Processing
3. Course outline
Background & terminology
Fundamentals
Techniques & applications
Further reading
16-Jan-19 Intro to Natural Language Processing 3
5. As humans we do it effortlessly ... don’t we?
DRUNK GETS NINEYEARS INVIOLIN CASE
PROSTITUTESAPPEALTO POPE
STOLEN PAINTING FOUND BYTREE
REDTAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFFTREES
INCLUDE CHILDRENWHEN BAKING COOKIES
MINERS REFUSETOWORK AFTER DEATH
16-Jan-19 Intro to Natural Language Processing 5
6. Polysemy
One word maps to
many concepts
e.g. bat
Synonymy
One concept maps to
many words
16-Jan-19 Intro to Natural Language Processing 6
7. Word order
Venetian blind vs.
Blind venetian
16-Jan-19 Intro to Natural Language Processing 7
8. Language is generative
Starbucks coffee is the best
The place I like most when I need to feed my caffeine
addiction is the company from Seattle with branches
everywhere
Many different ways to express a given idea
Synonymy, paraphrase, metaphor, etc.
16-Jan-19 Intro to Natural Language Processing 8
9. Language is changing
I want to buy a mobile
Ill-formed input
“accomodation office”
Co-ordination, negation, etc.
This is not a talk about neuro-linguistic programming
Multi-linguality
Claudia Schiffer is on the cover of Elle
Sarcasm, irony, slang, jargon, etc.
That was a wicked lecture
Yep – the coffee break was the best part
16-Jan-19 Intro to Natural Language Processing 9
10. Dan Jurafsky
non-standard English
Great job @justinbieber! Were
SOO PROUD of what youve
accomplished! U taught us 2
#neversaynever & you yourself
should never give up either♥
segmentation issues idioms
dark horse
get cold feet
lose face
throw in the towel
neologisms
unfriend
Retweet
bromance
tricky entity names
Where is A Bug’s Life playing …
Let It Be was recorded …
… a mutation on the for gene …
world knowledge
Mary and Sue are sisters.
Mary and Sue are mothers.
But that’s what makes it fun!
the New York-New Haven Railroad
the New York-New Haven Railroad
Why else is natural language
understanding difficult?
11. Text Analytics: a set of linguistic, analytical and
predictive techniques to extract structure and meaning
from unstructured documents
NLP: language can be spoken as well as written
Text Analytics ~= Natural Language Processing ~=Text Mining
16-Jan-19 Intro to Natural Language Processing 11
12. Computational Linguistics
Use of computational techniques to study linguistic
phenomena
Cognitive science
Study of human information processing (perception,
language, reasoning, etc.)
Computer science
Theoretical foundations of computation and practical
techniques for implementation
Information science
Analysis, classification, manipulation, retrieval and
dissemination of information
16-Jan-19 Intro to Natural Language Processing 12
13. Symbolic approaches
Rule-based, hand coded (by linguists/SMEs)
Knowledge-intensive
Statistical approaches
Distributional & neural approaches, supervised or
unsupervised
Data-intensive
16-Jan-19 Intro to Natural Language Processing 13
14. Language is AMBIGUOUS
To determine structure, we must resolve ambiguity!
Lexical analysis (tokenisation)
The cat sat on the mat
I can’t tokenise this sentence
Stop word removal
No definitive list
The Who, The The, Take That…
To be or not to be
16-Jan-19 Intro to Natural Language Processing 14
15. Stemming
fishing, fished, fish, fisher -> fish
Lemmatization
Linguistically principled analysis
Passing -> pass + ING
Were -> be + PAST
Delegate = de-leg-ate (?)
Ratify = rat-ify (?)
Morphology (prefixes, suffixes, etc.)
Gebäudereinigungsfirmenangestellter -> Gebäude +
Reinigung + Firma + Angestellter (building + cleaning +
company + employee)
16-Jan-19 Intro to Natural Language Processing 15
16. Syntax (part of speech tagging)
book -> NOUN, VERB
that -> DETERMINER
flight -> NOUN
Book that flight -> VB DT NN
Ambiguity problem
Time flies like an arrow -> ?
Fruit flies like a banana -> ?
Eats shoots and leaves -> ?
16-Jan-19 Intro to Natural Language Processing 16
17. Parsing (grammar)
I saw a venetian blind
I saw a blind venetian
I saw the man on the hill with
a telescope
Rugby is a game played by men
with odd-shaped balls
Sentence boundary
detection
Punctuation denotes the end of
a sentence!
“But not always!”, said Fred...
16-Jan-19 Intro to Natural Language Processing 17
18. ELIZA
Wiezenbaum 1966
Simple pattern matching
PARRY
Colby et al 1972
Pattern matching & planning
SHRDLU
Winograd 1972
Natural language understanding
Comprehensive grammar of English
16-Jan-19 Intro to Natural Language Processing 18
19. Differentiate between useful index terms and
‘noise’
Help lexicographers identify new
terminology:
the C4 carbons of the nicotinamide rings
X-ray therapy vs. X-ray therapies
PAHO vs. Pan-American Health Organization
Term extraction systems process scientific
papers to identify terminology
16-Jan-19 Intro to Natural Language Processing 19
20. Identification of key concepts,
e.g. people, places, organisations, etc.
Also postcodes, temporal/numerical expressions, etc.
“Mexico has been trying to stage a recovery since the
beginning of this year and it's always been getting
ahead of itself in terms of fundamentals,” said Matthew
Hickman of Lehman Brothers in NewYork.”
16-Jan-19 Intro to Natural Language Processing 20
Persons Organisations Cities Countries
Matthew Hickman Lehman Brothers NewYork Mexico
21. Entities + relationships
Based on pre-defined structures, e.g.
movements of company executives
corporate mergers / acquisitions
interactions between genes and proteins
Example (from MUC-6):
Now, Mr.James is preparing to sail into the sunset, and Mr. Dooner is poised to
rev up the engines to guide Interpublic Group's McCann-Erickson into the 21st
century.Yesterday, McCann made official what had been widely anticipated: Mr.
James, 57 years old, is stepping down as chief executive officer on July 1 and will
retire as chairman at the end of the year. He will be succeeded by Mr. Dooner, 45.
16-Jan-19 Intro to Natural Language Processing 21
22. <SUCCESSION_EVENT-2> :=
SUCCESSION_ORG:
<ORGANIZATION-1>
POST: "chairman"
IN_AND_OUT: <IN_AND_OUT-4>
VACANCY_REASON: DEPART_WORKFORCE
<IN_AND_OUT-4> :=
IO_PERSON: <PERSON-1>
NEW_STATUS: IN
ON_THE_JOB: NO
OTHER_ORG: <ORGANIZATION-1>
REL_OTHER_ORG: SAME_ORG
<ORGANIZATION-1> :=
ORG_NAME: "McCann-Erickson"
ORG_ALIAS: "McCann"
ORG_TYPE: COMPANY
<PERSON-1> :=
PER_NAME: "John J. Dooner Jr."
PER_ALIAS: "John Dooner" "Dooner"
16-Jan-19 Intro to Natural Language Processing 22
23. Can now use as metadata (for retrieval)
Or store in DB and query against it, e.g.
“show me all the events where a finance officer left
a company to take up the position of CEO in another
company”
16-Jan-19 Intro to Natural Language Processing 23
24. Dan Jurafsky
Information Extraction
Subject: curriculum meeting
Date: January 15, 2012
To: Dan Jurafsky
Hi Dan, we’ve now scheduled the curriculum meeting.
It will be in Gates 159 tomorrow from 10:00-11:30.
-Chris
24
Create new Calendar entry
Event: Curriculum mtg
Date: Jan-16-2012
Start: 10:00am
End: 11:30am
Where: Gates 159
25. Anaphora resolution relies on context
(semantic / pragmatic knowledge):
We gave the bananas to the monkeys because they were hungry.
We gave the bananas to the monkeys because they were ripe.
We gave the bananas to the monkeys because they were here.
16-Jan-19 Intro to Natural Language Processing 25
26. Resolve ambiguities due to polysemy
Often characterised by syntax, e.g.
Magnesium is a light metal (ADJ)
The light in the kitchen is quite bright (NOUN)
Can use a Part of SpeechTagger to resolve
broad ambiguities
PoS tags = NN, NNP, NNS, NNPS, etc.
16-Jan-19 Intro to Natural Language Processing 26
27. How do we measure NLP accuracy?
Compare against human annotation
However:
Humans often disagree
▪ Particularly for complex tasks
▪ IAA places an upper bound on performance
Comparison can be measured in many ways
▪ “Bill Gates is chairman of Microsoft” -> “Gates” = person?
16-Jan-19 Intro to Natural Language Processing 27
28. Human-computer interfaces (chatbots etc.)
Text classification
Text summarisation
Machine translation
Speech recognition & synthesis
Natural language generation
Text mining
Question answering
Sentiment Analysis
16-Jan-19 Intro to Natural Language Processing 28
29. Media monitoring
classify incoming news stories
Search engines:
classify query intent, e.g. search
Google for ‘LOG313’
Spam detection
16-Jan-19 Intro to Natural Language Processing 29
30. Dan Jurafsky
Machine Translation
• Fully automatic
30
• Helping human translators
Enter Source Text:
Translation from Stanford’s Phrasal:
这 不过 是 一 个 时间 的 问题 .
This is only a matter of time.
31. Summarization
single-document vs.
multi-document
Search results
MSWord
Research / analysis
tools
16-Jan-19 Intro to Natural Language Processing 31
32. Chatbots
Smart speakers
Smartphone assistants
Call handling systems
Travel
Hospitality
Banking
16-Jan-19 Intro to Natural Language Processing 32
33. Analogy with Data Mining
Discover or infer new knowledge from unstructured
text resources
A <-> B and B <-> C
Infer A <-> C?
e.g. link between migraine headaches and
magnesium deficiency
Applications in life sciences, media/publishing,
counter terrorism and competitive intelligence
16-Jan-19 Intro to Natural Language Processing 33
34. Going beyond the
document retrieval
paradigm
provide specific answers to
specific questions
Introduced as aTREC track
in 1999
16-Jan-19 Intro to Natural Language Processing 34
35. Dan Jurafsky
Question Answering: IBM’s Watson
• Won Jeopardy on February 16, 2011!
35
WILLIAM WILKINSON’S
“AN ACCOUNT OF THE PRINCIPALITIES OF
WALLACHIA AND MOLDOVIA”
INSPIRED THIS AUTHOR’S
MOST FAMOUS NOVEL
Bram Stoker
36. Identify and extract subjective information
Several sub-tasks:
Identify polarity, e.g. of movie reviews
▪ e.g. positive, negative, or neutral
Identify emotional states
▪ e.g. angry, sad, happy, etc.
Subjectivity/objectivity identification
▪ E.g. “fact” from opinion
Feature/aspect-based
▪ Differentiate between specific features or aspects of entities
16-Jan-19 Intro to Natural Language Processing 36
37. Dan Jurafsky
Information Extraction & Sentiment Analysis
• nice and compact to carry!
• since the camera is small and light, I won't need to carry
around those heavy, bulky professional cameras either!
• the camera feels flimsy, is plastic and very light in weight you
have to be very delicate in the handling of this camera
37
Size and weight
Attributes:
zoom
affordability
size and weight
flash
ease of use
✓
✗
✓
38. Commodity technologies:
Tokenisation, normalization, regular expression search
Mainstream, but room for improvement:
Lemmatization, spelling correction, speech synthesis
Mainstream but needs improvement:
PoS tagging, term extraction, summarization, speech recognition,
text categorization, spoken dialogue systems
Almost there:
Information extraction, NL generation, simple speech translation
Don’t hold your breath:
Full machine translation, advanced dialogue systems
16-Jan-19 Intro to Natural Language Processing 38
39. Dan Jurafsky
Language Technology
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation
(WSD)
Paraphrase
Named entity recognition (NER)
Parsing
Summarization
Information extraction (IE)
Machine translation (MT)
Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection
Let’s go to Agra!
Buy V1AGRA …
✓
✗
Colorless green ideas sleep furiously.
ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in Princeton
PERSON ORG LOC
You’re invited to our dinner
party, Friday May 27 at 8:30
Party
May 27
add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th Shanghai International Film Festival…
第13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is
good
Q. How effective is ibuprofen in reducing
fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterday
ABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do
you want a ticket?
The S&P500 jumped
40. Many commercial vendors
Open source/access:
GATE (Sheffield University)
RapidMiner
Stanford NLP
OpenNLP
OpenCalais
LingPipe
nltk
16-Jan-19 Intro to Natural Language Processing 40
41. Recent additions
Gensim (& word2vec, GloVe, fastText, etc.)
Textacy + Spacy
TextBlob
BERT, ELMo etc.
16-Jan-19 Intro to Natural Language Processing 41
42. D. Jurafsky and J. Martin, Speech and Language Processing.
Prentice-Hall, 2nd edition, 2008.
Bird, Steven, Ewan Klein, and Edward Loper (2009), Natural
Language Processing with Python, O'Reilly Media.
Markus Hofmann and Andrew Chisholm. 2016.Text Mining
andVisualization: Case Studies Using Open-SourceTools.
Chapman & Hall/CRC.
Manning, C. and Schutze, H. Foundations of Statistical
Language Processing. MIT Press, 1999.
16-Jan-19 Intro to Natural Language Processing 42
43. Install/configure Python (ideally 3.6)
Install an IDE (e.g PyCharm)
Set up a venv etc.
Install nltk + download data
See Ch 1 Section 1.2
Work thru Q19 onwards fromCh 1 of nltk
book:
https://www.nltk.org/book/ch01.html
16-Jan-19 Intro to Natural Language Processing 43
44. Tony Russell-Rose, PhD FBCS CITP CEng
Director, UXLabs | Founder, 2dSearch
Visiting Professor, Essex University
Web: uxlabs.co.uk, 2dsearch.com
Email: tgr@uxlabs.co.uk
Blog: http://isquared.wordpress.com
LinkedIn: http://uk.linkedin.com/in/tonyrussellrose
Twitter: @tonygrr
44Intro to Natural Language Processing