SlideShare ist ein Scribd-Unternehmen logo
1 von 91
Downloaden Sie, um offline zu lesen
1
Natural Language Processing
Toine Bogers
Aalborg University Copenhagen
Christina Lioma
University of Copenhagen
QUARTZ WINTER SCHOOL / FEBRUARY 12, 2018 / PADUA, ITALY
Who?
• Toine Bogers (toine@hum.aau.dk)
– Associate professor @ Aalborg University Copenhagen
– Interests
§ Recommender systems
§ Information retrieval (search engines)
§ Information behavior
• Christina Lioma (c.lioma@di.ku.dk)
– Full professor @ University of Copenhagen
– Interests
§ Information retrieval (search engines)
§ Natural language processing, computational linguistics
2
Outline
• Introduction to NLP
• Vector semantics
• Text classification
3
Useful references
• Slides in this lecture are based on Jurafsky & Martin
book
– Jurafsky, D., & Martin, J. H. (2014). Speech and
Language Processing (2nd ed.). Harlow: Pearson
Education. https://web.stanford.edu/~jurafsky/slp3/
• Other good textbooks on NLP
– Jackson, P., & Moulinier, I. (2007). Natural Language
Processing for Online Applications:Text Retrieval,
Extractionand Categorization. Amsterdam: Benjamins.
– Manning, C. D., & Schütze, H. (1999). Foundations of
StatisticalNatural Language Processing. Cambridge,
MA: MIT Press.
4
5
Part 1
Introduction to NLP
What is Natural Language Processing?
• Multidisciplinary branch of CS drawing largely from linguistics
• Definition by Liddy (1998)
– “Natural language processing is a range of computational techniques for
analyzing and representing naturally occurring texts at one or more levels
of linguistic analysis for the purpose of achieving human-like language
processing for a range of particular tasks or applications.”
• Goal: applied mechanization of human language
6
Levels of linguistic analysis
• Phonological — Interpretation of speech sounds within and across words
• Morphological — Componential analysis of words, including prefixes, suffixes
and roots
• Lexical — Word-level analysis including lexical meaning and part-of-speech
analysis
• Syntactic — Analysis of words in a sentence in order to uncover the grammatical
structure of the sentence
• Semantic — Determining the possible meanings of a sentence, including
disambiguation of words in context
• Discourse — Interpreting structure and meaning conveyed by texts larger than a
sentence
• Pragmatic — Understanding the purposeful use of language in situations,
particularly those aspects of language which require world knowledge
7Liddy (1998)
NLP is hard!
• Multidisciplinary knowledgegap
– Some computer scientists might not “get” linguistics
– Some linguists might not “get” computer science
– Goal: bridge this gap
• String similarity aloneis not enough
– Very similar strings can mean different things (1 & 2)
– Very different strings can mean similar things (2 & 3)
– Examples:
1. How fast is the TZ?
2. How fast will my TZ arrive?
3. Please tell me when I can expect the TZ I ordered
8
NLP is hard!
• Ambiguity
– Identical strings can have different meanings
– Example: “I made her duck” has at least five possible meanings
§ I cooked waterfowl for her
§ I cooked waterfowl belonging to her
§ I created the (plaster?) duck she owns
§ I caused her to quickly lower her head or body
§ I waved my magic wand and turned her into undifferentiated waterfowl
9
Overcoming NLP difficulty
• Natural languageambiguity is very common, but also largely local
– Immediate context resolves ambiguity
– Immediate context ➝ common sense
– Example: “My connection is too slow today.”
• Humans use common sense to resolve ambiguity, sometimes
without being aware there was ambiguity
10
Overcoming NLP difficulty
• Machines do not have common sense
• Initial suggestion
– Hand-code common sense in machines
– Impossibly hard to do for more than very limiteddomains
• Present suggestion
– Applications that work with very limited domains
– Approximate common sense by relatively simple techniques
11
Applications of NLP
• Language identification
• Spelling & grammar checking
• Speech recognition & synthesis
• Sentiment analysis
• Automatic summarization
• Machine translation
• Information retrieval
• Information extraction
• …
12
Applications of NLP
• Some applications are widespread (e.g., spell check), while others
are not ready for industry or are too expensive for popular use
• NLP tools rarely hit a 100% success rate
– Accuracy is assessed in statistical terms
– Tools become mature and usable when they operate above a certain
precision and below an acceptable cost
• All NLP tools improve continuously (and often rapidly)
13
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & part-of-speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
14
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & Part-of-Speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
15
today’s
focus
16
Part 2a
Vector semantics
Introduction to distributional semantics
Word similarity
• Understanding word similarity is essential for NLP
– Example
§ “fast” is similar to “rapid”
§ “tall” is similar to “height”
– Question answering
§ Question: “How tall is Mt. Everest?”
§ Candidate answer: “The official height of Mount Everest is 29029 feet.”
• Can we compute the similarity between words automatically?
– Distributional semantics is an approach to doing this
17
Application: Plagiarism detection
18
Distributional semantics
• Distributional semantics is the study of semantic similarities between
words using their distributional properties in text corpora
– Distributional models of meaning = vector-space models of meaning =
vector semantics
• Intuition behind this
– Linguistic items with similar distributions have similar meanings
– Zellig Harris (1954):
§ “oculist and eye-doctor … occur in almost the same environments”
§ “If A and B have almost identical environments we say that they are synonyms.”
– Firth (1957)
§ “You shall know a word by the company it keeps!”
19
Distributional semantics
• Example
– From context words, humans can guess tesgüino means an alcoholic
beverage like beer
• Intuition for algorithm
– Two words are similarif they have similarword contexts
20
A bottle of tesgüino is on the table.
Everybody likes tesgüino.
Tesgüino makes you drunk.
We make tesgüino out of corn.
Four kinds of models of vector semantics
• Sparse vector representations
– Mutual-information weighted word co-occurrence matrices
• Dense vector representations
– Singular value decomposition (and Latent Semantic Analysis)
– Neural-network-inspired models (skip-grams, CBOW)
– Brown clusters
21
today’s
focus
Shared intuition
• Model the meaning of a word by “embedding” in a vector space
– The meaning of a word is a vector of numbers
– Vector models are also called embeddings
• Contrast
– Word meaning is represented in many computational linguistic
applications by a vocabulary index (“word number 545”)
22
Representing vector semantics
• Words are related if they occur in the same context
• How big is this context?
– Entire document? Paragraph? Window of ±n words?
– Smaller contexts (e.g., context windows) are best for capturing similarity
– If a word w occurs a lot in the context windows of another word v (i.e., if
they frequently co-occur), then they are probably related
23
words rather than documents. This matrix is thus of dimensionality |V| ⇥ |V| and
each cell records the number of times the row (target) word and the column (context)
word co-occur in some context in some training corpus. The context could be the
document, in which case the cell represents the number of times the two words
appear in the same document. It is most common, however, to use smaller contexts,
such as a window around the word, for example of 4 words to the left and 4 words
to the right, in which case the cell represents the number of times (in some training
corpus) the column word occurs in such a ±4 word window around the row word.
For example here are 7-word windows surrounding four sample words from the
Brown corpus (just one example of each word):
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of,
their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened
well suited to programming on the digital computer. In finding the optimal R-stage policy from
for the purpose of gathering data and information necessary for the study authorized in the
For each word we collect the counts (from the windows around each occurrence)
of the occurrences of context words. Fig. 17.2 shows a selection from the word-word
co-occurrence matrix computed from the Brown corpus for these four words.
Representing vector semantics
• We can now define a word w by a vector of counts of context
words
– Counts represent how often those words have co-occurred in the same
context window with word w
– Each vector is of length |V|, where V is the vocabulary
– Vector semantics is captured in the word-word matrix is |V| x |V|
24
Example: Contexts ±7 words
25
aardvark computer data pinch result sugar …
apricot 0 0 0 1 0 1
pineapple 0 0 0 1 0 1
digital 0 2 1 0 1 0
information 0 1 6 0 4 0
…
such as a window around the word, for example of 4 words to the left and 4 words
to the right, in which case the cell represents the number of times (in some training
corpus) the column word occurs in such a ±4 word window around the row word.
For example here are 7-word windows surrounding four sample words from the
Brown corpus (just one example of each word):
sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of,
their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened
well suited to programming on the digital computer. In finding the optimal R-stage policy from
for the purpose of gathering data and information necessary for the study authorized in the
For each word we collect the counts (from the windows around each occurrence)
of the occurrences of context words. Fig. 17.2 shows a selection from the word-word
co-occurrence matrix computed from the Brown corpus for these four words.
aardvark ... computer data pinch result sugar ...
apricot 0 ... 0 0 1 0 1
pineapple 0 ... 0 0 1 0 1
digital 0 ... 2 1 0 1 0
information 0 ... 1 6 0 4 0
Word-word matrix
• We showed only 4 x 6, but the real matrix is 50,000 x 50,000
– Most values are 0 so it is very sparse
– That’s OK, since there are lots of efficient algorithms for sparse matrices
• The size of windows depends on your goals
– The shorter the windows, the more syntactic the representation
§ ± 1-3 very syntactic
– The longer the windows, the more semantic the representation
§ ± 4-10 more semantic
26
Two types of co-occurrence
• First-order co-occurrence (syntagmatic association)
– They are typically nearby each other
– wrote is a first-order associate of book or poem
• Second-order co-occurrence (paradigmatic association)
– They have similarneighbors
– wrote is a second-order associate of words like said or remarked
27
28
Part 2b
Vector semantics
Positive Pointwise Mutual Information (PPMI)
Problem with raw co-occurrence counts
• Raw word frequency is not a great measure of association between
words
– It’s very skewed
– “the” and “of” are very frequent, but maybe not the most discriminative
• We would prefer a measure that asks whether a context word is
particularly informative about the target word
– Positive Pointwise Mutual Information (PPMI)
29
Pointwise Mutual Information
• Do events x and y co-occur more than if they were independent?
• PMI between two words (Church & Hanks, 1989)
– Do words x and y co-occur more than if they were independent?
30
PMI(word1, word2) = lo 2
P(word1, word2)
P(word1)P(word2)
PMI(X,Y) = lo 2
P(x, )
P(x)P( )
Positive Pointwise Mutual Information
• PMI ranges from –∞ to +∞, but negative values are problematic
– Things are co-occurring less than we expect by chance
– Unreliable without enormous corpora
§ Imagine w1 and w2 whose probability is each 10-6
§ Hard to be sure p(w1, w2) is significantly different than 10-12
– Plus it is not clear people are good at “unrelatedness”
• So we just replace negative PMI values by 0
– Positive PMI (PPMI) between word1 and word2:
31
PMI(word1, word2) = max
✓
log2
P(word1, word2)
P(word1)P(word2)
, 0
◆
• Matrix F with W rows (words) and C columns (contexts)
• fij is # of times wi occurs in context cj
32
pij =
fij
fij
j=1
C
∑
i=1
W
∑
pi* =
fij
j=1
C
∑
fij
j=1
C
∑
i=1
W
∑
p* j =
fij
i=1
W
∑
fij
j=1
C
∑
i=1
W
∑
pmiij = log2
pij
pi* p* j
ppmiij =
pmiij if pmiij > 0
0 otherwise
!
"
#
$#
Computing PPMI on a term-context matrix
p(w=information, c=data) =
p(w=information) =
p(c=data) =
33
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
= .326/19
11/19 = .58
7/19 = .37
pij =
fij
fij
j=1
C
∑
i=1
W
∑
p(wi ) =
fij
j=1
C
∑
N
p(cj ) =
fij
i=1
W
∑
N
34
pmiij = log2
pij
pi* p* j
p(w,context) p(w)
computer data pinch result sugar
apricot 0.00 0.00 0.05 0.00 0.05 0.11
pineapple 0.00 0.00 0.05 0.00 0.05 0.11
digital 0.11 0.05 0.00 0.05 0.00 0.21
information 0.05 0.32 0.00 0.21 0.00 0.58
p(context) 0.16 0.37 0.11 0.26 0.11
PPMI(w,context)
computer data pinch result sugar
apricot - - 2.25 - 2.25
pineapple - - 2.25 - 2.25
digital 1.66 0.00 - 0.00 -
information 0.00 0.57 - 0.47 -
PMI(information, data) = lo 2
✓ .32
.37 ⇥ .58
◆
= .57
One more problem…
• We are unlikely to encounter rare
words unless we have large
corpora
– PMI values cannot be calculated if
co-occurrence count is 0
• Solution: give rare words slightly
higher probabilities
– Steal probability mass to
generalize better
– Laplace smoothing (aka add-one
smoothing)
§ Pretend we saw each word one
more time than we did
35
allegations
reports
claims
attack
request
man
outcome
…
allegations
attack
man
outcome
…
allegations
reports
claims
request
36
Part 2c
Vector semantics
Measuring word similarity: the cosine
Measuring similarity
• We need a way to measure the similarity between two target words v
and w
– Most vector similarity measures are based on the dot product or inner
product from linear algebra
– High when two vectors have large values in same dimensions
– Low (in fact 0) for orthogonal vectors with zeros in complementary
distribution
37
pineapple 0 0 0.56 0 0.56
digital 0.62 0 0 0 0
information 0 0.58 0 0.37 0
Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing c
in Fig. 17.5.
The cosine—like most measures for vector similarity used in NLP—is bas
the dot product operator from linear algebra, also called the inner product:duct
duct
dot-product(~v,~w) =~v·~w =
NX
i=1
viwi = v1w1 +v2w2 +...+vNwN (
Intuitively, the dot product acts as a similarity metric because it will tend
high just when the two vectors have large values in the same dimensions. Al
tively, vectors that have zeros in different dimensions—orthogonal vectors— w
very dissimilar, with a dot product of 0.
Measuring similarity
• Problem: dot product is not normalized for vector length
– Vectors are longer if they have higher values in each dimension
– That means more frequent words will have higher dot products
– Our similarity metric should not be sensitive to word frequency
• Solution: divide it by the length of the two vectors
– Is equal to the cosine of the angle between the two vectors!
– This is the cosine similarity
38
The dot product is higher if a vector is longer, with higher values in each dimension.
More frequent words have longer vectors, since they tend to co-occur with more
words and have higher co-occurrence values with each of them. Raw dot product
thus will be higher for frequent words. But this is a problem; we’d like a similarity
metric that tells us how similar two words are irregardless of their frequency.
The simplest way to modify the dot product to normalize for the vector length is
to divide the dot product by the lengths of each of the two vectors. This normalized
dot product turns out to be the same as the cosine of the angle between the two
vectors, following from the definition of the dot product between two vectors ~a and
~b:
~a·~b = |~a||~b|cosq
~a·~b
|~a||~b|
= cosq (19.12)
Calculating the cosine similarity
• vi is the PPMI value for word v in context i
• wi is the PPMI value for word w in context i
• cos(v, w) is the cosine similarity between v and w
– Raw frequency or PPMI are non-negative, so cosine range is [0, 1]
39
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
dot product unit vectors
large data computer
apricot 2 0 0
digital 0 1 2
information 1 6 1
40
Which pair of words is more similar?
cos(

v,

w) =

v •

w

v

w
=

v

v
•

w

w
=
viwii=1
N
∑
vi
2
i=1
N
∑ wi
2
i=1
N
∑
Example
cosine(apricot, digital) =
0 + 0 + 0
p
1 + 0 + 0
p
0 + 1 + 4
= 0
cosine(digital, information) =
0 + 6 + 2
p
0 + 1 + 4
p
1 + 36 + 1
=
8
p
5
p
38
= .58
cosine(apricot, information) =
2 + 0 + 0
p
4 + 0 + 0
p
1 + 36 + 1
=
2
2
p
38
= .16
1 2 3 4 5 6 7
1
2
3
digital
apricot
information
Dimension1:‘large’
Dimension 2: ‘data’ 41
large data
apricot 2 0
digital 0 1
information 1 6
Visualizing vectors and angles
WRIST
ANKLE
SHOULDER
ARM
LEG
HAND
FOOT
HEAD
NOSE
FINGER
TOE
FACE
EAR
EYE
TOOTH
DOG
CAT
PUPPY
KITTEN
COW
MOUSE
TURTLE
OYSTER
LION
BULL
CHICAGO
ATLANTA
MONTREAL
NASHVILLE
TOKYO
CHINA
RUSSIA
AFRICA
ASIA
EUROPE
AMERICA
BRAZIL
MOSCOW
FRANCE
HAWAII
42Rohde et al. (2006)
Clustering vectors to
visualize similarity in co-
occurrence matrices
Other possible similarity measures
43
44
Part 2d
Vector semantics
Evaluation
Evaluating word similarity
• Intrinsic evaluation
– Correlation between algorithm and human word similarity ratings
§ Wordsim353 ➞ 353 noun pairs rated on a 0-10 scale
– Taking TOEFL multiple-choice vocabulary tests
• Extrinsic (= task-based, end-to-end) evaluation
– Question answering
– Spell-checking
– Essay grading
45
sim(plane, car) = 5.77
Levied is closest in meaning to:
imposed / believed / requested / correlated
46
Part 2e
Vector semantics
Dense vectors
Sparse vs. dense vectors
• PPMI vectors are
– Long (length |V|= 20,000 to 50,000)
– Sparse (most elements are zero)
• Alternative: learn vectors which are
– Short (length 200-1000)
– Dense (most elements are non-zero)
47
Sparse vs. dense vectors
• Why dense vectors?
– Short vectors may be easier to use as features in machine learning (less
weights to tune)
– De-noising ➞ low-order dimensions may represent unimportant
information
– Dense vectors may generalize better than storing explicit counts
– Dense models may do better at capturing higher-order co-occurrence
§ car and automobile are synonyms, but represented as distinct dimensions
§ Fails to capture similarity between a word with car as a neighbor and a word
with automobile as a neighbor (= paradigmatic association)
48
Methods for creating dense vector embeddings
• Singular Value Decomposition (SVD)
– A special case of this is called Latent Semantic Analysis (LSA)
• “Neural Language Model”-inspired predictive models (skip-grams
and CBOW)
• Brown clustering
49
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & Part-of-Speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
50
51
Part 3a
Text classification
Introduction
50
Is this spam?
52
What is text classsification?
• Goal: Take a document and assign it a label representing its content
• Classic example: decide if a newspaper article is about politics, sports, or
business
• Many uses for the same technology
– Is this e-mail spam or not?
– Is this page a laser printer product page?
– Does this company accept overseas orders?
– Is this tweet positive or negative?
– Does this part of the CV describe part of a person’s work experience?
– Is this text written in Danish or Norwegian?
– Is this the “computer” or “harbor” sense of port?
53
Definition
• Input:
– A document d
– A fixed set of classes C = {c1, c2,…, cj}
• Output
– A predicted class c ∈ C
54
Top-down vs. bottom-up text classification
• Top-down approach
– We tell the computer exactly how it should solve a task (e.g., expert
systems)
– Example: black-list-address OR ("dollars" AND "have been
selected")
– Potential for high accuracy, but requires expensive & extensive
refinement by experts
• Bottom-up approach
– The computer finds out for itselfwith a ‘little’ help from us (e.g., machine
learning)
– Example: The word “viagra” is very often found in spam e-mails
55
Definition (updated)
• Input:
– A document d
– A fixed set of classes C = {c1, c2,…, cj}
– A training set of m hand-labeled documents (d1, c1), …, (dm, cm)
• Output
– A learned classifierγ: d ➞ c
56
Bottom-up text classification
• Any kind of classifier
– Naïve Bayes
– Logistic regression
– Support Vector Machines
– k-Nearest Neighbor
– Decision tree learning
– ...
57
today’s
focus
58
Part 3b
Text classification
Naïve Bayes
Naïve Bayes
• Simple (“naïve”) classification method based on Bayes rule
– Relies on very simple representation of document: bag of words
59
Bag-of-words representation
60
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
…
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
…
it
it
it
it
it
it
I
I
I
I
I
love
recommend
movie
the
the
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romantic
of
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
are
anyone
adventure
always
again
about
I love this movie! It's sweet,
but with satirical humor. The
dialogue is great and the
adventure scenes are fun...
It manages to be whimsical
and romantic while laughing
at the conventions of the
fairy tale genre. I would
recommend it to just about
anyone. I've seen it several
times, and I'm always happy
to see it again whenever I
have a friend who hasn't
seen it yet!
it
I
the
to
and
seen
yet
would
whimsical
times
sweet
satirical
adventure
genre
fairy
humor
have
great
…
6
5
4
3
3
2
1
1
1
1
1
1
1
1
1
1
1
1
…
Bag-of-words representation
61
γ( )=c
seen 2
sweet 1
whimsical 1
recommend 1
happy 1
... ...
Naïve Bayes
• Simple (“naïve”) classification method based on Bayes rule
– Relies on very simple representation of document: bag of words
• Bayes’ rule applied to documents and classes
– For a document d and a class c
62
P(c|d) =
P(d|c)P(c)
P(d)
Naïve Bayes (cont’d)
63
cMAP = argmax
c 2C
P(c|d)
MAP is "maximum a
posteriori" = most
likely class
= argmax
c 2C
P(d|c)P(c)
P(d)
Bayes' Rule
= argmax
c 2C
P(d|c)P(c) Dropping the
denominator
= argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Document d
represented as
features x1, …, xn
Naïve Bayes (cont’d)
• How often does this class occur?
– We can just count the relative frequencies in a corpus
– O(|X|n•|C|) parameters
– Could only be estimated if a very, very large number of training
examples was available!
64
cMAP = argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Multinomial Naïve Bayes
65
• Bag-of-words assumption
– Assume position does not matter
• Conditional independence assumption
– Assume the feature probabilities P(xi | cj) are independent given the
class c
P(x1,x2, . . . ,xn|c)
P(x1, ...,xn|c) = P(x1|c) · P(x2|c) · P(x3|c) · ... · P(xn|c)
cNB = argmax
c 2C
P(c)
Y
x 2X
P(x|c)
Multinomial Naïve Bayes (cont’d)
66
cMAP = argmax
c 2C
P(x1,x2, . . . ,xn|c)P(c)
Multinomial Naïve Bayes (cont’d)
67
cNB = argmax
cj 2C
P(cj )
Y
i 2positions
P(xi |cj )
all word positions in
test document
Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates
– Simply use the frequencies in the data
68
ˆP(wi |cj ) =
count(wi,cj )
P
w 2V
count(w,cj )
ˆP(cj ) =
doc_count(C = cj )
Ndoc
fraction of times word wi appears among
all words in documents of topic cj
fraction of documents with topic cj
Problem with Maximum Likelihood
• What if we have seen no training documents with the word fantastic
and classified in the topic positive (= thumbs-up)?
• Zero probabilities cannot be conditioned away, no matter the other
evidence!
69
ˆP("fantastic"|positive) =
count("fantastic", positive)
P
w 2V
count(w, positive)
= 0
cMAP = argmax
c
ˆP(c)
Y
i
ˆP(xi |c)
Solution: Laplace smoothing
• Also known as add-one (or add- )
– Pretend we saw each word one more time (or more times) than we did
– For
70
ˆP(wi |cj ) =
count(wi,cj ) + 1
P
w 2V
count(w,cj ) + 1
=
count(wi,cj ) + 1
P
w 2V
count(w,cj )
!
+ |V |
= 1
Learning a multinomial Naïve Bayes model
• From training corpus, extract vocabulary V
• Calculate P(cj) terms
– For each cj in C do
§ docsj ⟵ all docs with class = cj
71
• Calculate P(wk | cj) terms
– Textj ⟵ single doc containing all
docsj
– For each word wk in V
§ nk ⟵ # of occurrences of wk in Textj
aggregrate representation
of all documents of class j
P(wk |cj )
nk +
n + |V |
P(cj )
|docsj |
|total # documents|
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Priors
72
ˆP(c) =
Nc
N P(Japan) =
1
4
P(China) =
3
4
priors
P("Chinese" | China) =
count("Chinese", China) + 1
count(China) + |V |
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Conditional probabilities
73
ˆP(w|c) =
count(w,c) + 1
count(c) + |V |
=
5 + 1
8 + 6
=
6
14
=
3
7
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Conditional probabilities
74
P("Japan" | China) =
0 + 1
8 + 6
=
1
14
P("Tokyo" | China) =
0 + 1
8 + 6
=
1
14
P("Chinese" | China) =
5 + 1
8 + 6
=
3
7
P("Japan" | Japan) =
1 + 1
3 + 6
=
2
9
P("Tokyo" | Japan) =
1 + 1
3 + 6
=
2
9
P("Chinese" | Japan) =
1 + 1
3 + 6
=
2
9
Doc Words Class
TRAINING
1 Chinese Beijing Chinese China
2 Chinese Chinese Shanghai China
3 Chinese Macao China
4 Tokyo Japan Chinese Japan
TEST 5 Chinese Chinese Chinese Tokyo Japan ?
Worked example: Choosing a class
75
P("Japan" | China) =
0 + 1
8 + 6
=
1
14
P("Tokyo" | China) =
0 + 1
8 + 6
=
1
14
P("Chinese" | China) =
5 + 1
8 + 6
=
3
7
P("Japan" | Japan) =
1 + 1
3 + 6
=
2
9
P("Tokyo" | Japan) =
1 + 1
3 + 6
=
2
9
P("Chinese" | Japan) =
1 + 1
3 + 6
=
2
9
P(Japan) =
1
4
P(China) =
3
4
P(China | doc5) /
3
4
⇥
3
7
3
⇥
1
14
⇥
1
14
⇡ 0.0003
P(Japan | doc5) /
1
4
⇥
2
9
3
⇥
2
9
⇥
2
9
⇡ 0.0001
cNB = argmax
c 2C
P(c)
Y
x 2X
P(x|c)
Summary
• Naive Bayes is not so naïve
– Very fast
– Low storage requirements
– Robust to irrelevant features
§ Irrelevant features cancel each other out without affecting results
– Very good in domains with many equally important features
§ Decision Trees suffer from fragmentation in such cases – especially if little data
– A good, dependable baseline for text classification
76
Naïve Bayes in spam filtering
• SpamAssassin features (http://spamassassin.apache.org/tests_3_3_x.html)
– Properties
§ From: starts with many numbers
§ Subject is all capitals
§ HTML has a low ratio of text to image area
§ Claims you can be removed from the list
– Phrases
§ “viagra”
§ “impress ... girl”
§ “One hundred percent guaranteed”
§ “Prestigious Non-Accredited Universities” 77
78
Part 3c
Text classification
Evaluation
Evaluation
• Text classification can be seen as an application of machine learning
– Evaluation is similar to best-practice in machine learning
• Experimental setup
– Splitting in training, development, and test set
– Cross-validation for parameter optimization
• Evaluation
– Precision
§ For a particular class c, how many times were we correct in predicting c?
– Recall
§ For a particular class c, how many of the actual instances of c did we manage to
find?
– Other metrics: F-score, AUC 79
Precision vs. recall
• High precision
– When all returned answers must be correct
– Good when missing results are not problematic
– More common from hand-built systems
• High recall
– You get all the right answers, but garbage too
– Good when incorrect results are not problematic
– More common from automatic systems
• Trade-off
– In general, one can trade one for the other
– But it is harder to score well on both
80
precision
recall
Non-binary classification
• What if we have more than two classes?
– Solution: Train sets of binary classifiers
• Scenario 1: Any-of classification (aka multivalue classification)
– A document can belong to 0, 1, or >1 classes
– For each class c∈C
§ Build a classifier γc to distinguish c from all other classes c’∈C
– Given test document d
§ Evaluate it for membership in each class using each γc
§ d belongs to any class for which γc returns true
81
Non-binary classification
• Scenario 2: One-of classification (aka multinomial classification)
– A document can belong to exactly 1 class
– For each class c∈C
§ Build a classifier γc to distinguish c from all other classes c’∈C
– Given test document d
§ Evaluate it for membership in each class using each γc
§ d belongs to the one class with maximum score
82
83
Part 3d
Text classification
Practical issues
Text classification
• Usually, simple machine learning algorithms are used
– Examples: Naïve Bayes, decision trees
– Very robust, very re-usable, very fast
• Recently, slightly better performance from better algorithms
– Examples: SVMs, Winnow, boosting, k-NN
• Accuracy is more dependent on
– Naturalness of classes
– Quality of features extracted
– Amount of training data available
• Accuracy typically ranges from 65% to 97% depending on the situation
– Note particularly performance on rare classes
84
The real world
• Gee, I’m building a text classifier for real, now! What should I do?
• No training data?
– Manually written rules
§ Example: IF (wheat OR grain) AND NOT (whole OR bread) THEN
CATEGORIZE AS 'grain'
– Need careful crafting
§ Human tuning on development data
§ Time-consuming: 2 days per class
85
The real world
• Very little data?
– Use Naïve Bayes
§ Naïve Bayes is a “high-bias” algorithm (Ng & Jordan, 2002)
– Get more labeled data
§ Find clever ways to get humans to label data for you
– Try semi-supervised training methods
§ Bootstrapping, EM over unlabeled documents, …
• A reasonableamount of data?
– Perfect for all the clever classifiers
§ SVM, Regularized Logistic Regression, ...
– You can even use user-interpretable decision trees
§ Users like to hack & management likes quick fixes 86
Banko & Brill (2001)
The real world
• A huge amount of data?
– Can achieve high accuracy!
– Comes at a cost
§ SVMs (train time) or kNN (test time)
can be too slow
§ Regularized logistic regression can
be somewhat better
§ So Naïve Bayes can come back into
its own again!
– With enough data, classifier may
not matter…
87
Tweaking performance
• Domain-specific features and weights are very important in real
performance
• Sometimes need to collapse terms:
– Part numbers, chemical formulas, …
– But stemming generally does not help
• Upweighting (= counting a word as if it occurred twice)
– Title words (Cohen & Singer, 1996)
– First sentence of each paragraph (Murata, 1999)
– In sentences that contain title words (Ko et al., 2002)
88
Generic architecture of (most) NLP tools
1. Input pre-processing
2. Morphological & Part-of-Speech analysis (tokens)
3. Parsing (syntactic & semantic relations between tokens)
4. Context module (context-specific resolution)
5. Inference (according to the aim of the tool)
6. Generation (output representation)
7. Output processing (output representation refinement)
89
90
Questions?
References
• Jackson, P., & Moulinier, I. (2007). NaturalLanguageProcessing for
Online Applications:Text Retrieval, Extraction and Categorization.
Amsterdam: Benjamins.
• Jurafsky, D., & Martin, J. H. (2014). Speech and LanguageProcessing
(2nd ed.). Harlow: Pearson Education.
• Liddy, E. D. (2005). Enhanced Text Retrieval Using Natural Language
Processing. Bulletin of the American Society for Information Science
and Technology,24(4), 14-16.
• Manning,C. D., & Schütze, H. (1999).Foundations of Statistical
Natural LanguageProcessing. Cambridge, MA: MIT Press.
91

Weitere ähnliche Inhalte

Was ist angesagt?

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingIla Group
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingMariana Soffer
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyMarina Santini
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer modelsDing Li
 
Natural language processing
Natural language processingNatural language processing
Natural language processingAbash shah
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review Jayneel Vora
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 

Was ist angesagt? (20)

Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Nlp ambiguity presentation
Nlp ambiguity presentationNlp ambiguity presentation
Nlp ambiguity presentation
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Lecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language TechnologyLecture 1: Semantic Analysis in Language Technology
Lecture 1: Semantic Analysis in Language Technology
 
Natural language processing and transformer models
Natural language processing and transformer modelsNatural language processing and transformer models
Natural language processing and transformer models
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
NLP
NLPNLP
NLP
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
Nlp
NlpNlp
Nlp
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
NLP
NLPNLP
NLP
 

Ähnlich wie Natural Language Processing

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Mustafa Jarrar
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYijnlc
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AISATHYANARAYANAKB
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageRoelof Pieters
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxSyedNadeemAbbas6
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnRwanEnan
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOKoray Tugberk GUBUR
 
lexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdflexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdfGagu6
 
Natural language processing
Natural language processingNatural language processing
Natural language processingBasha Chand
 
Sanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingSanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingHitesh Joshi
 

Ähnlich wie Natural Language Processing (20)

Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing Adnan: Introduction to Natural Language Processing
Adnan: Introduction to Natural Language Processing
 
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Engineering Intelligent NLP Applications Using Deep Learning – Part 1
Engineering Intelligent NLP Applications Using Deep Learning – Part 1
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
AMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITYAMBIGUITY-AWARE DOCUMENT SIMILARITY
AMBIGUITY-AWARE DOCUMENT SIMILARITY
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
Natural Language Processing Course in AI
Natural Language Processing Course in AINatural Language Processing Course in AI
Natural Language Processing Course in AI
 
Visual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on LanguageVisual-Semantic Embeddings: some thoughts on Language
Visual-Semantic Embeddings: some thoughts on Language
 
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptxENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
ENeL_WG3_Survey-AKA4Lexicography-TiberiusHeylenKrek (1).pptx
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
NLP_KASHK: Introduction
NLP_KASHK: Introduction NLP_KASHK: Introduction
NLP_KASHK: Introduction
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffnL6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
L6.pptxsdv dfbdfjftj hgjythgfvfhjyggunghb fghtffn
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Lexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEOLexical Semantics, Semantic Similarity and Relevance for SEO
Lexical Semantics, Semantic Similarity and Relevance for SEO
 
lexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdflexical-semantics-221118101910-ccd46ac3.pdf
lexical-semantics-221118101910-ccd46ac3.pdf
 
1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Sanskrit in Natural Language Processing
Sanskrit in Natural Language ProcessingSanskrit in Natural Language Processing
Sanskrit in Natural Language Processing
 
REPORT.doc
REPORT.docREPORT.doc
REPORT.doc
 

Mehr von Toine Bogers

"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C..."If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...Toine Bogers
 
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingHands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingToine Bogers
 
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...Toine Bogers
 
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkA Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkToine Bogers
 
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...Toine Bogers
 
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq..."I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq...Toine Bogers
 
Defining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationDefining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationToine Bogers
 
An In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchAn In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchToine Bogers
 
Personalized search
Personalized searchPersonalized search
Personalized searchToine Bogers
 
A Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index SizeA Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index SizeToine Bogers
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Toine Bogers
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsToine Bogers
 
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...Toine Bogers
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Toine Bogers
 
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterMicro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterToine Bogers
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesToine Bogers
 

Mehr von Toine Bogers (16)

"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C..."If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
"If I like BLANK, what else will I like?": Analyzing a Human Recommendation C...
 
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while DrivingHands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
Hands-free but not Eyes-free: A Usability Evaluation of Siri while Driving
 
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
“Looking for an Amazing Game I Can Relax and Sink Hours into...”: A Study of ...
 
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in DenmarkA Study of Usage and Usability of Intelligent Personal Assistants in Denmark
A Study of Usage and Usability of Intelligent Personal Assistants in Denmark
 
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
“What was this movie about this chick?”: A Comparative Study of Relevance Asp...
 
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq..."I just scroll through my stuff until I find it or give up": A Contextual Inq...
"I just scroll through my stuff until I find it or give up": A Contextual Inq...
 
Defining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven RecommendationDefining and Supporting Narrative-driven Recommendation
Defining and Supporting Narrative-driven Recommendation
 
An In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book SearchAn In-depth Analysis of Tags and Controlled Metadata for Book Search
An In-depth Analysis of Tags and Controlled Metadata for Book Search
 
Personalized search
Personalized searchPersonalized search
Personalized search
 
A Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index SizeA Longitudinal Analysis of Search Engine Index Size
A Longitudinal Analysis of Search Engine Index Size
 
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
Tagging vs. Controlled Vocabulary: Which is More Helpful for Book Search?
 
Measuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage SystemsMeasuring System Performance in Cultural Heritage Systems
Measuring System Performance in Cultural Heritage Systems
 
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
How 'Social' are Social News Sites? Exploring the Motivations for Using Reddi...
 
Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?Search & Recommendation: Birds of a Feather?
Search & Recommendation: Birds of a Feather?
 
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on TwitterMicro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
Micro-Serendipity: Meaningful Coincidences in Everyday Life Shared on Twitter
 
Benchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program CommitteesBenchmarking Domain-specific Expert Search using Workshop Program Committees
Benchmarking Domain-specific Expert Search using Workshop Program Committees
 

Kürzlich hochgeladen

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsSumit Kumar yadav
 

Kürzlich hochgeladen (20)

VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow 💋 Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Botany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questionsBotany krishna series 2nd semester Only Mcq type questions
Botany krishna series 2nd semester Only Mcq type questions
 

Natural Language Processing

  • 1. 1 Natural Language Processing Toine Bogers Aalborg University Copenhagen Christina Lioma University of Copenhagen QUARTZ WINTER SCHOOL / FEBRUARY 12, 2018 / PADUA, ITALY
  • 2. Who? • Toine Bogers (toine@hum.aau.dk) – Associate professor @ Aalborg University Copenhagen – Interests § Recommender systems § Information retrieval (search engines) § Information behavior • Christina Lioma (c.lioma@di.ku.dk) – Full professor @ University of Copenhagen – Interests § Information retrieval (search engines) § Natural language processing, computational linguistics 2
  • 3. Outline • Introduction to NLP • Vector semantics • Text classification 3
  • 4. Useful references • Slides in this lecture are based on Jurafsky & Martin book – Jurafsky, D., & Martin, J. H. (2014). Speech and Language Processing (2nd ed.). Harlow: Pearson Education. https://web.stanford.edu/~jurafsky/slp3/ • Other good textbooks on NLP – Jackson, P., & Moulinier, I. (2007). Natural Language Processing for Online Applications:Text Retrieval, Extractionand Categorization. Amsterdam: Benjamins. – Manning, C. D., & Schütze, H. (1999). Foundations of StatisticalNatural Language Processing. Cambridge, MA: MIT Press. 4
  • 6. What is Natural Language Processing? • Multidisciplinary branch of CS drawing largely from linguistics • Definition by Liddy (1998) – “Natural language processing is a range of computational techniques for analyzing and representing naturally occurring texts at one or more levels of linguistic analysis for the purpose of achieving human-like language processing for a range of particular tasks or applications.” • Goal: applied mechanization of human language 6
  • 7. Levels of linguistic analysis • Phonological — Interpretation of speech sounds within and across words • Morphological — Componential analysis of words, including prefixes, suffixes and roots • Lexical — Word-level analysis including lexical meaning and part-of-speech analysis • Syntactic — Analysis of words in a sentence in order to uncover the grammatical structure of the sentence • Semantic — Determining the possible meanings of a sentence, including disambiguation of words in context • Discourse — Interpreting structure and meaning conveyed by texts larger than a sentence • Pragmatic — Understanding the purposeful use of language in situations, particularly those aspects of language which require world knowledge 7Liddy (1998)
  • 8. NLP is hard! • Multidisciplinary knowledgegap – Some computer scientists might not “get” linguistics – Some linguists might not “get” computer science – Goal: bridge this gap • String similarity aloneis not enough – Very similar strings can mean different things (1 & 2) – Very different strings can mean similar things (2 & 3) – Examples: 1. How fast is the TZ? 2. How fast will my TZ arrive? 3. Please tell me when I can expect the TZ I ordered 8
  • 9. NLP is hard! • Ambiguity – Identical strings can have different meanings – Example: “I made her duck” has at least five possible meanings § I cooked waterfowl for her § I cooked waterfowl belonging to her § I created the (plaster?) duck she owns § I caused her to quickly lower her head or body § I waved my magic wand and turned her into undifferentiated waterfowl 9
  • 10. Overcoming NLP difficulty • Natural languageambiguity is very common, but also largely local – Immediate context resolves ambiguity – Immediate context ➝ common sense – Example: “My connection is too slow today.” • Humans use common sense to resolve ambiguity, sometimes without being aware there was ambiguity 10
  • 11. Overcoming NLP difficulty • Machines do not have common sense • Initial suggestion – Hand-code common sense in machines – Impossibly hard to do for more than very limiteddomains • Present suggestion – Applications that work with very limited domains – Approximate common sense by relatively simple techniques 11
  • 12. Applications of NLP • Language identification • Spelling & grammar checking • Speech recognition & synthesis • Sentiment analysis • Automatic summarization • Machine translation • Information retrieval • Information extraction • … 12
  • 13. Applications of NLP • Some applications are widespread (e.g., spell check), while others are not ready for industry or are too expensive for popular use • NLP tools rarely hit a 100% success rate – Accuracy is assessed in statistical terms – Tools become mature and usable when they operate above a certain precision and below an acceptable cost • All NLP tools improve continuously (and often rapidly) 13
  • 14. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & part-of-speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 14
  • 15. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & Part-of-Speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 15 today’s focus
  • 16. 16 Part 2a Vector semantics Introduction to distributional semantics
  • 17. Word similarity • Understanding word similarity is essential for NLP – Example § “fast” is similar to “rapid” § “tall” is similar to “height” – Question answering § Question: “How tall is Mt. Everest?” § Candidate answer: “The official height of Mount Everest is 29029 feet.” • Can we compute the similarity between words automatically? – Distributional semantics is an approach to doing this 17
  • 19. Distributional semantics • Distributional semantics is the study of semantic similarities between words using their distributional properties in text corpora – Distributional models of meaning = vector-space models of meaning = vector semantics • Intuition behind this – Linguistic items with similar distributions have similar meanings – Zellig Harris (1954): § “oculist and eye-doctor … occur in almost the same environments” § “If A and B have almost identical environments we say that they are synonyms.” – Firth (1957) § “You shall know a word by the company it keeps!” 19
  • 20. Distributional semantics • Example – From context words, humans can guess tesgüino means an alcoholic beverage like beer • Intuition for algorithm – Two words are similarif they have similarword contexts 20 A bottle of tesgüino is on the table. Everybody likes tesgüino. Tesgüino makes you drunk. We make tesgüino out of corn.
  • 21. Four kinds of models of vector semantics • Sparse vector representations – Mutual-information weighted word co-occurrence matrices • Dense vector representations – Singular value decomposition (and Latent Semantic Analysis) – Neural-network-inspired models (skip-grams, CBOW) – Brown clusters 21 today’s focus
  • 22. Shared intuition • Model the meaning of a word by “embedding” in a vector space – The meaning of a word is a vector of numbers – Vector models are also called embeddings • Contrast – Word meaning is represented in many computational linguistic applications by a vocabulary index (“word number 545”) 22
  • 23. Representing vector semantics • Words are related if they occur in the same context • How big is this context? – Entire document? Paragraph? Window of ±n words? – Smaller contexts (e.g., context windows) are best for capturing similarity – If a word w occurs a lot in the context windows of another word v (i.e., if they frequently co-occur), then they are probably related 23 words rather than documents. This matrix is thus of dimensionality |V| ⇥ |V| and each cell records the number of times the row (target) word and the column (context) word co-occur in some context in some training corpus. The context could be the document, in which case the cell represents the number of times the two words appear in the same document. It is most common, however, to use smaller contexts, such as a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here are 7-word windows surrounding four sample words from the Brown corpus (just one example of each word): sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the For each word we collect the counts (from the windows around each occurrence) of the occurrences of context words. Fig. 17.2 shows a selection from the word-word co-occurrence matrix computed from the Brown corpus for these four words.
  • 24. Representing vector semantics • We can now define a word w by a vector of counts of context words – Counts represent how often those words have co-occurred in the same context window with word w – Each vector is of length |V|, where V is the vocabulary – Vector semantics is captured in the word-word matrix is |V| x |V| 24
  • 25. Example: Contexts ±7 words 25 aardvark computer data pinch result sugar … apricot 0 0 0 1 0 1 pineapple 0 0 0 1 0 1 digital 0 2 1 0 1 0 information 0 1 6 0 4 0 … such as a window around the word, for example of 4 words to the left and 4 words to the right, in which case the cell represents the number of times (in some training corpus) the column word occurs in such a ±4 word window around the row word. For example here are 7-word windows surrounding four sample words from the Brown corpus (just one example of each word): sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of, their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened well suited to programming on the digital computer. In finding the optimal R-stage policy from for the purpose of gathering data and information necessary for the study authorized in the For each word we collect the counts (from the windows around each occurrence) of the occurrences of context words. Fig. 17.2 shows a selection from the word-word co-occurrence matrix computed from the Brown corpus for these four words. aardvark ... computer data pinch result sugar ... apricot 0 ... 0 0 1 0 1 pineapple 0 ... 0 0 1 0 1 digital 0 ... 2 1 0 1 0 information 0 ... 1 6 0 4 0
  • 26. Word-word matrix • We showed only 4 x 6, but the real matrix is 50,000 x 50,000 – Most values are 0 so it is very sparse – That’s OK, since there are lots of efficient algorithms for sparse matrices • The size of windows depends on your goals – The shorter the windows, the more syntactic the representation § ± 1-3 very syntactic – The longer the windows, the more semantic the representation § ± 4-10 more semantic 26
  • 27. Two types of co-occurrence • First-order co-occurrence (syntagmatic association) – They are typically nearby each other – wrote is a first-order associate of book or poem • Second-order co-occurrence (paradigmatic association) – They have similarneighbors – wrote is a second-order associate of words like said or remarked 27
  • 28. 28 Part 2b Vector semantics Positive Pointwise Mutual Information (PPMI)
  • 29. Problem with raw co-occurrence counts • Raw word frequency is not a great measure of association between words – It’s very skewed – “the” and “of” are very frequent, but maybe not the most discriminative • We would prefer a measure that asks whether a context word is particularly informative about the target word – Positive Pointwise Mutual Information (PPMI) 29
  • 30. Pointwise Mutual Information • Do events x and y co-occur more than if they were independent? • PMI between two words (Church & Hanks, 1989) – Do words x and y co-occur more than if they were independent? 30 PMI(word1, word2) = lo 2 P(word1, word2) P(word1)P(word2) PMI(X,Y) = lo 2 P(x, ) P(x)P( )
  • 31. Positive Pointwise Mutual Information • PMI ranges from –∞ to +∞, but negative values are problematic – Things are co-occurring less than we expect by chance – Unreliable without enormous corpora § Imagine w1 and w2 whose probability is each 10-6 § Hard to be sure p(w1, w2) is significantly different than 10-12 – Plus it is not clear people are good at “unrelatedness” • So we just replace negative PMI values by 0 – Positive PMI (PPMI) between word1 and word2: 31 PMI(word1, word2) = max ✓ log2 P(word1, word2) P(word1)P(word2) , 0 ◆
  • 32. • Matrix F with W rows (words) and C columns (contexts) • fij is # of times wi occurs in context cj 32 pij = fij fij j=1 C ∑ i=1 W ∑ pi* = fij j=1 C ∑ fij j=1 C ∑ i=1 W ∑ p* j = fij i=1 W ∑ fij j=1 C ∑ i=1 W ∑ pmiij = log2 pij pi* p* j ppmiij = pmiij if pmiij > 0 0 otherwise ! " # $# Computing PPMI on a term-context matrix
  • 33. p(w=information, c=data) = p(w=information) = p(c=data) = 33 p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 = .326/19 11/19 = .58 7/19 = .37 pij = fij fij j=1 C ∑ i=1 W ∑ p(wi ) = fij j=1 C ∑ N p(cj ) = fij i=1 W ∑ N
  • 34. 34 pmiij = log2 pij pi* p* j p(w,context) p(w) computer data pinch result sugar apricot 0.00 0.00 0.05 0.00 0.05 0.11 pineapple 0.00 0.00 0.05 0.00 0.05 0.11 digital 0.11 0.05 0.00 0.05 0.00 0.21 information 0.05 0.32 0.00 0.21 0.00 0.58 p(context) 0.16 0.37 0.11 0.26 0.11 PPMI(w,context) computer data pinch result sugar apricot - - 2.25 - 2.25 pineapple - - 2.25 - 2.25 digital 1.66 0.00 - 0.00 - information 0.00 0.57 - 0.47 - PMI(information, data) = lo 2 ✓ .32 .37 ⇥ .58 ◆ = .57
  • 35. One more problem… • We are unlikely to encounter rare words unless we have large corpora – PMI values cannot be calculated if co-occurrence count is 0 • Solution: give rare words slightly higher probabilities – Steal probability mass to generalize better – Laplace smoothing (aka add-one smoothing) § Pretend we saw each word one more time than we did 35 allegations reports claims attack request man outcome … allegations attack man outcome … allegations reports claims request
  • 36. 36 Part 2c Vector semantics Measuring word similarity: the cosine
  • 37. Measuring similarity • We need a way to measure the similarity between two target words v and w – Most vector similarity measures are based on the dot product or inner product from linear algebra – High when two vectors have large values in same dimensions – Low (in fact 0) for orthogonal vectors with zeros in complementary distribution 37 pineapple 0 0 0.56 0 0.56 digital 0.62 0 0 0 0 information 0 0.58 0 0.37 0 Figure 19.6 The Add-2 Laplace smoothed PPMI matrix from the add-2 smoothing c in Fig. 17.5. The cosine—like most measures for vector similarity used in NLP—is bas the dot product operator from linear algebra, also called the inner product:duct duct dot-product(~v,~w) =~v·~w = NX i=1 viwi = v1w1 +v2w2 +...+vNwN ( Intuitively, the dot product acts as a similarity metric because it will tend high just when the two vectors have large values in the same dimensions. Al tively, vectors that have zeros in different dimensions—orthogonal vectors— w very dissimilar, with a dot product of 0.
  • 38. Measuring similarity • Problem: dot product is not normalized for vector length – Vectors are longer if they have higher values in each dimension – That means more frequent words will have higher dot products – Our similarity metric should not be sensitive to word frequency • Solution: divide it by the length of the two vectors – Is equal to the cosine of the angle between the two vectors! – This is the cosine similarity 38 The dot product is higher if a vector is longer, with higher values in each dimension. More frequent words have longer vectors, since they tend to co-occur with more words and have higher co-occurrence values with each of them. Raw dot product thus will be higher for frequent words. But this is a problem; we’d like a similarity metric that tells us how similar two words are irregardless of their frequency. The simplest way to modify the dot product to normalize for the vector length is to divide the dot product by the lengths of each of the two vectors. This normalized dot product turns out to be the same as the cosine of the angle between the two vectors, following from the definition of the dot product between two vectors ~a and ~b: ~a·~b = |~a||~b|cosq ~a·~b |~a||~b| = cosq (19.12)
  • 39. Calculating the cosine similarity • vi is the PPMI value for word v in context i • wi is the PPMI value for word w in context i • cos(v, w) is the cosine similarity between v and w – Raw frequency or PPMI are non-negative, so cosine range is [0, 1] 39 cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ dot product unit vectors
  • 40. large data computer apricot 2 0 0 digital 0 1 2 information 1 6 1 40 Which pair of words is more similar? cos(  v,  w) =  v •  w  v  w =  v  v •  w  w = viwii=1 N ∑ vi 2 i=1 N ∑ wi 2 i=1 N ∑ Example cosine(apricot, digital) = 0 + 0 + 0 p 1 + 0 + 0 p 0 + 1 + 4 = 0 cosine(digital, information) = 0 + 6 + 2 p 0 + 1 + 4 p 1 + 36 + 1 = 8 p 5 p 38 = .58 cosine(apricot, information) = 2 + 0 + 0 p 4 + 0 + 0 p 1 + 36 + 1 = 2 2 p 38 = .16
  • 41. 1 2 3 4 5 6 7 1 2 3 digital apricot information Dimension1:‘large’ Dimension 2: ‘data’ 41 large data apricot 2 0 digital 0 1 information 1 6 Visualizing vectors and angles
  • 45. Evaluating word similarity • Intrinsic evaluation – Correlation between algorithm and human word similarity ratings § Wordsim353 ➞ 353 noun pairs rated on a 0-10 scale – Taking TOEFL multiple-choice vocabulary tests • Extrinsic (= task-based, end-to-end) evaluation – Question answering – Spell-checking – Essay grading 45 sim(plane, car) = 5.77 Levied is closest in meaning to: imposed / believed / requested / correlated
  • 47. Sparse vs. dense vectors • PPMI vectors are – Long (length |V|= 20,000 to 50,000) – Sparse (most elements are zero) • Alternative: learn vectors which are – Short (length 200-1000) – Dense (most elements are non-zero) 47
  • 48. Sparse vs. dense vectors • Why dense vectors? – Short vectors may be easier to use as features in machine learning (less weights to tune) – De-noising ➞ low-order dimensions may represent unimportant information – Dense vectors may generalize better than storing explicit counts – Dense models may do better at capturing higher-order co-occurrence § car and automobile are synonyms, but represented as distinct dimensions § Fails to capture similarity between a word with car as a neighbor and a word with automobile as a neighbor (= paradigmatic association) 48
  • 49. Methods for creating dense vector embeddings • Singular Value Decomposition (SVD) – A special case of this is called Latent Semantic Analysis (LSA) • “Neural Language Model”-inspired predictive models (skip-grams and CBOW) • Brown clustering 49
  • 50. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & Part-of-Speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 50
  • 53. What is text classsification? • Goal: Take a document and assign it a label representing its content • Classic example: decide if a newspaper article is about politics, sports, or business • Many uses for the same technology – Is this e-mail spam or not? – Is this page a laser printer product page? – Does this company accept overseas orders? – Is this tweet positive or negative? – Does this part of the CV describe part of a person’s work experience? – Is this text written in Danish or Norwegian? – Is this the “computer” or “harbor” sense of port? 53
  • 54. Definition • Input: – A document d – A fixed set of classes C = {c1, c2,…, cj} • Output – A predicted class c ∈ C 54
  • 55. Top-down vs. bottom-up text classification • Top-down approach – We tell the computer exactly how it should solve a task (e.g., expert systems) – Example: black-list-address OR ("dollars" AND "have been selected") – Potential for high accuracy, but requires expensive & extensive refinement by experts • Bottom-up approach – The computer finds out for itselfwith a ‘little’ help from us (e.g., machine learning) – Example: The word “viagra” is very often found in spam e-mails 55
  • 56. Definition (updated) • Input: – A document d – A fixed set of classes C = {c1, c2,…, cj} – A training set of m hand-labeled documents (d1, c1), …, (dm, cm) • Output – A learned classifierγ: d ➞ c 56
  • 57. Bottom-up text classification • Any kind of classifier – Naïve Bayes – Logistic regression – Support Vector Machines – k-Nearest Neighbor – Decision tree learning – ... 57 today’s focus
  • 59. Naïve Bayes • Simple (“naïve”) classification method based on Bayes rule – Relies on very simple representation of document: bag of words 59
  • 60. Bag-of-words representation 60 it it it it it it I I I I I love recommend movie the the the the to to to and andand seen seen yet would with who whimsical whilewhenever times sweet several scenes satirical romantic of manages humor have happy fun friend fairy dialogue but conventions are anyone adventure always again about I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great it it it it it it I I I I I love recommend movie the the the the to to to and andand seen seen yet would with who whimsical whilewhenever times sweet several scenes satirical romantic of manages humor have happy fun friend fairy dialogue but conventions are anyone adventure always again about I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great … 6 5 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 … it it it it it it I I I I I love recommend movie the the the the to to to and andand seen seen yet would with who whimsical whilewhenever times sweet several scenes satirical romantic of manages humor have happy fun friend fairy dialogue but conventions are anyone adventure always again about I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet! it I the to and seen yet would whimsical times sweet satirical adventure genre fairy humor have great … 6 5 4 3 3 2 1 1 1 1 1 1 1 1 1 1 1 1 …
  • 61. Bag-of-words representation 61 γ( )=c seen 2 sweet 1 whimsical 1 recommend 1 happy 1 ... ...
  • 62. Naïve Bayes • Simple (“naïve”) classification method based on Bayes rule – Relies on very simple representation of document: bag of words • Bayes’ rule applied to documents and classes – For a document d and a class c 62 P(c|d) = P(d|c)P(c) P(d)
  • 63. Naïve Bayes (cont’d) 63 cMAP = argmax c 2C P(c|d) MAP is "maximum a posteriori" = most likely class = argmax c 2C P(d|c)P(c) P(d) Bayes' Rule = argmax c 2C P(d|c)P(c) Dropping the denominator = argmax c 2C P(x1,x2, . . . ,xn|c)P(c) Document d represented as features x1, …, xn
  • 64. Naïve Bayes (cont’d) • How often does this class occur? – We can just count the relative frequencies in a corpus – O(|X|n•|C|) parameters – Could only be estimated if a very, very large number of training examples was available! 64 cMAP = argmax c 2C P(x1,x2, . . . ,xn|c)P(c)
  • 65. Multinomial Naïve Bayes 65 • Bag-of-words assumption – Assume position does not matter • Conditional independence assumption – Assume the feature probabilities P(xi | cj) are independent given the class c P(x1,x2, . . . ,xn|c) P(x1, ...,xn|c) = P(x1|c) · P(x2|c) · P(x3|c) · ... · P(xn|c)
  • 66. cNB = argmax c 2C P(c) Y x 2X P(x|c) Multinomial Naïve Bayes (cont’d) 66 cMAP = argmax c 2C P(x1,x2, . . . ,xn|c)P(c)
  • 67. Multinomial Naïve Bayes (cont’d) 67 cNB = argmax cj 2C P(cj ) Y i 2positions P(xi |cj ) all word positions in test document
  • 68. Learning the Multinomial Naïve Bayes Model • First attempt: maximum likelihood estimates – Simply use the frequencies in the data 68 ˆP(wi |cj ) = count(wi,cj ) P w 2V count(w,cj ) ˆP(cj ) = doc_count(C = cj ) Ndoc fraction of times word wi appears among all words in documents of topic cj fraction of documents with topic cj
  • 69. Problem with Maximum Likelihood • What if we have seen no training documents with the word fantastic and classified in the topic positive (= thumbs-up)? • Zero probabilities cannot be conditioned away, no matter the other evidence! 69 ˆP("fantastic"|positive) = count("fantastic", positive) P w 2V count(w, positive) = 0 cMAP = argmax c ˆP(c) Y i ˆP(xi |c)
  • 70. Solution: Laplace smoothing • Also known as add-one (or add- ) – Pretend we saw each word one more time (or more times) than we did – For 70 ˆP(wi |cj ) = count(wi,cj ) + 1 P w 2V count(w,cj ) + 1 = count(wi,cj ) + 1 P w 2V count(w,cj ) ! + |V | = 1
  • 71. Learning a multinomial Naïve Bayes model • From training corpus, extract vocabulary V • Calculate P(cj) terms – For each cj in C do § docsj ⟵ all docs with class = cj 71 • Calculate P(wk | cj) terms – Textj ⟵ single doc containing all docsj – For each word wk in V § nk ⟵ # of occurrences of wk in Textj aggregrate representation of all documents of class j P(wk |cj ) nk + n + |V | P(cj ) |docsj | |total # documents|
  • 72. Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Priors 72 ˆP(c) = Nc N P(Japan) = 1 4 P(China) = 3 4 priors
  • 73. P("Chinese" | China) = count("Chinese", China) + 1 count(China) + |V | Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Conditional probabilities 73 ˆP(w|c) = count(w,c) + 1 count(c) + |V | = 5 + 1 8 + 6 = 6 14 = 3 7
  • 74. Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Conditional probabilities 74 P("Japan" | China) = 0 + 1 8 + 6 = 1 14 P("Tokyo" | China) = 0 + 1 8 + 6 = 1 14 P("Chinese" | China) = 5 + 1 8 + 6 = 3 7 P("Japan" | Japan) = 1 + 1 3 + 6 = 2 9 P("Tokyo" | Japan) = 1 + 1 3 + 6 = 2 9 P("Chinese" | Japan) = 1 + 1 3 + 6 = 2 9
  • 75. Doc Words Class TRAINING 1 Chinese Beijing Chinese China 2 Chinese Chinese Shanghai China 3 Chinese Macao China 4 Tokyo Japan Chinese Japan TEST 5 Chinese Chinese Chinese Tokyo Japan ? Worked example: Choosing a class 75 P("Japan" | China) = 0 + 1 8 + 6 = 1 14 P("Tokyo" | China) = 0 + 1 8 + 6 = 1 14 P("Chinese" | China) = 5 + 1 8 + 6 = 3 7 P("Japan" | Japan) = 1 + 1 3 + 6 = 2 9 P("Tokyo" | Japan) = 1 + 1 3 + 6 = 2 9 P("Chinese" | Japan) = 1 + 1 3 + 6 = 2 9 P(Japan) = 1 4 P(China) = 3 4 P(China | doc5) / 3 4 ⇥ 3 7 3 ⇥ 1 14 ⇥ 1 14 ⇡ 0.0003 P(Japan | doc5) / 1 4 ⇥ 2 9 3 ⇥ 2 9 ⇥ 2 9 ⇡ 0.0001 cNB = argmax c 2C P(c) Y x 2X P(x|c)
  • 76. Summary • Naive Bayes is not so naïve – Very fast – Low storage requirements – Robust to irrelevant features § Irrelevant features cancel each other out without affecting results – Very good in domains with many equally important features § Decision Trees suffer from fragmentation in such cases – especially if little data – A good, dependable baseline for text classification 76
  • 77. Naïve Bayes in spam filtering • SpamAssassin features (http://spamassassin.apache.org/tests_3_3_x.html) – Properties § From: starts with many numbers § Subject is all capitals § HTML has a low ratio of text to image area § Claims you can be removed from the list – Phrases § “viagra” § “impress ... girl” § “One hundred percent guaranteed” § “Prestigious Non-Accredited Universities” 77
  • 79. Evaluation • Text classification can be seen as an application of machine learning – Evaluation is similar to best-practice in machine learning • Experimental setup – Splitting in training, development, and test set – Cross-validation for parameter optimization • Evaluation – Precision § For a particular class c, how many times were we correct in predicting c? – Recall § For a particular class c, how many of the actual instances of c did we manage to find? – Other metrics: F-score, AUC 79
  • 80. Precision vs. recall • High precision – When all returned answers must be correct – Good when missing results are not problematic – More common from hand-built systems • High recall – You get all the right answers, but garbage too – Good when incorrect results are not problematic – More common from automatic systems • Trade-off – In general, one can trade one for the other – But it is harder to score well on both 80 precision recall
  • 81. Non-binary classification • What if we have more than two classes? – Solution: Train sets of binary classifiers • Scenario 1: Any-of classification (aka multivalue classification) – A document can belong to 0, 1, or >1 classes – For each class c∈C § Build a classifier γc to distinguish c from all other classes c’∈C – Given test document d § Evaluate it for membership in each class using each γc § d belongs to any class for which γc returns true 81
  • 82. Non-binary classification • Scenario 2: One-of classification (aka multinomial classification) – A document can belong to exactly 1 class – For each class c∈C § Build a classifier γc to distinguish c from all other classes c’∈C – Given test document d § Evaluate it for membership in each class using each γc § d belongs to the one class with maximum score 82
  • 84. Text classification • Usually, simple machine learning algorithms are used – Examples: Naïve Bayes, decision trees – Very robust, very re-usable, very fast • Recently, slightly better performance from better algorithms – Examples: SVMs, Winnow, boosting, k-NN • Accuracy is more dependent on – Naturalness of classes – Quality of features extracted – Amount of training data available • Accuracy typically ranges from 65% to 97% depending on the situation – Note particularly performance on rare classes 84
  • 85. The real world • Gee, I’m building a text classifier for real, now! What should I do? • No training data? – Manually written rules § Example: IF (wheat OR grain) AND NOT (whole OR bread) THEN CATEGORIZE AS 'grain' – Need careful crafting § Human tuning on development data § Time-consuming: 2 days per class 85
  • 86. The real world • Very little data? – Use Naïve Bayes § Naïve Bayes is a “high-bias” algorithm (Ng & Jordan, 2002) – Get more labeled data § Find clever ways to get humans to label data for you – Try semi-supervised training methods § Bootstrapping, EM over unlabeled documents, … • A reasonableamount of data? – Perfect for all the clever classifiers § SVM, Regularized Logistic Regression, ... – You can even use user-interpretable decision trees § Users like to hack & management likes quick fixes 86
  • 87. Banko & Brill (2001) The real world • A huge amount of data? – Can achieve high accuracy! – Comes at a cost § SVMs (train time) or kNN (test time) can be too slow § Regularized logistic regression can be somewhat better § So Naïve Bayes can come back into its own again! – With enough data, classifier may not matter… 87
  • 88. Tweaking performance • Domain-specific features and weights are very important in real performance • Sometimes need to collapse terms: – Part numbers, chemical formulas, … – But stemming generally does not help • Upweighting (= counting a word as if it occurred twice) – Title words (Cohen & Singer, 1996) – First sentence of each paragraph (Murata, 1999) – In sentences that contain title words (Ko et al., 2002) 88
  • 89. Generic architecture of (most) NLP tools 1. Input pre-processing 2. Morphological & Part-of-Speech analysis (tokens) 3. Parsing (syntactic & semantic relations between tokens) 4. Context module (context-specific resolution) 5. Inference (according to the aim of the tool) 6. Generation (output representation) 7. Output processing (output representation refinement) 89
  • 91. References • Jackson, P., & Moulinier, I. (2007). NaturalLanguageProcessing for Online Applications:Text Retrieval, Extraction and Categorization. Amsterdam: Benjamins. • Jurafsky, D., & Martin, J. H. (2014). Speech and LanguageProcessing (2nd ed.). Harlow: Pearson Education. • Liddy, E. D. (2005). Enhanced Text Retrieval Using Natural Language Processing. Bulletin of the American Society for Information Science and Technology,24(4), 14-16. • Manning,C. D., & Schütze, H. (1999).Foundations of Statistical Natural LanguageProcessing. Cambridge, MA: MIT Press. 91