2. 2
What is Natural Language Processing (NLP)
• A field of computer science that is concerned with interactions
between computers and human(natural) languages.
• A subfield of Artificial intelligence
• Natural Language :
Refers to the natural language spoken by people as opposed to
the artificial languages like Java , Python,C++ etc.
Basha D (Natural Language Processing)
3. 3
Forms of Natural Language
• The input/output of a NLP system can be:
– written text
– speech
• We will mostly concerned with written text (not speech).
• To process written text, we need:
– lexical, syntactic, semantic knowledge about the language
– discourse information, real world knowledge
Basha D (Natural Language Processing)
4. 4
Components of NLP
• Natural Language Understanding
– Mapping the given input in the natural language into a useful representation.
– Different level of analysis required:
morphological analysis,
syntactic analysis,
semantic analysis,
discourse analysis, …
• Natural Language Generation
– Producing output in the natural language from some internal representation.
– Different level of synthesis required:
deep planning (what to say),
syntactic generation
• NL Understanding is much harder than NL Generation.
But, still both of them are hard.
Basha D (Natural Language Processing)
5. 5
Why NL Understanding is hard?
• Natural language is extremely rich in form and structure, and
very ambiguous.
– How to represent meaning,
– Which structures map to which meaning structures.
• One input can mean many different things. Ambiguity can be at
different levels.
– Lexical (word level) ambiguity -- different meanings of words
– Syntactic ambiguity -- different ways to parse the sentence
– Interpreting partial information -- how to interpret pronouns
– Contextual information -- context of the sentence may affect the meaning of that
sentence.
• Many input can mean the same thing.
Basha D (Natural Language Processing)
6. 6
Knowledge of Language
• Phonology – concerns how words are related to the sounds that
realize them.
• Morphology – concerns how words are constructed from more
basic meaning units called morphemes. A morpheme is the
primitive unit of meaning in a language.
• Syntax – concerns how can be put together to form correct
sentences and determines what structural role each word plays in
the sentence and what phrases are subparts of other phrases.
• Semantics – concerns what words mean and how these meaning
combine in sentences to form sentence meaning. The study of
context-independent meaning.
Basha D (Natural Language Processing)
7. 7
Knowledge of Language (cont.)
• Pragmatics – concerns how sentences are used in different
situations and how use affects the interpretation of the sentence.
• Discourse – concerns how the immediately preceding sentences
affect the interpretation of the next sentence. For example,
interpreting pronouns and interpreting the temporal aspects of the
information.
• World Knowledge – includes general knowledge about the
world. What each language user must know about the other’s
beliefs and goals.
Basha D (Natural Language Processing)
8. 8
Ambiguity
I made her duck.
• How many different interpretations does this sentence have?
• What are the reasons for the ambiguity?
• The categories of knowledge of language can be thought of as
ambiguity resolving components.
• How can each ambiguous piece be resolved?
• Does speech input make the sentence even more ambiguous?
– Yes – deciding word boundaries
Basha D (Natural Language Processing)
9. 9
Ambiguity (cont.)
• Some interpretations of : I made her duck.
1. I cooked duck for her.
2. I cooked duck belonging to her.
3. I created a toy duck which she owns.
4. I caused her to quickly lower her head or body.
5. I used magic and turned her into a duck.
• duck – morphologically and syntactically ambiguous:
noun or verb.
• her – syntactically ambiguous: dative or possessive.
• make – semantically ambiguous: cook or create.
• make – syntactically ambiguous:
Basha D (Natural Language Processing)
10. 10
Resolve Ambiguities
• We will introduce models and algorithms to resolve ambiguities
at different levels.
• part-of-speech tagging -- Deciding whether duck is verb or
noun.
• word-sense disambiguation -- Deciding whether make is
create or cook.
• lexical disambiguation -- Resolution of part-of-speech and
word-sense ambiguities are two important kinds of lexical
disambiguation.
• syntactic ambiguity -- her duck is an example of syntactic
ambiguity, and can be addressed by probabilistic parsing.
Basha D (Natural Language Processing)
11. 11
Resolve Ambiguities (cont.)
I made her duck
S S
NP VP NP VP
I V NP NP I V NP
made her duck made DET N
her duck
Basha D (Natural Language Processing)
12. Zipf's law
• States that the frequency of a word is inversely proportional to the rank of the
word, where rank 1 is given to the most frequent word, 2 to the second most
frequent and so on. This is also called the power law distribution.
• The Zipf's law helps us form the basic intuition for stopwords - these are the
words having the highest frequencies (or lowest ranks) in the text, and are
typically of limited 'importance’.
Broadly, there are three kinds of words present in any text corpus:
• Highly frequent words, called stop words, such as ‘is’, ‘an’, ‘the’, etc.
• Significant words, which are typically more important to understand the text
• Rarely occurring words, which are again less important than significant words
Basha D (Natural Language Processing) 12
13. Stopwords
• Generally speaking, stopwords are removed from the text for two reasons:
• They provide no useful information, especially in applications such as spam
detector or search engine.
• Since the frequency of words is very high, removing stopwords results in a
much smaller data as far as the size of data is concerned. Reduced size results
in faster computation on text data. There’s also the advantage of less number of
features to deal with if stopwords are removed.
Basha D (Natural Language Processing) 13
14. NLP tasks that we deal
• Lexical processing
• Syntactic Analysis
• Semantic processing
Basha D (Natural Language Processing) 14
15. Lexical Processing
• Stop word removal
• Tokenization
• Bag of words representation
• Stemming and Lemmatization
• DTM
• TF-IDF representation
Basha D (Natural Language Processing) 15
16. 16
Lexical Processing
• Stopword removal –removing the less important words from
corpus.
• Tokenization – a technique that’s used to split the text into
smaller elements. These elements can be characters, words,
sentences, or even paragraphs depending on the application we
are working on.
• Bag of words Representation – To represent text in a format that
we can feed into machine learning algorithms. Here sequence of
occurrence does not matter. A bag-of-words model is just the
matrix that you get from text data.
Basha D (Natural Language Processing)
17. 17
Lexical Processing (cont.)
• Stemming– It is a rule-based technique that just chops off the
suffix of a word to get its root form, which is called the ‘stem’.
• Example: "The driver is racing in his boss’ car", the words
‘driver’ and ‘racing’ will be converted to their root form by just
chopping of the suffixes ‘er’ and ‘ing’. So, ‘driver’ will be
converted to ‘driv’ and ‘racing’ will be converted to ‘rac’.
• Lemmatization– it takes an input word and searches for its base
word by going recursively through all the variations of dictionary
words. The base word in this case is called the lemma. Words
such as ‘feet’, ‘drove’, ‘arose’, ‘bought’, etc
Basha D (Natural Language Processing)
18. 18
Lexical Processing (cont.)
• DTM– Document term matrix is the one that describes the
frequency of terms that occur in a collection of documents.
• In a document-term matrix, rows correspond to documents in the
collection and columns correspond to terms
Basha D (Natural Language Processing)
19. 19
Lexical Processing (cont.)
• The TF (term frequency) of a word is the frequency of a
word (i.e. number of times it appears) in a document.
• For example, when a 100 word document contains the term “cat”
12 times, the TF for the word ‘cat’ is
• TFcat = 12/100 i.e. 0.12
• The IDF (inverse document frequency):
• The IDF (inverse document frequency) of a word is the measure
of how significant that term is in the whole corpus.
Basha D (Natural Language Processing)
20. Syntactic Analysis
• Part-of-speech (POS) tagging
• Named Entity Recognition
• Constituency parsing
• Dependency parsing
Basha D (Natural Language Processing) 20
21. 21
Part-of-Speech (POS) Tagging
• Each word has a part-of-speech tag to describe its category.
• Part-of-speech tag of a word is one of major word groups
(or its subgroups).
– open classes -- noun, verb, adjective, adverb
– closed classes -- prepositions, determiners, conjuctions, pronouns, particples
• POS Taggers try to find POS tags for the words.
• duck is a verb or noun? (morphological analyzer cannot make
decision).
• A POS tagger may make that decision by looking the surrounding
words.
– Duck! (verb)
– Duck is delicious for dinner. (noun)
Basha D (Natural Language Processing)
22. 22
Syntactic Analysis
• Parsing–A key task in syntactical analysis is parsing. It means to
break down a given sentence into its 'grammatical constituents'.
Parsing is an important step in many applications which helps us
better understand the linguistic structure of sentences
Eg: "The quick brown fox jumps over the table"
• This structure divides the sentence into three main constituents:
'The quick brown fox' is a noun phrase
'jumps' is a verb phrase
'over the table' is a prepositional phrase.
Basha D (Natural Language Processing)
23. 23
Syntactic Analysis
• IOB (or BIO) method tags each token in the sentence with one of the three
labels: I - inside (the entity), O- outside (the entity) and B - beginning (of
entity)
• IOB labeling is especially helpful if the entities contain multiple words. We
would want our system to read words like ‘Air India’, ‘New Delhi’, etc, as
single entities.
• Named Entity Recognition task identifies ‘entities’ in the text. Entities could
refer to names of people, organizations (e.g. Air India, United Airlines),
places/cities (Mumbai, Chicago), dates and time points (May, Wednesday,
morning flight), numbers of specific types (e.g. money - 5000 INR) etc. POS
tagging in itself won’t be able to identify such word entities. Therefore, IOB
labeling is required. So, NER task is to predict IOB labels of each word.
•
Basha D (Natural Language Processing)
24. 24
Syntactic Analysis
• Constituency parsers–divide the sentence into constituent
phrases such as noun phrase, verb phrase, prepositional phrase
etc. Each constituent phrase can itself be divided into further
phrases. The constituency parse tree given below divides the
sentence into two main phrases - a noun phrase and a verb phrase.
The verb phrase is further divided into a verb and a prepositional
phrase, and so on.
Basha D (Natural Language Processing)
25. 25
Syntactic Analysis
• Dependency Parsers do not divide a sentence into constituent
phrases, but rather establish relationships directly between the
words themselves. The figure below is an example of a
dependency parse tree.
Basha D (Natural Language Processing)
26. 26
Semantic Analysis
• Assigning meanings to the structures created by syntactic
analysis.
• Mapping words and structures to particular domain objects in way
consistent with our knowledge of the world.
• Semantic can play an import role in selecting among competing
syntactic analyses and discarding illogical analyses.
– I robbed the bank -- bank is a river bank or a financial institution
• We have to decide the formalisms which will be used in the
meaning representation.
Basha D (Natural Language Processing)
28. 28
Databases -WordNet and ConceptNet
• WordNet is a semantically oriented dictionary of English, similar
to a traditional thesaurus but with a richer structure.
• WordNet is a part of NLTK and we can use WordNet to identify
the 'correct' sense of a word (i.e for word sense disambiguation).
• ConceptNet is a representation that provides commonsense
linkages between words. For example, it states that bread is
commonly found near toasters. These everyday facts could be
useful if, for e.g., you wanted to make a smart chatbot which says
- “Since you like toasters, do also like bread? I can order some for
you.”
Basha D (Natural Language Processing)
29. 29
Distributional Semantics
• The term-document occurrence matrix, where each row is a term
in the vocabulary and each column is a document (such as a
webpage, tweet, book etc.)
• The term-term co-occurrence matrix, where the ith row and jth
column represents the occurrence of the ith word in the context
of the jth word.
Basha D (Natural Language Processing)
32. 32
Word Sense Disambiguation
• Word sense disambiguation (WSD) is the task of identifying the
correct sense of an ambiguous word such as 'bank', 'bark', 'pitch'
etc.
• Supervised techniques for word sense disambiguation require the
input words to be tagged with their senses
• Supervised : Naive Bayes Classifier.
• Unsupervised : Lesk algorithm.
Basha D (Natural Language Processing)
33. 33
Natural Language Generation
• NLG is the process of constructing natural language outputs from
non-linguistic inputs.
• NLG can be viewed as the reverse process of NL understanding.
• A NLG system may have two main parts:
– Discourse Planner -- what will be generated. which
sentences.
– Surface Realizer -- realizes a sentence from its internal
representation.
• Lexical Selection -- selecting the correct words describing the
concepts.
Basha D (Natural Language Processing)
34. 34
Some NLP Applications
• Machine Translation – Translation between two natural
languages.
• Information Retrieval – Web search (uni-lingual or multi-lingual).
• Query Answering/Dialogue – Natural language interface with a
database system, or a dialogue system..
• Chat Bots
• Sentiment Analysis
• Some Small Applications –
– Grammar Checking, Spell Checking, Spell Correctors
Basha D (Natural Language Processing)
35. 35
Python Libraries for NLP
• NLTK –supports multiple languages compared to other
libraries ,No support for Word vectors
• Spacy- Fastest NLP framework ,provides built-in word vectors
• Gensim-Designed primarily for Unsupervised text modelling
• TextBLOB-Provides language translation and detection which is
powered by google translate
Basha D (Natural Language Processing)