SlideShare ist ein Scribd-Unternehmen logo
1 von 120
Downloaden Sie, um offline zu lesen
Natural Language Processing
CC-9016
Dr. C. Saravanan, Ph.D., NIT Durgapur
dr.cs1973@gmail.com
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Introduction
• What is NLP ?
• Natural language processing (NLP) can be defined as the automatic (or semi-
automatic) processing of human language.
• Nowadays, alternative terms are often preferred, like `Language
Technology' or `Language Engineering'.
• Language is often used in contrast with speech (e.g., Speech and
Language Technology).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Introduction cont…
• NLP is essentially multidisciplinary: it is closely related to linguistics.
• It also has links to research in cognitive science, psychology,
philosophy and maths (especially logic).
• Within Computer Science, it relates to formal language theory,
compiler techniques, theorem proving, machine learning and human-
computer interaction.
• Of course it is also related to AI, though nowadays it's not generally
thought of as part of AI.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Introduction cont…
• The HAL 9000 computer in Stanley Kubrick’s film 2001: A Space Odyssey is
one of the most recognizable characters in twentieth-century cinema.
• HAL is an artificial agent capable of speaking and understanding English,
and even reading lips.
• HAL able to do information retrieval (finding out where needed textual
resources reside), information extraction (extracting pertinent facts from
those textual resources), and inference (drawing conclusions based on
known facts).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Space Odyssey
Introduction cont…
Other valuable areas of language processing such as
spelling correction,
grammar checking,
information retrieval, and
machine translation.
Example: Is the door open? The door is open.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Online Applications
• Dictation - Online Speech Recognition https://dictation.io/
• SpeechPad - a new voice notebook https://speechpad.pw/
• Speechlogger - https://speechlogger.appspot.com/en/
• Talktyper https://talktyper.com/
(microphone and speaker is required)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Knowledge in Language Processing
• Requires knowledge about phonetics and phonology, which can help
model how words are pronounced in colloquial speech
• they're- they + are. They're from Canada.
• their- something belongs to "them." This is their car.
• there- in that place. The park is over there.
• two- the number 2. I'll have two hamburgers, please.
• to- this has many meanings. One means "in the direction of." I'm
going to South America.
• too- also. I want to go, too.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Knowledge in Language Processing cont…
• Requires knowledge about morphology, which captures information
about the shape and behavior of words in context.
• Singular and Plural – Man Men, Boy Boys
• The knowledge needed to order and group words together comes
under the heading of syntax.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Knowledge in Language Processing cont…
• Requires knowledge of the meanings of the component words, the
domain of lexical semantics
• and knowledge of how these components combine to form larger
meanings, compositional semantics.
• No
• No, I won’t open the door.
• I’m sorry and I’m afraid, I can’t. (indirect refusal)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Knowledge in Language Processing cont…
• The knowledge of language needed to engage in complex language
behavior can be separated into six distinct categories.
• Phonetics and Phonology – The study of linguistic sounds.
• Morphology – The study of the meaningful components of words.
• Syntax – The study of the structural relationships between words.
• Semantics – The study of meaning.
• Pragmatics – The study of how language is used to accomplish goals.
• Discourse – The study of linguistic units larger than a single utterance.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
AMBIGUITY
• Some input is ambiguous if there are multiple alternative linguistic
structures than can be built for it.
• Look at the dog with one eye.
• Look at the dog using only one of your eyes.
• Look at the dog that only has one eye.
• The word with is semantically ambiguous in the sentence.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
AMBIGUITY
• Some cases, for example the word make means ‘create’ and ‘cook’
• can be solved by word sense disambiguation and
• be resolved by speech act interpretation.
• Example: I make a Cake
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Models and Algorithms
• Most important elements are
• state machines,
• formal rule systems,
• logic,
• probability theory
• Well known computational algorithms
• dynamic programming
• state space search
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
State Machines
• state machines are formal models that consist of states, transitions
among states, and an input representation
• the variations of this basic model are deterministic and non-
deterministic finite-state automata, finite-state transducers
• which can write to an output device, weighted automata, Markov
models and hidden Markov models which have a probabilistic
component.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Formal Rule Systems
• procedural models and more important ones are
• regular grammars
• regular relations
• context-free grammars
• feature-augmented grammars
• probabilistic variants of them all.
• the algorithms that are often used such as depth-first search, best-
first, and A* search.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Logic and Probability Theory
• first order logic, also known as the predicate calculus, deals with
formalisms as feature-structures, semantic networks, and conceptual
dependency.
• Probability theory solves the many kinds of ambiguity problems
• given N choices for some ambiguous input, choose the most probable one
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Language, Thought, and Understanding
• Alan Turing (1950) introduced Turing Test
• In Turing’s game, there are three participants: 2 people and a computer.
• One of the people is a contestant and plays the role of an interrogator.
• To win, the interrogator must determine which of the other two
participants is the machine by asking a series of questions via a teletype.
• The task of the machine is to fool the interrogator into believing it is a
person by responding as a person would to the interrogator’s questions.
• The task of the second human participant is to convince the interrogator
that the other participant is the machine, and that they are human.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
ELIZA
• ELIZA (1966) was an early natural language processing system makes use of
pattern-matching capable of carrying on a limited form of conversation with a
user.
• User1: You are like my father in some ways.
• ELIZA1: WHAT RESEMBLANCE DO YOU SEE
• User2: You are not very aggressive but I think you don’t want me to notice that.
• ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE
• User3: You don’t argue with me.
• ELIZA3: WHY DO YOU THINK I DON’T ARGUEWITH YOU
• User4: You are afraid of me.
• ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
HISTORY 1940’s and 1950’s
• Automata theory was contributed to by Shannon (1948), who applied
probabilistic models of discrete Markov processes to automata for
language.
• Chomsky (1956) first considered finite-state machines as a way to
characterize a grammar, and defined a finite-state language as a language
generated by a finite-state grammar.
• These early models led to the field of formal language theory, which used
algebra and set theory to define formal languages as sequences of symbols.
• This includes the context-free grammar, first defined by Chomsky (1956) for
natural languages but independently discovered by Backus (1959) and Naur
et al. (1960) in their descriptions of the ALGOL programming language.
• The sound spectrograph was developed (Koenig et al., 1946)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1957–1970
• Speech and Language Processing had split very cleanly into two
paradigms: symbolic and stochastic.
• The first was the work of Chomsky and others on formal language
theory and generative syntax throughout the late 1950’s and early to
mid 1960’s, and the work of many linguistics and computer scientists
on parsing algorithms, initially top-down and bottom-up, and then via
dynamic programming.
• One of the earliest complete parsing systems was Zelig Harris’s
Transformations and Discourse Analysis Project (TDAP), which was
implemented between June 1958 and July 1959 at the University of
Pennsylvania (Harris, 1962).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1957–1970 cont…
• The second line of research was the new field of artificial intelligence.
• In the 1956 John McCarthy, Marvin Minsky, Claude Shannon, and
Nathaniel Rochester organized a two month workshop on artificial
intelligence.
• The major focus of the new field was the work on reasoning and logic
typified by Newell and Simon’s work on the Logic Theorist and the
General Problem Solver.
• These were simple systems which worked in single domains mainly by
a combination of pattern matching and key-word search with simple
heuristics for reasoning and question-answering.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1957–1970 cont…
• Bledsoe and Browning (1959) built a Bayesian system for text-recognition
(optical character recognition) that used a large dictionary and computed
the likelihood of each observed letter sequence given each word in the
dictionary by multiplying the likelihoods for each letter.
• Mosteller and Wallace (1964) applied Bayesian methods to the problem of
authorship attribution on The Federalist papers.
• The Brown corpus of American English, a 1 million word collection of
samples from 500 written texts from different genres (newspaper, novels,
non-fiction, academic, etc.), which was assembled in 1963-64 (Kucera and
Francis, 1967; Francis, 1979; Francis and Kucera, 1982), and William S. Y.
Wang’s 1967 DOC (Dictionary on Computer), an on-line Chinese dialect
dictionary.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1970–1983
• Use of the Hidden Markov Model and the metaphors of the noisy
channel and decoding, developed independently by Jelinek, Bahl,
Mercer, and colleagues at IBM’s Thomas J. Watson Research Center,
and Baker at Carnegie Mellon University.
• The logic-based paradigm was begun by the work of Colmerauer and
his colleagues on Q-systems and metamorphosis grammars
(Colmerauer, 1970, 1975), the forerunners of Prolog and Definite
Clause Grammars (Pereira and Warren, 1980).
• Independently, Kay’s (1979) work on functional grammar, and shortly
later, (1982)’s (1982) work on LFG, established the importance of
feature structure unification.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1970–1983 cont…
• Terry Winograd’s SHRDLU system which simulated a robot embedded
in a world of toy blocks (Winograd, 1972a). The program was able to
accept natural language text commands (Move the red block on top of
the smaller green one) of a hitherto unseen complexity and
sophistication.
• Roger Schank and his colleagues and students built a series of
language understanding programs that focused on human conceptual
knowledge such as scripts, plans and goals, and human memory
organization (Schank and Abelson, 1977; Schank and Riesbeck, 1981;
Cullingford, 1981; Wilensky, 1983; Lehnert, 1977).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1970–1983 cont…
• The discourse modeling paradigm focused on four key areas in
discourse. Grosz and her colleagues proposed ideas of discourse
structure and discourse focus (Grosz, 1977a; Sidner, 1983a), a number
of researchers began to work on automatic reference resolution
(Hobbs, 1978a), and the BDI (Belief-Desire-Intention) framework for
logic-based work on speech acts was developed (Perrault and Allen,
1980; Cohen and Perrault, 1979).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1983-1993
• The finite-state models, which began to receive attention again after
work on finite-state phonology and morphology by (Kaplan and Kay,
1981) and finite-state models of syntax by Church (1980).
• Another trend was the rise of probabilistic models throughout
speech and language processing, influenced strongly by the work at
the IBM Thomas J. Watson Research Center on probabilistic models of
speech recognition.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1994-1999
• First, probabilistic and data-driven models had become quite standard
throughout natural language processing.
• Algorithms for parsing, part of speech tagging, reference resolution, and
discourse processing all began to incorporate probabilities, and employ
evaluation methodologies borrowed from speech recognition and
information retrieval.
• Second, the increases in the speed and memory of computers had allowed
commercial exploitation of a number of subareas of speech and language
processing, in particular speech recognition and spelling and grammar
checking.
• Finally, the rise of the Web emphasized the need for language-based
information retrieval and information extraction.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
NLP applications
• spelling and grammar checking optical character recognition (OCR)
• screen readers for blind and partially sighted users / augmentative and
alternative communication
• machine aided translation lexicographers' tools
• information retrieval document classification (filtering, routing)
• document clustering information extraction
• question answering summarization
• text segmentation exam marking
• report generation (possibly multilingual) machine translation
• natural language interfaces to databases email understanding
• dialogue system
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Some Linguistic Terminology
• Morphology: the structure of words. For instance, unusually can be
thought of as composed of a prefix un-, a stem usual, and an affix -ly.
composed is compose plus the inflectional affix -ed: a spelling rule
means we end up with composed rather than composeed.
• Syntax: the way words are used to form phrases. e.g., it is part of
English syntax that a determiner such as the will come before a noun,
and also that determiners are obligatory with certain singular nouns.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Some Linguistic Terminology
• Semantics. Compositional semantics is the construction of meaning
(generally expressed as logic) based on syntax.
• Pragmatics: meaning in context.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Difficulties
• How fast is the 3G ?
• How fast will my 3G arrive?
• Please tell me when I can expect the 3G I ordered.
• Has my order number 4291 been shipped yet?
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Spelling and grammar checking
• All spelling checkers can ag words which aren't in a dictionary.
• * The neccesary steps are obvious.
• The necessary steps are obvious.
• If the user can expand the dictionary, or if the language has complex
productive morphology (see x2.1), then a simple list of words isn't
enough to do this and some morphological processing is needed.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• * The dog came into the room, it's tail wagging.
• The dog came into the room, its tail wagging.
• For instance, possessive its generally has to be followed by a noun or
an adjective noun combination etc.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• But, it sometimes isn't locally clear:
e.g. fair is ambiguous between a noun and an adjective.
• `It's fair', was all Kim said.
• Every village has an annual fair.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• The most elaborate spelling/grammar checkers can get some of these cases
right.
• * The tree's bows were heavy with snow.
• The tree's boughs were heavy with snow.
• Getting this right requires an association between tree and bough.
• Attempts might have been made to hand-code this in terms of general
knowledge of trees and their parts.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Checking punctuation can be hard
• Students at Cambridge University, who come from less affluent
backgrounds, are being offered up to 1,000 a year under a bursary scheme.
• The sentence implies that most/all students at Cambridge come from less
affluent backgrounds. The following sentence makes different meaning.
• Students at Cambridge University who come from less affluent
backgrounds are being offered up to 1,000 a year under a bursary scheme.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Information retrieval
• Information retrieval involves returning a set of documents in
response to a user query: Internet search engines are a form of IR.
• However, one change from classical IR is that Internet search now
uses techniques that rank documents according to how many links
there are to them (e.g., Google's PageRank) as well as the presence of
search terms.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Information extraction
• Information extraction involves trying to discover specific information
from a set of documents.
• The information required can be described as a template.
• For instance, for company joint ventures, the template might have
slots for the companies, the dates, the products, the amount of
money involved. The slot fillers are generally strings.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Question answering
• Attempts to find a specific answer to a specific question from a set of
documents, or at least a short piece of text that contains the answer.
• What is the capital of France?
• Paris has been the French capital for many centuries.
• There are some question-answering systems on theWeb, but most
use very basic techniques.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Machine translation (MT)
• MT work started in the US in the early fifties, concentrating on
Russian to English. A prototype system was demonstrated in 1954.
• Systran was providing useful translations by the late 60s.
• Until the 80s, the utility of general purpose MT systems was severely
limited by the fact that text was not available in electronic form:
Systran used teams of skilled typists to input Russian documents.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Systran and similar systems are not a substitute for human
translation:
• They are useful because they allow people to get an idea of what a
document is about, and maybe decide whether it is interesting
enough to get translated properly.
• Spoken language translation is viable for limited domains: research
systems include Verbmobil, SLT and CSTAR.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Natural language interfaces
• Natural language interfaces were the `classic' NLP problem in the 70s and
80s.
• LUNAR is the classic example of a natural language interface to a database
(NLID):
• Its database concerned lunar rock samples brought back from the Apollo
missions. LUNAR is described by Woods (1978).
• It was capable of translating elaborate natural language expressions into
database queries.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
SHRDLU
• SHRDLU (Winograd, 1973) was a system capable of participating in a
dialogue about a micro world (the blocks world) and manipulating
this world according to commands issued in English by the user.
• SHRDLU had a big impact on the perception of NLP at the time since it
seemed to show that computers could actually `understand'
language:
• The impossibility of scaling up from the micro world was not realized.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• There have been many advances in NLP since these systems were built:
• systems have become much easier to build, and somewhat easier to use,
but they still haven't become ubiquitous.
• Natural Language interfaces to databases were commercially available in
the late 1970s, but largely died out by the 1990s:
• Porting to new databases and especially to new domains requires very
specialist skills and is essentially too expensive.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
NLP application architecture
• The input to an NLP system could be speech or text.
• It could also be gesture (multimodal input or perhaps a Sign
Language).
• The output might be non-linguistic, but most systems need to give
some sort of feedback to the user, even if they are simply performing
some action.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Components
• Input Pre Processing: speech / gesture recogniser or text pre-processor
• Morphological Analysis
• Speech Tagging
• Parsing - this includes syntax and compositional semantics
• Disambiguation: this can be done as part of parsing
• Context Module: this maintains information about the context
• Text Planning: the part of language generation / what meaning to convey
• Tactical Generation: converts meaning representations to strings.
• Morphological Generation
• Output Processing: text-to-speech, text formatter, etc.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Following are required in order to construct the standard components
and knowledge sources.
• Lexicon Acquisition
• Grammar Acquisition
• Acquisition of Statistical Information
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Note
• NLP applications need complex knowledge sources.
• Applications cannot be 100% perfect. Humans aren't 100% perfect
anyway.
• Applications that aid humans are much easier to construct than
applications which replace humans.
• Improvements in NLP techniques are generally incremental rather
than revolutionary.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Morphology
• Morphology concerns the structure of words.
• Words are assumed to be made up of morphemes, which are the
minimal information carrying unit.
• Morphemes which can only occur in conjunction with other
morphemes are affixes: words are made up of a stem and zero or
more affixes.
• For instance, dog is a stem which may occur with the plural suffix +s
i.e., dogs.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• English only has suffixes (affixes which come after a stem) and
prefixes (which come before the stem).
• Few languages have infixes (affixes which occur inside the stem) and
circumfixes (affixes which go around a stem).
• Some English irregular verbs show a relic of inflection by infixation
(e.g. sing, sang, sung)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Inflectional Morphology
• Inflectional morphology concerns properties such as
• tense, aspect, number, person, gender, and case,
• although not all languages code all of these:
• English, for instance, has very little morphological marking of case
and gender.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Derivational Morphology
• Derivational affixes, such as un-, re-, anti- etc, have a broader range
of semantic possibilities and don't fit into neat paradigms.
• Antidisestablishmentarianism
• Antidisestablishmentarianismization
• Antiantimissile
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Zero Derivation
• tango, waltz etc. are words
• which are basically nouns but can be used as verbs.
• An English verb will have a present tense form, a 3rd person singular
present tense form, a past participle and a passive participle.
• e.g., text as a verb - texts, texted.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Stems and afifxes can be individually ambiguous.
• There is also potential for ambiguity in how a word form is split into
morphemes.
• For instance, unionised could be union -ise -ed
• or (in chemistry) un- ion -ise -ed.
• Note that un- ion is not a possible form (because un- can't attach to a
noun).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• un- that can attach to verbs, it nearly always denotes a reversal of a
process (e.g., untie)
• whereas the un- that attaches to adjectives means `not', which is the
meaning in the case of un- ion -ise -ed.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Spelling / orthographic rules
• English morphology is essentially concatenative.
• Some words have irregular morphology and their inflectional forms
simply have to be listed.
• For instance, the suffix -s is pronounced differently when it is added
to a stem which ends in s, x or z and the spelling reflects this with the
addition of an e (boxes etc.).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• The `e-insertion' rule can be described as follows:
{ s }
• ɛ → e / { x } ^_s
{ z }
• _ indicating the position in question
• ɛ is used for the empty string and
• ^ for the affix boundary
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Other spelling rules in English include consonant doubling
• e.g., rat, ratted, though note, not *auditted)
• and y/ie conversion (party, parties).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Lexical requirements for morphological
processing
• There are three sorts of lexical information that are needed for full,
high precision morphological processing:
• affixes, plus the associated information conveyed by the affix
• irregular forms, with associated information similar to that for affixes
• stems with syntactic categories
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• On approach to an affix lexicon is for it to consist of a pairing of affix
and some encoding of the syntactic/semantic effect of the affix.
• For instance, consider the following fragment of a suffix lexicon
• ed PAST_VERB
• ed PSP_VERB
• s PLURAL_NOUN
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• A lexicon of irregular forms is also needed.
• began PAST_VERB begin
• begun PSP_VERB begin
• English irregular forms are the same for all senses of a word. For
instance, ran is the past of run whether we are talking about athletes,
politicians or noses.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• The morphological analyser is split into two stages.
• The first of these only concerns morpheme forms
• A second stage which is closely coupled to the syntactic analysis
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Finite state automata for recognition
• The approach to spelling rules that described involves the use of finite
state transducers (FSTs).
• Suppose we want to recognize dates (just day and month pairs)
written in the format day/month. The day and the month may be
expressed as one or two digits (e.g. 11/2, 1/12 etc).
• This format corresponds to the following simple FSA, where each
character corresponds to one transition:
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Prediction and part-of-speech tagging
• Corpora
• A corpus (corpora is the plural) is simply a body of text that has been
collected for some purpose.
• A balanced corpus contains texts which represent different genres
(newspapers, fiction, textbooks, parliamentary reports, cooking
recipes, scientific papers etc.):
• Early examples were the Brown corpus (US English) and the
Lancaster-Oslo-Bergen (LOB) corpus (British English) which are each
about 1 million words: the more recent British National Corpus (BNC)
contains approx. 100 million words and includes 20 million words of
spoken English.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Implementing an email answering application:
• it is essential to collect samples of representative emails.
• For interface applications in particular, collecting a corpus requires a
simulation of the actual application:
• generally this is done by a Wizard of Oz experiment, where a human
pretends to be a computer.
• Corpora are needed in NLP for two reasons.
• To evaluate algorithms on real language
• Providing the data source for many machine-learning approaches
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Prediction
• The essential idea of prediction is that,
• given a sequence of words,
• to determine what's most likely to come next.
• Used for language modelling for automatic speech recognition.
• Speech recognisers cannot accurately determine a word from the
sound signal.
• They cannot reliably tell where each word starts and finishes.
• The most probable word is chosen on the basis of the language model
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• The language models which are currently most effective work on the
basis of N-grams (a type of Markov chain), where the sequence of the
prior n-1 words is used to predict the next.
• Trigram models use the preceding 2 words,
• bigram models use the preceding word and
• unigram models use no context at all, but simply work on the basis of
individual word probabilities.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Word prediction is also useful in communication aids:
• Systems for people who can't speak because of some form of
disability.
• People who use text-to-speech systems to talk because of a non-
linguistic disability usually have some form of general motor
impairment which also restricts their ability to type at normal rates
(stroke, ALS, cerebral palsy etc).
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Often they use alternative input devices, such as adapted keyboards, puffer
switches, mouth sticks or eye trackers.
• Generally such users can only construct text at a few words a minute,
which is too slow for anything like normal communication to be possible
(normal speech is around 150 words per minute).
• As a partial aid, a word prediction system is sometimes helpful:
• This gives a list of candidate words that changes as the initial letters are
entered by the user.
• The user chooses the desired word from a menu when it appears.
• The main difficulty with using statistical prediction models in such
applications is in finding enough data: to be useful, the model really has to
be trained on an individual speaker's output.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Prediction is important in estimation of entropy, including estimations
of the entropy of English.
• The notion of entropy is important in language modelling because it
gives a metric for the difficulty of the prediction problem.
• For instance, speech recognition is much easier in situations where
the speaker is only saying two easily distinguishable words than when
the vocabulary is unlimited: measurements of entropy can quantify
this.
• Other applications for prediction include optical character recognition
(OCR), spelling correction and text segmentation.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Bigrams
• A bigram model assigns a probability to a word based on the previous
word P(wn|wn-1) where wn is the 𝑛 𝑡ℎ
word in some string.
• The probability of some string of words 𝑃(𝑊
𝑛
1
) is thus approximated
by the product of these conditional probabilities:
• 𝑃 𝑊
𝑛
1
= 𝑘=1
𝑛
𝑃(𝑤 𝑘|𝑤 𝑘−1)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Example
• good morning
• good afternoon
• good afternoon
• it is very good
• it is good
• We'll use the symbol <s> to indicate the start of an utterance, so the
corpus really looks like:
• <s> good morning <s> good afternoon <s> good afternoon <s> it is
very good <s> it is good <s>
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• The bigram probabilities are given as
•
𝐶(𝑤 𝑛−1 𝑤 𝑛)
𝑤 𝐶(𝑤 𝑛−1 𝑤)
• the count of a particular bigram, normalized by dividing by the total
number of bigrams starting with the same word.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Part of Speech (POS) tagging
• One important application is to part-of-speech tagging (POS tagging), where the words in
a corpus are associated with a tag indicating some syntactic information that applies to
that particular use of the word.
• For instance, consider the example sentence below:
• They can fish.
• This has two readings: one (the most likely) about ability to Fish and other about putting
fish in cans. Fish is ambiguous between a singular noun, plural noun and a verb, while
can is ambiguous between singular noun, verb (the `put in cans' use) and modal verb.
However, they is unambiguously a pronoun. (I am ignoring some less likely possibilities,
such as proper names.)
• These distinctions can be indicated by POS tags:
• they PNP
• can VM0 VVB VVI NN1
• fish NN1 NN2 VVB VVI
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• The meaning of the tags above is:
• NN1 singular noun
• NN2 plural noun
• PNP personal pronoun
• VM0 modal auxiliary verb
• VVB base form of verb (except infinitive)
• VVI infinitive form of verb (i.e. occurs with `to')
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• A POS tagger resolves the lexical ambiguities to give the most likely
set of tags for the sentence. In this case, the right tagging is likely to
be:
• They PNP can VM0 sh VVB . PUN
• Note the tag for the full stop: punctuation is treated as unambiguous.
POS tagging can be regarded as a form of very basic word sense
disambiguation.
• The other syntactically possible reading is:
• They PNP can VVB sh NN2 . PUN
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Stochastic POS tagging
• One form of POS tagging applies the N-gram technique that we saw
above, but in this case it applies to the POS tags rather than the
individual words.
• The most common approaches depend on a small amount of
manually tagged training data from which POS N-grams can be
extracted.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Generative grammar
• concerned with the structure of phrases
• the dog slept
• can be bracketed
• ((the (big dog)) slept)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Equivalent
• Two grammars are said to be weakly-equivalent if they generate the
same strings.
• Two grammars are strongly equivalent if they assign the same
bracketing's to all strings they generate.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• In most, but not all, approaches, the internal structures are given
labels.
• For instance, the big dog is a noun phrase (abbreviated NP),
• slept, slept in the park and licked Sandy are verb phrases (VPs).
• The labels such as NP and VP correspond to non-terminal symbols in
a grammar.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Context free grammars
• A CFG has four components,
1. a set of non-terminal symbols (e.g., S, VP), conventionally written in
uppercase;
2. a set of terminal symbols (i.e., the words), conventionally written in
lowercase;
3. a set of rules (productions), where the left hand side (the mother) is
a single non-terminal and the right hand side is a sequence of one or
more non-terminal or terminal symbols (the daughters);
4. a start symbol, conventionally S, which is a member of the set of
non-terminal symbols.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• they fish
• (S (NP they) (VP (V fish)))
• they can fish
• (S (NP they) (VP (V can) (VP (V fish))))
• ;;; the modal verb `are able to' reading
• (S (NP they) (VP (V can) (NP fish)))
• ;;; the less plausible, put fish in cans, reading
• they fish in rivers
• (S (NP they) (VP (VP (V fish)) (PP (P in) (NP rivers))))
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Parse Trees
• Parse trees are equivalent to bracketed structures, but are easier to
read for complex cases.
• A parse tree and bracketed structure for one reading of they can fish
in December is shown below.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Chart Parsing
• A chart is a list of edges.
• In the simplest version of chart parsing, each edge records a rule
application and has the following structure:
• [id, left vertex, right vertex, mother category, daughters]
• A vertex is an integer representing a point in the input string, as illustrated
below:
. they . can . fish .
0 1 2 3
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• mother category refers to the rule that has been applied to create the
edge.
• daughters is a list of the edges that acted as the daughters for this
particular rule application.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Lexical Semantics
1. Meaning postulates
2. Classical lexical relations: hyponymy, meronymy, synonymy, antonymy
3. Taxonomies and WordNet
4. Classes of polysemy: homonymy, regular polysemy, vagueness
5. Word sense disambiguation
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Meaning postulates
• Inference rules can be used to relate open class predicates.
• bachelor(x) -> man(x) ^ unmarried(x)
• how to define table, tomato or thought
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Hyponymy
• Hyponymy is the classical relation
• e.g. dog is a hyponym of animal
• That is, the relevant sense of dog is the hyponym of animal
• animal is the hypernym of dog
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Classes of words can be categorized by hyponymy
• Some nouns - biological taxonomies, human artefacts, professions etc
• Abstract nouns, such as truth,
• Some verbs can be treated as being hyponyms of one another . e.g.
murder is a hyponym of kill
• Hyponymy is essentially useless for adjectives.
• coin a hyponym of money?
• coin might be metal and also money - multiple parents might be
possible
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Meronymy
• The standard examples of meronymy apply to physical relationships
• arm is part of a body (arm is a meronym of body);
• steering wheel is a merony of car
• a soldier is part of an army
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Synonymy and Antonymy
• True synonyms are relatively uncommon
• thin, slim, slender, skinny
• Antonymy is mostly discussed with respect to adjectives
• e.g., big/little, like/dislike
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
WordNet
• WordNet is the main resource for lexical semantics for English that is
used in NLP, primarily because of its very large coverage and the fact
that it's freely available.
• The primary organisation of WordNet is into synsets: synonym sets
(near-synonyms). To illustrate this, the following is part of what
WordNet returns as an `overview' of red:
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• wn red -over
• Overview of adj red The adj red has 6 senses (first 5 from tagged
texts)
• 1. (43) red, reddish, ruddy, blood-red, carmine, cerise, cherry, cherry-
red, crimson, ruby, ruby-red, scarlet -- (having any of numerous bright
or strong colors reminiscent of the color of blood or cherries or
tomatoes or rubies)
• 2. (8) red, reddish -- ((used of hair or fur) of a reddish brown color;
"red deer"; reddish hair")
• Nouns in WordNet are organized by hyponymy
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Sense 6
• big cat, cat
• => leopard, Panthera pardus
• => leopardess
• => panther
• => snow leopard, ounce, Panthera uncia
• => jaguar, panther, Panthera onca, Felis onca
• => lion, king of beasts, Panthera leo
• => lioness
• => lionet
• => tiger, Panthera tigris
• => Bengal tiger
• => tigress
• => liger
• => tiglon, tigon
• => cheetah, chetah, Acinonyx jubatus
• => saber-toothed tiger, sabertooth
• => Smiledon californicus
• => false saber-toothed tiger
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Polysemy
• The standard example of polysemy is bank (river bank) vs bank
(financial institution).
• teacher is intuitively general between male and female teachers
• Dictionaries are not much help
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Word sense disambiguation
• Word sense disambiguation (WSD) is needed for most NL applications
that involve semantics.
• What's changed since the 1980s is that various statistical or machine-
learning techniques have been used to avoid hand-crafting rules.
• supervised learning
• unsupervised learning
• Machine readable dictionaries (MRDs)
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Discourse
• Utterances are always understood in a particular context. Context-
dependent situations include:
1. Referring expressions: pronouns, definite expressions etc.
2. Universe of discourse: every dog barked, doesn't mean every dog in
the world but only every dog in some explicit or implicit contextual set.
3. Responses to questions, etc: only make sense in a context:
Who came to the party? Not Sandy.
4. Implicit relationships between events: Max fell. John pushed him.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Rhetorical relations and coherence
• Consider the following discourse: Max fell. John pushed him.
• This discourse can be interpreted in at least two ways:
1. Max fell because John pushed him.
2. Max fell and then John pushed him.
• In 1 the link is a form of explanation, but 2 is an example of narration.
• because and and then are said to be cue phrases.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Coherence
• Discourses have to have connectivity to be coherent:
• Kim got into her car. Sandy likes apples.
• Both of these sentences make perfect sense in isolation, but taken
together they are incoherent.
• Adding context can restore coherence:
• Kim got into her car. Sandy likes apples, so Kim thought she'd go to
the farm shop and see if she could get some.
• The second sentence can be interpreted as an explanation of the first.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• For example, consider a system that reports share prices. This might
generate:
• In trading yesterday: Dell was up 4.2%, Safeway was down 3.2%,
Compaq was up 3.1%.
• This is much less acceptable than a connected discourse:
• Computer manufacturers gained in trading yesterday: Dell was up
4.2% and Compaq was up 3.1%. But retail stocks suffered: Safeway
was down 3.2%.
• Here but indicates a Contrast.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Factors influencing discourse interpretation
• Cue phrases. These are sometimes unambiguous, but not usually. e.g.
and
• Punctuation (also prosody) and text structure.
• Max fell (John pushed him) and Kim laughed.
• Max fell, John pushed him and Kim laughed.
• Real world content:
• Max fell. John pushed him as he lay on the ground.
• Tense and aspect.
• Max fell. John had pushed him.
• Max was falling. John pushed him.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Referring expressions
• referent a real world entity that some piece of text (or speech) refers
to.
• referring expressions bits of language used to perform reference by a
speaker
• antecedant the text evoking a referent.
• anaphora the phenomenon of referring to an antecedant
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Pronoun agreement
• Pronouns generally have to agree in number and gender with their
antecedants.
• A little girl is at the door - see what she wants, please?
• My dog has hurt his foot - he is in a lot of pain.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Reflexives
• Johni cut himself shaving.
(himself = John, subscript notation used to indicate this)
• reflexive pronouns must be coreferential with a preceeding argument
of the same verb.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Pleonastic pronouns
• Pleonastic pronouns are semantically empty, and don't refer:
• It is snowing.
• It is not easy to think of good examples.
• It is obvious that Kim snores.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Salience
• There are a number of effects which cause particular pronoun
referents to be preferred.
• Recency More recent referents are preferred
• Kim has a fast car. Sandy has an even faster one. Lee likes to drive it.
• it preferentially refers to Sandy's car, rather than Kim's.
• Grammatical role Subjects > objects > everything else:
• Fred went to the Grafton Centre with Bill. He bought a CD.
• he is more likely to be interpreted as Fred than as Bill.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Repeated mention Entities that have been mentioned more
frequently are preferred.
• Fred was getting bored. He decided to go shopping. Bill went to the Grafton
Centre with Fred. He bought a CD.
• He=Fred (maybe) despite the general preference for subjects
• Parallelism Entities which share the same role as the pronoun in the
same sort of sentence are preferred
• Bill went with Fred to the Grafton Centre. Kim went with him to Lion Yard.
• Him=Fred, because the parallel interpretation is preferred.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Coherence effects The pronoun resolution may depend on the
rhetorical/discourse relation that is inferred.
• Bill likes Fred. He has a great sense of humour.
• He = Fred preferentially, possibly because the second sentence is
interpreted as an explanation of the first, and having a sense of
humour is seen as a reason to like someone.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Applications
• Machine Translation
• Spoken Dialogue Systems
• Machine translation (MT)
• Direct transfer: map between morphologically analysed structures.
• Syntactic transfer: map between syntactically analysed structures.
• Semantic transfer: map between semantics structures.
• Interlingua: construct a language-neutral representation from parsing and use
this for generation
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
MT using semantic transfer
Semantic transfer is an approach to MT which involves:
1. Parsing a source language string to produce a meaning
representation.
2. Transforming that representation to one appropriate for the target
language.
3. Generating Target Language from the transformed representation.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Dialogue Systems
Turn-taking: generally there are points where a speaker invites someone else
to take a turn (possibly choosing a specic person), explicitly (e.g., by asking a
question) or otherwise.
Pauses: pauses between turns are generally very short (a few hundred
milliseconds, but highly culture specific).
• Longer pauses are assumed to be meaningful: example from Levinson
(1983: 300)
A: Is there something bothering you or not? (1.0 sec pause)
A: Yes or no? (1.5 sec pause)
A: Eh?
B: No.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
• Overlap: Utterances can overlap (the acceptability of this is
dialect/culture specific but unfortunately humans tend to interrupt
automated systems. this is known as barge in).
• Backchannel: Utterances like Uh-huh, OK can occur during other
speaker's utterance as a sign that the hearer is paying attention.
• Attention: The speaker needs reassurance that the hearer is
understanding/paying attention. Often eye contact is enough, but this
is problematic with telephone conversations, dark sunglasses, etc.
Dialogue systems should give explicit feedback.
• Cooperativity: Because participants assume the others are
cooperative, we get effects such as indirect answers to questions.
• When do you want to leave?
• My meeting starts at 3pm.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
1. Single initiative systems (also known as system initiative systems):
system controls what happens when.
• System: Which station do you want to leave from?
• User: King's Cross
2. Mixed initiative dialogue. Both participants can control the dialogue
to some extent.
• System: Which station do you want to leave from?
• User: I don't know, tell me which station I need for Cambridge.
3. Dialogue tracking
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Email response
• The LinGO ERG constraint-based grammar has been used for parsing in
commercially-deployed email response systems.
Well, I am going on vacation for the next two weeks, so the first day that I
would be able to meet would be the eighteenth.
If you give me your name and your address we will send you the ticket.
The morning is good, but nine o'clock might be a little too late, as I have a
seminar at ten o'clock.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
References
• James Allen, Natural Language Understanding, Second Edition,
Benjamin-Cummings Publishing Co., Inc. USA, 1995.
• Eugene Charniack, Statistical Language Learning, MIT Press, USA,
1996.
• Daniel Jurafsky and James H. Martin, Speech and Language
Processing, Second Edition, Prentice Hall, 2008.
• Christopher D. Manning, and Heinrich Schutze, Foundations of
Statistical Natural Language Processing, MIT Press, 1999.
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
Questions ?
Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingVeenaSKumar2
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingRishikese MR
 
Natural language processing
Natural language processingNatural language processing
Natural language processingAbash shah
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)VenkateshMurugadas
 
Natural language processing
Natural language processingNatural language processing
Natural language processingKarenVacca
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: ParsingRushdi Shams
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentationSai Mohith
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AISaurav Shrestha
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Natural language processing
Natural language processingNatural language processing
Natural language processingHansi Thenuwara
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP) ASWINKP11
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language ProcessingMichel Bruley
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 

Was ist angesagt? (20)

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Nlp
NlpNlp
Nlp
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)Introduction to Natural Language Processing (NLP)
Introduction to Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing in AI
Natural Language Processing in AINatural Language Processing in AI
Natural Language Processing in AI
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing (NLP)
Natural language processing (NLP) Natural language processing (NLP)
Natural language processing (NLP)
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 

Andere mochten auch

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processingrohitnayak
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introductionananth
 
Introduction to fa and dfa
Introduction to fa  and dfaIntroduction to fa  and dfa
Introduction to fa and dfadeepinderbedi
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Seminar on Quantum Automata and Languages
Seminar on Quantum Automata and LanguagesSeminar on Quantum Automata and Languages
Seminar on Quantum Automata and Languagesranjanphu
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyiaemedu
 
WMJ&GMBwosc08-Effective Learning & Production Via Modelling
WMJ&GMBwosc08-Effective Learning & Production Via ModellingWMJ&GMBwosc08-Effective Learning & Production Via Modelling
WMJ&GMBwosc08-Effective Learning & Production Via ModellingGary Boyd
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationStephen Shellman
 
Introduction to automata
Introduction to automataIntroduction to automata
Introduction to automataShubham Bansal
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Rajnish Raj
 
Game Design Patterns Workshop - FDG2012 - Opening Remarks
Game Design Patterns Workshop - FDG2012 - Opening RemarksGame Design Patterns Workshop - FDG2012 - Opening Remarks
Game Design Patterns Workshop - FDG2012 - Opening RemarksJose Zagal
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory Rajendran
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingSandeep Tammu
 

Andere mochten auch (20)

Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
Natural Language Processing: L01 introduction
Natural Language Processing: L01 introductionNatural Language Processing: L01 introduction
Natural Language Processing: L01 introduction
 
Lecture: Automata
Lecture: AutomataLecture: Automata
Lecture: Automata
 
Introduction to fa and dfa
Introduction to fa  and dfaIntroduction to fa  and dfa
Introduction to fa and dfa
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Seminar on Quantum Automata and Languages
Seminar on Quantum Automata and LanguagesSeminar on Quantum Automata and Languages
Seminar on Quantum Automata and Languages
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
L3 v2
L3 v2L3 v2
L3 v2
 
Current state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a studyCurrent state of the art pos tagging for indian languages – a study
Current state of the art pos tagging for indian languages – a study
 
Automata
AutomataAutomata
Automata
 
WMJ&GMBwosc08-Effective Learning & Production Via Modelling
WMJ&GMBwosc08-Effective Learning & Production Via ModellingWMJ&GMBwosc08-Effective Learning & Production Via Modelling
WMJ&GMBwosc08-Effective Learning & Production Via Modelling
 
Natural Language Processing: Definition and Application
Natural Language Processing: Definition and ApplicationNatural Language Processing: Definition and Application
Natural Language Processing: Definition and Application
 
Ignat vita artur
Ignat vita arturIgnat vita artur
Ignat vita artur
 
Introduction to automata
Introduction to automataIntroduction to automata
Introduction to automata
 
Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...Natural Language processing Parts of speech tagging, its classes, and how to ...
Natural Language processing Parts of speech tagging, its classes, and how to ...
 
Game Design Patterns Workshop - FDG2012 - Opening Remarks
Game Design Patterns Workshop - FDG2012 - Opening RemarksGame Design Patterns Workshop - FDG2012 - Opening Remarks
Game Design Patterns Workshop - FDG2012 - Opening Remarks
 
Intro automata theory
Intro automata theory Intro automata theory
Intro automata theory
 
Theory of computing
Theory of computingTheory of computing
Theory of computing
 
Ngrams smoothing
Ngrams smoothingNgrams smoothing
Ngrams smoothing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 

Ähnlich wie Natural language processing

Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
Linguistics - Dr.Chithra G.K (Associate Professor at VIT)
Linguistics - Dr.Chithra G.K  (Associate Professor at VIT)Linguistics - Dr.Chithra G.K  (Associate Professor at VIT)
Linguistics - Dr.Chithra G.K (Associate Professor at VIT)DrChithraGK
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxSHIBDASDUTTA
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4DigiGurukul
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptxAlkadumiHamletto
 
Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Mithun B N
 
Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Samira Rahmdel
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA DATASCIENCE
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptOlusolaTop
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 
Natural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptxNatural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptxSHIBDASDUTTA
 
Lingustics and its definition
Lingustics and its definitionLingustics and its definition
Lingustics and its definitionGarret Raja
 

Ähnlich wie Natural language processing (20)

1 Introduction.ppt
1 Introduction.ppt1 Introduction.ppt
1 Introduction.ppt
 
L1 nlp intro
L1 nlp introL1 nlp intro
L1 nlp intro
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
Linguistics - Dr.Chithra G.K (Associate Professor at VIT)
Linguistics - Dr.Chithra G.K  (Associate Professor at VIT)Linguistics - Dr.Chithra G.K  (Associate Professor at VIT)
Linguistics - Dr.Chithra G.K (Associate Professor at VIT)
 
Natural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptxNatural Language Processing (NLP).pptx
Natural Language Processing (NLP).pptx
 
Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4Artificial Intelligence Notes Unit 4
Artificial Intelligence Notes Unit 4
 
1. level of language study.pptx
1. level of language study.pptx1. level of language study.pptx
1. level of language study.pptx
 
Discourse analysis
Discourse analysis Discourse analysis
Discourse analysis
 
Natural Language Processing - Unit 1
Natural Language Processing - Unit 1Natural Language Processing - Unit 1
Natural Language Processing - Unit 1
 
#Applied linguistics#
#Applied linguistics##Applied linguistics#
#Applied linguistics#
 
Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)Discourse analysis (Schmitt's book chapter 4)
Discourse analysis (Schmitt's book chapter 4)
 
Lesson 40
Lesson 40Lesson 40
Lesson 40
 
AI Lesson 40
AI Lesson 40AI Lesson 40
AI Lesson 40
 
NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2NOVA Data Science Meetup 1/19/2017 - Presentation 2
NOVA Data Science Meetup 1/19/2017 - Presentation 2
 
NLP 1.pptx
NLP 1.pptxNLP 1.pptx
NLP 1.pptx
 
NLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.pptNLP introduced and in 47 slides Lecture 1.ppt
NLP introduced and in 47 slides Lecture 1.ppt
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Natural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptxNatural Language Processing-(NLP).pptx
Natural Language Processing-(NLP).pptx
 
Linguistics
LinguisticsLinguistics
Linguistics
 
Lingustics and its definition
Lingustics and its definitionLingustics and its definition
Lingustics and its definition
 

Mehr von National Institute of Technology Durgapur (11)

Morphological operations
Morphological operationsMorphological operations
Morphological operations
 
Digital image processing 2
Digital image processing   2Digital image processing   2
Digital image processing 2
 
Digital image processing - OLD
Digital image processing - OLDDigital image processing - OLD
Digital image processing - OLD
 
Cs 331 Data Structures
Cs 331 Data StructuresCs 331 Data Structures
Cs 331 Data Structures
 
Prog lang-c
Prog lang-cProg lang-c
Prog lang-c
 
Computer and Information Technology
Computer and Information TechnologyComputer and Information Technology
Computer and Information Technology
 
Punctuations
PunctuationsPunctuations
Punctuations
 
Biometrics
BiometricsBiometrics
Biometrics
 
Computer vision
Computer visionComputer vision
Computer vision
 
Database design for HPC
Database design for HPCDatabase design for HPC
Database design for HPC
 
Human computer interaction
Human computer interactionHuman computer interaction
Human computer interaction
 

Kürzlich hochgeladen

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYKayeClaireEstoconing
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17Celine George
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfTechSoup
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfJemuel Francisco
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatYousafMalik24
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parentsnavabharathschool99
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designMIPLM
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxAnupkumar Sharma
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...JhezDiaz1
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomnelietumpap1
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management systemChristalin Nelson
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Mark Reed
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...Nguyen Thanh Tu Collection
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...Postal Advocate Inc.
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Seán Kennedy
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Celine George
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4MiaBumagat1
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONHumphrey A Beña
 

Kürzlich hochgeladen (20)

ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITYISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
ISYU TUNGKOL SA SEKSWLADIDA (ISSUE ABOUT SEXUALITY
 
How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17How to Add Barcode on PDF Report in Odoo 17
How to Add Barcode on PDF Report in Odoo 17
 
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdfInclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
Inclusivity Essentials_ Creating Accessible Websites for Nonprofits .pdf
 
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdfGrade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
Grade 9 Quarter 4 Dll Grade 9 Quarter 4 DLL.pdf
 
Earth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice greatEarth Day Presentation wow hello nice great
Earth Day Presentation wow hello nice great
 
Choosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for ParentsChoosing the Right CBSE School A Comprehensive Guide for Parents
Choosing the Right CBSE School A Comprehensive Guide for Parents
 
Keynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-designKeynote by Prof. Wurzer at Nordex about IP-design
Keynote by Prof. Wurzer at Nordex about IP-design
 
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptxMULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
MULTIDISCIPLINRY NATURE OF THE ENVIRONMENTAL STUDIES.pptx
 
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
ENGLISH 7_Q4_LESSON 2_ Employing a Variety of Strategies for Effective Interp...
 
ENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choomENGLISH6-Q4-W3.pptxqurter our high choom
ENGLISH6-Q4-W3.pptxqurter our high choom
 
Concurrency Control in Database Management system
Concurrency Control in Database Management systemConcurrency Control in Database Management system
Concurrency Control in Database Management system
 
Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)Influencing policy (training slides from Fast Track Impact)
Influencing policy (training slides from Fast Track Impact)
 
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
HỌC TỐT TIẾNG ANH 11 THEO CHƯƠNG TRÌNH GLOBAL SUCCESS ĐÁP ÁN CHI TIẾT - CẢ NĂ...
 
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
USPS® Forced Meter Migration - How to Know if Your Postage Meter Will Soon be...
 
Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...Student Profile Sample - We help schools to connect the data they have, with ...
Student Profile Sample - We help schools to connect the data they have, with ...
 
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
Incoming and Outgoing Shipments in 3 STEPS Using Odoo 17
 
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptxFINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
FINALS_OF_LEFT_ON_C'N_EL_DORADO_2024.pptx
 
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptxYOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
YOUVE_GOT_EMAIL_PRELIMS_EL_DORADO_2024.pptx
 
ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4ANG SEKTOR NG agrikultura.pptx QUARTER 4
ANG SEKTOR NG agrikultura.pptx QUARTER 4
 
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATIONTHEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
THEORIES OF ORGANIZATION-PUBLIC ADMINISTRATION
 

Natural language processing

  • 1. Natural Language Processing CC-9016 Dr. C. Saravanan, Ph.D., NIT Durgapur dr.cs1973@gmail.com Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 2. Introduction • What is NLP ? • Natural language processing (NLP) can be defined as the automatic (or semi- automatic) processing of human language. • Nowadays, alternative terms are often preferred, like `Language Technology' or `Language Engineering'. • Language is often used in contrast with speech (e.g., Speech and Language Technology). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 3. Introduction cont… • NLP is essentially multidisciplinary: it is closely related to linguistics. • It also has links to research in cognitive science, psychology, philosophy and maths (especially logic). • Within Computer Science, it relates to formal language theory, compiler techniques, theorem proving, machine learning and human- computer interaction. • Of course it is also related to AI, though nowadays it's not generally thought of as part of AI. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 4. Introduction cont… • The HAL 9000 computer in Stanley Kubrick’s film 2001: A Space Odyssey is one of the most recognizable characters in twentieth-century cinema. • HAL is an artificial agent capable of speaking and understanding English, and even reading lips. • HAL able to do information retrieval (finding out where needed textual resources reside), information extraction (extracting pertinent facts from those textual resources), and inference (drawing conclusions based on known facts). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 5. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur. Space Odyssey
  • 6. Introduction cont… Other valuable areas of language processing such as spelling correction, grammar checking, information retrieval, and machine translation. Example: Is the door open? The door is open. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 7. Online Applications • Dictation - Online Speech Recognition https://dictation.io/ • SpeechPad - a new voice notebook https://speechpad.pw/ • Speechlogger - https://speechlogger.appspot.com/en/ • Talktyper https://talktyper.com/ (microphone and speaker is required) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 8. Knowledge in Language Processing • Requires knowledge about phonetics and phonology, which can help model how words are pronounced in colloquial speech • they're- they + are. They're from Canada. • their- something belongs to "them." This is their car. • there- in that place. The park is over there. • two- the number 2. I'll have two hamburgers, please. • to- this has many meanings. One means "in the direction of." I'm going to South America. • too- also. I want to go, too. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 9. Knowledge in Language Processing cont… • Requires knowledge about morphology, which captures information about the shape and behavior of words in context. • Singular and Plural – Man Men, Boy Boys • The knowledge needed to order and group words together comes under the heading of syntax. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 10. Knowledge in Language Processing cont… • Requires knowledge of the meanings of the component words, the domain of lexical semantics • and knowledge of how these components combine to form larger meanings, compositional semantics. • No • No, I won’t open the door. • I’m sorry and I’m afraid, I can’t. (indirect refusal) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 11. Knowledge in Language Processing cont… • The knowledge of language needed to engage in complex language behavior can be separated into six distinct categories. • Phonetics and Phonology – The study of linguistic sounds. • Morphology – The study of the meaningful components of words. • Syntax – The study of the structural relationships between words. • Semantics – The study of meaning. • Pragmatics – The study of how language is used to accomplish goals. • Discourse – The study of linguistic units larger than a single utterance. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 12. AMBIGUITY • Some input is ambiguous if there are multiple alternative linguistic structures than can be built for it. • Look at the dog with one eye. • Look at the dog using only one of your eyes. • Look at the dog that only has one eye. • The word with is semantically ambiguous in the sentence. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 13. AMBIGUITY • Some cases, for example the word make means ‘create’ and ‘cook’ • can be solved by word sense disambiguation and • be resolved by speech act interpretation. • Example: I make a Cake Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 14. Models and Algorithms • Most important elements are • state machines, • formal rule systems, • logic, • probability theory • Well known computational algorithms • dynamic programming • state space search Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 15. State Machines • state machines are formal models that consist of states, transitions among states, and an input representation • the variations of this basic model are deterministic and non- deterministic finite-state automata, finite-state transducers • which can write to an output device, weighted automata, Markov models and hidden Markov models which have a probabilistic component. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 16. Formal Rule Systems • procedural models and more important ones are • regular grammars • regular relations • context-free grammars • feature-augmented grammars • probabilistic variants of them all. • the algorithms that are often used such as depth-first search, best- first, and A* search. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 17. Logic and Probability Theory • first order logic, also known as the predicate calculus, deals with formalisms as feature-structures, semantic networks, and conceptual dependency. • Probability theory solves the many kinds of ambiguity problems • given N choices for some ambiguous input, choose the most probable one Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 18. Language, Thought, and Understanding • Alan Turing (1950) introduced Turing Test • In Turing’s game, there are three participants: 2 people and a computer. • One of the people is a contestant and plays the role of an interrogator. • To win, the interrogator must determine which of the other two participants is the machine by asking a series of questions via a teletype. • The task of the machine is to fool the interrogator into believing it is a person by responding as a person would to the interrogator’s questions. • The task of the second human participant is to convince the interrogator that the other participant is the machine, and that they are human. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 19. ELIZA • ELIZA (1966) was an early natural language processing system makes use of pattern-matching capable of carrying on a limited form of conversation with a user. • User1: You are like my father in some ways. • ELIZA1: WHAT RESEMBLANCE DO YOU SEE • User2: You are not very aggressive but I think you don’t want me to notice that. • ELIZA2: WHAT MAKES YOU THINK I AM NOT AGGRESSIVE • User3: You don’t argue with me. • ELIZA3: WHY DO YOU THINK I DON’T ARGUEWITH YOU • User4: You are afraid of me. • ELIZA4: DOES IT PLEASE YOU TO BELIEVE I AM AFRAID OF YOU Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 20. HISTORY 1940’s and 1950’s • Automata theory was contributed to by Shannon (1948), who applied probabilistic models of discrete Markov processes to automata for language. • Chomsky (1956) first considered finite-state machines as a way to characterize a grammar, and defined a finite-state language as a language generated by a finite-state grammar. • These early models led to the field of formal language theory, which used algebra and set theory to define formal languages as sequences of symbols. • This includes the context-free grammar, first defined by Chomsky (1956) for natural languages but independently discovered by Backus (1959) and Naur et al. (1960) in their descriptions of the ALGOL programming language. • The sound spectrograph was developed (Koenig et al., 1946) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 21. 1957–1970 • Speech and Language Processing had split very cleanly into two paradigms: symbolic and stochastic. • The first was the work of Chomsky and others on formal language theory and generative syntax throughout the late 1950’s and early to mid 1960’s, and the work of many linguistics and computer scientists on parsing algorithms, initially top-down and bottom-up, and then via dynamic programming. • One of the earliest complete parsing systems was Zelig Harris’s Transformations and Discourse Analysis Project (TDAP), which was implemented between June 1958 and July 1959 at the University of Pennsylvania (Harris, 1962). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 22. 1957–1970 cont… • The second line of research was the new field of artificial intelligence. • In the 1956 John McCarthy, Marvin Minsky, Claude Shannon, and Nathaniel Rochester organized a two month workshop on artificial intelligence. • The major focus of the new field was the work on reasoning and logic typified by Newell and Simon’s work on the Logic Theorist and the General Problem Solver. • These were simple systems which worked in single domains mainly by a combination of pattern matching and key-word search with simple heuristics for reasoning and question-answering. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 23. 1957–1970 cont… • Bledsoe and Browning (1959) built a Bayesian system for text-recognition (optical character recognition) that used a large dictionary and computed the likelihood of each observed letter sequence given each word in the dictionary by multiplying the likelihoods for each letter. • Mosteller and Wallace (1964) applied Bayesian methods to the problem of authorship attribution on The Federalist papers. • The Brown corpus of American English, a 1 million word collection of samples from 500 written texts from different genres (newspaper, novels, non-fiction, academic, etc.), which was assembled in 1963-64 (Kucera and Francis, 1967; Francis, 1979; Francis and Kucera, 1982), and William S. Y. Wang’s 1967 DOC (Dictionary on Computer), an on-line Chinese dialect dictionary. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 24. 1970–1983 • Use of the Hidden Markov Model and the metaphors of the noisy channel and decoding, developed independently by Jelinek, Bahl, Mercer, and colleagues at IBM’s Thomas J. Watson Research Center, and Baker at Carnegie Mellon University. • The logic-based paradigm was begun by the work of Colmerauer and his colleagues on Q-systems and metamorphosis grammars (Colmerauer, 1970, 1975), the forerunners of Prolog and Definite Clause Grammars (Pereira and Warren, 1980). • Independently, Kay’s (1979) work on functional grammar, and shortly later, (1982)’s (1982) work on LFG, established the importance of feature structure unification. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 25. 1970–1983 cont… • Terry Winograd’s SHRDLU system which simulated a robot embedded in a world of toy blocks (Winograd, 1972a). The program was able to accept natural language text commands (Move the red block on top of the smaller green one) of a hitherto unseen complexity and sophistication. • Roger Schank and his colleagues and students built a series of language understanding programs that focused on human conceptual knowledge such as scripts, plans and goals, and human memory organization (Schank and Abelson, 1977; Schank and Riesbeck, 1981; Cullingford, 1981; Wilensky, 1983; Lehnert, 1977). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 26. 1970–1983 cont… • The discourse modeling paradigm focused on four key areas in discourse. Grosz and her colleagues proposed ideas of discourse structure and discourse focus (Grosz, 1977a; Sidner, 1983a), a number of researchers began to work on automatic reference resolution (Hobbs, 1978a), and the BDI (Belief-Desire-Intention) framework for logic-based work on speech acts was developed (Perrault and Allen, 1980; Cohen and Perrault, 1979). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 27. 1983-1993 • The finite-state models, which began to receive attention again after work on finite-state phonology and morphology by (Kaplan and Kay, 1981) and finite-state models of syntax by Church (1980). • Another trend was the rise of probabilistic models throughout speech and language processing, influenced strongly by the work at the IBM Thomas J. Watson Research Center on probabilistic models of speech recognition. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 28. 1994-1999 • First, probabilistic and data-driven models had become quite standard throughout natural language processing. • Algorithms for parsing, part of speech tagging, reference resolution, and discourse processing all began to incorporate probabilities, and employ evaluation methodologies borrowed from speech recognition and information retrieval. • Second, the increases in the speed and memory of computers had allowed commercial exploitation of a number of subareas of speech and language processing, in particular speech recognition and spelling and grammar checking. • Finally, the rise of the Web emphasized the need for language-based information retrieval and information extraction. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 29. NLP applications • spelling and grammar checking optical character recognition (OCR) • screen readers for blind and partially sighted users / augmentative and alternative communication • machine aided translation lexicographers' tools • information retrieval document classification (filtering, routing) • document clustering information extraction • question answering summarization • text segmentation exam marking • report generation (possibly multilingual) machine translation • natural language interfaces to databases email understanding • dialogue system Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 30. Some Linguistic Terminology • Morphology: the structure of words. For instance, unusually can be thought of as composed of a prefix un-, a stem usual, and an affix -ly. composed is compose plus the inflectional affix -ed: a spelling rule means we end up with composed rather than composeed. • Syntax: the way words are used to form phrases. e.g., it is part of English syntax that a determiner such as the will come before a noun, and also that determiners are obligatory with certain singular nouns. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 31. Some Linguistic Terminology • Semantics. Compositional semantics is the construction of meaning (generally expressed as logic) based on syntax. • Pragmatics: meaning in context. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 32. Difficulties • How fast is the 3G ? • How fast will my 3G arrive? • Please tell me when I can expect the 3G I ordered. • Has my order number 4291 been shipped yet? Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 33. Spelling and grammar checking • All spelling checkers can ag words which aren't in a dictionary. • * The neccesary steps are obvious. • The necessary steps are obvious. • If the user can expand the dictionary, or if the language has complex productive morphology (see x2.1), then a simple list of words isn't enough to do this and some morphological processing is needed. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 34. • * The dog came into the room, it's tail wagging. • The dog came into the room, its tail wagging. • For instance, possessive its generally has to be followed by a noun or an adjective noun combination etc. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 35. • But, it sometimes isn't locally clear: e.g. fair is ambiguous between a noun and an adjective. • `It's fair', was all Kim said. • Every village has an annual fair. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 36. • The most elaborate spelling/grammar checkers can get some of these cases right. • * The tree's bows were heavy with snow. • The tree's boughs were heavy with snow. • Getting this right requires an association between tree and bough. • Attempts might have been made to hand-code this in terms of general knowledge of trees and their parts. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 37. • Checking punctuation can be hard • Students at Cambridge University, who come from less affluent backgrounds, are being offered up to 1,000 a year under a bursary scheme. • The sentence implies that most/all students at Cambridge come from less affluent backgrounds. The following sentence makes different meaning. • Students at Cambridge University who come from less affluent backgrounds are being offered up to 1,000 a year under a bursary scheme. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 38. Information retrieval • Information retrieval involves returning a set of documents in response to a user query: Internet search engines are a form of IR. • However, one change from classical IR is that Internet search now uses techniques that rank documents according to how many links there are to them (e.g., Google's PageRank) as well as the presence of search terms. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 39. Information extraction • Information extraction involves trying to discover specific information from a set of documents. • The information required can be described as a template. • For instance, for company joint ventures, the template might have slots for the companies, the dates, the products, the amount of money involved. The slot fillers are generally strings. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 40. Question answering • Attempts to find a specific answer to a specific question from a set of documents, or at least a short piece of text that contains the answer. • What is the capital of France? • Paris has been the French capital for many centuries. • There are some question-answering systems on theWeb, but most use very basic techniques. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 41. Machine translation (MT) • MT work started in the US in the early fifties, concentrating on Russian to English. A prototype system was demonstrated in 1954. • Systran was providing useful translations by the late 60s. • Until the 80s, the utility of general purpose MT systems was severely limited by the fact that text was not available in electronic form: Systran used teams of skilled typists to input Russian documents. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 42. • Systran and similar systems are not a substitute for human translation: • They are useful because they allow people to get an idea of what a document is about, and maybe decide whether it is interesting enough to get translated properly. • Spoken language translation is viable for limited domains: research systems include Verbmobil, SLT and CSTAR. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 43. Natural language interfaces • Natural language interfaces were the `classic' NLP problem in the 70s and 80s. • LUNAR is the classic example of a natural language interface to a database (NLID): • Its database concerned lunar rock samples brought back from the Apollo missions. LUNAR is described by Woods (1978). • It was capable of translating elaborate natural language expressions into database queries. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 44. SHRDLU • SHRDLU (Winograd, 1973) was a system capable of participating in a dialogue about a micro world (the blocks world) and manipulating this world according to commands issued in English by the user. • SHRDLU had a big impact on the perception of NLP at the time since it seemed to show that computers could actually `understand' language: • The impossibility of scaling up from the micro world was not realized. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 45. • There have been many advances in NLP since these systems were built: • systems have become much easier to build, and somewhat easier to use, but they still haven't become ubiquitous. • Natural Language interfaces to databases were commercially available in the late 1970s, but largely died out by the 1990s: • Porting to new databases and especially to new domains requires very specialist skills and is essentially too expensive. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 46. NLP application architecture • The input to an NLP system could be speech or text. • It could also be gesture (multimodal input or perhaps a Sign Language). • The output might be non-linguistic, but most systems need to give some sort of feedback to the user, even if they are simply performing some action. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 47. Components • Input Pre Processing: speech / gesture recogniser or text pre-processor • Morphological Analysis • Speech Tagging • Parsing - this includes syntax and compositional semantics • Disambiguation: this can be done as part of parsing • Context Module: this maintains information about the context • Text Planning: the part of language generation / what meaning to convey • Tactical Generation: converts meaning representations to strings. • Morphological Generation • Output Processing: text-to-speech, text formatter, etc. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 48. • Following are required in order to construct the standard components and knowledge sources. • Lexicon Acquisition • Grammar Acquisition • Acquisition of Statistical Information Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 49. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 50. Note • NLP applications need complex knowledge sources. • Applications cannot be 100% perfect. Humans aren't 100% perfect anyway. • Applications that aid humans are much easier to construct than applications which replace humans. • Improvements in NLP techniques are generally incremental rather than revolutionary. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 51. Morphology • Morphology concerns the structure of words. • Words are assumed to be made up of morphemes, which are the minimal information carrying unit. • Morphemes which can only occur in conjunction with other morphemes are affixes: words are made up of a stem and zero or more affixes. • For instance, dog is a stem which may occur with the plural suffix +s i.e., dogs. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 52. • English only has suffixes (affixes which come after a stem) and prefixes (which come before the stem). • Few languages have infixes (affixes which occur inside the stem) and circumfixes (affixes which go around a stem). • Some English irregular verbs show a relic of inflection by infixation (e.g. sing, sang, sung) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 53. Inflectional Morphology • Inflectional morphology concerns properties such as • tense, aspect, number, person, gender, and case, • although not all languages code all of these: • English, for instance, has very little morphological marking of case and gender. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 54. Derivational Morphology • Derivational affixes, such as un-, re-, anti- etc, have a broader range of semantic possibilities and don't fit into neat paradigms. • Antidisestablishmentarianism • Antidisestablishmentarianismization • Antiantimissile Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 55. Zero Derivation • tango, waltz etc. are words • which are basically nouns but can be used as verbs. • An English verb will have a present tense form, a 3rd person singular present tense form, a past participle and a passive participle. • e.g., text as a verb - texts, texted. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 56. • Stems and afifxes can be individually ambiguous. • There is also potential for ambiguity in how a word form is split into morphemes. • For instance, unionised could be union -ise -ed • or (in chemistry) un- ion -ise -ed. • Note that un- ion is not a possible form (because un- can't attach to a noun). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 57. • un- that can attach to verbs, it nearly always denotes a reversal of a process (e.g., untie) • whereas the un- that attaches to adjectives means `not', which is the meaning in the case of un- ion -ise -ed. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 58. Spelling / orthographic rules • English morphology is essentially concatenative. • Some words have irregular morphology and their inflectional forms simply have to be listed. • For instance, the suffix -s is pronounced differently when it is added to a stem which ends in s, x or z and the spelling reflects this with the addition of an e (boxes etc.). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 59. • The `e-insertion' rule can be described as follows: { s } • ɛ → e / { x } ^_s { z } • _ indicating the position in question • ɛ is used for the empty string and • ^ for the affix boundary Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 60. • Other spelling rules in English include consonant doubling • e.g., rat, ratted, though note, not *auditted) • and y/ie conversion (party, parties). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 61. Lexical requirements for morphological processing • There are three sorts of lexical information that are needed for full, high precision morphological processing: • affixes, plus the associated information conveyed by the affix • irregular forms, with associated information similar to that for affixes • stems with syntactic categories Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 62. • On approach to an affix lexicon is for it to consist of a pairing of affix and some encoding of the syntactic/semantic effect of the affix. • For instance, consider the following fragment of a suffix lexicon • ed PAST_VERB • ed PSP_VERB • s PLURAL_NOUN Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 63. • A lexicon of irregular forms is also needed. • began PAST_VERB begin • begun PSP_VERB begin • English irregular forms are the same for all senses of a word. For instance, ran is the past of run whether we are talking about athletes, politicians or noses. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 64. • The morphological analyser is split into two stages. • The first of these only concerns morpheme forms • A second stage which is closely coupled to the syntactic analysis Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 65. Finite state automata for recognition • The approach to spelling rules that described involves the use of finite state transducers (FSTs). • Suppose we want to recognize dates (just day and month pairs) written in the format day/month. The day and the month may be expressed as one or two digits (e.g. 11/2, 1/12 etc). • This format corresponds to the following simple FSA, where each character corresponds to one transition: Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 66. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 67. Prediction and part-of-speech tagging • Corpora • A corpus (corpora is the plural) is simply a body of text that has been collected for some purpose. • A balanced corpus contains texts which represent different genres (newspapers, fiction, textbooks, parliamentary reports, cooking recipes, scientific papers etc.): • Early examples were the Brown corpus (US English) and the Lancaster-Oslo-Bergen (LOB) corpus (British English) which are each about 1 million words: the more recent British National Corpus (BNC) contains approx. 100 million words and includes 20 million words of spoken English. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 68. • Implementing an email answering application: • it is essential to collect samples of representative emails. • For interface applications in particular, collecting a corpus requires a simulation of the actual application: • generally this is done by a Wizard of Oz experiment, where a human pretends to be a computer. • Corpora are needed in NLP for two reasons. • To evaluate algorithms on real language • Providing the data source for many machine-learning approaches Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 69. Prediction • The essential idea of prediction is that, • given a sequence of words, • to determine what's most likely to come next. • Used for language modelling for automatic speech recognition. • Speech recognisers cannot accurately determine a word from the sound signal. • They cannot reliably tell where each word starts and finishes. • The most probable word is chosen on the basis of the language model Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 70. • The language models which are currently most effective work on the basis of N-grams (a type of Markov chain), where the sequence of the prior n-1 words is used to predict the next. • Trigram models use the preceding 2 words, • bigram models use the preceding word and • unigram models use no context at all, but simply work on the basis of individual word probabilities. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 71. • Word prediction is also useful in communication aids: • Systems for people who can't speak because of some form of disability. • People who use text-to-speech systems to talk because of a non- linguistic disability usually have some form of general motor impairment which also restricts their ability to type at normal rates (stroke, ALS, cerebral palsy etc). Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 72. • Often they use alternative input devices, such as adapted keyboards, puffer switches, mouth sticks or eye trackers. • Generally such users can only construct text at a few words a minute, which is too slow for anything like normal communication to be possible (normal speech is around 150 words per minute). • As a partial aid, a word prediction system is sometimes helpful: • This gives a list of candidate words that changes as the initial letters are entered by the user. • The user chooses the desired word from a menu when it appears. • The main difficulty with using statistical prediction models in such applications is in finding enough data: to be useful, the model really has to be trained on an individual speaker's output. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 73. • Prediction is important in estimation of entropy, including estimations of the entropy of English. • The notion of entropy is important in language modelling because it gives a metric for the difficulty of the prediction problem. • For instance, speech recognition is much easier in situations where the speaker is only saying two easily distinguishable words than when the vocabulary is unlimited: measurements of entropy can quantify this. • Other applications for prediction include optical character recognition (OCR), spelling correction and text segmentation. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 74. Bigrams • A bigram model assigns a probability to a word based on the previous word P(wn|wn-1) where wn is the 𝑛 𝑡ℎ word in some string. • The probability of some string of words 𝑃(𝑊 𝑛 1 ) is thus approximated by the product of these conditional probabilities: • 𝑃 𝑊 𝑛 1 = 𝑘=1 𝑛 𝑃(𝑤 𝑘|𝑤 𝑘−1) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 75. Example • good morning • good afternoon • good afternoon • it is very good • it is good • We'll use the symbol <s> to indicate the start of an utterance, so the corpus really looks like: • <s> good morning <s> good afternoon <s> good afternoon <s> it is very good <s> it is good <s> Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 76. • The bigram probabilities are given as • 𝐶(𝑤 𝑛−1 𝑤 𝑛) 𝑤 𝐶(𝑤 𝑛−1 𝑤) • the count of a particular bigram, normalized by dividing by the total number of bigrams starting with the same word. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 77. Part of Speech (POS) tagging • One important application is to part-of-speech tagging (POS tagging), where the words in a corpus are associated with a tag indicating some syntactic information that applies to that particular use of the word. • For instance, consider the example sentence below: • They can fish. • This has two readings: one (the most likely) about ability to Fish and other about putting fish in cans. Fish is ambiguous between a singular noun, plural noun and a verb, while can is ambiguous between singular noun, verb (the `put in cans' use) and modal verb. However, they is unambiguously a pronoun. (I am ignoring some less likely possibilities, such as proper names.) • These distinctions can be indicated by POS tags: • they PNP • can VM0 VVB VVI NN1 • fish NN1 NN2 VVB VVI Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 78. • The meaning of the tags above is: • NN1 singular noun • NN2 plural noun • PNP personal pronoun • VM0 modal auxiliary verb • VVB base form of verb (except infinitive) • VVI infinitive form of verb (i.e. occurs with `to') Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 79. • A POS tagger resolves the lexical ambiguities to give the most likely set of tags for the sentence. In this case, the right tagging is likely to be: • They PNP can VM0 sh VVB . PUN • Note the tag for the full stop: punctuation is treated as unambiguous. POS tagging can be regarded as a form of very basic word sense disambiguation. • The other syntactically possible reading is: • They PNP can VVB sh NN2 . PUN Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 80. Stochastic POS tagging • One form of POS tagging applies the N-gram technique that we saw above, but in this case it applies to the POS tags rather than the individual words. • The most common approaches depend on a small amount of manually tagged training data from which POS N-grams can be extracted. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 81. Generative grammar • concerned with the structure of phrases • the dog slept • can be bracketed • ((the (big dog)) slept) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 82. Equivalent • Two grammars are said to be weakly-equivalent if they generate the same strings. • Two grammars are strongly equivalent if they assign the same bracketing's to all strings they generate. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 83. • In most, but not all, approaches, the internal structures are given labels. • For instance, the big dog is a noun phrase (abbreviated NP), • slept, slept in the park and licked Sandy are verb phrases (VPs). • The labels such as NP and VP correspond to non-terminal symbols in a grammar. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 84. Context free grammars • A CFG has four components, 1. a set of non-terminal symbols (e.g., S, VP), conventionally written in uppercase; 2. a set of terminal symbols (i.e., the words), conventionally written in lowercase; 3. a set of rules (productions), where the left hand side (the mother) is a single non-terminal and the right hand side is a sequence of one or more non-terminal or terminal symbols (the daughters); 4. a start symbol, conventionally S, which is a member of the set of non-terminal symbols. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 85. • they fish • (S (NP they) (VP (V fish))) • they can fish • (S (NP they) (VP (V can) (VP (V fish)))) • ;;; the modal verb `are able to' reading • (S (NP they) (VP (V can) (NP fish))) • ;;; the less plausible, put fish in cans, reading • they fish in rivers • (S (NP they) (VP (VP (V fish)) (PP (P in) (NP rivers)))) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 86. Parse Trees • Parse trees are equivalent to bracketed structures, but are easier to read for complex cases. • A parse tree and bracketed structure for one reading of they can fish in December is shown below. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 87. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 88. Chart Parsing • A chart is a list of edges. • In the simplest version of chart parsing, each edge records a rule application and has the following structure: • [id, left vertex, right vertex, mother category, daughters] • A vertex is an integer representing a point in the input string, as illustrated below: . they . can . fish . 0 1 2 3 Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 89. • mother category refers to the rule that has been applied to create the edge. • daughters is a list of the edges that acted as the daughters for this particular rule application. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 90. Lexical Semantics 1. Meaning postulates 2. Classical lexical relations: hyponymy, meronymy, synonymy, antonymy 3. Taxonomies and WordNet 4. Classes of polysemy: homonymy, regular polysemy, vagueness 5. Word sense disambiguation Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 91. Meaning postulates • Inference rules can be used to relate open class predicates. • bachelor(x) -> man(x) ^ unmarried(x) • how to define table, tomato or thought Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 92. Hyponymy • Hyponymy is the classical relation • e.g. dog is a hyponym of animal • That is, the relevant sense of dog is the hyponym of animal • animal is the hypernym of dog Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 93. • Classes of words can be categorized by hyponymy • Some nouns - biological taxonomies, human artefacts, professions etc • Abstract nouns, such as truth, • Some verbs can be treated as being hyponyms of one another . e.g. murder is a hyponym of kill • Hyponymy is essentially useless for adjectives. • coin a hyponym of money? • coin might be metal and also money - multiple parents might be possible Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 94. Meronymy • The standard examples of meronymy apply to physical relationships • arm is part of a body (arm is a meronym of body); • steering wheel is a merony of car • a soldier is part of an army Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 95. Synonymy and Antonymy • True synonyms are relatively uncommon • thin, slim, slender, skinny • Antonymy is mostly discussed with respect to adjectives • e.g., big/little, like/dislike Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 96. WordNet • WordNet is the main resource for lexical semantics for English that is used in NLP, primarily because of its very large coverage and the fact that it's freely available. • The primary organisation of WordNet is into synsets: synonym sets (near-synonyms). To illustrate this, the following is part of what WordNet returns as an `overview' of red: Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 97. • wn red -over • Overview of adj red The adj red has 6 senses (first 5 from tagged texts) • 1. (43) red, reddish, ruddy, blood-red, carmine, cerise, cherry, cherry- red, crimson, ruby, ruby-red, scarlet -- (having any of numerous bright or strong colors reminiscent of the color of blood or cherries or tomatoes or rubies) • 2. (8) red, reddish -- ((used of hair or fur) of a reddish brown color; "red deer"; reddish hair") • Nouns in WordNet are organized by hyponymy Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 98. • Sense 6 • big cat, cat • => leopard, Panthera pardus • => leopardess • => panther • => snow leopard, ounce, Panthera uncia • => jaguar, panther, Panthera onca, Felis onca • => lion, king of beasts, Panthera leo • => lioness • => lionet • => tiger, Panthera tigris • => Bengal tiger • => tigress • => liger • => tiglon, tigon • => cheetah, chetah, Acinonyx jubatus • => saber-toothed tiger, sabertooth • => Smiledon californicus • => false saber-toothed tiger Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 99. Polysemy • The standard example of polysemy is bank (river bank) vs bank (financial institution). • teacher is intuitively general between male and female teachers • Dictionaries are not much help Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 100. Word sense disambiguation • Word sense disambiguation (WSD) is needed for most NL applications that involve semantics. • What's changed since the 1980s is that various statistical or machine- learning techniques have been used to avoid hand-crafting rules. • supervised learning • unsupervised learning • Machine readable dictionaries (MRDs) Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 101. Discourse • Utterances are always understood in a particular context. Context- dependent situations include: 1. Referring expressions: pronouns, definite expressions etc. 2. Universe of discourse: every dog barked, doesn't mean every dog in the world but only every dog in some explicit or implicit contextual set. 3. Responses to questions, etc: only make sense in a context: Who came to the party? Not Sandy. 4. Implicit relationships between events: Max fell. John pushed him. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 102. Rhetorical relations and coherence • Consider the following discourse: Max fell. John pushed him. • This discourse can be interpreted in at least two ways: 1. Max fell because John pushed him. 2. Max fell and then John pushed him. • In 1 the link is a form of explanation, but 2 is an example of narration. • because and and then are said to be cue phrases. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 103. Coherence • Discourses have to have connectivity to be coherent: • Kim got into her car. Sandy likes apples. • Both of these sentences make perfect sense in isolation, but taken together they are incoherent. • Adding context can restore coherence: • Kim got into her car. Sandy likes apples, so Kim thought she'd go to the farm shop and see if she could get some. • The second sentence can be interpreted as an explanation of the first. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 104. • For example, consider a system that reports share prices. This might generate: • In trading yesterday: Dell was up 4.2%, Safeway was down 3.2%, Compaq was up 3.1%. • This is much less acceptable than a connected discourse: • Computer manufacturers gained in trading yesterday: Dell was up 4.2% and Compaq was up 3.1%. But retail stocks suffered: Safeway was down 3.2%. • Here but indicates a Contrast. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 105. Factors influencing discourse interpretation • Cue phrases. These are sometimes unambiguous, but not usually. e.g. and • Punctuation (also prosody) and text structure. • Max fell (John pushed him) and Kim laughed. • Max fell, John pushed him and Kim laughed. • Real world content: • Max fell. John pushed him as he lay on the ground. • Tense and aspect. • Max fell. John had pushed him. • Max was falling. John pushed him. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 106. Referring expressions • referent a real world entity that some piece of text (or speech) refers to. • referring expressions bits of language used to perform reference by a speaker • antecedant the text evoking a referent. • anaphora the phenomenon of referring to an antecedant Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 107. Pronoun agreement • Pronouns generally have to agree in number and gender with their antecedants. • A little girl is at the door - see what she wants, please? • My dog has hurt his foot - he is in a lot of pain. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 108. Reflexives • Johni cut himself shaving. (himself = John, subscript notation used to indicate this) • reflexive pronouns must be coreferential with a preceeding argument of the same verb. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 109. Pleonastic pronouns • Pleonastic pronouns are semantically empty, and don't refer: • It is snowing. • It is not easy to think of good examples. • It is obvious that Kim snores. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 110. Salience • There are a number of effects which cause particular pronoun referents to be preferred. • Recency More recent referents are preferred • Kim has a fast car. Sandy has an even faster one. Lee likes to drive it. • it preferentially refers to Sandy's car, rather than Kim's. • Grammatical role Subjects > objects > everything else: • Fred went to the Grafton Centre with Bill. He bought a CD. • he is more likely to be interpreted as Fred than as Bill. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 111. • Repeated mention Entities that have been mentioned more frequently are preferred. • Fred was getting bored. He decided to go shopping. Bill went to the Grafton Centre with Fred. He bought a CD. • He=Fred (maybe) despite the general preference for subjects • Parallelism Entities which share the same role as the pronoun in the same sort of sentence are preferred • Bill went with Fred to the Grafton Centre. Kim went with him to Lion Yard. • Him=Fred, because the parallel interpretation is preferred. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 112. • Coherence effects The pronoun resolution may depend on the rhetorical/discourse relation that is inferred. • Bill likes Fred. He has a great sense of humour. • He = Fred preferentially, possibly because the second sentence is interpreted as an explanation of the first, and having a sense of humour is seen as a reason to like someone. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 113. Applications • Machine Translation • Spoken Dialogue Systems • Machine translation (MT) • Direct transfer: map between morphologically analysed structures. • Syntactic transfer: map between syntactically analysed structures. • Semantic transfer: map between semantics structures. • Interlingua: construct a language-neutral representation from parsing and use this for generation Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 114. MT using semantic transfer Semantic transfer is an approach to MT which involves: 1. Parsing a source language string to produce a meaning representation. 2. Transforming that representation to one appropriate for the target language. 3. Generating Target Language from the transformed representation. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 115. Dialogue Systems Turn-taking: generally there are points where a speaker invites someone else to take a turn (possibly choosing a specic person), explicitly (e.g., by asking a question) or otherwise. Pauses: pauses between turns are generally very short (a few hundred milliseconds, but highly culture specific). • Longer pauses are assumed to be meaningful: example from Levinson (1983: 300) A: Is there something bothering you or not? (1.0 sec pause) A: Yes or no? (1.5 sec pause) A: Eh? B: No. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 116. • Overlap: Utterances can overlap (the acceptability of this is dialect/culture specific but unfortunately humans tend to interrupt automated systems. this is known as barge in). • Backchannel: Utterances like Uh-huh, OK can occur during other speaker's utterance as a sign that the hearer is paying attention. • Attention: The speaker needs reassurance that the hearer is understanding/paying attention. Often eye contact is enough, but this is problematic with telephone conversations, dark sunglasses, etc. Dialogue systems should give explicit feedback. • Cooperativity: Because participants assume the others are cooperative, we get effects such as indirect answers to questions. • When do you want to leave? • My meeting starts at 3pm. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 117. 1. Single initiative systems (also known as system initiative systems): system controls what happens when. • System: Which station do you want to leave from? • User: King's Cross 2. Mixed initiative dialogue. Both participants can control the dialogue to some extent. • System: Which station do you want to leave from? • User: I don't know, tell me which station I need for Cambridge. 3. Dialogue tracking Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 118. Email response • The LinGO ERG constraint-based grammar has been used for parsing in commercially-deployed email response systems. Well, I am going on vacation for the next two weeks, so the first day that I would be able to meet would be the eighteenth. If you give me your name and your address we will send you the ticket. The morning is good, but nine o'clock might be a little too late, as I have a seminar at ten o'clock. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 119. References • James Allen, Natural Language Understanding, Second Edition, Benjamin-Cummings Publishing Co., Inc. USA, 1995. • Eugene Charniack, Statistical Language Learning, MIT Press, USA, 1996. • Daniel Jurafsky and James H. Martin, Speech and Language Processing, Second Edition, Prentice Hall, 2008. • Christopher D. Manning, and Heinrich Schutze, Foundations of Statistical Natural Language Processing, MIT Press, 1999. Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.
  • 120. Questions ? Natural Language Processing, Dr. C. Saravanan, NIT Durgapur.