Introduction to Natural Language Processing

Natural Language
Processing(NLP)
Introduction

History of NLP
NLP began in the 1950s as the intersection of artificial intelligence
and linguistics. NLP was originally distinct from text information
retrieval (IR), which employs highly scalable statistics-based
techniques to index and search large volumes of text efficiently:
Manning et al provide an excellent introduction to IR. With time,
however, NLP and IR have converged somewhat. Currently, NLP
borrows from several, very diverse fields, requiring today's NLP
researchers and developers to broaden their mental
knowledge-base significantly.

What is Natural Language
Processing?
NLP is a way for computers to analyze, understand, and
derive meaning from human language in a smart and
useful way. By utilizing NLP, developers can organize and
structure knowledge to perform tasks such as automatic
summarization, translation, named entity recognition,
relationship extraction, sentiment analysis, speech
recognition, and topic segmentation.

What Can Developers Use NLP
Algorithms For?
NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of
rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples
(i.e. a large corpus, like a book, down to a collection of sentences), and making a statistical inference. In
general, the more data analyzed, the more accurate the model will be.
● Summarize blocks of text using Summarizer to extract the most important and central ideas while
ignoring irrelevant information.
● Create a chatbot using Parsey McParseface, a language parsing deep learning model made by
Google that uses Point-of-Speech tagging.
● Identify the type of entity extracted, such as it being a person, place, or organization using Named
Entity Recognition.
● Use Sentiment Analysis to identify the sentiment of a string of text, from very negative to neutral
to very positive.
● Reduce words to their root, or stem, using PorterStemmer, or break up text into tokens using
Tokenizer.

Open Source NLP Libraries
These libraries provide the algorithmic building blocks of NLP in real-world
applications. Algorithmia provides a free API endpoint for many of these algorithms,
without ever having to setup or provision servers and infrastructure.
● Apache OpenNLP: a machine learning toolkit that provides tokenizers, sentence
segmentation, part-of-speech tagging, named entity extraction, chunking, parsing,
coreference resolution, and more.
● Natural Language Toolkit (NLTK): a Python library that provides modules for
processing text, classifying, tokenizing, stemming, tagging, parsing, and more.
● Standford NLP: a suite of NLP tools that provide part-of-speech tagging, the
named entity recognizer, coreference resolutionsystem, sentiment analysis, and
more.
● MALLET: a Java package that provides Latent Dirichlet Allocation, document
classification, clustering, topic modeling, information extraction, and more.

Some common
machine-learning
methods used in
NLP tasks
● Parts-of-speech
● Named entities
● Dependency parse
● OpenIE
● Lemmas
● Coreference
● Wikipedia Entities
● Relations
● Sentiments

Part-of-speech A Part-Of-Speech Tagger (POS Tagger) is a piece of
software that reads text in some language and assigns parts
of speech to each word (and other token), such as noun,
verb, adjective, etc.

Named Entities Named Entity Recognition (NER) labels sequences of
words in a text which are the names of things, such as
person and company names, or gene and protein names.
particularly for the 3 classes (PERSON, ORGANIZATION,
LOCATION)

Dependency Parse A dependency parser analyzes the grammatical structure of
a sentence, establishing relationships between "head" words
and words which modify those heads.

OpenIE Open information extraction (open IE) refers to the
extraction of relation tuples, typically binary relations, from
plain text. The central difference is that the schema for
these relations does not need to be specified in advance;
typically the relation name is just the text linking two
arguments.

Lemmas
Generates the word lemmas for all tokens in the corpus.
For grammatical reasons, documents are going to use
different forms of a word, such as organize, organizes, and
organizing. Additionally, there are families of derivationally
related words with similar meanings, such as democracy,
democratic, and democratization. In many situations, it
seems as if it would be useful for a search for one of these
words to return documents that contain another word in
the set.

Coreference Coreference resolution is the task of finding all expressions
that refer to the same entity in a text. It is an important step
for a lot of higher level NLP tasks that involve natural
language understanding such as document summarization,
question answering, and information extraction.

Wikipedia Entities For entity linking to Wikipedia pages.

Relation Extraction From natural language texts detect semantic relations
among entities
Stanford Relation Extractor is a Java implementation to
find relations between two entities. The current relation
extraction model is trained on the relation types (except the
'kill' relation) and data from the paper Roth and Yih, Global
inference for entity and relation identification, 2007.

Sentiment Analysis The process of computationally identifying and
categorizing opinions expressed in a piece of text, especially
in order to determine whether the writer's attitude towards
a particular topic, product, etc. is positive, negative, or
neutral.

Introduction to Natural Language Processing

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Natural Language Processing

Similar to Introduction to Natural Language Processing (20)

Recently uploaded

Recently uploaded (20)

Introduction to Natural Language Processing