Natural Language Processing (NLP) began in the 1950s and uses machine learning algorithms to analyze and understand human language. NLP can be used to automatically summarize text, translate languages, identify entities and sentiment, and perform other tasks. Popular open source NLP libraries like NLTK, Stanford NLP, and OpenNLP provide algorithms for part-of-speech tagging, named entity recognition, dependency parsing, and more. Common machine learning methods in NLP include techniques for parts-of-speech, named entities, lemmatization, and sentiment analysis.
2. History of NLP
NLP began in the 1950s as the intersection of artificial intelligence
and linguistics. NLP was originally distinct from text information
retrieval (IR), which employs highly scalable statistics-based
techniques to index and search large volumes of text efficiently:
Manning et al provide an excellent introduction to IR. With time,
however, NLP and IR have converged somewhat. Currently, NLP
borrows from several, very diverse fields, requiring today's NLP
researchers and developers to broaden their mental
knowledge-base significantly.
3. What is Natural Language
Processing?
NLP is a way for computers to analyze, understand, and
derive meaning from human language in a smart and
useful way. By utilizing NLP, developers can organize and
structure knowledge to perform tasks such as automatic
summarization, translation, named entity recognition,
relationship extraction, sentiment analysis, speech
recognition, and topic segmentation.
4. What Can Developers Use NLP
Algorithms For?
NLP algorithms are typically based on machine learning algorithms. Instead of hand-coding large sets of
rules, NLP can rely on machine learning to automatically learn these rules by analyzing a set of examples
(i.e. a large corpus, like a book, down to a collection of sentences), and making a statistical inference. In
general, the more data analyzed, the more accurate the model will be.
● Summarize blocks of text using Summarizer to extract the most important and central ideas while
ignoring irrelevant information.
● Create a chatbot using Parsey McParseface, a language parsing deep learning model made by
Google that uses Point-of-Speech tagging.
● Identify the type of entity extracted, such as it being a person, place, or organization using Named
Entity Recognition.
● Use Sentiment Analysis to identify the sentiment of a string of text, from very negative to neutral
to very positive.
● Reduce words to their root, or stem, using PorterStemmer, or break up text into tokens using
Tokenizer.
5. Open Source NLP Libraries
These libraries provide the algorithmic building blocks of NLP in real-world
applications. Algorithmia provides a free API endpoint for many of these algorithms,
without ever having to setup or provision servers and infrastructure.
● Apache OpenNLP: a machine learning toolkit that provides tokenizers, sentence
segmentation, part-of-speech tagging, named entity extraction, chunking, parsing,
coreference resolution, and more.
● Natural Language Toolkit (NLTK): a Python library that provides modules for
processing text, classifying, tokenizing, stemming, tagging, parsing, and more.
● Standford NLP: a suite of NLP tools that provide part-of-speech tagging, the
named entity recognizer, coreference resolutionsystem, sentiment analysis, and
more.
● MALLET: a Java package that provides Latent Dirichlet Allocation, document
classification, clustering, topic modeling, information extraction, and more.
6. Some common
machine-learning
methods used in
NLP tasks
● Parts-of-speech
● Named entities
● Dependency parse
● OpenIE
● Lemmas
● Coreference
● Wikipedia Entities
● Relations
● Sentiments
7. Part-of-speech A Part-Of-Speech Tagger (POS Tagger) is a piece of
software that reads text in some language and assigns parts
of speech to each word (and other token), such as noun,
verb, adjective, etc.
8. Named Entities Named Entity Recognition (NER) labels sequences of
words in a text which are the names of things, such as
person and company names, or gene and protein names.
particularly for the 3 classes (PERSON, ORGANIZATION,
LOCATION)
9. Dependency Parse A dependency parser analyzes the grammatical structure of
a sentence, establishing relationships between "head" words
and words which modify those heads.
10. OpenIE Open information extraction (open IE) refers to the
extraction of relation tuples, typically binary relations, from
plain text. The central difference is that the schema for
these relations does not need to be specified in advance;
typically the relation name is just the text linking two
arguments.
11. Lemmas
Generates the word lemmas for all tokens in the corpus.
For grammatical reasons, documents are going to use
different forms of a word, such as organize, organizes, and
organizing. Additionally, there are families of derivationally
related words with similar meanings, such as democracy,
democratic, and democratization. In many situations, it
seems as if it would be useful for a search for one of these
words to return documents that contain another word in
the set.
12. Coreference Coreference resolution is the task of finding all expressions
that refer to the same entity in a text. It is an important step
for a lot of higher level NLP tasks that involve natural
language understanding such as document summarization,
question answering, and information extraction.
14. Relation Extraction From natural language texts detect semantic relations
among entities
Stanford Relation Extractor is a Java implementation to
find relations between two entities. The current relation
extraction model is trained on the relation types (except the
'kill' relation) and data from the paper Roth and Yih, Global
inference for entity and relation identification, 2007.
15. Sentiment Analysis The process of computationally identifying and
categorizing opinions expressed in a piece of text, especially
in order to determine whether the writer's attitude towards
a particular topic, product, etc. is positive, negative, or
neutral.