1. NLP?
Natural language processing is the process of building computational
models for understanding natural language. It studies the problems
of automated generation and understanding of natural human
languages. NLP includes natural-language-generation systems that
convert information from computer databases into normal human
language and natural-language-understanding systems that convert
samples of human language into more formal representations that
are easier for computer programs to manipulate.
NLP involve multiple disciplines. Including artificial intelligence
techniques, multivariate, logical inference, statistics, linguistics and
any other technique that can be used to process, generate or
interpret language with computers.
In order to understand this field it is fundamental to know and
understand the meaning of the terms used in this field. This words
can reffer either to the processes used in this field or to definitions of
different kind of information. This information definitions are:
1- repositories of knowledge containing linguistic information, real
facts and different kinds of relations that can be found in
language.
2- specifications describing kinds of content and how to obtain
them from texts which provide information about different
aspects of texts.
Machine learning and NLP
Machine learning is a subfield of artificial intelligence (AI) concerned
with algorithms that allow computers to learn.
We can view NLP as “an extension of what machine learning” or “a
special kind of machine learning”. Both need to build models using
algorithms and datasets in order to be able to process the new data
with these already built models.
Machine-learning can provide natural language processing a range of
alternative Learning algorithms as well as additional general
approaches and methodologies.
NLP also introduces new learning frameworks and techniques such
as: information retrieval and extraction, through speech recognition
to syntax, semantics and language understanding related tasks. It
also presents the theoretical paradigms: learning theoretic,
2. probabilistic and information theoretic, and the relations among
them, along with the main algorithmic techniques developed within
these and in key natural language applications.
The 2 NLP approaches
1. Statistical NLP: comprises all quantitative approaches to
automated language processing, including probabilistic
modeling, information theory, and linear algebra.[6]
The
technology for statistical NLP comes mainly from machine
learning and data mining, both of which are fields of artificial
intelligence that involve learning from data.
2. Linguistic oriented: based on large repositories that contain
information about texts, for example a list of synonims, a
taxonomy, definition of the gramatic rules of languages, etc..
Mayor task in NLP
Automatic summarization: Produce a readable summary of a
chunk of text. Often used to provide summaries of text of a known
type, such as articles in the financial section of a newspaper.
Machine translation: Automatically translate text from one human
language to another. This is one of the most difficult problems,
and is a member of a class of problems colloquially termed "AI-
complete", i.e. requiring all of the different types of knowledge
that humans possess (grammar, semantics, facts about the real
world, etc.) in order to solve properly.
Part-of-speech tagging: Given a sentence, determine the part of
speech for each word. Many words, especially common ones, can
serve as multiple parts of speech. For example, "book" can be a
noun ("the book on the table") or verb ("to book a flight"); "set"
can be a noun, verb or adjective; and "out" can be any of at least
five different parts of speech. Note that some languages have
more such ambiguity than others. Languages with little inflectional
morphology, such as English are particularly prone to such
ambiguity. Chinese is prone to such ambiguity because it is a tonal
language during verbalization. Such inflection is not readily
conveyed via the entities employed within the orthography to
convey intended meaning.
3. Parsing: Determine the parse tree (grammatical analysis) of a
given sentence. The grammar for natural
languages is ambiguous and typical sentences have multiple
possible analyses. In fact, perhaps surprisingly, for a typical
sentence there may be thousands of potential parses (most of
which will seem completely nonsensical to a human).
Sentiment analysis: Extract subjective information usually from a
set of documents, often using online reviews to determine
"polarity" about specific objects. It is especially useful for
identifying trends of public opinion in the social media, for the
purpose of marketing.
Topic segmentation and recognition: Given a chunk of text,
separate it into segments each of which is devoted to a topic, and
identify the topic of the segment.
Part of NLP specific vocabulary and it's meaning.
Linguistics is the scientific and philosophical study of language,
encompassing a number of sub-fields. At the core of theoretical
linguistics is the study of language structure (grammar) and the
study of meaning (semantics). The first of these encompasses
morphology (the formation and composition of words) and syntax
(the rules that determine how words combine into phrases and
sentences).
A controlled vocabulary is a list of terms that have been
enumerated explicitly. This list is controlled by and is available from a
controlled vocabulary registration authority. All terms in a controlled
vocabulary should have an unambiguous, non-redundant definition.
Named entity recognition is a subtask of information extraction
that seeks to locate and classify atomic elements in text into
predefined categories such as the names of persons, organizations,
locations, expressions of times, quantities, monetary values,
percentages, etc.
A taxonomy is a collection of controlled vocabulary terms
organized into a hierarchical structure (tree shaped). Each term in
the taxonomy is in one or more parent-child relationships. The child
kind of thing has by definition the same constraints as the father type
ones plus one or more additional constraints. For example, car is a
child of vehicle. So any car is also a vehicle, but not every vehicle is a
car. There are also specific kind of taxonomies like an “enterprise
taxonomy” which contains terms related only to this specific
4. field. Taxonomies are seen as less broad
than ontologies because ontologies include logic inference and
allow a larger variety of relation types.
An ontology is a formal representation of a set of concepts within a
domain and the relationships between those concepts. It is used to
reason about the properties of that domain, and may be used to
define the domain. They are a form of knowledge representation.
Part-of-speech (POS) tagging is a process whereby tokens are
sequentially labeled with syntactic labels, such as "finite verb" or
"gerund" or "subordinating conjunction".
Morphology is the study of the internal structure of words.
Lexeme is the distinction between these two senses of "word" is
arguably the most important one in morphology. The first sense of
"word," the one in which dog and dogs are "the same word," this is
called lexeme. The second one is called word-form. We thus say
that dog and dogs have a common Lemma. a Stemmer is used to
transform words to its Lemma (also called root). ttjere are different
forms of the same lexeme. There is a form of a word that is chosen
conventionally to represent the canonical form of
a Lemma. A Lexicon is the collection of all the lexemes of a
language.
Grammar is the field of linguistics that covers the rules governing
the use of any given spoken languages. It mainly
includes morphology andsyntax, but it can be complemented with
other linguistic fields.
Syntax is the study of the principles and rules for constructing
sentences in natural languages; the term syntax is also used to refer
directly to the rules and principles that govern the sentence
structure. Semantics is basically the study of the meaning of signs.
These studies can be performed at word level, sentence level,
paragraph level, and even larger units of discourse levels..
Corpus is a large and structured set of texts used to do statistical
analysis, text-mining, validation of linguistic rules, calculate
document similarities, etc..
Slow but well organize video introduction:
http://www.youtube.com/watch?v=bDPULOFFlaI