This document discusses natural language processing and text analytics techniques for analyzing text documents. It describes early work by H.P. Luhn at IBM on automatic text summarization in the 1950s. It also discusses how techniques such as part-of-speech tagging, semantic networks, sentiment analysis, and intent analysis can be used to extract entities, relationships, and sentiment from text. Challenges with analyzing human language are noted, and it suggests expanding text analysis to audio, images, video and integrating with social media and user behavior data.
4. Natural Language Processing
By H.P. Luhn, in
IBM Journal,
April, 1958
http://altaplana.com/ibm-
luhn58-LiteratureAbstracts.pdf
5. Modelling Text
“Statistical information derived from word frequency and distribution is
used by the machine to compute a relative measure of significance, first
for individual words and then for sentences. Sentences scoring highest in
significance are extracted and printed out to become the auto-abstract.”
-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.
Luhn’s analysis of
Messengers of the Nervous
System, a Scientific American
article
http://wordle.net, applied
to the NY Times article
13. Lexical, syntactic, and semantic analysis discern
features including relationships in source materials.
Features = entities, measure-value pairs, concepts,
topics, events, sentiment, and more.
Text analytics may draw on:
• Lexicons & taxonomies.
• Statistics.
• Patterns.
• Linguistics.
• Machine learning.
Text Analytics
15. From POS to Relationships
Understand parts of
speech (POS), e.g. –
<subject> <verb>
<object> –to
discern facts and
relationships.
Semantic networks
such as WordNet
are a
disambiguation
asset.
17. Platforms and ecosystems.
APIs and services.
Text and content analytics --
Discerns and extracts features including relationships from
source materials.
Features = entities, key-value pairs, concepts, topics,
events, sentiment, etc.
Provide (for) BI on content-sourced data.
Data integration, record linkage, data fusion.
The Back End
21. Sentiment Analysis
“Sentiment analysis is the task of identifying positive
and negative opinions, emotions, and evaluations.”
-- Wilson, Wiebe & Hoffman, 2005, “Recognizing Contextual Polarity in
Phrase-Level Sentiment Analysis”
“Sentiment analysis or opinion mining is the
computational study of opinions, sentiments and
emotions expressed in text… An opinion on a feature f is
a positive or negative view, attitude, emotion or
appraisal on f from an opinion holder.”
-- Bing Liu, 2010, “Sentiment Analysis and Subjectivity,” in Handbook of
Natural Language Processing
25. Complications
Sentiment may be of interest at multiple levels.
Corpus / data space, i.e., across multiple sources.
Document.
Statement / sentence.
Entity / topic / concept.
Human language is noisy and chaotic!
Jargon, slang, irony, ambiguity, anaphora, polysemy,
synonymy, etc.
Context is key. Discourse analysis comes into play.
Must distinguish the sentiment holder from the object:
“Geithner said the recession may worsen.”
27. Sensemaking
“It is convenient to divide the entire
information access process into two
main components: information retrieval
through searching and browsing, and
analysis and synthesis of results. This
broader process is often referred to in
the literature as sensemaking.
Sensemaking refers to an iterative
process of formulating a conceptual
representation from of a large volume
of information. Search plays only one
part in this process.”
-- Marti Hearst, 2009 http://searchuserinterfaces.com/
28. Apply new tech to old needs, e.g., automated coding.
Select from and use all available data.
Marry social to profiles and surveys.
Factor in behaviors.
Interpret according to context and needs.
Understand intent to create situational predictive
models.
Explore; experiment.
Suggestions