SlideShare ist ein Scribd-Unternehmen logo
1 von 83
An Introduction to Text Analytics:
Social, Online, and Enterprise
Seth Grimes
Alta Plana Corporation
@sethgrimes
Text & Social Analytics Summit 2013
Workshop
June 4, 2013
Text Analytics Introduction
2013 Text Analytics Summit
2
Perspectives
Perspective #1: You support research or work in IT.
You help end users who have lots of text.
Perspective #2: You’re a researcher, business analyst, or
other “end user.”
You have lots of text. You want an automated way to deal
with it.
Perspective #3: You work for a solution provider.
Perspective #4: Other?
Text Analytics Introduction
2013 Text Analytics Summit
3
From the Analytics/Business Perspective
1. If you are not analyzing text – if you're analyzing only
transactional information – you're missing opportunity
or incurring risk.
2. Text analytics can boost business results –
“Organizations embracing text analytics all report having an
epiphany moment when they suddenly knew more than
before.”
-- Philip Russom, the Data Warehousing Institute, 2007
http://tdwi.org/articles/2007/05/09-what-works/bi-search-and-text-analytics.aspx
– via established BI / data-mining programs, or
independently.
Text Analytics Introduction
2013 Text Analytics Summit
4
Agenda
1. The “Unstructured” Data Challenge.
2. Text analytics for information retrieval and BI.
3. Text analysis technologies and processes.
4. Applications.
5. The market: Software, services, and solutions.
Note:
I will not cover the agenda in a linear fashion.
Product images and references are used to illustrate only.
Some slides are included for reference purposes.
Text Analytics Introduction
2013 Text Analytics Summit
5
Value in Text
It’s a truism that 80% of enterprise-relevant information
originates in “unstructured” form:
E-mail and messages.
Web pages, news & blog articles, forum postings, and other
social media.
Contact-center notes and transcripts.
Surveys, feedback forms, warranty claims.
Scientific literature, books, legal documents.
...
http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/
Non-text “unstructured” content?
Images
Audio including speech
Video
Value derives from patterns.
Text Analytics Introduction
2013 Text Analytics Summit
6
Unstructured Sources
These sources may contain “traditional” data.
The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard
& Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85.
And they may not.
www.stanford.edu/%7ernusse/wntwindow.html
Axin and Frat1 interact with dvl and GSK, bridging
Dvl to GSK in Wnt-mediated regulation of LEF-1.
Wnt proteins transduce their signals through
dishevelled (Dvl) proteins to inhibit glycogen synthase
kinase 3beta (GSK), leading to the accumulation of
cytosolic beta-catenin and activation of TCF/LEF-1
transcription factors. To understand the mechanism
by which Dvl acts through GSK to regulate LEF-1, we
investigated the roles of Axin and Frat1 in Wnt-
mediated activation of LEF-1 in mammalian cells. We
found that Dvl interacts with Axin and with Frat1,
both of which interact with GSK. Similarly, the Frat1
homolog GBP binds Xenopus Dishevelled in an
interaction that requires GSK.
www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed&
cmd=Retrieve&list_uids=10428961&dopt=Abstract
Text Analytics Introduction
2013 Text Analytics Summit
7
http://www.tropicalisland.de/NYC_New_York_
Brooklyn_Bridge_from_World_Trade_Center_
b.jpg
x(t) = t
y(t) = ½ a (et/a + e-t/a)
= acosh(t/a)
http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg
Structure in “unstructured” sourcesStructure in “unstructured” sources
Text Analytics Introduction
2013 Text Analytics Summit
8
Business Intelligence
Conventional BI feeds off:
It runs off:
"SUMLEV","STATE","COUNTY","STNAME","CTYNAME","YEAR","POPESTIMATE",
50,19,1,"Iowa","Adair County",1,8243,4036,4207,446,225,221,994,509
50,19,1,"Iowa","Adair County",2,8243,4036,4207,446,225,221,994,509
50,19,1,"Iowa","Adair County",3,8212,4020,4192,442,222,220,987,505
50,19,1,"Iowa","Adair County",4,8095,3967,4128,432,208,224,935,488
50,19,1,"Iowa","Adair County",5,8003,3924,4079,405,186,219,928,495
50,19,1,"Iowa","Adair County",6,7961,3892,4069,384,183,201,907,472
50,19,1,"Iowa","Adair County",7,7875,3855,4020,366,179,187,871,454
50,19,1,"Iowa","Adair County",8,7795,3817,3978,343,162,181,841,439
50,19,1,"Iowa","Adair County",9,7714,3777,3937,338,159,179,805,417
“The bulk of information value
is perceived as coming from
data in relational tables. The
reason is that data that is
structured is easy to mine and
analyze.”
-- Prabhakar Raghavan,
Google, formerly
Yahoo Research
Text Analytics Introduction
2013 Text Analytics Summit
9
Business Intelligence
Conventional BI produces:
Text Analytics Introduction
2013 Text Analytics Summit
10
Text-BI: Back to the Future
Business intelligence (BI) was first defined in 1958:
“In this paper, business is a collection of activities carried on for
whatever purpose, be it science, technology, commerce,
industry, law, government, defense, et cetera... The notion of
intelligence is also defined here... as ‘the ability to apprehend
the interrelationships of presented facts in such a way as to
guide action towards a desired goal.’”
-- Hans Peter Luhn
“A Business Intelligence System”
IBM Journal, October 1958
What was IT like in the ‘50s?
Document
input and
processing
Knowledge
handling is
key
Desk Set (1957): Computer engineer
Richard Sumner (Spencer Tracy)
and television network librarian
Bunny Watson (Katherine Hepburn)
and the "electronic brain" EMERAC.
Text Analytics Introduction
2013 Text Analytics Summit
12
From the Information Retrieval Perspective
What do people do with electronic documents?
1. Publish, Manage, and Archive.
2. Index and Search.
3. Categorize and Classify according to metadata &
contents.
4. Information Extraction.
For textual documents, text analytics enhances #1 & #2 and
enables #3 & #4.
You need linguistics to do #1 & #4 well, to deal with meaning
(a.k.a. semantics).
Search is not enough...
Text Analytics Introduction
2013 Text Analytics Summit
13
Semantics
Text analytics adds semantic understanding of –
Named entities: people, companies, places, etc.
Pattern-based entities: e-mail addresses, phone numbers, etc.
Concepts: abstractions of entities.
Facts and relationships.
Concrete and abstract attributes (e.g., 10-year, expensive,
comfortable).
Subjectivity in the forms of opinions, sentiments, and
emotions: attitudinal data.
Call these elements, collectively, features.
Text Analytics Introduction
2013 Text Analytics Summit
14
Semantics, Analytics, and IR
Text analytics generates semantics to bridge search, BI, and
applications, enabling next-generation information
systems.
Search BI
Applica-
tions
Search based
applications
(search + text +
apps)
Information access
(search + text + BI)
Integrated analytics
(text + BI)
Text analytics
(inner circle)
Semantic search
(search + text)
NextGen CRM, EFM,
MR, marketing, …
Text Analytics Introduction
2013 Text Analytics Summit
15
Information Access
Text analytics transforms Information Retrieval (IR) into
Information Access (IA).
• Search terms become queries.
• Retrieved material is mined for larger-scale structure.
• Retrieved material is mined for features such as entities and
topics or themes.
• Retrieved material is mined for smaller-scale structure such
as facts and relationships.
• Results are presented intelligently, for instance, grouping
on mined topics-themes.
• Extracted information may be visualized and explored.
Text Analytics Introduction
2013 Text Analytics Summit
16
Information Access
Text analytics enables results that suit the information and
the user, e.g., answers –
Text Analytics Introduction
2013 Text Analytics Summit
17
Text Data Mining Enables Content Exploration
Decisive Analytics
http://www.dac.us/
Text Analytics Introduction
2013 Text Analytics Summit
18
Text Analytics Definition
Text analytics automates what researchers, writers,
scholars, and all the rest of us have been doing for years.
Text analytics –
Applies linguistic and/or statistical techniques to extract
concepts and patterns that can be applied to categorize and
classify documents, audio, video, images.
Transforms “unstructured” information into data for
application of traditional analysis techniques.
Discerns meaning and relationships in large volumes of
information that were previously unprocessable by
computer.
Text Analytics Introduction
2013 Text Analytics Summit
19
Glossary: Information in Text
Entity: Typically a name (person, place, organization, etc.) or
a patterned composite (phone number, e-mail address).
Concept: An abstract entity or collection of entities.
Feature: An element of interest, e.g., an entity, concept,
topic, event, fact, relationship, etc.
Metadata: Descriptive information such as author, title,
publication date, file type and size.
Information Extraction (IE) involves pulling metadata,
features & their attributes out of textual sources.
Co-reference: Multiple expressions that describe the same
thing. Anaphora including pronoun use is an example:
John pushed Max. He fell.
John pushed Max. He laughed.
-- Laure Vieu and Patrick Saint-Dizier
Text Analytics Introduction
2013 Text Analytics Summit
20
Glossary: Methods
Natural Language Processing (NLP): Computers hear
humans.
Parsing: Evaluating the content of a document or text.
Tokenization: Identification of distinct elements, e.g.,
words, punctuation marks, n-grams.
Stemming/Lemmatization: Reducing word variants
(conjugation, declension, case, pluralization) to bases.
Term reduction: Use of synonyms, taxonomy, similarity
measures to group like terms.
Tagging: Wrapping XML tags around distinct features, a.k.a.
text augmentation. May involve text enrichment.
POS Tagging: Specifically identifying parts of speech via
syntactic analysis.
Text Analytics Introduction
2013 Text Analytics Summit
21
Glossary: Organizing and Structuring
Categorization: Specification of feature/doc groupings.
Clustering: Creating categories according to outcome-
similarity criteria.
Taxonomy: An exhaustive, hierarchical categorization of
entities and concepts, either specified or generated by
clustering.
Classification: Assigning an item to a category, perhaps
using a taxonomy.
Ontology : In practice, a classification of a set of items in a
way that represents knowledge.
An oak is a tree. A rose is a flower. A deer is an animal. A sparrow is a
bird. Russia is our fatherland. Death is inevitable.
-- P. Smirnovskii, A Textbook of Russian Grammar
Semantics relates features to others.
Text Analytics Introduction
2013 Text Analytics Summit
22
Glossary: Evaluation
Precision: The proportion of decisions (e.g., classifications)
that are correct.
Recall: The proportion of actual correct decisions (e.g.,
classifications) relative to the total number of correct
decisions.
Find the even numbers:
9 17 12 4 1 6 2 20 7 3 8 10
What is my Precision? What is my Recall?
Accuracy: How well an IE or IR task has been performed,
computed as an F-score weighting Precision & Recall,
typically:
f = 2*(precision * recall) / (precision + recall)
Relevance: Do results match the individual user’s needs?
Text Analytics Introduction
2013 Text Analytics Summit
23
Text Analytics Pipeline
Typical steps in text analytics include –
1. Identify and retrieve documents for analysis.
2. Apply statistical &/ linguistic &/ structural techniques to
discern, tag, and extract entities, concepts, relationships,
and events (features) within document sets.
3. Apply statistical pattern-matching & similarity techniques
to classify documents and organize extracted features
according to a specified or generated categorization /
taxonomy.
– via a pipeline of statistical & linguistic steps.
Let’s look at them, at steps to model text...
Text Analytics Introduction
2013 Text Analytics Summit
24
Modelling Text
Metadata.
E.g., title, author, date.
Statistics.
E.g., term frequency, co-occurrence, proximity.
Linguistics.
Lexicons, gazetteers, phrase books.
Word morphology, parts of speech, syntactic rules.
Semantic networks.
Larger-scale structure including discourse.
Machine learning.
An Introduction to Text Analytics: 2013 Workshop presentation
An Introduction to Text Analytics: 2013 Workshop presentation
Text Analytics Introduction
2013 Text Analytics Summit
27
“Statistical information derived from word frequency and distribution is
used by the machine to compute a relative measure of significance, first
for individual words and then for sentences. Sentences scoring highest in
significance are extracted and printed out to become the auto-abstract.”
-- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958.
http://wordle.net
Text Analytics Introduction
2013 Text Analytics Summit
28
Modelling Text
The text content of a document can be considered an
unordered “bag of words.”
Particular documents are points in a high-dimensional vector
space.
Salton, Wong &
Yang, “A Vector
Space Model for
Automatic
Indexing,”
November 1975.
Text Analytics Introduction
2013 Text Analytics Summit
29
Modelling Text
We might construct a document-term matrix...
D1 = “I like databases”
D2 = “I hate hate databases”
and use a weighting such as TF-IDF (term frequency–inverse
document frequency)…
in computing the cosine of the angle between weighted
doc-vectors to determine similarity.
I like hate databases
D1 1 1 0 1
D2 1 0 2 1
http://en.wikipedia.org/wiki/Term-document_matrix
Text Analytics Introduction
2013 Text Analytics Summit
30
Modelling Text
Analytical methods make text tractable.
Latent semantic indexing utilizing singular value
decomposition for term reduction / feature selection.
Creates a new, reduced concept space.
Takes care of synonymy, polysemy, stemming, etc.
Classification technologies / methods:
Naive Bayes.
Support Vector Machine.
K-nearest neighbor.
Text Analytics Introduction
2013 Text Analytics Summit
31
Modelling Text
In the form of query-document similarity, this is Information
Retrieval 101.
See, for instance, Salton & Buckley, “Term-Weighting
Approaches in Automatic Text Retrieval,” 1988.
A useful basic tech paper: Russ Albright, SAS, “Taming Text
with the SVD,” 2004.
Given the complexity of human language, statistical models
may fall short.
“Reading from text in general is a hard problem, because it
involves all of common sense knowledge.”
-- Expert systems pioneer Edward A. Feigenbaum
Text Analytics Introduction
2013 Text Analytics Summit
32
“Tri-grams” here
are pretty good at
describing the
Whatness of the
source text. Yet...
“This rather unsophisticated argument on ‘significance’
avoids such linguistic implications as grammar and syntax...
No attention is paid to the logical and semantic
relationships the author has established.”
-- Hans Peter Luhn, 1958
Text Analytics Introduction
2013 Text Analytics Summit
33
New York Times,
September 8, 1957
Anaphora /
coreference:
“They”
Text Analytics Introduction
2013 Text Analytics Summit
34
Advanced Term Counting
Counting term hits, in one source, at the doc level, doesn’t
take you far...
Good or bad? What’s behind the posts?
Text Analytics Introduction
2013 Text Analytics Summit
35
Why Do We Need Linguistics?
To get more out of text than can be delivered by a
bag/vector of words and term counting.
The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard
& Poor's 500 index gained1.44, or 0.11 percent, to 1,263.85.
The Dow gained 46.58, or 0.42 percent, to 11,002.14. The
Standard & Poor's 500 index fell 1.44, or 0.11 percent, to
1,263.85.
-- Luca Scagliarini, Expert System
Time flies like an arrow. Fruit flies like a banana.
-- Groucho Marx
(Statistical co-occurrence to build a model for analysis of
text such as these is possible but still limited.)
Text Analytics Introduction
2013 Text Analytics Summit
36
Parts of Speech
Text Analytics Introduction
2013 Text Analytics Summit
37
Parts of Speech
Text Analytics Introduction
2013 Text Analytics Summit
38
Parts of Speech
Text Analytics Introduction
2013 Text Analytics Summit
39
From POS to Relationships
When we under-
stand, for instance,
parts of speech
(POS), e.g. –
<subject> <verb>
<object>
– we’re in a position
to discern facts and
relationships...
Semantic networks
such as WordNet
are an asset for
word-sense
disambiguation.
An Introduction to Text Analytics: 2013 Workshop presentation
Text Analytics Introduction
2013 Text Analytics Summit
41
Tagging with GATE
Annotation (tagging) in action via GATE, an open-source
tool:
Text Analytics Introduction
2013 Text Analytics Summit
42
GATE Language Processing Pipeline
Text Analytics Introduction
2013 Text Analytics Summit
43
GATE Text Annotation
Text Analytics Introduction
2013 Text Analytics Summit
44
GATE Exported XML
<?xml version='1.0' encoding='windows-1252'?>
<GateDocument>
<!-- The document's features-->
<GateDocumentFeatures>
<Feature>
<Name className="java.lang.String">MimeType</Name>
<Value className="java.lang.String">text/html</Value>
</Feature>
<Feature>
<Name className="java.lang.String">gate.SourceURL</Name>
<Value className="java.lang.String">http://altaplana.com/SentimentAnalysis.html</Value>
</Feature>
</GateDocumentFeatures>
<!-- The document content area with serialized nodes -->
<TextWithNodes><Node id="0" />Sentiment<Node id="9" /> <Node id="10" />Analysis<Node id="18"
/>:<Node id="19" /> <Node id="20" />A<Node id="21" /> <Node id="22" />Focus<Node id="27" /> <Node
id="28" />on<Node id="30" /> <Node id="31" />Applications<Node id="43" />
<Node id="44" />
<Node id="45" />by<Node id="47" /> <Node id="48" />Seth<Node id="52" /> <Node id="53"
/>Grimes<Node id="59" /> [material cut]
</TextWithNodes>
<!-- The default annotation set -->
<AnnotationSet> [material cut]
<Annotation Id="67" Type="Token" StartNode="48" EndNode="52">
<Feature><Name className="java.lang.String">length</Name><Value
className="java.lang.String">4</Value></Feature>
<Feature><Name className="java.lang.String">category</Name><Value
className="java.lang.String">NNP</Value></Feature>
<Feature><Name className="java.lang.String">orth</Name><Value
className="java.lang.String">upperInitial</Value></Feature>
<Feature><Name className="java.lang.String">kind</Name><Value
className="java.lang.String">word</Value></Feature>
<Feature><Name className="java.lang.String">string</Name><Value
className="java.lang.String">Seth</Value></Feature>
</Annotation> [material cut]
</AnnotationSet>
</GateDocument>
Text Analytics Introduction
2013 Text Analytics Summit
45
Information Extraction
For content analysis, key in on extracting information.
Text features are typically marked up (annotated) in-place
with XML.
Entities and concepts may correspond to dimensions in a
standard BI model.
Both classes of object are hierarchically organized and have
attributes.
We can have both discovered and predetermined classifications
(taxonomies) of text features.
Dimensional modelling facilitates extraction to
databases...
Text Analytics Introduction
2013 Text Analytics Summit
46
Database Insert
Illustrated via an IBM example:
“The standard features are stored in the STANDARD_KW table,
keywords with their occurrences in the KEYWORD_KW_OCC
table, and the text list features in the TEXTLIST_TEXT table.
Every feature table contains the DOC_ID as a reference to the
DOCUMENT table.”
Text Analytics Introduction
2013 Text Analytics Summit
47
Sophisticated Pattern Matching
Lexicons and language rules boost accuracy. An example –
a bit complicated – GATE Extension via JAPE Rules...
/* locationcontext2.jape
* Dhavalkumar Thakker, Nottingham Trent University/PA Photos 15 Sept 2008*/
Phase: locationcontext2
Input: Lookup Token
Options: control = all debug = false
//Manchester, UK
Rule: locationcontext2
Priority:50
({Token.string == "at"})
( ({Token.string =~ "[Tt]he"})?
(
(
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial}
({Token.kind == punctuation})?
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial}
({Token.kind == punctuation})?
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ) |
( {Token.kind == word, Token.category == NNP, Token.orth == upperInitial}
({Token.kind == punctuation})?
( {Token.kind == word, Token.category == NNP, Token.orth == allCaps} |
{Token.kind == word, Token.category == NNP, Token.orth == upperInitial} )
) |
...
Text Analytics Introduction
2013 Text Analytics Summit
48
Predictive Modeling
Another processing pipeline and more rules…
Text Analytics Introduction
2013 Text Analytics Summit
49
Predictive Modeling
In the text context, predictive analytics is mostly about
classification and automated processing.
Modeling also helps, operationally, with:
Completion
Disambiguation
: use dictionaries, context
Error correction
http://en.wikipedia.org/wiki/File:
ITap_on_Motorola_C350.jpg
Text Analytics Introduction
2013 Text Analytics Summit
50
Error Correction
“Search logs suggest that from 10-15% of queries contain
spelling or typographical errors. Fittingly, one important
query reformulation tool is spelling suggestions or
corrections.”
-- Marti Hearst, Search User Interfaces
Text Analytics Introduction
2013 Text Analytics Summit
51
Accuracy and Semi-Structured Sources
An e-mail message is “semi-structured,” which facilitates
extracting metadata --
Date: Sun, 13 Mar 2005 19:58:39 -0500
From: Adam L. Buchsbaum <alb@research.att.com>
To: Seth Grimes <grimes@altaplana.com>
Subject: Re: Papers on analysis on streaming data
seth, you should contact divesh srivastava,
divesh@research.att.com regarding at&t labs data streaming
technology.
Adam
“Reading from text in structured domains I don’t think is as
hard.”
-- Edward A. Feigenbaum
Surveys are also typically s-s in a different way...
Text Analytics Introduction
2013 Text Analytics Summit
52
Structured &‘Unstructured’
The respondent is invited to explain his/her attitude:
Text Analytics Introduction
2013 Text Analytics Summit
53
Structured &‘Unstructured’
We typically look at frequencies and distributions of coded-
response questions:
Linkage of responses to
coded ratings helps in
analyses of free text:
Text Analytics Introduction
2013 Text Analytics Summit
54
“Sentiment analysis is the task of identifying positive and
negative opinions, emotions, and evaluations.”
-- Wilson, Wiebe & Hoffman, 2005, “Recognizing Contextual
Polarity in Phrase-Level Sentiment Analysis”
“Sentiment analysis or opinion mining is the computational
study of opinions, sentiments and emotions expressed in
text… An opinion on a feature f is a positive or negative
view, attitude, emotion or appraisal on f from an opinion
holder.”
-- Bing Liu, 2010, “Sentiment Analysis and Subjectivity,” in Handbook
of Natural Language Processing
“Dell really... REALLY need to stop overcharging... and when i
say overcharing... i mean atleast double what you would
pay to pick up the ram yourself.”
-- From Dell’s IdeaStorm.com
Sentiment Analysis
Text Analytics Introduction
2013 Text Analytics Summit
55
What are people saying? What’s hot/trending?
What are they saying about {topic|person|product} X?
... about X versus {topic|person|product} Y?
How has opinion about X and Y evolved?
How has opinion correlated with {our|competitors’|general}
{news|marketing|sales|events}?
What’s behind opinion, the root causes?
(How) Can we link opinions & transactions?
(How) Can we link opinion & intent?
Who are opinion leaders?
How does sentiment propagate across channels?
Sentiment Analysis
Text Analytics Introduction
2013 Text Analytics Summit
56
Sentiment Analysis
Applications include:
Brand / Reputation Management.
Competitive intelligence.
Customer Experience Management.
Enterprise Feedback Management.
Quality improvement.
Trend spotting.
There are significant challenges…
An Introduction to Text Analytics: 2013 Workshop presentation
Text Analytics Introduction
2013 Text Analytics Summit
58
Complications Illustrated
Unfiltered
duplicates
External
reference
“Kind” =
type, variety,
not a
sentiment.
Complete
misclassification
Text Analytics Introduction
2013 Text Analytics Summit
59
Sentiment Complications
There are many complications.
Sentiment may be of interest at multiple levels.
Corpus / data space, i.e., across multiple sources.
Document.
Statement / sentence.
Entity / topic / concept.
Human language is noisy and chaotic!
Jargon, slang, irony, ambiguity, anaphora, polysemy, synonymy,
etc.
Context is key. Discourse analysis comes into play.
Must distinguish the sentiment holder from the object:
“Geithner said the recession may worsen.”
Text Analytics Introduction
2013 Text Analytics Summit
60
Beyond Polarity
Text Analytics Introduction
2013 Text Analytics Summit
61
Intent Analysis
http://www.aiaioo.com/whitepapers/intention_analysis_use_cases.pdf
http://sentibet.com/
Text Analytics Introduction
2013 Text Analytics Summit
62
Applications
Text analytics has applications in –
• Intelligence & law enforcement.
• Life sciences.
• Media & publishing including social-media analysis and
contextual advertizing.
• Competitive intelligence.
• Voice of the Customer: CRM, product management &
marketing.
• Legal, tax & regulatory (LTR) including compliance.
• Recruiting.
Text Analytics Introduction
2013 Text Analytics Summit
63
Online Commerce
Text analytics is applied for marketing, search optimization,
competitive intelligence.
Analyze social media and enterprise feedback to understand
the Voice of the Market:
• Opportunities
• Threats
• Trends
Categorize product and service offerings for on-site search
and faceted navigation and to enrich content delivery.
Annotate pages to enhance Web-search findability, ranking.
Scrape competitor sites for offers and pricing.
Analyze social and news media for competitive information.
Text Analytics Introduction
2013 Text Analytics Summit
64
Voice of the Customer
Text analytics is applied to enhance customer service and
satisfaction.
Analyze customer interactions and opinions –
• E-mail, contact-center notes, survey responses
• Forum & blog posting and other social media
– to –
• Address customer product & service issues
• Improve quality
• Manage brand & reputation
If you can link qualitative information from text you can –
• Link feedback to transactions
• Assess customer value
• Understand root causes
• Mine data for measures such as churn likelihood
Text Analytics Introduction
2013 Text Analytics Summit
65
E-Discovery and Compliance
Text analytics is applied for compliance, fraud and risk, and
e-discovery.
Regulatory mandates and corporate practices dictate –
• Monitoring corporate communications
• Managing electronic stored information for production in
event of litigation
Sources include e-mail (!!), news, social media
Risk avoidance and fraud detection are key to effective
decision making
• Text analytics mines critical data from unstructured sources
• Integrated text-transactional analytics provides rich insights
Text Analytics Introduction
2013 Text Analytics Summit
66
Text Analytics Introduction
2013 Text Analytics Summit
67
http://altaplana.com/TA2011
Text Analytics Introduction
2013 Text Analytics Summit
68
8%
21%
25%
27%
36%
21%
34%
35%
44%
47%
21%
22%
23%
27%
29%
30%
35%
35%
41%
62%
0% 10% 20% 30% 40% 50% 60% 70%
text messages/SMS/chat
Web-site feedback
contact-center notes or transcripts
scientific or technical literature
e-mail and correspondence
review sites or forums
customer/market surveys
on-line forums
news articles
blogs and other social media
What textual information are you analyzing or do you plan to
analyze?
2011 (n=215)
2009 (n=100)
Text Analytics Introduction
2013 Text Analytics Summit
69
Text Analytics Introduction
2013 Text Analytics Summit
70
Text Analytics Introduction
2013 Text Analytics Summit
71
Getting Started
A best practices approach…
Assess:
• Assess business goals.
• Understand information sources.
• Consult and educate stakeholders.
Evaluate:
• Evaluate installed, hosted/SaaS, database-integrated options.
• Determine performance and business requirements.
• Match methods to goals, sources, and work practices.
Implement:
• Start with basic functions such as search, modest goals, or
with a single information source.
• Go for clear wins to gain support.
• Build out applications, capacity, BI/research integration.
Text Analytics Introduction
2013 Text Analytics Summit
72
Software & Platform Options
Text-analytics options may be grouped in general classes.
• Installed text-analysis application, whether desktop or
server or deployed in-database.
• Data mining workbench.
• Hosted.
• Programming tool.
• As-a-service, via an application programming interface
(API).
• Code library or component of a business/vertical
application, for instance for CRM, e-discovery, search.
Text analytics is frequently embedded in search or other
end-user applications.
The slides that follow next will present leading options in
each category except Hosted…
Text Analytics Introduction
2013 Text Analytics Summit
73
Text Analysis Applications
Vendors:
Attensity, Clarabridge, Daedalus, Digital Reasoning, Expert
System, Fido Labs, IBM, Linguamatics, Medallia, NetBase,
Open Text (Nstein), Provalis Research, SAP, SAS, SRA
NetOwl, Sysomos, Temis.
Typical uses:
Customer experience management (CEM), survey analysis,
social-media analysis, law enforcement, life sciences.
Typical characteristics:
Interface that allows the user to configure a processing
pipeline.
Interface for text exploration and visualization.
Export to databases
Text Analytics Introduction
2013 Text Analytics Summit
74
Data Mining Workbench
Vendors:
IBM SPSS Modeler, Megaputer PolyAnalyst, RapidMiner, SAS
Text Analytics, WEKA.
Typical uses:
Customer experience management (CEM), marketing
analytics, survey analysis, social-media analysis, law
enforcement.
Predictive modeling.
Typical characteristics:
Same as text-analysis applications, but with more
sophisticated modeling and analysis capabilities.
Text Analytics Introduction
2013 Text Analytics Summit
75
Programming/Development Tool
Vendors:
GATE, OpenNLP, Python NLTK, R, Stanford NLP – open
source.
NooJ – free, non-open source.
Typical uses:
Language modeling.
Data exploration.
Up to the programmer.
Typical characteristics:
Text is an add-in to a programming language/environment.
Text Analytics Introduction
2013 Text Analytics Summit
76
As a Service, via API
Vendors:
Alchemy API, Clarabridge, Open Amplify, Pingar, Saplo,
Semantria (Lexalytics), Thomson Reuters Calais, others.
Typical uses:
Annotation and content enrichment with the application
domain up to the user.
Typical characteristics:
Relies on remote, server-resident processing resources.
May or, more likely, may not be end-user customizable.
Text Analytics Introduction
2013 Text Analytics Summit
77
Code Library or Annotation Engine
Vendors:
Alias-I LingPipe, GATE.
Basis Technology Rosette, Lexalytics Salience, SAP Inxight,
SAS Teragram, TEMIS Luxid.
Typical uses:
Information extraction in support of business applications.
Typical characteristics:
Same as text-analysis applications, but with more
sophisticated modeling and analysis capabilities.
Text Analytics Introduction
2013 Text Analytics Summit
78
The Crowdsourcing Alternative (example)
Text Analytics Introduction
2013 Text Analytics Summit
79
http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html
http://hpccsystems.com/ (GNU Affero GPL)
A Big Data Platform (example)
Text Analytics Introduction
2013 Text Analytics Summit
80
Text Analytics Introduction
2013 Text Analytics Summit
81
(Accessible) Data Everywhere
Text Analytics Introduction
2013 Text Analytics Summit
82
http://open.blogs.nytimes.com/2012/02/16/rnews-is-
here-and-this-is-what-it-means/
<div itemscope itemtype="http://schema.org/Organization">
<span itemprop="name">Google.org (GOOG)</span>
Contact Details:
<div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress">
Main address:
<span itemprop="streetAddress">38 avenue de l'Opera</span>
<span itemprop="postalCode">F-75002</span>
<span itemprop="addressLocality">Paris, France</span> ,
</div>
Tel:<span itemprop="telephone">( 33 1) 42 68 53 00 </span>,
Fax:<span itemprop="faxNumber">( 33 1) 42 68 53 01 </span>,
E-mail: <span itemprop="email">secretariat(at)google.org</span>
</div>
http://schema.org/Organization
Structure Matters
http://img.freebase.com/api/trans/raw/m/02dtnzv
http://www.cambridgesemantics.com/se
mantic-university/semantic-search-and-
the-semantic-web
An Introduction to Text Analytics:
Social, Online, and Enterprise
Seth Grimes
Alta Plana Corporation
@sethgrimes
Text & Social Analytics Summit 2013
Workshop
June 4, 2013

Weitere ähnliche Inhalte

Was ist angesagt?

Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics TodaySeth Grimes
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewSeth Grimes
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and VisualizationSeth Grimes
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPromptCloud
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured DataChristine Connors
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and boltsNBER
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis ReportAbanoub Amgad
 
Directed versus undirected network analysis of student essays
Directed versus undirected network analysis of student essaysDirected versus undirected network analysis of student essays
Directed versus undirected network analysis of student essaysRoy Clariana
 
How many truths can you handle?
How many truths can you handle?How many truths can you handle?
How many truths can you handle?Panos Alexopoulos
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisationvoginip
 
project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysissneha penmetsa
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social networkChanon Hongsirikulkit
 
Applications: Prediction
Applications: PredictionApplications: Prediction
Applications: PredictionNBER
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Gan Keng Hoon
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchGan Keng Hoon
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using Rsantoshi mangalgi
 

Was ist angesagt? (20)

Lexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text AnalyticsLexalytics Text Analytics Workshop: Perfect Text Analytics
Lexalytics Text Analytics Workshop: Perfect Text Analytics
 
Text Analytics Today
Text Analytics TodayText Analytics Today
Text Analytics Today
 
Text Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry ViewText Analytics Past, Present & Future: An Industry View
Text Analytics Past, Present & Future: An Industry View
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Popular Text Analytics Algorithms
Popular Text Analytics AlgorithmsPopular Text Analytics Algorithms
Popular Text Analytics Algorithms
 
Getting Started with Unstructured Data
Getting Started with Unstructured DataGetting Started with Unstructured Data
Getting Started with Unstructured Data
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Nuts and bolts
Nuts and boltsNuts and bolts
Nuts and bolts
 
Datapedia Analysis Report
Datapedia Analysis ReportDatapedia Analysis Report
Datapedia Analysis Report
 
Directed versus undirected network analysis of student essays
Directed versus undirected network analysis of student essaysDirected versus undirected network analysis of student essays
Directed versus undirected network analysis of student essays
 
How many truths can you handle?
How many truths can you handle?How many truths can you handle?
How many truths can you handle?
 
Mind the Semantic Gap
Mind the Semantic GapMind the Semantic Gap
Mind the Semantic Gap
 
How to be successful with search in your organisation
How to be successful with search in your organisationHow to be successful with search in your organisation
How to be successful with search in your organisation
 
project sentiment analysis
project sentiment analysisproject sentiment analysis
project sentiment analysis
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
Applications: Prediction
Applications: PredictionApplications: Prediction
Applications: Prediction
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
 
D018212428
D018212428D018212428
D018212428
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 

Ähnlich wie An Introduction to Text Analytics: 2013 Workshop presentation

Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataSeth Grimes
 
The Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language ProcessingThe Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language ProcessingWaqas Tariq
 
Social Media and Text Analytics
Social Media and Text AnalyticsSocial Media and Text Analytics
Social Media and Text AnalyticsRushikeshChikane2
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical OverviewDigital Reasoning
 
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big DataIRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big DataIRJET Journal
 
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYA DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYijaia
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentThe Digital Group
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475IJRAT
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachIJMIT JOURNAL
 
Bioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalBioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalDr. Rupak Chakravarty
 
12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content Analytics12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content AnalyticsSeth Grimes
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...IJERA Editor
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSiJohn O'Gorman
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic ComputingMeena Nagarajan
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxnilesh405711
 
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTION
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONSPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTION
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONNexgen Technology
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 

Ähnlich wie An Introduction to Text Analytics: 2013 Workshop presentation (20)

Analysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ DataAnalysis of ‘Unstructured’ Data
Analysis of ‘Unstructured’ Data
 
The Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language ProcessingThe Process of Information extraction through Natural Language Processing
The Process of Information extraction through Natural Language Processing
 
Social Media and Text Analytics
Social Media and Text AnalyticsSocial Media and Text Analytics
Social Media and Text Analytics
 
Synthesys Technical Overview
Synthesys Technical OverviewSynthesys Technical Overview
Synthesys Technical Overview
 
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big DataIRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
IRJET-A Review on Topic Detection and Term-Term Relation Analysis in Big Data
 
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEYA DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
A DECADE OF USING HYBRID INFERENCE SYSTEMS IN NLP (2005 – 2015): A SURVEY
 
Empowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic EnrichmentEmpowering Search Through 3RDi Semantic Enrichment
Empowering Search Through 3RDi Semantic Enrichment
 
Paper id 26201475
Paper id 26201475Paper id 26201475
Paper id 26201475
 
IJET-V3I2P23
IJET-V3I2P23IJET-V3I2P23
IJET-V3I2P23
 
The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...The impact of standardized terminologies and domain-ontologies in multilingua...
The impact of standardized terminologies and domain-ontologies in multilingua...
 
Decision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining ApproachDecision Support for E-Governance: A Text Mining Approach
Decision Support for E-Governance: A Text Mining Approach
 
Bioinformatioc: Information Retrieval
Bioinformatioc: Information RetrievalBioinformatioc: Information Retrieval
Bioinformatioc: Information Retrieval
 
12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content Analytics12 Things the Semantic Web Should Know about Content Analytics
12 Things the Semantic Web Should Know about Content Analytics
 
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
Dynamic & Attribute Weighted KNN for Document Classification Using Bootstrap ...
 
Nordic health data metadata
Nordic health data   metadataNordic health data   metadata
Nordic health data metadata
 
Language First Protocol from QSi
Language First Protocol from QSiLanguage First Protocol from QSi
Language First Protocol from QSi
 
Text Analytics for Semantic Computing
Text Analytics for Semantic ComputingText Analytics for Semantic Computing
Text Analytics for Semantic Computing
 
Frame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptxFrame-Script and Predicate logic.pptx
Frame-Script and Predicate logic.pptx
 
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTION
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTIONSPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTION
SPIRIT: A TREE KERNEL-BASED METHOD FOR TOPIC PERSON INTERACTION DETECTION
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 

Mehr von Seth Grimes

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingSeth Grimes
 
Creating an AI Startup: What You Need to Know
Creating an AI Startup: What You Need to KnowCreating an AI Startup: What You Need to Know
Creating an AI Startup: What You Need to KnowSeth Grimes
 
NLP 2020: What Works and What's Next
NLP 2020: What Works and What's NextNLP 2020: What Works and What's Next
NLP 2020: What Works and What's NextSeth Grimes
 
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...Seth Grimes
 
From Customer Emotions to Actionable Insights, with Peter Dorrington
From Customer Emotions to Actionable Insights, with Peter DorringtonFrom Customer Emotions to Actionable Insights, with Peter Dorrington
From Customer Emotions to Actionable Insights, with Peter DorringtonSeth Grimes
 
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AIIntro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AISeth Grimes
 
Text Analytics Market Trends
Text Analytics Market TrendsText Analytics Market Trends
Text Analytics Market TrendsSeth Grimes
 
Text Analytics for NLPers
Text Analytics for NLPersText Analytics for NLPers
Text Analytics for NLPersSeth Grimes
 
Our FinTech Future – AI’s Opportunities and Challenges?
Our FinTech Future – AI’s Opportunities and Challenges? Our FinTech Future – AI’s Opportunities and Challenges?
Our FinTech Future – AI’s Opportunities and Challenges? Seth Grimes
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...Seth Grimes
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AISeth Grimes
 
Classification with Memes–Uber case study
Classification with Memes–Uber case studyClassification with Memes–Uber case study
Classification with Memes–Uber case studySeth Grimes
 
Aspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion AnalysisAspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion AnalysisSeth Grimes
 
Content AI: From Potential to Practice
Content AI: From Potential to PracticeContent AI: From Potential to Practice
Content AI: From Potential to PracticeSeth Grimes
 
An Industry Perspective on Subjectivity, Sentiment, and Social
An Industry Perspective on Subjectivity, Sentiment, and SocialAn Industry Perspective on Subjectivity, Sentiment, and Social
An Industry Perspective on Subjectivity, Sentiment, and SocialSeth Grimes
 
The Insight Value of Social Sentiment
The Insight Value of Social SentimentThe Insight Value of Social Sentiment
The Insight Value of Social SentimentSeth Grimes
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersText Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersSeth Grimes
 
Social Data Sentiment Analysis
Social Data Sentiment AnalysisSocial Data Sentiment Analysis
Social Data Sentiment AnalysisSeth Grimes
 

Mehr von Seth Grimes (20)

Recent Advances in Natural Language Processing
Recent Advances in Natural Language ProcessingRecent Advances in Natural Language Processing
Recent Advances in Natural Language Processing
 
Creating an AI Startup: What You Need to Know
Creating an AI Startup: What You Need to KnowCreating an AI Startup: What You Need to Know
Creating an AI Startup: What You Need to Know
 
NLP 2020: What Works and What's Next
NLP 2020: What Works and What's NextNLP 2020: What Works and What's Next
NLP 2020: What Works and What's Next
 
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...Efficient Deep Learning in Natural Language Processing Production, with Moshe...
Efficient Deep Learning in Natural Language Processing Production, with Moshe...
 
From Customer Emotions to Actionable Insights, with Peter Dorrington
From Customer Emotions to Actionable Insights, with Peter DorringtonFrom Customer Emotions to Actionable Insights, with Peter Dorrington
From Customer Emotions to Actionable Insights, with Peter Dorrington
 
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AIIntro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
Intro to Deep Learning for Medical Image Analysis, with Dan Lee from Dentuit AI
 
Emotion AI
Emotion AIEmotion AI
Emotion AI
 
Text Analytics Market Trends
Text Analytics Market TrendsText Analytics Market Trends
Text Analytics Market Trends
 
Text Analytics for NLPers
Text Analytics for NLPersText Analytics for NLPers
Text Analytics for NLPers
 
Our FinTech Future – AI’s Opportunities and Challenges?
Our FinTech Future – AI’s Opportunities and Challenges? Our FinTech Future – AI’s Opportunities and Challenges?
Our FinTech Future – AI’s Opportunities and Challenges?
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
Fairness in Machine Learning and AI
Fairness in Machine Learning and AIFairness in Machine Learning and AI
Fairness in Machine Learning and AI
 
Classification with Memes–Uber case study
Classification with Memes–Uber case studyClassification with Memes–Uber case study
Classification with Memes–Uber case study
 
Aspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion AnalysisAspect Detection for Sentiment / Emotion Analysis
Aspect Detection for Sentiment / Emotion Analysis
 
Content AI: From Potential to Practice
Content AI: From Potential to PracticeContent AI: From Potential to Practice
Content AI: From Potential to Practice
 
An Industry Perspective on Subjectivity, Sentiment, and Social
An Industry Perspective on Subjectivity, Sentiment, and SocialAn Industry Perspective on Subjectivity, Sentiment, and Social
An Industry Perspective on Subjectivity, Sentiment, and Social
 
The Insight Value of Social Sentiment
The Insight Value of Social SentimentThe Insight Value of Social Sentiment
The Insight Value of Social Sentiment
 
Text Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and ProvidersText Analytics 2014: User Perspectives on Solutions and Providers
Text Analytics 2014: User Perspectives on Solutions and Providers
 
Social Data Sentiment Analysis
Social Data Sentiment AnalysisSocial Data Sentiment Analysis
Social Data Sentiment Analysis
 

Kürzlich hochgeladen

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IES VE
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 

Kürzlich hochgeladen (20)

9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
20230104 - machine vision
20230104 - machine vision20230104 - machine vision
20230104 - machine vision
 
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
IESVE Software for Florida Code Compliance Using ASHRAE 90.1-2019
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 

An Introduction to Text Analytics: 2013 Workshop presentation

  • 1. An Introduction to Text Analytics: Social, Online, and Enterprise Seth Grimes Alta Plana Corporation @sethgrimes Text & Social Analytics Summit 2013 Workshop June 4, 2013
  • 2. Text Analytics Introduction 2013 Text Analytics Summit 2 Perspectives Perspective #1: You support research or work in IT. You help end users who have lots of text. Perspective #2: You’re a researcher, business analyst, or other “end user.” You have lots of text. You want an automated way to deal with it. Perspective #3: You work for a solution provider. Perspective #4: Other?
  • 3. Text Analytics Introduction 2013 Text Analytics Summit 3 From the Analytics/Business Perspective 1. If you are not analyzing text – if you're analyzing only transactional information – you're missing opportunity or incurring risk. 2. Text analytics can boost business results – “Organizations embracing text analytics all report having an epiphany moment when they suddenly knew more than before.” -- Philip Russom, the Data Warehousing Institute, 2007 http://tdwi.org/articles/2007/05/09-what-works/bi-search-and-text-analytics.aspx – via established BI / data-mining programs, or independently.
  • 4. Text Analytics Introduction 2013 Text Analytics Summit 4 Agenda 1. The “Unstructured” Data Challenge. 2. Text analytics for information retrieval and BI. 3. Text analysis technologies and processes. 4. Applications. 5. The market: Software, services, and solutions. Note: I will not cover the agenda in a linear fashion. Product images and references are used to illustrate only. Some slides are included for reference purposes.
  • 5. Text Analytics Introduction 2013 Text Analytics Summit 5 Value in Text It’s a truism that 80% of enterprise-relevant information originates in “unstructured” form: E-mail and messages. Web pages, news & blog articles, forum postings, and other social media. Contact-center notes and transcripts. Surveys, feedback forms, warranty claims. Scientific literature, books, legal documents. ... http://breakthroughanalysis.com/2008/08/01/unstructured-data-and-the-80-percent-rule/ Non-text “unstructured” content? Images Audio including speech Video Value derives from patterns.
  • 6. Text Analytics Introduction 2013 Text Analytics Summit 6 Unstructured Sources These sources may contain “traditional” data. The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85. And they may not. www.stanford.edu/%7ernusse/wntwindow.html Axin and Frat1 interact with dvl and GSK, bridging Dvl to GSK in Wnt-mediated regulation of LEF-1. Wnt proteins transduce their signals through dishevelled (Dvl) proteins to inhibit glycogen synthase kinase 3beta (GSK), leading to the accumulation of cytosolic beta-catenin and activation of TCF/LEF-1 transcription factors. To understand the mechanism by which Dvl acts through GSK to regulate LEF-1, we investigated the roles of Axin and Frat1 in Wnt- mediated activation of LEF-1 in mammalian cells. We found that Dvl interacts with Axin and with Frat1, both of which interact with GSK. Similarly, the Frat1 homolog GBP binds Xenopus Dishevelled in an interaction that requires GSK. www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed& cmd=Retrieve&list_uids=10428961&dopt=Abstract
  • 7. Text Analytics Introduction 2013 Text Analytics Summit 7 http://www.tropicalisland.de/NYC_New_York_ Brooklyn_Bridge_from_World_Trade_Center_ b.jpg x(t) = t y(t) = ½ a (et/a + e-t/a) = acosh(t/a) http://en.wikipedia.org/wiki/Seven_Bridges_of_K%C3%B6nigsberg Structure in “unstructured” sourcesStructure in “unstructured” sources
  • 8. Text Analytics Introduction 2013 Text Analytics Summit 8 Business Intelligence Conventional BI feeds off: It runs off: "SUMLEV","STATE","COUNTY","STNAME","CTYNAME","YEAR","POPESTIMATE", 50,19,1,"Iowa","Adair County",1,8243,4036,4207,446,225,221,994,509 50,19,1,"Iowa","Adair County",2,8243,4036,4207,446,225,221,994,509 50,19,1,"Iowa","Adair County",3,8212,4020,4192,442,222,220,987,505 50,19,1,"Iowa","Adair County",4,8095,3967,4128,432,208,224,935,488 50,19,1,"Iowa","Adair County",5,8003,3924,4079,405,186,219,928,495 50,19,1,"Iowa","Adair County",6,7961,3892,4069,384,183,201,907,472 50,19,1,"Iowa","Adair County",7,7875,3855,4020,366,179,187,871,454 50,19,1,"Iowa","Adair County",8,7795,3817,3978,343,162,181,841,439 50,19,1,"Iowa","Adair County",9,7714,3777,3937,338,159,179,805,417 “The bulk of information value is perceived as coming from data in relational tables. The reason is that data that is structured is easy to mine and analyze.” -- Prabhakar Raghavan, Google, formerly Yahoo Research
  • 9. Text Analytics Introduction 2013 Text Analytics Summit 9 Business Intelligence Conventional BI produces:
  • 10. Text Analytics Introduction 2013 Text Analytics Summit 10 Text-BI: Back to the Future Business intelligence (BI) was first defined in 1958: “In this paper, business is a collection of activities carried on for whatever purpose, be it science, technology, commerce, industry, law, government, defense, et cetera... The notion of intelligence is also defined here... as ‘the ability to apprehend the interrelationships of presented facts in such a way as to guide action towards a desired goal.’” -- Hans Peter Luhn “A Business Intelligence System” IBM Journal, October 1958 What was IT like in the ‘50s?
  • 11. Document input and processing Knowledge handling is key Desk Set (1957): Computer engineer Richard Sumner (Spencer Tracy) and television network librarian Bunny Watson (Katherine Hepburn) and the "electronic brain" EMERAC.
  • 12. Text Analytics Introduction 2013 Text Analytics Summit 12 From the Information Retrieval Perspective What do people do with electronic documents? 1. Publish, Manage, and Archive. 2. Index and Search. 3. Categorize and Classify according to metadata & contents. 4. Information Extraction. For textual documents, text analytics enhances #1 & #2 and enables #3 & #4. You need linguistics to do #1 & #4 well, to deal with meaning (a.k.a. semantics). Search is not enough...
  • 13. Text Analytics Introduction 2013 Text Analytics Summit 13 Semantics Text analytics adds semantic understanding of – Named entities: people, companies, places, etc. Pattern-based entities: e-mail addresses, phone numbers, etc. Concepts: abstractions of entities. Facts and relationships. Concrete and abstract attributes (e.g., 10-year, expensive, comfortable). Subjectivity in the forms of opinions, sentiments, and emotions: attitudinal data. Call these elements, collectively, features.
  • 14. Text Analytics Introduction 2013 Text Analytics Summit 14 Semantics, Analytics, and IR Text analytics generates semantics to bridge search, BI, and applications, enabling next-generation information systems. Search BI Applica- tions Search based applications (search + text + apps) Information access (search + text + BI) Integrated analytics (text + BI) Text analytics (inner circle) Semantic search (search + text) NextGen CRM, EFM, MR, marketing, …
  • 15. Text Analytics Introduction 2013 Text Analytics Summit 15 Information Access Text analytics transforms Information Retrieval (IR) into Information Access (IA). • Search terms become queries. • Retrieved material is mined for larger-scale structure. • Retrieved material is mined for features such as entities and topics or themes. • Retrieved material is mined for smaller-scale structure such as facts and relationships. • Results are presented intelligently, for instance, grouping on mined topics-themes. • Extracted information may be visualized and explored.
  • 16. Text Analytics Introduction 2013 Text Analytics Summit 16 Information Access Text analytics enables results that suit the information and the user, e.g., answers –
  • 17. Text Analytics Introduction 2013 Text Analytics Summit 17 Text Data Mining Enables Content Exploration Decisive Analytics http://www.dac.us/
  • 18. Text Analytics Introduction 2013 Text Analytics Summit 18 Text Analytics Definition Text analytics automates what researchers, writers, scholars, and all the rest of us have been doing for years. Text analytics – Applies linguistic and/or statistical techniques to extract concepts and patterns that can be applied to categorize and classify documents, audio, video, images. Transforms “unstructured” information into data for application of traditional analysis techniques. Discerns meaning and relationships in large volumes of information that were previously unprocessable by computer.
  • 19. Text Analytics Introduction 2013 Text Analytics Summit 19 Glossary: Information in Text Entity: Typically a name (person, place, organization, etc.) or a patterned composite (phone number, e-mail address). Concept: An abstract entity or collection of entities. Feature: An element of interest, e.g., an entity, concept, topic, event, fact, relationship, etc. Metadata: Descriptive information such as author, title, publication date, file type and size. Information Extraction (IE) involves pulling metadata, features & their attributes out of textual sources. Co-reference: Multiple expressions that describe the same thing. Anaphora including pronoun use is an example: John pushed Max. He fell. John pushed Max. He laughed. -- Laure Vieu and Patrick Saint-Dizier
  • 20. Text Analytics Introduction 2013 Text Analytics Summit 20 Glossary: Methods Natural Language Processing (NLP): Computers hear humans. Parsing: Evaluating the content of a document or text. Tokenization: Identification of distinct elements, e.g., words, punctuation marks, n-grams. Stemming/Lemmatization: Reducing word variants (conjugation, declension, case, pluralization) to bases. Term reduction: Use of synonyms, taxonomy, similarity measures to group like terms. Tagging: Wrapping XML tags around distinct features, a.k.a. text augmentation. May involve text enrichment. POS Tagging: Specifically identifying parts of speech via syntactic analysis.
  • 21. Text Analytics Introduction 2013 Text Analytics Summit 21 Glossary: Organizing and Structuring Categorization: Specification of feature/doc groupings. Clustering: Creating categories according to outcome- similarity criteria. Taxonomy: An exhaustive, hierarchical categorization of entities and concepts, either specified or generated by clustering. Classification: Assigning an item to a category, perhaps using a taxonomy. Ontology : In practice, a classification of a set of items in a way that represents knowledge. An oak is a tree. A rose is a flower. A deer is an animal. A sparrow is a bird. Russia is our fatherland. Death is inevitable. -- P. Smirnovskii, A Textbook of Russian Grammar Semantics relates features to others.
  • 22. Text Analytics Introduction 2013 Text Analytics Summit 22 Glossary: Evaluation Precision: The proportion of decisions (e.g., classifications) that are correct. Recall: The proportion of actual correct decisions (e.g., classifications) relative to the total number of correct decisions. Find the even numbers: 9 17 12 4 1 6 2 20 7 3 8 10 What is my Precision? What is my Recall? Accuracy: How well an IE or IR task has been performed, computed as an F-score weighting Precision & Recall, typically: f = 2*(precision * recall) / (precision + recall) Relevance: Do results match the individual user’s needs?
  • 23. Text Analytics Introduction 2013 Text Analytics Summit 23 Text Analytics Pipeline Typical steps in text analytics include – 1. Identify and retrieve documents for analysis. 2. Apply statistical &/ linguistic &/ structural techniques to discern, tag, and extract entities, concepts, relationships, and events (features) within document sets. 3. Apply statistical pattern-matching & similarity techniques to classify documents and organize extracted features according to a specified or generated categorization / taxonomy. – via a pipeline of statistical & linguistic steps. Let’s look at them, at steps to model text...
  • 24. Text Analytics Introduction 2013 Text Analytics Summit 24 Modelling Text Metadata. E.g., title, author, date. Statistics. E.g., term frequency, co-occurrence, proximity. Linguistics. Lexicons, gazetteers, phrase books. Word morphology, parts of speech, syntactic rules. Semantic networks. Larger-scale structure including discourse. Machine learning.
  • 27. Text Analytics Introduction 2013 Text Analytics Summit 27 “Statistical information derived from word frequency and distribution is used by the machine to compute a relative measure of significance, first for individual words and then for sentences. Sentences scoring highest in significance are extracted and printed out to become the auto-abstract.” -- H.P. Luhn, The Automatic Creation of Literature Abstracts, IBM Journal, 1958. http://wordle.net
  • 28. Text Analytics Introduction 2013 Text Analytics Summit 28 Modelling Text The text content of a document can be considered an unordered “bag of words.” Particular documents are points in a high-dimensional vector space. Salton, Wong & Yang, “A Vector Space Model for Automatic Indexing,” November 1975.
  • 29. Text Analytics Introduction 2013 Text Analytics Summit 29 Modelling Text We might construct a document-term matrix... D1 = “I like databases” D2 = “I hate hate databases” and use a weighting such as TF-IDF (term frequency–inverse document frequency)… in computing the cosine of the angle between weighted doc-vectors to determine similarity. I like hate databases D1 1 1 0 1 D2 1 0 2 1 http://en.wikipedia.org/wiki/Term-document_matrix
  • 30. Text Analytics Introduction 2013 Text Analytics Summit 30 Modelling Text Analytical methods make text tractable. Latent semantic indexing utilizing singular value decomposition for term reduction / feature selection. Creates a new, reduced concept space. Takes care of synonymy, polysemy, stemming, etc. Classification technologies / methods: Naive Bayes. Support Vector Machine. K-nearest neighbor.
  • 31. Text Analytics Introduction 2013 Text Analytics Summit 31 Modelling Text In the form of query-document similarity, this is Information Retrieval 101. See, for instance, Salton & Buckley, “Term-Weighting Approaches in Automatic Text Retrieval,” 1988. A useful basic tech paper: Russ Albright, SAS, “Taming Text with the SVD,” 2004. Given the complexity of human language, statistical models may fall short. “Reading from text in general is a hard problem, because it involves all of common sense knowledge.” -- Expert systems pioneer Edward A. Feigenbaum
  • 32. Text Analytics Introduction 2013 Text Analytics Summit 32 “Tri-grams” here are pretty good at describing the Whatness of the source text. Yet... “This rather unsophisticated argument on ‘significance’ avoids such linguistic implications as grammar and syntax... No attention is paid to the logical and semantic relationships the author has established.” -- Hans Peter Luhn, 1958
  • 33. Text Analytics Introduction 2013 Text Analytics Summit 33 New York Times, September 8, 1957 Anaphora / coreference: “They”
  • 34. Text Analytics Introduction 2013 Text Analytics Summit 34 Advanced Term Counting Counting term hits, in one source, at the doc level, doesn’t take you far... Good or bad? What’s behind the posts?
  • 35. Text Analytics Introduction 2013 Text Analytics Summit 35 Why Do We Need Linguistics? To get more out of text than can be delivered by a bag/vector of words and term counting. The Dow fell 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index gained1.44, or 0.11 percent, to 1,263.85. The Dow gained 46.58, or 0.42 percent, to 11,002.14. The Standard & Poor's 500 index fell 1.44, or 0.11 percent, to 1,263.85. -- Luca Scagliarini, Expert System Time flies like an arrow. Fruit flies like a banana. -- Groucho Marx (Statistical co-occurrence to build a model for analysis of text such as these is possible but still limited.)
  • 36. Text Analytics Introduction 2013 Text Analytics Summit 36 Parts of Speech
  • 37. Text Analytics Introduction 2013 Text Analytics Summit 37 Parts of Speech
  • 38. Text Analytics Introduction 2013 Text Analytics Summit 38 Parts of Speech
  • 39. Text Analytics Introduction 2013 Text Analytics Summit 39 From POS to Relationships When we under- stand, for instance, parts of speech (POS), e.g. – <subject> <verb> <object> – we’re in a position to discern facts and relationships... Semantic networks such as WordNet are an asset for word-sense disambiguation.
  • 41. Text Analytics Introduction 2013 Text Analytics Summit 41 Tagging with GATE Annotation (tagging) in action via GATE, an open-source tool:
  • 42. Text Analytics Introduction 2013 Text Analytics Summit 42 GATE Language Processing Pipeline
  • 43. Text Analytics Introduction 2013 Text Analytics Summit 43 GATE Text Annotation
  • 44. Text Analytics Introduction 2013 Text Analytics Summit 44 GATE Exported XML <?xml version='1.0' encoding='windows-1252'?> <GateDocument> <!-- The document's features--> <GateDocumentFeatures> <Feature> <Name className="java.lang.String">MimeType</Name> <Value className="java.lang.String">text/html</Value> </Feature> <Feature> <Name className="java.lang.String">gate.SourceURL</Name> <Value className="java.lang.String">http://altaplana.com/SentimentAnalysis.html</Value> </Feature> </GateDocumentFeatures> <!-- The document content area with serialized nodes --> <TextWithNodes><Node id="0" />Sentiment<Node id="9" /> <Node id="10" />Analysis<Node id="18" />:<Node id="19" /> <Node id="20" />A<Node id="21" /> <Node id="22" />Focus<Node id="27" /> <Node id="28" />on<Node id="30" /> <Node id="31" />Applications<Node id="43" /> <Node id="44" /> <Node id="45" />by<Node id="47" /> <Node id="48" />Seth<Node id="52" /> <Node id="53" />Grimes<Node id="59" /> [material cut] </TextWithNodes> <!-- The default annotation set --> <AnnotationSet> [material cut] <Annotation Id="67" Type="Token" StartNode="48" EndNode="52"> <Feature><Name className="java.lang.String">length</Name><Value className="java.lang.String">4</Value></Feature> <Feature><Name className="java.lang.String">category</Name><Value className="java.lang.String">NNP</Value></Feature> <Feature><Name className="java.lang.String">orth</Name><Value className="java.lang.String">upperInitial</Value></Feature> <Feature><Name className="java.lang.String">kind</Name><Value className="java.lang.String">word</Value></Feature> <Feature><Name className="java.lang.String">string</Name><Value className="java.lang.String">Seth</Value></Feature> </Annotation> [material cut] </AnnotationSet> </GateDocument>
  • 45. Text Analytics Introduction 2013 Text Analytics Summit 45 Information Extraction For content analysis, key in on extracting information. Text features are typically marked up (annotated) in-place with XML. Entities and concepts may correspond to dimensions in a standard BI model. Both classes of object are hierarchically organized and have attributes. We can have both discovered and predetermined classifications (taxonomies) of text features. Dimensional modelling facilitates extraction to databases...
  • 46. Text Analytics Introduction 2013 Text Analytics Summit 46 Database Insert Illustrated via an IBM example: “The standard features are stored in the STANDARD_KW table, keywords with their occurrences in the KEYWORD_KW_OCC table, and the text list features in the TEXTLIST_TEXT table. Every feature table contains the DOC_ID as a reference to the DOCUMENT table.”
  • 47. Text Analytics Introduction 2013 Text Analytics Summit 47 Sophisticated Pattern Matching Lexicons and language rules boost accuracy. An example – a bit complicated – GATE Extension via JAPE Rules... /* locationcontext2.jape * Dhavalkumar Thakker, Nottingham Trent University/PA Photos 15 Sept 2008*/ Phase: locationcontext2 Input: Lookup Token Options: control = all debug = false //Manchester, UK Rule: locationcontext2 Priority:50 ({Token.string == "at"}) ( ({Token.string =~ "[Tt]he"})? ( ( {Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ({Token.kind == punctuation})? {Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ({Token.kind == punctuation})? {Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ) | ( {Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ({Token.kind == punctuation})? ( {Token.kind == word, Token.category == NNP, Token.orth == allCaps} | {Token.kind == word, Token.category == NNP, Token.orth == upperInitial} ) ) | ...
  • 48. Text Analytics Introduction 2013 Text Analytics Summit 48 Predictive Modeling Another processing pipeline and more rules…
  • 49. Text Analytics Introduction 2013 Text Analytics Summit 49 Predictive Modeling In the text context, predictive analytics is mostly about classification and automated processing. Modeling also helps, operationally, with: Completion Disambiguation : use dictionaries, context Error correction http://en.wikipedia.org/wiki/File: ITap_on_Motorola_C350.jpg
  • 50. Text Analytics Introduction 2013 Text Analytics Summit 50 Error Correction “Search logs suggest that from 10-15% of queries contain spelling or typographical errors. Fittingly, one important query reformulation tool is spelling suggestions or corrections.” -- Marti Hearst, Search User Interfaces
  • 51. Text Analytics Introduction 2013 Text Analytics Summit 51 Accuracy and Semi-Structured Sources An e-mail message is “semi-structured,” which facilitates extracting metadata -- Date: Sun, 13 Mar 2005 19:58:39 -0500 From: Adam L. Buchsbaum <alb@research.att.com> To: Seth Grimes <grimes@altaplana.com> Subject: Re: Papers on analysis on streaming data seth, you should contact divesh srivastava, divesh@research.att.com regarding at&t labs data streaming technology. Adam “Reading from text in structured domains I don’t think is as hard.” -- Edward A. Feigenbaum Surveys are also typically s-s in a different way...
  • 52. Text Analytics Introduction 2013 Text Analytics Summit 52 Structured &‘Unstructured’ The respondent is invited to explain his/her attitude:
  • 53. Text Analytics Introduction 2013 Text Analytics Summit 53 Structured &‘Unstructured’ We typically look at frequencies and distributions of coded- response questions: Linkage of responses to coded ratings helps in analyses of free text:
  • 54. Text Analytics Introduction 2013 Text Analytics Summit 54 “Sentiment analysis is the task of identifying positive and negative opinions, emotions, and evaluations.” -- Wilson, Wiebe & Hoffman, 2005, “Recognizing Contextual Polarity in Phrase-Level Sentiment Analysis” “Sentiment analysis or opinion mining is the computational study of opinions, sentiments and emotions expressed in text… An opinion on a feature f is a positive or negative view, attitude, emotion or appraisal on f from an opinion holder.” -- Bing Liu, 2010, “Sentiment Analysis and Subjectivity,” in Handbook of Natural Language Processing “Dell really... REALLY need to stop overcharging... and when i say overcharing... i mean atleast double what you would pay to pick up the ram yourself.” -- From Dell’s IdeaStorm.com Sentiment Analysis
  • 55. Text Analytics Introduction 2013 Text Analytics Summit 55 What are people saying? What’s hot/trending? What are they saying about {topic|person|product} X? ... about X versus {topic|person|product} Y? How has opinion about X and Y evolved? How has opinion correlated with {our|competitors’|general} {news|marketing|sales|events}? What’s behind opinion, the root causes? (How) Can we link opinions & transactions? (How) Can we link opinion & intent? Who are opinion leaders? How does sentiment propagate across channels? Sentiment Analysis
  • 56. Text Analytics Introduction 2013 Text Analytics Summit 56 Sentiment Analysis Applications include: Brand / Reputation Management. Competitive intelligence. Customer Experience Management. Enterprise Feedback Management. Quality improvement. Trend spotting. There are significant challenges…
  • 58. Text Analytics Introduction 2013 Text Analytics Summit 58 Complications Illustrated Unfiltered duplicates External reference “Kind” = type, variety, not a sentiment. Complete misclassification
  • 59. Text Analytics Introduction 2013 Text Analytics Summit 59 Sentiment Complications There are many complications. Sentiment may be of interest at multiple levels. Corpus / data space, i.e., across multiple sources. Document. Statement / sentence. Entity / topic / concept. Human language is noisy and chaotic! Jargon, slang, irony, ambiguity, anaphora, polysemy, synonymy, etc. Context is key. Discourse analysis comes into play. Must distinguish the sentiment holder from the object: “Geithner said the recession may worsen.”
  • 60. Text Analytics Introduction 2013 Text Analytics Summit 60 Beyond Polarity
  • 61. Text Analytics Introduction 2013 Text Analytics Summit 61 Intent Analysis http://www.aiaioo.com/whitepapers/intention_analysis_use_cases.pdf http://sentibet.com/
  • 62. Text Analytics Introduction 2013 Text Analytics Summit 62 Applications Text analytics has applications in – • Intelligence & law enforcement. • Life sciences. • Media & publishing including social-media analysis and contextual advertizing. • Competitive intelligence. • Voice of the Customer: CRM, product management & marketing. • Legal, tax & regulatory (LTR) including compliance. • Recruiting.
  • 63. Text Analytics Introduction 2013 Text Analytics Summit 63 Online Commerce Text analytics is applied for marketing, search optimization, competitive intelligence. Analyze social media and enterprise feedback to understand the Voice of the Market: • Opportunities • Threats • Trends Categorize product and service offerings for on-site search and faceted navigation and to enrich content delivery. Annotate pages to enhance Web-search findability, ranking. Scrape competitor sites for offers and pricing. Analyze social and news media for competitive information.
  • 64. Text Analytics Introduction 2013 Text Analytics Summit 64 Voice of the Customer Text analytics is applied to enhance customer service and satisfaction. Analyze customer interactions and opinions – • E-mail, contact-center notes, survey responses • Forum & blog posting and other social media – to – • Address customer product & service issues • Improve quality • Manage brand & reputation If you can link qualitative information from text you can – • Link feedback to transactions • Assess customer value • Understand root causes • Mine data for measures such as churn likelihood
  • 65. Text Analytics Introduction 2013 Text Analytics Summit 65 E-Discovery and Compliance Text analytics is applied for compliance, fraud and risk, and e-discovery. Regulatory mandates and corporate practices dictate – • Monitoring corporate communications • Managing electronic stored information for production in event of litigation Sources include e-mail (!!), news, social media Risk avoidance and fraud detection are key to effective decision making • Text analytics mines critical data from unstructured sources • Integrated text-transactional analytics provides rich insights
  • 66. Text Analytics Introduction 2013 Text Analytics Summit 66
  • 67. Text Analytics Introduction 2013 Text Analytics Summit 67 http://altaplana.com/TA2011
  • 68. Text Analytics Introduction 2013 Text Analytics Summit 68 8% 21% 25% 27% 36% 21% 34% 35% 44% 47% 21% 22% 23% 27% 29% 30% 35% 35% 41% 62% 0% 10% 20% 30% 40% 50% 60% 70% text messages/SMS/chat Web-site feedback contact-center notes or transcripts scientific or technical literature e-mail and correspondence review sites or forums customer/market surveys on-line forums news articles blogs and other social media What textual information are you analyzing or do you plan to analyze? 2011 (n=215) 2009 (n=100)
  • 69. Text Analytics Introduction 2013 Text Analytics Summit 69
  • 70. Text Analytics Introduction 2013 Text Analytics Summit 70
  • 71. Text Analytics Introduction 2013 Text Analytics Summit 71 Getting Started A best practices approach… Assess: • Assess business goals. • Understand information sources. • Consult and educate stakeholders. Evaluate: • Evaluate installed, hosted/SaaS, database-integrated options. • Determine performance and business requirements. • Match methods to goals, sources, and work practices. Implement: • Start with basic functions such as search, modest goals, or with a single information source. • Go for clear wins to gain support. • Build out applications, capacity, BI/research integration.
  • 72. Text Analytics Introduction 2013 Text Analytics Summit 72 Software & Platform Options Text-analytics options may be grouped in general classes. • Installed text-analysis application, whether desktop or server or deployed in-database. • Data mining workbench. • Hosted. • Programming tool. • As-a-service, via an application programming interface (API). • Code library or component of a business/vertical application, for instance for CRM, e-discovery, search. Text analytics is frequently embedded in search or other end-user applications. The slides that follow next will present leading options in each category except Hosted…
  • 73. Text Analytics Introduction 2013 Text Analytics Summit 73 Text Analysis Applications Vendors: Attensity, Clarabridge, Daedalus, Digital Reasoning, Expert System, Fido Labs, IBM, Linguamatics, Medallia, NetBase, Open Text (Nstein), Provalis Research, SAP, SAS, SRA NetOwl, Sysomos, Temis. Typical uses: Customer experience management (CEM), survey analysis, social-media analysis, law enforcement, life sciences. Typical characteristics: Interface that allows the user to configure a processing pipeline. Interface for text exploration and visualization. Export to databases
  • 74. Text Analytics Introduction 2013 Text Analytics Summit 74 Data Mining Workbench Vendors: IBM SPSS Modeler, Megaputer PolyAnalyst, RapidMiner, SAS Text Analytics, WEKA. Typical uses: Customer experience management (CEM), marketing analytics, survey analysis, social-media analysis, law enforcement. Predictive modeling. Typical characteristics: Same as text-analysis applications, but with more sophisticated modeling and analysis capabilities.
  • 75. Text Analytics Introduction 2013 Text Analytics Summit 75 Programming/Development Tool Vendors: GATE, OpenNLP, Python NLTK, R, Stanford NLP – open source. NooJ – free, non-open source. Typical uses: Language modeling. Data exploration. Up to the programmer. Typical characteristics: Text is an add-in to a programming language/environment.
  • 76. Text Analytics Introduction 2013 Text Analytics Summit 76 As a Service, via API Vendors: Alchemy API, Clarabridge, Open Amplify, Pingar, Saplo, Semantria (Lexalytics), Thomson Reuters Calais, others. Typical uses: Annotation and content enrichment with the application domain up to the user. Typical characteristics: Relies on remote, server-resident processing resources. May or, more likely, may not be end-user customizable.
  • 77. Text Analytics Introduction 2013 Text Analytics Summit 77 Code Library or Annotation Engine Vendors: Alias-I LingPipe, GATE. Basis Technology Rosette, Lexalytics Salience, SAP Inxight, SAS Teragram, TEMIS Luxid. Typical uses: Information extraction in support of business applications. Typical characteristics: Same as text-analysis applications, but with more sophisticated modeling and analysis capabilities.
  • 78. Text Analytics Introduction 2013 Text Analytics Summit 78 The Crowdsourcing Alternative (example)
  • 79. Text Analytics Introduction 2013 Text Analytics Summit 79 http://www.geeklawblog.com/2011/12/lexis-advance-platform-launch-two.html http://hpccsystems.com/ (GNU Affero GPL) A Big Data Platform (example)
  • 80. Text Analytics Introduction 2013 Text Analytics Summit 80
  • 81. Text Analytics Introduction 2013 Text Analytics Summit 81 (Accessible) Data Everywhere
  • 82. Text Analytics Introduction 2013 Text Analytics Summit 82 http://open.blogs.nytimes.com/2012/02/16/rnews-is- here-and-this-is-what-it-means/ <div itemscope itemtype="http://schema.org/Organization"> <span itemprop="name">Google.org (GOOG)</span> Contact Details: <div itemprop="address" itemscope itemtype="http://schema.org/PostalAddress"> Main address: <span itemprop="streetAddress">38 avenue de l'Opera</span> <span itemprop="postalCode">F-75002</span> <span itemprop="addressLocality">Paris, France</span> , </div> Tel:<span itemprop="telephone">( 33 1) 42 68 53 00 </span>, Fax:<span itemprop="faxNumber">( 33 1) 42 68 53 01 </span>, E-mail: <span itemprop="email">secretariat(at)google.org</span> </div> http://schema.org/Organization Structure Matters http://img.freebase.com/api/trans/raw/m/02dtnzv http://www.cambridgesemantics.com/se mantic-university/semantic-search-and- the-semantic-web
  • 83. An Introduction to Text Analytics: Social, Online, and Enterprise Seth Grimes Alta Plana Corporation @sethgrimes Text & Social Analytics Summit 2013 Workshop June 4, 2013