SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Text Mining: Tools,
Techniques, and Applications
Nathan Treloar
President
AvaQuest, Inc.
© 2002, AvaQuest Inc.
Outline
 Text Mining Defined
 Foundations of Text Mining
 Example Applications
 User Interface Challenges
 The Future
© 2002, AvaQuest Inc.
Mining Medical Literature
 Medical research
 Find causal links between symptoms
or diseases and drugs or chemicals.
© 2002, AvaQuest Inc.
A Real Example
 Research objective:
– Follow chains of causal implication to discover a
relationship between migraines and biochemical
levels.
 Data:
– medical research papers, medical news
(unstructured text information)
 Key concept types:
– symptoms, drugs, diseases, chemicals…
© 2002, AvaQuest Inc.
Example Application: Medical
Research
 stress is associated with migraines
 stress can lead to loss of magnesium
 calcium channel blockers prevent some migraines
 magnesium is a natural calcium channel blocker
 spreading cortical depression (SCD) is implicated
in some migraines
 high levels of magnesium inhibit SCD
 migraine patients have high platelet aggregability
 magnesium can suppress platelet aggregability
(source: Swanson and Smalheiser, 1994)
© 2002, AvaQuest Inc.
Text Mining Defined
 Discover useful and previously unknown
“gems” of information in large text
collections
© 2002, AvaQuest Inc.
“Search” versus “Discover”
Data
Mining
Text
Mining
Data
Retrieval
Information
Retrieval
Search
(goal-oriented)
Discover
(opportunistic)
Structured
Data
Unstructured
Data (Text)
© 2002, AvaQuest Inc.
Data Retrieval
 Find records within a structured
database.
Database Type Structured
Search Mode Goal-driven
Atomic entity Data Record
Example Information Need “Find a Japanese restaurant in Boston
that serves vegetarian food.”
Example Query “SELECT * FROM restaurants WHERE
city = boston AND type = japanese
AND has_veg = true”
© 2002, AvaQuest Inc.
Information Retrieval
 Find relevant information in an
unstructured information source
(usually text)
Database Type Unstructured
Search Mode Goal-driven
Atomic entity Document
Example Information Need “Find a Japanese restaurant in Boston
that serves vegetarian food.”
Example Query “Japanese restaurant Boston” or
Boston->Restaurants->Japanese
© 2002, AvaQuest Inc.
Data Mining
 Discover new knowledge
through analysis of data
Database Type Structured
Search Mode Opportunistic
Atomic entity Numbers and Dimensions
Example Information Need “Show trend over time in # of visits to
Japanese restaurants in Boston ”
Example Query “SELECT SUM(visits) FROM restaurants
WHERE city = boston AND type =
japanese ORDER BY date”
© 2002, AvaQuest Inc.
Text Mining
 Discover new knowledge
through analysis of text
Database Type Unstructured
Search Mode Opportunistic
Atomic entity Language feature or concept
Example Information Need “Find the types of food poisoning most
often associated with Japanese
restaurants”
Example Query Rank diseases found associated with
“Japanese restaurants”
© 2002, AvaQuest Inc.
Motivation for Text Mining

Approximately 90% of the world’s data is held in
unstructured formats (source: Oracle Corporation)
 Information intensive business processes demand
that we transcend from simple document retrieval to
“knowledge” discovery.
90%
Structured Numerical or Coded
Information
10%
Unstructured or Semi-structured
Information
© 2002, AvaQuest Inc.
Challenges of Text Mining
 Very high number of possible “dimensions”
– All possible word and phrase types in the language!!
 Unlike data mining:
– records (= docs) are not structurally identical
– records are not statistically independent
 Complex and subtle relationships between concepts in
text
– “AOL merges with Time-Warner”
– “Time-Warner is bought by AOL”
 Ambiguity and context sensitivity
– automobile = car = vehicle = Toyota
– Apple (the company) or apple (the fruit)
© 2002, AvaQuest Inc.
The Emergence of Text Mining
 Advances in text processing technology
– Natural Language Processing (NLP)
– Computational Linguistics
 Cheap Hardware!
– CPU
– Disk
– Network
© 2002, AvaQuest Inc.
Text Processing
 Statistical Analysis
– Quantify text data
 Language or Content Analysis
– Identifying structural elements
– Extracting and codifying meaning
– Reducing the dimensions of text data
© 2002, AvaQuest Inc.
Statistical Analysis
 Use statistics to add a numerical
dimension to unstructured text
Term frequency
Document length
Document frequency
Term proximity
© 2002, AvaQuest Inc.
Content Analysis
 Lexical and Syntactic Processing
– Recognizing “tokens” (terms)
– Normalizing words
– Language constructs (parts of speech, sentences, paragraphs)
 Semantic Processing
– Extracting meaning
– Named Entity Extraction (People names, Company Names,
Locations, etc…)
 Extra-semantic features
– Identify feelings or sentiment in text
 Goal = Dimension Reduction
© 2002, AvaQuest Inc.
Syntactic Processing
 Lexical analysis
– Recognizing word boundaries
– Relatively simple process in English
 Syntactic analysis
– Recognizing larger constructs
– Sentence and Paragraph Recognition
– Parts of speech tagging
– Phrase recognition
© 2002, AvaQuest Inc.
Named Entity Extraction
 Identify and type language features
 Examples:
 People names
 Company names
 Geographic location names
 Dates
 Monetary amount
 Others… (domain specific)
© 2002, AvaQuest Inc.
Simple Entity Extraction
“The quick brown fox jumps over the lazy dog”
Noun phrase Noun phrase
Mammal
Canidae
Mammal
Canidae
© 2002, AvaQuest Inc.
Entity Extraction in Use
 Categorization
– Assign structure to unstructured content to facilitate
retrieval
 Summarization
– Get the “gist” of a document or document collection
 Query expansion
– Expand query terms with related “typed” concepts
 Text Mining
– Find patterns, trends, relationships between
concepts in text
© 2002, AvaQuest Inc.
Extra-semantic Information
 Extracting hidden meaning or sentiment based
on use of language.
– Examples:
 “Customer is unhappy with their service!”
 Sentiment = discontent
 Sentiment is:
– Emotions: fear, love, hate, sorrow
– Feelings: warmth, excitement
– Mood, disposition, temperament, …
 Or even (someday)…
– Lies, sarcasm
© 2002, AvaQuest Inc.
Text Mining:
General Applications
 Relationship Analysis
– If A is related to B, and B is related to C, there is
potentially a relationship between A and C.
 Trend analysis
– Occurrences of A peak in October.
 Mixed applications
– Co-occurrence of A together with B peak in
November.
© 2002, AvaQuest Inc.
Text Mining:
Business Applications
 Ex 1: Decision Support in CRM
- What are customers’ typical complaints?
- What is the trend in the number of satisfied
customers in Cleveland?
 Ex 2: Knowledge Management
– People Finder
 Ex 3: Personalization in eCommerce
- Suggest products that fit a user’s interest profile
(even based on personality info).
© 2002, AvaQuest Inc.
The Needs:
– Analysis of call records as input into
decision-making process of Bank’s
management
– Quick answers to important questions
 Which offices receive the most angry calls?
 What products have the fewest satisfied customers?
 (“Angry” and “Satisfied” are recognizable sentiments)
– User friendly interface and visualization
tools
Example 1:
Decision Support using Bank Call
Center Data
© 2002, AvaQuest Inc.
Example 1:
Decision Support using Bank Call
Center Data
 The Information Source:
– Call center records
– Example:
AC2G31, 01, 0101, PCC, 021, 0053352,
NEW YORK, NY, H-SUPRVR8, STMT,
“mr stark has been with the company for
about 20 yrs. He hates his stmt format and
wishes that we would show a daily balance
to help him know when he falls below the
required balance on the account.”
© 2002, AvaQuest Inc.
Example 1:
Call Volume by Sentiment
0
200
400
600
800
1000
Negative Calls Related to Bank
Statements
Cleveland
New York
Boston
© 2002, AvaQuest Inc.
The Needs:
- Find people as well as documents that
can address my information need.
- Promote collaboration and knowledge
sharing
- Leverage existing information access
system
- The Information Sources:
- Email, groupware, online reports, …
Example 2:
KM People Finder
© 2002, AvaQuest Inc.
Example 2:
Simple KM People Finder
Relevant
Docs
Search or
Navigation
System
Name
Extractor Authority
List
Query
Ranked People Names
© 2002, AvaQuest Inc.
Example 2:
KM People Finder
© 2002, AvaQuest Inc.
Example 3:
Personalized Movie “Matcher”
 The Need:
– Match movies to individuals based on preference
profile
 The Information:
– Written reviews of movies
– Users’ lists of favorite movies.
Movie
Reviews
Sentiment
Analysis
Typed and
Tagged
Reviews
© 2002, AvaQuest Inc.
Sentiment Analysis of Movies:
Visualization (after Evans)
absurdity
destruction
fear
horror
immorality
inferiority
injustice
insecurity
deception
death
crime
conflict
0
1
Action
Romance
© 2002, AvaQuest Inc.
Commercial Tools
 IBM Intelligent Miner for Text
 Semio Map
 InXight LinguistX / ThingFinder
 LexiQuest
 ClearForest
 Teragram
 SRA NetOwl Extractor
 Autonomy
© 2002, AvaQuest Inc.
User Interfaces for Text
Mining
 Need some way to present results of Text
Mining in an intuitive, easy to manage form.
 Options:
– Conventional text “lists” (1D)
– Charts and graphs (2D)
– Advanced visualization tools (3D+)
 Network maps
 Landscapes
 3d “spaces”
© 2002, AvaQuest Inc.
UI Challenges
Simple lists, charts, and graphs not
obviously applicable or difficult to
work with due to high dimensionality
of text
Advanced visualization tools can
be intimidating for the general
community and are not readily
accepted
© 2002, AvaQuest Inc.
Charts and Graphs
http://www.cognos.com/
© 2002, AvaQuest Inc.
Visualization: Network Maps
http://www.thinkmap.com/
© 2002, AvaQuest Inc.
Visualization: Network Maps
http://www.lexiquest.com/
© 2002, AvaQuest Inc.
Visualization: Landscapes
http://www.aurigin.com/
© 2002, AvaQuest Inc.
Visualization: 3D Spaces
http://zing.ncsl.nist.gov/~cugini/uicd/cc-paper.html
© 2002, AvaQuest Inc.
The Future
 Different tools and data, but common dimensions
 Example:
– “Find sales trends by product and correlate with occurrences of
company name in business news articles”
– Dimensions: Time, Company names (or stock symbols), Product
names, Regions
© 2002, AvaQuest Inc.
Recent Events
 February 2002
– Meta Group posts report arguing for need to
integrate business intelligence applications with
knowledge management portals.
 March 2002
– SAS, leading provider of business intelligence
software solutions, partners with Inxight to introduce
true text mining product.

Weitere ähnliche Inhalte

Was ist angesagt?

Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
Ali Habeeb
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
KU Leuven
 

Was ist angesagt? (20)

Textmining Introduction
Textmining IntroductionTextmining Introduction
Textmining Introduction
 
Text mining Pre-processing
Text mining Pre-processingText mining Pre-processing
Text mining Pre-processing
 
Sentiment analysis
Sentiment analysisSentiment analysis
Sentiment analysis
 
Data Mining & Applications
Data Mining & ApplicationsData Mining & Applications
Data Mining & Applications
 
Opinion Mining
Opinion MiningOpinion Mining
Opinion Mining
 
Presentation on Sentiment Analysis
Presentation on Sentiment AnalysisPresentation on Sentiment Analysis
Presentation on Sentiment Analysis
 
5.2 mining time series data
5.2 mining time series data5.2 mining time series data
5.2 mining time series data
 
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
Stemming And Lemmatization Tutorial | Natural Language Processing (NLP) With ...
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 
web mining
web miningweb mining
web mining
 
Web Mining & Text Mining
Web Mining & Text MiningWeb Mining & Text Mining
Web Mining & Text Mining
 
Text data mining1
Text data mining1Text data mining1
Text data mining1
 
Practical sentiment analysis
Practical sentiment analysisPractical sentiment analysis
Practical sentiment analysis
 
Machine learning with Big Data power point presentation
Machine learning with Big Data power point presentationMachine learning with Big Data power point presentation
Machine learning with Big Data power point presentation
 
Introduction to Data Mining
Introduction to Data MiningIntroduction to Data Mining
Introduction to Data Mining
 
Clustering
ClusteringClustering
Clustering
 
Data science applications and usecases
Data science applications and usecasesData science applications and usecases
Data science applications and usecases
 
Exploratory data analysis with Python
Exploratory data analysis with PythonExploratory data analysis with Python
Exploratory data analysis with Python
 
Classification and Clustering
Classification and ClusteringClassification and Clustering
Classification and Clustering
 
Sentiment Analysis
Sentiment AnalysisSentiment Analysis
Sentiment Analysis
 

Andere mochten auch

Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
Ontotext
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Mohamed Zaki
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
Sung Yub Kim
 

Andere mochten auch (20)

Best Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining ProcessingBest Practices for Large Scale Text Mining Processing
Best Practices for Large Scale Text Mining Processing
 
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
Controlled Vocabularies and Text Mining - Use Cases at the Goettingen
 
Social media Listening and Analytics: A brief Overview
Social media Listening and Analytics: A brief OverviewSocial media Listening and Analytics: A brief Overview
Social media Listening and Analytics: A brief Overview
 
Social Listening for the Travel & Hospitality Industry
Social Listening for the Travel & Hospitality IndustrySocial Listening for the Travel & Hospitality Industry
Social Listening for the Travel & Hospitality Industry
 
Text Mining of Movie Reviews
Text Mining of Movie ReviewsText Mining of Movie Reviews
Text Mining of Movie Reviews
 
Gartner webinar social media analytics 23.10.2014
Gartner webinar social media analytics 23.10.2014Gartner webinar social media analytics 23.10.2014
Gartner webinar social media analytics 23.10.2014
 
Text Mining and Visualization
Text Mining and VisualizationText Mining and Visualization
Text Mining and Visualization
 
Big data concept
Big data conceptBig data concept
Big data concept
 
Social listening-insights-emetrics-presentation
Social listening-insights-emetrics-presentationSocial listening-insights-emetrics-presentation
Social listening-insights-emetrics-presentation
 
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
Hadoop World 2011: Data Mining in Hadoop, Making Sense of it in Mahout! - Mic...
 
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
Analyzing Customer Experience Feedback Using Text Mining: A Linguistics-Based...
 
Text classification & sentiment analysis
Text classification & sentiment analysisText classification & sentiment analysis
Text classification & sentiment analysis
 
Business case for Big Data Analytics
Business case for Big Data AnalyticsBusiness case for Big Data Analytics
Business case for Big Data Analytics
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다[2014년 3월 25일] mining minds   빅 데이터, 욕망을 읽다
[2014년 3월 25일] mining minds 빅 데이터, 욕망을 읽다
 
Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철Kth daisy 추천솔루션_20130509_v1.0_이호철
Kth daisy 추천솔루션_20130509_v1.0_이호철
 
Dm ml study_roadmap
Dm ml study_roadmapDm ml study_roadmap
Dm ml study_roadmap
 
Data Mining with R CH1 요약
Data Mining with R CH1 요약Data Mining with R CH1 요약
Data Mining with R CH1 요약
 
Expanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with TajoExpanding Your Data Warehouse with Tajo
Expanding Your Data Warehouse with Tajo
 
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
Io t에서 big data를 통합하는 통합 빅데이터 플랫폼 flamingo_클라우다인_김병곤 대표이사
 

Ähnlich wie Text mining

Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
dataminers.ir
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
Phi Jack
 

Ähnlich wie Text mining (20)

Callcenter HPE IDOL overview
Callcenter HPE IDOL overviewCallcenter HPE IDOL overview
Callcenter HPE IDOL overview
 
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
AI-SDV 2021: Jay ven Eman - implementation-of-new-technology-within-a-big-pha...
 
Introduction To Data Mining
Introduction To Data MiningIntroduction To Data Mining
Introduction To Data Mining
 
Introduction To Data Mining
Introduction To Data Mining   Introduction To Data Mining
Introduction To Data Mining
 
Predictive Text Analytics
Predictive Text AnalyticsPredictive Text Analytics
Predictive Text Analytics
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Content Management, Metadata and Semantic Web
Content Management, Metadata and Semantic WebContent Management, Metadata and Semantic Web
Content Management, Metadata and Semantic Web
 
Textalytics - Voice of the Customer - Sentiment Analysis Symposium 2014
Textalytics - Voice of the Customer - Sentiment Analysis Symposium 2014Textalytics - Voice of the Customer - Sentiment Analysis Symposium 2014
Textalytics - Voice of the Customer - Sentiment Analysis Symposium 2014
 
Cognitive Systems
Cognitive SystemsCognitive Systems
Cognitive Systems
 
6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana6months industrial training in data mining,ludhiana
6months industrial training in data mining,ludhiana
 
6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar6months industrial training in data mining, jalandhar
6months industrial training in data mining, jalandhar
 
6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana6 weeks summer training in data mining,ludhiana
6 weeks summer training in data mining,ludhiana
 
6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar6 weeks summer training in data mining,jalandhar
6 weeks summer training in data mining,jalandhar
 
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
Anzo Smart Data Lake 4.0 - a Data Lake Platform for the Enterprise Informatio...
 
Text analysis and its Importance.pdf
Text analysis and its Importance.pdfText analysis and its Importance.pdf
Text analysis and its Importance.pdf
 
Quantrax Corporation's platform for intelligent debt collection
Quantrax Corporation's platform for intelligent debt collectionQuantrax Corporation's platform for intelligent debt collection
Quantrax Corporation's platform for intelligent debt collection
 
Text mining and analytics v6 - p2
Text mining and analytics   v6 - p2Text mining and analytics   v6 - p2
Text mining and analytics v6 - p2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Week12
Week12Week12
Week12
 
Text mining and data mining
Text mining and data mining Text mining and data mining
Text mining and data mining
 

Kürzlich hochgeladen

%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
masabamasaba
 

Kürzlich hochgeladen (20)

VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni%in Benoni+277-882-255-28 abortion pills for sale in Benoni
%in Benoni+277-882-255-28 abortion pills for sale in Benoni
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
%in Rustenburg+277-882-255-28 abortion pills for sale in Rustenburg
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
Direct Style Effect Systems -The Print[A] Example- A Comprehension AidDirect Style Effect Systems -The Print[A] Example- A Comprehension Aid
Direct Style Effect Systems - The Print[A] Example - A Comprehension Aid
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open SourceWSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
WSO2CON 2024 - Freedom First—Unleashing Developer Potential with Open Source
 
WSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security ProgramWSO2CON 2024 - How to Run a Security Program
WSO2CON 2024 - How to Run a Security Program
 
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
WSO2CON 2024 - Cloud Native Middleware: Domain-Driven Design, Cell-Based Arch...
 
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
%+27788225528 love spells in Atlanta Psychic Readings, Attraction spells,Brin...
 
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
WSO2CON 2024 - Navigating API Complexity: REST, GraphQL, gRPC, Websocket, Web...
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa%in tembisa+277-882-255-28 abortion pills for sale in tembisa
%in tembisa+277-882-255-28 abortion pills for sale in tembisa
 

Text mining

  • 1. Text Mining: Tools, Techniques, and Applications Nathan Treloar President AvaQuest, Inc.
  • 2. © 2002, AvaQuest Inc. Outline  Text Mining Defined  Foundations of Text Mining  Example Applications  User Interface Challenges  The Future
  • 3. © 2002, AvaQuest Inc. Mining Medical Literature  Medical research  Find causal links between symptoms or diseases and drugs or chemicals.
  • 4. © 2002, AvaQuest Inc. A Real Example  Research objective: – Follow chains of causal implication to discover a relationship between migraines and biochemical levels.  Data: – medical research papers, medical news (unstructured text information)  Key concept types: – symptoms, drugs, diseases, chemicals…
  • 5. © 2002, AvaQuest Inc. Example Application: Medical Research  stress is associated with migraines  stress can lead to loss of magnesium  calcium channel blockers prevent some migraines  magnesium is a natural calcium channel blocker  spreading cortical depression (SCD) is implicated in some migraines  high levels of magnesium inhibit SCD  migraine patients have high platelet aggregability  magnesium can suppress platelet aggregability (source: Swanson and Smalheiser, 1994)
  • 6. © 2002, AvaQuest Inc. Text Mining Defined  Discover useful and previously unknown “gems” of information in large text collections
  • 7. © 2002, AvaQuest Inc. “Search” versus “Discover” Data Mining Text Mining Data Retrieval Information Retrieval Search (goal-oriented) Discover (opportunistic) Structured Data Unstructured Data (Text)
  • 8. © 2002, AvaQuest Inc. Data Retrieval  Find records within a structured database. Database Type Structured Search Mode Goal-driven Atomic entity Data Record Example Information Need “Find a Japanese restaurant in Boston that serves vegetarian food.” Example Query “SELECT * FROM restaurants WHERE city = boston AND type = japanese AND has_veg = true”
  • 9. © 2002, AvaQuest Inc. Information Retrieval  Find relevant information in an unstructured information source (usually text) Database Type Unstructured Search Mode Goal-driven Atomic entity Document Example Information Need “Find a Japanese restaurant in Boston that serves vegetarian food.” Example Query “Japanese restaurant Boston” or Boston->Restaurants->Japanese
  • 10. © 2002, AvaQuest Inc. Data Mining  Discover new knowledge through analysis of data Database Type Structured Search Mode Opportunistic Atomic entity Numbers and Dimensions Example Information Need “Show trend over time in # of visits to Japanese restaurants in Boston ” Example Query “SELECT SUM(visits) FROM restaurants WHERE city = boston AND type = japanese ORDER BY date”
  • 11. © 2002, AvaQuest Inc. Text Mining  Discover new knowledge through analysis of text Database Type Unstructured Search Mode Opportunistic Atomic entity Language feature or concept Example Information Need “Find the types of food poisoning most often associated with Japanese restaurants” Example Query Rank diseases found associated with “Japanese restaurants”
  • 12. © 2002, AvaQuest Inc. Motivation for Text Mining  Approximately 90% of the world’s data is held in unstructured formats (source: Oracle Corporation)  Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery. 90% Structured Numerical or Coded Information 10% Unstructured or Semi-structured Information
  • 13. © 2002, AvaQuest Inc. Challenges of Text Mining  Very high number of possible “dimensions” – All possible word and phrase types in the language!!  Unlike data mining: – records (= docs) are not structurally identical – records are not statistically independent  Complex and subtle relationships between concepts in text – “AOL merges with Time-Warner” – “Time-Warner is bought by AOL”  Ambiguity and context sensitivity – automobile = car = vehicle = Toyota – Apple (the company) or apple (the fruit)
  • 14. © 2002, AvaQuest Inc. The Emergence of Text Mining  Advances in text processing technology – Natural Language Processing (NLP) – Computational Linguistics  Cheap Hardware! – CPU – Disk – Network
  • 15. © 2002, AvaQuest Inc. Text Processing  Statistical Analysis – Quantify text data  Language or Content Analysis – Identifying structural elements – Extracting and codifying meaning – Reducing the dimensions of text data
  • 16. © 2002, AvaQuest Inc. Statistical Analysis  Use statistics to add a numerical dimension to unstructured text Term frequency Document length Document frequency Term proximity
  • 17. © 2002, AvaQuest Inc. Content Analysis  Lexical and Syntactic Processing – Recognizing “tokens” (terms) – Normalizing words – Language constructs (parts of speech, sentences, paragraphs)  Semantic Processing – Extracting meaning – Named Entity Extraction (People names, Company Names, Locations, etc…)  Extra-semantic features – Identify feelings or sentiment in text  Goal = Dimension Reduction
  • 18. © 2002, AvaQuest Inc. Syntactic Processing  Lexical analysis – Recognizing word boundaries – Relatively simple process in English  Syntactic analysis – Recognizing larger constructs – Sentence and Paragraph Recognition – Parts of speech tagging – Phrase recognition
  • 19. © 2002, AvaQuest Inc. Named Entity Extraction  Identify and type language features  Examples:  People names  Company names  Geographic location names  Dates  Monetary amount  Others… (domain specific)
  • 20. © 2002, AvaQuest Inc. Simple Entity Extraction “The quick brown fox jumps over the lazy dog” Noun phrase Noun phrase Mammal Canidae Mammal Canidae
  • 21. © 2002, AvaQuest Inc. Entity Extraction in Use  Categorization – Assign structure to unstructured content to facilitate retrieval  Summarization – Get the “gist” of a document or document collection  Query expansion – Expand query terms with related “typed” concepts  Text Mining – Find patterns, trends, relationships between concepts in text
  • 22. © 2002, AvaQuest Inc. Extra-semantic Information  Extracting hidden meaning or sentiment based on use of language. – Examples:  “Customer is unhappy with their service!”  Sentiment = discontent  Sentiment is: – Emotions: fear, love, hate, sorrow – Feelings: warmth, excitement – Mood, disposition, temperament, …  Or even (someday)… – Lies, sarcasm
  • 23. © 2002, AvaQuest Inc. Text Mining: General Applications  Relationship Analysis – If A is related to B, and B is related to C, there is potentially a relationship between A and C.  Trend analysis – Occurrences of A peak in October.  Mixed applications – Co-occurrence of A together with B peak in November.
  • 24. © 2002, AvaQuest Inc. Text Mining: Business Applications  Ex 1: Decision Support in CRM - What are customers’ typical complaints? - What is the trend in the number of satisfied customers in Cleveland?  Ex 2: Knowledge Management – People Finder  Ex 3: Personalization in eCommerce - Suggest products that fit a user’s interest profile (even based on personality info).
  • 25. © 2002, AvaQuest Inc. The Needs: – Analysis of call records as input into decision-making process of Bank’s management – Quick answers to important questions  Which offices receive the most angry calls?  What products have the fewest satisfied customers?  (“Angry” and “Satisfied” are recognizable sentiments) – User friendly interface and visualization tools Example 1: Decision Support using Bank Call Center Data
  • 26. © 2002, AvaQuest Inc. Example 1: Decision Support using Bank Call Center Data  The Information Source: – Call center records – Example: AC2G31, 01, 0101, PCC, 021, 0053352, NEW YORK, NY, H-SUPRVR8, STMT, “mr stark has been with the company for about 20 yrs. He hates his stmt format and wishes that we would show a daily balance to help him know when he falls below the required balance on the account.”
  • 27. © 2002, AvaQuest Inc. Example 1: Call Volume by Sentiment 0 200 400 600 800 1000 Negative Calls Related to Bank Statements Cleveland New York Boston
  • 28. © 2002, AvaQuest Inc. The Needs: - Find people as well as documents that can address my information need. - Promote collaboration and knowledge sharing - Leverage existing information access system - The Information Sources: - Email, groupware, online reports, … Example 2: KM People Finder
  • 29. © 2002, AvaQuest Inc. Example 2: Simple KM People Finder Relevant Docs Search or Navigation System Name Extractor Authority List Query Ranked People Names
  • 30. © 2002, AvaQuest Inc. Example 2: KM People Finder
  • 31. © 2002, AvaQuest Inc. Example 3: Personalized Movie “Matcher”  The Need: – Match movies to individuals based on preference profile  The Information: – Written reviews of movies – Users’ lists of favorite movies. Movie Reviews Sentiment Analysis Typed and Tagged Reviews
  • 32. © 2002, AvaQuest Inc. Sentiment Analysis of Movies: Visualization (after Evans) absurdity destruction fear horror immorality inferiority injustice insecurity deception death crime conflict 0 1 Action Romance
  • 33. © 2002, AvaQuest Inc. Commercial Tools  IBM Intelligent Miner for Text  Semio Map  InXight LinguistX / ThingFinder  LexiQuest  ClearForest  Teragram  SRA NetOwl Extractor  Autonomy
  • 34. © 2002, AvaQuest Inc. User Interfaces for Text Mining  Need some way to present results of Text Mining in an intuitive, easy to manage form.  Options: – Conventional text “lists” (1D) – Charts and graphs (2D) – Advanced visualization tools (3D+)  Network maps  Landscapes  3d “spaces”
  • 35. © 2002, AvaQuest Inc. UI Challenges Simple lists, charts, and graphs not obviously applicable or difficult to work with due to high dimensionality of text Advanced visualization tools can be intimidating for the general community and are not readily accepted
  • 36. © 2002, AvaQuest Inc. Charts and Graphs http://www.cognos.com/
  • 37. © 2002, AvaQuest Inc. Visualization: Network Maps http://www.thinkmap.com/
  • 38. © 2002, AvaQuest Inc. Visualization: Network Maps http://www.lexiquest.com/
  • 39. © 2002, AvaQuest Inc. Visualization: Landscapes http://www.aurigin.com/
  • 40. © 2002, AvaQuest Inc. Visualization: 3D Spaces http://zing.ncsl.nist.gov/~cugini/uicd/cc-paper.html
  • 41. © 2002, AvaQuest Inc. The Future  Different tools and data, but common dimensions  Example: – “Find sales trends by product and correlate with occurrences of company name in business news articles” – Dimensions: Time, Company names (or stock symbols), Product names, Regions
  • 42. © 2002, AvaQuest Inc. Recent Events  February 2002 – Meta Group posts report arguing for need to integrate business intelligence applications with knowledge management portals.  March 2002 – SAS, leading provider of business intelligence software solutions, partners with Inxight to introduce true text mining product.

Hinweis der Redaktion

  1. Looking at 5 things What is text mining? Easiest to understand by relating it to known technologies Foundation of text mining The fundamental theories and technologies that make text mining work. Application of text mining General and real world problems that can be solved with text mining User Interface Challenges We’ll look at Uis that have been developed for text mining The Future Quick glimpse into the potential for text mining in the future
  2. This is all very interesting, but what real-life business problems does this hold promise for? Let’s consider a specific scenario involving the medical domain, specifically, medical research. Consider that the medical domain has quite a lot of knowledge captured in the form of unstructured documents: physician reports, medical news articles and reports, etc...
  3. The source of information is a collection of medical research papers and news articles. From this source, we can extract the “dimensions” of the data. Dimensions are the “classes” of information that add substance and some implicit structure to the otherwise unstructured data we’re dealing with. The goal is to explore the source of migraines by identifying potential causal links to blood chemistry.
  4. Here’s what was found. Note that this is an indication of a “potential” link. It turns out that a follow-up clinical study validated this result.
  5. How many people have been involved in a implementation of a so-called Business Intelligence system (Decision Support, Knowledge Discovery, Data Mining System) How many people have been part of building a text retrieval or information retrieval system (in other words, a “search” application)? In the loosest definition, text mining attempts to combine the idea of “mining” textual information by employing some of the same technologies used for text retrieval.
  6. Data Retrieval systems are the ones most people are familiar with. They are the applications provided by behemoths like Oracle and Sybase. An “information need” is what is the user’s head. The “query” is the user’s articulations of this information need to the system. They are not always the same.
  7. Most of us are familiar with “search”. Thanks to the growth of the Web and sites like Google, AltaVista, Excite, etc…, anyone who’s reasonably “Net savvy” has had some exposure to the technology that is IR or information retrieval. IR systems usually attempt to address one of two modes of searching: goal-driven or opportunistic. The two modes represent the two types of searches that people typically perform. How many people still go to their local public library? I maintain that when people use the library they are in one of two modes. Either they are looking for a particular book or books, or they are browsing an area of interest. That is the difference between goal-driven and opportunistic search.
  8. Data Mining employs analysis and interpretation of data captured in structured databases to facilitate decision making. So called “Decision Support” systems usually employ some kind of Data Mining capabilities.
  9. Text Mining employs the same concepts as Data Mining but against unstructured or semi-structured text information sources. Text mining aids the opportunistic searcher. Not only can it help traditional IR by “suggesting” relevant information, it can extract knowledge that is not nicely encapsulated in a single document (or book).
  10. The justification for the interest in text mining is the same as for the interest in knowledge retrieval (search and categorization). The shear amount of unstructured data (mostly textual) out there calls for more than just document retrieval. Tools and techniques exist to mine this data and realize value in the same way that data mining taps structured data for business intelligence and knowledge discovery.
  11. Why aren’t there more products that do text mining? Because it’s hard!!! First, there are many possible dimensions of text. Consider just the classes of nouns that might be represented in a text collection. Then, add to that noun phrases (nouns plus adjectives or multi-word concepts). Second, different documents can look quite different. Never mind issues like formatting differences. Third, the relationships between words and concepts in text is subtle. Figuring out that a relationship exists is easy, providing the information about the nature of the relationship is tricky. Finally, the same word can have many meanings (e.g. “interest”), or many words can have the same meaning.
  12. Why aren’t there more products that do text mining? Because it’s hard!!! First, there are many possible dimensions of text. Consider just the classes of nouns that might be represented in a text collection. Then, add to that noun phrases (nouns plus adjectives or multi-word concepts). Second, different documents can look quite different. Never mind issues like formatting differences. Third, the relationships between words and concepts in text is subtle. Figuring out that a relationship exists is easy, providing the information about the nature of the relationship is tricky. Finally, the same word can have many meanings (e.g. “interest”), or many words can have the same meaning.
  13. So, what helps? Well, the technology to analyze the written word and to address the problems listed in the previous slide has existed for quite some number of years, but only in the last 2 or 3 years have we seen products that are applying this technology to the idea of text mining. Sometimes called CL, sometimes NLP, but easiest to just refer to it as Text Analysis.
  14. Statistics about text are at the heart of most IR systems. Simple statistics like the number of times a search term occurs in a document can be used to infer the potential relevance of that document.
  15. Content Analysis tries to disambiguate structure and meaning in text. The three processing “levels” represent three levels of sophistication in this disambiguating. Ultimately, what were trying to do is reduce the number of dimensions in the text data.
  16. Simple syntactic processing is designed fundamentally to reduce the complexity inherent in text by reducing the possible number of words and phrases to a more manageable number.
  17. Semantic processing aims to “type” language features or concepts so that the information can be mined by these different concept types.
  18. Here’s a simple example: Given a sentence, it’s useful to recognize the important concepts present. In this example, we are recognizing noun phrases and then classifying the phrases as particular types. How concepts are classified depends on the research domain. Here I may have an application intended for a biologist where the kinds of things we might like to know are potential relationships between foxes and dogs. This could easily be factored a different way where dog and fox are not the some concept type.
  19. So what is concept extraction good for. Well, it has lot’s of general applications.
  20. The ultimate (to date) in language processing is the inference of deep and hidden meaning in unstructured text. It is inherently subjective, but a standard classification scheme can help in the association of business rules to the inferred affects.
  21. Some general applications of text mining.
  22. A couple specific business applications of text mining. Gotta get that “e” in there!
  23. Bank call centers thousands of calls a day, mostly in unstructured or semi-structured formats. This information represents a wealth of knowledge that can be translated into market strategies, etc…
  24. This diagram, created by Dr. David Evans at Clairvoyance, shows a way to visualize the affect of a movie using statistical data and normalization techniques.
  25. That’s all fine and dandy, but how do we provide this functionality in a mainstream application?
  26. Here’s a traditional user interface for presenting the results of a data mining operation. In this case, the data is product sales data which is being used to generate a variance report. It’s a bit harder to see how text information could be presented as a histogram, but, as you’ll see in the demos at the end of this presentation, it can be done.
  27. Another technique for visualizing relationship information is network maps. Here’s an example that shows, albeit dimly, how the relationships implied in a thesaurus can be shown in a 2d and pseudo-3d map. This is a viewer from a company called ThinkMap.
  28. Another example from a company called LexiQuest (formerly Erli) which makes language processing technology.
  29. One of the more interesting approaches to visualizing patterns in text is from a company called Aurigin in their Themescape product. Now, I have a formal education in geology and geophysics, so I’m comfortable with looking at maps like this, but I have to believe this is also a fairly intuitive interface for most people. Generally what it tries to do is show thematic clusters as peaks in the map. By clicking on a peak, you can “drill down” into that cluster. This example shows “whole collection” analysis.
  30. Of course, you can even get more esoteric with 3d spaces. Here’s an example from research being done at the NIST.
  31. A very interesting future for text mining is integration with traditional data mining concepts and application. Recent activity in the Information Retrieval space shows promise that this bridge will get crossed in the next several years.