SlideShare ist ein Scribd-Unternehmen logo
1 von 20
Text Mining
Presenter: Gokul K S
Text mining also is known as Text Data Mining(TDM)
and Knowledge Discovery in Textual Database(KDT).
A process of identifying novel information
from a collection of text
2
“
What is Text Databases ?.
3
Comparison
Data Mining
 process directly
 Identify causal
relationship
 Structured
numeric
transaction data
residing in
rational data
warehouse
Text Mining
 Linguistic processing
or natural language
processing (NLP)
 Discover heretofore
unknown information
4
Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext
5
HomeLoan (
Loanee: Frank Rizzo
Lender: MWF
Agency: Lake View
Amount: $200,000
Term: 15 years
)
Frank Rizzo bought
his home from Lake
View Real Estate in
1992.
He paid $200,000
under a15-year loan
from MW Financial.
<a href>Frank Rizzo
</a> Bought
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
In <b>1992</b>.
<p>...
Loans($200K,[map],...)
Information
Retrieval
 The science of searching for
 Information in documents
 Documents themselves
 Metadata which describe documents
 Text, sound, images or data, within
database: relational stand-alone database
or hypertext networked databases such as
the Internet or intranets.
6
Information retrieval cont..
 A field developed in parallel with database
systems
 Information is organized into (a large
number of) documents
 Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents
Basic Measures for
Text Retrieval
8
Precision: the percentage of retrieved documents that
are in fact relevant to the query (i.e., “correct”
responses)
Precision
.
9
Relevant Relevant &
Retrieved Retrieved
All Documents
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision


Recall Recall: the percentage of documents that are relevant
to the query and were, in fact, retrieved
10
|}{|
|}{}{|
Re
Relevant
RetrievedRelevant
call


Trade-off ○Trade-off: which is defined as the harmonic mean of
recall and precision:
11
2/)(
*
_
precisionrecall
precisionrecall
scoreF


Text Retrieval Methods
 Document Selection
 Boolean Model
A typical method of this category is the Boolean retrieval model, in which a
document is represented by a set of keywords and a user provides a
Boolean expression of keywords, such as “car and repair shops,” “tea or
coffee,” or “database systems but not Oracle.”
The Boolean model predicts that each document is either relevant or non-
relevant based on the match of a document to the query
12
Document ranking
Document ranking methods use the query to
rank all documents in the order of relevance.
13
Document ranking
Basic techniques
Stop list
Set of words that are deemed “irrelevant”, even though they may
appear frequently
◦E.g., a, the, of, for, to, with, etc.
◦Stop lists may vary when document set varies
14
Document ranking
◦Word stem
Several words are small syntactic variants of each other since they share a
common word stem
E.g., drug, drugs, drugged
◦A term frequency table
Each entry frequent_table(i, j) = # of occurrences of the word ti in
document di
◦Usually, the ratio instead of the absolute number of occurrences is used
15
Document ranking
◦Term Frequency(TF)
The term frequency be the number of occurrences of term t in the
document d, that is, freq (d, t). The (weighted) term-frequency
matrix TF(d, t) measures the association of a term t with respect to
the given document d: it is generally defined as 0 if the document
does not contain the term, and nonzero otherwise.
16
otherwise.t))),log(freq(dlog(11
0t)freq(d,if,0t)TF(d,


Document ranking
|dt| << |d|, the term t will have a large IDF scaling factor and vice
versa.
Inverse document frequency (IDF)
◦That represents the scaling factor, or the importance of a term t.
○If a term t occurs in many documents, its importance will be
scaled down due to its reduced discriminative power.
17
||
||1
log)(
dt
d
tIDF


Document ranking
○In a complete vector-space model, TF and IDF are combined
together, which forms
TF-IDF(d, t) = TF(d, t)*IDF(t)
○
18
Document ranking
Similarity based
Finds similar documents based on a set of common keywords
Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
measure the closeness of a document to a query (a set of keywords
◦
19
||||
),(
21
21
21
vv
vv
vvsim


Thanks!
20

Weitere ähnliche Inhalte

Was ist angesagt?

An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1dbDavid Planella
 
A Novel Method and Architecture for Law Processing, Utilising High Performan...
A Novel Method and Architecture  for Law Processing, Utilising High Performan...A Novel Method and Architecture  for Law Processing, Utilising High Performan...
A Novel Method and Architecture for Law Processing, Utilising High Performan...Samos2019Summit
 
Storage dei dati con MongoDB
Storage dei dati con MongoDBStorage dei dati con MongoDB
Storage dei dati con MongoDBAndrea Balducci
 
Hierarchical Dirichlet Process
Hierarchical Dirichlet ProcessHierarchical Dirichlet Process
Hierarchical Dirichlet ProcessSangwoo Mo
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Jie Bao
 
Reflection and Metadata
Reflection and MetadataReflection and Metadata
Reflection and MetadataMichal Píše
 
SC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport data
SC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport dataSC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport data
SC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport dataBigData_Europe
 
New open document text (2)
New open document text (2)New open document text (2)
New open document text (2)Samron Samantha
 
e-CODEX Project General Overview
e-CODEX Project General Overview e-CODEX Project General Overview
e-CODEX Project General Overview OficinaJudicial
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classificationshakimov
 

Was ist angesagt? (12)

Data types
Data typesData types
Data types
 
An introduction to U1db
An introduction to U1dbAn introduction to U1db
An introduction to U1db
 
A Novel Method and Architecture for Law Processing, Utilising High Performan...
A Novel Method and Architecture  for Law Processing, Utilising High Performan...A Novel Method and Architecture  for Law Processing, Utilising High Performan...
A Novel Method and Architecture for Law Processing, Utilising High Performan...
 
Storage dei dati con MongoDB
Storage dei dati con MongoDBStorage dei dati con MongoDB
Storage dei dati con MongoDB
 
Hierarchical Dirichlet Process
Hierarchical Dirichlet ProcessHierarchical Dirichlet Process
Hierarchical Dirichlet Process
 
Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics Query Translation for Data Sources with Heterogeneous Content Semantics
Query Translation for Data Sources with Heterogeneous Content Semantics
 
Reflection and Metadata
Reflection and MetadataReflection and Metadata
Reflection and Metadata
 
SC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport data
SC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport dataSC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport data
SC4 Workshop 2 : Pieter Colpaert - Maximizing the reuse of open transport data
 
New open document text (2)
New open document text (2)New open document text (2)
New open document text (2)
 
Basics
BasicsBasics
Basics
 
e-CODEX Project General Overview
e-CODEX Project General Overview e-CODEX Project General Overview
e-CODEX Project General Overview
 
Applications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and ClassificationApplications of Word Vectors in Text Retrieval and Classification
Applications of Word Vectors in Text Retrieval and Classification
 

Ähnlich wie Text Mining

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)Uma Se
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining TechniquesHouw Liong The
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...cscpconf
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...onlmcq
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencevini89
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET Journal
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Sean Golliher
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Techniquekevig
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02Jeet Das
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...ijbuiiir1
 

Ähnlich wie Text Mining (20)

Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
UNIT 3 IRT.docx
UNIT 3 IRT.docxUNIT 3 IRT.docx
UNIT 3 IRT.docx
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Term weighting
Term weightingTerm weighting
Term weighting
 
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...IRS-Lecture-Notes irsirs    IRS-Lecture-Notes irsirs   IRS-Lecture-Notes irsi...
IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsirs IRS-Lecture-Notes irsi...
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Testing Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting TechniqueTesting Different Log Bases for Vector Model Weighting Technique
Testing Different Log Bases for Vector Model Weighting Technique
 
Ir
IrIr
Ir
 
Ir
IrIr
Ir
 
Ir models
Ir modelsIr models
Ir models
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
 
Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
 
Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...Indian Language Text Representation and Categorization Using Supervised Learn...
Indian Language Text Representation and Categorization Using Supervised Learn...
 
Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 

Mehr von Gokulks007

Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Gokulks007
 
Elearning week12
Elearning week12Elearning week12
Elearning week12Gokulks007
 
Elearning week11
Elearning week11Elearning week11
Elearning week11Gokulks007
 
Elearning week10
Elearning week10Elearning week10
Elearning week10Gokulks007
 
Elearning week9
Elearning week9Elearning week9
Elearning week9Gokulks007
 
Elearning week8
Elearning week8Elearning week8
Elearning week8Gokulks007
 
Elearning week7
Elearning week7Elearning week7
Elearning week7Gokulks007
 
Elearning week6
Elearning week6Elearning week6
Elearning week6Gokulks007
 
Elearning week5
Elearning week5Elearning week5
Elearning week5Gokulks007
 
Elearning week4
Elearning week4Elearning week4
Elearning week4Gokulks007
 
Elearning week3
Elearning week3Elearning week3
Elearning week3Gokulks007
 
E learning week2
E learning week2E learning week2
E learning week2Gokulks007
 
E learning week1
E learning week1E learning week1
E learning week1Gokulks007
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2Gokulks007
 
Machine Learning
Machine LearningMachine Learning
Machine LearningGokulks007
 

Mehr von Gokulks007 (15)

Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024Machine learning workshop, CZU Prague 2024
Machine learning workshop, CZU Prague 2024
 
Elearning week12
Elearning week12Elearning week12
Elearning week12
 
Elearning week11
Elearning week11Elearning week11
Elearning week11
 
Elearning week10
Elearning week10Elearning week10
Elearning week10
 
Elearning week9
Elearning week9Elearning week9
Elearning week9
 
Elearning week8
Elearning week8Elearning week8
Elearning week8
 
Elearning week7
Elearning week7Elearning week7
Elearning week7
 
Elearning week6
Elearning week6Elearning week6
Elearning week6
 
Elearning week5
Elearning week5Elearning week5
Elearning week5
 
Elearning week4
Elearning week4Elearning week4
Elearning week4
 
Elearning week3
Elearning week3Elearning week3
Elearning week3
 
E learning week2
E learning week2E learning week2
E learning week2
 
E learning week1
E learning week1E learning week1
E learning week1
 
Machine learning module 2
Machine learning module 2Machine learning module 2
Machine learning module 2
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 

Kürzlich hochgeladen

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...fonyou31
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfciinovamais
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 

Kürzlich hochgeladen (20)

Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Advance Mobile Application Development class 07
Advance Mobile Application Development class 07Advance Mobile Application Development class 07
Advance Mobile Application Development class 07
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
Ecosystem Interactions Class Discussion Presentation in Blue Green Lined Styl...
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 

Text Mining

  • 2. Text mining also is known as Text Data Mining(TDM) and Knowledge Discovery in Textual Database(KDT). A process of identifying novel information from a collection of text 2
  • 3. “ What is Text Databases ?. 3
  • 4. Comparison Data Mining  process directly  Identify causal relationship  Structured numeric transaction data residing in rational data warehouse Text Mining  Linguistic processing or natural language processing (NLP)  Discover heretofore unknown information 4
  • 5. Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext 5 HomeLoan ( Loanee: Frank Rizzo Lender: MWF Agency: Lake View Amount: $200,000 Term: 15 years ) Frank Rizzo bought his home from Lake View Real Estate in 1992. He paid $200,000 under a15-year loan from MW Financial. <a href>Frank Rizzo </a> Bought <a hef>this home</a> from <a href>Lake View Real Estate</a> In <b>1992</b>. <p>... Loans($200K,[map],...)
  • 6. Information Retrieval  The science of searching for  Information in documents  Documents themselves  Metadata which describe documents  Text, sound, images or data, within database: relational stand-alone database or hypertext networked databases such as the Internet or intranets. 6
  • 7. Information retrieval cont..  A field developed in parallel with database systems  Information is organized into (a large number of) documents  Information retrieval problem: locating relevant documents based on user input, such as keywords or example documents
  • 9. Precision: the percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses) Precision . 9 Relevant Relevant & Retrieved Retrieved All Documents |}{| |}{}{| Retrieved RetrievedRelevant precision  
  • 10. Recall Recall: the percentage of documents that are relevant to the query and were, in fact, retrieved 10 |}{| |}{}{| Re Relevant RetrievedRelevant call  
  • 11. Trade-off ○Trade-off: which is defined as the harmonic mean of recall and precision: 11 2/)( * _ precisionrecall precisionrecall scoreF  
  • 12. Text Retrieval Methods  Document Selection  Boolean Model A typical method of this category is the Boolean retrieval model, in which a document is represented by a set of keywords and a user provides a Boolean expression of keywords, such as “car and repair shops,” “tea or coffee,” or “database systems but not Oracle.” The Boolean model predicts that each document is either relevant or non- relevant based on the match of a document to the query 12
  • 13. Document ranking Document ranking methods use the query to rank all documents in the order of relevance. 13
  • 14. Document ranking Basic techniques Stop list Set of words that are deemed “irrelevant”, even though they may appear frequently ◦E.g., a, the, of, for, to, with, etc. ◦Stop lists may vary when document set varies 14
  • 15. Document ranking ◦Word stem Several words are small syntactic variants of each other since they share a common word stem E.g., drug, drugs, drugged ◦A term frequency table Each entry frequent_table(i, j) = # of occurrences of the word ti in document di ◦Usually, the ratio instead of the absolute number of occurrences is used 15
  • 16. Document ranking ◦Term Frequency(TF) The term frequency be the number of occurrences of term t in the document d, that is, freq (d, t). The (weighted) term-frequency matrix TF(d, t) measures the association of a term t with respect to the given document d: it is generally defined as 0 if the document does not contain the term, and nonzero otherwise. 16 otherwise.t))),log(freq(dlog(11 0t)freq(d,if,0t)TF(d,  
  • 17. Document ranking |dt| << |d|, the term t will have a large IDF scaling factor and vice versa. Inverse document frequency (IDF) ◦That represents the scaling factor, or the importance of a term t. ○If a term t occurs in many documents, its importance will be scaled down due to its reduced discriminative power. 17 || ||1 log)( dt d tIDF  
  • 18. Document ranking ○In a complete vector-space model, TF and IDF are combined together, which forms TF-IDF(d, t) = TF(d, t)*IDF(t) ○ 18
  • 19. Document ranking Similarity based Finds similar documents based on a set of common keywords Answer should be based on the degree of relevance based on the nearness of the keywords, relative frequency of the keywords, etc. measure the closeness of a document to a query (a set of keywords ◦ 19 |||| ),( 21 21 21 vv vv vvsim  