Text Mining

Text Mining
Presenter: Gokul K S

Text mining also is known as Text Data Mining(TDM)
and Knowledge Discovery in Textual Database(KDT).
A process of identifying novel information
from a collection of text
2

“
What is Text Databases ?.
3

Comparison
Data Mining
 process directly
 Identify causal
relationship
 Structured
numeric
transaction data
residing in
rational data
warehouse
Text Mining
 Linguistic processing
or natural language
processing (NLP)
 Discover heretofore
unknown information
4

Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext
5
HomeLoan (
Loanee: Frank Rizzo
Lender: MWF
Agency: Lake View
Amount: $200,000
Term: 15 years
)
Frank Rizzo bought
his home from Lake
View Real Estate in
1992.
He paid $200,000
under a15-year loan
from MW Financial.
<a href>Frank Rizzo
</a> Bought
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
In <b>1992</b>.
<p>...
Loans($200K,[map],...)

Information
Retrieval
 The science of searching for
 Information in documents
 Documents themselves
 Metadata which describe documents
 Text, sound, images or data, within
database: relational stand-alone database
or hypertext networked databases such as
the Internet or intranets.
6

Information retrieval cont..
 A field developed in parallel with database
systems
 Information is organized into (a large
number of) documents
 Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents

Basic Measures for
Text Retrieval
8

Precision: the percentage of retrieved documents that
are in fact relevant to the query (i.e., “correct”
responses)
Precision
.
9
Relevant Relevant &
Retrieved Retrieved
All Documents
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision



Recall Recall: the percentage of documents that are relevant
to the query and were, in fact, retrieved
10
|}{|
|}{}{|
Re
Relevant
RetrievedRelevant
call



Trade-off ○Trade-off: which is defined as the harmonic mean of
recall and precision:
11
2/)(
*
_
precisionrecall
precisionrecall
scoreF



Text Retrieval Methods
 Document Selection
 Boolean Model
A typical method of this category is the Boolean retrieval model, in which a
document is represented by a set of keywords and a user provides a
Boolean expression of keywords, such as “car and repair shops,” “tea or
coffee,” or “database systems but not Oracle.”
The Boolean model predicts that each document is either relevant or non-
relevant based on the match of a document to the query
12

Document ranking
Document ranking methods use the query to
rank all documents in the order of relevance.
13

Document ranking
Basic techniques
Stop list
Set of words that are deemed “irrelevant”, even though they may
appear frequently
◦E.g., a, the, of, for, to, with, etc.
◦Stop lists may vary when document set varies
14

Document ranking
◦Word stem
Several words are small syntactic variants of each other since they share a
common word stem
E.g., drug, drugs, drugged
◦A term frequency table
Each entry frequent_table(i, j) = # of occurrences of the word ti in
document di
◦Usually, the ratio instead of the absolute number of occurrences is used
15

Document ranking
◦Term Frequency(TF)
The term frequency be the number of occurrences of term t in the
document d, that is, freq (d, t). The (weighted) term-frequency
matrix TF(d, t) measures the association of a term t with respect to
the given document d: it is generally defined as 0 if the document
does not contain the term, and nonzero otherwise.
16
otherwise.t))),log(freq(dlog(11
0t)freq(d,if,0t)TF(d,



Document ranking
|dt| << |d|, the term t will have a large IDF scaling factor and vice
versa.
Inverse document frequency (IDF)
◦That represents the scaling factor, or the importance of a term t.
○If a term t occurs in many documents, its importance will be
scaled down due to its reduced discriminative power.
17
||
||1
log)(
dt
d
tIDF



Document ranking
○In a complete vector-space model, TF and IDF are combined
together, which forms
TF-IDF(d, t) = TF(d, t)*IDF(t)
○
18

Document ranking
Similarity based
Finds similar documents based on a set of common keywords
Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
measure the closeness of a document to a query (a set of keywords
◦
19
||||
),(
21
21
21
vv
vv
vvsim



Text Mining

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (12)

Ähnlich wie Text Mining

Ähnlich wie Text Mining (20)

Mehr von Gokulks007

Mehr von Gokulks007 (15)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Text Mining