2. Text mining also is known as Text Data Mining(TDM)
and Knowledge Discovery in Textual Database(KDT).
A process of identifying novel information
from a collection of text
2
4. Comparison
Data Mining
process directly
Identify causal
relationship
Structured
numeric
transaction data
residing in
rational data
warehouse
Text Mining
Linguistic processing
or natural language
processing (NLP)
Discover heretofore
unknown information
4
5. Data Mining / Knowledge Discovery
Structured Data Multimedia Free Text Hypertext
5
HomeLoan (
Loanee: Frank Rizzo
Lender: MWF
Agency: Lake View
Amount: $200,000
Term: 15 years
)
Frank Rizzo bought
his home from Lake
View Real Estate in
1992.
He paid $200,000
under a15-year loan
from MW Financial.
<a href>Frank Rizzo
</a> Bought
<a hef>this home</a>
from <a href>Lake
View Real Estate</a>
In <b>1992</b>.
<p>...
Loans($200K,[map],...)
6. Information
Retrieval
The science of searching for
Information in documents
Documents themselves
Metadata which describe documents
Text, sound, images or data, within
database: relational stand-alone database
or hypertext networked databases such as
the Internet or intranets.
6
7. Information retrieval cont..
A field developed in parallel with database
systems
Information is organized into (a large
number of) documents
Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents
9. Precision: the percentage of retrieved documents that
are in fact relevant to the query (i.e., “correct”
responses)
Precision
.
9
Relevant Relevant &
Retrieved Retrieved
All Documents
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision
10. Recall Recall: the percentage of documents that are relevant
to the query and were, in fact, retrieved
10
|}{|
|}{}{|
Re
Relevant
RetrievedRelevant
call
11. Trade-off ○Trade-off: which is defined as the harmonic mean of
recall and precision:
11
2/)(
*
_
precisionrecall
precisionrecall
scoreF
12. Text Retrieval Methods
Document Selection
Boolean Model
A typical method of this category is the Boolean retrieval model, in which a
document is represented by a set of keywords and a user provides a
Boolean expression of keywords, such as “car and repair shops,” “tea or
coffee,” or “database systems but not Oracle.”
The Boolean model predicts that each document is either relevant or non-
relevant based on the match of a document to the query
12
14. Document ranking
Basic techniques
Stop list
Set of words that are deemed “irrelevant”, even though they may
appear frequently
◦E.g., a, the, of, for, to, with, etc.
◦Stop lists may vary when document set varies
14
15. Document ranking
◦Word stem
Several words are small syntactic variants of each other since they share a
common word stem
E.g., drug, drugs, drugged
◦A term frequency table
Each entry frequent_table(i, j) = # of occurrences of the word ti in
document di
◦Usually, the ratio instead of the absolute number of occurrences is used
15
16. Document ranking
◦Term Frequency(TF)
The term frequency be the number of occurrences of term t in the
document d, that is, freq (d, t). The (weighted) term-frequency
matrix TF(d, t) measures the association of a term t with respect to
the given document d: it is generally defined as 0 if the document
does not contain the term, and nonzero otherwise.
16
otherwise.t))),log(freq(dlog(11
0t)freq(d,if,0t)TF(d,
17. Document ranking
|dt| << |d|, the term t will have a large IDF scaling factor and vice
versa.
Inverse document frequency (IDF)
◦That represents the scaling factor, or the importance of a term t.
○If a term t occurs in many documents, its importance will be
scaled down due to its reduced discriminative power.
17
||
||1
log)(
dt
d
tIDF
18. Document ranking
○In a complete vector-space model, TF and IDF are combined
together, which forms
TF-IDF(d, t) = TF(d, t)*IDF(t)
○
18
19. Document ranking
Similarity based
Finds similar documents based on a set of common keywords
Answer should be based on the degree of relevance based on the
nearness of the keywords, relative frequency of the keywords, etc.
measure the closeness of a document to a query (a set of keywords
◦
19
||||
),(
21
21
21
vv
vv
vvsim