This document discusses text mining and provides an outline of the topic. It defines text mining as the analysis of natural language text data and explains why it is useful given the large amount of unstructured data. The document then describes the basic text mining process, which includes steps like filtering, segmentation, stemming, eliminating excessive words, and clustering. Several applications of text mining are mentioned like call centers, anti-spam, and market intelligence. Challenges of text mining like dealing with unstructured data and large collections of documents are also outlined.
3. Introduction
⢠What is Text Mining?
â Text mining is the analysis of data contained in
natural language text
4. Introduction
⢠Why Text Mining?
â Massive amount of new information being
created Worldâs data doubles every 18 months
(Jacques Vallee Ph.D)
â 80-90% of all data is held in various
unstructured formats
â Useful information can be derived from this
unstructured data
5. Unstructured Data Examples âOreâ
⢠Email
⢠Insurance claims
⢠News articles
⢠Web pages
⢠Patent portfolios
⢠Customer
complaint letters
⢠Contracts
⢠Transcripts of
phone calls with
customers
⢠Technical
documents
6. Reasons for Text Mining
0
10
20
30
40
50
60
70
80
90
Percentage
Collections of
Text
Structured Data
7. How Text Mining Differs from Data
Mining
Data Mining
⢠Identify data sets
⢠Select features
⢠Prepare data
⢠Analyze
distribution
Text Mining
⢠Identify documents
⢠Extract features
⢠Select features by
algorithm
⢠Prepare data
⢠Analyze
distribution
8. Mining
ďś Filtering : remove punctuation, special
characters .
ďśSegmentation: segment document to
words.
9. ďśStemming : Techniques used to
find out the root/stem of a word:
â E.g.,
â user engineering
â users engineered
â used engineer
â using
⢠Stem (root) : use engineer
Usefulness
⢠improving effectiveness of retrieval and text mining
â matching similar words
⢠reducing indexing size
â combing words with same roots may reduce indexing size as much
as 40-50%.
Mining
10. ď§ Basic stemming methods
⢠remove ending
â if a word ends with a consonant other than s,
followed by an s, then delete s.
â if a word ends in es, drop the s.
â if a word ends in ing, delete the ing unless the remaining word consists only
of one letter or of th.
â If a word ends with ed, preceded by a consonant, delete the ed unless this
leaves only a single letter.
â âŚ...
⢠transform words
â if a word ends with âiesâ but not âeiesâ or âaiesâ then âies â
Mining
11. Mining
ďśeliminate⯠excessive words : words that not
give meaning by itself such as preposition
, conjunction , conditional particle.
That is performed by comparison with a list
of these words.
12. Canonical Names
President Bush
Mr. Bush
George Bush
Canonical Name:
George Bush
⢠The canonical name is the most explicit, least
ambiguous name constructed from the different
variants found in the document
⢠Reduces ambiguity of variants
13. Mining
ďśClipping : eliminate words that appear in high
or low frequency.
o The low frequencyâs words will forms small
clusters that not useful , and high frequencyâs
words that is always appear and itâs also not
useful.
o There is many ways to calculate wordâs
frequency in document(s)
15. Text Mining: Analysis
⢠Which words are most present.
⢠Which words are most interesting .
⢠Which words help define the document.
⢠What are the interesting text phrases?
17. Actual examples
⢠One of clinical center in USA be capable of
determine one of genes that responsible for
one of harmful diseases by treat greater than
150,000 news paper.
⢠Text mining in holy Quran.
⢠EtcâŚ.
18. Challenges in Text Mining
⢠Information is in unstructured textual form and itâs
in Natural Language (NL).
⢠Not readily accessible to be used by computers.
⢠Dealing with huge collections of documents.
⢠Require Skillful person to choose which documents
that will treat , and analysis the output .
⢠Require more time.
⢠Cost , 50,000$ just to software.
19. More information
⢠Central Intelligence Agency (CIA) the most
supportive to text mining .
- 11/ September events.
- mining in E-mail , chat rooms, and social
networks .
-So its support many companies such as
Attensity ŘInxight , Intelliseek.
20. More information
⢠SPSS company statisticâs : text mining software
userâs so little comparing with data mining
software userâs.
21. conclusion
⢠Finally, most refer to that the field of text
mining are still in the research phase
⢠and still its applications limited operation at
the present time
⢠but the possibilities that can be provided,
which helps to understand the huge amounts
of text and extract the core of which
information is important and useful prospects
in many areas .