SlideShare ist ein Scribd-Unternehmen logo
1 von 33
India is multilingual nation. Text mining is a growing
research area in data mining. So the aim is conduct a
detailed study on text mining on Indian language. The
objectives of the research work are as follows
1. To design a method for Indian language documents
representation
2. To propose an algorithm to categorize documents based
on language and domain.
3. To design a language independent algorithm to extract
the keywords from all the Indian language documents.
1
Third DC Meeting
2
Third DC Meeting
3
Third DC Meeting
Kannada
ನಮಸ್ಕಾ ರ, ಶುಭ
ಮುಂಜಾನೆ
ಶುಭ ಮಧ್ಯಾ ಹ್ನ , ಶುಭ
ರಾತ್ರ
ಿ
ಶುಭ ಹಾರೈಕೆ (ಗುಡ್
ಬೈ)
ಧನಾ ವಾದಗಳು
namaskAra ,shubha
muMjAne
shubha madhyAhna,
shubha rAtri
shubha hAraike (guD
bai)
dhanyavAdagaLu
Tamil
வணக்கம், (காலை)
வணக்கம், (மதிய)
வணக்கம்
நை்லிரவாக
அலமயட் டும்
சென
் று வருகிறேன
்
நன
் றி
vaNakkam (kAlai)
vaNakkam (matiya)
vaNakkam
~nalliravAka
amaiyaTTum
cenRu varukiREn
~nanRi
Telugu
హలో, నమస్కా రం
నమస్కా రం, నమస్తే
నమస్తే, కృతజ్ఞతలు
halO, namaskAraM
namaskAraM ,
namastE namastE ,
kRutaj~jatalu
4
Third DC Meeting
The research work is
divided into 3 phases
Documents
Representation
Vector
Space Model
Properties of
Corpus
Document
categorization
Based on the
language
Language
Independent
Classifier
Keywords
extraction
Using
TF, IDF,
TFIDF
5
Third DC Meeting
Data Preprocessing means
converting unstructured data into
structured data.
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
After preprocessing data mining
algorithms can be applied.
In Text mining study, a document
is used as a basic unit of analysis.
To analyze the document, the first
step is data preprocessing
6
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
7
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning /
pattern recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
8
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning /
pattern recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The
availability of constantly increasing amount of textual data of
various Indian regional languages in electronic form has
accelerated. Therefore Indian language text mining is required.
9
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
10
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
1. What is Text Mining?
* Strict definition:
The nontrivial extraction of implicit, previously unknown, and
potentially useful information from [textual] data.
* Loose definition:
The science of extracting useful information from large [textual]
data sets.
* Text mining = information retrieval + statistics + artificial
intelligence (natural language processing, machine learning / pattern
recognition)
2. What are the Data sources for text mining?
* World Wide Web
3. What is the need for Indian language text mining?
* In the Constitution of India, a provision is made for each of the
Indian states to choose their own official language for
communicating at the state level for official purpose. The availability
of constantly increasing amount of textual data of various Indian
regional languages in electronic form has accelerated. Therefore
Indian language text mining is required.
11
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
Stemming is the term used to describe
the process for reducing derived words
to their root form
Example:
"cats“, "catlike", "catty“ -> "cat",
 "stemmer", "stemming", "stemmed" ->
"stem“
"fishing", "fished", and "fisher" -> "fish“
"argue", "argued", "argues“, "arguing",
"argus" ->"argu”
12
Third DC Meeting
Corpus
Standardization
Tokenization
Remove Stop Words
Stemming
Vector Space
Model
W1 W2 W3 W4 W5
DOC1 1 0 2 1 0
DOC2 2 0 1 0 3
DOC3 0 1 2 2 0
dij represents number of
times that term appears
in the document
13
Third DC Meeting
language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
14
Third DC Meeting
Zipf’s Law describes the
word behavior in an
entire corpus
language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
15
Third DC Meeting
In natural language, there
are a few frequent terms
and many rare terms.
language kannada tamil telugu
Docs 100 100 100
Tokens 26315 20360 18427
Vocabulary 20417 15941 14652
16
Third DC Meeting
Frequency * rank = constant.
So frequency of a word is
inversely proportional to its
rank.
Objective: To classify the documents based on language.
Documents
Classifier
Tamil Language
Documents
Kannada
Language
Documents
Telugu
Language
Documents
17
Third DC Meeting
Algorithm
1. Identify specific language files.
2. Associate a Language label with each of the files.
3. Build a Corpus C
4. Preprocess the Corpus C.
5. Apply a Stemming algorithm to reduce all the words to their root
form.
6. Generate VSM or a Term Document matrix using Binary Term
Occurrence D( i, j where i is the document i and j is the jth term of
document i.)
7. Train the Classifier (kNN,j48 and NB) using C as training examples.
18
Third DC Meeting
Confusion Matrix
kNN Classifier j48 Classifier NB Classifier
Kannada Tamil Telugu Kannada Tamil Telugu Kannada Tamil Telugu
87 2 11 99 1 0 100 0 0
2 96 2 1 97 2 2 98 0
4 0 96 4 0 96 5 0 95
19
Third DC Meeting
Data mining algorithms are used for English text categorization,
similarly they can be applied for Indian language text categorization
The effectiveness of classification algorithm
kNN gives 93% accuracy
Decision tree gives 97.33% accuracy
Naïve Bayes gives97.66% accuracy.
Naïve Bayes is efficient algorithm for Indian language text
categorization.
20
Third DC Meeting
The objective of this work is to design a language independent
classifier to categories the documents based on domain.
Documents
Cinema Sports Politics
Language
Independent
Classifier
21
Third DC Meeting
S N Domain
No of
document
s
No of
Tokens
No of vocabulary
No of vocabulary
after removing
Stop-words
Case 1
Cinema 5 1378 978 943
Politics 5 831 560 537
Sports 5 712 426 398
Case 2
Cinema 48 13463 5302 5107
Sports 48 24096 8156 7934
22
Third DC Meeting
The prediction of the classify models for case 1 are tabulated in the form of
confusion matrix.
kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
23
Third DC Meeting
The prediction of the classify models for case 2 are tabulated in the form of confusion matrix.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
24
Third DC Meeting
In case 1, only five documents of three domains (Cinema, Sports, and Politics)
For measuring the accuracy of classification algorithm, confusion matrix is used.
 kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66%
J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00%
In case 2, 48 documents of two domains (Cinema, Sports)
For measuring the accuracy of classification algorithm, confusion matrix is used.
kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5%
J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83%
25
Third DC Meeting
Objective:
 Keyword extraction is the
task to identify a small set of
words or keywords from a
document that can describes
the meaning of the document.
 It should be done
systematically without human
intervention. It should be
language independent model
Indian Language Document
Data Preprocessing
Candidate Keywords
TF IDF
Ranking
IDF
Selection of Keywords with
line/page no
26
Third DC Meeting
Algorithm
1. Dravidian language text document is tokenized
2. Stop words and frequent words elimination to get vocabulary words.
3. Vocabulary words are stored in the form of matrix called Vector space
model
4. Term frequency, Inverse document frequency and TF*IDF for each word is
calculated
5. Select the vocabulary words by fixing threshold value for TF*IDF.
6. Along with the keywords the corresponding line number or paragraph
number or page number is also extracted
27
Third DC Meeting
Recall =
No. of relevant documents
retrieved
Total no. of relevant
documents in the corpus
Recall = TP / (TP+FN) × 100
Precision = TP / (TP+FP) × 100
Recall Precision
Precision
=
No. of relevant documents
retrieved
Total no. of documents
retrieved from the corpus
28
Third DC Meeting
In the case of Tamil text, when
TFIDF is 2.3979 the recall is
100%. Therefore we are using
2.3979 as a TFIDF threshold
value to extract the keywords. So
we are considering those words
whose TFIDF value is grater then
2.3979 as keywords
29
Third DC Meeting
In the case of Kannada text, when
TF*IDF is 3.0910 the recall is
100%. Therefore we are using
3.0910 as a TFIDF threshold value
to extract the keywords. So we
are considering those words
whose TFIDF value is grater then
3.0910 as keywords
30
Third DC Meeting
In the case of Telugu text,(Fig
6) when TFIDF is 3.4095 the
recall is 100%. Therefore we
are using 3.4095 as a TFIDF
threshold value to extract the
keywords. So we are
considering those words
whose TFIDF value is grater
then 3.4095 as keywords
31
Third DC Meeting
Phase 3: Keyword extraction from Telugu Language
In the case of Telugu text,(Fig 6)
when TFIDF is 3.4095 the recall is
100%. Therefore we are using
3.4095 as a TFIDF threshold value
to extract the keywords. So we are
considering those words whose
TFIDF value is grater then 3.4095 as
keywords
32
Third DC Meeting
The third phase of work is keyword extraction. TF*IDF is a used to evaluate
how important is a word in a document. The TF*IDF is used as a threshold to
select the important keyword. In the case of Kannada text, when TFIDF is
3.0910 the recall is 100%. Therefore 3.0910 is used as a TFIDF threshold
value to extract the keywords. In the case of Tamil text, when TFIDF is 2.3979
the recall is 100%. Therefore 2.3979 is used as a TFIDF threshold value to
extract the keywords. In the case of Telugu text, when TFIDF is 3.4095 the
recall is 100%. Therefore 3.4095 is used as a TFIDF threshold value to extract
the keywords.
33
Third DC Meeting

Weitere ähnliche Inhalte

Ähnlich wie Third DC Meeting1.ppt

Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...CSCJournals
 
A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems IJECEIAES
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...kevig
 
electronics-11-01780-v2.pdf
electronics-11-01780-v2.pdfelectronics-11-01780-v2.pdf
electronics-11-01780-v2.pdfNaveenkushwaha18
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfSUDESHNASANI1
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...IRJET Journal
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageEditor IJCATR
 
Creation of speech corpus for emotion analysis in Gujarati language and its e...
Creation of speech corpus for emotion analysis in Gujarati language and its e...Creation of speech corpus for emotion analysis in Gujarati language and its e...
Creation of speech corpus for emotion analysis in Gujarati language and its e...IJECEIAES
 
Script identification from printed document images using statistical
Script identification from printed document images using statisticalScript identification from printed document images using statistical
Script identification from printed document images using statisticalIAEME Publication
 
IRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET Journal
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMcscpconf
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir systemcsandit
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
 
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents ijsc
 
Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...IJECEIAES
 
B tech project_report
B tech project_reportB tech project_report
B tech project_reportabhiuaikey
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportAnwar Jameel
 

Ähnlich wie Third DC Meeting1.ppt (20)

A017420108
A017420108A017420108
A017420108
 
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
Script Identification of Text Words from a Tri-Lingual Document Using Voting ...
 
Applsci 09-02758
Applsci 09-02758Applsci 09-02758
Applsci 09-02758
 
A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems A Context-based Numeral Reading Technique for Text to Speech Systems
A Context-based Numeral Reading Technique for Text to Speech Systems
 
G1803013542
G1803013542G1803013542
G1803013542
 
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
A NOVEL APPROACH FOR NAMED ENTITY RECOGNITION ON HINDI LANGUAGE USING RESIDUA...
 
electronics-11-01780-v2.pdf
electronics-11-01780-v2.pdfelectronics-11-01780-v2.pdf
electronics-11-01780-v2.pdf
 
A-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdfA-STUDY-ON-SENTIMENT-POLARITY.pdf
A-STUDY-ON-SENTIMENT-POLARITY.pdf
 
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
Review and Approaches to Develop Legal Assistance for Lawyers and Legal Profe...
 
Survey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi LanguageSurvey on Indian CLIR and MT systems in Marathi Language
Survey on Indian CLIR and MT systems in Marathi Language
 
Creation of speech corpus for emotion analysis in Gujarati language and its e...
Creation of speech corpus for emotion analysis in Gujarati language and its e...Creation of speech corpus for emotion analysis in Gujarati language and its e...
Creation of speech corpus for emotion analysis in Gujarati language and its e...
 
Script identification from printed document images using statistical
Script identification from printed document images using statisticalScript identification from printed document images using statistical
Script identification from printed document images using statistical
 
IRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb PeopleIRJET- Communication Aid for Deaf and Dumb People
IRJET- Communication Aid for Deaf and Dumb People
 
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEMA LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
A LANGUAGE INDEPENDENT APPROACH TO DEVELOP URDUIR SYSTEM
 
A language independent approach to develop urduir system
A language independent approach to develop urduir systemA language independent approach to develop urduir system
A language independent approach to develop urduir system
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents Keyword Extraction Based Summarization of Categorized Kannada Text Documents
Keyword Extraction Based Summarization of Categorized Kannada Text Documents
 
Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...Identification of monolingual and code-switch information from English-Kannad...
Identification of monolingual and code-switch information from English-Kannad...
 
B tech project_report
B tech project_reportB tech project_report
B tech project_report
 
Summer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) ReportSummer Research Project (Anusaaraka) Report
Summer Research Project (Anusaaraka) Report
 

Kürzlich hochgeladen

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARKOUSTAV SARKAR
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersMairaAshraf6
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsvanyagupta248
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxNadaHaitham1
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdfKamal Acharya
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VDineshKumar4165
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxchumtiyababu
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadhamedmustafa094
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.Kamal Acharya
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXssuser89054b
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEselvakumar948
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 

Kürzlich hochgeladen (20)

HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
Computer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to ComputersComputer Lecture 01.pptxIntroduction to Computers
Computer Lecture 01.pptxIntroduction to Computers
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
AIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech studentsAIRCANVAS[1].pdf mini project for btech students
AIRCANVAS[1].pdf mini project for btech students
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
kiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal loadkiln thermal load.pptx kiln tgermal load
kiln thermal load.pptx kiln tgermal load
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLEGEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
GEAR TRAIN- BASIC CONCEPTS AND WORKING PRINCIPLE
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 

Third DC Meeting1.ppt

  • 1. India is multilingual nation. Text mining is a growing research area in data mining. So the aim is conduct a detailed study on text mining on Indian language. The objectives of the research work are as follows 1. To design a method for Indian language documents representation 2. To propose an algorithm to categorize documents based on language and domain. 3. To design a language independent algorithm to extract the keywords from all the Indian language documents. 1 Third DC Meeting
  • 4. Kannada ನಮಸ್ಕಾ ರ, ಶುಭ ಮುಂಜಾನೆ ಶುಭ ಮಧ್ಯಾ ಹ್ನ , ಶುಭ ರಾತ್ರ ಿ ಶುಭ ಹಾರೈಕೆ (ಗುಡ್ ಬೈ) ಧನಾ ವಾದಗಳು namaskAra ,shubha muMjAne shubha madhyAhna, shubha rAtri shubha hAraike (guD bai) dhanyavAdagaLu Tamil வணக்கம், (காலை) வணக்கம், (மதிய) வணக்கம் நை்லிரவாக அலமயட் டும் சென ் று வருகிறேன ் நன ் றி vaNakkam (kAlai) vaNakkam (matiya) vaNakkam ~nalliravAka amaiyaTTum cenRu varukiREn ~nanRi Telugu హలో, నమస్కా రం నమస్కా రం, నమస్తే నమస్తే, కృతజ్ఞతలు halO, namaskAraM namaskAraM , namastE namastE , kRutaj~jatalu 4 Third DC Meeting
  • 5. The research work is divided into 3 phases Documents Representation Vector Space Model Properties of Corpus Document categorization Based on the language Language Independent Classifier Keywords extraction Using TF, IDF, TFIDF 5 Third DC Meeting
  • 6. Data Preprocessing means converting unstructured data into structured data. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model After preprocessing data mining algorithms can be applied. In Text mining study, a document is used as a basic unit of analysis. To analyze the document, the first step is data preprocessing 6 Third DC Meeting
  • 7. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 7 Third DC Meeting
  • 8. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 8 Third DC Meeting
  • 9. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 9 Third DC Meeting
  • 10. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 10 Third DC Meeting
  • 11. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model 1. What is Text Mining? * Strict definition: The nontrivial extraction of implicit, previously unknown, and potentially useful information from [textual] data. * Loose definition: The science of extracting useful information from large [textual] data sets. * Text mining = information retrieval + statistics + artificial intelligence (natural language processing, machine learning / pattern recognition) 2. What are the Data sources for text mining? * World Wide Web 3. What is the need for Indian language text mining? * In the Constitution of India, a provision is made for each of the Indian states to choose their own official language for communicating at the state level for official purpose. The availability of constantly increasing amount of textual data of various Indian regional languages in electronic form has accelerated. Therefore Indian language text mining is required. 11 Third DC Meeting
  • 12. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model Stemming is the term used to describe the process for reducing derived words to their root form Example: "cats“, "catlike", "catty“ -> "cat",  "stemmer", "stemming", "stemmed" -> "stem“ "fishing", "fished", and "fisher" -> "fish“ "argue", "argued", "argues“, "arguing", "argus" ->"argu” 12 Third DC Meeting
  • 13. Corpus Standardization Tokenization Remove Stop Words Stemming Vector Space Model W1 W2 W3 W4 W5 DOC1 1 0 2 1 0 DOC2 2 0 1 0 3 DOC3 0 1 2 2 0 dij represents number of times that term appears in the document 13 Third DC Meeting
  • 14. language kannada tamil telugu Docs 100 100 100 Tokens 26315 20360 18427 Vocabulary 20417 15941 14652 14 Third DC Meeting Zipf’s Law describes the word behavior in an entire corpus
  • 15. language kannada tamil telugu Docs 100 100 100 Tokens 26315 20360 18427 Vocabulary 20417 15941 14652 15 Third DC Meeting In natural language, there are a few frequent terms and many rare terms.
  • 16. language kannada tamil telugu Docs 100 100 100 Tokens 26315 20360 18427 Vocabulary 20417 15941 14652 16 Third DC Meeting Frequency * rank = constant. So frequency of a word is inversely proportional to its rank.
  • 17. Objective: To classify the documents based on language. Documents Classifier Tamil Language Documents Kannada Language Documents Telugu Language Documents 17 Third DC Meeting
  • 18. Algorithm 1. Identify specific language files. 2. Associate a Language label with each of the files. 3. Build a Corpus C 4. Preprocess the Corpus C. 5. Apply a Stemming algorithm to reduce all the words to their root form. 6. Generate VSM or a Term Document matrix using Binary Term Occurrence D( i, j where i is the document i and j is the jth term of document i.) 7. Train the Classifier (kNN,j48 and NB) using C as training examples. 18 Third DC Meeting
  • 19. Confusion Matrix kNN Classifier j48 Classifier NB Classifier Kannada Tamil Telugu Kannada Tamil Telugu Kannada Tamil Telugu 87 2 11 99 1 0 100 0 0 2 96 2 1 97 2 2 98 0 4 0 96 4 0 96 5 0 95 19 Third DC Meeting
  • 20. Data mining algorithms are used for English text categorization, similarly they can be applied for Indian language text categorization The effectiveness of classification algorithm kNN gives 93% accuracy Decision tree gives 97.33% accuracy Naïve Bayes gives97.66% accuracy. Naïve Bayes is efficient algorithm for Indian language text categorization. 20 Third DC Meeting
  • 21. The objective of this work is to design a language independent classifier to categories the documents based on domain. Documents Cinema Sports Politics Language Independent Classifier 21 Third DC Meeting
  • 22. S N Domain No of document s No of Tokens No of vocabulary No of vocabulary after removing Stop-words Case 1 Cinema 5 1378 978 943 Politics 5 831 560 537 Sports 5 712 426 398 Case 2 Cinema 48 13463 5302 5107 Sports 48 24096 8156 7934 22 Third DC Meeting
  • 23. The prediction of the classify models for case 1 are tabulated in the form of confusion matrix. kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66% J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00% 23 Third DC Meeting
  • 24. The prediction of the classify models for case 2 are tabulated in the form of confusion matrix. kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5% J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83% 24 Third DC Meeting
  • 25. In case 1, only five documents of three domains (Cinema, Sports, and Politics) For measuring the accuracy of classification algorithm, confusion matrix is used.  kNN accuracy = (0+0+4) / ( 0+0+5+0+0+5+1+0+4 ) = 26.66% J48 accuracy = (2+3+4) / (2+0+3+0+3+2+0+1+4) = 60.00% In case 2, 48 documents of two domains (Cinema, Sports) For measuring the accuracy of classification algorithm, confusion matrix is used. kNN Classifier Accuracy = (38+46) / (38+10+2+46) = 87.5% J48 Classifier Aaccuracy = (48+20) / (48+0+28+20) = 70.83% 25 Third DC Meeting
  • 26. Objective:  Keyword extraction is the task to identify a small set of words or keywords from a document that can describes the meaning of the document.  It should be done systematically without human intervention. It should be language independent model Indian Language Document Data Preprocessing Candidate Keywords TF IDF Ranking IDF Selection of Keywords with line/page no 26 Third DC Meeting
  • 27. Algorithm 1. Dravidian language text document is tokenized 2. Stop words and frequent words elimination to get vocabulary words. 3. Vocabulary words are stored in the form of matrix called Vector space model 4. Term frequency, Inverse document frequency and TF*IDF for each word is calculated 5. Select the vocabulary words by fixing threshold value for TF*IDF. 6. Along with the keywords the corresponding line number or paragraph number or page number is also extracted 27 Third DC Meeting
  • 28. Recall = No. of relevant documents retrieved Total no. of relevant documents in the corpus Recall = TP / (TP+FN) × 100 Precision = TP / (TP+FP) × 100 Recall Precision Precision = No. of relevant documents retrieved Total no. of documents retrieved from the corpus 28 Third DC Meeting
  • 29. In the case of Tamil text, when TFIDF is 2.3979 the recall is 100%. Therefore we are using 2.3979 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 2.3979 as keywords 29 Third DC Meeting
  • 30. In the case of Kannada text, when TF*IDF is 3.0910 the recall is 100%. Therefore we are using 3.0910 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 3.0910 as keywords 30 Third DC Meeting
  • 31. In the case of Telugu text,(Fig 6) when TFIDF is 3.4095 the recall is 100%. Therefore we are using 3.4095 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 3.4095 as keywords 31 Third DC Meeting
  • 32. Phase 3: Keyword extraction from Telugu Language In the case of Telugu text,(Fig 6) when TFIDF is 3.4095 the recall is 100%. Therefore we are using 3.4095 as a TFIDF threshold value to extract the keywords. So we are considering those words whose TFIDF value is grater then 3.4095 as keywords 32 Third DC Meeting
  • 33. The third phase of work is keyword extraction. TF*IDF is a used to evaluate how important is a word in a document. The TF*IDF is used as a threshold to select the important keyword. In the case of Kannada text, when TFIDF is 3.0910 the recall is 100%. Therefore 3.0910 is used as a TFIDF threshold value to extract the keywords. In the case of Tamil text, when TFIDF is 2.3979 the recall is 100%. Therefore 2.3979 is used as a TFIDF threshold value to extract the keywords. In the case of Telugu text, when TFIDF is 3.4095 the recall is 100%. Therefore 3.4095 is used as a TFIDF threshold value to extract the keywords. 33 Third DC Meeting