SlideShare ist ein Scribd-Unternehmen logo
1 von 8
Downloaden Sie, um offline zu lesen
International Journal of Computational Intelligence Research
ISSN 0973-1873 Volume 5, Number 2 (2009), pp. 225–232
© Research India Publications
http://www.ripublication.com/ijcir.htm
Classification and Identification of Telugu Aksharas
using Moment Invariants and C4.5 Algorithm
C. Srikanth, B.L. Deekshatulu, C.Raghavendra Rao
and Chakravarthy Bhagvati
A.I. Laboratory, Department of Computer and Information Science,
University of Hyderabad – Hyderabad -500046, India.
Abstract
Classifying and recognizing Telugu characters (aksharas) is a challenging task
because of the variations in the script and the large number of characters. The
complexity of the shape is a result of structural compositions involving vowels
(V), consonants (C), consonants with vowel modifiers (CV) and consonant
clusters (CCV). This paper presents a novel classification strategy for
classifying aksharas with the CV structure. This is achieved by constructing
two decision trees — one for consonants and the other for vowel modifiers —
by hybridizing the moment invariants and C4.5 algorithm. The results,
although preliminary and limited in scale, illustrate the potential strengths of
the approach. Perhaps, the uniqueness and the strength of the approach lies in
the divide-and-conquer strategy where the two orthographically orthogonal
decision trees each recognizing only a few tens of classes may in combination
potentially classify hundreds of classes.
Introduction
Optical character recognition (OCR) refers to the field of study dealing with
converting the character images present in a scanned document image into computer
readable text. Usually, the output from an OCR system is text represented in ASCII
for English and Unicode or ISCII for Indian scripts.
Research on OCR systems has a long history for English [1] and several
commercially successful systems, such as Abby FineReader, are in existence today.
The situation is different for Indian scripts. Although initial and pioneering papers
appeared in the ’70s, it is only in the past decade that truly systematic development
has been taking place with assistance from the Govt. of India for developing
technology solutions for Indian languages [2]. The results are indicated in a number of
papers on Indian OCR systems in recent ICDAR conferences [3]–[7].
226 C. Srikanth and B.L. Deekshatulu et al
OCR techniques reported in literature view the problem as one of classification,
i.e., assigning a label representing a character to a pattern of pixels comprising an
image of the character. One of the key components of such an approach is classifier
design. The challenge lies in the large character sets seen in many Indian scripts and
therefore the need for a large multiclass classifier. Whereas the English script has less
than a 100 different symbols (lower and uppercase characters, punctuation and special
symbols), most Indian scripts involve classifying a few hundred and for some
languages (Hindi and Bangla) more than a thousand characters. All the traditional
classifiers such as nearest neighbour [5], hand-crafted decision trees [8], neural
networks [9] and SVMs [10] have been tried with varying levels of success.
In this paper, we present a different idea of classifying and recognizing Telugu
characters using a combination of two decision trees. One decision tree is trained for
recognizing consonants (C) while the other is trained for vowel-modified consonants
(CV). The approach is based on hybridizing the set of seven invariant moments
proposed by M. K. Hu [14] and the C4.5 algorithm of J. Ross Quinlan [11].
The rest of the paper is organized as follows. We present the main characteristics
of the Telugu script in Section 2 followed by our approach in Section 3. Section 4
summarizes the results while in Section 5, we present our conclusions.
Telugu Script
Telugu is the main language of Andhra Pradesh State in India and records indicate its
existenc e in almost the present form since 7th Century AD. Telugu is a phonetic
language with an ortho-syllabic script that is written from left to right, with each
character generally representing a syllable. Telugu alphabet is generally said to have
16 vowels, 36 consonants and three special symbols. Modern Telugu consists,
however, of only 12 vowels, 36 consonants and 2 special symbols (shown in Figure
1). Besides these, Telugu script consists of
(a) Vowels(12)
(b) Consonants (36)
(c) Special symbols(2)
Figure 1: The Telugu alphabet.
Classification and Identification of Telugu Aksharas 227
other symbols called vattulu which are half-consonants that appear in consonant
clusters (CCV). Vowel marks (VM) or half-vowels are also used in modifying a
consonant with a vowel sound. Examples are shown in Figure 2 where the first
character represents the sound ku and the vowel mark
Figure 2: Examples of compound characters: the first has a CV structure while the
second has a CCV structure.
to the right of the character corresponds to the vowel The second character in
the Figure is the sound kri where the circular stroke at the bottom is the vattu
corresponding to the consonant and the little circular mark at the top is the vowel
mark corresponding to the vowel . The complete sequence of vowel modified
variants of the consonant is shown in Figure 3. Such a sequence is called a
gunintham.
Figure 3: The ka gunintham showing the full set of vowel marks.
A gunintham defines a natural class in Telugu language and we train our classifier
to recognize the different characters within a gunintham as a single class. All the
vowels are grouped into a single class (labelled ‘1’). The consonants starting with
and ending with are assigned the labels ‘2’ to ‘37’. The vowel marks are assigned
the labels ‘1’ to ‘16’ for the full set of 16 modifiers that appear in older texts.
Our Approach
Our approach is to classify a given unknown character into its base consonant class
using one decision tree and then into a specific vowel mark class using the second
decision tree. The combination of the two gives the precise label for the character.
The features used for both the decision trees are invariant moments. The first decision
tree needs to classify a character into 36 classes while the second outputs one of 16
classes. By using group classification based on guninthams and combining the output
from two relatively small decision trees, which in some sense are orthogonal in their
functionality, we achieve a combined classification capability of 576 classes (36×16).
It is this aspect of our classifier design that is most interesting.
228 C. Srikanth and B.L. Deekshatulu et al
A. Moment invariants
Moment invariants are well-known in literature and characterize properties of
connected regions in binary images. These moments are invariant to translation,
rotation and scaling and hence their name of invariant moments. They are useful
because they provide a fairly simple representation of shape for classification and
recognition tasks. The seven invariant moments are derived from the normalized
central moments
where
for p+q = 2, 3, . . ., and are the standard central momentsof 2-D functions.
The set of seven invariant moments are defined as
The above definitions are taken from Gonzalez [12].
Table I gives the seven invariant moments for three sample characters (na),
(ki) and (lo). The column titled C-Code gives the consonant code while the
column titled CV-code gives the vowel mark code. From the table, it may be seen that
is represented by the two code sequence ‘21’ and ‘1’while, is represented by ‘2’
and ‘3’ and is ‘29’ and ‘13.’ Also, it should be clear now that the sound ko (the
character ) would be represented by ‘2’ and ‘13.’
Table I: Seven invariant moment values for sample characters.
Classification and Identification of Telugu Aksharas 229
B. Data and sample decision tables
C4.5 is a supervised learning algorithm used to constructdecision trees from the given
data. It is an extension of ID3 algorithm [11] and is not restricted to binary splits and
uses simple depth-first construction.
Training data had been generated from Andhrabhoomi newspaper after suitable
preprocessing to remove skew, noise and perform binarization. Invariant Moments
(IM) for each character had been calculated. The IM values thus generated are
discretized in order to fit into the C4.5 algorithm. The discretization had been done
using where x is the IM value, μ is the mean and is the standard
deviation and h is a scaling constant.
The mean and standard deviation are calculated from the training set and used for
discretization. The discretized values of the seven moments for the same characters
shown in Table I are given in Table II.
These discrete values are input to the C4.5 algorithm to construct a decision tree
for classifying an unknown character into a base consonant (i.e, the C-code). Sample
rules for
Table II: Discretized moments for the sample characters shown in Table I.
some instances of the guninthams or vowel-modified versions of the three
consonants shown in the previous tables are shown below.
Table III: Sample rules for various instances of the three consonants shown in
previous tables.
Table III is based on the output generated by Matlab for consonant recognition.
The column order shows the relative importance of each discretized moment roughly
according to entropy gain while the first column shows the character class code (C-
code). The first moment has the highest entropy gain. Then, depending on the next
230 C. Srikanth and B.L. Deekshatulu et al
dominating entropy value, it generates a complete tree with the decision attribute as
the leaf node. The table should be interpreted as rules in the following manner. For
example, the rule in the first row of Table III is that
if the value of discretized moment 1 is ‘3’ and the value of the discretized moment
7 is ‘3’ and the value of discretized moment 6 is ‘4’ and .... and the value of
discretized moment 4 is ‘3’, then thesample character belongs to the class ‘2’, i.e.,
(ka).
A similar approach is followed for constructing a decision tree to recognize the
vowel-marks or guninthams. From the resulting tree, it was found that the third
discretized moment (DM3) has the greatest entropy gain in this case with the first
discretized moment coming up in the third place.
Figure 4 shows a portion of the decision tree for classifying consonants. It is
generated from the Matlab output shown in Table III.
Validation Tests and Results
The Telugu characters for both training and testing are taken from 10 editorials of the
Andhra Bhoomi newspaper.
Figure 4: A portion of the decision tree constructed for classifying consonants.
Several sets of experiments were done using a 10-fold test [11]. The different
experiments were done by varying the value of h, the scaling constant in the
discretization function. CART algorithm [11] is also tried for comparison. It has
generally been found that C4.5 gives better results than CART. The results of these
initial experiments are summarized in Table IV. The number of distinct characters
Classification and Identification of Telugu Aksharas 231
observed in the test sets is approximately 350 (which may be compared against the
total possible characters which is 36 × 16 or 576).
It may be seen from Table IV that the classification accuracy does not vary much
for 0.5 ≤ h ≤ 1.0. Also, CART algorithm does not depend on h and therefore the
results are simply replicated for the different values of h.
Later, a further set of editorials were used in training and testing. The results of a
10-fold test showed that the accuracy is 98.52% for characters with CV structure and
97.03% for the C structure. These results may seem contradictory to intuition because
a base consonant should be simpler in shape when
Table IV: Results on initial training and testing on a dataset obtained from 10 andhra
bhoomi editorials.
compared to a vowel-modified consonant. A possible explanation is that several
consonant-vowel-modifier combinations do not occur with sufficiently high
frequency. A case in point occurs with consonants such as and that may be easily
confused or mis-recognized. However, the vowel-modified forms of the consonant
are extremely rare when compared with that of which leads to less number of
confusions when the CV structure is used.
It may also be noted that a distinguished Telugu linguist Dr. Bh. Krishnamurti
notes that the CV structure is the most common character in Telugu comprising
roughly 78% of all the characters in large corpora [13]. Consonants make up a
significant part of the remaining characters while vowels, which can occur only at the
beginning of a word, and CCV structures are relatively less frequent. Therefore,
designing a classifier or an OCR system for Telugu may well give a higher priority to
C and CV structures for achieving high accuracy.
Conclusions
In this paper, we showed that shape-based features such as moment invariants in
combination with a well-designed classification strategy lead to high accuracy in
reconizing Telugu characters. The results reported are no doubt on small datasets but the
approach is interesting. The strategy of dividing the character recognition problem
comprising several hundred classes into two ‘orthographically orthogonal’ decision trees
is perhaps unique. That neither of the two decision trees recognizes more than a few tens
of classes gives it additional beauty in that the training sets and feature sizes may be
232 C. Srikanth and B.L. Deekshatulu et al
relatively small. In our approach and experiments, the two decision trees, using seven
invariant moments, and classifying 36 and 16 classes respectively have together
recognized nearly 350 distinct characters (hence class labels) with a good accuracy.
References
[1] George Nagy, Sharad C. Seth, Mahesh Viswanathan: A Prototype Document
Image Analysis System for Technical Journals. IEEE Computer 25(7): 10–22
(1992)
[2] Ministry of Communications and Information Technology, Govt. of India:
Technology Development in Indian Languages. http://www.tdil.gov.in
[3] B. B. Chaudhuri, U. Pal, Mandar Mitra: Automatic Recognition of Printed
Oriya Script. Intl. Conf. on Document Analysis and Recognition (ICDAR)
2001, Seattle (USA). pp 795–799
[4] G. S. Lehal, Chandan Singh, Ritu Lehal: A Shape Based Post Processor for
Gurmukhi OCR. Intl. Conf. on Document Analysis and Recognition (ICDAR)
2001, Seattle (USA). pp 1105–1109
[5] Atul Negi, Chakravarthy Bhagvati, B. Krishna: An OCR System for Telugu.
Intl. Conf. on Document Analysis and Recognition (ICDAR) 2001, Seattle
(USA). pp 1110–1114
[6] Santanu Chaudhury, Geetika Sethi, Anand Vyas, Gaurav Harit: Devising
Interactive Access Techniques for Indian Language Document Images. Intl.
Conf. on Document Analysis and Recognition (ICDAR) 2003, Barcelona
(Spain). pp 885–889
[7] Aurelie Lemaitre, B. B. Chaudhuri, Bertrand Coasnon: Perceptive Vision for
Headline Localisation in Bangla Handwritten Text Recognition. Intl.Conf. on
Document Analysis and Recognition (ICDAR) 2007, Brazil.pp 614–618
[8] G. S. Lehal, Chandan Singh: A Gurmukhi Script Recognition System. Intl.
Conf. on Pattern Recognition (ICPR 2000). pp 2557–2560
[9] K. G. Aparna, A. G. Ramakrishnan: A Complete Tamil Optical Character
Recognition System. Document Analysis Systems 2002. pp 53–57
[10] Angshul Majumdar, B. B. Chaudhuri: Curvelet-Based Multi SVM Recognizer
for Offline Handwritten Bangla: A Major Indian Script. Intl. Conf. on
Document Analysis and Recognition (ICDAR) 2007, Brazil.pp 491–495
[11] Mitchell, T. M.: Machine Learning. McGraw Hill, 1997
[12] R. C. Gonzalez, R. E. Woods: Digital Image Processing. 2nd Ed., Addison-
Wesley, 1997.
[13] Bh. Krishnamurti, I. Ramambrahmam, C. R. Rao: Evaluation of Total Literacy
Campaigns: Case Studies. Booklinks Corporation, Hyderabad.
[14] M. K. Hu: Visual Pattern Recognition by Moment Invariants. IRE Trans.on
Information Theory, IT-8: 179–187
[15] C. Srikanth, “Classification and Identification of Telugu Aksharas using
Moment Invariants and C4.5 Algorithm”, Technical Report, 2008, DCIS,
University of Hyderabad.

Weitere ähnliche Inhalte

Was ist angesagt?

Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Dhabal Sethi
 
SMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTS
SMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTSSMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTS
SMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTSsmumbahelp
 
Application of HCR's Rank Formula on color property of articles
Application of HCR's Rank Formula on color property of articlesApplication of HCR's Rank Formula on color property of articles
Application of HCR's Rank Formula on color property of articlesHarish Chandra Rajpoot
 
05 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)205 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)2IAESIJEECS
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...IRJET Journal
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammarmeresie tesfay
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...ijcsa
 
UML Class Diagram Notation
UML Class Diagram NotationUML Class Diagram Notation
UML Class Diagram Notationadnan12345678
 
Fuzzy rule based classification and recognition of handwritten hindi
Fuzzy rule based classification and recognition of handwritten hindiFuzzy rule based classification and recognition of handwritten hindi
Fuzzy rule based classification and recognition of handwritten hindiIAEME Publication
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONijnlc
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIcscpconf
 
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)cscpconf
 
Unit 3 Tree chapter 5
Unit 3  Tree chapter 5Unit 3  Tree chapter 5
Unit 3 Tree chapter 5DrkhanchanaR
 
Recognition of Farsi Handwritten Numbers Using the Fuzzy Method
Recognition of Farsi Handwritten Numbers Using the Fuzzy MethodRecognition of Farsi Handwritten Numbers Using the Fuzzy Method
Recognition of Farsi Handwritten Numbers Using the Fuzzy MethodCSCJournals
 
K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...
K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...
K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...EngineerPH EducatorPH
 
Mc0082 theory of computer science
Mc0082  theory of computer scienceMc0082  theory of computer science
Mc0082 theory of computer sciencesmumbahelp
 

Was ist angesagt? (20)

Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)Ijarcet vol-3-issue-3-623-625 (1)
Ijarcet vol-3-issue-3-623-625 (1)
 
Overview of C Language
Overview of C LanguageOverview of C Language
Overview of C Language
 
SMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTS
SMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTSSMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTS
SMU BSC IT FALL / SUMMER 2013 SOLVED ASSIGNMENTS
 
Application of HCR's Rank Formula on color property of articles
Application of HCR's Rank Formula on color property of articlesApplication of HCR's Rank Formula on color property of articles
Application of HCR's Rank Formula on color property of articles
 
05 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)205 8640 (update email) multiset cs closure propertie (edit lafi)2
05 8640 (update email) multiset cs closure propertie (edit lafi)2
 
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...IRJET-  	  Spoken Language Identification System using MFCC Features and Gaus...
IRJET- Spoken Language Identification System using MFCC Features and Gaus...
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...Some alternative ways to find m ambiguous binary words corresponding to a par...
Some alternative ways to find m ambiguous binary words corresponding to a par...
 
Introduction to fuzzy logic
Introduction to fuzzy logicIntroduction to fuzzy logic
Introduction to fuzzy logic
 
UML Class Diagram Notation
UML Class Diagram NotationUML Class Diagram Notation
UML Class Diagram Notation
 
Fuzzy rule based classification and recognition of handwritten hindi
Fuzzy rule based classification and recognition of handwritten hindiFuzzy rule based classification and recognition of handwritten hindi
Fuzzy rule based classification and recognition of handwritten hindi
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
 
HMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDIHMM BASED POS TAGGER FOR HINDI
HMM BASED POS TAGGER FOR HINDI
 
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
IMPROVEMENT OF CRF BASED MANIPURI POS TAGGER BY USING REDUPLICATED MWE (RMWE)
 
Unit 3 Tree chapter 5
Unit 3  Tree chapter 5Unit 3  Tree chapter 5
Unit 3 Tree chapter 5
 
Recognition of Farsi Handwritten Numbers Using the Fuzzy Method
Recognition of Farsi Handwritten Numbers Using the Fuzzy MethodRecognition of Farsi Handwritten Numbers Using the Fuzzy Method
Recognition of Farsi Handwritten Numbers Using the Fuzzy Method
 
K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...
K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...
K-12 Most Essential Learning Competencies (MELC) - Mother Tongue and SHS Appl...
 
Er
ErEr
Er
 
Mc0082 theory of computer science
Mc0082  theory of computer scienceMc0082  theory of computer science
Mc0082 theory of computer science
 

Ähnlich wie Classification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm

Fragmentation of handwritten touching characters in devanagari script
Fragmentation of handwritten touching characters in devanagari scriptFragmentation of handwritten touching characters in devanagari script
Fragmentation of handwritten touching characters in devanagari scriptZac Darcy
 
Fragmentation of Handwritten Touching Characters in Devanagari Script
Fragmentation of Handwritten Touching Characters in Devanagari ScriptFragmentation of Handwritten Touching Characters in Devanagari Script
Fragmentation of Handwritten Touching Characters in Devanagari ScriptZac Darcy
 
Devnagari document segmentation using histogram approach
Devnagari document segmentation using histogram approachDevnagari document segmentation using histogram approach
Devnagari document segmentation using histogram approachVikas Dongre
 
Script Identification In Trilingual Indian Documents
Script Identification In Trilingual Indian DocumentsScript Identification In Trilingual Indian Documents
Script Identification In Trilingual Indian DocumentsCSCJournals
 
BanglaDocAnalyzer
BanglaDocAnalyzerBanglaDocAnalyzer
BanglaDocAnalyzerSamina Azad
 
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHijcseit
 
Critical Review on Off-Line Sinhala Handwriting Recognition
Critical Review on Off-Line Sinhala Handwriting RecognitionCritical Review on Off-Line Sinhala Handwriting Recognition
Critical Review on Off-Line Sinhala Handwriting RecognitionAmila Wijayarathna
 
An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...ijnlc
 
Review of research on devnagari character recognition
Review of research on devnagari character recognitionReview of research on devnagari character recognition
Review of research on devnagari character recognitionVikas Dongre
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlpeSAT Journals
 
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...ITIIIndustries
 
Recognition of Words in Tamil Script Using Neural Network
Recognition of Words in Tamil Script Using Neural NetworkRecognition of Words in Tamil Script Using Neural Network
Recognition of Words in Tamil Script Using Neural NetworkIJERA Editor
 
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...CSCJournals
 
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTSEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTcscpconf
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languageskevig
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)bolovv
 

Ähnlich wie Classification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm (20)

Fragmentation of handwritten touching characters in devanagari script
Fragmentation of handwritten touching characters in devanagari scriptFragmentation of handwritten touching characters in devanagari script
Fragmentation of handwritten touching characters in devanagari script
 
Fragmentation of Handwritten Touching Characters in Devanagari Script
Fragmentation of Handwritten Touching Characters in Devanagari ScriptFragmentation of Handwritten Touching Characters in Devanagari Script
Fragmentation of Handwritten Touching Characters in Devanagari Script
 
Devnagari document segmentation using histogram approach
Devnagari document segmentation using histogram approachDevnagari document segmentation using histogram approach
Devnagari document segmentation using histogram approach
 
Script Identification In Trilingual Indian Documents
Script Identification In Trilingual Indian DocumentsScript Identification In Trilingual Indian Documents
Script Identification In Trilingual Indian Documents
 
BanglaDocAnalyzer
BanglaDocAnalyzerBanglaDocAnalyzer
BanglaDocAnalyzer
 
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACHDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
 
Critical Review on Off-Line Sinhala Handwriting Recognition
Critical Review on Off-Line Sinhala Handwriting RecognitionCritical Review on Off-Line Sinhala Handwriting Recognition
Critical Review on Off-Line Sinhala Handwriting Recognition
 
An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...An exhaustive font and size invariant classification scheme for ocr of devana...
An exhaustive font and size invariant classification scheme for ocr of devana...
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
Review of research on devnagari character recognition
Review of research on devnagari character recognitionReview of research on devnagari character recognition
Review of research on devnagari character recognition
 
Usage of regular expressions in nlp
Usage of regular expressions in nlpUsage of regular expressions in nlp
Usage of regular expressions in nlp
 
E05423641
E05423641E05423641
E05423641
 
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...Dimensionality Reduction and Feature Selection Methods for Script Identificat...
Dimensionality Reduction and Feature Selection Methods for Script Identificat...
 
Recognition of Words in Tamil Script Using Neural Network
Recognition of Words in Tamil Script Using Neural NetworkRecognition of Words in Tamil Script Using Neural Network
Recognition of Words in Tamil Script Using Neural Network
 
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
A Novel Approach for Bilingual (English - Oriya) Script Identification and Re...
 
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXTSEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
SEGMENTATION OF CHARACTERS WITHOUT MODIFIERS FROM A PRINTED BANGLA TEXT
 
Identification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian LanguagesIdentification and Classification of Named Entities in Indian Languages
Identification and Classification of Named Entities in Indian Languages
 
Compiler lec 8
Compiler lec 8Compiler lec 8
Compiler lec 8
 
Chapter Two(1)
Chapter Two(1)Chapter Two(1)
Chapter Two(1)
 
E018212935
E018212935E018212935
E018212935
 

Kürzlich hochgeladen

computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm Systemirfanmechengr
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substationstephanwindworld
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptSAURABHKUMAR892774
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girlsssuser7cb4ff
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxsomshekarkn64
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 

Kürzlich hochgeladen (20)

computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptxExploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
Exploring_Network_Security_with_JA3_by_Rakesh Seal.pptx
 
Class 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm SystemClass 1 | NFPA 72 | Overview Fire Alarm System
Class 1 | NFPA 72 | Overview Fire Alarm System
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Earthing details of Electrical Substation
Earthing details of Electrical SubstationEarthing details of Electrical Substation
Earthing details of Electrical Substation
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
Arduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.pptArduino_CSE ece ppt for working and principal of arduino.ppt
Arduino_CSE ece ppt for working and principal of arduino.ppt
 
Call Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call GirlsCall Girls Narol 7397865700 Independent Call Girls
Call Girls Narol 7397865700 Independent Call Girls
 
lifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptxlifi-technology with integration of IOT.pptx
lifi-technology with integration of IOT.pptx
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
🔝9953056974🔝!!-YOUNG call girls in Rajendra Nagar Escort rvice Shot 2000 nigh...
 

Classification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm

  • 1. International Journal of Computational Intelligence Research ISSN 0973-1873 Volume 5, Number 2 (2009), pp. 225–232 © Research India Publications http://www.ripublication.com/ijcir.htm Classification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm C. Srikanth, B.L. Deekshatulu, C.Raghavendra Rao and Chakravarthy Bhagvati A.I. Laboratory, Department of Computer and Information Science, University of Hyderabad – Hyderabad -500046, India. Abstract Classifying and recognizing Telugu characters (aksharas) is a challenging task because of the variations in the script and the large number of characters. The complexity of the shape is a result of structural compositions involving vowels (V), consonants (C), consonants with vowel modifiers (CV) and consonant clusters (CCV). This paper presents a novel classification strategy for classifying aksharas with the CV structure. This is achieved by constructing two decision trees — one for consonants and the other for vowel modifiers — by hybridizing the moment invariants and C4.5 algorithm. The results, although preliminary and limited in scale, illustrate the potential strengths of the approach. Perhaps, the uniqueness and the strength of the approach lies in the divide-and-conquer strategy where the two orthographically orthogonal decision trees each recognizing only a few tens of classes may in combination potentially classify hundreds of classes. Introduction Optical character recognition (OCR) refers to the field of study dealing with converting the character images present in a scanned document image into computer readable text. Usually, the output from an OCR system is text represented in ASCII for English and Unicode or ISCII for Indian scripts. Research on OCR systems has a long history for English [1] and several commercially successful systems, such as Abby FineReader, are in existence today. The situation is different for Indian scripts. Although initial and pioneering papers appeared in the ’70s, it is only in the past decade that truly systematic development has been taking place with assistance from the Govt. of India for developing technology solutions for Indian languages [2]. The results are indicated in a number of papers on Indian OCR systems in recent ICDAR conferences [3]–[7].
  • 2. 226 C. Srikanth and B.L. Deekshatulu et al OCR techniques reported in literature view the problem as one of classification, i.e., assigning a label representing a character to a pattern of pixels comprising an image of the character. One of the key components of such an approach is classifier design. The challenge lies in the large character sets seen in many Indian scripts and therefore the need for a large multiclass classifier. Whereas the English script has less than a 100 different symbols (lower and uppercase characters, punctuation and special symbols), most Indian scripts involve classifying a few hundred and for some languages (Hindi and Bangla) more than a thousand characters. All the traditional classifiers such as nearest neighbour [5], hand-crafted decision trees [8], neural networks [9] and SVMs [10] have been tried with varying levels of success. In this paper, we present a different idea of classifying and recognizing Telugu characters using a combination of two decision trees. One decision tree is trained for recognizing consonants (C) while the other is trained for vowel-modified consonants (CV). The approach is based on hybridizing the set of seven invariant moments proposed by M. K. Hu [14] and the C4.5 algorithm of J. Ross Quinlan [11]. The rest of the paper is organized as follows. We present the main characteristics of the Telugu script in Section 2 followed by our approach in Section 3. Section 4 summarizes the results while in Section 5, we present our conclusions. Telugu Script Telugu is the main language of Andhra Pradesh State in India and records indicate its existenc e in almost the present form since 7th Century AD. Telugu is a phonetic language with an ortho-syllabic script that is written from left to right, with each character generally representing a syllable. Telugu alphabet is generally said to have 16 vowels, 36 consonants and three special symbols. Modern Telugu consists, however, of only 12 vowels, 36 consonants and 2 special symbols (shown in Figure 1). Besides these, Telugu script consists of (a) Vowels(12) (b) Consonants (36) (c) Special symbols(2) Figure 1: The Telugu alphabet.
  • 3. Classification and Identification of Telugu Aksharas 227 other symbols called vattulu which are half-consonants that appear in consonant clusters (CCV). Vowel marks (VM) or half-vowels are also used in modifying a consonant with a vowel sound. Examples are shown in Figure 2 where the first character represents the sound ku and the vowel mark Figure 2: Examples of compound characters: the first has a CV structure while the second has a CCV structure. to the right of the character corresponds to the vowel The second character in the Figure is the sound kri where the circular stroke at the bottom is the vattu corresponding to the consonant and the little circular mark at the top is the vowel mark corresponding to the vowel . The complete sequence of vowel modified variants of the consonant is shown in Figure 3. Such a sequence is called a gunintham. Figure 3: The ka gunintham showing the full set of vowel marks. A gunintham defines a natural class in Telugu language and we train our classifier to recognize the different characters within a gunintham as a single class. All the vowels are grouped into a single class (labelled ‘1’). The consonants starting with and ending with are assigned the labels ‘2’ to ‘37’. The vowel marks are assigned the labels ‘1’ to ‘16’ for the full set of 16 modifiers that appear in older texts. Our Approach Our approach is to classify a given unknown character into its base consonant class using one decision tree and then into a specific vowel mark class using the second decision tree. The combination of the two gives the precise label for the character. The features used for both the decision trees are invariant moments. The first decision tree needs to classify a character into 36 classes while the second outputs one of 16 classes. By using group classification based on guninthams and combining the output from two relatively small decision trees, which in some sense are orthogonal in their functionality, we achieve a combined classification capability of 576 classes (36×16). It is this aspect of our classifier design that is most interesting.
  • 4. 228 C. Srikanth and B.L. Deekshatulu et al A. Moment invariants Moment invariants are well-known in literature and characterize properties of connected regions in binary images. These moments are invariant to translation, rotation and scaling and hence their name of invariant moments. They are useful because they provide a fairly simple representation of shape for classification and recognition tasks. The seven invariant moments are derived from the normalized central moments where for p+q = 2, 3, . . ., and are the standard central momentsof 2-D functions. The set of seven invariant moments are defined as The above definitions are taken from Gonzalez [12]. Table I gives the seven invariant moments for three sample characters (na), (ki) and (lo). The column titled C-Code gives the consonant code while the column titled CV-code gives the vowel mark code. From the table, it may be seen that is represented by the two code sequence ‘21’ and ‘1’while, is represented by ‘2’ and ‘3’ and is ‘29’ and ‘13.’ Also, it should be clear now that the sound ko (the character ) would be represented by ‘2’ and ‘13.’ Table I: Seven invariant moment values for sample characters.
  • 5. Classification and Identification of Telugu Aksharas 229 B. Data and sample decision tables C4.5 is a supervised learning algorithm used to constructdecision trees from the given data. It is an extension of ID3 algorithm [11] and is not restricted to binary splits and uses simple depth-first construction. Training data had been generated from Andhrabhoomi newspaper after suitable preprocessing to remove skew, noise and perform binarization. Invariant Moments (IM) for each character had been calculated. The IM values thus generated are discretized in order to fit into the C4.5 algorithm. The discretization had been done using where x is the IM value, μ is the mean and is the standard deviation and h is a scaling constant. The mean and standard deviation are calculated from the training set and used for discretization. The discretized values of the seven moments for the same characters shown in Table I are given in Table II. These discrete values are input to the C4.5 algorithm to construct a decision tree for classifying an unknown character into a base consonant (i.e, the C-code). Sample rules for Table II: Discretized moments for the sample characters shown in Table I. some instances of the guninthams or vowel-modified versions of the three consonants shown in the previous tables are shown below. Table III: Sample rules for various instances of the three consonants shown in previous tables. Table III is based on the output generated by Matlab for consonant recognition. The column order shows the relative importance of each discretized moment roughly according to entropy gain while the first column shows the character class code (C- code). The first moment has the highest entropy gain. Then, depending on the next
  • 6. 230 C. Srikanth and B.L. Deekshatulu et al dominating entropy value, it generates a complete tree with the decision attribute as the leaf node. The table should be interpreted as rules in the following manner. For example, the rule in the first row of Table III is that if the value of discretized moment 1 is ‘3’ and the value of the discretized moment 7 is ‘3’ and the value of discretized moment 6 is ‘4’ and .... and the value of discretized moment 4 is ‘3’, then thesample character belongs to the class ‘2’, i.e., (ka). A similar approach is followed for constructing a decision tree to recognize the vowel-marks or guninthams. From the resulting tree, it was found that the third discretized moment (DM3) has the greatest entropy gain in this case with the first discretized moment coming up in the third place. Figure 4 shows a portion of the decision tree for classifying consonants. It is generated from the Matlab output shown in Table III. Validation Tests and Results The Telugu characters for both training and testing are taken from 10 editorials of the Andhra Bhoomi newspaper. Figure 4: A portion of the decision tree constructed for classifying consonants. Several sets of experiments were done using a 10-fold test [11]. The different experiments were done by varying the value of h, the scaling constant in the discretization function. CART algorithm [11] is also tried for comparison. It has generally been found that C4.5 gives better results than CART. The results of these initial experiments are summarized in Table IV. The number of distinct characters
  • 7. Classification and Identification of Telugu Aksharas 231 observed in the test sets is approximately 350 (which may be compared against the total possible characters which is 36 × 16 or 576). It may be seen from Table IV that the classification accuracy does not vary much for 0.5 ≤ h ≤ 1.0. Also, CART algorithm does not depend on h and therefore the results are simply replicated for the different values of h. Later, a further set of editorials were used in training and testing. The results of a 10-fold test showed that the accuracy is 98.52% for characters with CV structure and 97.03% for the C structure. These results may seem contradictory to intuition because a base consonant should be simpler in shape when Table IV: Results on initial training and testing on a dataset obtained from 10 andhra bhoomi editorials. compared to a vowel-modified consonant. A possible explanation is that several consonant-vowel-modifier combinations do not occur with sufficiently high frequency. A case in point occurs with consonants such as and that may be easily confused or mis-recognized. However, the vowel-modified forms of the consonant are extremely rare when compared with that of which leads to less number of confusions when the CV structure is used. It may also be noted that a distinguished Telugu linguist Dr. Bh. Krishnamurti notes that the CV structure is the most common character in Telugu comprising roughly 78% of all the characters in large corpora [13]. Consonants make up a significant part of the remaining characters while vowels, which can occur only at the beginning of a word, and CCV structures are relatively less frequent. Therefore, designing a classifier or an OCR system for Telugu may well give a higher priority to C and CV structures for achieving high accuracy. Conclusions In this paper, we showed that shape-based features such as moment invariants in combination with a well-designed classification strategy lead to high accuracy in reconizing Telugu characters. The results reported are no doubt on small datasets but the approach is interesting. The strategy of dividing the character recognition problem comprising several hundred classes into two ‘orthographically orthogonal’ decision trees is perhaps unique. That neither of the two decision trees recognizes more than a few tens of classes gives it additional beauty in that the training sets and feature sizes may be
  • 8. 232 C. Srikanth and B.L. Deekshatulu et al relatively small. In our approach and experiments, the two decision trees, using seven invariant moments, and classifying 36 and 16 classes respectively have together recognized nearly 350 distinct characters (hence class labels) with a good accuracy. References [1] George Nagy, Sharad C. Seth, Mahesh Viswanathan: A Prototype Document Image Analysis System for Technical Journals. IEEE Computer 25(7): 10–22 (1992) [2] Ministry of Communications and Information Technology, Govt. of India: Technology Development in Indian Languages. http://www.tdil.gov.in [3] B. B. Chaudhuri, U. Pal, Mandar Mitra: Automatic Recognition of Printed Oriya Script. Intl. Conf. on Document Analysis and Recognition (ICDAR) 2001, Seattle (USA). pp 795–799 [4] G. S. Lehal, Chandan Singh, Ritu Lehal: A Shape Based Post Processor for Gurmukhi OCR. Intl. Conf. on Document Analysis and Recognition (ICDAR) 2001, Seattle (USA). pp 1105–1109 [5] Atul Negi, Chakravarthy Bhagvati, B. Krishna: An OCR System for Telugu. Intl. Conf. on Document Analysis and Recognition (ICDAR) 2001, Seattle (USA). pp 1110–1114 [6] Santanu Chaudhury, Geetika Sethi, Anand Vyas, Gaurav Harit: Devising Interactive Access Techniques for Indian Language Document Images. Intl. Conf. on Document Analysis and Recognition (ICDAR) 2003, Barcelona (Spain). pp 885–889 [7] Aurelie Lemaitre, B. B. Chaudhuri, Bertrand Coasnon: Perceptive Vision for Headline Localisation in Bangla Handwritten Text Recognition. Intl.Conf. on Document Analysis and Recognition (ICDAR) 2007, Brazil.pp 614–618 [8] G. S. Lehal, Chandan Singh: A Gurmukhi Script Recognition System. Intl. Conf. on Pattern Recognition (ICPR 2000). pp 2557–2560 [9] K. G. Aparna, A. G. Ramakrishnan: A Complete Tamil Optical Character Recognition System. Document Analysis Systems 2002. pp 53–57 [10] Angshul Majumdar, B. B. Chaudhuri: Curvelet-Based Multi SVM Recognizer for Offline Handwritten Bangla: A Major Indian Script. Intl. Conf. on Document Analysis and Recognition (ICDAR) 2007, Brazil.pp 491–495 [11] Mitchell, T. M.: Machine Learning. McGraw Hill, 1997 [12] R. C. Gonzalez, R. E. Woods: Digital Image Processing. 2nd Ed., Addison- Wesley, 1997. [13] Bh. Krishnamurti, I. Ramambrahmam, C. R. Rao: Evaluation of Total Literacy Campaigns: Case Studies. Booklinks Corporation, Hyderabad. [14] M. K. Hu: Visual Pattern Recognition by Moment Invariants. IRE Trans.on Information Theory, IT-8: 179–187 [15] C. Srikanth, “Classification and Identification of Telugu Aksharas using Moment Invariants and C4.5 Algorithm”, Technical Report, 2008, DCIS, University of Hyderabad.