project doc (1)

A RULE BASED APPROACH ON
STEMMING OF BENGALI VERBS
A Project Work Submitted in Partial Fulfilment
of the Requirements for the Degree of
BACHELOR OF TECHNOLOGY
in
COMPUTER SCIENCE & ENGINEERING
by
ANASUYA PAUL (Roll No. 10700111006)
JOYEETA BAGCHI (Roll No. 10700111021)
KOUSHIK DUTTA (Roll No. 10700111024)
SNEHA SARKAR (Roll No. 10700111049)
Under the supervision of
Mr. Alok Ranjan Pal
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING
COLLEGE OF ENGINEERING & MANAGEMENT, KOLAGHAT
(Affiliated to West Bengal University of Technology)
Purba Medinipur – 721171, West Bengal, India

1
CERTIFICATE OF APPROVAL
This is to certify that the work embodied in this project entitled A RULE BASED
APPROACH ON STEMMING OF BENGALI VERBS submitted by Anasuya Paul,
Joyeeta Bagchi, Koushik Dutta and Sneha Sarkar to the Department of Computer Science &
Engineering, is carried out under my direct supervision and guidance.
The project work has been prepared as per the regulations of West Bengal University
of Technology and I strongly recommend that this project work be accepted in fulfilment of
the requirement for the degree of B.Tech.
Supervisor
Mr. Alok Ranjan Pal
Asst. Prof., Dept. of CSE
Countersigned by
Prof. (Dr.) Dilip Kumar Gayen
Head Department of CSE

2
Certificate by the Board of Examiners
This is to certify that the project work entitled A RULE BASED APPROACH ON
STEMMING OF BENGALI VERBS submitted by Anasuya Paul, Joyeeta Bagchi, Koushik
Dutta and Sneha Sarkar to the Department of Computer Science and Engineering of College
of Engineering of Management, Kolaghat has been examined and evaluated.
The project work has been prepared as per the regulations of West Bengal University
of Technology and qualifies to be accepted in fulfilment of the requirement for the degree of
B. Tech.
Project Co-ordinator Board of Examiners

3
ABSTRACT
Based on the various inflexions of verbs available in the Bengali Dictionary, an attempt is
made to retrieve the stem word from their inflexions in the underlying sentences. The input
sentences are collected from 50 different categories of the Bengali text corpus developed in
the TDIL project of the Govt. of India, while the information about different inflexions of
particular verb is collected from Bengali Dictionary. In this project, we present a
lightweight stemmer for 14 selected Bengali Verbs that strips the suffixes using a
predefined suffix list, on a “longest match” basis, and then finds root on basis of some rules.
We have applied the algorithm over 450 sentences and achieved around 99.36% accuracy in
retrieving the root word from their inflexions in the underlying sentences .The proposed
stemmer is both computationally inexpensive and domain independent.

4
INDEX
Sl.No. TITLE Pg.No.
1. Introduction ------------------------------------------------------------------ 5 – 6
2. Theoretical Study ------------------------------------------------------------ 7 – 12
3. Related Work ---------------------------------------------------------------- 13 – 14
4. Proposed Approach --------------------------------------------------------- 15 – 21
4.1. Overall Pictorial Representation ------------------------------------------ 15
4.1.1. Explanation of Proposed Approach with example ---------------------- 16
4.1.2. Detail explanation of Module 1 (Suffix Stripping) --------------------- 16
4.1.3. Detail explanation of Module 2 (Applying Rules) ---------------------- 17
4.1.4. Sentence Collection --------------------------------------------------------- 17
4.1.5. Normalization ---------------------------------------------------------------- 18
4.1.6. Tagging of Verbs ------------------------------------------------------------ 19
4.1.7. Preparing Output File ------------------------------------------------------- 19
4.1.8. Preparing Suffix List -------------------------------------------------------- 19
4.1.9. Verification ------------------------------------------------------------------- 20
4.2. Algorithm --------------------------------------------------------------------- 20 – 21
5. Output and Discussion ------------------------------------------------------ 22 – 24
5.1. Partial View of Input File -------------------------------------------------- 22
5.2. Suffix List -------------------------------------------------------------------- 22
5.3. Partial View of Output File ------------------------------------------------ 23
5.4. Efficiency --------------------------------------------------------------------- 24
5.5. Time Complexity ------------------------------------------------------------ 24
6. Conclusion and Future Work ---------------------------------------------- 25
i. Acknowledgement ---------------------------------------------------------- 26
ii. References ------------------------------------------------------------------- 27 – 28
iii. Appendix --------------------------------------------------------------------- 29 – 32

5
1. INTRODUCTION
Stemming is an operation that splits a word into the constituent root part and affix
without doing complete morphological analysis. It is used to improve the performance of
spelling checkers and information retrieval applications, where morphological analysis
would be too computationally expensive. It is a pre-processing step in Text Mining
applications as well as a very common requirement of Natural Language processing
functions. The main purpose of stemming is to reduce different grammatical forms / word
forms of a word like its noun, adjective, verb, adverb etc. to its root form. We can say that the
goal of stemming is to reduce inflectional forms and sometimes derivationally related forms
of a word to a common base form.
Bengali is one of the most morphologically rich languages. More than one inflection can be
applied to the stem to form the word type. Stemming is a hard problem for the four
categories:Noun, Adjective, Adverb and Verb but Verb is the most problematic area for
Stemming. Bangla has a vast inflectional system, the number of inflected and
derivational forms of a certain lexicon is huge. For example there are nearly (10*5) forms
for a certain verb word in Bengali as there is 10 tenses and 5 persons and a root verb
changes its form according to tense and person. For example here are 20 forms of verb
root KA (খা). Other than this, there are lots of prefixes and suffixes, which can attach
with a root word and form a new word.
Different forms of verb root DEKHA (দেখা) are dekhi(দেখখ), dekhis(দেখখস) ,dekh (দেখ) ,dekhe
(দেখখ) ,dekhen (দেখখন) , dekhbo (দেখব) , dekhbi (দেখখব) , dekhbe (দেখখব) , dekhben (দেখখবন) ,
dekhchi(দেখখি) , dekhchis (দেখখিস) , dekhche(দেখখি) , dekhchen (দেখখিন) , dekhchilam
(দেখখিলাম) , dekhchili (দেখখিখল) , dekhchilo(দেখখিল) , dekhchilen (দেখখিখলন) , dekhlam
(দেখলাম) , dekhli (দেখখল) , dekhlo (দেখল) , dekhlen (দেখখলন) , dekhtis (দেখখিস) , dekhtam
(দেখিাম) , dekhto (দেখি) , dekhten (দেখখিন) , dekhai (দেখাই) , dekhay (দেখায়) , dekhas
(দেখাস) , dekhao (দেখাও) , dekhechi (দেখখখি) , dekhecho (দেখখি) , dekhechis (দেখখখিস) ,
dekhechen (দেখখখিন) , dekhtei (দেখখিই) , dekhar (দেখার) , dekhabo (দেখাব) , dekhaben
(দেখাখবন) , dekhabi (দেখাখব) etc.
Different suffixes that are added with root word to form a new word are chilen(খিখলন) ,
chilam (খিলাম) , chilis (খিখলস) , chilo (খিখলা) , chile (খিখল) , chili (খিখল) , chilo (খিল) , chen
(দিন) , lam (লাম) , len (দলন) , tam (িাম) , tei (দিই) , tis (খিস) , ten (দিন) , ben (দবন) , chi (খি) ,
che (দি) , bi (খব) , be (দব) , te(দি) , le (দল) , li (খল) , lo (দলা) , to (দিা) etc.

6
Overview of Stemming of Bengali Verbs
Root Word Inflected Verb Form Stripped Word + Suffixes Suffixes
দেখা দেখখিলাম দেখ + খিলাম খিলাম
করা করখল কর + খল খল
জানা জানখিস জান + খিস খিস
বলা বলিাম বল + িাম িাম
নাচা নাচখিখলন নাচ +খিখলন খিখলন
Table 1: Stemming of Bengali Verbs
We review the existing work in this area in Section 2; then we present the proposed
stemming algorithm in Section 4, followed by its output and discussion in section 5 and
evaluation in section 6. Finally, we conclude with a look at future research directions.

7
2. THEORITICAL STUDY
Natural language processing (NLP) is a field of computer science, artificial intelligence,
and computational linguistics concerned with the interactions between computers and human
(natural) languages. As such, NLP is related to the area of human–computer interaction.
Many challenges in NLP involve natural language understanding, that is, enabling computers
to derive meaning from human or natural language input, and others involve natural language
generation. The history of NLP generally starts in the 1950s, although work can be found
from earlier periods. In 1950, Alan Turing published an article titled "Computing Machinery
and Intelligence" which proposed what is now called the Turing test as a criterion of
intelligence. Some notably successful NLP systems developed in the 1960s were SHRDLU, a
natural language system working in restricted "blocks worlds" with restricted vocabularies,
and ELIZA, a simulation of a Rogerian psychotherapist, written by Joseph Weizenbaum
between 1964 to 1966. Using almost no information about human thought or emotion,
ELIZA sometimes provided a startlingly human-like interaction. When the "patient"
exceeded the very small knowledge base, ELIZA might provide a generic response, for
example, responding to "My head hurts" with "Why do you say your head hurts?".
Modern NLP algorithms are based on machine learning, especially statistical machine
learning. The paradigm of machine learning is different from that of most prior attempts at
language processing. The machine-learning paradigm calls instead for using general learning
algorithms — often, although not always, grounded in statistical inference — to
automatically learn such rules through the analysis of large corpora of typical real-world
examples. A corpus (plural, "corpora") is a set of documents (or sometimes, individual
sentences) that have been hand-annotated with the correct values to be learned.
The following is a list of some of the most commonly researched tasks in NLP.What
distinguishes these tasks from other potential and actual NLP tasks is not only the volume of
research devoted to them but the fact that for each one there is typically a well-defined
problem setting, a standard metric for evaluating the task, standard corpora on which the task
can be evaluated, and competitions devoted to the specific task.
a. Automatic summarization
Produce a readable summary of a chunk of text. Often used to provide summaries of
text of a known type, such as articles in the financial section of a newspaper.
b. Coreference resolution
Given a sentence or larger chunk of text, determine which words ("mentions") refer to
the same objects ("entities"). Anaphora resolution is a specific example of this task,
and is specifically concerned with matching up pronouns with the nouns or names that
they refer to. The more general task of co-reference resolution also includes
identifying so-called "bridging relationships" involving referring expressions.
For example, in a sentence such as "He entered John's house through the front door",
"the front door" is a referring expression and the bridging relationship to be identified
is the fact that the door being referred to is the front door of John's house (rather than
of some other structure that might also be referred to).

8
c. Discourse analysis
This rubric includes a number of related tasks. One task is identifying the discourse
structure of connected text, i.e. the nature of the discourse relationships between
sentences (e.g. elaboration, explanation, contrast). Another possible task is
recognizing and classifying the speech acts in a chunk of text (e.g. yes-no question,
content question, statement, assertion, etc.).
d. Machine translation
Automatically translate text from one human language to another. This is one of the
most difficult problems, and is a member of a class of problems colloquially termed
"AI-complete", i.e. requiring all of the different types of knowledge that humans
possess (grammar, semantics, facts about the real world, etc.) in order to solve
properly.
e. Morphological segmentation
Separate words into individual morphemes and identify the class of the morphemes.
The difficulty of this task depends greatly on the complexity of the morphology (i.e.
the structure of words) of the language being considered. English has fairly simple
morphology, especially inflectional morphology, and thus it is often possible to ignore
this task entirely and simply model all possible forms of a word (e.g. "open, opens,
opened, opening") as separate words. In languages such as Turkish, however, such an
approach is not possible, as each dictionary entry has thousands of possible word
forms. Not only for Turkish but also the Manipuri,[4]
which is a highly agglutinated
Indian language.
f. Named entity recognition (NER)
Given a stream of text, determine which items in the text map to proper names, such
as people or places, and what the type of each such name is (e.g. person, location,
organization. For example, the first word of a sentence is also capitalized, and named
entities often span several words, only some of which are capitalized.
g. Natural language generation
Convert information from computer databases into readable human language.
h. Natural language understanding
Convert chunks of text into more formal representations such as first-order logic
structures that are easier for computer programs to manipulate. Natural language
understanding involves the identification of the intended semantic from the multiple
possible semantics which can be derived from a natural language expression which
usually takes the form of organized notations of natural languages concepts.
Introduction and creation of language metamodel and ontology are efficient however
empirical solutions. An explicit formalization of natural languages semantics without
confusions with implicit assumptions such as closed world assumption (CWA) vs.
open world assumption, or subjective Yes/No vs. objective True/False is expected for
the construction of a basis of semantics formalization.
i. Optical character recognition (OCR)
Given an image representing printed text, determine the corresponding text.

9
j. Part-of-speech tagging
Given a sentence, determine the part of speech for each word. Many words, especially
common ones, can serve as multiple parts of speech. For example, "book" can be a
noun ("the book on the table") or verb ("to book a flight"); "set" can be a noun, verb
or adjective; and "out" can be any of at least five different parts of speech. Some
languages have more such ambiguity than others. Languages with little inflectional
morphology, such as English are particularly prone to such ambiguity.
k. Parsing
Determine the parse tree (grammatical analysis) of a given sentence. The grammar for
natural languages is ambiguous and typical sentences have multiple possible analyses.
In fact, perhaps surprisingly, for a typical sentence there may be thousands of
potential parses (most of which will seem completely nonsensical to a human).
l. Question answering
Given a human-language question, determine its answer. Typical questions have a
specific right answer (such as "What is the capital of Canada?"), but sometimes open-
ended questions are also considered (such as "What is the meaning of life?"). Recent
works have looked at even more complex questions.
m. Relationship extraction
Given a chunk of text, identify the relationships among named entities (e.g. who is the
wife of whom).
n. Sentence breaking (also known as sentence boundary disambiguation)
Given a chunk of text, find the sentence boundaries. Sentence boundaries are often
marked by periods or other punctuation marks, but these same characters can serve
other purposes (e.g. marking abbreviations).
o. Sentiment analysis
Extract subjective information usually from a set of documents, often using online
reviews to determine "polarity" about specific objects. It is especially useful for
identifying trends of public opinion in the social media, for the purpose of marketing.
p. Speech recognition
Given a sound clip of a person or people speaking, determine the textual
representation of the speech. This is the opposite of text to speech and is one of the
extremely difficult problems colloquially termed "AI-complete" (see above). In
natural speech there are hardly any pauses between successive words, and thus speech
segmentation is a necessary subtask of speech recognition (see below). Note also that
in most spoken languages, the sounds representing successive letters blend into each
other in a process termed coarticulation, so the conversion of the analog signal to
discrete characters can be a very difficult process.
q. Speech segmentation
Given a sound clip of a person or people speaking, separate it into words. A subtask
of speech recognition and typically grouped with it.

10
r. Topic segmentation and recognition
Given a chunk of text, separate it into segments each of which is devoted to a topic,
and identify the topic of the segment.
s. Word segmentation
Separate a chunk of continuous text into separate words. For a language like English,
this is fairly trivial, since words are usually separated by spaces. However, some
written languages like Chinese, Japanese and Thai do not mark word boundaries in
such a fashion, and in those languages text segmentation is a significant task requiring
knowledge of the vocabulary and morphology of words in the language.
t. Word sense disambiguation
Many words have more than one meaning; we have to select the meaning which
makes the most sense in context. For this problem, we are typically given a list of
words and associated word senses, e.g. from a dictionary or from an online resource
such as WordNet.
In some cases, sets of related tasks are grouped into subfields of NLP that are often
considered separately from NLP as a whole. Examples include:
 Information retrieval (IR)
This is concerned with storing, searching and retrieving information. It is a
separate field within computer science (closer to databases), but IR relies on
some NLP methods (for example, stemming). Some current research and
applications seek to bridge the gap between IR and NLP.
 Information extraction (IE)
This is concerned in general with the extraction of semantic information from
text. This covers tasks such as named entity recognition, Coreference
resolution, relationship extraction, etc.
 Speech processing
This covers speech recognition, text-to-speech and related tasks.
Stemming is the term used in linguistic morphology and information retrieval to describe the
process for reducing inflected (or sometimes derived) words to their word stem, base or root
form—generally a written word form. The stem needs not to be identical to the
morphological root of the word; it is usually sufficient that related words map to the same
stem, even if this stem is not in itself a valid root.
A stemmer for English, for example, should identify the string "cats" (and possibly "catlike",
"catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on
"stem". A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root
word, "fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce to
the stem "argu" (illustrating the case where the stem is not itself a word or root) but
"argument" and "arguments" reduce to the stem "argument".
The design of stemmers is language specific, and requires some to significant linguistic
expertise in the language, as well as the understanding of the needs for a spelling checker for
that language. A typical simple stemmer algorithm involves removing suffixes using a

11
list of frequent suffixes, while a more complex one would use morphological
knowledge to derive a stem from the words.
Words that are identified to have same root form are grouped in a cluster with the
identified root word as cluster centre. An inflectional suffix is a terminal affix that does not
change the word-class (parts of speech) of the root during concatenation; it is added to
maintain the syntactic environment of the root in Bangla. On the other hand, derivational
suffixes change word-class (parts of speech) and the orthographic-form of the root word.
Experiments have been carried out with two types of algorithms: simple suffix stripping
algorithm and score based stemming cluster identification algorithm. The Suffix stripping
algorithm simply checks if any word has any suffixes (one or more than one suffixes) from
a manually generated suffix list and then the word is assigned to the appropriate
cluster where cluster centre is the assumed root word, i.e., the form obtained after deleting the
suffix from the surface form. Suffix stripping algorithm works well for Noun, Adjective,
Adverb categories. The words of other part of speech categories especially Verbs
follow derivational morphology.
The score based stemming technique has been designed to resolve the stem for inflected word
forms. The technique uses Minimum Edit Distance method, well known for spelling error
detection, to measure the cost of classifying every word being in a particular class.
Score based technique considers two standard operations of Minimum Edit Distance, i.e.,
insertion and deletion. The consideration range of insertion and deletion for the present task
is maximum three characters. The idea is that the present word matches an existing cluster
centre after insertion and/or deletion of maximum three characters. The present word will be
assigned to the cluster that can be reached with minimum number of insertion and/or
deletion. This is an iterative clustering mechanism for assigning each word into a cluster. A
separate list of verb inflections (only 50 entries; manually edited) has been maintained to
validate the result of the score based technique.
Stemming algorithms can be broadly classified into two categories, namely Rule – Based
and Statistical.
2.1. Rule Based Approach
In a rule based approach language specific rules are encoded and based on these rules
stemming is performed. In this approach various conditions are specified for converting
a word to its derivational stem, a list of all valid stems are given and also there are some
exceptional rules which are used to handle the exceptional cases. For example the
word “absorption” is derived from the stem “absorpt” and “absorbing” is derived from the
stem “absorb”. The problem of the spelling exceptions arises in the above case when we try
to match the two words “absorpt” and “absorb”. Such exceptions are handled very
carefully by introducing recording and partial matching techniques in the stemmer as
post stemming procedures.

12
Advantages of Rule Based Approach are -
1. These are fast in nature i.e. the computation time used to find a stem is lesser.
2. The retrieval results for English by using Rule Based Stemmer are very high.
But one of the main disadvantages of Rule Based Stemmer is that one need to have
extensive language expertise to make them.
2.2. Statistical Approach
Statistical stemming is an effective and popular approach in information retrieval.
Some recent studies show that statistical stemmers are good alternatives to rule-based
stemmers. Additionally, their advantage lies in the fact that they do not require language
expertise. Rather they employstatistical information from a large corpus of a given language
to learn morphology of words.
Yet another suffix stripper (YASS) is one such statistics based language independent
stemmer .Its performance is comparable to that of Porter’s and Lovin’s stemmers, both in
terms of average precision and the total number of relevant documents retrieved the
challenge of retrieval from languages with poor resources.
GRAS is a graph based language independent stemming algorithm for information
retrieval [19]. The following features make this algorithm attractive and useful: (1)
retrieval effectiveness, (2) generality, that is, its language-independent nature, and (3)
low computational cost.
Advantages of Statistical Stemmer are:
1. Statistical stemmers are useful for languages having scarce resources.
2. This approach yields best retrieval results for suffixing languages or the languages
which are morphologically more complex like French,Portuguese, Hindi, Marathi, and
Bengali rather than English.
Disadvantages of Statistical approach is that Statistical Stemmers are time consuming
because for these stemmers to work we need to have complete language coverage, in
terms of morphology of words, their variants etc.

13
3. RELATED WORK
Martin Porter developed the “Porter Stemmer”,which is a conflation stemmer, in 1980
at the University of Cambridge [5]. The Porter Stemmer uses the fact that English
language suffixes are mostly a combination of smaller and simpler suffixes. Porter
designed a rule-based stemmer with five steps, each of which applies a set of rules.
Ramanathan and Rao (2003) proposed a lightweight stemmer for Hindi which has used a
hand crafted suffix list and has performed longest match stripping. Light stemming refers to
stripping of a small set of either prefixes or suffixes or both, without trying to deal with
infixes, or recognize patterns and find roots. This lightweight stemmer proposed for
Hindi is based on the grammar for Hindi language in which a list of total 65 suffixes is
generated manually. Terms are conflated by stripping off word endings from a suffix list
on a `longest match' basis. Noun, adjective and verb infections have been discussed and
based on that 65 unique suffixes are collected. The major advantage of this approach is as it
is computationally inexpensive. Documents were chosen from varied domains such as
Films, Health, Business ,Sports and Politics. The collection contained 35977 unique
words. Under stemming and over stemming errors calculated in this methodology were
4.68% and 13.84% respectively. No recall/precision-based evaluation of the work has
been reported; thus the effectiveness of this stemming procedure is difficult to estimate.
Majumder et al. (2007) developed statistical approach YASS: Yet Another Suffix
Stripper, which uses a clustering based approach based on string distance measures and
requires no linguistic knowledge. They concluded that stemming improves recall of IR
systems for Indian languages like Bengali. YASS is based on string distance measure
which is used to cluster a lexicon created from a text corpus into homogenous groups. Each
group is expected to represent an equivalence class consisting of morphological
variants of the single root word.
Dasgupta and Ng (2006) proposed unsupervised morphological parsing of Bengali.
Unsupervised morphological analysis is the task of segmenting words into prefixes,
suffixes and stems without prior knowledge of language-specific morphotactics and
morphophonological rules. This parser is composed of two steps: (1) inducing prefixes,
suffixes and roots from a vocabulary consisting of words taken from a large,
unannotated corpus, and (2) segmenting a word based on these induced morphemes. When
evaluated on a set of 4,110 human-segmented Bengali words, our algorithm achieves
83% success.
Pandey and Siddiqui (2008) [17] proposed an unsupervised stemming algorithm for
Hindi based on Goldsmith (2001) [69] approach. It is based on split-all method. For
unsupervised learning (training), words from Hindi documents from EMILLE corpus have
been extracted. These words have been split to give n-gram (n=1, 2, 3 … l) suffix,
where l is length of the word. Then suffix and stem probabilities are computed. These
probabilities are multiplied to give split probability. The optimal segment corresponds to
maximum split probability. Some post-processing steps have been taken to refine the

14
learned suffixes. It is evaluated on 1000-1000 words randomly extracted words (only)
from Hindi WordNet1 data base. The training data has been constructed by extracting
106403 words extracted from EMILLE2 corpus. The observed accuracy is 89.9% after
applying some heuristic measures. The F-score is 94.96%. The algorithm does not
require any language specific information.
Majgaonker and Siddiqui (2010) developed an unsupervised approach for Marathi
stemmer. Three different approaches (rule based, suffix stripping and statistical
stripping) for suffix rules generation has been used in unsupervised stemmer. The rule-
based stemmer uses a set of manually extracted suffix stripping rules whereas the
unsupervised approach learns suffixes automatically from a set of words extracted from
rawMarathi text. The performance of both the stemmers has been compared on a test
dataset consisting of 1500 manually stemmed word. The maximum accuracy observed
is 82.5% for the statistical suffix stripping approach. This approach uses a set of
words to learn suffixes.
Suba et al. (2011) proposed two stemmers for Gujaratia lightweight inflectional stemmer
based on a hybrid approach and a heavyweight derivational stemmer based on a rule-based
approach. The inflectional stemmer has an average accuracy of about 90.7% which is
considerable as far as IR is concerned. Boost in accuracy due to POS based stemming was
9.6% and due to inclusion of the language characteristics it was further boosted by
12.7%. The derivational stemmer has an average accuracy of 70.7% which can act as
a good baseline and can be useful in tasks such as dictionary search or data
compression. The limitations of inflectional stemmer can be easily overcome if
modules like Named Entity Recognizer are integrated with the system.
In “A Light Weight Stemmer for Bengali and Its Use in Spelling Checker” by Md. Zahurul
Islam, Md. Nizam Uddin and Mumit Khan from Center for Research on Bangla Language
Processing, BRAC University, Dhaka, Bangladesh,presents a computationally inexpensive
stemming algorithm for Bengali, which handles suffix removal in a domain independent
way.First the spelling checker checks the given word with a lexicon containing only the root
words. If the word is found, then it is a valid word, terminating the checking process.If the
word is not found in the lexicon, they apply the stemming algorithm. There are two
possible scenarios: the stemming algorithm finds and returns a stem, or it cannot find a
possible suffix.Then they try to get probable stem list with their suffixes from modified
stemming method. Correction accuracy for single error misspellings: 90.8%. Correction
accuracy for multi-error misspellings: 67%.
In 2012 an iterative stemmer for Tamil Language was proposed by Vivekanandan
Ramachandran et al.In this proposed model,suffix stripper algorithm is used to stem
Tamil words to its root word.
Upendra Mishra and Chandra Prakash , present the Hybrid approach which is combination of
brute force and suffix removal approach and reduces the problem of over-stemming and
under-stemming.

15
4.PROPOSED APPROACH
Our proposed algorithm is based on a lightweight stemmer for Bengali Verbs that strips the
suffixes using a predefined suffix list, on a “longest match” basis, and then finds root on
basis of some rules. For this purpose, firstly the input file is read and inflected verb forms are
fetched. The inflexion of each such inflected verb is then compared with the suffixes in the
suffix list and removed, if any match is found. The subroot is then checked. If it ends with e-
kar(‘খ ’), o-kar(‘ো া’), a-kar(‘দ ’) or aa-kar(‘ া’) then, replace it with aa-kar(‘ া’). If it starts
with e-kar (‘খ ’), u-kar (‘ ’) or a-kar(‘দ ’), then replace it with a-kar(‘দ ’), o-kar(‘ো া’) or aa-
kar(‘ া’) respectively. Generate the output doc file by copying the contents of input file and
concatenating it with their obtained root words wherever the word contains ‘/verb’. Finally,
compare the generated output file with the desired output file and calculate the efficiency.
4.1. Overall Pictorial Presentation
Reading Input Text
Selecting & tagging verbs
Fetching of tagged verbs
Module 1: Applying suffix striping
Obtaining stripped part
Module 2: Applying rules
Generating Output File
Calculating Efficiency
Figure 1: Pictorial representation of proposed approach

16
4.1.1. Explanation of Proposed Approach with example
PROCESS EXAMPLE
Reading Input Text গাড় ছাড় ার কয়েক ্ত আয়গ একি অপ েকা জা তান ম য়ে
যত ত ায় গাড় য়্ উয়ে ।
Selecting & tagging verbs গাড় ছাড় ার/verb কয়েক ্ত আয়গ একি অপ েকা জা তান
ম য়ে যত ত ায় গাড় য়্ উয়ে ল/verb ।
Fetching of tagged verbs ছাড় ার/verb, ল/verb
Applying suffix striping ছাড় ার -> ছাড় + ার, ল -> + ল
Obtaining stripped part ছাড়,
Applying rules ছাড়া, া
Generating Output File গাড় ছাড় ার/verb/ছাড়া কয়েক ্ত আয়গ একি অপ েকা
জা তান ম য়ে যত ত ায় গাড় য়্ উয়ে ল/verb/ া ।
4.1.2. Detail explanation of Module 1 (Suffix Stripping)
Table 2: Proposed Approach with example
No
Reading Suffix List
Fetching suffixes from the suffix list
Strip the suffix from the inflected verb
Fetch next suffix from the suffix list
Obtain the subroot/stripped verb
Yes
No
Yes
Checking if
the consired
verb contains
the suffix
Are all the
suffixes
fetched
Figure 2: Module 1(Suffix Stripping)

17
4.1.3. Detail explanation of Module 2: Applying Rules
4.1.4. Sentence Collection
Technology Development for Indian Languages (TDIL) Programme initiated by the
Department of Electronics & Information Technology (DeitY), Ministry of Communication
& Information Technology (MC&IT), Govt. of India has the objective of developing
Information Processing Tools and Techniques to facilitate human-machine interaction
without language barrier; creating and accessing multilingual knowledge resources; and
integrating them to develop innovative user products and services.
Reading stripped verb/subroot
No
Yes
No
No
Yes
Yes
No
Yes
YesYesYesYes
NoNoNoNoSubroot
ends with
e-kar(‘খ ’)
Subroot
ends with
o-kar(‘ো া’)
Subroot
ends with
a-kar(‘দ ’)
Subroot
ends with
aa-kar(‘ া’)
Length of
subroot
< 3
Replace the ending kar with aa-kar(‘ া’)
Concatenate with aa-kar(‘ া’)
Subroot
starts with
e-kar(‘খ ’)
Replace the starting kar with a-kar(‘দ ’)
Subroot
starts with
u-kar(‘ ’)
Replace the starting kar with o-kar(‘ো া’)
Subroot
starts with
a-kar(‘দ ’)
Replace the starting kar with aa-kar(‘ া’)
Obtain root verb
Figure 3: Module 2 (Applying Rules)

18
The Programme also promotes Language Technology standardization through active
participation in International and national standardization bodies such as ISO, UNICODE,
World-wide-Web consortium (W3C) and BIS (Bureau of Indian Standards) to ensure
adequate representation of Indian languages in existing and future language technology
standards.
The input sentences are collected from 50 different categories of the Bengali text corpus
developed in the TDIL project of the Govt. of India, while the information about different
inflexions of particular verb is collected from Bengali Dictionary. We have selected 14
Bengali Verbs, and presented a sentence for each inflexion of a particular verb. Accordingly,
we have applied our algorithm over 638 sentences.
4.1.5. Normalization
The Bengali text corpus developed in the TDIL project of the Govt. of India separates words
by ‘|’, whereas we have separated words by spaces ‘ ’. Moreover, the end of each sentence is
marked by ‘ | ’ and any kind of exclamation sign, eg. question mark ‘?’, comma ‘,’,
exclamation mark ‘!’, etc., is replaced by ‘ | ’.
Figure 4: Screen Shot of Un-normalized Document
Figure 5: Screen Shot of Un-normalized Document

19
4.1.6. Tagging of Verb
In every sentence, the inflected word whose root is to be found out is tagged by ‘/verb’.
Figure 6: Screen Shot of verb-tagged Document
4.1.7. Preparing Output File
An output file is prepared whereby the inflected word of every sentence whose root is to be
found out is tagged as ‘/verb/’ concatenated by the actual root word. This file is prepared in
order to calculate the efficiency of our proposed algorithm.
Figure 7: Screen Shot of desired output Document
4.1.8. Preparing Suffix List
After surveying the inflexions of various Bengali Verbs from the 50 different categories of
the Bengali text corpus developed in the TDIL project of the Govt. of India, we have
prepared a suffix list by selecting 35 mostly occurring suffixes.

20
4.1.9. Verification
Generated output file is compared with the prepared output file and thereby the efficiency of
the algorithm is calculated.
4.2. Algorithm
STEP 1. Start of algorithm.
STEP 2. Create 4 new string[] namely splits1[], splits2[ ] and splits3[ ].
STEP 3. Read the contents of the doc files and split the words by space (‘ ’) separator.
3.1. Store the words of each sentence in splits1[ ].
3.2. Store the inflexions in splits2[ ].
3.3. Store the desired root words in splits3[ ].
STEP 4. Declare and initialize variables l1=length of splits1[ ] , l2=length of splits2[ ] .
STEP 5. Fetch the inflected verb forms in input1[] from splits1[i] if ‘/verb’ is contained
by the currently fetched word. This step is repeated l1 times.
5.1. Determine the subroot from input1[i] by repeating the steps l2 times.
5.1.1. if splits2[j] in contained in input1[i] then,
5.1.1.a. Declare variable index which stores the index
of last occurrence of splits2[j] in input1[i].
5.1.1.b. If index is greater than equal to 2 then,
5.1.1.b.i. Store the substring of input1[i] from
begindex=0 to endindex=index in input1[i].
5.1.1.b.ii. Break the loop.
5.2. Determine the actual root input1[i] by repeating the steps l1 times.
5.2.1. Check the ending kar of input1[i].
5.2.1.a. if input1[i] ends with e-kar(‘খ ’), o-kar(‘ো া’), a-
kar(‘দ ’) or aa-kar(‘ া’) then, replace it with aa-kar(‘ া’).
5.2.1.b. if length of input1[i] is less than 3, concate it with aa-
kar(‘ া’).
5.2.2. Check the starting kar of input1[i].
5.2.2.a. if input1[i] starts with e-kar (‘খ ’), then replace it with
a-kar(‘দ ’).
5.2.2.b. if input1[i] starts with u-kar (‘ ’), then replace it with
o-kar(‘ো া’).

21
5.2.2.c. if input1[i] starts with a-kar(‘দ ’), then replace it with
aa-kar(‘ া’).
STEP 6. Generate the output doc file by copying the contents of splits1[] and
concatenating it with their obtained root words from input1[] wherever the word
contains ‘/verb’.
STEP 7. Compare the obtained sentences in splits1[ ] with the desired sentences in
splits3[ ] and calculate the efficiency.
STEP 8. End of algorithm.

22
5. OUTPUT AND DISCUSSION:
5.1. Partial View of Input File:
5.2. Suffix List:
Figure 8: Partial view of Input File
Figure 9: Screen shot of Suffix List

23
5.3. Partial View of Output File:
Figure 10: Partial view of Output File

24
5.4. EFFICIENCY:
Dealing with 500 sentences, our proposed approach gives an efficiency of 99.4%.
5.5. TIME COMPLEXITY:
The time complexity of the proposed algorithm is:
WORST CASE: O(n2
)
Figure 11: Screen shot of Efficiency of proposed approach

25
6. CONCLUSION AND FUTURE WORK:
Stemming plays a vital role in information retrieval system and its effect is very large. In
this project, we present a lightweight stemmer for 14 selected Bengali Verbs that strips the
suffixes using a predefined suffix list, on a “longest match” basis, and then finds root on
basis of some rules. Except a few cases, the result obtained from our algorithm is quite
satisfactory according to our expectation. We argue that a stronger and populated learning set
would invariably yield better result. In future , we plan to test our algorithm with more sets of
Bengali verbs.
As the research in Bengali language is much less than those in languages like English
and Hindi, still lot of dimensions are untouched. Using several relevant and new
approach, better Bengali stemmer can be developed and thus will be useful for further
linguistic computing.

26
ACKNOWLEDGEMENT
It gives us great pleasure to find an opportunity to express our deep and sincere gratitude to
our project guide Mr. Alok Ranjan Pal. We do very respectfully recollect his constant
encouragement, kind attention and keen interest throughout the course of our work. We are
highly indebted to him for the way he modeled and structured our work with his valuable tips
and suggestions that he accorded to us in every respect of our work.
We are extremely grateful to the Department of Computer Science & Engineering, CEMK,
for extending all the facilities of our department.
We humbly extend our sense of gratitude to other faculty members, laboratory staff, library
staff and administration of this Institute for providing us their valuable help and time with a
congenital working environment.
Last but not the least; we would like to convey our heartiest thanks to all our classmates who
time to time have helped us with their valuable suggestions during our project work.
Date:23.05.2015
Anasuya Paul
University Roll:10700111006
University Registration No:111070110006
Joyeeta Bagchi
Koushik Dutta
Sneha Sarkar

27
References:
1. Ramanathan and D. D. Rao, “A Lightweight Stemmer for Hindi”, Workshop on
Computational Linguistics for South-Asian Languages, EACL, 2003.
2. M. Z. Islam, M. N. Uddin and M. Khan, “A Light Weight Stemmer for Bengali
and its Use in Spelling Checker”.Proc. 1st Intl. Conf. on Digital Comm. and
Computer Applications (DCCA07), Irbid, Jordan, March 19-23 2007.
3. P. Majumder, M. Mitra, S. K. Parui, G. Kole, P. Mitra, and K. Datta, “YASS: Yet
Another Suffix Stripper”,Association for Computing Machinery Transactions on
Information Systems, 25(4):18-38, 2007.
4. S. Dasgupta and V. Ng, “Unsupervised Morphological Parsing of Bengali”,
Language Resources and Evaluation, 40(3-4):311-330, 2006.
5. A. K. Pandey and T. J. Siddiqui, “An Unsupervised Hindi Stemmer with Heuristic
Improvements”, In Proceedings of the Second Workshop on Analytics For Noisy
Unstructured Text Data, 303:99-105, 2008.
6. M. M. Majgaonker and T. J Siddiqui, “Discovering Suffixes: A Case Study for
Marathi Language”,International Journal on Computer Science and Engineering,
Vol. 02, No. 08, pp. 2716-2720, 2010.
7. K. Suba, D. Jiandani and P. Bhattacharyya, “Hybrid Inflectional Stemmer and
Rule-based Derivational Stemmer for Gujarati”, In proceedings of the 2nd
Workshop on South and Southeast Asian Natural Language Processing
(WSSANLP), IJCNLP 2011, Chiang Mai, Thailand, pp.1-8, 2011.
8. M.F. Porter, “An algorithm for suffix stripping”, Program, 14(3) 1980, pp. 130−137.
9. P. Kundu and B.B. Chaudhuri, “Error Pattern in Bengali Text”, International Journal
of Dravidian Linguistics, 28(2) 1999.
10. B.B. Chaudhuri, “Reversed word dictionary and phonetically similar word grouping
based
11. spell-checker to Bengali text”, In the Proceedings of LESAL Workshop, 2001.
12. Sandipan Sarkar and Sivaji Bandyopadhyay. Study on Rule-Based Stemming
Patterns and Issues in a Bengali Short Story-Based Corpus. In ICON 2009.
13. S. Dasgupta,M. Khan: Morphological parsing of Bangla words using PCKIMMO. In:
ICCIT 2004. (2004).
14. Barzilay, R. & Elhadad. M. 1997. Using Lexical Chains for Text Summarization.In
Proceedings of the Workshop on Intelligent Scalable Text Summarization. Madrid,
Spain.
15. Pratikkumar patel kashyap popat “hybrid stemmer for gujarati” in proc. of the 1st
workshop on south and southeast Asian natural language processing (wssanlp), pages
51–55, the 23rd international conference on computational linguistics (coling),
Beijing, august 2010
16. Upendra Mishra Chandra Prakash “MAULIK: An Effective Stemmer for Hindi
Language” International Journal on Computer Science and Engineering (IJCSE).
Abduelbaset m. Goweder, Husien a. Alhammi, Tarik rashed, and Abdulsalam Musrat
“A Hybrid Method for Stemming Arabic Text”.

28
17. Kartik Suba, Dipti Jiandani and Pushpak Bhattacharyya “Hybrid Inflectional Stemmer
and Rule-based Derivational Stemmer for Gujarati”
18. Hairdar Harmanani, Walid Keirouz, Saeed Raheel “A Rule Based Extensible
Stemmer for Information Retrieval with Application to Arabic” The international
Arab Journal of Information Technology.Vol -3 July- 2006.
19. Navanath Saharia, Utpal Sharma and Jugal Kalita [6] present paper on “Analysis and
Evaluation of Stemming Algorithms: A case Study with Assamese”. ICACCI’12,
August 3-5, 2012, Chennai, T Nadu, India.
20. Nikhil Kanuparthi, Abhilash Inumella and Dipti Misra Sharma “Hindi Derivational
Morphological Analyzer”. Proceedings of the Twelfth Meeting of the Special Interest
Group on Computational Morphology and Phonology (SIGMORPHON2012), pages
10–16, Montre´al, Canada, June 7, 2012. C 2012 Association for Computational
Linguistics.
21. Juhi Ameta, Nisheeth Joshi, Iti Mathur “A Lightweight Stemmer for Gujarati”.
22. Mohamad Ababneh, Riyad Al-Shalabi, Ghassan Kanaan, and Alaa AlNobani
“Building an Effective Rule-Based Light Stemmer for Arabic Language to Improve
Search Effectiveness” The International Arab Journal of Information Technology,
Vol. 9, No. 4, July 2012.
23. Ms. Anjali Ganesh Jivani “A Comparative Study of Stemming Algorithms” Int. J.
Comp. Tech. Appl., Vol 2 (6), 1930-1938
24. M. F. Porter 1980. "An Algorithm for Suffix Stripping Program", 14(3):130-137.
25. V. M. Orengo and C. Huyck “A Stemming Algorithm for the Portuguese Language”
Proceedings of the Eighth International Symposium on String Processing and
Information Retrieval, pages 186-193, 2001.
26. Deepika Sharma “Stemming Algorithms: A Comparative Study and their Analysis”
International Journal of Applied Information Systems (IJAIS) – ISSN: 2249-0868
Foundation of Computer Science FCS, New York, USA Volume 4– No.3, September
2012 .
27. J. B. Lovins 1968. "Development of a Stemming Algorithm."Mechanical Translation
and Computational Linguistics, 11(1-2), 22-31.

29
Appendix:
1. Program Code :
package stemming_verb;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;
import java.io.*;
public class Stemming_verb {
public static void main(String[] args) {
File file1=null, file2=null, file3=null, file4=null;
WordExtractor extractor1 = null, extractor2=null, extractor3=null, extractor4=null;
try{
/*------------------Reading sentences-----------------------*/
file1 = new File("G:Stemmingfinal_projectsentence_input.doc");
FileInputStream fis1 = new FileInputStream(file1.getAbsolutePath());
HWPFDocument document1 = new HWPFDocument(fis1);
extractor1 = new WordExtractor(document1);
String fileData1 = extractor1.getText();
String[] splits1 = fileData1.split(" ");
String[] input1=new String[splits1.length];
int l1=splits1.length;
/*------------------Reading inflexions-----------------------*/
file2 = new File("G:Stemmingfinal_projectsuffixes.doc");
String[] splits2 = fileData2.split("।");

30
/*-------------------Reading desired output file----------------------*/
file4 = new File("G:Stemmingfinal_projectsentence_output.doc");
String[] splits4 = fileData4.split(" ");
/*-------------------Suffix stripping----------------------*/
int verb=0;
for(int i=0;i<l1;i++)
{
if(splits1[i].contains("/verb"))
{
verb++;
int index1=splits1[i].lastIndexOf("/verb");
input1[i]=splits1[i].substring(0,index1);
for(int j=0;j<l2;j++)
{
if(input1[i].contains(splits2[j]))
{
int index=input1[i].lastIndexOf(splits2[j]);
if(index>=2)
{
input1[i]=input1[i].substring(0, index);
break;
}
}
}
}
}

31
/*--------------------applying rules------------------*/
{
if(input1[i]!=null){
if(input1[i].endsWith("খ "))//ends with e-kar
{
input1[i]=input1[i].replace("খ "," া") ;
}
else if(input1[i].endsWith("দ া"))//ends with o-kar
{
input1[i]=input1[i].replace("দ া"," া") ;
}
else if(input1[i].endsWith("দ "))//ends with a-kar
{
input1[i]=input1[i].replace("দ "," া") ;
}
else if(input1[i].endsWith(" া"))//ends with aa-kar
{
input1[i]=input1[i].replace(" া"," া") ;
}
else if(input1[i].length()<=3)
{
input1[i]=input1[i].concat(" া");
}
String word4;
word4=(input1[i]);
if(word4.startsWith("খ ", 1))//for replacing e-kar by a-kar
{
input1[i]=input1[i].replace("খ ","দ ") ;
}
else if(word4.startsWith(" ", 1))//for replacing u-kar by o-kar
{
input1[i]=input1[i].replace(" ","দ া") ;
}
else if(word4.startsWith("দ ", 1))//for replacing a-kar by aa-kar
{
input1[i]=input1[i].replace("দ "," া") ;
}

32
}
}
/*--------------------Writing obtained root words to doc file----------------------*/
boolean append75=true;
FileWriter write75=new
FileWriter("G:Stemmingfinal_projectoutput_new.doc",append75);
PrintWriter print_line75= new PrintWriter(write75);
String word1,word2;
{
word1=(splits1[i]);
word2=(input1[i]);
if(splits1[i].contains("/verb"))
{
splits1[i]=splits1[i].concat("/").concat(input1[i]);
print_line75.printf(word1+"/"+word2+" ");
}
else
print_line75.printf(word1+" ");
}
print_line75.close();
/*--------------------Calculating efficiency-----------------------*/
int count=0;
double eff;
{
if(splits1[i].equalsIgnoreCase(splits4[i]))
{
count++;
}
}
eff=(double)count*100/l1;
System.out.println("No of inflected verb forms = "+verb);
System.out.println("Efficiency = "+eff);
}
catch(Exception e){
System.out.println(e);
}
}
}

project doc (1)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie project doc (1)

Ähnlich wie project doc (1) (20)

project doc (1)