SlideShare ist ein Scribd-Unternehmen logo
1 von 10
Downloaden Sie, um offline zu lesen
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
DOI: 10.5121/ijnlc.2017.6102 13
ANALYSIS OF MWES IN HINDI TEXT USING NLTK
Rakhi Joon and Archana Singhal
Department of Computer Science, University of Delhi, New Delhi, India
ABSTRACT
Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human)
languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so
on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi
text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be
decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical
idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The
corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi
PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian
Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed
work.
KEYWORDS
NLTK, Hindi, Multi Words, MWEs, Analysis of MWEs, Hindi Text.
1. INTRODUCTION
Hindi is most widely used language spoken as well as used for official work in India and other
countries. It is originated from Sanskrit language and thus has a rich set of grammar and
literature. The use of Hindi language has gained much popularity in NLP research. There are
many resources developed for Hindi and some are under consideration. The NLTK framework is
used for processing natural languages and providing comprehensive support for various NLP
related tasks [8]. It is also used as a tool for carrying out the proposed work in which, the Hindi
text is processed to generate the frequency graphs for various words and that can be further used
for MWEs frequency measures.
MWEs comprises of two or more words (2-grams, 3-grams,…, n-grams) that convey some
meaning as a whole, but independently the words convey some other meaning. The meaning of
MWEs cannot be directly given by its components words. Some examples of MWE in Hindi are
पि मी स यता(paschami sabhyata), बाल िववाह (baal Vivah), and so on. There are various views from
different authors regarding MWEs :
“A unit whose exact meaning cannot be derived directly from the meaning of its parts” [12]
“Arbitrary and recurrent word combination” [11]
“Idiosyncratic concepts that cross word boundaries” [10]
As per the linguistic properties, “MWE are lexical items that can be decomposed into multiple
lexemes and display lexical, syntactic, semantic, pragmatic and statistical idiomaticity” [5].
All the above definitions follow the property of “Idiomaticity” for MWEs. Idiomaticity refers to
deviation from the basic properties of the words and applied at lexical, syntactic, semantic,
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
14
pragmatic and statistical level [2]. MWEs are the phrases that cannot be entirely predicted on the
basis of standard grammar rules and lexical entries (http://mwe.stanford.edu/reading-group.html).
The processing of text in any language depends on the grammatical constructs as well as the
linguistic and syntactic properties of that particular language. The representation of a particular
construct in any language is affected by the POS taggers. In Hindi, the extraction process of
MWEs begins with the processing of text, finding n-grams, POS tagging and then applying the
algorithms and procedures to carry out the required task. In the proposed task, the list of n-grams
is extracted which can be further used for MWEs extraction. MWEs extraction in English
language is very common task and a lot of work has been done, but MWEs have not gained much
popularity in Hindi language. In the proposed work an attempt has been made to process the
Hindi text and to analyse the extracted types of MWEs present in Hindi text [2,3] on the basis of
their properties and usage. The further analysis is done for accuracy. The corpus used for the
processing is taken from one of the famous Hindi novel “KaramaBhumi by Munshi PremChand”
and for result analysis, the Hindi corpus is obtained from CFILT.
The organization of the paper is done in the following manner: The next section gives a brief
review of related work. In third section, Dataset used is explained with brief introduction about
the corpus. In the next section, the description of Hindi Text processing using NLTK is given,
how the various operations are performed on text and so on. In section 5, various types of Hindi
MWEs and how to process the text for MWEs extraction in Hindi are proposed by the authors
with detailed analysis of previous results. It is followed by the Conclusion and Reference
sections.
2. RELATED WORK
The basic understanding of NLTK and its specifications are given by many authors, in [7], the
basics of python programming language for NLP are given which are used for carrying out the
task of text processing in NLTK. Python is a particularly convenient language to use for writing
scripts to perform natural language processing tasks [9]. The simplified version of NLTK toolkit
and its efficient usage is provided in [8]. Various processing tasks are discussed in the paper. The
more simplified version of toolkit, NLTK-Lite is discussed in [9], which provides ready access to
standard corpora, along with representations for common linguistic data structures, reference
implementations for many NLP tasks, and extensive documentation including tutorials and library
reference. More detailed knowledge and syntax related information is given in [13-15]. In [1] an
empirical study of integrating n-grams and multi-word terms into topic models is presented by the
authors while maintaining similarities between them and words based on their component
structure with the help of LDA-ITER algorithm by which the most suitable n-grams and multi-
word terms are incorporated. In [4], the author thoroughly examined various types of MWE
encountered in Hindi from machine translation viewpoint. These specific types were not given
proper attention in research for e.g. ‘vaalaa’ construct, doublets, replication etc. Many of the
types are frequently used in daily life but are not given proper place in formal textual corpus. The
author presented a stepwise mining of various MWEs in Hindi and their machine translation
perspectives. The lexical and syntactic configuration of MWEs is discussed by the authors in [5],
i.e. sometimes the component words preserve original semantics but sometimes MWEs encodes
extra semantics. Further base types of MWEs and the linguistic properties of MWEs are
explained. The basic methodology for MWEs is presented in [10] by the author. In this paper the
difficulty of extracting Multi Words from a particular document along with types and properties
of MWEs are discussed. Collocations are the combination of random and repeated words, as
described in [11]. A number of techniques based on statistical methods for extracting collocations
from large textual corpora are presented by the author. These techniques were based on some
original filtering methods that give higher-precision output. A lexicographic tool, Xtract is used
for the implementation of proposed techniques. In [6], the authors tried to find out whether MWE
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
15
identification methods can be efficiently applied to different types of MWEs and various
languages. The authors further investigated the hypothesis that MWEs can be detected
independently by the unique statistical properties of their component words, regardless of their
type. The performance is measured and compared by using three statistical measures: Mutual
Information, χ2 and Permutation Entropy. In the proposed work the main focus is on Hindi text
processing and MEWs extraction analysis. Earlier, the identification and Extraction of Hindi
MWEs are done by the authors [2]. A new classification of Hindi MWEs is given which include
MWEs types: Acronyms and Abbreviations, Compound Adverb, Complex predicates, foreign
words and terms, Idioms, morphemes, Named entities, proverbs, replicating words and Standard
phrases and Expressions. Some existing types are also included along with new types and
ontology concept to have a better classification representation. A brief introduction of Hindi
adverbs is presented in [3] and compound adverbs which act as MWEs are identified in the Hindi
text. Further the classification of the compound adverbs is proposed on the basis of types of
adverbs.
3. DATASET
A corpus is the collection of text for a particular subject. In any text processing task corpus plays
a vital role as all the operations are performed on the text and the training and test data
categorization is done on the data. In the proposed work, the experiments are performed on the
Hindi Dataset taken from one of the famous novel by Hindi writer, “Munshi Premchand”: “कम भूमी
(Karama Bhumi)”. This novel is a political novel in which political problems are presented
through some families and thus the collection of text entitles a particular subject as well as the
identity. The dataset contains 128842 Hindi words with 14081 unique words. As the total data in
the book is divided into five sections. Last four sections together (containing 84637 words and
11009 unique words) are taken for training purpose and the first section (containing 44205 words
and 7301 unique words) is taken for testing purpose. The Text processing as well as MWEs
extraction tasks are applied on the uniformly distributed data, like here in a particular ratio of text
is used as training set and rest as text set.
The data used for result analysis is obtained from the Hindi corpus provided by CFILT, IIT
Bombay (http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar). The file can’t be directly used for the
required process so it is processed further to obtain a refined form of text. After obtaining the
refined text file from the repository, the Hindi POS tagger is used to get the text file containing
the POS tagged data [3]. That POS tagged file is further processed to get the final list of Hindi
MWEs.
4. HINDI TEXT PROCESSING WITH NLTK
The Hindi text is processed in Unicode format as NLTK supports this format. There are several
repositories provided by NLTK for Indian Languages via “Indian Language POS-Tagged Corpus
Reader”. The following sections discusses about the various text processing tasks.
4.1. Corpus Exploration
Several corpora are provided by NLTK in many languages, for Indian languages also Indian
Language POS-Tagged Corpus is there on which experiments can be performed. On the other
hand, a new corpus can be added into Python and can be processed using NLTK as done in the
proposed work. Use the NLTK corpus module to read the corpus “kb_complete.txt”, included in
the Python directory, and various operations are performed on this text file to get the complete
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
16
understanding of how to process the Hindi text using NLTK. The sorted set as well as unique
words are extracted out from the corpus.
Figure 1. Frequency Distribution of three common words in complete dataset
Figure 2. Frequency Distribution of three common words in the Training dataset
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
17
Figure 3. Frequency Distribution of three common words in the Test dataset
The most commonly used words are obtained from the corpus, here most common 50 words are
discovered and from those 3 words ("म", "वह","रहा") are selected for finding and checking out the
frequency distribution for those words in the corpus for training as well as test data. Then various
dispersion plots of these three words are plotted with the training and test data as shown in the
above mentioned figures.
4.2. Operations done on text
There are a number of ways in which one can use the inbuilt NLTK supported python functions
to process the text in a particular language. Since for Multiword, the individual words are
combined together to form a new one, so Bi-grams, Tri-grams and n-grams are needed to be
extracted from the text and thus can be further filtered out to extract multiword. Like here from
the text the bi-grams and tri-grams are obtained:
>>> list(bigrams(words))
[('हमारे', ' कूल '), (' कूल ', 'और'), ('और', 'कॉलेज '), ('कॉलेज ', 'म'), ('म', 'िजस'), ('िजस', 'त परता'), ('त परता', 'से'), ('से',
'फ स'), ('फ स', 'वसूल'), ('वसूल', 'क '), ('क ', 'जाती'), ('जाती', 'है,'), ('है,', 'शायद'), ('शायद', 'मालगुजारी'), …]
>>> list(trigrams(words))
[('हमारे', ' कूल ', 'और'), (' कूल ', 'और', 'कॉलेज '), ('और', 'कॉलेज ', 'म'), ('कॉलेज ', 'म', 'िजस'), ('म', 'िजस', 'त परता'),
('िजस', 'त परता', 'से'), ('त परता', 'से', 'फ स'), ('से', 'फ स', 'वसूल'), ('फ स', 'वसूल', 'क '), ('वसूल', 'क ', 'जाती'), ('क ', 'जाती',
'है,')…]
Similarly n-grams can be obtained from the text. In MWEs extraction the initial step is to group
the words and that is initiated using the n-grams operation on the text. Then further steps are
followed to get the final MWEs list.
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
18
4.3. N-Grams as MWEs
A Multi word can be 2-grams, 3-grams or extended up to any number of words depending on the
type as well as the property. With the help of NLTK, one can easily extract n-grams from the text
as explained in the above section. The identification of MWEs from the n-grams is an important
task because the first step is to extract the n-grams list that can be further processed to identify
and extract MWEs. In figure 4 some n-grams extracted from the Dataset (“कम भूमी (Karama
Bhumi)”) are shown which can act as MWEs.
Figure 4. Pictorial representation of n-grams (हवाई जहाज, &दयहीन द(तरी शासन, ल(ज क झंकार का नाम गजल) as
MWEs in Hindi
5. HINDI MWES CLASSIFICATION AND ANALYSIS
5.1. MWEs in Hindi and Significance
In any language, a sentence itself exhibits various properties and predicates like noun, pronoun,
verb, adjective, adverb and so on. For the formation and extraction of MWEs from a corpus, first
it is detected whether the combination of words which is to be categorized as MWEs exhibits the
necessary and sufficient conditions of MWE which mainly include that words have to be
separated by space or delimiter, non-compositionality of meaning and idiomaticity and many
more [2]. The existing and new types of Hindi MWEs which were defined by the authors are
listed in the below mentioned table.
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
19
Table 1. Types of Hindi MWEs with examples [2].
S.No. Type Example
1. Acronyms and Abbreviations ‘ए पी जे अ-दुल कलाम ’(A. P. J. Abdul Kalam)
2. Complex Predicates वापस लेना (vaapas lena, return back)
3. Foreign Words and Terms सीिनयर डॉ/टर (Senior Doctor)
4. Idioms ऊँ गली उठाना: आलोचना करना (to lift finger, to blame)
5. Morphemes पीने वाला पानी (pine vaalaa pani, drinking water)
6. Replicating words अभी अभी (now now, recently)
7. Proverbs दूर के ढोल सुहावने लगते ह9 (The drums sound better at a
distance, The grass is always greener on the other
side)
8. Named Entities नई िद:ली (New Delhi)
9. Adverbs:
Adverb of Place( थानवाचक)
Adverb of Quantity (प<रमाणवाचक)
Adverb of Time (कालवाचक)
Adverb of Manner (रीितवाचक)
आगे पीछे (aage piche, front and back)
बारी बारी (baari baari, one by one)
आज कल (aaj kal, now-a-days)
भली भांित (bhali bhanti, clearly)
Here, the last three types were proposed by the authors as MWEs in Hindi and rest six types were
defined in [4]. The examples are taken from the paper itself. The result analysis of these types
will be done in section 5.2 based on most of these types.
Figure 5. Ontological Classification of MWEs in Hindi
In [2], the classification tools used for the Hindi MWEs is Protégé, which provides ontological
classification for the various class types. The representation of ontological classification of Hindi
MWEs is shown below. The highlighted types [3] are the additions to the previous classification
[4] of Hindi MWEs. In this way the classification is done in the previous work [2-4] and the
results are analyzed in the next section.
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
20
5.2. MWEs Extraction: Result Analysis of existing work
The experiments in [3] and [4] are performed on the Hindi corpus taken from CFILT, IIT
Bombay (http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar ) and results are collected from these
papers. In this paper the results are used for analysis purpose only. Since the F-Measure score of
many types are >90% as they frequently occur in the text and the properties exhibited by the
words fulfil the necessary and sufficient condition to be a multiword.
Table 2. F-Measure score of various types of Hindi MWEs.
S.No. Type F-Measure Score
1. Acronyms and Abbreviations
with dots
92.2%
2. Replicating class 97.4%
3. Doublet class 73.6%
4. ‘vaala’ construct class 90.7%
5. Complex predicates and
compound words
77.2%
6. Acronym and named entities 27.5%
7. Compound Adverbs 90.66%
(Average of three sample
experiments)
The identification of Acronyms and abbreviations and ‘vaala’ construct classes can be done easily
and they occur frequently in text so there is a large collection of these types which results in
higher f-score. Replicating words also occur frequently in any text and thus added to highest
score. The Doublet class mainly includes pair of words that are antonyms, hyponyms or near
synonyms of each other [4]. There is no such frequent occurrence of these types exist in any
document but an average number exists, so the score is also an average. Since the acronyms and
named entities occur in many verities in any text and are very difficult to find out as well as
identified, this is the reason behind the least score of these types. The Compound adverbs also
occur frequently in any text and these are of various types also as shown in Table 1, so gained a
good F-Measure score. The pictorial representation of analysis is shown below.
Figure 6. Analysis of various Hindi MWEs types
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
21
6. CONCLUSIONS AND FUTURE SCOPE
The MWEs are very important aspect in any language which is based on the property of
“Idiomaticity” (deviation from the basic properties of the words) applied at lexical, syntactic,
semantic, pragmatic and statistical level [3]. In this paper a detailed understanding of Hindi Text
processing as well as MWEs is provided which is comprehensive enough for the processing of
Hindi MWEs. NLTK is a toolkit available for processing various natural languages tasks. It
supports Unicode format and provides separate packages for Indian languages. Various
repositories of NLTK are being used for processing the tasks. Further the result analysis of
previous work [2,3] has been done and shown using the graphical measures. The corpus used for
the experiments has been collected from various sources for accuracy. For text processing the
corpus is collected from the famous Hindi novel “कम भूमी (Karama Bhumi)” by Munshi Premchand
and for analysis purpose the Hindi corpus provided by CFILT has been used.
The future scope lies in various directions, mainly in extension of ontologies for Hindi MWEs
types and exploration for more new types.
REFERENCES
[1] Nokel, M. & Loukachevitch , N., (2016), “Accounting ngrams and multi-word terms can improve
topic models” Proceedings of the 12th Workshop on Multiword Expressions, Association for
Computational Linguistics, pp44–49
[2] Joon, R. & Singhal, A., (2015) “Classification of MWEs in Hindi using ontology,” ITC 2015, in the
proceedings of international conference on Recent Trends in Information, Telecommunication and
Computing -ITC 2015, Vol. 6, pp84-92.
[3] Joon, R. & Singhal, A., (2015), "A System for Compound Adverbs MWEs extraction in Hindi." in the
proceedings of International Conference on Contemporary Computing (IC3), IEEE Computer
Society, Vol. 8, pp336-341.
[4] Sinha, RMK, (2011) “Stepwise Mining of Multi-Word Expressions in Hindi,” In: Proceedings of the
Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011),
Portland Oregon, USA, pp110–115.
[5] Baldwin, Tim & Kim, SN (2010) Multiword Expressions Handbook of Natural Language Processing,
Edited by: Nitin Indurkhya and Fred J. Damerau (eds.) Second Edition, CRC Press, Boca Raton,
USA, pp.267-292.
[6] Ramisch, C., Schreiner, P., Idiart, M. & Villavicencio, A., (2008) “An Evaluation of Methods for the
Extraction of Multiword Expressions”, in the proceedings of International conference on Language
Resources and Evaluation, Vol. 6, pp50-53.
[7] Madnani, N., (2007),“Getting Started on Natural Language Processing with Python”, Crossroads.
[8] Bird, S., (2006), “NLTK: The Natural Language Toolkit” Proceedings of the COLING/ACL 2006
Interactive Presentation Sessions, Association for Computational Linguistics, pp69–72.
[9] Bird, S., (2005), “NLTK-Lite: Efficient Scripting for Natural Language Processing”, Proceedings of
the 4th International Conference on Natural Language Processing, 2005, pp. 11 – 18.
[10] Sag, I.A., Baldwin, T., Bond, F., Copestake, A. & Flickinger, D., (2002) “ Multiword expressions: A
pain in the neck for NLP,” In: Proceedings of Third International Conference on Computational
Linguistics and Intelligent Text Processing: CICLing-2002, Lecture Notes in Computer, pp. 1-15.
[11] Smadja, F. (1993) “Retrieving Collocations from Text: Xtract”, Computational Linguistics 19, pp143-
177
[12] Choueka, Y. (1988), “Looking for needles in a haystack or locating interesting collocational
expressions in large textual databases” In RIAO 88:(Recherche d'Information Assistée par
Ordinateur). Conference, pp609-623.
[13] Natural Language Toolkit. http://nltk.sourceforge.net
[14] NLTK Tutorial. http://nltk.sourceforge.net/index.php/Book
[15] http://www.nltk.org/
International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017
22
Authors
Rakhi Joon is pursuing her Ph.D. in Computer Science from University of Delhi,
New Delhi. She did her M.Tech. in Computer Science & Engineering from GJU S&T,
Hisar, Haryana and B.Tech. in Information Technology from MDU, Rohtak, Haryana.
Her research areas include Natural Language Processing, Wireless Networks.
Dr. Archana Singhal is working as an Associate Professor, Department of Computer
Science, Indraprastha College for Women, University of Delhi, New Delhi. Her
research areas include Natural Language Processing, Semantic Web, Multi-agent
Systems, Information Retrieval and Ontologies, Secure Software Systems. She has
many publications to her credit in reputed journals and International conferences.

Weitere ähnliche Inhalte

Was ist angesagt?

Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationijnlc
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION cscpconf
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...ijctcm
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEkevig
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEijnlc
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...ijaia
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systemspaperpublications3
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONTRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONijnlc
 
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODPART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODijait
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalamijcsit
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATAijnlc
 

Was ist angesagt? (17)

Conceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarizationConceptual framework for abstractive text summarization
Conceptual framework for abstractive text summarization
 
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
DOCUMENT SUMMARIZATION IN KANNADA USING KEYWORD EXTRACTION
 
Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...Automatic classification of bengali sentences based on sense definitions pres...
Automatic classification of bengali sentences based on sense definitions pres...
 
SCTUR: A Sentiment Classification Technique for URDU
SCTUR: A Sentiment Classification Technique for URDUSCTUR: A Sentiment Classification Technique for URDU
SCTUR: A Sentiment Classification Technique for URDU
 
G1803013542
G1803013542G1803013542
G1803013542
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...Suitability of naïve bayesian methods for paragraph level text classification...
Suitability of naïve bayesian methods for paragraph level text classification...
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONTRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
 
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTIONTRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
TRANSLATING LEGAL SENTENCE BY SEGMENTATION AND RULE SELECTION
 
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHODPART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
PART OF SPEECH TAGGING OFMARATHI TEXT USING TRIGRAMMETHOD
 
P1803018289
P1803018289P1803018289
P1803018289
 
A New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in MalayalamA New Approach to Parts of Speech Tagging in Malayalam
A New Approach to Parts of Speech Tagging in Malayalam
 
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATANAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
NAMED ENTITY RECOGNITION FROM BENGALI NEWSPAPER DATA
 

Andere mochten auch

A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSA PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
 
INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...
INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...
INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...ijsptm
 
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNINGMULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNINGsipij
 
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATIONBUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATIONijnlc
 
A PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODEL
A PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODELA PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODEL
A PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODELVLSICS Design
 
ANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISES
ANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISESANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISES
ANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISESsipij
 
A CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMS
A CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMSA CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMS
A CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMSIJMIT JOURNAL
 
TRUST: DIFFERENT VIEWS, ONE GOAL
TRUST: DIFFERENT VIEWS, ONE GOALTRUST: DIFFERENT VIEWS, ONE GOAL
TRUST: DIFFERENT VIEWS, ONE GOALijsptm
 
A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...
A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...
A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...ijwmn
 
NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...
NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...
NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...ijwmn
 
A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...
A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...
A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...sipij
 
DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...
DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...
DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...ijma
 
A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...
A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...
A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...ijma
 
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...sipij
 
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...ijma
 
ASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITY
ASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITYASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITY
ASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITYijsc
 
BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...
BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...
BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...ijsc
 
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAREAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAijp2p
 
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLPA NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLPijnlc
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)ijnlc
 

Andere mochten auch (20)

A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSA PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
 
INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...
INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...
INNOVATIVE AND SOCIAL TECHNOLOGIES ARE REVOLUTIONIZING SAUDI CONSUMER ATTITUD...
 
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNINGMULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
MULTIMODAL BIOMETRICS RECOGNITION FROM FACIAL VIDEO VIA DEEP LEARNING
 
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATIONBUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
BUILDING A SYLLABLE DATABASE TO SOLVE THE PROBLEM OF KHMER WORD SEGMENTATION
 
A PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODEL
A PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODELA PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODEL
A PREDICTION METHOD OF GESTURE TRAJECTORY BASED ON LEAST SQUARES FITTING MODEL
 
ANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISES
ANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISESANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISES
ANALYSIS OF MMSE SPEECH ESTIMATION IMPACT IN WEST SUMATRA'S NOISES
 
A CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMS
A CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMSA CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMS
A CASE STUDY ON AUTO SOCIALIZATION IN ONLINE PLATFORMS
 
TRUST: DIFFERENT VIEWS, ONE GOAL
TRUST: DIFFERENT VIEWS, ONE GOALTRUST: DIFFERENT VIEWS, ONE GOAL
TRUST: DIFFERENT VIEWS, ONE GOAL
 
A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...
A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...
A STUDY ON QUANTITATIVE PARAMETERS OF SPECTRUM HANDOFF IN COGNITIVE RADIO NET...
 
NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...
NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...
NETWORK PERFORMANCE ENHANCEMENT WITH OPTIMIZATION SENSOR PLACEMENT IN WIRELES...
 
A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...
A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...
A NOVEL METHOD FOR PERSON TRACKING BASED K-NN : COMPARISON WITH SIFT AND MEAN...
 
DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...
DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...
DEVELOPMENT OF AN ANDROID APPLICATION FOR OBJECT DETECTION BASED ON COLOR, SH...
 
A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...
A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...
A NOVEL IMAGE STEGANOGRAPHY APPROACH USING MULTI-LAYERS DCT FEATURES BASED ON...
 
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
ROBUST FEATURE EXTRACTION USING AUTOCORRELATION DOMAIN FOR NOISY SPEECH RECOG...
 
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
A COLLAGE IMAGE CREATION & “KANISEI” ANALYSIS SYSTEM BY COMBINING MULTIPLE IM...
 
ASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITY
ASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITYASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITY
ASPHALTIC MATERIAL IN THE CONTEXT OF GENERALIZED POROTHERMOELASTICITY
 
BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...
BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...
BFO – AIS: A FRAME WORK FOR MEDICAL IMAGE CLASSIFICATION USING SOFT COMPUTING...
 
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATAREAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
REAL-TIME INTRUSION DETECTION SYSTEM FOR BIG DATA
 
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLPA NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
A NOVEL APPROACH FOR INFORMATION RETRIEVAL TECHNIQUE FOR WEB USING NLP
 
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
DEVELOPMENT OF ARABIC NOUN PHRASE EXTRACTOR (ANPE)
 

Ähnlich wie ANALYSIS OF MWES IN HINDI TEXT USING NLTK

A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationIJERD Editor
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMijnlc
 
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONKANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONijnlc
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...IJECEIAES
 
Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Nakul Sharma
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourseijitcs
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...kevig
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ijnlc
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Taggingkevig
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGkevig
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineeringNakul Sharma
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 
Corpus study design
Corpus study designCorpus study design
Corpus study designbikashtaly
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...Scott Bou
 
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...ijnlc
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsLisa Graves
 
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...ijnlc
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Waqas Tariq
 

Ähnlich wie ANALYSIS OF MWES IN HINDI TEXT USING NLTK (20)

A Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text SummarizationA Survey of Various Methods for Text Summarization
A Survey of Various Methods for Text Summarization
 
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVMHINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
HINDI AND MARATHI TO ENGLISH MACHINE TRANSLITERATION USING SVM
 
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATIONKANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
KANNADA NAMED ENTITY RECOGNITION AND CLASSIFICATION
 
Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...Automatic text summarization of konkani texts using pre-trained word embeddin...
Automatic text summarization of konkani texts using pre-trained word embeddin...
 
Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...Possibility of interdisciplinary research software engineering andnatural lan...
Possibility of interdisciplinary research software engineering andnatural lan...
 
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
Applying Rule-Based Maximum Matching Approach for Verb Phrase Identification ...
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourse
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
Parsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function TaggingParsing of Myanmar Sentences With Function Tagging
Parsing of Myanmar Sentences With Function Tagging
 
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGINGPARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
PARSING OF MYANMAR SENTENCES WITH FUNCTION TAGGING
 
Integrating natural language processing and software engineering
Integrating natural language processing and software engineeringIntegrating natural language processing and software engineering
Integrating natural language processing and software engineering
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 
Corpus study design
Corpus study designCorpus study design
Corpus study design
 
A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...A Comprehensive Study On Natural Language Processing And Natural Language Int...
A Comprehensive Study On Natural Language Processing And Natural Language Int...
 
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
IMPROVING THE QUALITY OF GUJARATI-HINDI MACHINE TRANSLATION THROUGH PART-OF-S...
 
A Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And ApplicationsA Review Of Text Mining Techniques And Applications
A Review Of Text Mining Techniques And Applications
 
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
EXTENDING THE KNOWLEDGE OF THE ARABIC SENTIMENT CLASSIFICATION USING A FOREIG...
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...Design and Development of a Malayalam to English Translator- A Transfer Based...
Design and Development of a Malayalam to English Translator- A Transfer Based...
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Bhuvaneswari Subramani
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

ANALYSIS OF MWES IN HINDI TEXT USING NLTK

  • 1. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 DOI: 10.5121/ijnlc.2017.6102 13 ANALYSIS OF MWES IN HINDI TEXT USING NLTK Rakhi Joon and Archana Singhal Department of Computer Science, University of Delhi, New Delhi, India ABSTRACT Natural Language Toolkit (NLTK) is a generic platform to process the data of various natural (human) languages and it provides various resources for Indian languages also like Hindi, Bangla, Marathi and so on. In the proposed work, the repositories provided by NLTK are used to carry out the processing of Hindi text and then further for analysis of Multi word Expressions (MWEs). MWEs are lexical items that can be decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical idiomaticity. The main focus of this paper is on processing and analysis of MWEs for Hindi text. The corpus used for Hindi text processing is taken from the famous Hindi novel “KaramaBhumi by Munshi PremChand”. The result analysis is done using the Hindi corpus provided by Resource Centre for Indian Language Technology Solutions (CFILT). Results are analysed to justify the accuracy of the proposed work. KEYWORDS NLTK, Hindi, Multi Words, MWEs, Analysis of MWEs, Hindi Text. 1. INTRODUCTION Hindi is most widely used language spoken as well as used for official work in India and other countries. It is originated from Sanskrit language and thus has a rich set of grammar and literature. The use of Hindi language has gained much popularity in NLP research. There are many resources developed for Hindi and some are under consideration. The NLTK framework is used for processing natural languages and providing comprehensive support for various NLP related tasks [8]. It is also used as a tool for carrying out the proposed work in which, the Hindi text is processed to generate the frequency graphs for various words and that can be further used for MWEs frequency measures. MWEs comprises of two or more words (2-grams, 3-grams,…, n-grams) that convey some meaning as a whole, but independently the words convey some other meaning. The meaning of MWEs cannot be directly given by its components words. Some examples of MWE in Hindi are पि मी स यता(paschami sabhyata), बाल िववाह (baal Vivah), and so on. There are various views from different authors regarding MWEs : “A unit whose exact meaning cannot be derived directly from the meaning of its parts” [12] “Arbitrary and recurrent word combination” [11] “Idiosyncratic concepts that cross word boundaries” [10] As per the linguistic properties, “MWE are lexical items that can be decomposed into multiple lexemes and display lexical, syntactic, semantic, pragmatic and statistical idiomaticity” [5]. All the above definitions follow the property of “Idiomaticity” for MWEs. Idiomaticity refers to deviation from the basic properties of the words and applied at lexical, syntactic, semantic,
  • 2. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 14 pragmatic and statistical level [2]. MWEs are the phrases that cannot be entirely predicted on the basis of standard grammar rules and lexical entries (http://mwe.stanford.edu/reading-group.html). The processing of text in any language depends on the grammatical constructs as well as the linguistic and syntactic properties of that particular language. The representation of a particular construct in any language is affected by the POS taggers. In Hindi, the extraction process of MWEs begins with the processing of text, finding n-grams, POS tagging and then applying the algorithms and procedures to carry out the required task. In the proposed task, the list of n-grams is extracted which can be further used for MWEs extraction. MWEs extraction in English language is very common task and a lot of work has been done, but MWEs have not gained much popularity in Hindi language. In the proposed work an attempt has been made to process the Hindi text and to analyse the extracted types of MWEs present in Hindi text [2,3] on the basis of their properties and usage. The further analysis is done for accuracy. The corpus used for the processing is taken from one of the famous Hindi novel “KaramaBhumi by Munshi PremChand” and for result analysis, the Hindi corpus is obtained from CFILT. The organization of the paper is done in the following manner: The next section gives a brief review of related work. In third section, Dataset used is explained with brief introduction about the corpus. In the next section, the description of Hindi Text processing using NLTK is given, how the various operations are performed on text and so on. In section 5, various types of Hindi MWEs and how to process the text for MWEs extraction in Hindi are proposed by the authors with detailed analysis of previous results. It is followed by the Conclusion and Reference sections. 2. RELATED WORK The basic understanding of NLTK and its specifications are given by many authors, in [7], the basics of python programming language for NLP are given which are used for carrying out the task of text processing in NLTK. Python is a particularly convenient language to use for writing scripts to perform natural language processing tasks [9]. The simplified version of NLTK toolkit and its efficient usage is provided in [8]. Various processing tasks are discussed in the paper. The more simplified version of toolkit, NLTK-Lite is discussed in [9], which provides ready access to standard corpora, along with representations for common linguistic data structures, reference implementations for many NLP tasks, and extensive documentation including tutorials and library reference. More detailed knowledge and syntax related information is given in [13-15]. In [1] an empirical study of integrating n-grams and multi-word terms into topic models is presented by the authors while maintaining similarities between them and words based on their component structure with the help of LDA-ITER algorithm by which the most suitable n-grams and multi- word terms are incorporated. In [4], the author thoroughly examined various types of MWE encountered in Hindi from machine translation viewpoint. These specific types were not given proper attention in research for e.g. ‘vaalaa’ construct, doublets, replication etc. Many of the types are frequently used in daily life but are not given proper place in formal textual corpus. The author presented a stepwise mining of various MWEs in Hindi and their machine translation perspectives. The lexical and syntactic configuration of MWEs is discussed by the authors in [5], i.e. sometimes the component words preserve original semantics but sometimes MWEs encodes extra semantics. Further base types of MWEs and the linguistic properties of MWEs are explained. The basic methodology for MWEs is presented in [10] by the author. In this paper the difficulty of extracting Multi Words from a particular document along with types and properties of MWEs are discussed. Collocations are the combination of random and repeated words, as described in [11]. A number of techniques based on statistical methods for extracting collocations from large textual corpora are presented by the author. These techniques were based on some original filtering methods that give higher-precision output. A lexicographic tool, Xtract is used for the implementation of proposed techniques. In [6], the authors tried to find out whether MWE
  • 3. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 15 identification methods can be efficiently applied to different types of MWEs and various languages. The authors further investigated the hypothesis that MWEs can be detected independently by the unique statistical properties of their component words, regardless of their type. The performance is measured and compared by using three statistical measures: Mutual Information, χ2 and Permutation Entropy. In the proposed work the main focus is on Hindi text processing and MEWs extraction analysis. Earlier, the identification and Extraction of Hindi MWEs are done by the authors [2]. A new classification of Hindi MWEs is given which include MWEs types: Acronyms and Abbreviations, Compound Adverb, Complex predicates, foreign words and terms, Idioms, morphemes, Named entities, proverbs, replicating words and Standard phrases and Expressions. Some existing types are also included along with new types and ontology concept to have a better classification representation. A brief introduction of Hindi adverbs is presented in [3] and compound adverbs which act as MWEs are identified in the Hindi text. Further the classification of the compound adverbs is proposed on the basis of types of adverbs. 3. DATASET A corpus is the collection of text for a particular subject. In any text processing task corpus plays a vital role as all the operations are performed on the text and the training and test data categorization is done on the data. In the proposed work, the experiments are performed on the Hindi Dataset taken from one of the famous novel by Hindi writer, “Munshi Premchand”: “कम भूमी (Karama Bhumi)”. This novel is a political novel in which political problems are presented through some families and thus the collection of text entitles a particular subject as well as the identity. The dataset contains 128842 Hindi words with 14081 unique words. As the total data in the book is divided into five sections. Last four sections together (containing 84637 words and 11009 unique words) are taken for training purpose and the first section (containing 44205 words and 7301 unique words) is taken for testing purpose. The Text processing as well as MWEs extraction tasks are applied on the uniformly distributed data, like here in a particular ratio of text is used as training set and rest as text set. The data used for result analysis is obtained from the Hindi corpus provided by CFILT, IIT Bombay (http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar). The file can’t be directly used for the required process so it is processed further to obtain a refined form of text. After obtaining the refined text file from the repository, the Hindi POS tagger is used to get the text file containing the POS tagged data [3]. That POS tagged file is further processed to get the final list of Hindi MWEs. 4. HINDI TEXT PROCESSING WITH NLTK The Hindi text is processed in Unicode format as NLTK supports this format. There are several repositories provided by NLTK for Indian Languages via “Indian Language POS-Tagged Corpus Reader”. The following sections discusses about the various text processing tasks. 4.1. Corpus Exploration Several corpora are provided by NLTK in many languages, for Indian languages also Indian Language POS-Tagged Corpus is there on which experiments can be performed. On the other hand, a new corpus can be added into Python and can be processed using NLTK as done in the proposed work. Use the NLTK corpus module to read the corpus “kb_complete.txt”, included in the Python directory, and various operations are performed on this text file to get the complete
  • 4. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 16 understanding of how to process the Hindi text using NLTK. The sorted set as well as unique words are extracted out from the corpus. Figure 1. Frequency Distribution of three common words in complete dataset Figure 2. Frequency Distribution of three common words in the Training dataset
  • 5. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 17 Figure 3. Frequency Distribution of three common words in the Test dataset The most commonly used words are obtained from the corpus, here most common 50 words are discovered and from those 3 words ("म", "वह","रहा") are selected for finding and checking out the frequency distribution for those words in the corpus for training as well as test data. Then various dispersion plots of these three words are plotted with the training and test data as shown in the above mentioned figures. 4.2. Operations done on text There are a number of ways in which one can use the inbuilt NLTK supported python functions to process the text in a particular language. Since for Multiword, the individual words are combined together to form a new one, so Bi-grams, Tri-grams and n-grams are needed to be extracted from the text and thus can be further filtered out to extract multiword. Like here from the text the bi-grams and tri-grams are obtained: >>> list(bigrams(words)) [('हमारे', ' कूल '), (' कूल ', 'और'), ('और', 'कॉलेज '), ('कॉलेज ', 'म'), ('म', 'िजस'), ('िजस', 'त परता'), ('त परता', 'से'), ('से', 'फ स'), ('फ स', 'वसूल'), ('वसूल', 'क '), ('क ', 'जाती'), ('जाती', 'है,'), ('है,', 'शायद'), ('शायद', 'मालगुजारी'), …] >>> list(trigrams(words)) [('हमारे', ' कूल ', 'और'), (' कूल ', 'और', 'कॉलेज '), ('और', 'कॉलेज ', 'म'), ('कॉलेज ', 'म', 'िजस'), ('म', 'िजस', 'त परता'), ('िजस', 'त परता', 'से'), ('त परता', 'से', 'फ स'), ('से', 'फ स', 'वसूल'), ('फ स', 'वसूल', 'क '), ('वसूल', 'क ', 'जाती'), ('क ', 'जाती', 'है,')…] Similarly n-grams can be obtained from the text. In MWEs extraction the initial step is to group the words and that is initiated using the n-grams operation on the text. Then further steps are followed to get the final MWEs list.
  • 6. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 18 4.3. N-Grams as MWEs A Multi word can be 2-grams, 3-grams or extended up to any number of words depending on the type as well as the property. With the help of NLTK, one can easily extract n-grams from the text as explained in the above section. The identification of MWEs from the n-grams is an important task because the first step is to extract the n-grams list that can be further processed to identify and extract MWEs. In figure 4 some n-grams extracted from the Dataset (“कम भूमी (Karama Bhumi)”) are shown which can act as MWEs. Figure 4. Pictorial representation of n-grams (हवाई जहाज, &दयहीन द(तरी शासन, ल(ज क झंकार का नाम गजल) as MWEs in Hindi 5. HINDI MWES CLASSIFICATION AND ANALYSIS 5.1. MWEs in Hindi and Significance In any language, a sentence itself exhibits various properties and predicates like noun, pronoun, verb, adjective, adverb and so on. For the formation and extraction of MWEs from a corpus, first it is detected whether the combination of words which is to be categorized as MWEs exhibits the necessary and sufficient conditions of MWE which mainly include that words have to be separated by space or delimiter, non-compositionality of meaning and idiomaticity and many more [2]. The existing and new types of Hindi MWEs which were defined by the authors are listed in the below mentioned table.
  • 7. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 19 Table 1. Types of Hindi MWEs with examples [2]. S.No. Type Example 1. Acronyms and Abbreviations ‘ए पी जे अ-दुल कलाम ’(A. P. J. Abdul Kalam) 2. Complex Predicates वापस लेना (vaapas lena, return back) 3. Foreign Words and Terms सीिनयर डॉ/टर (Senior Doctor) 4. Idioms ऊँ गली उठाना: आलोचना करना (to lift finger, to blame) 5. Morphemes पीने वाला पानी (pine vaalaa pani, drinking water) 6. Replicating words अभी अभी (now now, recently) 7. Proverbs दूर के ढोल सुहावने लगते ह9 (The drums sound better at a distance, The grass is always greener on the other side) 8. Named Entities नई िद:ली (New Delhi) 9. Adverbs: Adverb of Place( थानवाचक) Adverb of Quantity (प<रमाणवाचक) Adverb of Time (कालवाचक) Adverb of Manner (रीितवाचक) आगे पीछे (aage piche, front and back) बारी बारी (baari baari, one by one) आज कल (aaj kal, now-a-days) भली भांित (bhali bhanti, clearly) Here, the last three types were proposed by the authors as MWEs in Hindi and rest six types were defined in [4]. The examples are taken from the paper itself. The result analysis of these types will be done in section 5.2 based on most of these types. Figure 5. Ontological Classification of MWEs in Hindi In [2], the classification tools used for the Hindi MWEs is Protégé, which provides ontological classification for the various class types. The representation of ontological classification of Hindi MWEs is shown below. The highlighted types [3] are the additions to the previous classification [4] of Hindi MWEs. In this way the classification is done in the previous work [2-4] and the results are analyzed in the next section.
  • 8. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 20 5.2. MWEs Extraction: Result Analysis of existing work The experiments in [3] and [4] are performed on the Hindi corpus taken from CFILT, IIT Bombay (http://www.cfilt.iitb.ac.in/hin_corp_unicode.tar ) and results are collected from these papers. In this paper the results are used for analysis purpose only. Since the F-Measure score of many types are >90% as they frequently occur in the text and the properties exhibited by the words fulfil the necessary and sufficient condition to be a multiword. Table 2. F-Measure score of various types of Hindi MWEs. S.No. Type F-Measure Score 1. Acronyms and Abbreviations with dots 92.2% 2. Replicating class 97.4% 3. Doublet class 73.6% 4. ‘vaala’ construct class 90.7% 5. Complex predicates and compound words 77.2% 6. Acronym and named entities 27.5% 7. Compound Adverbs 90.66% (Average of three sample experiments) The identification of Acronyms and abbreviations and ‘vaala’ construct classes can be done easily and they occur frequently in text so there is a large collection of these types which results in higher f-score. Replicating words also occur frequently in any text and thus added to highest score. The Doublet class mainly includes pair of words that are antonyms, hyponyms or near synonyms of each other [4]. There is no such frequent occurrence of these types exist in any document but an average number exists, so the score is also an average. Since the acronyms and named entities occur in many verities in any text and are very difficult to find out as well as identified, this is the reason behind the least score of these types. The Compound adverbs also occur frequently in any text and these are of various types also as shown in Table 1, so gained a good F-Measure score. The pictorial representation of analysis is shown below. Figure 6. Analysis of various Hindi MWEs types
  • 9. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 21 6. CONCLUSIONS AND FUTURE SCOPE The MWEs are very important aspect in any language which is based on the property of “Idiomaticity” (deviation from the basic properties of the words) applied at lexical, syntactic, semantic, pragmatic and statistical level [3]. In this paper a detailed understanding of Hindi Text processing as well as MWEs is provided which is comprehensive enough for the processing of Hindi MWEs. NLTK is a toolkit available for processing various natural languages tasks. It supports Unicode format and provides separate packages for Indian languages. Various repositories of NLTK are being used for processing the tasks. Further the result analysis of previous work [2,3] has been done and shown using the graphical measures. The corpus used for the experiments has been collected from various sources for accuracy. For text processing the corpus is collected from the famous Hindi novel “कम भूमी (Karama Bhumi)” by Munshi Premchand and for analysis purpose the Hindi corpus provided by CFILT has been used. The future scope lies in various directions, mainly in extension of ontologies for Hindi MWEs types and exploration for more new types. REFERENCES [1] Nokel, M. & Loukachevitch , N., (2016), “Accounting ngrams and multi-word terms can improve topic models” Proceedings of the 12th Workshop on Multiword Expressions, Association for Computational Linguistics, pp44–49 [2] Joon, R. & Singhal, A., (2015) “Classification of MWEs in Hindi using ontology,” ITC 2015, in the proceedings of international conference on Recent Trends in Information, Telecommunication and Computing -ITC 2015, Vol. 6, pp84-92. [3] Joon, R. & Singhal, A., (2015), "A System for Compound Adverbs MWEs extraction in Hindi." in the proceedings of International Conference on Contemporary Computing (IC3), IEEE Computer Society, Vol. 8, pp336-341. [4] Sinha, RMK, (2011) “Stepwise Mining of Multi-Word Expressions in Hindi,” In: Proceedings of the Workshop on Multiword Expressions: from Parsing and Generation to the Real World (MWE 2011), Portland Oregon, USA, pp110–115. [5] Baldwin, Tim & Kim, SN (2010) Multiword Expressions Handbook of Natural Language Processing, Edited by: Nitin Indurkhya and Fred J. Damerau (eds.) Second Edition, CRC Press, Boca Raton, USA, pp.267-292. [6] Ramisch, C., Schreiner, P., Idiart, M. & Villavicencio, A., (2008) “An Evaluation of Methods for the Extraction of Multiword Expressions”, in the proceedings of International conference on Language Resources and Evaluation, Vol. 6, pp50-53. [7] Madnani, N., (2007),“Getting Started on Natural Language Processing with Python”, Crossroads. [8] Bird, S., (2006), “NLTK: The Natural Language Toolkit” Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, Association for Computational Linguistics, pp69–72. [9] Bird, S., (2005), “NLTK-Lite: Efficient Scripting for Natural Language Processing”, Proceedings of the 4th International Conference on Natural Language Processing, 2005, pp. 11 – 18. [10] Sag, I.A., Baldwin, T., Bond, F., Copestake, A. & Flickinger, D., (2002) “ Multiword expressions: A pain in the neck for NLP,” In: Proceedings of Third International Conference on Computational Linguistics and Intelligent Text Processing: CICLing-2002, Lecture Notes in Computer, pp. 1-15. [11] Smadja, F. (1993) “Retrieving Collocations from Text: Xtract”, Computational Linguistics 19, pp143- 177 [12] Choueka, Y. (1988), “Looking for needles in a haystack or locating interesting collocational expressions in large textual databases” In RIAO 88:(Recherche d'Information Assistée par Ordinateur). Conference, pp609-623. [13] Natural Language Toolkit. http://nltk.sourceforge.net [14] NLTK Tutorial. http://nltk.sourceforge.net/index.php/Book [15] http://www.nltk.org/
  • 10. International Journal on Natural Language Computing (IJNLC) Vol. 6, No.1, February 2017 22 Authors Rakhi Joon is pursuing her Ph.D. in Computer Science from University of Delhi, New Delhi. She did her M.Tech. in Computer Science & Engineering from GJU S&T, Hisar, Haryana and B.Tech. in Information Technology from MDU, Rohtak, Haryana. Her research areas include Natural Language Processing, Wireless Networks. Dr. Archana Singhal is working as an Associate Professor, Department of Computer Science, Indraprastha College for Women, University of Delhi, New Delhi. Her research areas include Natural Language Processing, Semantic Web, Multi-agent Systems, Information Retrieval and Ontologies, Secure Software Systems. She has many publications to her credit in reputed journals and International conferences.