SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Presented by-
Sujit Kumar Das
M.Tech 3rd sem,IT
Roll-021413 No-363202205
1
POS Tagging And Token
Classification By Using Bangla
TokenizerUnder the Supervision Of
Mr. Sourish Dhar
Asst. Professor,Dept of IT
Assam University
Contents…
2
 Introduction
 Literature Survey
 Our Proposal
 Future Works To Be Done
 Conclusions
 References
Introduction:
3
What is NLP?
 Field of computer science, artificial intelligence,
and linguistics concerned with the interactions
between computers and human (natural)
languages[1].
 NLP provides means of analyzing text .
 The goal of NLP is to make computers analyze
and understand the languages that humans use
naturally.
Cont…
4
Why Natural Language Processing?
Computers “see” text in English the same way we
use to see.
 People have no trouble understanding language
but computers have.
– No common sense knowledge.
– No reasoning capacity.
Cont…
5
What We Need In NLP Task?
 Knowledge about Language.
 Knowledge about world.
 A way to combine Knowledge sources.
Cont…
6
Mostly Solved Making Good
Progress
Still Really Hard
Spam Detection Sentiment Analysis Question
Answering
POS Tagging Word Sense
Disambiguation
Paraphrase
Named Entity
Recognition
Parsing Summarization
Machine
Translation
Dialog
Language Technology:
Cont…
7
POS Tagging:
Input: The grand jury commented on a number of
other topics.
Output: The/DT grand/JJ jury/NN commented/VBD
on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.
NE Recognition:
Input: Dan went to London for attend a conference
on NLP in 2012.
Output: Dan went to London for attend a conference
on NLP in 2012.
Name Dan
Location London
Date 2010
Cont…
What Is Tokenization?
8
Tokenization is the process of breaking a stream of
text up into words, phrases, symbols and other
meaningful elements called tokens.
Token: It’s a sequence of character that can be
treated as a single
logical entity.
Typically Tokens are-Natural Languages Programming Languages
Words Identifiers
Numbers Keywords
Abbreviations Operators
Symbols Special symbols
Constants
Cont…
What Is Tokenizer?
9
The job of a Tokenizer is to break up a stream of text
into tokens.
Why Tokenizer?
 It does very crucial task in pre-processing any
natural language.
 To handle semantic issues in the subsequent stages
in machine translation.
 Produces a structural description on an input
sentence.
 For language modeling, the distribution of input text
into tokens is compulsory[9].
Cont…
10
What is Token Classification?
Tokens classification means identification of each
tokens(words/terms) in a document and classify them
into some predefined categories.
Theses predefined categories can be name of a
person, symbols, punctuations, Abbreviations,
numbers, date etc.
Cont…
Steps in Token Classification:
11
 Tokenize the given input text.
 Assign to each token the class (or tag) that it
belongs to.
For Example,
Token Class
মাইকেল Name
৪৫ Number
খবর Word
Cont…
12
Why Bengali Language Processing:
 One of the top ten spoken Language in the
world.
 Lack of research work till now.
Challenges In Bengali Language
Processing:
 Due to its Grammatical Vastness.
 Not well structured like Eastern Language(for
example English).
Cont…
13
Goals of Bengali Language Processing:
To develop technology and standards to make
computer usage Bangla enabled.
To establish standards for Bangla text processing to
ensure interoperability across platforms.
 To develop large standardized corpus for Bangla text
and speech.
 To create an ensemble of available Bangla software
and corpus in a standardized form and make them
easily available to all.
 To develop new software and modify or enhance the
existing software.
 To develop suitable speech Technology for Bangla.
Literature Survey:
14
 A Tokenizer is a component of parser . Parsing
natural language text is more difficult than the
computer languages such as compiler and word
processor because the grammars for natural
languages are complex, ambiguous and infinity
number of vocabulary[8].
 Natural language applications namely Information
Extraction, Machine Translation, and Speech
Recognition, need to have an accurate parser[8].
 A tokenizer plays its significant part in a parser, by
identifying the group or collection of words, existing
as a single and complex word in a sentence. Later
on, it breaks up the complex word into its
Cont…
Related Works:
15
Some Existing standard tokenizers-
 Standford Tokenizer for English Language[10].
 Shallow Tokenizer for Bengali Language.
 Vaakkriti Tokenizer for Sanskrit Language[2].
These Tokenizers was developed for some
particular languages only i.e., all Tokenizers doesn’t
work for all languages.
Cont…
Standford Tokenizer:
16
 Developed mainly for English Language and later
on for Arabic,Chinese and spanish languages also.
 Java language was used for developing.
Online Interface:
Cont…
Results after parsing:
17
S=sentence ,NP=Noun Phrase ,NNS=Noun Plural, VP=Verb Phrase,
VBZ=Verb,3rd present singular, VBN=Verb, past participle,
PP=Prepositional phrase ,TO=to, IN= Preposition or
subordinating conjunction.
Cont…
Shallow Bangla Tokenizer:
18
The shallow parser gives the analysis of a sentence in
terms of-
 Morphological Analysis.
 POS Tagging.
 Chunking.
Apart from the final output, intermediate output of
individual modules is also available.
Cont…
19
Online Interface:
Cont…
20
Result after submitting:
Cont…
21
Bengali Stemmers:
 A Rule-Based Stemmer for Bengali Language by
Sandipan Sarkar,IBM and Sivaji
Bandhopadhay,Jadavpur University[12].
 A light weight stemmer for Bengali and which was
use in spelling checker by Md. Zahurul Islam, Md.
Nizam Uddin and Mumit Khan,CRBLP,BRAC
University,Dhaka in 2007[13].
 Yet Another Suffix Stripper, which uses a clustering
based approach based on string distance
measures and requires no linguistic knowledge by
P.Majumdar, Gobinda Kole,ISI Pabitra Mitra,IIT and
Kalyankumar Dutta,Jadavpur University in
Cont…
22
Comparison Of Three stemmers:
Stemmer Used Method Accuracy(%)
Rule-Based Orthographic-
syllable
89.0
Light weight Longest Match
Basis
90.8
YASS String Distance
Measure
88.0
Cont…
23
POS Tagger:
 Supervised POS Tagging: Has pre-tagged
Corpora used for training to learn information
about the tagset, word-tag frequencies, rule sets
etc[11].
e.g., N-Gram,Maximum Entropy Model(ME),Hidden
Markov Model(HMM) etc.
 Unsupervised POS Tagging: Do not require a
pre-tagged corpora. they use advanced
computational methods to automatically induce
tagsets.
e.g.,Brill, Baum-Welch algorithm etc[11].
Cont…
24
Supervised Model POS Taggers
Comparison:Tagger Applied Method
Uni-Gram(N=1) Most likely approach
HMM One sentence at a
time. Formula-
P (word | tag) * P (tag | previous n
tags)
Bi-Gram(N=2) Same as Unigram but consider just
previous word tag
Cont…
25
UNI-GRAM BI-GRAM HMM
Sentences
Tokens Accuracy(%) Accuracy(%) Accuracy(%)
87 1002 28.6 28.6 39.3
304 4003 42.4 41.9 49.7
532 8026 48.1 47.9 53.6
677 10001 49.8 49.5 54.3
Bangla - SPSAL Corpus and Tagset with Test data: 400
sentences, 5225 tokens from the SPSAL test corpus[11].
Cont…
Problem Domain:
26
 Bangla is very rich in inflections, vibhakties (suffix)
and karakas, and often they are ambiguous also.
 It is not easy to provide necessary semantic and
world knowledge that we humans often use while
we parse and understand various Bangla
sentences.
So, mainly due to grammatical vastness design of
bangla Toeknizer is not an easy task.
Cont…
Bengali Grammar: POS
27
Cont…
Bengali Grammar: Genders
28
There are four genders in Bengali grammar -
1.Pung lingo(masculine)
2.Stree lingo(feminine)
3.Ubha lingo(common)
4.Klib lingo(material)
Cont…
Bengali Grammar: Numbers
29
Like English language Bengali has also two
numbers-
 Singular: When we define a single object or
person its singular.
eg. a man, a girl etc.
 When we consider more than one objects or
persons its plural numbers.
eg. Two man, mangoes etc.
Our Proposal:
30
We are going to develop such a system which can
be use for tokenize Bengali Text as well as the
system will be able to solve the problem of Tokens
Classification.
raw
(unstructured)
text
part-of-speech
tagging Token Classification
annotated
(structured)
text
Natural Language Processing
Fig: Our Model
Pre-
processing
Cont…
Flow Chart :
31
Input
Words
Stop Words
Removal
POS Tag
Classify
Text
Stemming
Cont…
32
Input:
Input will be a Bengali Text.
Words:(Completed)
Text will be split into words after removing all non-
character and white spaces and then store them into
excel file.
Stop Words Removal(Completed):
Stop words are the frequently occurring set of
words which do not aggregate relevant information to
the text classification task.
Root words:
After pulling out prefixes and suffixes from any
word thus the origin form of a word is known as root
Cont…
33
POS Tagging:
After finding the root word(stemming) each
elements will push into some particular classes
which is previously generated. Thus, Parts-Of-
Speech(POS) will be tagged with each word
here.
Tokens Classification:
Tokens classification means after finding
tokens from above tasks categories them into
some pre-defined classes.
Our consideration of classes will be mainly
Title,
Surname,Collocation,punctuation,Abbreviation,
Number,
Date, Unknown and foreign word.
Current Status Of Our Work:
34 Snapshot1: system Interface
Cont…
35
Snapshot 2: After Loading Using Load
Button
Cont…
36 Snapshot 3: After getting tokens from
Cont…
37
Snapshot4: Tokens after removing Stop-
words
Cont…
38
Snapshot3: After execution words are split and stored in excel file.
Future Works To Be Done:
39
 Stemming i.e., Finding Root Words.
 POS Tagging.
 Classification
Conclusions:
40
Although in Language processing tokenizing is
a Fundamental task, But due to richness of Bengali
grammar and structure of Bengali text it is not an
easy task in case of Bengali Language. Again
Stemming is also a difficult task to do. To make an
effective bangla Tokenizer one must have a vast
knowledge on Bengali Grammar. So, We hope that
we will able to develop such a system which will
overcome difficulties and the limitations of existing
bangla Tokenizer and give efficient Tokens and
finally we will able to classify the tokens.
References:
41
[1] Wikipedia
[2] Aasish Pappu and Ratna Sanyal “Vaakkriti:
Sanskrit Tokenizer”Indian Institute of Information
Technology, Allahabad (U.P.), India.
[3] Firoj Alam, S. M. Murtoza Habib, Mumit Khan
“Text Normalization system for Bangla” Center for
research on Bangla Language Processing,
Department of Computer Science and Engineering,
BRAC University, Bangladesh.
[4] Goutam Kumar Saha, “Parsing Bengali Text - an
Intelligent Approach” Scientist-F, Centre for
Development of Advanced Computing, (CDAC),
Kolkata.
Cont…
42
[5] “Magic of ASP.Net with C#” by Kumar Sanjeeb and
Shibi Panikkar.
[6] www.C-sharpcorner.com
[7] “Overview of Stemming Algorithms” Ilia Smirnov
http://the-smirnovs.org/info/stemming.pdf.
[8] “Recognizing Bangla grammar using predictive
parser”, by K. M. Azharul Hasan, Al-Mahmud, Amit
Mondal, Amit Saha. Department of Computer Science
and Engineering (CSE) Khulna University of
Engineering and Technology (KUET) Khulna-9203,
Bangladesh.
[9] “Model for Sindhi Text Segmentation into Word
Tokens” J. A. MAHAR, H. SHAIKH*, G. Q. MEMON
Faculty of Engineering, Science and Technology,
Cont…
43
[11] “COMPARISON OF DIFFERENT POS TAGGING
TECHNIQUES FOR SOME SOUTH ASIAN
LANGUAGES” by Fahim Muhammad Hasan, BRAC
University,Dhaka,Bangladesh.
[12] “Design of a Rule-based Stemmer for Natural
Language Text in Bengali”by Sandipan Sarkar IBM
India and Sivaji Bandyopadhyay Computer Science
and Engineering Department Jadavpur University,
Kolkata.
[13] “A Light Weight Stemmer for Bengali and Its Use in
Spelling Checker” by Md. Zahurul Islam, Md. Nizam
Uddin and Mumit Khan, Center for Research on
Bangla Language Processing, BRAC University,
Dhaka, Bangladesh.
[14] “Yet Another Suffix Stripper” by PRASENJIT
MAJUMDER, MANDAR MITRA, SWAPAN K. PARUI,
44
Thank
You

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationMarina Santini
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)Kuppusamy P
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine TranslationJaganadh Gopinadhan
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing Adarsh Saxena
 
Tokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseTokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseRAKESH P
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: ParsingRushdi Shams
 
Natural language processing
Natural language processingNatural language processing
Natural language processingprashantdahake
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentationSai Mohith
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review Jayneel Vora
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Natural language processing
Natural language processingNatural language processing
Natural language processingAbash shah
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingToine Bogers
 

Was ist angesagt? (20)

Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Nlp
NlpNlp
Nlp
 
Lecture: Word Sense Disambiguation
Lecture: Word Sense DisambiguationLecture: Word Sense Disambiguation
Lecture: Word Sense Disambiguation
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing (nlp)
Natural language processing (nlp)Natural language processing (nlp)
Natural language processing (nlp)
 
A tutorial on Machine Translation
A tutorial on Machine TranslationA tutorial on Machine Translation
A tutorial on Machine Translation
 
Natural Language Processing
Natural Language Processing Natural Language Processing
Natural Language Processing
 
Tokenization using nlp | NLP Course
Tokenization using nlp | NLP CourseTokenization using nlp | NLP Course
Tokenization using nlp | NLP Course
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
NLP
NLPNLP
NLP
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Natural Language Processing: Parsing
Natural Language Processing: ParsingNatural Language Processing: Parsing
Natural Language Processing: Parsing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural language processing PPT presentation
Natural language processing PPT presentationNatural language processing PPT presentation
Natural language processing PPT presentation
 
Natural Language Processing seminar review
Natural Language Processing seminar review Natural Language Processing seminar review
Natural Language Processing seminar review
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 

Andere mochten auch

Arabic morphology and POS-tagging
Arabic morphology and POS-taggingArabic morphology and POS-tagging
Arabic morphology and POS-taggingbutest
 
Thinking about nlp
Thinking about nlpThinking about nlp
Thinking about nlpPan Xiaotong
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analyticsErik Tromp
 
NLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionNLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionevolutionpd
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingFlorian Leitner
 
Using Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesUsing Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesBenjamin Taylor
 
AI Reality: Where are we now? Data for Good? - Bill Boorman
AI Reality: Where are we now? Data for Good? - Bill  BoormanAI Reality: Where are we now? Data for Good? - Bill  Boorman
AI Reality: Where are we now? Data for Good? - Bill BoormanTextkernel
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Practical Deep Learning for NLP
Practical Deep Learning for NLP Practical Deep Learning for NLP
Practical Deep Learning for NLP Textkernel
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text MiningWill Stanton
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)Sumit Raj
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Pythonanntp
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsMd. Main Uddin Rony
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine LearningAmrinder Arora
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" Joshua Bloom
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Jimmy Lai
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationPier Luca Lanzi
 
Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Alexander Korbonits
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Márton Miháltz
 

Andere mochten auch (20)

Arabic morphology and POS-tagging
Arabic morphology and POS-taggingArabic morphology and POS-tagging
Arabic morphology and POS-tagging
 
Intro to NLP. Lecture 2
Intro to NLP.  Lecture 2Intro to NLP.  Lecture 2
Intro to NLP. Lecture 2
 
Thinking about nlp
Thinking about nlpThinking about nlp
Thinking about nlp
 
Deep learning for text analytics
Deep learning for text analyticsDeep learning for text analytics
Deep learning for text analytics
 
NLP@Work Conference: email persuasion
NLP@Work Conference: email persuasionNLP@Work Conference: email persuasion
NLP@Work Conference: email persuasion
 
OUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String ProcessingOUTDATED Text Mining 3/5: String Processing
OUTDATED Text Mining 3/5: String Processing
 
Using Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From ResumesUsing Deep Learning And NLP To Predict Performance From Resumes
Using Deep Learning And NLP To Predict Performance From Resumes
 
AI Reality: Where are we now? Data for Good? - Bill Boorman
AI Reality: Where are we now? Data for Good? - Bill  BoormanAI Reality: Where are we now? Data for Good? - Bill  Boorman
AI Reality: Where are we now? Data for Good? - Bill Boorman
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Practical Deep Learning for NLP
Practical Deep Learning for NLP Practical Deep Learning for NLP
Practical Deep Learning for NLP
 
Deep Learning and Text Mining
Deep Learning and Text MiningDeep Learning and Text Mining
Deep Learning and Text Mining
 
Natural language processing (Python)
Natural language processing (Python)Natural language processing (Python)
Natural language processing (Python)
 
Natural Language Processing and Python
Natural Language Processing and PythonNatural Language Processing and Python
Natural Language Processing and Python
 
Classification Based Machine Learning Algorithms
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
 
Online algorithms in Machine Learning
Online algorithms in Machine LearningOnline algorithms in Machine Learning
Online algorithms in Machine Learning
 
PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning" PyData 2015 Keynote: "A Systems View of Machine Learning"
PyData 2015 Keynote: "A Systems View of Machine Learning"
 
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Machine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to ClassificationMachine Learning and Data Mining: 10 Introduction to Classification
Machine Learning and Data Mining: 10 Introduction to Classification
 
Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)Deep Learning with Python (PyData Seattle 2015)
Deep Learning with Python (PyData Seattle 2015)
 
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
Deep Learning Architectures for NLP (Hungarian NLP Meetup 2016-09-07)
 

Ähnlich wie NLP

Token classification using Bengali Tokenizer
Token classification using Bengali TokenizerToken classification using Bengali Tokenizer
Token classification using Bengali TokenizerJeet Das
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptpavankalyanadroittec
 
Natural language processing using python
Natural language processing using pythonNatural language processing using python
Natural language processing using pythonPrakash Anand
 
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsNLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsHimanshu kandwal
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflowseungwoo kim
 
Artificial inteIegence & Machine learning - Key Concepts
Artificial inteIegence & Machine learning - Key ConceptsArtificial inteIegence & Machine learning - Key Concepts
Artificial inteIegence & Machine learning - Key ConceptsHasibAhmadKhaliqi1
 
Machine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsMachine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsIOSR Journals
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingBhavya Chawla
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShashank Shisodia
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPAnuj Gupta
 
Natural language understandihggjsjng. pptx
Natural language understandihggjsjng. pptxNatural language understandihggjsjng. pptx
Natural language understandihggjsjng. pptxMAKSHAY6
 
Natural language understanding of chatbots
Natural language understanding of chatbotsNatural language understanding of chatbots
Natural language understanding of chatbotsabn17p
 

Ähnlich wie NLP (20)

Token classification using Bengali Tokenizer
Token classification using Bengali TokenizerToken classification using Bengali Tokenizer
Token classification using Bengali Tokenizer
 
Nltk
NltkNltk
Nltk
 
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this pptAI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
AI UNIT 3 - SRCAS JOC.pptx enjoy this ppt
 
NLP.pptx
NLP.pptxNLP.pptx
NLP.pptx
 
Natural language processing using python
Natural language processing using pythonNatural language processing using python
Natural language processing using python
 
Presentation1
Presentation1Presentation1
Presentation1
 
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_StudentsNLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
NLP_A Chat-Bot_answering_queries_of_UT-Dallas_Students
 
NLP Deep Learning with Tensorflow
NLP Deep Learning with TensorflowNLP Deep Learning with Tensorflow
NLP Deep Learning with Tensorflow
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Artificial inteIegence & Machine learning - Key Concepts
Artificial inteIegence & Machine learning - Key ConceptsArtificial inteIegence & Machine learning - Key Concepts
Artificial inteIegence & Machine learning - Key Concepts
 
ppt
pptppt
ppt
 
ppt
pptppt
ppt
 
Machine Translation Approaches and Design Aspects
Machine Translation Approaches and Design AspectsMachine Translation Approaches and Design Aspects
Machine Translation Approaches and Design Aspects
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
Shallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliteratorShallow parser for hindi language with an input from a transliterator
Shallow parser for hindi language with an input from a transliterator
 
NLP Bootcamp
NLP BootcampNLP Bootcamp
NLP Bootcamp
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Natural language understandihggjsjng. pptx
Natural language understandihggjsjng. pptxNatural language understandihggjsjng. pptx
Natural language understandihggjsjng. pptx
 
Natural language understanding of chatbots
Natural language understanding of chatbotsNatural language understanding of chatbots
Natural language understanding of chatbots
 

Mehr von Jeet Das

Lecture 13
Lecture 13Lecture 13
Lecture 13Jeet Das
 
Lecture 12
Lecture 12Lecture 12
Lecture 12Jeet Das
 
Lecture 11
Lecture 11Lecture 11
Lecture 11Jeet Das
 
Lecture 10
Lecture 10Lecture 10
Lecture 10Jeet Das
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Jeet Das
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08 Jeet Das
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02Jeet Das
 
Information Retrieval 07
Information Retrieval 07Information Retrieval 07
Information Retrieval 07Jeet Das
 
Information Retrieval-06
Information Retrieval-06Information Retrieval-06
Information Retrieval-06Jeet Das
 
Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Jeet Das
 
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Jeet Das
 
Information Retrieval-1
Information Retrieval-1Information Retrieval-1
Information Retrieval-1Jeet Das
 
Silent sound technology
Silent sound technologySilent sound technology
Silent sound technologyJeet Das
 

Mehr von Jeet Das (13)

Lecture 13
Lecture 13Lecture 13
Lecture 13
 
Lecture 12
Lecture 12Lecture 12
Lecture 12
 
Lecture 11
Lecture 11Lecture 11
Lecture 11
 
Lecture 10
Lecture 10Lecture 10
Lecture 10
 
Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)Lecture 09(introduction to machine learning)
Lecture 09(introduction to machine learning)
 
Information Retrieval 08
Information Retrieval 08 Information Retrieval 08
Information Retrieval 08
 
Information Retrieval 02
Information Retrieval 02Information Retrieval 02
Information Retrieval 02
 
Information Retrieval 07
Information Retrieval 07Information Retrieval 07
Information Retrieval 07
 
Information Retrieval-06
Information Retrieval-06Information Retrieval-06
Information Retrieval-06
 
Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)Information Retrieval-05(wild card query_positional index_spell correction)
Information Retrieval-05(wild card query_positional index_spell correction)
 
Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)Information Retrieval-4(inverted index_&_query handling)
Information Retrieval-4(inverted index_&_query handling)
 
Information Retrieval-1
Information Retrieval-1Information Retrieval-1
Information Retrieval-1
 
Silent sound technology
Silent sound technologySilent sound technology
Silent sound technology
 

Kürzlich hochgeladen

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncssuser2ae721
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxJoão Esperancinha
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerAnamika Sarkar
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catcherssdickerson1
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...VICTOR MAESTRE RAMIREZ
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidNikhilNagaraju
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...121011101441
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction managementMariconPadriquez1
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort servicejennyeacort
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .Satyam Kumar
 

Kürzlich hochgeladen (20)

INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsyncWhy does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
Why does (not) Kafka need fsync: Eliminating tail latency spikes caused by fsync
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptxDecoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
Decoding Kotlin - Your guide to solving the mysterious in Kotlin.pptx
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube ExchangerStudy on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
Study on Air-Water & Water-Water Heat Exchange in a Finned Tube Exchanger
 
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor CatchersTechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
TechTAC® CFD Report Summary: A Comparison of Two Types of Tubing Anchor Catchers
 
Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...Software and Systems Engineering Standards: Verification and Validation of Sy...
Software and Systems Engineering Standards: Verification and Validation of Sy...
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCRCall Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
Call Us -/9953056974- Call Girls In Vikaspuri-/- Delhi NCR
 
main PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfidmain PPT.pptx of girls hostel security using rfid
main PPT.pptx of girls hostel security using rfid
 
Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...Instrumentation, measurement and control of bio process parameters ( Temperat...
Instrumentation, measurement and control of bio process parameters ( Temperat...
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
computer application and construction management
computer application and construction managementcomputer application and construction management
computer application and construction management
 
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort serviceGurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
Gurgaon ✡️9711147426✨Call In girls Gurgaon Sector 51 escort service
 
Churning of Butter, Factors affecting .
Churning of Butter, Factors affecting  .Churning of Butter, Factors affecting  .
Churning of Butter, Factors affecting .
 

NLP

  • 1. Presented by- Sujit Kumar Das M.Tech 3rd sem,IT Roll-021413 No-363202205 1 POS Tagging And Token Classification By Using Bangla TokenizerUnder the Supervision Of Mr. Sourish Dhar Asst. Professor,Dept of IT Assam University
  • 2. Contents… 2  Introduction  Literature Survey  Our Proposal  Future Works To Be Done  Conclusions  References
  • 3. Introduction: 3 What is NLP?  Field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages[1].  NLP provides means of analyzing text .  The goal of NLP is to make computers analyze and understand the languages that humans use naturally.
  • 4. Cont… 4 Why Natural Language Processing? Computers “see” text in English the same way we use to see.  People have no trouble understanding language but computers have. – No common sense knowledge. – No reasoning capacity.
  • 5. Cont… 5 What We Need In NLP Task?  Knowledge about Language.  Knowledge about world.  A way to combine Knowledge sources.
  • 6. Cont… 6 Mostly Solved Making Good Progress Still Really Hard Spam Detection Sentiment Analysis Question Answering POS Tagging Word Sense Disambiguation Paraphrase Named Entity Recognition Parsing Summarization Machine Translation Dialog Language Technology:
  • 7. Cont… 7 POS Tagging: Input: The grand jury commented on a number of other topics. Output: The/DT grand/JJ jury/NN commented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./. NE Recognition: Input: Dan went to London for attend a conference on NLP in 2012. Output: Dan went to London for attend a conference on NLP in 2012. Name Dan Location London Date 2010
  • 8. Cont… What Is Tokenization? 8 Tokenization is the process of breaking a stream of text up into words, phrases, symbols and other meaningful elements called tokens. Token: It’s a sequence of character that can be treated as a single logical entity. Typically Tokens are-Natural Languages Programming Languages Words Identifiers Numbers Keywords Abbreviations Operators Symbols Special symbols Constants
  • 9. Cont… What Is Tokenizer? 9 The job of a Tokenizer is to break up a stream of text into tokens. Why Tokenizer?  It does very crucial task in pre-processing any natural language.  To handle semantic issues in the subsequent stages in machine translation.  Produces a structural description on an input sentence.  For language modeling, the distribution of input text into tokens is compulsory[9].
  • 10. Cont… 10 What is Token Classification? Tokens classification means identification of each tokens(words/terms) in a document and classify them into some predefined categories. Theses predefined categories can be name of a person, symbols, punctuations, Abbreviations, numbers, date etc.
  • 11. Cont… Steps in Token Classification: 11  Tokenize the given input text.  Assign to each token the class (or tag) that it belongs to. For Example, Token Class মাইকেল Name ৪৫ Number খবর Word
  • 12. Cont… 12 Why Bengali Language Processing:  One of the top ten spoken Language in the world.  Lack of research work till now. Challenges In Bengali Language Processing:  Due to its Grammatical Vastness.  Not well structured like Eastern Language(for example English).
  • 13. Cont… 13 Goals of Bengali Language Processing: To develop technology and standards to make computer usage Bangla enabled. To establish standards for Bangla text processing to ensure interoperability across platforms.  To develop large standardized corpus for Bangla text and speech.  To create an ensemble of available Bangla software and corpus in a standardized form and make them easily available to all.  To develop new software and modify or enhance the existing software.  To develop suitable speech Technology for Bangla.
  • 14. Literature Survey: 14  A Tokenizer is a component of parser . Parsing natural language text is more difficult than the computer languages such as compiler and word processor because the grammars for natural languages are complex, ambiguous and infinity number of vocabulary[8].  Natural language applications namely Information Extraction, Machine Translation, and Speech Recognition, need to have an accurate parser[8].  A tokenizer plays its significant part in a parser, by identifying the group or collection of words, existing as a single and complex word in a sentence. Later on, it breaks up the complex word into its
  • 15. Cont… Related Works: 15 Some Existing standard tokenizers-  Standford Tokenizer for English Language[10].  Shallow Tokenizer for Bengali Language.  Vaakkriti Tokenizer for Sanskrit Language[2]. These Tokenizers was developed for some particular languages only i.e., all Tokenizers doesn’t work for all languages.
  • 16. Cont… Standford Tokenizer: 16  Developed mainly for English Language and later on for Arabic,Chinese and spanish languages also.  Java language was used for developing. Online Interface:
  • 17. Cont… Results after parsing: 17 S=sentence ,NP=Noun Phrase ,NNS=Noun Plural, VP=Verb Phrase, VBZ=Verb,3rd present singular, VBN=Verb, past participle, PP=Prepositional phrase ,TO=to, IN= Preposition or subordinating conjunction.
  • 18. Cont… Shallow Bangla Tokenizer: 18 The shallow parser gives the analysis of a sentence in terms of-  Morphological Analysis.  POS Tagging.  Chunking. Apart from the final output, intermediate output of individual modules is also available.
  • 21. Cont… 21 Bengali Stemmers:  A Rule-Based Stemmer for Bengali Language by Sandipan Sarkar,IBM and Sivaji Bandhopadhay,Jadavpur University[12].  A light weight stemmer for Bengali and which was use in spelling checker by Md. Zahurul Islam, Md. Nizam Uddin and Mumit Khan,CRBLP,BRAC University,Dhaka in 2007[13].  Yet Another Suffix Stripper, which uses a clustering based approach based on string distance measures and requires no linguistic knowledge by P.Majumdar, Gobinda Kole,ISI Pabitra Mitra,IIT and Kalyankumar Dutta,Jadavpur University in
  • 22. Cont… 22 Comparison Of Three stemmers: Stemmer Used Method Accuracy(%) Rule-Based Orthographic- syllable 89.0 Light weight Longest Match Basis 90.8 YASS String Distance Measure 88.0
  • 23. Cont… 23 POS Tagger:  Supervised POS Tagging: Has pre-tagged Corpora used for training to learn information about the tagset, word-tag frequencies, rule sets etc[11]. e.g., N-Gram,Maximum Entropy Model(ME),Hidden Markov Model(HMM) etc.  Unsupervised POS Tagging: Do not require a pre-tagged corpora. they use advanced computational methods to automatically induce tagsets. e.g.,Brill, Baum-Welch algorithm etc[11].
  • 24. Cont… 24 Supervised Model POS Taggers Comparison:Tagger Applied Method Uni-Gram(N=1) Most likely approach HMM One sentence at a time. Formula- P (word | tag) * P (tag | previous n tags) Bi-Gram(N=2) Same as Unigram but consider just previous word tag
  • 25. Cont… 25 UNI-GRAM BI-GRAM HMM Sentences Tokens Accuracy(%) Accuracy(%) Accuracy(%) 87 1002 28.6 28.6 39.3 304 4003 42.4 41.9 49.7 532 8026 48.1 47.9 53.6 677 10001 49.8 49.5 54.3 Bangla - SPSAL Corpus and Tagset with Test data: 400 sentences, 5225 tokens from the SPSAL test corpus[11].
  • 26. Cont… Problem Domain: 26  Bangla is very rich in inflections, vibhakties (suffix) and karakas, and often they are ambiguous also.  It is not easy to provide necessary semantic and world knowledge that we humans often use while we parse and understand various Bangla sentences. So, mainly due to grammatical vastness design of bangla Toeknizer is not an easy task.
  • 28. Cont… Bengali Grammar: Genders 28 There are four genders in Bengali grammar - 1.Pung lingo(masculine) 2.Stree lingo(feminine) 3.Ubha lingo(common) 4.Klib lingo(material)
  • 29. Cont… Bengali Grammar: Numbers 29 Like English language Bengali has also two numbers-  Singular: When we define a single object or person its singular. eg. a man, a girl etc.  When we consider more than one objects or persons its plural numbers. eg. Two man, mangoes etc.
  • 30. Our Proposal: 30 We are going to develop such a system which can be use for tokenize Bengali Text as well as the system will be able to solve the problem of Tokens Classification. raw (unstructured) text part-of-speech tagging Token Classification annotated (structured) text Natural Language Processing Fig: Our Model Pre- processing
  • 31. Cont… Flow Chart : 31 Input Words Stop Words Removal POS Tag Classify Text Stemming
  • 32. Cont… 32 Input: Input will be a Bengali Text. Words:(Completed) Text will be split into words after removing all non- character and white spaces and then store them into excel file. Stop Words Removal(Completed): Stop words are the frequently occurring set of words which do not aggregate relevant information to the text classification task. Root words: After pulling out prefixes and suffixes from any word thus the origin form of a word is known as root
  • 33. Cont… 33 POS Tagging: After finding the root word(stemming) each elements will push into some particular classes which is previously generated. Thus, Parts-Of- Speech(POS) will be tagged with each word here. Tokens Classification: Tokens classification means after finding tokens from above tasks categories them into some pre-defined classes. Our consideration of classes will be mainly Title, Surname,Collocation,punctuation,Abbreviation, Number, Date, Unknown and foreign word.
  • 34. Current Status Of Our Work: 34 Snapshot1: system Interface
  • 35. Cont… 35 Snapshot 2: After Loading Using Load Button
  • 36. Cont… 36 Snapshot 3: After getting tokens from
  • 37. Cont… 37 Snapshot4: Tokens after removing Stop- words
  • 38. Cont… 38 Snapshot3: After execution words are split and stored in excel file.
  • 39. Future Works To Be Done: 39  Stemming i.e., Finding Root Words.  POS Tagging.  Classification
  • 40. Conclusions: 40 Although in Language processing tokenizing is a Fundamental task, But due to richness of Bengali grammar and structure of Bengali text it is not an easy task in case of Bengali Language. Again Stemming is also a difficult task to do. To make an effective bangla Tokenizer one must have a vast knowledge on Bengali Grammar. So, We hope that we will able to develop such a system which will overcome difficulties and the limitations of existing bangla Tokenizer and give efficient Tokens and finally we will able to classify the tokens.
  • 41. References: 41 [1] Wikipedia [2] Aasish Pappu and Ratna Sanyal “Vaakkriti: Sanskrit Tokenizer”Indian Institute of Information Technology, Allahabad (U.P.), India. [3] Firoj Alam, S. M. Murtoza Habib, Mumit Khan “Text Normalization system for Bangla” Center for research on Bangla Language Processing, Department of Computer Science and Engineering, BRAC University, Bangladesh. [4] Goutam Kumar Saha, “Parsing Bengali Text - an Intelligent Approach” Scientist-F, Centre for Development of Advanced Computing, (CDAC), Kolkata.
  • 42. Cont… 42 [5] “Magic of ASP.Net with C#” by Kumar Sanjeeb and Shibi Panikkar. [6] www.C-sharpcorner.com [7] “Overview of Stemming Algorithms” Ilia Smirnov http://the-smirnovs.org/info/stemming.pdf. [8] “Recognizing Bangla grammar using predictive parser”, by K. M. Azharul Hasan, Al-Mahmud, Amit Mondal, Amit Saha. Department of Computer Science and Engineering (CSE) Khulna University of Engineering and Technology (KUET) Khulna-9203, Bangladesh. [9] “Model for Sindhi Text Segmentation into Word Tokens” J. A. MAHAR, H. SHAIKH*, G. Q. MEMON Faculty of Engineering, Science and Technology,
  • 43. Cont… 43 [11] “COMPARISON OF DIFFERENT POS TAGGING TECHNIQUES FOR SOME SOUTH ASIAN LANGUAGES” by Fahim Muhammad Hasan, BRAC University,Dhaka,Bangladesh. [12] “Design of a Rule-based Stemmer for Natural Language Text in Bengali”by Sandipan Sarkar IBM India and Sivaji Bandyopadhyay Computer Science and Engineering Department Jadavpur University, Kolkata. [13] “A Light Weight Stemmer for Bengali and Its Use in Spelling Checker” by Md. Zahurul Islam, Md. Nizam Uddin and Mumit Khan, Center for Research on Bangla Language Processing, BRAC University, Dhaka, Bangladesh. [14] “Yet Another Suffix Stripper” by PRASENJIT MAJUMDER, MANDAR MITRA, SWAPAN K. PARUI,