Sanskrit Parser Report

SANSKRIT LANGUAGE PARSER
Akash Bhargava - 10UCS002
Ashok Kumar - 10UCS010
Laxmi Kant Yadav - 10UCS027
Vijay Kumar Gupta - 10UCS057
COMPUTER SCIENCE & ENGINEERING DEPARTMENT
NATIONAL INSTITUTE OF TECHNOLOGY, AGARTALA
INDIA-799055
MAY, 2014

SANSKRIT LANGUAGE PARSER
Dissertation submitted to
National Institute of Technology, Agartala
for the award of the degree
of
Bachelor of Technology
by
Akash Bhargava - 10UCS002
Ashok Kumar - 10UCS010
Laxmi Kant Yadav - 10UCS027
Vijay Kumar Gupta - 10UCS057
Under the Guidance of
Mr. Nikhil Debbarma
Assistant Professor, CSE Department, NIT Agartala, India
COMPUTER SCIENCE & ENGINEERING DEPARTMENT
NATIONAL INSTITUTE OF TECHNOLOGY AGARTALA
MAY, 2014

DISSERTATION APPROVAL SHEET
This dissertation entitled “Language Parser”, by Akash Bhargava, Enrolment Number 10UCS002;
Ashok Kumar, Enrollment Number 10UCS010; Laxmi Kant Yadav, Enrollment Number 10UCS027;
Vijay Kumar Gupta, Enrollment Number 10UCS057 is approved for the award of Bachelor of
Technology in Computer Science & Engineering.
Nikhil Debbarma
Dissertation Supervisor
Assistant Professor
Computer Science & Engineering Department
NIT, Agartala
Paritosh Bhattacharya
Head Of Department
Professor
NIT, Agartala
Date:19.05.2014
Place:NIT, Agartala
iii

DECLARATION
We declare that the work presented in this dissertation titled “Language Parser”,
submitted to the Computer Science & Engineering Department, National Institute
of Technology, Agartala, for the award of the Bachelor of Technology degree
in Computer Science & Engineering, represents my ideas in my own words and
where others’ ideas or words have been included, We have adequately cited and
referenced the original sources. We also declare that we have adhered to all prin-
ciples of academic honesty and integrity and have not misrepresented or fabricated
or falsiﬁed any idea/data/fact/source in my submission. We understand that any vi-
olation of the above will be cause for disciplinary action by the Institute and can
also evoke penal action from the sources which have thus not been properly cited
or from whom proper permission has not been taken when needed.
MAY, 2014
Agartala
Akash Bhargava
10UCS002
Ashok Kumar
10UCS010
Laxmi Kant Yadav
10UCS027
Vijay Kumar Gupta
10UCS057
iv

CERTIFICATE
This dissertation entitled “Language Parser”, by Akash Bhargava, Enrolment Number 10UCS002;
Ashok Kumar, Enrollment Number 10UCS010; Laxmi Kant Yadav, Enrollment Number 10UCS027;
Vijay Kumar Gupta, Enrollment Number 10UCS057 is approved for the award of Bachelor of
Technology in Computer Science & Engineering.
Nikhil Debbarma
Dissertation Supervisor
Assistant Professor
NIT, Agartala
Suman Deb
Coordinator
Assistant Professor
NIT, Agartala
v

Acknowledgement
We would like to take this opportunity to express our deep sense of gratitude to all who helped
us directly or indirectly during this project work. Firstly, we would like to thank out super-
visor Asst. Prof. Nikhil Debbarma and Co-ordinator Asst. Prof. Suman Deb for being a
great mentor and the best advisor we could ever have.His advice, encouragement and critics
are source of innovative ideas, inspiration and causes behind the successful completion of this
project. The conﬁdence shown on us by him was the biggest source of inspiration for us. It has
been privilege working with them for last two semesters on two different projects.
We are highly obliged to all the faculty member of Computer Science and Engineering Depart-
ment for their support and encouragement. We also thank out Director Dr. Gopal Mugeraya
and HOD CSE Dept. Asst. Prof. Paritosh Bhattacharya for providing excellent computing
and other facilities without which this work could not achieve its quality goal.
We would like to express our sincere appreciation and gratitude towards Asst. Prof. Anupam
Jamatia for his support to prepare this project report in LATEX. Finally we are grateful to out
parents for their support. It was impossible for us to complete this project without their love,
blessing and encouragement.
-Akash Bhargava, Ashok Kumar, Laxmi Kant Yadav, Vijay Kumar Gupta
vi

Dedicated to
To our loving families for their kind love and support.
To our Project Supervisor Asst. Prof. Nikhil Debbarma and our Project Coordinator
Asst. Prof. Suman Deb for sharing valuable knowledge, encouragement showing
conﬁdence on us all the time.
vii

Abstract
Parsing or syntactic analysis is the process of analysing a string of symbols, either in natural
language or in computer languages, according to the rules of a formal grammar. The term
parsing comes from Latin pars (orationis), meaning part of speech.Traditional sentence parsing
is often performed as a method of understanding the exact meaning of a sentence, sometimes
with the aid of devices such as sentence diagrams. It usually emphasizes the importance of
grammatical divisions such as subject and predicate.
According to many researchers, Sanskrit is a very scientific language. Sanskrit behaves
very closely as programming language. So if we are able to make a translator that translates
Sanskrit into other language, then it would prove to be a significant development in the field of
NLP(Natural Language Processing).
In this project we will basically try to parse a Sanskrit sentence so that later on it could be
easy to translate it in some other language. We take input as a Sanskrit sentence or paragraph.
We tokenize the whole sentence(Lexical analysis). We recognize the parts of the speech from
individual tokens(Parsing) and then we parse the sentence or try to make sense out of it(Parsing)
viii

Contents
Acknowledgement vi
Dedicated to vii
Abstract viii
1 Introduction 3
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.5 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.6 About The Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
ix

1.7 Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.8 Study of the Project . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 System Requirement Speciﬁcation 7
2.1 Compiler Phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Lexical Analysis Phase : . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.2 Semantic Analysis Phase : . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.3 Intermediate Code Generation: . . . . . . . . . . . . . . . . . . . . . 9
2.1.4 Code Optimization : . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.5 Code Generation : . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.2 Parsing Methods : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Grammar : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4 Makeﬁle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3 System Design 19
3.1 Spiral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Input Stages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.3 Input Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.4 Input Media . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Data Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.6 Output Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4 Implementation & Screen shots 24
x

4.1 Parser :- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.1 Parsing Methods : . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.1.2 Ambiguity : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2 Implementation Steps :- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 The Lexer : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 The Parser : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.3 Grammer Used : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.4 Uses Of A Grammar : . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3 Input & Output : . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Testing 35
5.1 Syntax Error Handling: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2 Error-Recovery Strategies : . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 Panic mode: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.2 Phrase-level recovery: . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.3 Error productions : . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.4 Global correction : . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
6 Conclusion 38
7 Appendix 40
8 Reference 42
xi

List of Figures
2.1 Phase of Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Lexical Analyzer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Parsing Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Vibhakti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Conjugational . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6 Noun and Adjective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.7 Noun Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.8 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.9 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.10 Noun . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.1 Spiral Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
x

CSED, NIT Agartala
4.1 lexical Analysis Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Output Snapshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.4 Output SnapShot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
1

Chapter 1
Introduction
1.1 Purpose
In this project we will basically try to parse a Sanskrit sentence so that later on it could be easy
to translate it in some other language.
1.2 Scope
Ability to parse From Sanskrit sentence to English Sentence.
1.3 Basis
We will ﬁrst put up some concepts then employ them-
3

CSED, NIT Agartala
• Lexical Analysis
• Parsing
• Advantages of using Sanskrit
• Approach
1.4 Overview
This Design Document is divided into five major Section.
Section 1 is an Introduction that provides information about the document itself.
Section 2 is an overview of the application and its primary functionality.
Section 3 identifies the assumptions and constraints followed during the design the software.
Section 4 documents the over system architecture.
Section 5 provides the detailed design information for every subsystem and component in the
current delivery
1.5 Objective
In this project we will basically try to parse a Sanskrit sentence so that later on it could be
easy to translate it in some other language. Here we are describing about Machine Translation
Technique for translating Sanskrit sentence to English sentence.
1.6 About The Project
• Machine Translation has been defined as the process that utilizes computer software to
translate text from one natural language to another, It is one of the most important appli-
cations of Natural Language Processing.
• It helps people from different places to understand an unknown language without the aid
of a human translator.
4

CSED, NIT Agartala
• The language to be translated is the Source Language (SL). The language, to which source
language translated is Target Language (TL).
• The major machine translation technique are Rule Based Machine Translation Technique,
Statistical Machine Translation Technique (SMT) and Example-based machine transla-
tion (EBMT).
• One of the effective techniques for machine translation is Rule Based Machine Transla-
tion.
• In India, different machine translation systems are implemented. AnglaUrdu (AnglaHindi
based) Machine Translation System for English to Urdu , HindiAngla Machine Trans-
lation Systems form Hindi to English, English-Assarnese Machine Translation System
(Machine Translation System from English to Assamese, MaTra: Human Aided Machine
Translation System, AnglaHindi: An English to Hindi Machine-Aided Translation Sys-
tem and AnglaBharti Technology for machine aided translation from English to Indian
Languages, these are some of the machine translation works implemented in India.
• Machine translation from Sanskrit is never an easy task because of structural vastness of
its Grammar, but the grammar is well organized and least ambiguous compared to other
natural language.
• The Sanskrit sentence which is the input for our ﬁrst module i.e. lexical Parser it generates
a Parse tree that is generated by using semantic relationships.
• This parse tree acts as an input to the Second module i.e. Semantic mapper where the
Sanskrit semantic word is mapped to the English semantic word.
1.7 Drawbacks
Some of the most ﬂuent drawbacks of the project:
• This project is all about Parsing a language into another , it is not a pure translator.
• This project is platform dependent (here platform is Linux).
• It is Database oriented project not just using online approach.
5

CSED, NIT Agartala
1.8 Study of the Project
To Provide the facility for users to give input in sanskrit language and converting (parsing ) it
into English language. Here we have some predeﬁned methods for Parsing As:
• We ﬁrst tokenize the input using strtok(str,´’ ´’);
• Each token can be of 3 types- Noun,verb, preposition.The task is to identify these token
which is done by matching in indexed database.
• Each token is stored in a structure along with the meaning and its morphologic.
• Then parser comes into play and form a tree type of structure. Using these tokens.
Major approaches of Machine Translation are rule-based machine translation (RBMT, also
known as the Rational approach). Rule based translation consists of:
1. Process of analyzing input sentence of a source language syntactically and or semantically
2. Process of generating output sentence of a target language based on internal structure each
process is controlled by the dictionary and the rules.
• The strength of the rule based method is that the information can be obtained through
introspection and analysis.
• The weakness of the rule based method is the accuracy of entire process is the product of
the accuracies of each sub stage.
6

Chapter 2
System Requirement Speciﬁcation
2.1 Compiler Phases
Compiler operates in phases ans each phase transforms the source program from one represen-
tation to another. Compiler has six phases :-
• Lexical Analyzer
• Syntax Analyzer
• Semantic Analyzer
• Intermediate code generation
• Code optimization
• Code Generation
7

CSED, NIT Agartala
Symbol table and error handling interact with the six phases. Some of the phases may be
grouped together.
.
Figure 2.1: Phase of Compiler
2.1.1 Lexical Analysis Phase :
The lexical phase reads the characters in the source program and groups them into a stream
of tokens in which each token represents a logically cohesive sequence of characters, such as,
An identiﬁer, A keyword, A punctuation character. The character sequence forming a token is
called the lexeme for the token. The semantic standard representation was designed to provide a
simple description of the grammatical relationships in a sentence that can easily be understood
and effectively used by people without linguistic expertise who want to extract textual relations.
The sentence relationships are represented uniformly as semantic standard relations between
pairs of words.
8

CSED, NIT Agartala
.
Figure 2.2: Lexical Analyzer
2.1.2 Semantic Analysis Phase :
This phase checks the source program for semantic errors and gathers type information for the
subsequent code-generation phase. It uses the hierarchical structure determined by the syntax-
analysis phase to identify the operators and operands of expressions and statements. An impor-
tant component of semantic analysis is type checking.
2.1.3 Intermediate Code Generation:
The syntax and semantic analysis generate a explicit intermediate representation of the source
program. The intermediate representation should have two important properties:
• It should be easy to produce.
• Easy to translate into target program.
Intermediate representation can have a variety of forms. One of the forms is: three address
code; which is like the assembly language for a machine in which every location can act like a
9

CSED, NIT Agartala
register. Three address code consists of a sequence of instructions, each of which has at most
three operands
2.1.4 Code Optimization :
Code optimization phase attempts to improve the intermediate code, so that faster-running ma-
chine code will result.
2.1.5 Code Generation :
The final phase of the compiler is the generation of target code, consisting normally of relocat-
able machine code or assembly code. Memory locations are selected for each of the variables
used by the program. Then, the each intermediate instruction is translated into a sequence of
machine instructions that perform the same task.
2.2 Parsing Methods :
In the compiler model, the parser obtains a string of tokens from the lexical analyser, and verifies
that the string can be generated by the grammar for the source language. The parser returns any
syntax error for the source language. There are two types of parsing methods: top-down and
bottom-up. ”Top-down” is pretty much self-explanatory. From left to right, we drill down
through each non-terminal until we get to a terminal. We also build our tree from the root node
down to the leaves in a top-down fashion. It’s important to note that we drill down from left
to right replacing the leftmost non-terminal first. The definitive meaning of top-down parsing
is an attempt to find a leftmost derivation. ” In bottom-up parsing we are doing a rightmost
derivation, where we replace the rightmost non-terminal first.
There are three general types parsers for grammars.Universal parsing methods such as
theCocke-Younger-Kasami algorithmand Earleys algorithmcan parse any grammar. These meth-
ods are too inefficient to use in production compilers. The methods commonly used in compilers
are classified as either top-down parsingorbottom-up parsing. Top-down parsers build parse
trees from thetop (root)to the bottom (leaves) Bottom-up parsers build parse trees from the
10

CSED, NIT Agartala
.
Figure 2.3: Parsing Step
leaves and work up to the root. In both case input to the parser is scanned from left to right,
one symbol at a time. The output of the parser is some representation of the parse tree for the
stream of tokens. There are number of tasks that might be conducted during parsing. Such as
• Collecting information about various tokens into the symbol table.
• Performing type checking and other kinds of semantic analysis.
• Generating intermediate code.
11

CSED, NIT Agartala
Algorithm for Parsing an English sentence
1. Tokenize the sentence into various tokens i.e. token list.
2. To find the relationship between tokens we are using dependency grammar and binary
relation for our Sanskrit language. Token list acts as an input to semantic class to represent
the semantic standard.
3. Semantic class generates a tree we have a class Tree Transform which will create a tree.
4. Semantic class generates a tree we have a class Tree Transform which will create a tree.
2.3 Grammar :
Grammar provides a precise way to specify the syntax (structure or arrangement of composing
units) of a language. In grade school we take grammar lessons that teach us to speak and write
proper English. They teach us the correct way to form sentences with subjects, predicates,
noun phrases, verb phrases, etc. Subjects, predicates, and phrases are some of the composing
units of a sentence in English; similarly, if/else statements, assignment statements, and function
definitions are some of the composing units of source code, which itself is a single sentence of
a particular programming language. There are a very large number of valid English sentences
one could compose; likewise, there are a large (probably infinite) number of valid source code
programs one could create. If someone says ”on the computer she is,” we immediately recognize
that the sentence is ill- formed. It’s structure is invalid, because the noun phrase should proceed
the verb phrase. It should be: ”She is on the computer .If we take a look at that diagramming
article, well see that the model is exactly like an AST. So it goes without saying that parsing, or
more formally, syntactical analysis,” has its roots in Linguistics. Moreover, just as in English,
programming languages need to be specified in a way that allows us to verify whether a sentence
of the language is valid. That’s where context-free grammars (CFG) come to into play; they
allow us to specify the syntax of a programming language’s source code.
12

CSED, NIT Agartala
Vibhakti as Pointer
.
Figure 2.4: Vibhakti
13

CSED, NIT Agartala
Basic conjugational endings :
Figure 2.5: Conjugational
.
14

CSED, NIT Agartala
Basic noun and adjective declension
.
Figure 2.6: Noun and Adjective
15

CSED, NIT Agartala
A-stems (noun words ending with a)
Figure 2.7: Noun Word
.
16

CSED, NIT Agartala
i- and u-stems
.
Figure 2.8: Noun
.
Figure 2.9: Noun
17

CSED, NIT Agartala
Sanskrit verbs There are 10 types of verb declension forms. One example of bhava root
word is given here. (Only present, past, future).
.
Figure 2.10: Noun
2.4 Makefile
GNU make utility to maintain groups of programs.The purpose of the make utility is to de-
termine automatically which pieces of a large program need to be recompiled, and issue the
commands to recompile them.To prepare to use make, you must write a file called the make-
file that describes the relationships among files in your program, and the states the commands
for updating each file. In a program, typically the executable file is updated from object files,
which are in turn made by compiling source files . Once a suitable makefile exits, each time you
change some source files. make command will process the file called makefile. In that case, we
should use -f option if you want make command processes Makefile.
make clean:- ”make clean” deletes any files generated by previous attempts, leaving you with
clean source code
18

Chapter 3
System Design
3.1 Spiral Model
The spiral model of software development is show diagrammatic representation of this model
appears like a spiral with many loops. The exact number of loop in the spiral is not fixed each
loop of the spiral represents a phase of the software process. This model is much more flex-
ible than other model,since the exact no of phase of the phases through which the product is
developed is not fixed. Each phase in this model is split into four sectors as shown in figure.
The first quadrant identifies the objectives of the phase and the alternative solution is possible
for the phase under consideration. During second phase, the alternative solutions are evaluate
the best solutions possible. The spiral model provides direct support for coping with project
risks.Activities during the fourth quadrant concern reviewing the result of the stages traversed
so far with the customer and planning the next iteration around the spiral. This is viewed as
meta model,since it subsumes all the discussed model. The spiral mode; uses a prototyping ap-
proach by first building a prototype before embarking in the actual product development effort.
Also, the spiral model can be considered as supporting the evolutionary model-the iterations
19

CSED, NIT Agartala
Figure 3.1: Spiral Model
along the spiral can be considered as evolutionary model levels through which the complete
system is built. This enables the developer to understand and resolve the risks at each evolu-
tionary level.the spiral model uses prototyping as a risk reduction mechanism and also return
the systematic step-wise approach of the waterfall model.
3.2 Input Stages
The main input stages can be listed as below:
• Data supply
• Data transaction
• Data synchronization
• Data veriﬁcation
• Data validation
• Data correction
20

CSED, NIT Agartala
3.3 Input Types
It is necessary to determine the various types of inputs.Inputs can be categorized as follows:
• External inputs,which are prime inputs for the system.
• Internal inputs,which are user communications with the system.
• which are inputs entered during a dialogue.
3.4 Input Media
At this stage choice has to be made about the input media. To conclude about the input media
consideration has to be given to:
• Type of input
• Flexibility of format
• Speed
• Accuracy
• Easy of correction
• Easy to use
• Portability
3.5 Data Flow Diagram
21

CSED, NIT Agartala
Figure 3.2: Data Flow Diagram
Figure 3.3: Data Flow Diagram
22

CSED, NIT Agartala
3.6 Output Design
Outputs from computer systems are required primarily to communicate the results of processing
to users.They are also used to provide a permanent copy of the results for later consultation.The
various types of outputs are:
• External Outputs,whose destination is in the ﬁle named Temp.
• Internal outputs whose destination is with in organization and they are the Users main
interface with the Linux system.
• Operational outputs whose use is purely with in the android mobile department.
• Interface outputs,which involve the user in communicating directly with the system.
23

Chapter 4
Implementation & Screen shots
We will be ﬁnding trend in programming languages which are moving faster from machine level
to high level to human level languages. See how it is moving from assembly¿c¿c++¿Java¿ruby
And this will not stop until they create something entirely humanly. The scope of Sanskrit to
become a computer language lies in library system. When you compile a code in C, it patches
your code with some predeﬁned libraries. E.g. if you do strcmp(string1,string2) is the best way
to do it because it will link library code in your executable. Libraries are written in assembly
language and highly optimized. So if you have all libraries with you, why you need C? Why
cant just say GO AND OPEN THE DOOR and expect computer to understand it and do it
in highly optimized way. Onus lies with intelligent interpreter. Sanskrit is language where
letters have meanings. It does not need to be words for them to transmit emotions/information.
Composition of letters to words, again changes their meaning. Yes, something like OOPS. E.g.
ANU is particle and PARMANU is nanoparticle. To be a programming language Consistency
is needed which is there in Sanskrit. Ill explorer more in future how Sanskrit can be adjusted
to be a human computer language.Sanskrit is not descriptive language. You dont need to write
paragraphs to explain. When you translate something to Sanskrit, its size will reduce. It is
precise, crisp and clear.
24

CSED, NIT Agartala
4.1 Parser :-
Parsing is the de-linearization of linguistic input; that is, the use of grammatical rules and
other knowledge sources to determine the functions of words in the input sentence. Getting an
efficient and unambiguous parse of natural languages has been a subject of wide interest in the
field of artificial intelligence over past 50 years. A parser breaks data into smaller elements,
according to a set of rules that describe its structure. Parsing is the process of analysing a text,
made of a sequence of tokens (for example, words), to determine its grammatical structure with
respect to a given grammar.
Following are the Steps to generate a Parse Tree:-
1. : Input is a English sentence.
2. : Lexical Analyzer Creates Tokens.
3. : Tokens generated acts as an input to Semantic analyzer.
4. : Tokens generated acts as an input to Semantic analyzer.
5. : Output is a parse tree.
4.1.1 Parsing Methods :
There are two types of parsing methods: top-down and bottom-up. ”Top-down” is pretty much
self-explanatory. From left to right, we drill down through each non-terminal until we get to a
terminal. We also build our tree from the root node down to the leaves in a top-down fashion.
It’s important to note that we drill down from left to right replacing the leftmost non-terminal
first. The definitive meaning of top-down parsing is an attempt to find a leftmost derivation.”
In bottom-up parsing we are doing a rightmost derivation, where we replace the rightmost non-
terminal first.
• Bottom-Up Parsing
In bottom-up parsing the derivation starts from the string of terminals (our sentence) .
We try to derive the start symbol of our CFG. It’s essentially a top-down derivation back-
wards. Initially, instead of replacing a non-terminal with another non-terminal or terminal
25

CSED, NIT Agartala
(drilling down), we replace a terminal with non-terminal (drilling up). At certain points
we may even replace several non-terminals with one non-terminal. Since the derivation
is the exact reverse of a leftmost derivation, we are then replacing non-terminals from
right to left (a rightmost derivation). When we make a replacement we create a node that
becomes the parent of some other node instead of its child.
• Top-Down Parsing
There are several problems with top-down parsing.
(1) Left-recursion can lead to infinite parsing loops, so it must be eliminated. Left re-
cursion in a CFG production occurs when the non-terminal on the left side appears first
on the right side of the arrow. There are simple algorithms to remove it, but the CFG
becomes twice as long in many cases.
(2) Top-down parsing may involve backtracking. Backtracking is the act of climbing back
up the derivation (the parse), reversing everything to try another derivation path. We end
up re-scanning the input as well. If inserting information into a symbol table as the parse
proceeds, everything has to be removed. The need for backtracking can be eliminated
by parsing with lookahead. Backtracking isn’t restricted to top-down parsers. There are
backtracking LR parsers as well.
Finally, (3) the order in which we choose non-terminal expansions can cause valid inputs
to be rejected without information as to why.
4.1.2 Ambiguity :
Ambiguous grammars are those in which a string of the language has more than one parse tree.
This is problematic because it may be hard to interpret the intended meaning of the string. x*y;
That C statement can be interpreted as the multiplication of two variables, x and y, or as the
declaration of a variable y whose type is a pointer to x. To resolve the conflict the compiler
must locate y’s type information in the symbol table. If it’s a numerical type the statement
is interpreted as an expression. Generally speaking, ambiguity is an unwanted feature of any
grammar and may pose a threat to the correctness of both top-down and bottom-up parsers.
Different parsers handle it with varying efficacy. In spite of all this, ambiguity isn’t always
a problem. It’s possible to generate a non-ambiguous language from an ambiguous grammar.
Even if there are two parse trees that generate a string, as long as it has one intended meaning
there’s no problem. Some parser generators allow specifying precedence and associativity rules
to remove any ambiguity.
26

CSED, NIT Agartala
4.2 Implementation Steps :-
The following steps used for developing this application:
4.2.1 The Lexer :
The first step towards creating a succesful Sanskrit English Parser(SEP) is to create a
lexer that analyses every word of the input sanskrit sentence.
Tokenizer:
The tokenizer divides the complete sentence in a stream of individual words seperated by blank
spaces.
Avyaya Analyser :
Every single output of the tokenizer goes through the smallest database of avyaya words(indeclinables)
and only if it produces a complete match, the word is accepted as an avyaya.
Verb Analyser :
The second relatively bigger database of verb roots(dhaturoops) is placed after the avyaya
database. Tokens not recognized as avyaya are then processed by the verb analyser. The pro-
gram verb.cpp analyses the suffix of every input token and generates information regarding
tense, person and number of corresponding token. The suffix is then removed and the verb is
mapped to its respective root using the verb databse. If a match is found the token is accepted
as a verb, else passed on for noun analysis.
27

CSED, NIT Agartala
Noun Analyser :
Tokens not yet recognized are fed to the noun analyser (noun.cpp). Noun declensions belonging
to different genders have different pattern that can not be matched by the program. Hence of the
21 possible noun declensions for 1 single noun, 10 declensions are stored as exceptions while
remaining 11 are processed by the program and the root word is obtained. Lastly if the word
is still not recognized than it is not present in the database and must be entered manually for
analysis.
Figure 4.1: lexical Analysis Steps
4.2.2 The Parser :
Equipped with the knowledge of what individual words represent we can now move towards
re-arranging them in such a way that their mere translation results in a meaningful English
sentence. When parsing from Sanskrit to English we move from a word order free language to
a language in which only a particular order of words would convey the same meaning.
28

CSED, NIT Agartala
How to represent CONTEXT ?
By CONTEXT we mean the parts of a statement that precede or follow a specific word or pas-
sage, usually influencing its meaning or effect. Sanskrit uses the concept of vibhakti to generate
context. Due to lack of vibhakti in English the user will have to understand the context of every
word with help from the LEXER. Using the lexer the user can add words like for, from, to, etc.
which are not used in Sanskrit. Thus the PARSER gives us the spatial arrangement of input
words in converted form (in English) and the LEXER is referred for context. This results in
English translation of a Sanskrit sentence.
Structure of an English sentence :
Every English sentence is a combination of nouns and verbs related to each other through con-
text. In a SIMPLE sentence (sentence without connectors having only 1 verb), the verb is the
central entity. Nouns then relate to this central entity via context, as defined-
Nominative(S) the SUBJECT/doer of verb
Accusative (O) the OBJECT of verb
Instrumental (I) the cause/means of verb
Dative (D) the indirect object of verb
Ablative (A) represents comparison/separation
Locative (L) represents position in space/time
The LEXER already generates this contextual information for every noun, and the PARSER can
now arrange a simple input sentence spatially, following the rules of English as shown below.
Thus, we have the following order
S V O L/A/D/I
The PARSER interprets LEXER’s outputs and rearranges various nouns at their respective po-
sitions as shown. The user can now apply context of every noun used, to obtain a corresponding
English translation.
Parsing rules for a simple sentence :
The PARSER can handle all forms of noun declensions,verb declensions and avyayas(including
connectors). Following points summarise the working of the parser -
29

CSED, NIT Agartala
.
Figure 4.2: Parsing
• The parser stores nouns, verbs and avyaya in 3 separate structures along with their re-
spective information required by the parser like case context,number,person.
• The parser can handle words representing adjectives.
• The parser can handle words representing adverbs.
• The parser can resolve ambiguity generated by Sanskrit noun declensions. Ex. If an input
Sanskrit sentence contains no nominative noun but there is a noun which can be both
nominative and accusative then it is treated as nominative.
• The parser requires that the subject and verb agree on number.thus, is correct but, is
incorrect
• The parser also handles the GENETIVE case which represents a noun-noun relationship
rather than a noun-verb relationship as other declensions do.
• The parser handles avyayas which correspond to a given noun declension type.
• The parser handles avyayas representing questions.
• The parser handles avyayas that act as conjunctions of different types
• The parser can thus handle multiple sentences joined together using avyayas.
30

CSED, NIT Agartala
• The parser displays the interpreted spatial arrangement of the input sentence, in a text file
named temp.
• The parser can process an input even if some part of it is not defined in the laxer database.
Such unrecognized input tokens are outputed as it is, at the start of resultant sentence, in
the temp file.
4.2.3 Grammer Used :
Sanskrit uses a context free grammar. Also the BNF grammar for Sanskrit also exists. The
various forms of BNF grammar is given as:
<BNF rule> ::= <nonterminal > ”::=” <definitions >
<nonterminal > ::=” <” <words > ”>”
<terminal > ::= <word > | <punctuation mark > |’ ” ’ <any chars >’ ” ’
<words > ::= <word >|<words ><word >
<word > ::= <letter >|<word ><letter >|<word ><digit >
<definitions > ::= <definition >|<definitions >”|” <definition >
<definition > ::= <empty >|<term >|<definition ><term
<empty > ::=
<term > ::= <terminal >|<nonterminal >
4.2.4 Uses Of A Grammar :
A BNF grammar can be used in two ways :-
• To generate strings belonging to the grammar
• To do this, start with a string containing a non-terminal; while there are still non-terminals
in the string replace a non-terminal with one of its definitions.
• To recognize strings belonging to the grammar
• This is the way programs are compiled - a program is a string belonging to the grammar
that defines the language
31

CSED, NIT Agartala
• Recognition is much harder than generation
32

CSED, NIT Agartala
4.3 Input & Output :
Figure 4.3: Output Snapshot
33

CSED, NIT Agartala
Figure 4.4: Output SnapShot
34

Chapter 5
Testing
While developing this project we faced some discrepancy between the grammar deﬁnition and
the query classes implementation. In order to have a coherent implementation, we had to correct
them.
For the testing there are different strategies :-
5.1 Syntax Error Handling:
Planning the error handling right from the start can both simplify the structure of a compiler and
improve its response to errors. The program can contain errors at many different levels. e.g.
• Lexical such as misspelling an identiﬁer, keyword, or operator.
• Syntax such as an arithmetic expression with unbalanced parenthesis.
• Semantic such as an operator applied to an incompatible operand.
35

CSED, NIT Agartala
• Logical such as an infinitely recursive call.
Much of the error detection and recovery in a compiler is centred on the syntax analysis
phase. One reason for this is that many errors are syntactic in nature or are exposed when the
stream of tokens coming from the lexical analyser disobeys the grammatical rules defining the
programming language. Another is the precision of modern parsing methods; they can detect
the presence of syntactic errors in programs very efficiently.
The error handler in a parser has simple goals:-
• It should the presence of errors clearly and accurately.
• It should recover from each error quickly enough to be able to detect subsequent errors.
• It should not significantly slow down the processing of correct programs.
5.2 Error-Recovery Strategies :
There are many different general strategies that a parser can employ to recover from a syntactic
error.
• Panic mode
• Phrase level
• Error production
• Global correction
5.2.1 Panic mode:
• This is used by most parsing methods.
• On discovering an error, the parser discards input symbols one at a time until one of a
designated set of synchronizing tokens ( delimiters; such as; semicolon or end ) is found.
36

CSED, NIT Agartala
• Panic mode correction often skips a considerable amount of input without checking it for
additional errors.
• It is simple.
5.2.2 Phrase-level recovery:
• On discovering an error; the parser may perform local correction on the remaining input;
i.e., it may replace a prefix of the remaining input by some string that allows the parser to
continue.
• Exmple, local correction would be to replace a comma by a semicolon, deleting an extra-
neous semicolon, or insert a missing semicolon.
• Its major drawback is the difficulty it has in coping with situations in which the actual
error has occurred before the point of detection.
5.2.3 Error productions :
• If an error production is used by the parser, can generate appropriate error diagnostics to
indicate the erroneous construct that has been recognized in the input.
5.2.4 Global correction :
• Given an incorrect input string x and grammar G, the algorithm will find a parse tree for
a related string y, such that the number of insertions, deletions and changes of tokens
required to transform x into y is as small as possible.
37

Chapter 6
Conclusion
The project is mainly based on Two languages C and C++. In this project we have Used Sanskrit
as an input language and English as an output language. Firstly Taking input Sanskrit from
Keyboard , Tokenize the sentence using Tokenizer , Identifying the tokens using Token Analyser
, Then matching the Tokens from database and fetching the output words and ﬁnally Add all
the resulting words to produce the output . The main goal of the current study was to parse a
Sanskrit sentence so that later on it could be easy to translate it in some other language.
The ﬁndings from this study make several contributions to the current literature. First that
we should use Sanskrit as the primary language for programming purpose .
Finally, a number of important limitations need to be considered. First This project is all
about Parsing a language into another , it is not a pure translator. Second This project is platform
dependent (here platform is Linux) and third It is Database oriented project not just using on-
line approach. It is recommended that further research be undertaken in the following areas:
• We can make this project more user friendly by using graphical user interface.
• We can apply this scheme on many different languages.
38

CSED, NIT Agartala
The ﬁndings of this study have a number of important implications for future practice.This
translator is mainly based on fetching of data from database
39

Chapter 7
Appendix
A
Avyaya Analyser 37
Ambiguous 15
C
Compiler 6
Code Optimization 9
Code Generation 9
D
Drawbacks 4
Data Flow Diagram 20
E
Error-Recovery Strategies 35
Error productions 36
G
Grammar 11
Grammer Used 30
40

CSED, NIT Agartala
Global correction 36
I
Intermediate Code Generation 8
Input Stages 19
Input Types 20
L
Lexical Analysis Phase 7
M
Makeﬁle 17
O
Objective 3
Output Design 22
P
Parsing Methods 9
SS
Scope 2
T
Testing 34
U
Uses Of A Grammar 30
41

Chapter 8
Reference
To our Project Supervisor Assistant Professor Nikhil Debbarma and our Project Coordina-
tor Assistant Professor Suman Deb for sharing valuable knowledge, encouragement showing
confidence on us all the time and some link on internet.
• Sanskrit & Artificial Intelligence —NASA
Knowledge Representation in Sanskrit and Artificial Intelligence by Rick Briggs Roacs,
NASA Armes Research Centre, Moffet Field, California
• http://www.vedicsciences.net/articles/sanskrit-nasa.html
• AI Magazine publishes the importance of Sanskrit
• http://www.aaai.org/ojs/index.php/aimagazine/article/viewArticle/466
• http://sanskrit.jnu.ac.in/morph/analyze.jsp
• http://uttishthabharata.wordpress.com/2011/05/30/sanskrit-programming/
42

Sanskrit Parser Report

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

Andere mochten auch

Andere mochten auch (6)

Ähnlich wie Sanskrit Parser Report

Ähnlich wie Sanskrit Parser Report (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Sanskrit Parser Report