This document discusses syntactic aggregation in Bengali text generation. It analyzes a corpus of Bengali sentences to identify common syntactic aggregation constructs, including paratactic and elliptic constructions. It then proposes an approach to syntactically aggregate two simple Bengali clauses into a more fluent compound sentence based on the identified constructs. The approach takes as input the constituent clauses, their rhetorical relation, and connecting discourse marker to generate the aggregated sentence.
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Syntactic aggregation
1. Syntactic Aggregation in Bengali Text Generation
Sumit Das, Anupam Basu, Sudeshna Sarkar
Department of Computer Science and Engineering,
Indian Institute of Technology, Kharagpur, India – 721302
sumit.jucse@gmail.com,{anupam,sudeshna}@cse.iitkgp.ernet.in
Abstract two text spans in (1a), linked by a C ONJUNCTION
The quality of the sentences generated by a rhetorical relation (Mann and Thompson, 1988)
natural language generation system can be can be combined as in (1b). But (1b) contains un-
evaluated based on their well-formedness necessary repetitions shown by the words in bold.
(fluency, conciseness and coherence) and So, these can be aggregated to produce (1c) which
faithfulness to the communication intent. is more fluent, concise, and coherent than (1b).
In this paper, we explore the prevalent 1. a. * Jack went up the hill.
syntactic aggregation constructs in Ben-
* Jill went up the hill.
gali and present an approach towards gen-
b. Jack went up the hill and Jill went up
erating Bengali compound sentences using
the hill.
the identified constructs. The inputs to our
c. Jack and Jill went up the hill.
syntactic aggregation method are the con-
stituent simple sentences, rhetorical rela- Syntactic aggregation is the most common form of
tions defined over them and the discourse aggregation observed in any real discourse. Shaw
markers realizing the relations. The paper (2002) proposed that in syntactic aggregation sim-
describes a rule based approach to form pler linguistic components are combined in accor-
the compound sentences, by reorganiza- dance with linguistic rules. As it is a language de-
tion of components followed by elimina- pendent process, so linguistic knowledge, such as,
tion of redundancies of lexical entities, and preferred word ordering, special verb form usage
presents a user based evaluation of the re- etc. are required for combining text spans. For
sults obtained. example, in Bengali the two simple text spans in
1 Introduction (2a), linked by S EQUENCE rhetorical relation, can
be simply combined using appropriate discourse
Any Natural language Generation (NLG) system marker eba.n as in (2b). But in (2b), the word in
should have the capability to remove unneces- bold is redundant. So, applying the conjunction
sary repetitions when generating text. Unneces- reduction construct the two text spans can be ag-
sary repetitions make the text less fluent and non- gregated to generate (2c). But, (2c) can further be
coherent. In NLG, the task of combining con- aggregated to (2d) by using non-finite verb giYe.
stituent simpler text spans by removing repetitions
2. a. 1 (Ram
is called aggregation. According to the standard * rAma mAThe giYechhila
three-stage pipeline NLG architecture proposed by went to the playground).
Reiter and Dale (2000) aggregation is a basic task * rAma phuTabala khelechhila
of any NLG system for generating fluent, concise, (Ram played football).
and coherent text. Dalianis (1993) viewed aggre- b. rAma mAThe giYechhila eba.n rAma
gation mainly as redundancy elimination problem phuTabala khelechhila (Ram went to
and should be done in such a way that the origi- 1
In this paper, Bengali graphemes are written using Ro-
nal meaning of the text is preserved and no unde- man Script in ITRANS notation. They are written in italics
sirable implication is produced. For example, the font.
Proceedings of ICON-2009: 7th International Conference on Natural Language Processing
Macmillan Publishers, India. Also accessible from http://ltrc.iiit.ac.in/proceedings/ICON-2009
2. the playground and Ram played foot- eration. Apart from redundancy elimination, ag-
ball). gregation choices can affect other characteristics
c. rAma mAThe giYechhila eba.n phuTa- of text, such as sentence complexity, focus, em-
bala khelechhila (Ram went to the phasis, theme/rhyme, prosody etc.
playground and played football). Reape and Mellish (1999) defined aggregation
d. rAma mAThe giYe phuTabala khelech- as a process to generate more concise, cohesive,
hila. (Ram went to the playground and and fluent text by omitting or substituting repeat-
played football). ing entities where the reader can infer the deleted
entities from the remaining text. Reaper and Mel-
Clearly, to syntactically aggregate smaller text lish distinguished among different types of aggre-
spans in Bengali an NLG system should have the gation: conceptual, discourse, semantic, syntactic,
knowledge of Bengali grammar. lexical, and referential. According to them syn-
In this work, we have studied a corpus of Ben- tactic aggregation is the most common and can be
gali sentences to identify the prevalent syntac- stated by some grouping rules, like, subject group-
tic aggregation constructs in Bengali. Then, we ing, predicate grouping etc.
have proposed a method to syntactically aggregate Horacek (1992) has given a more theoretical
two simple clauses using the constructs identified view of aggregation. He explained it by some
to generate a more fluent, concise and coherent grouping phenomena, like content based grouping,
compound sentence. The inputs are two simple structurally motivated propositional grouping.
clauses, the rhetorical relation between them and Shaw (2002) categorized aggregation into four
the discourse marker realizing that relation. types: interpretive, referential , syntactic, and lex-
The rest of this paper is organized as follows: In ical. He focused mainly on syntactic aggregation.
section 2, we briefly mentioned the related works He divided syntactic aggregation into two types:
in syntactic aggregation. In Section 3, we present a hypotactic and paratactic. In paratactic aggrega-
corpus analysis to identify the prevalent syntactic tion all the constituent text spans are of equal sta-
aggregation constructs in Bengali. Rhetorical rela- tus. On the other hand, in hypotactic aggregation
tions considered in this work are mentioned in sec- the constituent text spans are related by some sub-
tion 4 and the semantic representation used is de- ordinate relation.
scribed in section 5. We described our approach in In Virtual Storyteller project (Marit Theune and
section 6 and the evaluation methods in 7. In sec- Hendriks, 2006) different conjunctive and ellipti-
tion 8, concluding remarks and some future scopes cal constructs were used to syntactically aggregate
relevant to this work have been provided. simpler text span to generate more coherent and
concise fairy-tales.
2 Related Work All the works in the area of text aggregation en-
countered so far are focused on English and other
There does not exist any general consensus regard-
European languages. In this work, we have pro-
ing the exact definition of aggregation, the types
posed methods to perform syntactic aggregation in
of aggregation or the component of an NLG sys-
Bengali text generation.
tem where aggregation tasks should be performed.
The general approach is to handle the aggregation
3 Corpus Analysis
tasks in domain and application specific way.
Dalianis (1993; 1996) equated aggregation with We conducted a corpus analysis to identify the
the process of redundancy elimination. He divided prevalent syntactic aggregation constructs used in
it into four principal categories: syntactic, elision, Bengali for generating compound sentences. For
lexical, and referential aggregation. In syntactic this we have chosen text of narrative style be-
aggregation repetitions are removed syntactically cause narrative texts are mainly activity or event
leaving one item (at least) in the text to express driven. So, it is easier to model the different
the meaning explicitly. types of aggregation construct in narrative text.
Wilkinson (1995) contradicted Dalianis’s views We have a corpus of 600 compound sentences col-
of equating text aggregation with redundancy el- lected from Bengali story books. We have ran-
emination because in certain context it can be domly chosen 350 sentences from that corpus for
done by using suitable referring expression gen- analysis. First the selected compound sentences
3. were segmented into simple clauses. A simple * rAma bhAta eba.n shyAma ruTi
clause is equivalent to a simple sentence which khAbe (Ram will eat rice and
contains only one finite verb and no coordinating Shyam will eat roti).
conjunction. For example, the compound sentence Here the right most portion of the first
rAma eba.n shyAma kAla skule giYechhila (Ram proposition(khAbe) is deleted.
and Shyam went to school yesterday) contains 2 – Coordinating one constituent: In this
simple clauses: rAma kAla skule giYechhila (Ram case, one constituent entity from each
went to school yesterday) and shyAma kAla skule of the input simple clauses are co-
giYechhila (Shyam went to school yesterday). By ordinated by a conjunction. This can
decomposing the 350 compound sentences, we got happen to any entity of the constituent
868 simple clauses (2.48 simple clauses per sen- simple clauses.
tence). This measure is important to determine the
* rAma eba.n shyAma phuTbala
maximum number of simple clauses that can be khelachhila (Ram and Shyam was
aggregated in a single sentence. We cannot keep playing football).
on aggregating arbitrarily large number of sim- The subjects of the two constituent sim-
ple clauses even if they are syntactically similar, ple clauses in the above example are co-
since it may result in too complex but less fluent ordinated.
text. From the corpus analysis, we have identi-
– Non-finite verb generation: If both
fied two types of frequently used syntactic aggre-
the input simple clauses are about some
gation constructs in Bengali, e.g., paratactic con-
events or actions performed sequen-
struct and elliptic construct.
tially or concurrently by the same sub-
• Simple paratactic construction: In this ject then they are aggregated using non-
case, the two constituent simple clauses are finite form of the verb of the first simple
simply connected by the conjunctive dis- clause.
course marker and no word deletion is re- * rAma baAta kheYe skule yAbe
quired. (Ram will eat rice and go to school).
In the above example, the two con-
– rAma ekatA boi paRachhila eba.n
stituent simple clauses are about two
shyAma phuTabala khelachhila (Ram
actions performed sequentially by the
was reading a book and Shyam was
same subject. So, perfect participle form
playing football).
of the verb khAoYA i.e. kheYe is used for
• Elliptic construction: Ellipsis is defined as aggregation.
the omission of superfluous words from the Any combination of the above four types of
surface form which are inferable from the en- elliptic constructs is also allowed. For ex-
tities in the remaining text. The different el- ample, in (3) both conjunction reduction and
liptic constructs observed in Bengali are: RNR are used and (4) is generated by us-
– Conjunction reduction: In conjunction ing both conjunction reduction and non-finite
reduction, the subject of the second sim- verb.
ple clause is deleted. 3. rAma bhAta eba.n mAchha khAbe
* rAma khAbAra kheYechhe eba.n (Ram will eat rice and roti).
bandhudera sAthe sinemA dekhate 4. rAma skule giYe phuTabala khelabe
gechhe (Ram has eaten food and (Ram will go to school and play foot-
gone to see a movie with friends). ball).
In the example given above, the subject In summary, though for corpus study we have con-
of the second simple clause, i.e., rAma sidered only narrative Bengali text, it is a part
is deleted using conjunction reduction of more general approach. As syntactic aggrega-
construct. tion is language dependent but domain indepen-
– Right node raising (RNR): In RNR, dent task (Shaw, 2002), the contributions of this
the right most portion of the first simple work can be extended to generate aggregated text
clause is deleted. in Bengali in other domains as well.
4. 4 Rhetorical Relations Considered information, such as, verb root (v-root), theme,
tense, aspect, mood, polarity etc. The arg frame
From the corpus study, we know that paratactic
contains the nominal entities along with the the-
aggregations are the most common form of syn-
matic role of that entity in that clause. If there
tactic aggregation in Bengali. In paratactic ag-
is any modifier for the verb or any nominal en-
gregation, the constituent text spans are of equal
tity in a clause then the respective modifier frames
status and are linked by a multi-nuclear rhetori-
(v-mod and w-mod frame) are present inside the
cal relations (Mann and Thompson, 1988). In this
corresponding pre and arg frame.
work, we have focused on the different paratac-
tic constructs for syntactic aggregation of Bengali
text. The multi-nuclear rhetorical relations consid-
ered in this paper are C ONJUNCTION , D ISJUNC -
TION , C ONTRAST , and S EQUENCE as defined by
original Rhetorical Structure Theory (RST). In ad-
dition to the said relations, we have considered
another multi-nuclear temporal coherence relation
PARALLEL as defined below:
Two text spans are said to be related by
PARALLEL relation if the actions or the
events in those two text spans are occur-
ring simultaneously.
For example, the two constituent clauses present in
(5) are rAma khAbAra khAchchhila (Ram was eat-
ing food) and rAma Tibhi dekhachhila (Ram was
watching TV). The actions in these two clauses
are concurrent. So, the coherence relation between
them is PARALLEL.
5. rAma khAbAra khete khete Tibhi dekhachhila
(Ram was watching TV while eating food).
5 The Semantic Representation
The semantic representation chosen here is a case-
frame representation. This is called predicate-
argument representation. The basic building block
in this representation is sentence. An example of
the sentence frame is given in Figure 1. A sentence
contains a clause frame and clause-count which
Figure 1: Case-frame representation for the sen-
denotes the number of simple clauses present in
tence “rAma pa.Dachhila eba.n shyAma khelach-
the sentence. The clause is a recursive structure
hila.” (Ram was reading and Shyam was play-
that can contain clauses inside itself which makes
ing).
it capable of representing both simple and com-
posite (compound and complex) sentences. For
simple sentence, the outer clause only contains 6 Proposed Approach
one inner clause. On the other hand, for composite
sentence the outer clause contains the constituent In our approach for syntactic aggregation, the in-
inner clauses along with the rhetorical relation (rh- puts are two simple clauses, the rhetorical relation
rel) connecting and discourse marker (dm) realiz- between them, and the discourse marker realiz-
ing that rhetorical relation. A clause frame con- ing that relation. To syntactically aggregate the
tains a predicate frame (pre) and list of argument two simple clauses by using the different paratac-
frames (arg). The pre frame contains verb related tic constructs identified in section 3 we propose
5. the following steps: kakhana < kothAYa. The role on the left side of <
will appear before the role on the right side in the
• Step 1: Ordering arguments in the constituent surface form.
clauses.
6. Ami AgAmIkAla skule yAba (I shall go to
• Step 2: Repeating entity identification.
school with my father).
• Step 3: Ordering constituent clauses.
Again, in (7) the role set is {ke, kothAYa, kakhana,
• Step 4: Superfluous words deletion and non- kAra sAthe}. By using (7) the total order obtained
finite verb generation. from (6) can be extended to ke < kakhana < kAra
sAthe < kothAYa.
• Step 5: Correct surface form generation.
7. Ami AgAmIkAla bAbAra sAthe skule yAba
The above steps are described below. (Tomorrow I shall go to school with my fa-
ther).
6.1 Argument Ordering in the Constituent
Clauses Using the above method for the entire set of sim-
Preferred word ordering in a sentence varies with ple clauses we have identified the set of possible
languages and it is very important for syntactic ag- roles in Bengali and developed a total order among
gregation. Though Bengali is a free-word-order them. The arg frames in the input simple clauses
language, the preferred word ordering in a Bengali are ordered using the developed total order.
sentence is subject-object-verb.
In this work, the input simple clauses are taken 6.2 Repeating Entity Identification
in their corresponding semantic case-frame repre- In our current approach, to remove the redundant
sentation as shown in Figure 1. The arg frames in entities first we have identified the repeating enti-
the clause are then ordered by using a total order ties present in both the simple clauses taken as in-
among the roles associated with the arg frames. put. We are assuming that the nominal entities are
These roles are neither semantic roles nor Paninian equivalent if they have the same thematic role and
roles. The problem that prevents both the seman- root word in the constituent simple clauses. For
tic and Paninian roles is that, none of them can example, in the simplified semantic representa-
be associated with a unique postposition which tion of the compound sentence shown in Figure 2,
is very important for generating sentence in Ben- the constituent simple clauses have one repeating
gali. So the alternative approach should be to de- nominal entity. In both the simple sentences, the
sign some intermediate representation that has suf- thematic role of that entity is ki and surface form is
ficient granularity of the roles, such that ambigu- bhAta. Two verbs are equivalent if they have same
ous assignments of postpositions are not possible. root words and other functional parameters, such
Now, Bengali has a list of postpositions that are as, tense, aspect, mood, polarity etc. In Figure 2,
used in different contexts to convey different se- verbs are equivalent and thus repeating. Two noun
mantics. In this work, roles have been designed modifiers are equivalent if they have the same root
at a granularity level where one role is assigned to word and are modifying two nominal entities with
a semantically unique postposition. For develop- the same thematic role. Lastly, two verb modifiers
ing the total order of the roles, we have followed are equivalent if they have same root word. The
an approach taken in the SANYOG system (Bhat- repeating entities are tagged with the status RE-
tacharya, 2004). We have taken the constituent PEATING.
simple clauses of the compound sentences used
for corpus analysis. Each simple clause was rep- 6.3 Ordering Constituent Clauses
resented in their case-frame representation and the All the rhetorical relations considered in this work,
arg frames inside them are then ordered as they ap- mentioned in section 4, are multi-nuclear rela-
pear in the surface form of the clause. In this way, tions. So, two simple clauses connected by any
the ordering among the roles of the arg frames in a of these relations, except S EQUENCE relation, can
clause is known. For example, the role set for (6) be realized in any order. In case of S EQUENCE
is {ke, kothAYa, kakhana}. From (6) we can infer relation, an ordering constraint is imposed by the
that the preferred order among these roles is ke < sequence of the input clauses. So, for S EQUENCE
6. Figure 2: Simplified case-frame representation for the sentence “rAma eba.n shyAma bhAta khAbe.”
(Ram and Shyam will eat rice). Note: ∼() denotes a frame.
relation the clauses cannot be reordered. For • Polarity: If two simple clauses have the
other relations, after identifying the repeating en- same tense but different polarity for the verb
tities, the constituent simple clauses in the result- then the clause with negative polarity will
ing compound sentence are reordered on the basis come first in the surface form. For exam-
of their chronological order and polarity following ple, if the simple clauses in (9a), linked by
the rules mentioned below: C ONJUNCTION relation, are aggregated as in
(9b) then the negative polarity marker nA af-
• Tense: If the two constituent clauses have
fects both the verb kinabe and khAbe. So, the
different tense then they are ordered chrono-
communicative goal is not preserved. How-
logically. This improves the fluency of the
ever, if the clauses are reordered and then ag-
generated compound sentence. For example,
gregated, (9c) results which is grammatically
if the two clauses in (8a), linked by C ON -
correct, fluent and preserves the meaning.
JUNCTION relation, are aggregated without
chronological ordering then (8b) is gener- 9. a. rAma chakaleTa kinabe. rAma
ated. But if they are ordered according to chakaleTa khAbe nA. (Ram will
their tense and aggregated then (8c) is gener- buy chocolate. Ram will not eat
ated which is more fluent and coherent then chocolate).
(8b). b. rAma chakaleTa kinabe eba.n khAbe
8. a. · Ami bA.Di yAba. (I shall go nA (Ram will buy chocolate and
home). will not eat).
· rAma skule gechhe. (Ram has c. rAma chakaleTa khAbe nA eba.n
gone to school). kinabe (Ram will not eat chocolate
and will buy).
b. Ami bA.Di yAba eba.n rAma skule
gechhe. (I shall go home and Ram The ordering based on polarity is done when
has gone to school). the clauses are linked by either C ONJUNC -
c. rAma skule gechhe eba.n Ami bA.Di TION or D ISJUNCTION relation.
yAba. (Ram has gone to school
6.4 Superfluous Words Identification and
and I shall go home).
Non-finite Verb Generation
The chronological ordering is done when
After identifying the repeating entities and order-
the rhetorical relation between the two con-
ing the constituent clauses, the superfluous words
stituent clauses is C ONJUNCTION, D ISJUNC -
are identified using the following two methods:
TION or C ONTRAST . As the constituent sim-
ple clauses are concurrent for PARALLEL re- • Forward deletion: If the entities at the be-
lation, this ordering is not required. ginning of the surface forms of both clauses
7. are REPEATING then they are marked as bold faced words in the second clause are forward
DELETED in the second clause. Surface deleted.
forms of both the clauses are traversed from
left-to-right and REPEATING entities are 12. rAma Aja bhAta khAbe eba.n rAma kAle
marked as DELETED in the second clause bhAta khAbe (Ram will eat rice and Shaym
unless a NON-REPEATING entity is encoun- will eat rice).
tered. For example, the two constituent
13. rAma Aja bhAta khAbe kintu rAma kAle
clauses in (10), linked by C ONJUNCTION re-
bAbAra sAthe ruti khAbe (Ram will eat rice
lation, have REPEATING entities with the
today but Ram will eat roti with father tomor-
role ke and kakhana and they occur at the
row).
beginning of both the clauses. So, the RE-
PEATING entities are marked DELETED in In case of S EQUENCE or PARALLEL relation, only
the second clause indicated by the words in forward deletion is done. In addition to that, the
bold face. verb of the first clause is modified to non-finite
form if the subjects of both the clauses are the
10. rAma gatakAla khAbAra kheYechhila
same. For S EQUENCE relation, the non-finite form
eba.n rAma gatakAla skule giYechhila
is the perfect participle of the verb and for PAR -
(Ram ate food yesterday and Ram went
ALLEL relation, it is the progressive participle.
to school yesterday).
For example, in (14a) the two clauses are linked
• Backward deletion: If the verb and the by S EQUENCE relation. So, first the bold faced
entities at the end of the surface forms of words in the second clauses are forward deleted
both clauses are REPEATING then they are and then perfect participle form of the verb of the
marked as DELETED in the first clause. Sur- first clause is generated. This results in the com-
face forms of both the clauses are traversed pound sentence (14b). Similarly, the two clauses
from right-to-left and REPEATING verb and in (15a), linked by PARALLEL relation, are also
entities are marked as DELETED in the first aggregated to (15b) by using the progressive par-
clause unless a NON-REPEATING entity is ticiple of the root verb paRA.
encountered. For example, the two con- 14. a. rAma bA.Di yAbe eba.n rAma bhAta
stituent clauses in (11), linked by C ONJUNC - khAbe (Ram will go home and Ram
TION relation, have REPEATING verb and
will eat rice).
a REPEATING entity with the role ki and
b. rAma bA.Di giYe bhAta khAbe (Ram
they occur at the end of both the clauses.
will go home and eat rice).
So, the REPEATING elements are marked
DELETED in the first clause indicated by the 15. a. rAma bai pa.Dachhila eba.n rAma
words in bold face. khAbAra khAchchhila (Ram was read-
11. rAma bhAta khAbe eba.n shyAma ing a book. Ram was eating food).
bhAta khAbe (Ram will eat rice and b. rAma bai pa.Date pa.Date khAbAra
Shaym will eat rice). khAchchhila (Ram was eating food
while he was reading a book).
If the two simple clauses, linked by C ONJUNC -
TION , D ISJUNCTION or C ONTRAST relation, have 6.5 Correct Surface Form Generation
the same role set then the REPEATING entities are The redundant words are identified in the previ-
forward deleted and backward deleted. For exam- ous step but the actual deletion is done is this
ple, in (12) the two simple clauses, connected by step. While generating the resulting compound
C ONJUNCTION relation, have the same set of as- sentence, the entities marked as DELETED are not
sociated roles. So, bold faced words in the second realized i.e. deleted from the surface form.
clause are deleted forward and those in the first In case of subject coordinating and RNR con-
clause are deleted backward. However, if the role structs, if the subjects of the two input clauses are
set is different then only forward deletion is done. different then correct surface form of the common
As the two clauses in (13), connected by a C ON - verb should be generated. For example, in (16)
TRAST relation, has different role sets, only the the surface form used for the common verb khelA
8. is khelba which is generated by the subject of the 7 Evaluation
first clause i.e. Ami.
We have developed a system which performs syn-
16. Ami eba.n rAma kAla phuTabala khelaba (I tactic aggregation of two simple clauses by follow-
and Ram will play football tomorrow). ing the steps mentioned in section 6. Evaluation of
that system is important to validate our approach.
Here we have given some rules for generating cor-
We performed a user based evaluation. The sys-
rect inflectional form of the common verb for dif-
tem outputs were shown to the human evaluators
ferent syntactic aggregation constructs in Bengali.
and they were asked to rate those outputs based
• In case of subject coordinating, if one of the on some parameters. Depending upon their feed-
subjects is of first person then the common backs the overall system performance is measured.
verb will be inflected by that first person sub- We evaluated the system with three human eval-
ject. As, in (17) the common verb inflection uators and they were native speakers of Bengali.
yAba is generated by the first person subject They were only given a brief idea about the rhetor-
Ami. ical relations considered in this work. As men-
tioned in section 3, from a corpus of 600 com-
17. Ami eba.n tumi kAla skule yAba (I and
pound sentences 350 were chosen randomly for
Ram will play football tomorrow).
corpus study. The remaining 250 sentences were
• In case of subject coordinating, if one of the used as test sentences in the evaluation. The test
subjects is of second person and the other is sentences were segmented into constituent sim-
of either second or third person then the com- ple clauses. The simple clauses, the rhetorical re-
mon verb will be inflected by that second per- lation connecting them, and the appropriate dis-
son subject. As, in (18) the common verb in- course marker realizing that relation were given to
flection yAo is generated by the second per- the human evaluator as the test inputs. The evalu-
son subject tumi. ation is performed depending upon the following
two criteria:
18. tumi eba.n rAma skule yAo (You and
Ram go to school). • Well-formedness: We define the well-
formedness of an output sentence by its
• In case of subject coordinating, if both the
grammatical correctness and conciseness.
subjects are of third person then the subject
The grammatical correctness measures the
of the complete clause will inflect the com-
accuracy of the syntax, word order and the
mon verb. As, in (19) both the subjects are of
morphological inflections used.
third person and the common verb inflection
karabena is generated by the subject of the • Faithfulness: The faithfulness of an output
complete clause i.e. tini. measures how well the communication goal
19. rAma eba.n tini kAjatA karabena is preserved by the generated output.
(Ram and he will do the work).
For both the measures, the evaluators were
• In case of RNR construct other than the sub- asked to score the outputs on a scale of 1 to 5.
ject coordinating, the subject of the complete 1 is the best and 5 is the worst. The scoring for
clause will inflect the common verb. As, well-formedness and faithfulness were done sepa-
in (20) the common verb inflection khelabe rately by an individual evaluator so that the score
is generated by the subject of the complete of one does not influence the score of the other.
clause i.e. se. The results of each evaluator for well-formedness
and faithfulness are shown in Figure 3 and Figure
20. Ami krikeTa eba.n se phuTabala khe-
4 respectively.
labe (I shall play cricket and he will
To calculate overall performance of the system
play football).
the scores given by individual evaluator were com-
So, following the above rules the correct inflec- bined as follows: If two or more evaluators have
tional form of the common verb is generated given a common score to a test sentence then it
which increases the fluency and naturalness of the is assigned to that common score; If all the eval-
generated text. uators have given different scores to a test sen-
9. tence then it is not considered for overall perfor-
mance calculation. The overall performance of
our system for well-formedness and faithfulness
are shown in Figure 5 and Figure 6 respectively.
Figure 6: Faithfulness Pie Chart
ciseness. For example, the two clauses in (21a) are
Figure 3: Well-formedness Bar Graph
connected by S EQUENCE relation and the system
syntactically aggregates them to (21b). But (21b)
is very good in terms of word ordering and con-
ciseness.
21. a. rahima ekadina rAstAYa bhi.Da
dekhechhila. rahimera mAthA ghure
giYechhila (One day Rahim saw a
huge mass in the street. Rahim was
moved by that).
b. rahima ekadina rAstAYa bhi.Da
dekhechhila eba.n tAra mAthA ghure
giYechhila (One day Rahim saw a
huge mass in the street and he was
Figure 4: Faithfulness Bar Graph moved by that).
The errors regarding the faithfulness measure are
due to wrong order of the constituent clauses and
absence of cues which indicates emphasis and
prosody. For example, the two clause in (22a),
connected by C ONJUNCTION relation, are aggre-
gated to (22b). But the output is ambiguous in
terms of faithfulness as both the verbs are now in
the scope of the words bAbAra sAthe.
22. a. rAma bAbAra sAthe khAbAra khAbe.
rAma Tibhi dekhabe (Ram will eat
food with father. Ram will watch TV).
b. rAma bAbAra sAthe khAbAra khAbe
eba.n Tibhi dekhabe (Ram will eat
food with father and watch TV).
Figure 5: Well-formedness Pie Chart
8 Conclusion
The inconsistencies with respect to well-
formedness of the system generated output are In this article, we have shown our methods to gen-
mainly due to the errors in word ordering and con- erate aggregated and elliptic sentences in Bengali
10. from clause-sized semantic representations. The Mukhopadhyay for their valuable advice and sup-
current system can produce paratactic construc- port. This work is supported by the project Sanyog
tions and use ellipsis to omit repeated entities. We - Phase II, funded by Media Lab Asia, and con-
were able to produce all the desired forms of syn- ducted in Communication Empowerment Labora-
tactic aggregation (see Section 3), though there are tory, Indian Institute of Technology.
scopes for improvements.
Deletion of the repeating words in the gener-
ated output sentence sometimes does not preserve References
meaning. In that case, to make the text fluent Samit Bhattacharya. 2004. Sanyog: An iconic sys-
anaphoric pronouns need to be used. For example, tem for multilingual communication for people with
speech and motor impairments. M.S. Thesis, IIT,
if the two clauses in (23a), connected by C ON - Kharagpur, Supervisor-Basu, A, Sarkar, Sudeshna.
JUNCTION relation, are aggregated by removing
the repeating words in boldface then actual com- Hercules Dalianis and Eduard H. Hovy. 1993. Aggre-
municative goal is not preserved. In place of that, gation in natural language generation. In EWNLG
’93, Proceedings of the 4th European Workshop on
these two clauses are correctly aggregated to (23b) Natural Language Generation, Pisa, Italy.
by using anaphoric pronoun tAra.
H. Dalianis. 1996. Aggregation as a subtask of text and
sentence planning. In J.H.Stewman (ed.), Proceed-
23. a. Ami rAmer sAthe phuTabala khelaba
ings of Florida AI Research Symposium, FLAIRS-
eba.n yadu rAmer sAthe sinemA 96, pages 1–5, Key West, Florida.
dekhabe (I shall play football with
Ram and Jadu will see a movie with Helmut Horacek. 1992. An integrated view of text
planning. In Proceedings of the 6th International
Ram). Workshop on Natural Language Generation, pages
b. Ami rAmer sAthe phuTabala khelaba 29–44, London, UK. Springer-Verlag.
eba.n yadu tAra sAthe sinemA dekhabe
William C. Mann and Sandra A. Thompson. 1988.
(I shall play football with Ram and Jadu Rhetorical structure theory: Toward a functional the-
will see a movie with him. ory of text organization. Text, 8(3):243–281.
Feikje Hielkema Marit Theune and Petra Hendriks.
The current system takes discourse marker as in- 2006. Performing aggregation and ellipsis using dis-
put for a combining simple clauses. But it can course structures. Research on Language and Com-
be extended to select the appropriate discourse putation, 4(4):353–375.
marker depending upon the rhetorical relation and
M. Reape and C. Mellish. 1999. Just what is aggre-
other functional informations such as polarity, gation anyway. In Proceedings of the 7th European
prosody, emphasis etc. Workshop on Natural Language Generation, pages
The system can be extended to aggregate more 20–29, May.
than two simple clauses. In that case the docu- Ehud Reiter and Robert Dale. 2000. Building Natural
ment structure tree (Reiter and Dale, 2000) will be Language Generation Systems. Cambridge Univer-
the input. Clauses can be aggregated according to sity Press, New York, NY, USA.
the specification of the document structure tree un-
James Chi-Kuei Shaw. 2002. Clause aggregation: an
less the complexity of an single sentence exceed approach to generating concise text. Ph.D. thesis,
a predefined threshold. Depending upon the re- New York, NY, USA. Sponsor-Mckeown, Kathleen
sulting sentence complexity and other contextual R.
information, sentence break may be declared re- John Wilkinson. 1995. Aggregation in natural lan-
sulting in multi-sentential text. guage generation: Another look. Technical report,
In our future works, we intend to handle the Computer Science Department, University of Water-
above mentioned limitations to generate more nat- loo.
ural Bengali text.
Acknowledgement
We would like to thank anonymous reviewers for
valuable comments. We would also like to thank
Mr. Plaban Kumar Bhowmik and Mr. Sibansu