Gujarati Text-to-Speech Presentation

10:02 10:02
Text-to-Speech System for
Gujarati
Project Presentation by Samyak Bhuta

10:02 10:02
* PROJECT PROFILE *
Objective : Developing a Text-to-Speech
System for Gujarati

10:02 10:02
* PROJECT PROFILE *
Under the guidance of
 Prof. Ram Mohan
 Shri Jignesh Dholakia

10:02 10:02
* PROJECT PROFILE *
At Resorce Centre for Indian Language
Technology Solutions in Gujarati,
Faculty of Arts,
The M. S. University of Baroda, BARODA.

10:02 10:02
Next 25 minutes …
> Sound and Speech Sound
> ABC of TTS Systems
> Pilot Project
> GTTS from scratch
> Speech , Syllable and Partneme
> Speech Sounds in detail
> Core Engine
> Language Dependent Components

10:02 10:02
Sound : a flow of air
Source EarAir flows
Sound
♫
♪
♫

10:02 10:02
What makes different sounds ?
 The factors, responsible for perceptual
difference between one kind of sound from
the another are
 Amplitude (or volume) which tells how much
power the air-flow holds within
 Frequency (or pitch) which tells at what rate
the air-flow is repeating itself

10:02 10:02
The “Source” doesn’t matters
 An air-flow of kind A will sound same
weather it has generated from source X
or source Y.

10:02 10:02
Speech Sound
 A kind of sound whose source is
Human Vocal Organism and who
finds its place in human speech.
 e.g. ક્ , સ્ , અ , ઈ
 A standard called International Phonetic
Alphabet (IPA) is used to depict such sounds

10:02 10:02
IPA
 IPA comprises almost all the speech sounds
of all languages in the world.
 Speech sounds are more formally known as
Phones
 IPA uses set of symbols to represent them
e.g. k , s , ə , i , ʤ
 IPA Chart …

10:02 10:02
Synthesized Speech Sound
 If we can produce the same pattern of
air-flow as it is produced by Human Vocal
Organism, representing a speech sound,
we can say that we have synthesized the
speech sound

10:02 10:02
Speech Synthesizer
 A mechanism which is capable of producing
synthesized speech sound in controlled
manner.

10:02 10:02
Text-to-Speech Systems
 A Speech Synthesizer which is smart enough
to produce equivalent Speech output of the
given text.
 The smartness accounts for making the
output as natural and intelligible as
possible.

10:02 10:02
Text-to-Speech Systems
 Usually, the TTS Systems are specific to
only one human language and takes input
text from only that language

10:02 10:02
Basic structure of TTS Systems
 Function of any TTS System is, generally,
divided into three subtasks or phases.
I. Preprocessing
II. Phonetic-Prosodic Translation
III. Speech Production
 The text input travels through these
phases, one by one, and eventually ends
up in a speech .

10:02 10:02
Preprocessing
 “Dr. Ajay Shah will come to clinic on 23 ,Jan.”
 We read it …
“DOCTOR Ajay Shah will come to clinic on
TWENTY THIRD OF JANUARY”.
 The Preprocessing is meant to convert
the input text, from raw condition, to
pronounceable word text.

10:02 10:02
Phonetic-Prosodic Translation
 This phase can be logically divided into two
different phases,
• Phonetic Translation
• Prosodic Translation
 Real TTS Systems may implement these
phases separately or as a unit but together
they provide data for the next phase of TTS.

10:02 10:02
Phonetic Translation
 In human languages, the script under use
doesn’t necessarily posses the one to one
mapping with speech.
 e.g.
enough is pronounced as INAF / inəf IPA
છોકરો is pronounced as છોક્રો / okʧ ɾo IPA

10:02 10:02
Phonetic Translation
 A Phonetic Translation is used to provide
information, to the next phase, about exactly
what kind of speech sounds (phones) to be
produced for the given text.
 Phonetic Translation is also regarded as
Letter-to-Sound rules.

10:02 10:02
Prosodic Translation
 Mapping from letter-to-sound rules only
provides information about kind of speech
sound to be generated. To convey the
emotions and expressions residing in the
input text , Prosody needs to be applied.
 By Prosody we mean,
Amplitude + Pitch + Duration

10:02 10:02
Speech Production
 This phase is responsible for actual output
of the speech.
 The phase uses the phonetic and prosodic
information provided from the previous
phase.
 Various approaches exist for production of
speech.

10:02 10:02
Different ways for Speech Production
 Three widely used approaches for speech
production are
• Articulatory Synthesis
• Source-Filter Synthesis
• Concatenative Synthesis
 Speech production part of the TTS System
is generally regarded as speech engine.


10:02 10:02
Usecases
 As we understood the structure of the TTS
Systems we realized that all three phases is
required in order to develop complete TTS
for Gujarati.
 At the top most abstraction level a use case
can be conceived for fulfilling the requirement
of having a TTS System for Gujarati.

10:02 10:02
Usecases
 The topmost use case, then, can be divided
into three further use cases each fulfilling
the requirement of three different phases
 During the project we tried to realize each
use case one by one.

10:02 10:02
Pilot Project
 As we approached various requirements
and usecases to be realized, we found that
developing a Preprocessor is not so much
significant as developing the other two
phases. So we decided to develop later on.
 We decided to develop Phonetic-Prosodic
Translation phase first as if it can be easily
plugged into any already build ….speech

10:02 10:02
Pilot Project
… speech engine who takes input in terms of
of IPA.
 FreeTTS, IBMJS, Dhvani, Narad were
studied
 We used Java Speech API along with IBMJS
as a speech engine to be used.
 The input to the engine was provided through
Java Speech Markup Language (JSML)

10:02 10:02
Pilot Project : Objective
 To develop a TTS System using already
available Speech Engine and supplying
transcripted (equivalent ) IPA text of target
Gujarati Unicode text to the engine.

10:02 10:02
Pilot Project : S/W Requirement
 A Speech Engine Component which takes
IPA and speaks it out .

10:02 10:02
Pilot Project : Design
 No of usecases were conceived and its
implementation was provided as different
java classes.

10:02 10:02
Pilot Project : Conclusion
 We cannot continue developing a TTS
System with “outsider” speech engine as
the accent and other things need to be
Gujarati in nature.

10:02 10:02
Starting of GTTS from Scratch
 From the result of the Pilot Project we
concluded that it is required to develop the
Speech Engine keeping Gujarati in mind.
 Concatenative approach was to be used
since it provides naturalness and has proven
track record.

10:02 10:02
Concatenation
 In Concatenative approach, already stored
segments of sounds are joined together to
produce the complete speech.
 Such segments are known as concatenation
unit.
 We used Partnemes as our concatenation
unit.

10:02 10:02
Partnemes
 Partneme is a very small segment of sound
whose typical length ranges from 8 ms to
100 ms. We get the partnemes by cutting
the recorded speech.
 But before understanding what is partneme
we have to understand human speech in
greater detail. Especially the relation
between speech and syllable.

10:02 10:02
How we speak ?
 At time of normal breathing the period we
devote to breath-in is longer than that of
breath-out in a complete breath cycle.
 But when we start speaking, the breath-in
period becomes shorter paving the way for
a longer breath-out period.
 It is so because to speak out (anything) we
need some air-flow. We use the air-flow …

10:02 10:02
How we speak ? : Human Vocal Tract
… powered by lungs, during breath-out.
 This air-flow is modified at various points
of Human Vocal Tract, ending up in a one
or another kind of speech sound (phones).
 Human Vocal Tract comprises of various
organs which, in one or another way,
changes the air-flow.
 Human Vocal Tract …

10:02 10:02
How we speak ? : Syllable and Speech
 During the one complete breath cycle
we can speak out more than one phones.
 These all phones, spoken out in just one
breath cycle, constitutes a syllable .
 Sequence of such syllables in their
continuity forms a speech.

10:02 10:02
How we speak ? : Syllable Structure
 It is important to know the structure of
syllable in order to understand partnemes.
 Typically a syllable is made up of vowel as a
nucleus with consonants around it.
 Gujarati employees the following syllable
structure.
< C + C + C + V + V̯ + C + C >

10:02 10:02
 < C + C + C + V + V̯ + C + C >
where C - consonants
V - vowel
V̯ - unsyllablized vowel
 An utterance ( spoken word ) is made up
series of such syllables.

10:02 10:02
 રામ - ɾam is made up of single syllable.
here the structure becomes
< ɾC
+ aV
+ mC
> .
 પત્ર - pətɾ is also made up of single syllable.
< pC
+ əV
+ tC
+ ɾC
>
 લશ્કર - ləʃkəɾ is made up of two syllables.
< lC
+ əV
+ ʃC
> < kC
+ əV
+ ɾC
>

10:02 10:02
How we speak ? : Consonants and Vowels
 Consonants and vowels are two different
kind of speech sounds with different
acoustic parameters.
 To know the exact difference between
consonants and vowels we have to
understand how the single vocal tract is
capable of producing so many different
sounds.

10:02 10:02
How we speak ? : Articulation
 Modification of the air-flow is achieved by
articulation of various speech organs of the
vocal tract.
 The exact nature of speech sound that will
come up during the breath-out is determined
by
1 Place of Articulation
2 Manner of Articulation

10:02 10:02
How we speak ? : Place of articulation
 Place of articulation refers to the exact point,
in human vocal tract, where articulation
happened.
e.g. [p] - two lips
[k] - back of tongue with velum
[ ] - tip of tongue with alveolarɾ

10:02 10:02
How we speak ? : Manner of articulation
 Manner of articulation refers to the degree
of constriction made, during the articulation.
e.g. [p] - stop or plosive
[ ] - affricateʧ
[ ] - tappedɾ
[ j ] - glide
[ o ] - vowel ( no constriction )

10:02 10:02
How we speak ? : Voicedness
 If, during the traveling of the air-flow from the
glottis, vocal cords are vibrating (and thus
changing the air-flow) we get a voiced
sound.
e.g. [g] - voiced
[k] - unvoiced

10:02 10:02
How we speak ? : Aspiration
 Aspiration refers to the state of vocal cords,
during the final stage of process, when
speaking out phones. When we speak out
aspirated phones the vocal cords
approaches, itself to vibrating state, as
time goes ( irrespective of their voicednees ).
e.g. [k ] - aspiratedʰ
[ k ] - unaspirated

10:02 10:02
Segmentation and Partneme
 Segmentation of partnemes is achieved by
separating the recorded syllable.
 Given is sound wave form for ગમન build with
partnemes. Red lines mark the separation.

10:02 10:02
Partnemes
 As shown syallable is logically divided into
 null sound to consonant transition
 core consonant
 consonant to vowel transition
 core vowel
 vowel to consonant transition
 core consonant
 consonant to null sound transition

10:02 10:02
Partnemes
 If we can provide the partnemes for each
vowel and consonant we can join them
accordingly to produce any complete syllable
and hence any utterance.
e.g.
કરણ - kə əɾ ɳ
0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0

10:02 10:02
ભારત - b aʰ əɾ t
0_b ;b ;b _a;a;a_ʰ ʰ ʰ ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0

10:02 10:02
Core Engine
 The speech engine, we developed to
concatenate such partneme sequence
based on given IPA, uses pair of files.
 One, called Voice File , contains the audio
data of all the partnemes.
 The other serves as a reference to the
Voice File and is called Voice Info File .
It contains the place and length of
partnemes in the Voice File .

10:02 10:02
Core Engine
 The Core Engine realizes the usecase for
having a speech engine.

10:02 10:02
Language Dependent Components
 Since Core Engine only understands IPA
sequence we have to provide a component
which translate the Gujarati text to IPA
sequence .
 The Preprocessing capabilities need also
be developed for a complete TTS System.
 Unlike Core Engine, both aforementioned
components would be specific to particular
language and …

10:02 10:02
Language Dependent Components
… therefore kept aside as language dependent
components.
 Preprocessor :
As preprocessing should be highly
customizable from the end user end we
have provided a text file which can be
edited to control the functionality of the
preprocessor.

10:02 10:02
 IPATranscriptor : This component currently
provides only phonetic translation of the given
Gujarati text as complete rules for prosodic
translation are not available.

10:02 10:02
Thanks
 Prof. Bhartiben Modi
 Mr. Ajay Sarvaiya
 Mr. Irshad Shaikh
 Mr. Mihir Trivedi

10:02 10:02
Sloka
બુદ્ધિ વદ્ધિ વડે અર્થોનુદ્ધં ગ્રહણ કરી, આત્મા મનને ઉચ્ચારણની ઇચ્છા સાથે યોજે
છે. મન કાયાિ વગ્નને પ્રજ્વિ વલિત કરે છે, અર્ને તે (કાયાિ વગ્ન ) પ્રાણવાયુદ્ધને પ્રેરે છે.
તે પ્રેિરત વાયુદ્ધ, મૂર્ધિાર્ધા ( શીષ ર્ધા ) સાથે અર્િ વભઘાત પામીને, મુદ્ધખને પ્રાપ્ત કરીને,
તે તે સ્થાનોમાંથી પસાર થતાં, સ્વર, કાળ , સ્થાન , બાહ્ય અર્ને આભ્યંતર
પ્રયત્નોના અર્નુદ્ધપ્રદાનથી પાંચ પ્રકારના વણોનો પ્રાદુદ્ધભાર્ધાવ કરે છે.
- પાિ વણનીય િ વશક્ષા, દસમો અર્ધ્યાય, કાિરકા ૬, ૯ .

Gujarati Text-to-Speech Presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Gujarati Text-to-Speech Presentation

Ähnlich wie Gujarati Text-to-Speech Presentation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Gujarati Text-to-Speech Presentation