Presentation regarding development of text-to-speech system for Gujarati. Input would be arbitrary Gujarati unicode text while output would equivalent speech sound.
2. 10:02 10:02
* PROJECT PROFILE *
Objective : Developing a Text-to-Speech
System for Gujarati
3. 10:02 10:02
* PROJECT PROFILE *
Under the guidance of
Prof. Ram Mohan
Shri Jignesh Dholakia
4. 10:02 10:02
* PROJECT PROFILE *
At Resorce Centre for Indian Language
Technology Solutions in Gujarati,
Faculty of Arts,
The M. S. University of Baroda, BARODA.
5. 10:02 10:02
Next 25 minutes …
> Sound and Speech Sound
> ABC of TTS Systems
> Pilot Project
> GTTS from scratch
> Speech , Syllable and Partneme
> Speech Sounds in detail
> Core Engine
> Language Dependent Components
7. 10:02 10:02
What makes different sounds ?
The factors, responsible for perceptual
difference between one kind of sound from
the another are
Amplitude (or volume) which tells how much
power the air-flow holds within
Frequency (or pitch) which tells at what rate
the air-flow is repeating itself
8. 10:02 10:02
The “Source” doesn’t matters
An air-flow of kind A will sound same
weather it has generated from source X
or source Y.
9. 10:02 10:02
Speech Sound
A kind of sound whose source is
Human Vocal Organism and who
finds its place in human speech.
e.g. ક્ , સ્ , અ , ઈ
A standard called International Phonetic
Alphabet (IPA) is used to depict such sounds
10. 10:02 10:02
IPA
IPA comprises almost all the speech sounds
of all languages in the world.
Speech sounds are more formally known as
Phones
IPA uses set of symbols to represent them
e.g. k , s , ə , i , ʤ
IPA Chart …
12. 10:02 10:02
Synthesized Speech Sound
If we can produce the same pattern of
air-flow as it is produced by Human Vocal
Organism, representing a speech sound,
we can say that we have synthesized the
speech sound
14. 10:02 10:02
Text-to-Speech Systems
A Speech Synthesizer which is smart enough
to produce equivalent Speech output of the
given text.
The smartness accounts for making the
output as natural and intelligible as
possible.
15. 10:02 10:02
Text-to-Speech Systems
Usually, the TTS Systems are specific to
only one human language and takes input
text from only that language
16. 10:02 10:02
Basic structure of TTS Systems
Function of any TTS System is, generally,
divided into three subtasks or phases.
I. Preprocessing
II. Phonetic-Prosodic Translation
III. Speech Production
The text input travels through these
phases, one by one, and eventually ends
up in a speech .
17. 10:02 10:02
Preprocessing
“Dr. Ajay Shah will come to clinic on 23 ,Jan.”
We read it …
“DOCTOR Ajay Shah will come to clinic on
TWENTY THIRD OF JANUARY”.
The Preprocessing is meant to convert
the input text, from raw condition, to
pronounceable word text.
18. 10:02 10:02
Phonetic-Prosodic Translation
This phase can be logically divided into two
different phases,
• Phonetic Translation
• Prosodic Translation
Real TTS Systems may implement these
phases separately or as a unit but together
they provide data for the next phase of TTS.
19. 10:02 10:02
Phonetic Translation
In human languages, the script under use
doesn’t necessarily posses the one to one
mapping with speech.
e.g.
enough is pronounced as INAF / inəf IPA
છોકરો is pronounced as છોક્રો / okʧ ɾo IPA
20. 10:02 10:02
Phonetic Translation
A Phonetic Translation is used to provide
information, to the next phase, about exactly
what kind of speech sounds (phones) to be
produced for the given text.
Phonetic Translation is also regarded as
Letter-to-Sound rules.
21. 10:02 10:02
Prosodic Translation
Mapping from letter-to-sound rules only
provides information about kind of speech
sound to be generated. To convey the
emotions and expressions residing in the
input text , Prosody needs to be applied.
By Prosody we mean,
Amplitude + Pitch + Duration
22. 10:02 10:02
Speech Production
This phase is responsible for actual output
of the speech.
The phase uses the phonetic and prosodic
information provided from the previous
phase.
Various approaches exist for production of
speech.
23. 10:02 10:02
Different ways for Speech Production
Three widely used approaches for speech
production are
• Articulatory Synthesis
• Source-Filter Synthesis
• Concatenative Synthesis
Speech production part of the TTS System
is generally regarded as speech engine.
24. 10:02 10:02
Usecases
As we understood the structure of the TTS
Systems we realized that all three phases is
required in order to develop complete TTS
for Gujarati.
At the top most abstraction level a use case
can be conceived for fulfilling the requirement
of having a TTS System for Gujarati.
25. 10:02 10:02
Usecases
The topmost use case, then, can be divided
into three further use cases each fulfilling
the requirement of three different phases
During the project we tried to realize each
use case one by one.
26. 10:02 10:02
Pilot Project
As we approached various requirements
and usecases to be realized, we found that
developing a Preprocessor is not so much
significant as developing the other two
phases. So we decided to develop later on.
We decided to develop Phonetic-Prosodic
Translation phase first as if it can be easily
plugged into any already build ….speech
27. 10:02 10:02
Pilot Project
… speech engine who takes input in terms of
of IPA.
FreeTTS, IBMJS, Dhvani, Narad were
studied
We used Java Speech API along with IBMJS
as a speech engine to be used.
The input to the engine was provided through
Java Speech Markup Language (JSML)
28. 10:02 10:02
Pilot Project : Objective
To develop a TTS System using already
available Speech Engine and supplying
transcripted (equivalent ) IPA text of target
Gujarati Unicode text to the engine.
29. 10:02 10:02
Pilot Project : S/W Requirement
A Speech Engine Component which takes
IPA and speaks it out .
30. 10:02 10:02
Pilot Project : Design
No of usecases were conceived and its
implementation was provided as different
java classes.
31. 10:02 10:02
Pilot Project : Conclusion
We cannot continue developing a TTS
System with “outsider” speech engine as
the accent and other things need to be
Gujarati in nature.
32. 10:02 10:02
Starting of GTTS from Scratch
From the result of the Pilot Project we
concluded that it is required to develop the
Speech Engine keeping Gujarati in mind.
Concatenative approach was to be used
since it provides naturalness and has proven
track record.
33. 10:02 10:02
Concatenation
In Concatenative approach, already stored
segments of sounds are joined together to
produce the complete speech.
Such segments are known as concatenation
unit.
We used Partnemes as our concatenation
unit.
34. 10:02 10:02
Partnemes
Partneme is a very small segment of sound
whose typical length ranges from 8 ms to
100 ms. We get the partnemes by cutting
the recorded speech.
But before understanding what is partneme
we have to understand human speech in
greater detail. Especially the relation
between speech and syllable.
35. 10:02 10:02
How we speak ?
At time of normal breathing the period we
devote to breath-in is longer than that of
breath-out in a complete breath cycle.
But when we start speaking, the breath-in
period becomes shorter paving the way for
a longer breath-out period.
It is so because to speak out (anything) we
need some air-flow. We use the air-flow …
36. 10:02 10:02
How we speak ? : Human Vocal Tract
… powered by lungs, during breath-out.
This air-flow is modified at various points
of Human Vocal Tract, ending up in a one
or another kind of speech sound (phones).
Human Vocal Tract comprises of various
organs which, in one or another way,
changes the air-flow.
Human Vocal Tract …
39. 10:02 10:02
How we speak ? : Syllable and Speech
During the one complete breath cycle
we can speak out more than one phones.
These all phones, spoken out in just one
breath cycle, constitutes a syllable .
Sequence of such syllables in their
continuity forms a speech.
40. 10:02 10:02
How we speak ? : Syllable Structure
It is important to know the structure of
syllable in order to understand partnemes.
Typically a syllable is made up of vowel as a
nucleus with consonants around it.
Gujarati employees the following syllable
structure.
< C + C + C + V + V̯ + C + C >
41. 10:02 10:02
How we speak ? : Syllable Structure
< C + C + C + V + V̯ + C + C >
where C - consonants
V - vowel
V̯ - unsyllablized vowel
An utterance ( spoken word ) is made up
series of such syllables.
42. 10:02 10:02
How we speak ? : Syllable Structure
રામ - ɾam is made up of single syllable.
here the structure becomes
< ɾC
+ aV
+ mC
> .
પત્ર - pətɾ is also made up of single syllable.
here the structure becomes
< pC
+ əV
+ tC
+ ɾC
>
લશ્કર - ləʃkəɾ is made up of two syllables.
here the structure becomes
< lC
+ əV
+ ʃC
> < kC
+ əV
+ ɾC
>
43. 10:02 10:02
How we speak ? : Consonants and Vowels
Consonants and vowels are two different
kind of speech sounds with different
acoustic parameters.
To know the exact difference between
consonants and vowels we have to
understand how the single vocal tract is
capable of producing so many different
sounds.
44. 10:02 10:02
How we speak ? : Articulation
Modification of the air-flow is achieved by
articulation of various speech organs of the
vocal tract.
The exact nature of speech sound that will
come up during the breath-out is determined
by
1 Place of Articulation
2 Manner of Articulation
45. 10:02 10:02
How we speak ? : Place of articulation
Place of articulation refers to the exact point,
in human vocal tract, where articulation
happened.
e.g. [p] - two lips
[k] - back of tongue with velum
[ ] - tip of tongue with alveolarɾ
46. 10:02 10:02
How we speak ? : Manner of articulation
Manner of articulation refers to the degree
of constriction made, during the articulation.
e.g. [p] - stop or plosive
[ ] - affricateʧ
[ ] - tappedɾ
[ j ] - glide
[ o ] - vowel ( no constriction )
47. 10:02 10:02
How we speak ? : Voicedness
If, during the traveling of the air-flow from the
glottis, vocal cords are vibrating (and thus
changing the air-flow) we get a voiced
sound.
e.g. [g] - voiced
[k] - unvoiced
48. 10:02 10:02
How we speak ? : Aspiration
Aspiration refers to the state of vocal cords,
during the final stage of process, when
speaking out phones. When we speak out
aspirated phones the vocal cords
approaches, itself to vibrating state, as
time goes ( irrespective of their voicednees ).
e.g. [k ] - aspiratedʰ
[ k ] - unaspirated
49. 10:02 10:02
Segmentation and Partneme
Segmentation of partnemes is achieved by
separating the recorded syllable.
Given is sound wave form for ગમન build with
partnemes. Red lines mark the separation.
50. 10:02 10:02
Partnemes
As shown syallable is logically divided into
null sound to consonant transition
core consonant
consonant to vowel transition
core vowel
vowel to consonant transition
core consonant
consonant to null sound transition
51. 10:02 10:02
Partnemes
If we can provide the partnemes for each
vowel and consonant we can join them
accordingly to produce any complete syllable
and hence any utterance.
e.g.
કરણ - kə əɾ ɳ
0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0
53. 10:02 10:02
Core Engine
The speech engine, we developed to
concatenate such partneme sequence
based on given IPA, uses pair of files.
One, called Voice File , contains the audio
data of all the partnemes.
The other serves as a reference to the
Voice File and is called Voice Info File .
It contains the place and length of
partnemes in the Voice File .
55. 10:02 10:02
Language Dependent Components
Since Core Engine only understands IPA
sequence we have to provide a component
which translate the Gujarati text to IPA
sequence .
The Preprocessing capabilities need also
be developed for a complete TTS System.
Unlike Core Engine, both aforementioned
components would be specific to particular
language and …
56. 10:02 10:02
Language Dependent Components
… therefore kept aside as language dependent
components.
Preprocessor :
As preprocessing should be highly
customizable from the end user end we
have provided a text file which can be
edited to control the functionality of the
preprocessor.
57. 10:02 10:02
IPATranscriptor : This component currently
provides only phonetic translation of the given
Gujarati text as complete rules for prosodic
translation are not available.
59. 10:02 10:02
Sloka
બુદ્ધિ વદ્ધિ વડે અર્થોનુદ્ધં ગ્રહણ કરી, આત્મા મનને ઉચ્ચારણની ઇચ્છા સાથે યોજે
છે. મન કાયાિ વગ્નને પ્રજ્વિ વલિત કરે છે, અર્ને તે (કાયાિ વગ્ન ) પ્રાણવાયુદ્ધને પ્રેરે છે.
તે પ્રેિરત વાયુદ્ધ, મૂર્ધિાર્ધા ( શીષ ર્ધા ) સાથે અર્િ વભઘાત પામીને, મુદ્ધખને પ્રાપ્ત કરીને,
તે તે સ્થાનોમાંથી પસાર થતાં, સ્વર, કાળ , સ્થાન , બાહ્ય અર્ને આભ્યંતર
પ્રયત્નોના અર્નુદ્ધપ્રદાનથી પાંચ પ્રકારના વણોનો પ્રાદુદ્ધભાર્ધાવ કરે છે.
- પાિ વણનીય િ વશક્ષા, દસમો અર્ધ્યાય, કાિરકા ૬, ૯ .