SlideShare ist ein Scribd-Unternehmen logo
1 von 59
10:02 10:02
Text-to-Speech System for
Gujarati
Project Presentation by Samyak Bhuta
10:02 10:02
* PROJECT PROFILE *
Objective : Developing a Text-to-Speech
System for Gujarati
10:02 10:02
* PROJECT PROFILE *
Under the guidance of
 Prof. Ram Mohan
 Shri Jignesh Dholakia
10:02 10:02
* PROJECT PROFILE *
At Resorce Centre for Indian Language
Technology Solutions in Gujarati,
Faculty of Arts,
The M. S. University of Baroda, BARODA.
10:02 10:02
Next 25 minutes …
> Sound and Speech Sound
> ABC of TTS Systems
> Pilot Project
> GTTS from scratch
> Speech , Syllable and Partneme
> Speech Sounds in detail
> Core Engine
> Language Dependent Components
10:02 10:02
Sound : a flow of air
Source EarAir flows
Sound
♫
♪
♫
10:02 10:02
What makes different sounds ?
 The factors, responsible for perceptual
difference between one kind of sound from
the another are
 Amplitude (or volume) which tells how much
power the air-flow holds within
 Frequency (or pitch) which tells at what rate
the air-flow is repeating itself
10:02 10:02
The “Source” doesn’t matters
 An air-flow of kind A will sound same
weather it has generated from source X
or source Y.
10:02 10:02
Speech Sound
 A kind of sound whose source is
Human Vocal Organism and who
finds its place in human speech.
 e.g. ક્ , સ્ , અ , ઈ
 A standard called International Phonetic
Alphabet (IPA) is used to depict such sounds
10:02 10:02
IPA
 IPA comprises almost all the speech sounds
of all languages in the world.
 Speech sounds are more formally known as
Phones
 IPA uses set of symbols to represent them
e.g. k , s , ə , i , ʤ
 IPA Chart …
10:02 10:02
IPA Chart
10:02 10:02
Synthesized Speech Sound
 If we can produce the same pattern of
air-flow as it is produced by Human Vocal
Organism, representing a speech sound,
we can say that we have synthesized the
speech sound
10:02 10:02
Speech Synthesizer
 A mechanism which is capable of producing
synthesized speech sound in controlled
manner.
10:02 10:02
Text-to-Speech Systems
 A Speech Synthesizer which is smart enough
to produce equivalent Speech output of the
given text.
 The smartness accounts for making the
output as natural and intelligible as
possible.
10:02 10:02
Text-to-Speech Systems
 Usually, the TTS Systems are specific to
only one human language and takes input
text from only that language
10:02 10:02
Basic structure of TTS Systems
 Function of any TTS System is, generally,
divided into three subtasks or phases.
I. Preprocessing
II. Phonetic-Prosodic Translation
III. Speech Production
 The text input travels through these
phases, one by one, and eventually ends
up in a speech .
10:02 10:02
Preprocessing
 “Dr. Ajay Shah will come to clinic on 23 ,Jan.”
 We read it …
“DOCTOR Ajay Shah will come to clinic on
TWENTY THIRD OF JANUARY”.
 The Preprocessing is meant to convert
the input text, from raw condition, to
pronounceable word text.
10:02 10:02
Phonetic-Prosodic Translation
 This phase can be logically divided into two
different phases,
• Phonetic Translation
• Prosodic Translation
 Real TTS Systems may implement these
phases separately or as a unit but together
they provide data for the next phase of TTS.
10:02 10:02
Phonetic Translation
 In human languages, the script under use
doesn’t necessarily posses the one to one
mapping with speech.
 e.g.
enough is pronounced as INAF / inəf IPA
છોકરો is pronounced as છોક્રો / okʧ ɾo IPA
10:02 10:02
Phonetic Translation
 A Phonetic Translation is used to provide
information, to the next phase, about exactly
what kind of speech sounds (phones) to be
produced for the given text.
 Phonetic Translation is also regarded as
Letter-to-Sound rules.
10:02 10:02
Prosodic Translation
 Mapping from letter-to-sound rules only
provides information about kind of speech
sound to be generated. To convey the
emotions and expressions residing in the
input text , Prosody needs to be applied.
 By Prosody we mean,
Amplitude + Pitch + Duration
10:02 10:02
Speech Production
 This phase is responsible for actual output
of the speech.
 The phase uses the phonetic and prosodic
information provided from the previous
phase.
 Various approaches exist for production of
speech.
10:02 10:02
Different ways for Speech Production
 Three widely used approaches for speech
production are
• Articulatory Synthesis
• Source-Filter Synthesis
• Concatenative Synthesis
 Speech production part of the TTS System
is generally regarded as speech engine.

10:02 10:02
Usecases
 As we understood the structure of the TTS
Systems we realized that all three phases is
required in order to develop complete TTS
for Gujarati.
 At the top most abstraction level a use case
can be conceived for fulfilling the requirement
of having a TTS System for Gujarati.
10:02 10:02
Usecases
 The topmost use case, then, can be divided
into three further use cases each fulfilling
the requirement of three different phases
 During the project we tried to realize each
use case one by one.
10:02 10:02
Pilot Project
 As we approached various requirements
and usecases to be realized, we found that
developing a Preprocessor is not so much
significant as developing the other two
phases. So we decided to develop later on.
 We decided to develop Phonetic-Prosodic
Translation phase first as if it can be easily
plugged into any already build ….speech
10:02 10:02
Pilot Project
… speech engine who takes input in terms of
of IPA.
 FreeTTS, IBMJS, Dhvani, Narad were
studied
 We used Java Speech API along with IBMJS
as a speech engine to be used.
 The input to the engine was provided through
Java Speech Markup Language (JSML)
10:02 10:02
Pilot Project : Objective
 To develop a TTS System using already
available Speech Engine and supplying
transcripted (equivalent ) IPA text of target
Gujarati Unicode text to the engine.
10:02 10:02
Pilot Project : S/W Requirement
 A Speech Engine Component which takes
IPA and speaks it out .
10:02 10:02
Pilot Project : Design
 No of usecases were conceived and its
implementation was provided as different
java classes.
10:02 10:02
Pilot Project : Conclusion
 We cannot continue developing a TTS
System with “outsider” speech engine as
the accent and other things need to be
Gujarati in nature.
10:02 10:02
Starting of GTTS from Scratch
 From the result of the Pilot Project we
concluded that it is required to develop the
Speech Engine keeping Gujarati in mind.
 Concatenative approach was to be used
since it provides naturalness and has proven
track record.
10:02 10:02
Concatenation
 In Concatenative approach, already stored
segments of sounds are joined together to
produce the complete speech.
 Such segments are known as concatenation
unit.
 We used Partnemes as our concatenation
unit.
10:02 10:02
Partnemes
 Partneme is a very small segment of sound
whose typical length ranges from 8 ms to
100 ms. We get the partnemes by cutting
the recorded speech.
 But before understanding what is partneme
we have to understand human speech in
greater detail. Especially the relation
between speech and syllable.
10:02 10:02
How we speak ?
 At time of normal breathing the period we
devote to breath-in is longer than that of
breath-out in a complete breath cycle.
 But when we start speaking, the breath-in
period becomes shorter paving the way for
a longer breath-out period.
 It is so because to speak out (anything) we
need some air-flow. We use the air-flow …
10:02 10:02
How we speak ? : Human Vocal Tract
… powered by lungs, during breath-out.
 This air-flow is modified at various points
of Human Vocal Tract, ending up in a one
or another kind of speech sound (phones).
 Human Vocal Tract comprises of various
organs which, in one or another way,
changes the air-flow.
 Human Vocal Tract …
10:02 10:02
HumanVocalTract
10:02 10:02
10:02 10:02
How we speak ? : Syllable and Speech
 During the one complete breath cycle
we can speak out more than one phones.
 These all phones, spoken out in just one
breath cycle, constitutes a syllable .
 Sequence of such syllables in their
continuity forms a speech.
10:02 10:02
How we speak ? : Syllable Structure
 It is important to know the structure of
syllable in order to understand partnemes.
 Typically a syllable is made up of vowel as a
nucleus with consonants around it.
 Gujarati employees the following syllable
structure.
< C + C + C + V + V̯ + C + C >
10:02 10:02
How we speak ? : Syllable Structure
 < C + C + C + V + V̯ + C + C >
where C - consonants
V - vowel
V̯ - unsyllablized vowel
 An utterance ( spoken word ) is made up
series of such syllables.
10:02 10:02
How we speak ? : Syllable Structure
 રામ - ɾam is made up of single syllable.
here the structure becomes
< ɾC
+ aV
+ mC
> .
 પત્ર - pətɾ is also made up of single syllable.
here the structure becomes
< pC
+ əV
+ tC
+ ɾC
>
 લશ્કર - ləʃkəɾ is made up of two syllables.
here the structure becomes
< lC
+ əV
+ ʃC
> < kC
+ əV
+ ɾC
>
10:02 10:02
How we speak ? : Consonants and Vowels
 Consonants and vowels are two different
kind of speech sounds with different
acoustic parameters.
 To know the exact difference between
consonants and vowels we have to
understand how the single vocal tract is
capable of producing so many different
sounds.
10:02 10:02
How we speak ? : Articulation
 Modification of the air-flow is achieved by
articulation of various speech organs of the
vocal tract.
 The exact nature of speech sound that will
come up during the breath-out is determined
by
1 Place of Articulation
2 Manner of Articulation
10:02 10:02
How we speak ? : Place of articulation
 Place of articulation refers to the exact point,
in human vocal tract, where articulation
happened.
e.g. [p] - two lips
[k] - back of tongue with velum
[ ] - tip of tongue with alveolarɾ
10:02 10:02
How we speak ? : Manner of articulation
 Manner of articulation refers to the degree
of constriction made, during the articulation.
e.g. [p] - stop or plosive
[ ] - affricateʧ
[ ] - tappedɾ
[ j ] - glide
[ o ] - vowel ( no constriction )
10:02 10:02
How we speak ? : Voicedness
 If, during the traveling of the air-flow from the
glottis, vocal cords are vibrating (and thus
changing the air-flow) we get a voiced
sound.
e.g. [g] - voiced
[k] - unvoiced
10:02 10:02
How we speak ? : Aspiration
 Aspiration refers to the state of vocal cords,
during the final stage of process, when
speaking out phones. When we speak out
aspirated phones the vocal cords
approaches, itself to vibrating state, as
time goes ( irrespective of their voicednees ).
e.g. [k ] - aspiratedʰ
[ k ] - unaspirated
10:02 10:02
Segmentation and Partneme
 Segmentation of partnemes is achieved by
separating the recorded syllable.
 Given is sound wave form for ગમન build with
partnemes. Red lines mark the separation.
10:02 10:02
Partnemes
 As shown syallable is logically divided into
 null sound to consonant transition
 core consonant
 consonant to vowel transition
 core vowel
 vowel to consonant transition
 core consonant
 consonant to null sound transition
10:02 10:02
Partnemes
 If we can provide the partnemes for each
vowel and consonant we can join them
accordingly to produce any complete syllable
and hence any utterance.
e.g.
કરણ - kə əɾ ɳ
0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0
10:02 10:02
ભારત - b aʰ əɾ t
0_b ;b ;b _a;a;a_ʰ ʰ ʰ ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0
10:02 10:02
Core Engine
 The speech engine, we developed to
concatenate such partneme sequence
based on given IPA, uses pair of files.
 One, called Voice File , contains the audio
data of all the partnemes.
 The other serves as a reference to the
Voice File and is called Voice Info File .
It contains the place and length of
partnemes in the Voice File .
10:02 10:02
Core Engine
 The Core Engine realizes the usecase for
having a speech engine.
10:02 10:02
Language Dependent Components
 Since Core Engine only understands IPA
sequence we have to provide a component
which translate the Gujarati text to IPA
sequence .
 The Preprocessing capabilities need also
be developed for a complete TTS System.
 Unlike Core Engine, both aforementioned
components would be specific to particular
language and …
10:02 10:02
Language Dependent Components
… therefore kept aside as language dependent
components.
 Preprocessor :
As preprocessing should be highly
customizable from the end user end we
have provided a text file which can be
edited to control the functionality of the
preprocessor.
10:02 10:02
 IPATranscriptor : This component currently
provides only phonetic translation of the given
Gujarati text as complete rules for prosodic
translation are not available.
10:02 10:02
Thanks
 Prof. Bhartiben Modi
 Mr. Ajay Sarvaiya
 Mr. Irshad Shaikh
 Mr. Mihir Trivedi
10:02 10:02
Sloka
બુદ્ધિ વદ્ધિ વડે અર્થોનુદ્ધં ગ્રહણ કરી, આત્મા મનને ઉચ્ચારણની ઇચ્છા સાથે યોજે
છે. મન કાયાિ વગ્નને પ્રજ્વિ વલિત કરે છે, અર્ને તે (કાયાિ વગ્ન ) પ્રાણવાયુદ્ધને પ્રેરે છે.
તે પ્રેિરત વાયુદ્ધ, મૂર્ધિાર્ધા ( શીષ ર્ધા ) સાથે અર્િ વભઘાત પામીને, મુદ્ધખને પ્રાપ્ત કરીને,
તે તે સ્થાનોમાંથી પસાર થતાં, સ્વર, કાળ , સ્થાન , બાહ્ય અર્ને આભ્યંતર
પ્રયત્નોના અર્નુદ્ધપ્રદાનથી પાંચ પ્રકારના વણોનો પ્રાદુદ્ધભાર્ધાવ કરે છે.
- પાિ વણનીય િ વશક્ષા, દસમો અર્ધ્યાય, કાિરકા ૬, ૯ .

Weitere ähnliche Inhalte

Was ist angesagt?

Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
ankit_saluja
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
Hugo Moreno
 
Voice morphing document
Voice morphing documentVoice morphing document
Voice morphing document
himadrigupta
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
Diptimaya Sarangi
 

Was ist angesagt? (20)

Automatic speech recognition
Automatic speech recognitionAutomatic speech recognition
Automatic speech recognition
 
Unit 1 speech processing
Unit 1 speech processingUnit 1 speech processing
Unit 1 speech processing
 
Speech to text conversion
Speech to text conversionSpeech to text conversion
Speech to text conversion
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Voice morphing
Voice morphingVoice morphing
Voice morphing
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
Voice morphing document
Voice morphing documentVoice morphing document
Voice morphing document
 
Text to speech with Google Cloud
Text to speech with Google CloudText to speech with Google Cloud
Text to speech with Google Cloud
 
Speech recognition system seminar
Speech recognition system seminarSpeech recognition system seminar
Speech recognition system seminar
 
Speech Recognition in Artificail Inteligence
Speech Recognition in Artificail InteligenceSpeech Recognition in Artificail Inteligence
Speech Recognition in Artificail Inteligence
 
Speech recognition
Speech recognitionSpeech recognition
Speech recognition
 
Voice Morping ppt
Voice Morping pptVoice Morping ppt
Voice Morping ppt
 
Speech Recognition
Speech RecognitionSpeech Recognition
Speech Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Deep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh TomarDeep Learning for Speech Recognition - Vikrant Singh Tomar
Deep Learning for Speech Recognition - Vikrant Singh Tomar
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Lip reading Project
Lip reading ProjectLip reading Project
Lip reading Project
 
Speech recognition techniques
Speech recognition techniquesSpeech recognition techniques
Speech recognition techniques
 
speech processing basics
speech processing basicsspeech processing basics
speech processing basics
 

Andere mochten auch

Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
behzad66
 
Good presentation!
Good presentation!Good presentation!
Good presentation!
Arry Arman
 

Andere mochten auch (19)

Text to speech conversation in gujarati
Text to speech conversation in gujaratiText to speech conversation in gujarati
Text to speech conversation in gujarati
 
Digital Tools for Language Development
Digital Tools for Language DevelopmentDigital Tools for Language Development
Digital Tools for Language Development
 
Text to speech converter in C#.NET
Text to speech converter in C#.NETText to speech converter in C#.NET
Text to speech converter in C#.NET
 
Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Nari tu narayani
Nari tu narayaniNari tu narayani
Nari tu narayani
 
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
Conversational Speech Translation - Challenges and Techniques, by Chris Wendt...
 
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
65 - An Empirical Simulation-based Study of Real-Time Speech Translation for ...
 
Personalising speech to-speech translation
Personalising speech to-speech translationPersonalising speech to-speech translation
Personalising speech to-speech translation
 
Instant speech translation 10BM60080 - VGSOM
Instant speech translation   10BM60080 - VGSOMInstant speech translation   10BM60080 - VGSOM
Instant speech translation 10BM60080 - VGSOM
 
The translator (session 3)
The translator (session 3)The translator (session 3)
The translator (session 3)
 
Good presentation!
Good presentation!Good presentation!
Good presentation!
 
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
Rudy Marsman's thesis presentation slides: Speech synthesis based on a limite...
 
The Speaking Glove
The Speaking GloveThe Speaking Glove
The Speaking Glove
 
Gesture recognition techniques
Gesture  recognition techniques Gesture  recognition techniques
Gesture recognition techniques
 
Ai based character recognition and speech synthesis
Ai based character recognition and speech  synthesisAi based character recognition and speech  synthesis
Ai based character recognition and speech synthesis
 
IT Introduction - 06. Graphic & Multimedia
IT Introduction - 06. Graphic & MultimediaIT Introduction - 06. Graphic & Multimedia
IT Introduction - 06. Graphic & Multimedia
 
Text to Speech for Mobile Voice
Text to Speech for Mobile Voice Text to Speech for Mobile Voice
Text to Speech for Mobile Voice
 
Gujarat the growth story
Gujarat the growth storyGujarat the growth story
Gujarat the growth story
 
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
Effect of Machine Translation in Interlingual Conversation: Lessons from a Fo...
 

Ähnlich wie Gujarati Text-to-Speech Presentation

Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
IJCI JOURNAL
 
An Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile EnvironmentAn Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile Environment
Association of Scientists, Developers and Faculties
 

Ähnlich wie Gujarati Text-to-Speech Presentation (20)

SAP (SPEECH AND AUDIO PROCESSING)
SAP (SPEECH AND AUDIO PROCESSING)SAP (SPEECH AND AUDIO PROCESSING)
SAP (SPEECH AND AUDIO PROCESSING)
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
 
Ey4301913917
Ey4301913917Ey4301913917
Ey4301913917
 
An Introduction To Speech Recognition
An Introduction To Speech RecognitionAn Introduction To Speech Recognition
An Introduction To Speech Recognition
 
visH (fin).pptx
visH (fin).pptxvisH (fin).pptx
visH (fin).pptx
 
On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...On Developing an Automatic Speech Recognition System for Commonly used Englis...
On Developing an Automatic Speech Recognition System for Commonly used Englis...
 
Voice Transmission - Echo Translation Demo
Voice Transmission - Echo Translation DemoVoice Transmission - Echo Translation Demo
Voice Transmission - Echo Translation Demo
 
Voice based web browser
Voice based web browserVoice based web browser
Voice based web browser
 
Dhvani TTS
Dhvani TTSDhvani TTS
Dhvani TTS
 
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
Approach To Build A Marathi Text-To-Speech System Using Concatenative Synthes...
 
Direct Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete UnitsDirect Punjabi to English Speech Translation using Discrete Units
Direct Punjabi to English Speech Translation using Discrete Units
 
An Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile EnvironmentAn Application for Performing Real Time Speech Translation in Mobile Environment
An Application for Performing Real Time Speech Translation in Mobile Environment
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
Concatenative bangla speech synthesizer model
Concatenative bangla speech synthesizer modelConcatenative bangla speech synthesizer model
Concatenative bangla speech synthesizer model
 
final ppt BATCH 3.pptx
final ppt BATCH 3.pptxfinal ppt BATCH 3.pptx
final ppt BATCH 3.pptx
 
H010625862
H010625862H010625862
H010625862
 
Speech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law compandingSpeech to text conversion for visually impaired person using µ law companding
Speech to text conversion for visually impaired person using µ law companding
 
Automatic Speech Recognition
Automatic Speech RecognitionAutomatic Speech Recognition
Automatic Speech Recognition
 
Speech processinglecworkshop
Speech processinglecworkshopSpeech processinglecworkshop
Speech processinglecworkshop
 
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
Artificially Generatedof Concatenative Syllable based Text to Speech Synthesi...
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 

Gujarati Text-to-Speech Presentation

  • 1. 10:02 10:02 Text-to-Speech System for Gujarati Project Presentation by Samyak Bhuta
  • 2. 10:02 10:02 * PROJECT PROFILE * Objective : Developing a Text-to-Speech System for Gujarati
  • 3. 10:02 10:02 * PROJECT PROFILE * Under the guidance of  Prof. Ram Mohan  Shri Jignesh Dholakia
  • 4. 10:02 10:02 * PROJECT PROFILE * At Resorce Centre for Indian Language Technology Solutions in Gujarati, Faculty of Arts, The M. S. University of Baroda, BARODA.
  • 5. 10:02 10:02 Next 25 minutes … > Sound and Speech Sound > ABC of TTS Systems > Pilot Project > GTTS from scratch > Speech , Syllable and Partneme > Speech Sounds in detail > Core Engine > Language Dependent Components
  • 6. 10:02 10:02 Sound : a flow of air Source EarAir flows Sound ♫ ♪ ♫
  • 7. 10:02 10:02 What makes different sounds ?  The factors, responsible for perceptual difference between one kind of sound from the another are  Amplitude (or volume) which tells how much power the air-flow holds within  Frequency (or pitch) which tells at what rate the air-flow is repeating itself
  • 8. 10:02 10:02 The “Source” doesn’t matters  An air-flow of kind A will sound same weather it has generated from source X or source Y.
  • 9. 10:02 10:02 Speech Sound  A kind of sound whose source is Human Vocal Organism and who finds its place in human speech.  e.g. ક્ , સ્ , અ , ઈ  A standard called International Phonetic Alphabet (IPA) is used to depict such sounds
  • 10. 10:02 10:02 IPA  IPA comprises almost all the speech sounds of all languages in the world.  Speech sounds are more formally known as Phones  IPA uses set of symbols to represent them e.g. k , s , ə , i , ʤ  IPA Chart …
  • 12. 10:02 10:02 Synthesized Speech Sound  If we can produce the same pattern of air-flow as it is produced by Human Vocal Organism, representing a speech sound, we can say that we have synthesized the speech sound
  • 13. 10:02 10:02 Speech Synthesizer  A mechanism which is capable of producing synthesized speech sound in controlled manner.
  • 14. 10:02 10:02 Text-to-Speech Systems  A Speech Synthesizer which is smart enough to produce equivalent Speech output of the given text.  The smartness accounts for making the output as natural and intelligible as possible.
  • 15. 10:02 10:02 Text-to-Speech Systems  Usually, the TTS Systems are specific to only one human language and takes input text from only that language
  • 16. 10:02 10:02 Basic structure of TTS Systems  Function of any TTS System is, generally, divided into three subtasks or phases. I. Preprocessing II. Phonetic-Prosodic Translation III. Speech Production  The text input travels through these phases, one by one, and eventually ends up in a speech .
  • 17. 10:02 10:02 Preprocessing  “Dr. Ajay Shah will come to clinic on 23 ,Jan.”  We read it … “DOCTOR Ajay Shah will come to clinic on TWENTY THIRD OF JANUARY”.  The Preprocessing is meant to convert the input text, from raw condition, to pronounceable word text.
  • 18. 10:02 10:02 Phonetic-Prosodic Translation  This phase can be logically divided into two different phases, • Phonetic Translation • Prosodic Translation  Real TTS Systems may implement these phases separately or as a unit but together they provide data for the next phase of TTS.
  • 19. 10:02 10:02 Phonetic Translation  In human languages, the script under use doesn’t necessarily posses the one to one mapping with speech.  e.g. enough is pronounced as INAF / inəf IPA છોકરો is pronounced as છોક્રો / okʧ ɾo IPA
  • 20. 10:02 10:02 Phonetic Translation  A Phonetic Translation is used to provide information, to the next phase, about exactly what kind of speech sounds (phones) to be produced for the given text.  Phonetic Translation is also regarded as Letter-to-Sound rules.
  • 21. 10:02 10:02 Prosodic Translation  Mapping from letter-to-sound rules only provides information about kind of speech sound to be generated. To convey the emotions and expressions residing in the input text , Prosody needs to be applied.  By Prosody we mean, Amplitude + Pitch + Duration
  • 22. 10:02 10:02 Speech Production  This phase is responsible for actual output of the speech.  The phase uses the phonetic and prosodic information provided from the previous phase.  Various approaches exist for production of speech.
  • 23. 10:02 10:02 Different ways for Speech Production  Three widely used approaches for speech production are • Articulatory Synthesis • Source-Filter Synthesis • Concatenative Synthesis  Speech production part of the TTS System is generally regarded as speech engine. 
  • 24. 10:02 10:02 Usecases  As we understood the structure of the TTS Systems we realized that all three phases is required in order to develop complete TTS for Gujarati.  At the top most abstraction level a use case can be conceived for fulfilling the requirement of having a TTS System for Gujarati.
  • 25. 10:02 10:02 Usecases  The topmost use case, then, can be divided into three further use cases each fulfilling the requirement of three different phases  During the project we tried to realize each use case one by one.
  • 26. 10:02 10:02 Pilot Project  As we approached various requirements and usecases to be realized, we found that developing a Preprocessor is not so much significant as developing the other two phases. So we decided to develop later on.  We decided to develop Phonetic-Prosodic Translation phase first as if it can be easily plugged into any already build ….speech
  • 27. 10:02 10:02 Pilot Project … speech engine who takes input in terms of of IPA.  FreeTTS, IBMJS, Dhvani, Narad were studied  We used Java Speech API along with IBMJS as a speech engine to be used.  The input to the engine was provided through Java Speech Markup Language (JSML)
  • 28. 10:02 10:02 Pilot Project : Objective  To develop a TTS System using already available Speech Engine and supplying transcripted (equivalent ) IPA text of target Gujarati Unicode text to the engine.
  • 29. 10:02 10:02 Pilot Project : S/W Requirement  A Speech Engine Component which takes IPA and speaks it out .
  • 30. 10:02 10:02 Pilot Project : Design  No of usecases were conceived and its implementation was provided as different java classes.
  • 31. 10:02 10:02 Pilot Project : Conclusion  We cannot continue developing a TTS System with “outsider” speech engine as the accent and other things need to be Gujarati in nature.
  • 32. 10:02 10:02 Starting of GTTS from Scratch  From the result of the Pilot Project we concluded that it is required to develop the Speech Engine keeping Gujarati in mind.  Concatenative approach was to be used since it provides naturalness and has proven track record.
  • 33. 10:02 10:02 Concatenation  In Concatenative approach, already stored segments of sounds are joined together to produce the complete speech.  Such segments are known as concatenation unit.  We used Partnemes as our concatenation unit.
  • 34. 10:02 10:02 Partnemes  Partneme is a very small segment of sound whose typical length ranges from 8 ms to 100 ms. We get the partnemes by cutting the recorded speech.  But before understanding what is partneme we have to understand human speech in greater detail. Especially the relation between speech and syllable.
  • 35. 10:02 10:02 How we speak ?  At time of normal breathing the period we devote to breath-in is longer than that of breath-out in a complete breath cycle.  But when we start speaking, the breath-in period becomes shorter paving the way for a longer breath-out period.  It is so because to speak out (anything) we need some air-flow. We use the air-flow …
  • 36. 10:02 10:02 How we speak ? : Human Vocal Tract … powered by lungs, during breath-out.  This air-flow is modified at various points of Human Vocal Tract, ending up in a one or another kind of speech sound (phones).  Human Vocal Tract comprises of various organs which, in one or another way, changes the air-flow.  Human Vocal Tract …
  • 39. 10:02 10:02 How we speak ? : Syllable and Speech  During the one complete breath cycle we can speak out more than one phones.  These all phones, spoken out in just one breath cycle, constitutes a syllable .  Sequence of such syllables in their continuity forms a speech.
  • 40. 10:02 10:02 How we speak ? : Syllable Structure  It is important to know the structure of syllable in order to understand partnemes.  Typically a syllable is made up of vowel as a nucleus with consonants around it.  Gujarati employees the following syllable structure. < C + C + C + V + V̯ + C + C >
  • 41. 10:02 10:02 How we speak ? : Syllable Structure  < C + C + C + V + V̯ + C + C > where C - consonants V - vowel V̯ - unsyllablized vowel  An utterance ( spoken word ) is made up series of such syllables.
  • 42. 10:02 10:02 How we speak ? : Syllable Structure  રામ - ɾam is made up of single syllable. here the structure becomes < ɾC + aV + mC > .  પત્ર - pətɾ is also made up of single syllable. here the structure becomes < pC + əV + tC + ɾC >  લશ્કર - ləʃkəɾ is made up of two syllables. here the structure becomes < lC + əV + ʃC > < kC + əV + ɾC >
  • 43. 10:02 10:02 How we speak ? : Consonants and Vowels  Consonants and vowels are two different kind of speech sounds with different acoustic parameters.  To know the exact difference between consonants and vowels we have to understand how the single vocal tract is capable of producing so many different sounds.
  • 44. 10:02 10:02 How we speak ? : Articulation  Modification of the air-flow is achieved by articulation of various speech organs of the vocal tract.  The exact nature of speech sound that will come up during the breath-out is determined by 1 Place of Articulation 2 Manner of Articulation
  • 45. 10:02 10:02 How we speak ? : Place of articulation  Place of articulation refers to the exact point, in human vocal tract, where articulation happened. e.g. [p] - two lips [k] - back of tongue with velum [ ] - tip of tongue with alveolarɾ
  • 46. 10:02 10:02 How we speak ? : Manner of articulation  Manner of articulation refers to the degree of constriction made, during the articulation. e.g. [p] - stop or plosive [ ] - affricateʧ [ ] - tappedɾ [ j ] - glide [ o ] - vowel ( no constriction )
  • 47. 10:02 10:02 How we speak ? : Voicedness  If, during the traveling of the air-flow from the glottis, vocal cords are vibrating (and thus changing the air-flow) we get a voiced sound. e.g. [g] - voiced [k] - unvoiced
  • 48. 10:02 10:02 How we speak ? : Aspiration  Aspiration refers to the state of vocal cords, during the final stage of process, when speaking out phones. When we speak out aspirated phones the vocal cords approaches, itself to vibrating state, as time goes ( irrespective of their voicednees ). e.g. [k ] - aspiratedʰ [ k ] - unaspirated
  • 49. 10:02 10:02 Segmentation and Partneme  Segmentation of partnemes is achieved by separating the recorded syllable.  Given is sound wave form for ગમન build with partnemes. Red lines mark the separation.
  • 50. 10:02 10:02 Partnemes  As shown syallable is logically divided into  null sound to consonant transition  core consonant  consonant to vowel transition  core vowel  vowel to consonant transition  core consonant  consonant to null sound transition
  • 51. 10:02 10:02 Partnemes  If we can provide the partnemes for each vowel and consonant we can join them accordingly to produce any complete syllable and hence any utterance. e.g. કરણ - kə əɾ ɳ 0_k;k;k_ə;ə;ə_ɾ;ɾ;ɾ_ə;ə;ə_ɳ;ɳ;ɳ_0
  • 52. 10:02 10:02 ભારત - b aʰ əɾ t 0_b ;b ;b _a;a;a_ʰ ʰ ʰ ɾ;ɾ;ɾ_ə;ə;ə_t;t;t_0
  • 53. 10:02 10:02 Core Engine  The speech engine, we developed to concatenate such partneme sequence based on given IPA, uses pair of files.  One, called Voice File , contains the audio data of all the partnemes.  The other serves as a reference to the Voice File and is called Voice Info File . It contains the place and length of partnemes in the Voice File .
  • 54. 10:02 10:02 Core Engine  The Core Engine realizes the usecase for having a speech engine.
  • 55. 10:02 10:02 Language Dependent Components  Since Core Engine only understands IPA sequence we have to provide a component which translate the Gujarati text to IPA sequence .  The Preprocessing capabilities need also be developed for a complete TTS System.  Unlike Core Engine, both aforementioned components would be specific to particular language and …
  • 56. 10:02 10:02 Language Dependent Components … therefore kept aside as language dependent components.  Preprocessor : As preprocessing should be highly customizable from the end user end we have provided a text file which can be edited to control the functionality of the preprocessor.
  • 57. 10:02 10:02  IPATranscriptor : This component currently provides only phonetic translation of the given Gujarati text as complete rules for prosodic translation are not available.
  • 58. 10:02 10:02 Thanks  Prof. Bhartiben Modi  Mr. Ajay Sarvaiya  Mr. Irshad Shaikh  Mr. Mihir Trivedi
  • 59. 10:02 10:02 Sloka બુદ્ધિ વદ્ધિ વડે અર્થોનુદ્ધં ગ્રહણ કરી, આત્મા મનને ઉચ્ચારણની ઇચ્છા સાથે યોજે છે. મન કાયાિ વગ્નને પ્રજ્વિ વલિત કરે છે, અર્ને તે (કાયાિ વગ્ન ) પ્રાણવાયુદ્ધને પ્રેરે છે. તે પ્રેિરત વાયુદ્ધ, મૂર્ધિાર્ધા ( શીષ ર્ધા ) સાથે અર્િ વભઘાત પામીને, મુદ્ધખને પ્રાપ્ત કરીને, તે તે સ્થાનોમાંથી પસાર થતાં, સ્વર, કાળ , સ્થાન , બાહ્ય અર્ને આભ્યંતર પ્રયત્નોના અર્નુદ્ધપ્રદાનથી પાંચ પ્રકારના વણોનો પ્રાદુદ્ધભાર્ધાવ કરે છે. - પાિ વણનીય િ વશક્ષા, દસમો અર્ધ્યાય, કાિરકા ૬, ૯ .