Micro-Scholarship, What it is, How can it help me.pdf
Ry pyconjp2015 karaoke
1. 1
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Chang Gung Univ.
Taiwan
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
2. CguTextKaraoke
a Karaoke-style Read-aloud System
Using Speech Alignment and Text-to-Speech Technology
Chun-Han Lai (賴俊翰)
Renyuan Lyu (呂仁園)
Chang Gung University (長庚大學)
Taiwan (台灣)
2
3. Abstract
• A procedure to create a Speech-to-Text
Synchronization file from an original text-only file
– can be used to show high-light text just like a Karaoke
machine
– very useful for language learning purpose.
• TTS (Text-to-speech) technology on clouds, like
Google TTS
• Speech-recognition technology, like HTK, for
temporal alignment
3
4. Introduction
• Starting from a text-only file, using a cloud-based text-to-speech
(TTS) technology, like Google Translate/TTS, and also a speech-
recognition technology, like Hidden Markov Model Toolkits (HTK),
we could generate its associated timed-text file which aligns up text
with speech waveform file in the temporal axis.
• Python is used not only as a glue to link all different styles of
software resources, like Google Translate and HTK, but also as a
powerful tool to deal with all text processing tasks in this project.
• From such a kind of timed text file, we have also provided a
JavaScript based web-app and also a Python GUI software to
demonstrate the time-aligned high-lighted text like a karaoke
machine in word level, which are considered very useful for the
language learning purpose.
4
5. a Karaoke-style Text Read-aloud System
https://www.youtube-nocookie.com/embed/9a5KoXNCagM?start=180
• Karaoke (カラオケ) is a form of interactive
entertainment in which an amateur singer sings
along with recorded music.
• Lyrics are usually displayed on a video screen, along
with a moving symbol, changing color, or music
video images, to guide the singer.
• Here is an example of my favorites
https://en.wikipedia.org/wiki/Karaoke
5
6. Speech Shadowing Technique
for Language Learning
• The motivation of this project
» https://en.wikipedia.org/wiki/Speech_shadowing
–Speech shadowing
• is an Language Learning technique in which
subjects repeat speech immediately after hearing it.
– The technique is used in language learning.
– A demonstration can be viewed at the following Youtube
link.
• “English Speaking Practice: How to improve your
English Speaking and Fluency: SHADOWING”
• https://www.youtube.com/watch?v=GVWFGIyNswI6
7. Text-to-Speech Synthesis
7
Wikipedia is a multilingual, web-based, free-content encyclopedia project supported
by the Wikimedia Foundation and based on a model of openly editable content. The
name "Wikipedia" is a portmanteau of the words wiki (a technology for creating
collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain its speech
8. Google TTS API
in a Python module
8
• pip install gTTS
from gtts import gTTS
aText= 'Wikipedia is a multilingual, ...'
aLang= 'en'
tts= gTTS(text= aText, lang= aLang)
tts.save("aSpeech.mp3")
aSpeech.mp3aText
https://github.com/pndurette/gTTS
9. FFmpeg
• About Ffmpeg
– [https://en.wikipedia.org/wiki/FFmpeg]
– FFmpeg is a free software project that
produces libraries and programs for
handling multimedia data.
– It is one of the leading multimedia frameworks,
able to do many DSP tasks, including ...
• decode, encode,
• transcode, mux, demux, stream, filter and play
9
10. 10
FFmpeg -i aSpeech.mp3 -y -
vn -acodec pcm_s16le -ac 1
-ar 16000 -f wav
aSpeech.wav
aSpeech.mp3 aSpeech.wav
Pcm, 16 bits/sample Little endian
1 (mono) channel
16000 samples/sec
FFplay
aSpeech.wav
Verifying
by seeing
and hearing
Or using an interactive audio tool, like Audacity.
11. Audacity (audio editor)
• Audacity is a powerful, free open source digital audio editor
– Its features include:
• Recording and playing back sounds
• Importing and exporting of WAV, MP3, ....
• Viewing and editing via cut, copy, and paste, ...
11
aSpeech.mp3
aSpeech.wav
12. Text-to-Speech Alignment
12
Wikipedia is a multilingual, web-based, free-content encyclopedia project
supported by the Wikimedia Foundation and based on a model of openly editable
content. The name "Wikipedia" is a portmanteau of the words wiki (a technology for
creating collaborative websites, from the Hawaiian word wiki, meaning "quick") and
encyclopedia. Wikipedia's articles provide links designed to guide the user to related
pages with additional information.
Given: a piece of Text and its speech, e.g.,
The goal is to obtain a ‘Timed-Text’
0.0000.080sil
0.0800.870wikipedia
0.8700.990is
0.9901.080a
1.0802.010multilingual
2.0102.140sil
2.1602.240sil
2.2403.020webbased
3.0203.180sil
3.2043.354sil
3.3544.284freecontent
4.2845.374encyclopedia
5.3745.774project
5.7746.454supported
6.4546.754by
6.7546.904the
6.9047.574wikimedia
7.5748.414foundation
8.4148.514sil
8.5328.622sil
8.6228.852and
8.8529.242based
9.2429.382on
9.3829.432a
9.4329.982model
9.98210.032of
10.03210.592openly
10.59211.212editable
11.21211.802content
11.80211.932sil
:
:
:
13. Wav splitting
13
In Sentence-level, this can be straightforward done by
extracting the time information from the TTS mp3 files,
which are received sentence by sentence.
Sentence boundaries
14. Phonetic Transcription
• Speech recognition technology needs to transcribe text into
phonetic symbols, in order to build up phone models.
14
“Wikipedia is a multilingual, web-based, free-content encyclopedia project.”
“wikipedia ɪz ə məltilɪŋwəl, wɛb- best, fri- kɑntɛnt ənsɑjkləpidiə prɑdʒɛkt.”
”wikipedia Iz @ m@ltilINw@l, wEb- best, fri- kAntEnt @nsAykl@pidi@ prAdZEkt.”
Original English Text: (ASCII only, perhaps!)
Transcription in IPA: (needs Unicode)
Transcription in SAMPA: (ASCII only, including non-alphabet symbols)
http://upodn.com/phon.asp
15. • Post processing of phonetic transcription
• To map or simply clean all undesired symbols from multiple
styles of outputs
– (usually in unicode, or some non-alphabet symbols)
• For plain English (en),
– Approximately using the original Text as the phone sequence.
– Although it seems too simple, it is so far so good.
• For Traditional Chinese (zh-tw),
– Google Translate was used to get phonetic symbols in Pinyin (拼音,
pīnyīn), and then plain romaji (eliminating the tone mark)
• For Japanese (ja),
– Mecab has been used recently to get the Katakana (片仮名, カタカナ).
– Romkan has been used to transform katakana to romaji (kunrei)
• Thanks to Python, it helps me do the most jobs
during this stage of processing!!
15
16. • Phonetic transcription for English
– Using regular expression module
16
phn= text2phn_en(enText)
enText=
‘’’Wikipedia is a multilingual, web-based,
free-content encyclopedia project.‘’’
phn=
‘’’wikipedia_is_a_multilingual_webbased
_freecontent_encyclopedia_project’’'
import re
pats= ''|"|-|^_|_$|,|.|(|)'
phn= re.sub(pats, '', phn)
17. • Phonetic transcription for Traditional Chinese
– Using Google Translate/TTS api
17
phn= text2phn_tc(tcText)
tcText=
‘維基百科是一個自由內容’
phn=
‘weiji_baike_shi_yige_ziyou_neirong’
GOOGLE_TTS_URL=
'https://translate.google.
com.tw/translate_a/singl
e?dt=bd&dt=ex&dt=at&'
req= urllib.request.Request(GOOGLE_TTS_URL + data)
18. • Phonetic transcription for Japanese
– Using MeCab and Romkan
18
phn= text2phn_jp(jpText)
jpText=
‘‘’ウィキペディアは、
信頼されるフリーなオンライン百科事典、‘’’
phn=
‘‘’wikipedyia_wa_sil_sinrai_sa_reru_furi-_
na_onrain_hyakka_ziten‘’’
import MeCab
import romkan
y= MeCab.Tagger().parse(text)
...
kun= romkan.to_kunrei(phn)
20. • HMM Toolkits (HTK),
– http://htk.eng.cam.ac.uk/
– Given a speech utterance, with its phone
sequence, the speech can be well aligned with
phones by ‘forced alignment’ techniques in the
HMM approach.
– A set of HMM Toolkits, called HTK, provided a
convenient way to utilize the HMM approach.
20
Speech recognition technology
22. HTK processing (abstract) ....
22
• #[00] setting the working dir
• #[01] creating the (hmm) model prototype
• #[02] label processing
• #[03] feature extraction
• #[04] model initialization
• #[05] model training
• #[06] forced alignment
• #[07] post file moving operation
23. HTK processing (detail)....
23
#[00] setting the working dir
dirName= ./_wav/
#[01] creating the (hmm) model prototype
CreateHProto....
myHmmPro
N = 3 M = 6
#[02] label processing
000, 0,----> ._htkhled -A -i spLab00.mlf -n spLab00.lst -S spLab.scp hL
001, 0,----> ._htkhled -A -i spLab.mlf -n spLab.lst -S spLab.scp hLed.l
002, 0,----> ._htkhled -A -i spLab_p.mlf -n spLab_p.lst -S spLab.scp -I
#[03] feature extraction
003, 0,----> ._htkHCopy -A -C hCopy.conf -S spWav2Mfc.scp 1>> 1.htk.out 2>>
#[04] model initialization
004, 1,----> mkdir hmms_p
005, 0,----> ._htkHCompV -A -m -C hInit.conf -S spMfc.scp -I spLab_p.mlf -M
#[05] model training
006, 0,----> ._htkHERest -A -C hErest.conf -S spMfc.scp -p 1 -t 2000.0 -w 3
007, 0,----> ._htkHERest -A -C hErest.conf -p 0 -t 2000.0 -w 3 -v 0.05 -I sp
: (repeating several times...)
:
#[06] forced alignment
016, 0,----> ._htkHVite -A -a -C hVite.conf -S spMfc.scp -d hmms_p/ -i s
#[07] post file moving operation
017, 1,----> mkdir outDir
018, 1,----> copy spLab_aligned.mlf outDir./_wav_aligned.mlf
29. HTK summary
29
HLed
HCopy
HCompV
HERest
HVite
HTK Tools
#!MLF!#
"./_wav/SN0.rec"
0 800000 sil -578.044434
800000 8700000 wikipedia -5636.368652
8700000 9900000 is -855.988770
9900000 10800000 a -693.554871
10800000 20100000 multilingual -7268.197266
20100000 21400000 sil -791.746216
.
"./_wav/SN1.rec"
0 800000 sil -541.083069
800000 8600000 webbased -5977.622070
8600000 10200000 sil -1048.225220
.
"./_wav/SN2.rec"
0 1500000 sil -1100.892822
1500000 10800000 freecontent -7094.197266
10800000 21700000 encyclopedia -8148.633789
21700000 25700000 project -3247.493896
25700000 32500000 supported -5594.979492
32500000 35500000 by -2412.487305
35500000 37000000 the -1176.310547
37000000 43700000 wikimedia -5128.852051
43700000 52100000 foundation -5995.618164
52100000 53100000 sil -695.872864
.
.
.
spLab_aligned.mlf
wavDir/
30. The major algorithm in HTK
30
‘Holiday Shopping’ = ‘h’+’o’+’l’+’i’+’d’+’ay’+’sil’+’sh’+’o’+’p’+’I’+’ng’
‘h’ ’o’ ’ng’
• Forced Alignment in HTK
– 1. Given a Speech signal
– 2. Doing the Pronunciation transcription
• Pronunciation symbols must be all-ASCII only!!
– 3. Training to get the HMM models
34. A Browser in Javascript and HTML
for Text-KaraOke
• https://youtu.be/11-ltx0yv_o
34
35. A Browser in Python using TKinter
for Text-KaraOke
35
36. Conclusion & Future Work
• Make the process more automatically.
• Make the user interface more friendly.
• Make the program more robust.
• Call for your help to improve.
• Thank you for Listening!
36
37. 37
PyCon JP 2015
Renyuan Lyu
呂仁園
Chun-Han Lai
賴俊翰
Karaoke-style Read-aloud System
Oct/10/ Saturday 2 p.m.–2:30 p.m. in 会議室1/Conference Room 1
Thank you for Listening.
ご聴取 有り難う 御座いました。
感謝您的收聽。