Speereo Software provides speech recognition technologies including automatic speech recognition (ASR), text-to-speech (TTS), and speech compression algorithms optimized for embedded devices and mobile phones. Their speech recognition engine achieves high accuracy even in noisy environments while requiring minimal CPU and memory resources. Speereo also offers a speech development SDK to easily integrate speech capabilities into applications.
Presentation on how to chat with PDF using ChatGPT code interpreter
General Speereo Technology
1.
2. Speereo Software, 2009
www.speereo.com
Speereo Speech Recognition Technologies
Konstantin Lamin Oleg Maleev Daniel Ischenko
CEO CTO, VP of R&D VP of Business
lamin@speereo.com maleev@speereo.com Development
d_ischenko@speereo.co
m
3. What speech technologies are needed for?
User friendliness
Speech is most natural way of communication for humans.
Therefore speech interface is most natural way to interact with
mobile device.
Mobility
While using speech interface User‟s hand and eyes are free for
any other activity.
Device novelty
Speech interface gives User an easy-to-use device not
burdened by numerous keys or large screens.
4. Automatic Speech Recognition System (ASR)
ASR is a conversion of speech signal to text or control
commands. ASR allows to manufacture devices with
speech control abilities, i.e. speech interface.
Voice Command ID
ASR
5. Speech Synthesizer (TTS)
Text to speech (TTS) is a signal conversion in with
consideration of language pronunciation norms. It allows to
create „speaking‟ devices.
Text Speech
TTS
6. Speech signal compression
Allows to record speech signal with small memory size.
Speech Packing Data
Packing
Unpacking
Packing Data Speech
7. Speech technology on desktop PC
ASR
Pentium 4 2.0 GHz , 64 MB
Memory bandwidth 1.2 GB/s
TTS
Pentium 4 2.0 GHz , 100-500 MB
Standard solutions are not acceptable for embedded
and mobile devices. Threfore special approaches for reduction
of CPU and memory usage must be applied.
8. Requirements for embedded devices
(low footprint)
Compactness: used memory size less than 1-2 MB).
Possibility to perform with CPU under 100 MIPS 300
MHz XScale - 12x or more output in performance with 2.0
GHz Pentium 4.
Low memory bandwidth (XScale delivers only 64
MB/s).
9. Embedded Speech SDK
Intuitively understandable and simple API accessible
for use by non-specialists in speech technologies field.
Scalable and portable software design.
Possibility to use with various OS, or on devices with no
OS.
Only Software! No demand for use of any additional
hardware.
10. Speech Recognition Technology Characteristics
Speaker-Dependent or Speaker-Independent?
Is training necessary? Training necessity annoys
Users.
Recognition on the phonetic (any size dictionary) or
whole-word level (small dictionaries only)?
Large (>10000 words) or small size vocabulary?
Is dynamic change of recognizable commands set
possible?
11. Optimal System of Speech Control
Speaker-Independent.
Flexible large vocabulary, allowing to change the set of
recognizable words and phrases „on the fly‟.
Noise robustness. Ability to use device in different
conditions (car, outdoors, in a crowded surrounding).
Stability to pronunciation variations, including nonnative
speakers.
12. Speereo Speech Recognition Engine
Acoustic Real-Time
environment Phone Models Trancriber
Speech
Acoustic Very Large
Decoder
Front-End Vocabulary
Recognition
result
13. Acoustic Front-End
Features system, 41 coefficient.
Setting on acoustic environment.
Special algorithm for automatic setting on the
microphone type (far-field or close talk), conditions of
recording and a channel distortion.
Special algorithm for operation of the system in a car.
14. Acoustic model Decoder
Continuous Density Hidden Markov‟s Models (more
precise).
Discrete Hidden Markov‟s Models (faster).
For English language 63 HMM models that include
2446 mixture Gaussian components.
Parameters of HMM models have been determined
statistically with use of a priori phonetic restrictions.
Enhanced algorithm of decoder functionality to speed
up work mechanism.
15. Real-Time Transcriber
Converts written English words and phrases to suitable
for recognition form.
Unlimited dictionary. Out-of-Vocabulary problem solved.
Recognition of first and last names.
Recognition of geographic names.
16. Accuracy
Test 1: Long phrases recognition
Test conditions: statistical sampling – 1680 utterances, 626
unique phrases. Language – English.
Recognition accuracy – 99.9%.
Test 2: Short words recognition
Test conditions: numerical vocabulary database (including
inarticulately pronounced words), 11 unique words.
Language – English: recognition accuracy – 99.2%.
Language – Russian: recognition accuracy – 98.5%.
18. Speereo Speech Engine in a Car
Test 4: long phrases recognition in noisy surrounding
Test conditions: statistical sampling – 1632 utterances, 626
unique phrases. Noise sample – moving vehicle with windows
rolled down.
Language – English.
Recognition accuracy – 97,6%.
Due to special algorithms, Speereo Recognition Engine
demonstrates good robustness in a car.
19. Comparison of Recognition Systems
Number of mistakes in tests 1 and 2 (less value is better)
80
70
60
50
Phrases
40
Digits
30
20
10
0
Philips Microsoft IBM Speereo
While testing the following product have been used:
Philips FreeSpeech 2000, Microsoft Speech Recognition
Engine 4.0, IBM ViaVoice 7.0, Speereo Speech Engine 2.0
20. Speereo Speech Recognition Technology
Features
High accuracy speech recognition
Speaker-Independence
Large vocabulary (>100000 words)
Short latency
Noise robustness
Excellent compatibility
Ease of use
21. CPU and Memory requirements
Speereo Speech Engine currently supports a wide
variety of processors, such as SHx, TMPR39XX, NEC
VR4122, MIPS, ARM, Xscale, etc.
Speereo Speech Engine operates with CPU with
performance from 40 MIPS (80 recommended) and
memory from 700 KB.
22. Speereo Speech Recognition SDK
Simple API not requiring skills in the speech technology
development.
Supports Windows Mobile, Symbian, Java, other platforms
and embedded devices with no OS.
For OS Windows Mobile and Symbian the operation support
with ready made Audio Input-Output is provided. No need to
program Audio Input-Output.
Support of smartphones based on Series 60, UIQ, Windows
Mobile and mobile devices with J2M.
23. Speereo Speech Engine Windows CE Version
Audio
Input-Output
List of speech commands
Application 1
Speech commands
pronounced by user
Speereo Application 2
Speech Engine
Application N
24. Use of Speereo Speech Engine (SE)
Operation of SE can be divided into 2 major stages:
1.Application defines the operating mode of SE and if it‟s
necessary sends the list of speech commands to SE.
2.When User pronounces a phrase (command), SE determines
most probable phrase from the list of received speech
commands and sends its ID to the application.
Developer does not need to trace the moment of pronouncing
of a phrase. All one needs is to process the Speereo Speech
Engine message that contains ID of the command pronounced
by User.
25. Recognition modes
There are 3 recognition modes of SE realized currently:
1. Recognition of phrases with words known to SE and
included into the vocabulary.
2. Recognition of phrases with unknown to SE words (mostly
personal names, etc.). In this case unknown words are
transcribed automatically.
3. Recognition of numbers from the 1 to the 31. There is a
special mode for improvement of ordinal numbers recognition
accuracy.
26. Speereo Speech Engine Initialization
In order to use the Speech Interface in any application
developer must register given application in Speereo Speech
Engine by accessing AddRegisterApplication function.
Function prototype is as follows:
UINT AddRegisterApplication (HWND hWnd), where hWnd is
the handle of the developer‟s application window which
receives the message from SE.
27. Speech Commands List Creation
Speech Commands List is created by AddPhrase function for
each speech command.
void AddPhrase (LPCTSTR pszText, DWORD dwId)
Where pszText is a speech command in orthographic form and
dwId is the identifier of the speech command that will be
returned by SE if the speech command is pronounced.
28. Response receipt from SE
Message WM_SRT_ACCEPTHYPO passes identifier of
recognized speech command as wPARAM parameter.
Message goes from SE to the application window hWnd of
which was used in the AddRegisterApplication function as its
parameter.
Example:
case WM_SRT_ACCEPTHYPO:
MakeHypo (wParam);
return TRUE;
MakeHypo is developer's command for implementation of
speech commands functionality here.
29. Defining Speech Commands Example
AddPhrase (_T(“Open Window”), ID_OPEN_WINDOW)
AddPhrase (_T(“Close Window”), ID_CLOSE_WINDOW)
That means that two speech commands (“Open Window” and
“Close Window”) are passed to SE with identifiers
ID_OPEN_WINDOW and ID_CLOSE_WINDOW accordingly.
30. It’s That Simple!
In order to build speech interface into the application using
Speereo Speech Engine one has to make following three
simple steps:
1.Initialize Speereo Speech Engine.
2.Define list of speech commands.
3.Define application‟s reaction to speech commands.
31. Speereo Speech Engine Additional Features
1.Microphone and speaker controls.
2.Ability to interact with several applications simultaneously.
3.Ability to record sound and voice signal via microphone and
real-time compression.
4.Ability to play sound and voice signals for User/speaker.
5.Speech signal detector selection (continuous monitoring of
speech signal or recognition launch on a key press).
32. Speereo Speech Engine
Implementation Possibilities
Home appliances
Consumer electronics (audio/video systems)
Computer hardware and software (all operations)
Portable devices (mobile phones, smartphones)
Voice mail system
Other embedded devices
Using Speereo Speech Interface can greatly contribute to
functionality, accessibility, and innovative appeal of any
product by making it fully interactive, easy to control, and
therefore more productive and enjoyable.
33. Example 1: operating a phonebook
Feature can be accessed by
Instead of selecting from menu…
one short phrase: “Call
Menu
Samantha”.
Send voice message via E-mail/MMS:
Names
Say “Send E-mail” or “Send MMS”.
You will be prompted to give the name
of recipient. Since names are
Search Samantha
articulated the system finds the name
in the database and offers to send a
Call voice message.
34. Example 2: Mobile Voice Interface
Voice Interface for mobile services is highly requires by mobile
community.
Weather Maps
Dictionaries Tickets booking
Exchange rates Information
E-Commerce Humor
35. Example 3: GPS Voice Control
Speech menu Search P.O.I.
Map navigation Route indication
36. Speereo Voice Translator
Speaking in a foreign language? Nothing's more simple!
Speereo Voice Translator is an
Innovative mobile phrase book, that
understands a spoken phrase in English
(pronounced even with a strong accent)
and immediately reads back the same
phrase in Arabic, Chinese (Traditional or
Simplified), Danish, English, Finnish,
French, German, Italian, Korean, Polish,
Russian, Spanish or Turkish.
37. Speereo Voice Organizer
Manage your personal information,
send e-mails and set your schedule
using only voice commands with our
Stylus Free Concept – Speereo Voice
Organizer!
Free your hands & don‟t stop to work
your mobile device – application will find
and dial numbers, write e-mails and
remind you of your appointments
following your voice commands!
38. Use our unique skills!
Speech interface is a new level of the user‟s convenience. We
got the necessary knowledge for the successful
implementation of speech technology.
quot;A voice-operated scheduler is a very good idea and Speereo has made it
an impressive and enjoyable reality. Perhaps the best thing about Voice
Organizer is that you can access all of its features with one hand. One
touch of your Pocket PC's record button and your voice does all the work:
switching between days, week and month views of your events; adding new
events; or adding vocal notes to your phone contacts.“
Voice Recognition Programs for the Pocket PC
By John Mierau, Pocket PC Magazine, November 2002, Vol. 5 No. 5
39. Speech Synthesizers Types (TTS)
Whole words TTS
Words DB
Speech
Text Phrases compiler
Phonemic TTS
Phones DB
Text
Prosody
Speech
Transcriber Phrases compiler
Phones
40. TTS Requirements
Whole-words TTS
Predefined vocabulary (up to 2-3 thousands words) at the
system development stage.
CPU from 40 MIPS, RAM from 0.5 Mb requires pronunciation
by a narrator of all the vocabulary‟s words.
Phonemic TTS
Large dictionaries possible (over 100 thousands words).
CPU from 80 MIPS, RAM from 2 Mb, does not require setting
for a dictionary.
41. TTS Language Support
Whole-words ТТS
Any language may be used. Narrator needed to create the
word‟s database. Вevelopment time (1-2 weeks) depending on
dictionary.
Phonemic TTS
Presently there is support of English, Spanish, German and
Italian.
New language development period – 3 months.
42. Speech Compression Algorithms
Speech signal
16bit/8kHz
1 minute takes 960 КB in memory.
АDPCM (Adaptive Differential Pulse Code Modulation) is
recording only the difference between samples and adjusting
the coding scale dynamically)
1 minute takes 240 КB in memory.
Compression of any sound signal is possible.
43. Speech Compression Special Algorythms
Use of speech signal features allows to achieve higher
compression power:
GSM compression
1 minute takes about 100 КB in memory. Optimal compression
of speech signal only.
Speereo advanced compression
1 minute takes about 10.25 KB in memory.
It is possible to record more than 1.5 hours of speech signal
into 1mb of space.
44. Speereo Advanced Compression
Speereo compression/decompression algorithm in a real-time
mode requires a processor with performance of 60 MIPS and
memory of 200 КB.
Only Speereo decompression algorithm in a real-time mode
requires a processor with performance of 40 MIPS and
memory of 200 КB.
45. Speereo Compression Algorithms Usage
Preinstalled voice commands for mobile and embedded
devices play (decompression only).
Creation of voice User commands on PC with following
transfer them to mobile and embedded devices
(decompression on embedded devices, compression on
desktop PC).
Recording and play of Users‟ commands on mobile and
embedded devices (compression and decompression on
embedded devices).
46. Conclusion
Speereo Speech Technology for embedded devices:
Automatic Speech Recognition (ASR) from 40 MIPS(80 MIPS
is recommended) from memory of 700 KB.
Speech synthesizer (TTS) from 40(80) MIPS, from memory of
500KB (2Mb).
Speech signal compression from 40 MIPS, from memory of
200 KB.
47. Speereo Speech Technology
Technology that understands your language
QUESTIONS? COMMENTS?
Speereo Software UK
www.speereo.com
Konstantin Lamin Oleg Maleev Daniel Ischenko
CEO CTO, VP of R&D VP of Business
lamin@speereo.com maleev@speereo.com Development
d_ischenko@speereo.co
m