Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Voice Browsing And Multimodal Interaction In 2009
1. Voice Browser and Multimodal Interaction In 2009
Paolo Baggia
Director of International Standards
March 6th, 2009
Google TechTalk
Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
2. Overview
A Bit of History
W3C Speech Interaction Framework Today
ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions
W3C Multimodal Interaction Today
MMI Architecture
EMMA and InkML
A language for Emotions
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 2
3. Company Profile
Privately held company (fully owned by Telecom Italia), founded in 2001 as
spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and
expertise in voice processing.
Global Company, leader in Europe and South America for award-winning, high
quality voice technologies (synthesis, recognition, authentication and
identification) available in 26 languages and 62 voices.
Multilingual, proprietary technologies protected
over 100 patents worldwide Munich
London
Financially robust, break-even reached in 2004,
revenues and earnings growing year on year
Paris
Growth-plan investment approved for
the evolution of products and services. Madrid
Offices in New York. Headquarters in Torino, Torino
local representative sales offices in Rome, New York
Rome
Madrid, Paris, London, Munich
Flexible: About 100 employees, plus a
vibrant ecosystem of local freelancers.
Google TechTalk – Mar 6th, 2009 Paolo Baggia 3
4. International Awards
“2008 Frost & Sullivan European Telematics and Infotainment
Emerging Company of the Year” Award
Winner of “Market leader-Best Speech Engine” Speech
Industry Award 2007 and 2008
Loquendo MRCP Server: Winner of 2008 IP Contact
Center Technology Pioneer Award
“Best Innovation in Automotive Speech Synthesis” Prize
AVIOS-SpeechTEK West 2007
“Best Innovation in Expressive Speech Synthesis” Prize
AVIOS-SpeechTEK West 2006
“Best Innovation in Multi-Lingual Speech Synthesis”
Prize AVIOS-SpeechTEK West 2005
Google TechTalk – Mar 6th, 2009 Paolo Baggia 4
5. A Bit of History
Google TechTalk – Mar 6th, 2009 Paolo Baggia 5
6. Standard Bodies
Two main standard bodies:
W3C – World Wide Web Consortium
Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full
potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).
400 members all over the world, 50 Working, Interest and Coordination Groups.
W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM,
SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web
Accessibility, Device Independence)
IETF – Internet Engineering Task Force
Founded in 1986, but growth in 1991as Internet Society. 1300 members.
HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP)
is very relevant for speech platforms.
Two industrial forums:
VoiceXML Forum (www.voicexml.org)
Inventors of VoiceXML 1.0, then submitted to W3C for standardization.
Current goal is to promote, disseminate and support VoiceXML and related standards.
SALT Forum (www.saltforum.org)
Supported by Microsoft to define a lightweight markup for telephony and multimodal
applications.
Other relevant bodies:
3GPP, OMA, ETSI, NIST
Google TechTalk – Mar 6th, 2009 Paolo Baggia 6
7. The (r)evolution of VoiceXML
1998 - 2004
W3C charters
W3C charters
Voice Browser
Multimodal Interaction
WG
WG
EMMA 1.0
By Cisco, Comverse,
VoiceXML W3C Rec
SALT Forum Intel, Microsoft, Philips,
Forum Birth Birth SpeechWorks, PLS 1.0
By AT&T, IBM, W3C REC
Lucent, Motorola, 2007
2004
2000
1998
2009
2008
1999 2002
SSML 1.0
W3C Voice SISR 1.0
W3C Rec
SRGS 1.0
Browser W3C Rec
VoiceXML 1.0 W3C Rec VoiceXML 2.0
VoiceXML 2.0
Workshop Released W3C Rec
W3C Rec
Preparing to announce VoiceXML 1.0
Friday Feb. 25th, 2000
Lucent, Naperville, Illinois
Left to right: Gerald Karam (AT&T), Linda Boyer (IBM),
Ken Rehor (Lucent), Bruce Lucas (IBM),
Pete Danielsen (Lucent), Jim Ferrans (Motorola),
Dave Ladd (Motorola).
Google TechTalk – Mar 6th, 2009 Paolo Baggia 7
8. Speech Interface Framework in 2000
(by Jim Larson)
Semantic Interpretation for
Speech Recognition (SISR)
VoiceXML 2.1
N-gram Grammar ML
EMMA
Speech Recognition Natural Language
VoiceXML 2.0
Grammar Spec. (SRGS) Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Pronunciation Lexicon Dialog
Specification (PLS) Manager
User Pre-recorded Audio Player
Telephone
Media System
Planning
Language
TTS
Generation
Reusable Components
Speech Synthesis Call Control XML
Markup Language (SSML) (CCXML)
Google TechTalk – Mar 6th, 2009 Paolo Baggia 8
9. Speech Interface Framework - Today
(by Jim Larson)
Semantic Interpretation for
Speech Recognition (SISR)
VoiceXML 2.1
N-gram Grammar ML
EMMA 1.0
Speech Recognition Natural Language
VoiceXML 2.0
Grammar Spec. (SRGS) Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Pronunciation Lexicon Dialog
Specification (PLS) Manager
User Pre-recorded Audio Player
Telephone
Media System
Planning
Language
TTS
Generation
Reusable Components
Speech Synthesis Call Control XML
Markup Language (SSML) (CCXML)
Google TechTalk – Mar 6th, 2009 Paolo Baggia 9
10. Speech Interface Framework - End of 2009
(by Jim Larson)
Semantic Interpretation for
Speech Recognition (SISR)
VoiceXML 2.1
N-gram Grammar ML
EMMA 1.0
Speech Recognition Natural Language
VoiceXML 2.0
Grammar Spec. (SRGS) Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Pronunciation Lexicon Dialog
Specification (PLS) Manager
User Pre-recorded Audio Player
Telephone
Media System
Planning
Language
TTS
Generation
Reusable Components
Speech Synthesis Call Control XML
Markup Language (SSML) (CCXML)
Google TechTalk – Mar 6th, 2009 Paolo Baggia 10
12. Architectural Changes
Traditional (proprietary) architecture
ASR / DTMF
Speech Proprietary
User SCE
Applic.
TTS / Audio
Proprietary
platform
.grxml/.gram, .pls
VoiceXML architecture
ASR / DTMF
.vxml
VoiceXML Web
User
Browser Applic.
HTTP
TTS / Audio
VoiceXML
platform
.ssml, .wav/.mp3, .pls
Google TechTalk – Mar 6th, 2009 Paolo Baggia 12
13. The VoiceXML Impact
VoiceXML changed the landscape of IVRs and speech application
creation
From proprietary to standard-based speech applications
Before After
• Standard VoiceXML
• Proprietary platforms
platforms
(HW & SW)
• Standards for Speech
• Proprietary
Technologies
applications (by
proprietary SCE) • Standard tools for
VoiceXML applications
• Mainly DTMF and
pre-recorded prompts • Integration of DTMF
and ASR
• First attempts to add
speech into IVR • Still predominance of
DTMF, but more and
more speech
applications
Google TechTalk – Mar 6th, 2009 Paolo Baggia 13
14. Overview
A Bit of History
W3C Speech Interaction Framework Today
ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions
W3C Multimodal Interaction Today
MMI Architecture
EMMA and InkML
A language for Emotions
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 14
15. Standards for ASR and DTMF
SRGS 1.0, SISR 1.0
Google TechTalk – Mar 6th, 2009 Paolo Baggia 15
16. W3C Standards for Speech/DTMF Grammars
SEMANTICS
SYNTAX
Speech
Defines constraints on Describes how to
admissible sentences for grammar produce results after
a specific recognition turn an utterance is recognized
SRGS SISR
SRGS SISR
ABNF XML literal script
ABNF XML literal script
voice dtmf
voice dtmf
http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 16
17. SRGS/SISR Grammars for “Torino”
SRGS XML SRGS ABNF
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<grammar xml:lang=quot;en-USquot; version=quot;1.0quot;
xmlns=quot;http://www.w3.org/2001/06/grammarquot; #ABNF 1.0 iso-8859-1;
tag-format=quot;semantics/1.0-literalsquot;>
SISR mode voice;
tag-format <semantics/1.0-literals>;
<rule id=quot;mainquot; scope=quot;publicquot;>
<token>Torino</token>
literal <tag>10100</tag>
public $main = Torino {10100} ;
</rule>
</grammar>
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<grammar xml:lang=quot;en-USquot; version=quot;1.0quot;
#ABNF 1.0 iso-8859-1;
xmlns=quot;http://www.w3.org/2001/06/grammar
quot; tag-format=quot;semantics/1.0quot;> mode voice;
SISR tag-format <semantics/1.0>;
<tag>var unused=7;</tag>
<rule id=quot;mainquot; scope=quot;publicquot;>
script {var unused=7;};
<token>Torino</token>
public $main = Torino {out=quot;10100quot;;} ;
<tag>out=quot;10100quot;;</tag>
</rule>
</grammar>
Google TechTalk – Mar 6th, 2009 Paolo Baggia 17
18. SRGS/SISR Standards – Pros
Powerful syntax (CFG) and very powerful semantics (ECMA)
DMTF and Voice input are transparent to the application
Wide and consistent adoption among technology vendors
Two syntax XML and ABNF are great!
Developers can choose (XML validation vs. compact format)
Transformations are possible
XML ABNF (easy, simple XSLT)
ABNF XML (requires a ABNF parser)
Open Source tools might be created to:
Validate grammar syntax
Transform grammars
Debug grammars on written input
Coverage tests: explode covered sentences, GenSem, SemTester, etc.
Google TechTalk – Mar 6th, 2009 Paolo Baggia 18
19. SRGS/SISR Standards – Small Issues
Semantics declaration: tag-format attribute
If value “semantics/1.0”?
Mandate SISR Script semantics inside semantic tags
If value “semantics/1.0-literal”?
Mandate SISR Literal semantics inside semantic tags
If missing?
Unclear! Risk of interoperability troubles
SISR Script Semantics
Clumsy default assignment: returns last referenced rule only
Developer must properly pop-up results
Be careful to redefine “out”
Assign a scalar value might result in errors
SISR Literal Semantics
Only useful for very simple word-list rules
No support for encapsulating rules
SISR Literal grammars as external references ONLY!
Google TechTalk – Mar 6th, 2009 Paolo Baggia 19
20. SRGS/SISR – Encapsulated Grammars
Gr2.gram
Literal
Gr41.grxml
Gr1.grxml
Literal
Script
Gr3.grxml
Script
Gr42.gram
Script
Google TechTalk – Mar 6th, 2009 Paolo Baggia 20
21. SRGS/SISR Standards – Rich XML Results
Section 7 of SISR 1.0 specification
http://www.w3.org/TR/semantic-interpretation/#SI7
Serialization rules from SISR ECMA results into XML
Edge cases:
Arrays
Special variable “_attribute” and “_value”
Creation of namespaces and prefixes
{
drink: {
_nsdecl: {
_prefix:quot;n1quot;,
_name:quot;http://www.example.com/n1quot;
},
_nsprefix:quot;n1quot;,
liquid: {
_nsdecl: {
<n1:drink xmlns:n1=quot;http://www.example.com/n1quot;>
_prefix:quot;n2quot;,
<liquid n2:color=quot;black“
_name:quot;http://www.example.com/n2quot;
xmlns:n2=quot;http://www.example.com/n2quot;>coke</liquid>
},
_attributes: { <size>medium</size>
color: { </n1:drink>
_nsprefix:quot;n2quot;,
_value:quot;blackquot;
}
},
_value:quot;cokequot;
},
size:quot;mediumquot;
}
}
Google TechTalk – Mar 6th, 2009 Paolo Baggia 21
22. SRGS/SISR Standards – Next Steps
Adoption of the PLS 1.0 lexicon
Clear entry point into PLS lexicons, <token> element
Missing role attribute in <token> to allow homographs
disambiguation
Next extensions via Errata
XML 1.1 support and IR
Update normative references
No Major Extensions are needed!
Google TechTalk – Mar 6th, 2009 Paolo Baggia 22
23. Speech Synthesis
SSML 1.0/1.1
Google TechTalk – Mar 6th, 2009 Paolo Baggia 23
24. TTS – Functional Architecture and
Markup/Non-Markup support
Text-to-
Structure Text Prosody Waveform
Phoneme
Analysis Normalization Analysis Production
Conversion
Markup support:
Markup support:
Markup support:
<phoneme>, <lexicon>
<p>, <s>
<voice>, <audio>
Non-Markup support:
Non-Markup support:
Non-Markup support:
look up in pronunciation
infer the structure by
dictionary
automatic text analysis
Markup support:
Markup support: <emphasis>, <break>, <prosody>
<say-as> for date, time, phone number, numbers Non-Markup support:
<sub> for acronyms and transliterations automatically generate prosody through analysis of
Non-Markup support: document structure and sentence syntax
automatically identify and convert constructs
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 24
25. SSML 1.0 – Language description (I)
version attribute
Document Structure SSML namespace attribute
<speak> root element
<?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?>
<speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot;
xml:lang=quot;en-USquot;>
<p>I don't speak Japanese.</p>
<p xml:lang=quot;jaquot;>Nihongo-ga wakarimasen.</p>
Languages </speak>
Processing and Pronunciation
– <p> and <s> (paragraph and sentence)
to give a structure to the text
– <say-as> element
to indicate the type of text construct contained within the element
ex. date, numbers, etc.
– <phoneme> element
to provides a phonetic pronunciation for the contained text in IPA
– <sub> element
to provide substitutions for expanding acronyms in sequence of
words
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 25
26. SSML 1.0 – Language description (II)
Style
- <voice> element
<?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?>
<speak version=quot;1.0quot;
xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;>
The moon is raising on the beach, when John says,
looking Mary in the eyes:
<voice name=quot;simonquot;>I love you!</voice>
but she suddenly replies:
<voice name=quot;susanquot;> Please, be serious! </voice>
</speak>
Other voice selection attributes are:
name, xml:lang, gender, age, and variant
- <emphasis> element
requests that the contained text be spoken with emphasis
level attribute can set it to strong, moderate, reduced, or none
- <break> element
controls the pausing between words
time attribute with two kind of values:
Time expressions “5s”, “20ms”
strength attribute with values:
none, x-weak, weak, medium (default value), strong, or x-strong
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 26
27. SSML 1.0 – Language description (III)
Prosody
<prosody> element
permits control of the pitch, speaking rate and volume of the
speech output.
The attributes are:
volume: the volume for the contained text.
rate: the speaking rate in words-per-minute for the contained text.
duration: a value in seconds or milliseconds for the desired time to take
to read the element contents.
pitch: the baseline pitch for the contained text.
range: the pitch range (variability) for the contained text in Hertz.
contour: sets the actual pitch contour for the contained text.
Other elements
<audio> element - to play an audio file
<mark> element - to place a marker into the text/tag sequence
<desc> element - to provide a description of a non-speech audio
source in <audio>
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 27
28. Towards SSML 1.1 – Motivations
Internationalization needs:
Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
Results:
No major needs for Eastern and Western European languages
Many issues for Far East languages (Mandarin, Japanese, Korean)
Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many
Indian languages
Mark input with or without vowels
Mark the transliteration schema used for input
Extensions required by Voice Browser:
More powerful error handling, selection of fall-back strategies
Trimming attributes
Volume attribute to adopt a logarithmic scale (before was linear)
Alignment with PLS 1.0 specification for user lexicons
http://www.w3.org/TR/speech-synthesis11/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 28
29. SSML 1.1 – Language Changes
<w> element
Lexicon extensions
<lookup> element
permits control of the pitch, speaking rate and volume of the
speech output.
Phonetic Alphabet Registry creation and adoption
quot;ipaquot; for International Phonetic Alphabet
Registering policy for other phonetic alphabets, similar to LTRU for
Language tags
Candidates:
PinYin for Mandarin Chinese
JEITA for Japanese
X-SAMPA, ASCII transliteration of IPA codes
http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 29
31. Pronunciation Lexicons
Pronunciation Lexicon
A mapping between words (or short phrases), their written representations,
and their pronunciations suitable for use by an ASR engine or a TTS
engine
Pronunciation lexicons are not only useful for voice browsers
They have also proven effective mechanisms to support accessibility for the
differently able as well as greater usability for all users
They are used to good effect in screen readers and user agents supporting
multimodal interfaces
The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is
designed to enable interoperable specification of pronunciation
lexicons
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 31
32. PLS 1.0 – Language Overview
A PLS document is a container (<lexicon>) of several lexical entries
(<lexeme>)
Each lexical entry contains
One or more spellings (<grapheme>)
One or more pronunciations (<phoneme>) or substitutions (<alias>)
Each PLS document is related to a single unique language (xml:lang)
SSML 1.0 and SRGS 1.0 documents can reference one or more PLS
documents
Current version doesn’t include morphological, syntactic and semantic
information associated with pronunciations
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 32
33. PLS 1.0 – An Example
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<lexicon version=quot;1.0quot;
xmlns=quot;http://www.w3.org/2005/01/pronunciation-lexiconquot;
xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot;
xsi:schemaLocation=quot;http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/pronunciation-lexicon/pls.xsdquot;
alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
<lexeme>
<grapheme>Sepulveda</grapheme>
ˈȜ Ǻ
<phoneme>səˈpȜlvǺdə</phoneme>
</lexeme>
<lexeme>
<grapheme>W3C</grapheme>
<alias>World Wide Web Consortium</alias>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 33
34. PLS 1.0 – Used for TTS
SSML 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<speak version=quot;1.0quot; … xml:lang=quot;en-USquot;>
<lexicon uri=quot;http://www.example.com/SSMLexample.plsquot;/>
The title of the movie is: quot;La vita è bellaquot; (Life is beautiful),
which is directed by Benigni.
</speak>
PLS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
<lexeme>
<grapheme>La vita è bella</grapheme>
<phoneme>ˈlǡ ˈviːȎə ˈȤeǺ ˈbǫlə</phoneme>
ˈǡ ː Ǻǫ
</lexeme>
<lexeme>
<grapheme>Benigni</grapheme>
<phoneme>bǫˈniːnji</phoneme>
ǫː
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 34
35. PLS 1.0 – Used for ASR
SRGS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<grammar version=quot;1.0“ xml:lang=quot;en-USquot; root=quot;moviesquot; mode=quot;voicequot;>
<lexicon uri=quot;http://www.example.com/SRGSexample.plsquot;/>
<rule id=quot;moviesquot; scope=quot;publicquot;>
<one-of>
<item>Terminator 2: Judgment Day</item>
<item>Pluto's Judgement Day</item>
</one-of>
</rule>
</grammar>
PLS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
<lexeme>
<grapheme>judgment</grapheme>
<grapheme>judgement</grapheme>
ˈȜ
<phoneme>ˈdʒȜdʒ.mənt</phoneme>
</lexeme>
</lexicon>
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 35
36. Examples of Use
Multiple pronunciations for the same orthography
Multiple orthographies
Homophones
Homographs
Acronyms, Abbreviations, etc.
Detailed descriptions can be found in:
W3C specification, Wikipedia
Paolo Baggia, SpeechTEK 2008 & Voice Search 2009
Google TechTalk – Mar 6th, 2009 Paolo Baggia 36
37. PLS 1.0 – Open Issues
No wide support of IPA in speech engines
Slowly changes are under way
Phonetic Alphabet Registry will open doors to other alphabets in a
controlled and interoperable way
Integration in ASR/TTS
SSML 1.1 will interoperate with PLS 1.0
SRGS 1.0 still missing support of role attribute for PLS 1.0
No matching algorithm inside PLS, because it is mainly a data
format
http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 37
39. International Phonetic Alphabet
Pronunciation is represented by a phonetic alphabet
Standard phonetic alphabets
International Phonetic Alphabet (IPA)
Well known phonetic alphabet
SAMPA - ASCII based (simple to write)
Pinyin (Chinese Mandarin), JEITA (Japanese), etc.
Proprietary phonetic alphabets
International Phonetic Alphabet (IPA)
Created by International Phonetic Association (active since 1896),
collaborative effort by all the major phoneticians around the world
Universally agreed system of notation for sounds of languages
Covers all languages
Requires UNICODE to write it
Normatively referenced by PLS
Google TechTalk – Mar 6th, 2009 Paolo Baggia 39
40. IPA – Chart
IPA was founded in 1886
It is the major international
association of phoneticians
The IPA alphabet provides
symbols making possible the
phonemic transcription of all
known languages
IPA characters can be encoded in
Unicode by supplementing
ASCII with characters from
other ranges, particularly:
IPA extensions (0250–02AF)
Latin Extended-A (0100-017F)
See the detailed:
http://www.unicode.org/charts
Google TechTalk – Mar 6th, 2009 Paolo Baggia 40
41. Phonetic Alphabets – Issues
The real problem is how to write pronunciation in a reliable, unless
you are trained phonetician
Issues with fonts and authoring, browsers, but Unicode fonts today
support IPA extensions, see:
http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm
There are very few tools to help writing pronunciations and to let
you listen to what you have written
Make available pronunciations in IPA or other general phonetic
languages.
Google TechTalk – Mar 6th, 2009 Paolo Baggia 41
42. Voice Dialog languages:
VoiceXML 2.0
VoiceXML 2.1
Google TechTalk – Mar 6th, 2009 Paolo Baggia 42
43. VoiceXML 2.0 – Features, Elements
Menus, forms, sub-dialogs Events
<menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>,
<catch>, <throw>
Input
Transition and submission
Speech recognition
<grammar> <goto>, <submit>
Recording Telephony
<record>
Connection control
Keypad <transfer>, <disconnect>
<grammar mode=quot;dtmfquot;>
Telephony information
Output
Platform specifics
Audio files <object>
<audio>
Performance
Text-To-Speech
Fetch
<prompt>
Properties
Variables (ECMA-262)
<var>, <assign>, <script>
scoping rules
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 43
44. VoiceXML 2.0 – Execution Model
Execution is synchronous
Only disconnect event is handled (somewhat) asynchronous
Execution is always in a single dialog: <form> or <menu>
Form Interpretation Algorithm for <field> selection
Prompt are queued
Played only when encountering a waiting state
Played before a fetchaudio is started
Processing is always in one of two states:
Waiting for input in an input item:
<field>, <record>, <transfer>, etc.
Transitioning between input items in response of an input
Event-driven:
user’s input event handling
<nomatch>, <noinput>
generalized event mechanism
<catch>, <throw>
call event handling
connection.*
error event handling
error.*
http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 44
45. VoiceXML 2.1 – Extended Features
Dynamically referencing grammars and scripts:
<grammar expr=quot;…quot;>, <script expr=quot;…quot;>
Record user’s utterance during form filling
recordutterance property
Add new shadow variables: recording, recordingsize, recordingduration
Detect barge-in during prompt playback (SSML <mark>)
Add markexpr attribute
Add new shadow variables: markname and marktime
Fetch XML data without transition
Use read-only subset of DOM
Dynamically concatenate prompts <foreach>
Iterate throught ECMAScript arrays and execute content
Send data upon disconnect
<disconnect namelist=quot;…quot;>
Additional transfer type
<transfer type=quot;consultationquot;>
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 45
46. VoiceXML Applications
Static VoiceXML applications
The VoiceXML page is always the same, so the user experience
No personalization or customization
Dynamic VoiceXML applications
User experience is customized
• After authentication (PIN)
• Using caller-id or SIP-id
Data driven
Dynamic pages generated at runtime
e.g. JSP, ASP, etc.
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 46
47. A Drawback of VoiceXML 2.0
A drawback of VoiceXML is that the transition from a VoiceXML page
to another is a costly activity:
Fetch the new page, if not cached
Parse the page
Initialize the context, possibly loading and initializing a new application
root document
Load or pre-compile scripts
The transitions are the only way to return data to the Web Application
(if the VoiceXML is dynamic)
Pages must be created to include dynamic data
VoiceXML 2.1 addresses part of this drawback by feeding dynamic
data to a running VoiceXML page
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 47
48. Advantages of VoiceXML 2.1 - AJAX
Two of the eight new features in VoiceXML 2.1 helps to create
more dynamic VoiceXML applications:
<data> element
<foreach> element
Static VoiceXML document can fetch user-specific data at runtime,
without changing the VoiceXML document
<data> element allows retrieval of arbitrary XML data without
VoiceXML document transitions
Returned XML data are accessible by a subset of DOM primitives
<foreach> extend the prompts to allow the iteration on a dynamic
array of information to create a dynamic prompt
This is similar to AJAX programming for HTML services
It decouples presentation layer (VoiceXML) from business logic
(accessed via <data>)
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 48
49. VoiceXML 2.1 – <data> Element
Attributes:
the variable to be filled with the DOM of the retrieved data
name
scr or srcexpr the URI of the location of the XML data
the list of variables to be submitted
namelist
either ‘get’ or ‘post’
method
media encoding
enctype
fetch and caching attributes
As <var>, it may appear in executable content (<form> and <vxml>)
The value of name must be a declared variable
The platform will fill the variable of the DOM of the fetched XML data
<data> element is synchronous (the service stops to get data)
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 49
50. VoiceXML 2.1 – <foreach> Element
Attributes:
ECMAScript expression that must evaluate to ECMAScript array
array
the variable that stores the element to be processed
item
<foreach> allows the application to iterate on an ECMAScript array and
to execute the content
<foreach> may appear:
In executable content (all executable content elements may appear as
content of <foreach>)
In <prompt> (restrictions on the content are applied)
<foreach> allows sophisticated concatenation of prompts
http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 50
51. VoiceXML – Final Remarks
The changed landscape for speech application development:
Virtually all the IVRs today support VoiceXML
New options related to VoiceXML:
SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)
Large hosting of speech applications (TellMe, Voxeo)
Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)
Further changes may come from the CCXML adoption
… but:
Mainly system driven applications are actually deployed
New challenges to incorporate more powerful dialog strategies,
mixed-initiative are under discussion.
http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 51
52. VoiceXML Resources
Voice Browser Working Group (spec, FAQ, implementations, resources):
http://www.w3.org/Voice/
VoiceXML Forum site (resources, education, interest groups):
http://www.voicexml.org/
VoiceXML Forum Review:
http://www.voicexmlreview.org/
Interesting articles related to VoiceXML and more
Example code in the sections quot;First Wordsquot; and quot;Speak & Listenquot;
Ken Rehor’s World of VoiceXML
http://www.kenrehor.com/voicexml
Online documentation related to VoiceXML Platforms
Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie
Many books on VoiceXML:
Jim Larson, quot;VoiceXML Introduction to Developing Speech Applicationsquot;, Prentice-Hall,
2002.
A. Hocek, D. Cuddihy, quot;Definitive VoiceXMLquot;, Prentice-Hall, 2002
Google TechTalk – Mar 6th, 2009 Paolo Baggia 52
53. Call Control:
CCXML 1.0
Google TechTalk – Mar 6th, 2009 Paolo Baggia 53
54. CCXML 1.0 – Highlights
Asynchronous event processing
Acceptance or refusal of an incoming call
Different type of transfer call management
Outbound call activation (interaction with an external entity)
Use of ECMAScript adding scripting capabilities to call control
applications
VoiceXML modularization
Conferencing management
Google TechTalk – Mar 6th, 2009 Paolo Baggia 54
55. CCXML 1.0 – Elements Relationship
Google TechTalk – Mar 6th, 2009 Paolo Baggia 55
57. CCXML 1.0 – connection.alerting Event
Basic telephony information has been retrieved on alerting event and
is available into CCXML document:
Local URI, remote URI, protocol used, redirection info, etc.
Based on certain checked info, CCXML can accept or refuse the
incoming call, even before contacting the dialog server;
Any error that can occur during the phone call can be managed by
CCXML service (connection.failed, error.connection events)
Call Control CCXML VoiceXML
Adapter Interpreter Interpreter
connection.alerting
Analyzing events$ content
<accept/> | <reject/>
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6th, 2009 Paolo Baggia 57
58. CCXML 1.0 – How to activate a new dialog
CCXML actions:
Receives alerting event from Call Control Adapter
Asks to dialog server to prepare a new dialog
Waits for the preparation
If the dialog has been successfully prepared, accept the call
Asks to dialog server to start the prepared new dialog
CCXML
Call Control VoiceXML
Interpreter
Adapter Interpreter
alerting
prepare a new dialog
dialog prepared
call accepted
connected
start the prepared dialog
dialog started
Google TechTalk – Mar 6th, 2009 Paolo Baggia 58
59. Call transfer
CCXML supports transfer call of different modality: quot;bridgequot;, quot;blindquot;,
quot;consultationquot;;
Based on different modalities features CCXML language allows the expected
interaction with the Call Control Adapter to correctly perform the transfer;
During the different phases of transfer call creation the CCXML can receive
any asynchronous event and correctly manage it, interrupting the call, if
requested
CCXML
Call Control VoiceXML
Interpreter
Adapter Interpreter
Performing a transfer
command1
answer1
[…]
transfer complete …
Google TechTalk – Mar 6th, 2009 Paolo Baggia 59
60. External Events
CCXML Interpreter Context can receive events from an external entity
able to use the HTTP protocol;
Events generated in this way must be sent to a CCXML by a POST
HTTP command
A event is so performed and:
It can be addressed on a new session whose creation must be requested
It can be addressed on an existent session, specifying the ID in the
request
CCXML External
Interpreter Entity
basic http event
Event
management
Event management result
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6th, 2009 Paolo Baggia 60
61. External event on a new session:
the Outbound Call
A particular request arrived to Call Control from an external entity;
A particular CCXML service associated with the received event is started and
a set of operations between Call Control Adapter, Call Control and Dialog
Server is activated: the outbound call is so placed
outbound call request
Call Control CCXML VoiceXML
Adapter Interpreter Interpreter
Create a call
connection progressing …
Prepare a dialog
prepared
connection connected
Start the prepared dialog
Google TechTalk – Mar 6th, 2009 Paolo Baggia 61
62. External event on a session:
dialog termination request
An external entity performs a HTTP POST request towards the CCXML
Interpreter Context, specifying a sessionid, requesting the termination of a
particular dialog;
The CCXML check the session id, if this is valid then CCXML Interpreter
injects the event received in the session;
The CCXML service has a transition on that event and performs the dialog
termination on a particular dialog identifier;
Dialog termination request
Call Control VoiceXML
CCXML
Adapter Interpreter
Interpreter
It depends on dialogterminate (dialogid)
dialog.exit event
management
dialog.exit
disconnect(connId) dialogprepare
Google TechTalk – Mar 6th, 2009 Paolo Baggia 62
63. Loading different CCXML documents:
<fetch> and <goto> elements
<fetch> and <goto> elements are used respectively to asynchronously fetch
content identified by the attributes of the <fetch> and to go in a fetched
document, if it’s successfully loaded;
CCXML - MODULARIZATION
- SOURCE EXEMPLIFICATION
Interpreter
- MORE READABILITY
<fetch
next=quot;'http://../Fetch/doc1.ccxml'quot;
type=quot;'application/ccxml+xml'quot;
fetchid=quot;resultquot;/>
fetch the document quot;doc1.ccxmlquot;
fetch.done / error.fetch
The first event occurred
in a new document
is ccxml.loaded
goto into the new document /
continue to work on the same dialog
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6th, 2009 Paolo Baggia 63
65. CCXML 1.0 – Next Steps
CCXML specification is a Last Call Working Draft, all the feature
requests and clarifications have been addressed;
An Implementation Report test suite is under development;
It is very close to be published as W3C Candidate Recommendation;
Internal or external companies will be invited to send implementation
report on their CCXML platform;
After that, CCXML 1.0 specification will be able to become Proposed
Recommendation and then W3C Recommendation.
http://www.w3.org/TR/ccxml
Google TechTalk – Mar 6th, 2009 Paolo Baggia 65
67. Speech Interface Framework - End of 2009
(by Jim Larson)
Semantic Interpretation for
Speech Recognition (SISR)
VoiceXML 2.1
N-gram Grammar ML
EMMA 1.0
Speech Recognition Natural Language
VoiceXML 2.0
Grammar Spec. (SRGS) Semantics ML
Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer
Pronunciation Lexicon Dialog
Specification (PLS) Manager
User Pre-recorded Audio Player
Telephone
Media System
Planning
Language
TTS
Generation
Reusable Components
Speech Synthesis Call Control XML
Markup Language (SSML) (CCXML)
Google TechTalk – Mar 6th, 2009 Paolo Baggia 67
68. Architectural Changes
.grxml/.gram, .pls
VoiceXML architecture
ASR / DTMF
.vxml
VoiceXML Web
User
Browser Applic.
HTTP
TTS / Audio
VoiceXML
platform
.ssml, .wav/.mp3, .pls
Google TechTalk – Mar 6th, 2009 Paolo Baggia 68
69. VoxNauta – Internal Architecture
Google TechTalk – Mar 6th, 2009 Paolo Baggia 69
70. Loquendo MRCP Server/LSS 7.0 Architecture
Load Balancer
RTSP SIP
MRCP v2
(MRCPv1) (SDP)
RTP SIP
RTSP Parser MRCP v2
parser
SDP
MRCP v1 Parser
Management Graphic
MP (SNMP)
Management
Configuration Consolle
Config files
AP
MRCP v1/v2 Server
Interf.
Logger Log files
Audio AP
API
Provider
Win32/Linux
OS
NLSML / EMMA
TTS & ASR interface
TTS and ASR API TTS and ASR API
LASR-SV
LASR
LTTS
Google TechTalk – Mar 6th, 2009 Paolo Baggia 70
71. IETF MRCP Protocols
Media Resource Control Protocol MRCP are IETF standards
MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on
RTSP/RTP
MRCPv2 is Internet Draft,
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP
offering the new audio recording and Speaker Verification
functionalities
Optimized client-server solution for the large-scale deployment of
speech technologies in the telephony field, such as call centers,
CRM, news and email-reading, self-service applications, etc.
Allows standard interface of speech technologies in all IVR platforms
For more information read:
Dave Burke, Speech Processing for IP Networks. Media
Resource Control Protocol (MRCP), ed. Wiley
Google TechTalk – Mar 6th, 2009 Paolo Baggia 71
72. VoiceXML in a Call Center
PBX
Fixed/
Optional
Mobile
Network
Voice Gateway for
Non SIP PBX
VOXNAUTA IVR
ACD
WEB CTI Data
Server Server Server
Operators
Google TechTalk – Mar 6th, 2009 Paolo Baggia 72
73. VoiceXML in the IMS Architecture
TDM protocols
VOICE SIP protocols
Fixed/ RTP
GATEWAY
Mobile
VoiceXML on HTTPS
Network
VOXNAUTA MRF
IP
Network
Application Server
Google TechTalk – Mar 6th, 2009 Paolo Baggia 73
74. Overview
A Bit of History
W3C Speech Interaction Framework Today
ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions
W3C Multimodal Interaction Today
MMI Architecture
EMMA and InkML
A language for Emotions
Next Future
Google TechTalk – Mar 6th, 2009 Paolo Baggia 74
75. Modes, Modalities and Technologies
Speech
Audio
Stylus
Touch
Accelerometer
Keyboard/keypad
Mouse/touchpad
Camera
Geolocation
Handwriting recognition
Speaker verification
Signature verification
Fingerprint identification
….
Google TechTalk – Mar 6th, 2009 Paolo Baggia 75
76. Complement and Supplement
Speech Visual
- Transient - Persistent
- Linear - Spatial
- Hands and Eyes-Free - Eyes
- Suffers Noise - Suffers Light Conditions
Enable to choose among different modalities or to mix
them
Adaptable to different social, environmental conditions or
to user preference
Google TechTalk – Mar 6th, 2009 Paolo Baggia 76
77. GUI VUI MUI
or
MMUI
Google TechTalk – Mar 6th, 2009 Paolo Baggia 77
78. MMI has an Intrinsic Complexity
Interaction
Manager
speech
speech
fingerprint
text fingerprint
text
Face
mouse Face
mouse
identification
identification
geolocation
handwriting geolocation
handwriting Speaker
Speaker
verification
Vital verification
accelerometer Vital
accelerometer
signs
signs
Sensor Identification
User intent
video
video
photograph
photograph
Audio
Audio
drawing
drawing recording
recording
Deborah Dahl, Voice Search 2009
Recording
Google TechTalk – Mar 6th, 2009 Paolo Baggia 78
79. MMI can Include Many Different Technologies
Touchscreen Accelerometer
Interaction
Speech
Geolocation
recognition Manager
Fingerprint
Keypad
recognition
Handwriting
recognition
Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6th, 2009 Paolo Baggia 79
80. Uniform Representation for MMI
Getting everything to work together is complicated.
One simplification is to represent the same information
from different modalities in the same format.
The need a common language for representing the
same information from different modalities
EMMA (Extensible MultiModal Annotation) 1.0
A uniform representation for multimodal information
Google TechTalk – Mar 6th, 2009 Paolo Baggia 80
81. Touchscreen Accelerometer
EMMA
EMMA
Interaction
Speech
EMMA EMMA Geolocation
recognition Manager
EMMA EMMA
EMMA
Fingerprint
Keypad
recognition
Handwriting
recognition
Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6th, 2009 Paolo Baggia 81
82. EMMA Structural Elements
EMMA Elements
Provide containers for application
semantics and for multimodal
annotation emma:emma
<emma:emma …> emma:interpretation
<emma:one-of>
<emma:interpretation>
emma:one-of
…
</emma:interpretation>
<emma:interpretation> emma:group
…
</emma:interpretation>
emma:sequence
</emma:one-of>
</emma:emma>
emma:lattice
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 82
83. EMMA Annotations
Characteristics and processing of input, e.g.:
token of input
emma:tokens
reference to processing
emma:process
lack of input
emma:no-input
uninterpretable input
emma:uninterpreted
human language of input
emma:lang
emma:signal reference to signal
emma:media-type media type
emma:confidence confidence scores
emma:source annotation of input source
emma:start emma:end Timestamps (absolute/relative)
emma:medium emma:mode medium, mode, and
emma:function function of input
emma:hook hook
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 83
84. EMMA 1.0 – Example Travel Application
INPUT:
quot;I want to go from Boston
to Denver on March 11quot;
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6th, 2009 Paolo Baggia 84
85. EMMA 1.0 – Same meaning
<emma:interpretation medium=quot;acousticquot; mode=quot;voicequot;
id=quot;int1quot;>
<origin>Boston</origin>
Speech
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>
<emma:interpretation medium=quot;tactilequot; mode=quot;gui“
id=quot;int1quot;>
<origin>Boston</origin>
Mouse
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6th, 2009 Paolo Baggia 85
86. EMMA 1.0 – Handwriting Input
<emma:interpretation medium=quot;tactilequot; mode=quot;inkquot;
id=quot;int1quot;>
<origin>Boston</origin>
<destination>Denver</destination>
<date>11032009</date>
</emma:interpretation>
http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009
Google TechTalk – Mar 6th, 2009 Paolo Baggia 86
88. EMMA 1.0 – Representing Lattices
Speech recognizers, Handwriting recognizers and other input
processing components may provide lattice output:
A graph encoding a range of possible recognition results or
interpretations
portland
today please
from
flights to austin 7
1 2 3 4 5 6 8
oakland tomorrow
boston
From Michael Joshnston, AT&T Research
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 88
89. EMMA 1.0 – Representing Lattices
Lattices can be represented using EMMA elements:
<emma:lattice emma:initial=quot;?quot; emma:final=quot;?quot;>
<emma:arc emma:from=quot;?quot; emma:to=quot;?quot;>
<emma:emma version=quot;1.0quot;
xmlns:emma=quot;http://www.w3.org/2003/04/emmaquot;>
<emma:interpretation>
<emma:lattice emma:initial=quot;1quot; emma:final=quot;8quot;>
<emma:arc emma:from=quot;1quot; emma:to=quot;2quot;>flights</emma:arc>
<emma:arc emma:from=quot;2quot; emma:to=quot;3quot;>to</emma:arc>
<emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>boston</emma:arc>
<emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>austin</emma:arc>
<emma:arc emma:from=quot;4quot; emma:to=quot;5quot;>from</emma:arc>
<emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>portland</emma:arc>
<emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>oakland</emma:arc>
<emma:arc emma:from=quot;6quot; emma:to=quot;7quot;>today</emma:arc>
<emma:arc emma:from=quot;7quot; emma:to=quot;8quot;>please</emma:arc>
<emma:arc emma:from=quot;6quot; emma:to=quot;8quot;>tomorrow</emma:arc>
</emma:lattice>
</emma:interpretation>
</emma:emma>
From Michael Joshnston, AT&T Research
http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 89
90. EMMA in Multimodal Framework
http://www.w3.org/TR/mmi-framework
EMMA
Google TechTalk – Mar 6th, 2009 Paolo Baggia 90
92. InkML 1.0 – Status and Advances
Rich annotation for Ink:
Trace, Trace formats and Trace collections
Contextual information
Canvases
Etc.
Result of classification of InkML traces may be a semantic
representation in EMMA 1.0
Current status is Last Call Working Draft, next will be Candidate
Recommendation with release of an Impl. Report test-suite
Raising interest from major industries
http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 92
93. MMI Architecture Specification
“Multimodal Architecture and Interfaces“, W3C Working Draft,
http://www.w3.org/TR/mmi-arch/
Runtime Framework provides Delivery Interaction Data
the basic infrastructure and Context Manager Component
Component
controls communication among
the constituents. Runtime Framework
Interaction Manager (IM)
Modality Component API
coordinates Modality
Components (MCs) by life-cycle
Modality Modality
events and contains the shared Component 1 Component N
data (context).
Event-based communication
between IM and MCs.
http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008
Google TechTalk – Mar 6th, 2009 Paolo Baggia 93
94. MMI Arch – Laboratory Implementation
Implementation of components using W3C markup languages.
Delivery Interaction Data
Context Manager Component
Component
SCXML
Runtime Framework
Modality Component API Modality Component API
HTML VoiceXML
Modality Modality
Component 1 Component N
for GUI for VUI
http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008
Google TechTalk – Mar 6th, 2009 Paolo Baggia 94
95. MMI Arch – Laboratory Implementation
SCXML based Interaction Manager.
VoiceXML + HTML modality components.
SCXML interpreter
Server
HTTP I/O Processor
Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)
CCXML/VoiceXML Server
Browser
HTML Browser
Telephony interface
Client
Phone Client
GUI modality component Voice modality component
http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008
Google TechTalk – Mar 6th, 2009 Paolo Baggia 95
96. MMI Architecture – Open Issues
Profiles
Start-up, Registration, Delegation
in distributed environment
Transport of Events
Extensibility of Events
http://www.w3.org/TR/mmi-arch/
Google TechTalk – Mar 6th, 2009 Paolo Baggia 96
97. Emotion in Wikipedia
From Wikipedia definition:
“An emotion is a mental and physiological state associated with a
wide variety of feelings, thoughts, and behaviours. It is a prime
determinant of the sense of subjective well-being and appears to play
a central role in many human activities. As a result of this generality,
the subject has been explored in many, if not all of the human
sciences and art forms. There is much controversy concerning how
emotions are defined and classified.”
General goal: Make interaction between humans and machines more
natural for the humans
Machines should become able:
• to register human emotions (and related states)
• to convey emotions (and related states)
• to “understand” the emotional relevance of events
Google TechTalk – Mar 6th, 2009 Paolo Baggia 97