Voice Browsing And Multimodal Interaction In 2009

Voice Browser and Multimodal Interaction In 2009

Paolo Baggia
Director of International Standards

March 6th, 2009

Google TechTalk

Google TechTalk – Mar 6th, 2009 Paolo Baggia 11

Overview

A Bit of History

W3C Speech Interaction Framework Today
ASR/DMTF
TTS
Lexicons
Voice Dialog and Call Control
Voice Platforms and Next Evolutions

W3C Multimodal Interaction Today
MMI Architecture
EMMA and InkML
A language for Emotions

Next Future

Company Profile

Privately held company (fully owned by Telecom Italia), founded in 2001 as
spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and
expertise in voice processing.
Global Company, leader in Europe and South America for award-winning, high
quality voice technologies (synthesis, recognition, authentication and
identification) available in 26 languages and 62 voices.
Multilingual, proprietary technologies protected
over 100 patents worldwide Munich
London
Financially robust, break-even reached in 2004,
revenues and earnings growing year on year
Paris
Growth-plan investment approved for
the evolution of products and services. Madrid

Offices in New York. Headquarters in Torino, Torino

local representative sales offices in Rome, New York
Rome
Madrid, Paris, London, Munich
Flexible: About 100 employees, plus a
vibrant ecosystem of local freelancers.

International Awards

“2008 Frost & Sullivan European Telematics and Infotainment
Emerging Company of the Year” Award

Winner of “Market leader-Best Speech Engine” Speech
Industry Award 2007 and 2008

Loquendo MRCP Server: Winner of 2008 IP Contact
Center Technology Pioneer Award

“Best Innovation in Automotive Speech Synthesis” Prize
AVIOS-SpeechTEK West 2007

“Best Innovation in Expressive Speech Synthesis” Prize
AVIOS-SpeechTEK West 2006

“Best Innovation in Multi-Lingual Speech Synthesis”
Prize AVIOS-SpeechTEK West 2005


A Bit of History


Standard Bodies
Two main standard bodies:
W3C – World Wide Web Consortium
Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full
potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).
400 members all over the world, 50 Working, Interest and Coordination Groups.
W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM,
SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web
Accessibility, Device Independence)
IETF – Internet Engineering Task Force
Founded in 1986, but growth in 1991as Internet Society. 1300 members.
HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP)
is very relevant for speech platforms.

Two industrial forums:
VoiceXML Forum (www.voicexml.org)
Inventors of VoiceXML 1.0, then submitted to W3C for standardization.
Current goal is to promote, disseminate and support VoiceXML and related standards.
SALT Forum (www.saltforum.org)
Supported by Microsoft to define a lightweight markup for telephony and multimodal
applications.

Other relevant bodies:
3GPP, OMA, ETSI, NIST


The (r)evolution of VoiceXML
1998 - 2004

W3C charters
W3C charters
Voice Browser
Multimodal Interaction
WG
WG
EMMA 1.0
By Cisco, Comverse,
VoiceXML W3C Rec
SALT Forum Intel, Microsoft, Philips,
Forum Birth Birth SpeechWorks, PLS 1.0
By AT&T, IBM, W3C REC
Lucent, Motorola, 2007
2004
2000
1998
2009
2008
1999 2002
SSML 1.0
W3C Voice SISR 1.0
W3C Rec
SRGS 1.0
Browser W3C Rec
VoiceXML 1.0 W3C Rec VoiceXML 2.0
VoiceXML 2.0
Workshop Released W3C Rec
W3C Rec

Preparing to announce VoiceXML 1.0
Friday Feb. 25th, 2000
Lucent, Naperville, Illinois

Left to right: Gerald Karam (AT&T), Linda Boyer (IBM),
Ken Rehor (Lucent), Bruce Lucas (IBM),
Pete Danielsen (Lucent), Jim Ferrans (Motorola),
Dave Ladd (Motorola).


Speech Interface Framework in 2000
(by Jim Larson)

Semantic Interpretation for
Speech Recognition (SISR)

VoiceXML 2.1
N-gram Grammar ML
EMMA
Speech Recognition Natural Language
VoiceXML 2.0
Grammar Spec. (SRGS) Semantics ML

Language
ASR
Understanding
Context World
Interpretation Wide
Web
DTMF Tone Recognizer

Pronunciation Lexicon Dialog
Specification (PLS) Manager

User Pre-recorded Audio Player
Telephone
Media System
Planning
Language
TTS
Generation

Reusable Components
Speech Synthesis Call Control XML
Markup Language (SSML) (CCXML)


Speech Interface Framework - Today
(by Jim Larson)


VoiceXML 2.1
N-gram Grammar ML
EMMA 1.0

VoiceXML 2.0

Language
ASR
Understanding
Context World
Interpretation Wide
Web


Telephone
Media System
Planning
Language
TTS
Generation

Reusable Components


Speech Interface Framework - End of 2009
(by Jim Larson)


VoiceXML 2.1
N-gram Grammar ML
EMMA 1.0
VoiceXML 2.0

Language
ASR
Understanding
Context World
Interpretation Wide
Web


Telephone
Media System
Planning
Language
TTS
Generation

Reusable Components


W3C Process


Architectural Changes

Traditional (proprietary) architecture

ASR / DTMF
Speech Proprietary
User SCE
Applic.
TTS / Audio
Proprietary
platform

.grxml/.gram, .pls
VoiceXML architecture

ASR / DTMF
.vxml
VoiceXML Web
User
Browser Applic.
HTTP
TTS / Audio
VoiceXML
platform

.ssml, .wav/.mp3, .pls


The VoiceXML Impact

VoiceXML changed the landscape of IVRs and speech application
creation
From proprietary to standard-based speech applications

Before After
• Standard VoiceXML
• Proprietary platforms
platforms
(HW & SW)
• Standards for Speech
• Proprietary
Technologies
applications (by
proprietary SCE) • Standard tools for
VoiceXML applications
• Mainly DTMF and
pre-recorded prompts • Integration of DTMF
and ASR
• First attempts to add
speech into IVR • Still predominance of
DTMF, but more and
more speech
applications


Overview

A Bit of History

ASR/DMTF
TTS
Lexicons

MMI Architecture
EMMA and InkML

Next Future

Standards for ASR and DTMF
SRGS 1.0, SISR 1.0


W3C Standards for Speech/DTMF Grammars

SEMANTICS
SYNTAX
Speech
Defines constraints on Describes how to
admissible sentences for grammar produce results after
a specific recognition turn an utterance is recognized

SRGS SISR
SRGS SISR

ABNF XML literal script
ABNF XML literal script

voice dtmf
voice dtmf
http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/


SRGS/SISR Grammars for “Torino”

SRGS XML SRGS ABNF

<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<grammar xml:lang=quot;en-USquot; version=quot;1.0quot;
xmlns=quot;http://www.w3.org/2001/06/grammarquot; #ABNF 1.0 iso-8859-1;
tag-format=quot;semantics/1.0-literalsquot;>
SISR mode voice;
tag-format <semantics/1.0-literals>;
<rule id=quot;mainquot; scope=quot;publicquot;>
<token>Torino</token>
literal <tag>10100</tag>
public $main = Torino {10100} ;
</rule>

</grammar>

<grammar xml:lang=quot;en-USquot; version=quot;1.0quot;
#ABNF 1.0 iso-8859-1;
xmlns=quot;http://www.w3.org/2001/06/grammar
quot; tag-format=quot;semantics/1.0quot;> mode voice;
SISR tag-format <semantics/1.0>;
<tag>var unused=7;</tag>
<rule id=quot;mainquot; scope=quot;publicquot;>
script {var unused=7;};
<token>Torino</token>
public $main = Torino {out=quot;10100quot;;} ;
<tag>out=quot;10100quot;;</tag>
</rule>

</grammar>


SRGS/SISR Standards – Pros

Powerful syntax (CFG) and very powerful semantics (ECMA)
DMTF and Voice input are transparent to the application
Wide and consistent adoption among technology vendors

Two syntax XML and ABNF are great!
Developers can choose (XML validation vs. compact format)

Transformations are possible
XML ABNF (easy, simple XSLT)
ABNF XML (requires a ABNF parser)

Open Source tools might be created to:
Validate grammar syntax
Transform grammars
Debug grammars on written input
Coverage tests: explode covered sentences, GenSem, SemTester, etc.


SRGS/SISR Standards – Small Issues

Semantics declaration: tag-format attribute
If value “semantics/1.0”?
Mandate SISR Script semantics inside semantic tags
If value “semantics/1.0-literal”?
Mandate SISR Literal semantics inside semantic tags
If missing?
Unclear! Risk of interoperability troubles

SISR Script Semantics
Clumsy default assignment: returns last referenced rule only
Developer must properly pop-up results
Be careful to redefine “out”
Assign a scalar value might result in errors

SISR Literal Semantics
Only useful for very simple word-list rules
No support for encapsulating rules
SISR Literal grammars as external references ONLY!


SRGS/SISR – Encapsulated Grammars

Gr2.gram
Literal

Gr41.grxml
Gr1.grxml
Literal
Script

Gr3.grxml
Script

Gr42.gram
Script


SRGS/SISR Standards – Rich XML Results
Section 7 of SISR 1.0 specification
http://www.w3.org/TR/semantic-interpretation/#SI7
Serialization rules from SISR ECMA results into XML
Edge cases:
Arrays
Special variable “_attribute” and “_value”
Creation of namespaces and prefixes
{
drink: {
_nsdecl: {
_prefix:quot;n1quot;,
_name:quot;http://www.example.com/n1quot;
},
_nsprefix:quot;n1quot;,
liquid: {
_nsdecl: {
<n1:drink xmlns:n1=quot;http://www.example.com/n1quot;>
_prefix:quot;n2quot;,
<liquid n2:color=quot;black“
_name:quot;http://www.example.com/n2quot;
xmlns:n2=quot;http://www.example.com/n2quot;>coke</liquid>
},
_attributes: { <size>medium</size>
color: { </n1:drink>
_nsprefix:quot;n2quot;,
_value:quot;blackquot;
}
},
_value:quot;cokequot;
},
size:quot;mediumquot;
}
}


SRGS/SISR Standards – Next Steps

Adoption of the PLS 1.0 lexicon
Clear entry point into PLS lexicons, <token> element
Missing role attribute in <token> to allow homographs
disambiguation

Next extensions via Errata
XML 1.1 support and IR
Update normative references

No Major Extensions are needed!


Speech Synthesis
SSML 1.0/1.1


TTS – Functional Architecture and
Markup/Non-Markup support

Text-to-
Structure Text Prosody Waveform
Phoneme
Analysis Normalization Analysis Production
Conversion

Markup support:
Markup support:
Markup support:
<phoneme>, <lexicon>
<p>, <s>
<voice>, <audio>
Non-Markup support:
Non-Markup support:
Non-Markup support:
look up in pronunciation
infer the structure by
dictionary
automatic text analysis

Markup support:
Markup support: <emphasis>, <break>, <prosody>
<say-as> for date, time, phone number, numbers Non-Markup support:
<sub> for acronyms and transliterations automatically generate prosody through analysis of
Non-Markup support: document structure and sentence syntax
automatically identify and convert constructs

http://www.w3.org/TR/speech-synthesis/

SSML 1.0 – Language description (I)
version attribute
Document Structure SSML namespace attribute
<speak> root element
<?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?>
<speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot;
xml:lang=quot;en-USquot;>
<p>I don't speak Japanese.</p>
<p xml:lang=quot;jaquot;>Nihongo-ga wakarimasen.</p>
Languages </speak>

Processing and Pronunciation
– <p> and <s> (paragraph and sentence)
to give a structure to the text
– <say-as> element
to indicate the type of text construct contained within the element
ex. date, numbers, etc.
– <phoneme> element
to provides a phonetic pronunciation for the contained text in IPA
– <sub> element
to provide substitutions for expanding acronyms in sequence of
words

SSML 1.0 – Language description (II)
Style
- <voice> element
<?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?>
<speak version=quot;1.0quot;
xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;>

The moon is raising on the beach, when John says,
looking Mary in the eyes:
<voice name=quot;simonquot;>I love you!</voice>
but she suddenly replies:
<voice name=quot;susanquot;> Please, be serious! </voice>
</speak>

Other voice selection attributes are:
name, xml:lang, gender, age, and variant

- <emphasis> element
requests that the contained text be spoken with emphasis
level attribute can set it to strong, moderate, reduced, or none
- <break> element
controls the pausing between words
time attribute with two kind of values:
Time expressions “5s”, “20ms”
strength attribute with values:
none, x-weak, weak, medium (default value), strong, or x-strong

SSML 1.0 – Language description (III)

Prosody
<prosody> element
permits control of the pitch, speaking rate and volume of the
speech output.

The attributes are:
volume: the volume for the contained text.
rate: the speaking rate in words-per-minute for the contained text.
duration: a value in seconds or milliseconds for the desired time to take
to read the element contents.
pitch: the baseline pitch for the contained text.
range: the pitch range (variability) for the contained text in Hertz.
contour: sets the actual pitch contour for the contained text.

Other elements
<audio> element - to play an audio file
<mark> element - to place a marker into the text/tag sequence
<desc> element - to provide a description of a non-speech audio
source in <audio>

Towards SSML 1.1 – Motivations

Internationalization needs:
Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
Results:
No major needs for Eastern and Western European languages
Many issues for Far East languages (Mandarin, Japanese, Korean)
Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many
Indian languages
Mark input with or without vowels
Mark the transliteration schema used for input

Extensions required by Voice Browser:
More powerful error handling, selection of fall-back strategies
Trimming attributes
Volume attribute to adopt a logarithmic scale (before was linear)

Alignment with PLS 1.0 specification for user lexicons
http://www.w3.org/TR/speech-synthesis11/

SSML 1.1 – Language Changes

<w> element

Lexicon extensions
<lookup> element
permits control of the pitch, speaking rate and volume of the
speech output.

Phonetic Alphabet Registry creation and adoption
quot;ipaquot; for International Phonetic Alphabet
Registering policy for other phonetic alphabets, similar to LTRU for
Language tags
Candidates:
PinYin for Mandarin Chinese
JEITA for Japanese
X-SAMPA, ASCII transliteration of IPA codes


Pronunciation Lexicon
PLS 1.0


Pronunciation Lexicons

Pronunciation Lexicon
A mapping between words (or short phrases), their written representations,
and their pronunciations suitable for use by an ASR engine or a TTS
engine

Pronunciation lexicons are not only useful for voice browsers
They have also proven effective mechanisms to support accessibility for the
differently able as well as greater usability for all users
They are used to good effect in screen readers and user agents supporting
multimodal interfaces

The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is
designed to enable interoperable specification of pronunciation
lexicons

http://www.w3.org/TR/pronunciation-lexicon/

PLS 1.0 – Language Overview

A PLS document is a container (<lexicon>) of several lexical entries
(<lexeme>)

Each lexical entry contains
One or more spellings (<grapheme>)
One or more pronunciations (<phoneme>) or substitutions (<alias>)

Each PLS document is related to a single unique language (xml:lang)

SSML 1.0 and SRGS 1.0 documents can reference one or more PLS
documents

Current version doesn’t include morphological, syntactic and semantic
information associated with pronunciations


PLS 1.0 – An Example

<lexicon version=quot;1.0quot;
xmlns=quot;http://www.w3.org/2005/01/pronunciation-lexiconquot;
xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot;
xsi:schemaLocation=quot;http://www.w3.org/2005/01/pronunciation-lexicon
http://www.w3.org/TR/pronunciation-lexicon/pls.xsdquot;
alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>

<lexeme>
<grapheme>Sepulveda</grapheme>
ˈȜ Ǻ
<phoneme>səˈpȜlvǺdə</phoneme>
</lexeme>

<lexeme>
<grapheme>W3C</grapheme>
<alias>World Wide Web Consortium</alias>
</lexeme>

</lexicon>


PLS 1.0 – Used for TTS

SSML 1.0
<speak version=quot;1.0quot; … xml:lang=quot;en-USquot;>
<lexicon uri=quot;http://www.example.com/SSMLexample.plsquot;/>
The title of the movie is: quot;La vita è bellaquot; (Life is beautiful),
which is directed by Benigni.
</speak>

PLS 1.0
<lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
<lexeme>
<grapheme>La vita è bella</grapheme>
<phoneme>ˈlǡ ˈviːȎə ˈȤeǺ ˈbǫlə</phoneme>
ˈǡ ː Ǻǫ
</lexeme>
<lexeme>
<grapheme>Benigni</grapheme>
<phoneme>bǫˈniːnji</phoneme>
ǫː
</lexeme>
</lexicon>

PLS 1.0 – Used for ASR

SRGS 1.0
<grammar version=quot;1.0“ xml:lang=quot;en-USquot; root=quot;moviesquot; mode=quot;voicequot;>
<lexicon uri=quot;http://www.example.com/SRGSexample.plsquot;/>
<rule id=quot;moviesquot; scope=quot;publicquot;>
<one-of>
<item>Terminator 2: Judgment Day</item>
<item>Pluto's Judgement Day</item>
</one-of>
</rule>
</grammar>

PLS 1.0
<lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
<lexeme>
<grapheme>judgment</grapheme>
<grapheme>judgement</grapheme>
ˈȜ
<phoneme>ˈdʒȜdʒ.mənt</phoneme>
</lexeme>
</lexicon>

Examples of Use

Multiple pronunciations for the same orthography

Multiple orthographies

Homophones

Homographs

Acronyms, Abbreviations, etc.

Detailed descriptions can be found in:
W3C specification, Wikipedia
Paolo Baggia, SpeechTEK 2008 & Voice Search 2009


PLS 1.0 – Open Issues

No wide support of IPA in speech engines
Slowly changes are under way
Phonetic Alphabet Registry will open doors to other alphabets in a
controlled and interoperable way

Integration in ASR/TTS
SSML 1.1 will interoperate with PLS 1.0
SRGS 1.0 still missing support of role attribute for PLS 1.0

No matching algorithm inside PLS, because it is mainly a data
format


Pronunciation Alphabets
IPA, SAMPA


International Phonetic Alphabet

Pronunciation is represented by a phonetic alphabet
Standard phonetic alphabets
International Phonetic Alphabet (IPA)
Well known phonetic alphabet
SAMPA - ASCII based (simple to write)
Pinyin (Chinese Mandarin), JEITA (Japanese), etc.
Proprietary phonetic alphabets

International Phonetic Alphabet (IPA)
Created by International Phonetic Association (active since 1896),
collaborative effort by all the major phoneticians around the world
Universally agreed system of notation for sounds of languages
Covers all languages
Requires UNICODE to write it
Normatively referenced by PLS


IPA – Chart
IPA was founded in 1886
It is the major international
association of phoneticians
The IPA alphabet provides
symbols making possible the
phonemic transcription of all
known languages

IPA characters can be encoded in
Unicode by supplementing
ASCII with characters from
other ranges, particularly:
IPA extensions (0250–02AF)
Latin Extended-A (0100-017F)
See the detailed:
http://www.unicode.org/charts


Phonetic Alphabets – Issues

The real problem is how to write pronunciation in a reliable, unless
you are trained phonetician
Issues with fonts and authoring, browsers, but Unicode fonts today
support IPA extensions, see:
http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm

There are very few tools to help writing pronunciations and to let
you listen to what you have written

Make available pronunciations in IPA or other general phonetic
languages.


Voice Dialog languages:
VoiceXML 2.0
VoiceXML 2.1


VoiceXML 2.0 – Features, Elements

Menus, forms, sub-dialogs Events
<menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>,
<catch>, <throw>
Input
Transition and submission
Speech recognition
<grammar> <goto>, <submit>
Recording Telephony
<record>
Connection control
Keypad <transfer>, <disconnect>
<grammar mode=quot;dtmfquot;>
Telephony information
Output
Platform specifics
Audio files <object>
<audio>
Performance
Text-To-Speech
Fetch
<prompt>
Properties
Variables (ECMA-262)
<var>, <assign>, <script>
scoping rules
http://www.w3.org/TR/voicexml20/

VoiceXML 2.0 – Execution Model

Execution is synchronous
Only disconnect event is handled (somewhat) asynchronous

Execution is always in a single dialog: <form> or <menu>
Form Interpretation Algorithm for <field> selection

Prompt are queued
Played only when encountering a waiting state
Played before a fetchaudio is started

Processing is always in one of two states:
Waiting for input in an input item:
<field>, <record>, <transfer>, etc.
Transitioning between input items in response of an input

Event-driven:
user’s input event handling
<nomatch>, <noinput>
generalized event mechanism
<catch>, <throw>
call event handling
connection.*
error event handling
error.*

VoiceXML 2.1 – Extended Features
Dynamically referencing grammars and scripts:
<grammar expr=quot;…quot;>, <script expr=quot;…quot;>

Record user’s utterance during form filling
recordutterance property
Add new shadow variables: recording, recordingsize, recordingduration

Detect barge-in during prompt playback (SSML <mark>)
Add markexpr attribute
Add new shadow variables: markname and marktime

Fetch XML data without transition
Use read-only subset of DOM
Dynamically concatenate prompts <foreach>
Iterate throught ECMAScript arrays and execute content

Send data upon disconnect
<disconnect namelist=quot;…quot;>
Additional transfer type
<transfer type=quot;consultationquot;>

VoiceXML Applications

Static VoiceXML applications
The VoiceXML page is always the same, so the user experience
No personalization or customization

Dynamic VoiceXML applications
User experience is customized
• After authentication (PIN)
• Using caller-id or SIP-id
Data driven
Dynamic pages generated at runtime
e.g. JSP, ASP, etc.

http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/

A Drawback of VoiceXML 2.0

A drawback of VoiceXML is that the transition from a VoiceXML page
to another is a costly activity:
Fetch the new page, if not cached
Parse the page
Initialize the context, possibly loading and initializing a new application
root document
Load or pre-compile scripts

The transitions are the only way to return data to the Web Application
(if the VoiceXML is dynamic)

Pages must be created to include dynamic data

VoiceXML 2.1 addresses part of this drawback by feeding dynamic
data to a running VoiceXML page


Advantages of VoiceXML 2.1 - AJAX

Two of the eight new features in VoiceXML 2.1 helps to create
more dynamic VoiceXML applications:
<data> element
<foreach> element

Static VoiceXML document can fetch user-specific data at runtime,
without changing the VoiceXML document
<data> element allows retrieval of arbitrary XML data without
VoiceXML document transitions
Returned XML data are accessible by a subset of DOM primitives
<foreach> extend the prompts to allow the iteration on a dynamic
array of information to create a dynamic prompt

This is similar to AJAX programming for HTML services
It decouples presentation layer (VoiceXML) from business logic
(accessed via <data>)

VoiceXML 2.1 – <data> Element

Attributes:
the variable to be filled with the DOM of the retrieved data
name
scr or srcexpr the URI of the location of the XML data
the list of variables to be submitted
namelist
either ‘get’ or ‘post’
method
media encoding
enctype
fetch and caching attributes

As <var>, it may appear in executable content (<form> and <vxml>)
The value of name must be a declared variable
The platform will fill the variable of the DOM of the fetched XML data
<data> element is synchronous (the service stops to get data)


VoiceXML 2.1 – <foreach> Element

Attributes:
ECMAScript expression that must evaluate to ECMAScript array
array
the variable that stores the element to be processed
item

<foreach> allows the application to iterate on an ECMAScript array and
to execute the content
<foreach> may appear:
In executable content (all executable content elements may appear as
content of <foreach>)
In <prompt> (restrictions on the content are applied)
<foreach> allows sophisticated concatenation of prompts


VoiceXML – Final Remarks

The changed landscape for speech application development:
Virtually all the IVRs today support VoiceXML
New options related to VoiceXML:
SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)
Large hosting of speech applications (TellMe, Voxeo)
Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)
Further changes may come from the CCXML adoption

… but:
Mainly system driven applications are actually deployed
New challenges to incorporate more powerful dialog strategies,
mixed-initiative are under discussion.


VoiceXML Resources

Voice Browser Working Group (spec, FAQ, implementations, resources):
http://www.w3.org/Voice/

VoiceXML Forum site (resources, education, interest groups):
http://www.voicexml.org/
VoiceXML Forum Review:
http://www.voicexmlreview.org/
Interesting articles related to VoiceXML and more
Example code in the sections quot;First Wordsquot; and quot;Speak & Listenquot;

Ken Rehor’s World of VoiceXML
http://www.kenrehor.com/voicexml

Online documentation related to VoiceXML Platforms
Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie
Many books on VoiceXML:
Jim Larson, quot;VoiceXML Introduction to Developing Speech Applicationsquot;, Prentice-Hall,
2002.
A. Hocek, D. Cuddihy, quot;Definitive VoiceXMLquot;, Prentice-Hall, 2002


Call Control:
CCXML 1.0


CCXML 1.0 – Highlights

Asynchronous event processing

Acceptance or refusal of an incoming call

Different type of transfer call management

Outbound call activation (interaction with an external entity)

Use of ECMAScript adding scripting capabilities to call control
applications

VoiceXML modularization

Conferencing management


CCXML 1.0 – Elements Relationship


CCXML 1.0 – Incoming Call
CCXML document
Event catching and processing
<?xml version=quot;1.0quot;
encoding=quot;UTF-8quot;?>
<ccxml version=quot;1.0quot;>

[…]

<transition
CCXML
connection.alerting event=quot;connection.alertingquot;>
Interpreter
[…]
</transition>

event$ <transition
event=quot;connection.disconnectedquot;>
[…]
name:’connection.alerting’;
</transition>
connectionid:‘0239023901903993’;
eventid:’00001’; ....
…..

http://www.w3.org/TR/ccxml


CCXML 1.0 – connection.alerting Event

Basic telephony information has been retrieved on alerting event and
is available into CCXML document:
Local URI, remote URI, protocol used, redirection info, etc.

Based on certain checked info, CCXML can accept or refuse the
incoming call, even before contacting the dialog server;

Any error that can occur during the phone call can be managed by
CCXML service (connection.failed, error.connection events)

Call Control CCXML VoiceXML
Adapter Interpreter Interpreter

connection.alerting

Analyzing events$ content
<accept/> | <reject/>



CCXML 1.0 – How to activate a new dialog
CCXML actions:
Receives alerting event from Call Control Adapter
Asks to dialog server to prepare a new dialog
Waits for the preparation
If the dialog has been successfully prepared, accept the call
Asks to dialog server to start the prepared new dialog

CCXML
Call Control VoiceXML
Interpreter
Adapter Interpreter
alerting
prepare a new dialog
dialog prepared
call accepted
connected
start the prepared dialog
dialog started


Call transfer

CCXML supports transfer call of different modality: quot;bridgequot;, quot;blindquot;,
quot;consultationquot;;
Based on different modalities features CCXML language allows the expected
interaction with the Call Control Adapter to correctly perform the transfer;
During the different phases of transfer call creation the CCXML can receive
any asynchronous event and correctly manage it, interrupting the call, if
requested

CCXML
Interpreter
Adapter Interpreter

Performing a transfer
command1
answer1

[…]
transfer complete …


External Events

CCXML Interpreter Context can receive events from an external entity
able to use the HTTP protocol;
Events generated in this way must be sent to a CCXML by a POST
HTTP command
A event is so performed and:
It can be addressed on a new session whose creation must be requested
It can be addressed on an existent session, specifying the ID in the
request
CCXML External
Interpreter Entity

basic http event

Event
management
Event management result



External event on a new session:
the Outbound Call

A particular request arrived to Call Control from an external entity;
A particular CCXML service associated with the received event is started and
a set of operations between Call Control Adapter, Call Control and Dialog
Server is activated: the outbound call is so placed
outbound call request

Call Control CCXML VoiceXML
Adapter Interpreter Interpreter
Create a call

connection progressing …
Prepare a dialog

prepared

connection connected
Start the prepared dialog


External event on a session:
dialog termination request
An external entity performs a HTTP POST request towards the CCXML
Interpreter Context, specifying a sessionid, requesting the termination of a
particular dialog;
The CCXML check the session id, if this is valid then CCXML Interpreter
injects the event received in the session;
The CCXML service has a transition on that event and performs the dialog
termination on a particular dialog identifier;
Dialog termination request

CCXML
Adapter Interpreter
Interpreter

It depends on dialogterminate (dialogid)
dialog.exit event
management
dialog.exit
disconnect(connId) dialogprepare


Loading different CCXML documents:
<fetch> and <goto> elements

<fetch> and <goto> elements are used respectively to asynchronously fetch
content identified by the attributes of the <fetch> and to go in a fetched
document, if it’s successfully loaded;

CCXML - MODULARIZATION
- SOURCE EXEMPLIFICATION
Interpreter
- MORE READABILITY

<fetch
next=quot;'http://../Fetch/doc1.ccxml'quot;
type=quot;'application/ccxml+xml'quot;
fetchid=quot;resultquot;/>
fetch the document quot;doc1.ccxmlquot;

fetch.done / error.fetch
The first event occurred
in a new document
is ccxml.loaded
goto into the new document /
continue to work on the same dialog



Simple CCXML Document
<ccxml version=quot;1.0quot; xmlns=quot;http://www.w3.org/2002/09/ccxmlquot;>
<var name=quot;currentStatequot;/>
<var name=quot;myDialogIdquot;/>
<var name=quot;myConnIdquot;/>
<eventprocessor statevariable=quot;currentStatequot;>
<transition event=quot;connection.alertingquot;>
<assign name=quot;myConnIdquot; expr=quot;event$.connectionidquot;/>
<accept connectionid=quot;event$.connectionidquot;/>
</transition>
<transition event=quot;connection.connectedquot;>
<dialogstart src=quot;'http://www.example.com/flight.vxml'quot;
connectionid=quot;myConnIdquot; dialogid=quot;myDialogIdquot;/>
</transition>
<transition event=quot;dialog.startedquot;>
<log expr=quot;’VoiceXML appl is running now’quot;/>
</transition>
<transition event=quot;connection.disconnectedquot;>
<dialogterminate dialogid=quot;myDialogIdquot;/>
</transition>
<transition event=quot;dialog.exitquot;>
<disconnect connectionid=quot;myConnIdquot;/>
</transition>
<transition event=quot;*quot;>
<log expr=quot;'Closing, unexpected:'+ event$.namequot;/>
<exit/>
</transition>
</eventprocessor>
</ccxml>


CCXML 1.0 – Next Steps

CCXML specification is a Last Call Working Draft, all the feature
requests and clarifications have been addressed;

An Implementation Report test suite is under development;

It is very close to be published as W3C Candidate Recommendation;

Internal or external companies will be invited to send implementation
report on their CCXML platform;

After that, CCXML 1.0 specification will be able to become Proposed
Recommendation and then W3C Recommendation.



Speech Interface Framework
Tour Complete!


Speech Interface Framework - End of 2009
(by Jim Larson)


VoiceXML 2.1
N-gram Grammar ML
EMMA 1.0
VoiceXML 2.0

Language
ASR
Understanding
Context World
Interpretation Wide
Web


Telephone
Media System
Planning
Language
TTS
Generation

Reusable Components


Architectural Changes

.grxml/.gram, .pls
VoiceXML architecture

ASR / DTMF
.vxml
VoiceXML Web
User
Browser Applic.
HTTP
TTS / Audio
VoiceXML
platform

.ssml, .wav/.mp3, .pls


VoxNauta – Internal Architecture


Loquendo MRCP Server/LSS 7.0 Architecture

Load Balancer

RTSP SIP
MRCP v2
(MRCPv1) (SDP)

RTP SIP
RTSP Parser MRCP v2
parser
SDP
MRCP v1 Parser

Management Graphic
MP (SNMP)
Management
Configuration Consolle
Config files
AP
MRCP v1/v2 Server
Interf.
Logger Log files
Audio AP
API
Provider
Win32/Linux
OS
NLSML / EMMA

TTS & ASR interface

TTS and ASR API TTS and ASR API

LASR-SV
LASR
LTTS


IETF MRCP Protocols

Media Resource Control Protocol MRCP are IETF standards
MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on
RTSP/RTP
MRCPv2 is Internet Draft,
http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP
offering the new audio recording and Speaker Verification
functionalities
Optimized client-server solution for the large-scale deployment of
speech technologies in the telephony field, such as call centers,
CRM, news and email-reading, self-service applications, etc.
Allows standard interface of speech technologies in all IVR platforms

For more information read:
Dave Burke, Speech Processing for IP Networks. Media
Resource Control Protocol (MRCP), ed. Wiley


VoiceXML in a Call Center
PBX

Fixed/
Optional
Mobile
Network
Voice Gateway for
Non SIP PBX

VOXNAUTA IVR

ACD

WEB CTI Data
Server Server Server

Operators

VoiceXML in the IMS Architecture

TDM protocols
VOICE SIP protocols
Fixed/ RTP
GATEWAY
Mobile
VoiceXML on HTTPS
Network

VOXNAUTA MRF

IP
Network

Application Server

Overview

A Bit of History

ASR/DMTF
TTS
Lexicons

MMI Architecture
EMMA and InkML

Next Future

Modes, Modalities and Technologies

Speech
Audio
Stylus
Touch
Accelerometer
Keyboard/keypad
Mouse/touchpad
Camera
Geolocation
Handwriting recognition
Speaker verification
Signature verification
Fingerprint identification
….


Complement and Supplement

Speech Visual
- Transient - Persistent
- Linear - Spatial
- Hands and Eyes-Free - Eyes
- Suffers Noise - Suffers Light Conditions

Enable to choose among different modalities or to mix
them
Adaptable to different social, environmental conditions or
to user preference


GUI VUI MUI
or
MMUI


MMI has an Intrinsic Complexity

Interaction
Manager
speech
speech
fingerprint
text fingerprint
text
Face
mouse Face
mouse
identification
identification
geolocation
handwriting geolocation
handwriting Speaker
Speaker
verification
Vital verification
accelerometer Vital
accelerometer
signs
signs
Sensor Identification
User intent

video
video
photograph
photograph
Audio
Audio
drawing
drawing recording
recording

Deborah Dahl, Voice Search 2009
Recording

MMI can Include Many Different Technologies

Touchscreen Accelerometer

Interaction
Speech
Geolocation
recognition Manager

Fingerprint
Keypad
recognition

Handwriting
recognition



Uniform Representation for MMI

Getting everything to work together is complicated.
One simplification is to represent the same information
from different modalities in the same format.
The need a common language for representing the
same information from different modalities

EMMA (Extensible MultiModal Annotation) 1.0
A uniform representation for multimodal information


Touchscreen Accelerometer

EMMA
EMMA

Interaction
Speech
EMMA EMMA Geolocation
recognition Manager

EMMA EMMA
EMMA
Fingerprint
Keypad
recognition

Handwriting
recognition



EMMA Structural Elements

EMMA Elements
Provide containers for application
semantics and for multimodal
annotation emma:emma

<emma:emma …> emma:interpretation
<emma:one-of>
<emma:interpretation>
emma:one-of
…
</emma:interpretation>
<emma:interpretation> emma:group
…
emma:sequence
</emma:one-of>
</emma:emma>
emma:lattice

http://www.w3.org/TR/emma/

EMMA Annotations

Characteristics and processing of input, e.g.:
token of input
emma:tokens
reference to processing
emma:process
lack of input
emma:no-input
uninterpretable input
emma:uninterpreted

human language of input
emma:lang

emma:signal reference to signal

emma:media-type media type

emma:confidence confidence scores
emma:source annotation of input source
emma:start emma:end Timestamps (absolute/relative)
emma:medium emma:mode medium, mode, and
emma:function function of input
emma:hook hook


EMMA 1.0 – Example Travel Application

INPUT:
quot;I want to go from Boston
to Denver on March 11quot;

http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009


EMMA 1.0 – Same meaning

<emma:interpretation medium=quot;acousticquot; mode=quot;voicequot;
id=quot;int1quot;>
<origin>Boston</origin>
Speech
<destination>Denver</destination>
<date>11032009</date>

<emma:interpretation medium=quot;tactilequot; mode=quot;gui“
id=quot;int1quot;>
Mouse
<date>11032009</date>



EMMA 1.0 – Handwriting Input

<emma:interpretation medium=quot;tactilequot; mode=quot;inkquot;
id=quot;int1quot;>
<date>11032009</date>



EMMA 1.0 – Biometrics Input

<emma:emma version=quot;1.0quot;> <emma:emma version=quot;1.0quot;>
<emma:interpretation <emma:interpretation
id=quot;int1quot; id=quot;int1quot;
emma:confidence=quot;.75quot; emma:confidence=quot;.80quot;
emma:medium=quot;visualquot; emma:medium=quot;acousticquot;
emma:mode=quot;photographquot; emma:mode=quot;voicequot;
emma:verbal=quot;falsequot; emma:verbal=quot;falsequot;
emma:function=quot;identificationquot;> emma:function=quot;identificationquot;>
<person>12345</person> <person>12345</person>
<name>Mary Smith</name> <name>Mary Smith</name>
</emma:interpretation> </emma:interpretation>
</emma:emma> </emma:emma>



EMMA 1.0 – Representing Lattices

Speech recognizers, Handwriting recognizers and other input
processing components may provide lattice output:

A graph encoding a range of possible recognition results or
interpretations

portland
today please
from
flights to austin 7
1 2 3 4 5 6 8
oakland tomorrow
boston

From Michael Joshnston, AT&T Research

EMMA 1.0 – Representing Lattices
Lattices can be represented using EMMA elements:
<emma:lattice emma:initial=quot;?quot; emma:final=quot;?quot;>
<emma:arc emma:from=quot;?quot; emma:to=quot;?quot;>

<emma:emma version=quot;1.0quot;
xmlns:emma=quot;http://www.w3.org/2003/04/emmaquot;>
<emma:interpretation>
<emma:lattice emma:initial=quot;1quot; emma:final=quot;8quot;>
<emma:arc emma:from=quot;1quot; emma:to=quot;2quot;>flights</emma:arc>
<emma:arc emma:from=quot;2quot; emma:to=quot;3quot;>to</emma:arc>
<emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>boston</emma:arc>
<emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>austin</emma:arc>
<emma:arc emma:from=quot;4quot; emma:to=quot;5quot;>from</emma:arc>
<emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>portland</emma:arc>
<emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>oakland</emma:arc>
<emma:arc emma:from=quot;6quot; emma:to=quot;7quot;>today</emma:arc>
<emma:arc emma:from=quot;7quot; emma:to=quot;8quot;>please</emma:arc>
<emma:arc emma:from=quot;6quot; emma:to=quot;8quot;>tomorrow</emma:arc>
</emma:lattice>
</emma:emma>
From Michael Joshnston, AT&T Research

EMMA in Multimodal Framework
http://www.w3.org/TR/mmi-framework

EMMA


InkML 1.0 – Digital Ink

Ink Markup Language (InkML), http://www.w3.org/TR/InkML
Data format for presenting digital Ink (pen, stylus, etc)
Allows the input and processing of handwritings, gesture, sketches,
music, etc.
<ink>
<trace>
10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140,
13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135,
58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205
</trace>
<trace>
130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125,
152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200,
150 208, 163 210, 178 208, 192 201, 205 192, 214 180
</trace>
<trace>
227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134,
230 148, 234 162, 235 176, 238 190, 241 204
</trace>
<trace>
282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129,
291 143, 294 157, 294 171, 294 185, 296 199, 300 213
</trace>
<trace>
366 130, 359 143, 354 157, 349 171, 352 185, 359 197,
371 204, 385 205, 398 202, 408 191, 413 177, 413 163,
405 150, 392 143, 378 141, 365 150
</trace>
</ink>

http://www.w3.org/TR/InkML/

InkML 1.0 – Status and Advances

Rich annotation for Ink:
Trace, Trace formats and Trace collections
Contextual information
Canvases
Etc.

Result of classification of InkML traces may be a semantic
representation in EMMA 1.0

Current status is Last Call Working Draft, next will be Candidate
Recommendation with release of an Impl. Report test-suite
Raising interest from major industries

http://www.w3.org/TR/InkML/

MMI Architecture Specification

“Multimodal Architecture and Interfaces“, W3C Working Draft,
http://www.w3.org/TR/mmi-arch/

Runtime Framework provides Delivery Interaction Data
the basic infrastructure and Context Manager Component
Component
controls communication among
the constituents. Runtime Framework

Interaction Manager (IM)
Modality Component API
coordinates Modality
Components (MCs) by life-cycle
Modality Modality
events and contains the shared Component 1 Component N
data (context).
Event-based communication
between IM and MCs.

http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008


MMI Arch – Laboratory Implementation

Implementation of components using W3C markup languages.

Delivery Interaction Data
Context Manager Component
Component
SCXML
Runtime Framework

Modality Component API Modality Component API

HTML VoiceXML
Modality Modality
Component 1 Component N
for GUI for VUI



MMI Arch – Laboratory Implementation

SCXML based Interaction Manager.
VoiceXML + HTML modality components.

SCXML interpreter
Server
HTTP I/O Processor

Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA)

CCXML/VoiceXML Server
Browser
HTML Browser
Telephony interface
Client
Phone Client

GUI modality component Voice modality component



MMI Architecture – Open Issues

Profiles

Start-up, Registration, Delegation
in distributed environment

Transport of Events

Extensibility of Events

http://www.w3.org/TR/mmi-arch/


Emotion in Wikipedia

From Wikipedia definition:

“An emotion is a mental and physiological state associated with a
wide variety of feelings, thoughts, and behaviours. It is a prime
determinant of the sense of subjective well-being and appears to play
a central role in many human activities. As a result of this generality,
the subject has been explored in many, if not all of the human
sciences and art forms. There is much controversy concerning how
emotions are defined and classified.”

General goal: Make interaction between humans and machines more
natural for the humans

Machines should become able:
• to register human emotions (and related states)
• to convey emotions (and related states)
• to “understand” the emotional relevance of events


Voice Browsing And Multimodal Interaction In 2009

Voice Browsing And Multimodal Interaction In 2009

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie Voice Browsing And Multimodal Interaction In 2009

Ähnlich wie Voice Browsing And Multimodal Interaction In 2009 (20)

Mehr von GoogleTecTalks

Mehr von GoogleTecTalks (13)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Voice Browsing And Multimodal Interaction In 2009