More Related Content
Similar to Introduction to VoiceXml and Voice Web Architecture (20)
More from Paul Nguyen (10)
Introduction to VoiceXml and Voice Web Architecture
- 2. Session Overview
• Voice Web Architecture
– Components of a Voice Web Application
• Voice Standards
– W3C Speech Interface Framework
• VoiceXML
– Language features
– Execution model - Form Interpretation Algorithm (FIA)
• Application Design Techniques
– Static vs. dynamic VoiceXML
– Performance Considerations
• CCXML, VoiceXML and VoIP
• Application Deployment Models
• New Technologies
– Speaker Biometrics, Video, Multimodal, VoiceXML 3.0
© 2007 Ken Rehor. All Rights Reserved. 2
- 3. Simplifying Voice Services programming
• Web-based architecture for interactive speech services
– Exploit web technologies to simplify voice service creation and deployment
– Enable consolidation of voice and web services
– Separate service logic from user interaction
• High-level programming languages
– Control speech and telephony resources in uniform manner
– Shield application programmers from implementation details
• No need to know ASR, TTS, telephony APIs
– Create portable applications
• Run on enterprise system or in telephone network
• Run on a variety of platforms, ASR agnostic
© 2007 Ken Rehor. All Rights Reserved. 3
- 5. Key Ideas
• Standard/Common high-level language
– Designed for the task
• Leverage open, known technology
– Web protocols, servers, networks, development tools, expertise
• Phone number mapped to URL
– Phone number associated with URL of voice service
© 2007 Ken Rehor. All Rights Reserved. 5
- 6. Voice / Web Application Architecture
<grxml>
PSTN or .wav
VoIP
<vxml> • Grammars
• Audio files
Any phone • Scripts
VoiceXML HTTP
browser
HTTP
Internet or
HTTP
Intranet
<html> Application
(web) server
• Application logic
• Content and data
• Transaction processing
• Database interface
• Images
• Audio files
Web • Scripts
Browser
© 2007 Ken Rehor. All Rights Reserved. 6
- 7. Voice Application Architecture and Components
<grxml>
Welcome to
Customer Acme products .wav
service, …
please… <vxml>
Caller HTTP
VoiceXML
PSTN platform
Internet or
intranet Web
server
VoiceXML
interpreter
OA&M
middleware
Telephony
DTMF
Audio
ASR
TTS
© 2007 Ken Rehor. All Rights Reserved. 7
- 8. Application Backend Architecture
• Grammars
• Audio files
• Scripts
<vxml>
Transaction
Server
HTTP
Internet or Intranet or
Intranet Internet
Application
(web) server
• Application logic
• Content and data
• Transaction processing
• Database interface Database
(content)
Web
service
© 2007 Ken Rehor. All Rights Reserved. 8
- 9. Components of a Voice Solution
• Traditional phone, VoIP phone, mobile phone, or multimodal device
• Telephone network
– Circuit-switched PSTN or packet-switched VoIP
– Connects caller’s telephone with Telephony Server
• Voice User Interface
– Dialog structure / flow
– Prompts – what the application says to the user
– Speech grammars – what the user can say
• Application logic that executes on an application server
– Web "back-end“
– Database, or database interface
• VoiceXML Server that executes dialogs
– Controls resources such as ASR, SIV, TTS, etc
• Data network to connect application server and VoiceXML server
© 2007 Ken Rehor. All Rights Reserved. 9
- 10. Inbound or Outbound calls
• VoiceXML application works the same for inbound and
outbound calls
– Additional call progress detection generally required for outbound
• Simple protocol for initiating outbound calls
– No firm standards, but most vendors follow similar techniques
– HTTP, Web Services, etc.
© 2007 Ken Rehor. All Rights Reserved. 10
- 12. Value of Open Standards
• Non-proprietary interfaces between components
• Allow choice of best components for the task
• User interface languages
– W3C Speech Interface Framework: VoiceXML, SRGS, SSML, SI
– W3C: HTML, XHTML, SMIL, X+V
– OMA: WAP
• Communication protocols
– W3C: CCXML for 3rd-party telephony call control
– W3C: HTTP, HTTPS, SOAP, WSDL
– IETF: SIP, MRCP, MSCP
– 3GPP: IMS
– ITU: T1, ISDN
© 2007 Ken Rehor. All Rights Reserved. 12
- 13. Visual vs. Voice markup
Web app UI Voice Web app UI
• HTML – Structure • VoiceXML – Structure
– Layout – Dialog flow
– Input declaration – Input declaration
– Transitions – Transitions
• Images • Audio files
• Audio files / streams • Video, Images
• Video • Text (for TTS)
• Text • Scripts
• Scripts
© 2007 Ken Rehor. All Rights Reserved. 13
- 14. Protocols
Web applications Voice Web applications
• HTTP, HTTPS • HTTP, HTTPS
• RTP • RTP
• SOAP • SOAP
• WSDL • WSDL
• … • SIP
• …
© 2007 Ken Rehor. All Rights Reserved. 14
- 15. Voice Standards Activities
• Speech Interface Framework
• Network protocols
– SIP, MRCP v2, etc.
• Platform Certification, Developer Certification,
Speaker Biometrics, Architecture, Tools
© 2007 Ken Rehor. All Rights Reserved. 15
- 16. Voice Application Standards
CCXML VoiceXML
SIP Netann Call Control Application
MSCML Application SOAP
MOML / MSML
MSCP Scripts
DMSP CCXML VXML GRXML
MGCP
etc. HTTP HTTP
HTTPS
Scripts
HTTPS
Media Audio
Control
Interface CCXML SSML
Conference/ Browser
Media
Server
Telephony Dialog
Control Control
SIP Interface Interface
VoIP VoiceXML DTMF GRXML
Phone
Gateway Browser
Networ RFC 2833
k T1 / E1 Media G.711, WAV,
ISDN VoiceXML 2.0 Audio
Mixer / .au, mp3, etc.
SS7 RTP VoiceXML 2.1
Caller Server ECMAScript 262
MRCP Client
Telephony Control Interface: SIP, etc. MRCP v1
Dialog Control Interface: SIP, MSCP, etc. MRCP MRCP v2
Server Server Server
TTS ASR SIV
© 2007 Ken Rehor. All Rights Reserved. SSML GRXML ** standards in progress **
16
- 18. Voice Application Components
• Dialog – flow control of the inputs, outputs, next steps
• Input grammars
– Control input constraints for DTMF and speech recognition
• Output formatting
– Pronunciation, timing, sequencing
© 2007 Ken Rehor. All Rights Reserved. 18
- 19. W3C Speech Interface Framework
• VoiceXML
• SRGS
• SSML
• Semantic Interpretation
• Pronunciation Lexicon
• Call Control
For more information, see:
W3C Voice Browser Working Group http://www.w3.org/Voice/
© 2007 Ken Rehor. All Rights Reserved. 19
- 20. Voice User Interface - Dialog
• W3C VoiceXML 2.0
– W3C Recommendation March 2004
– Widely implemented
• Approximately 4 dozen platforms
• Many service providers worldwide
– VoiceXML Forum certification program
• Nearly two dozen certified platforms, more coming
• W3C VoiceXML 2.1
– Candidate Recommendation Sept 2006
– Test suite under development; Certification Program to follow
– Many platform vendors are implementing
• W3C VoiceXML 3.0
– Early stages of development
– SCXML – state chart markup language designed as a controller for V3 and
CCXML 2.0 ("Working Draft" Jan 2006)
© 2007 Ken Rehor. All Rights Reserved. 20
- 21. User Interaction – Input / Output Control
• Input grammars W3C SRGS 1.0
– W3C Recommendation
– Widely implemented
• Output formatting W3C SSML 1.0
– W3C Recommendation
– Widely implemented, yet minor real support
(most TTS engines ignore the SSML instructions)
• Semantic Interpretation for Speech Recognition W3C SISR 1.0
– Nearing Candidate Recommendation
– Implementation gaining acceptance
© 2007 Ken Rehor. All Rights Reserved. 21
- 22. W3C Speech Interface Framework
Semantic Interpretation
© 2007 Ken Rehor. All Rights Reserved. 22
- 23. W3C Speech Recognition Grammar Specification
• Markup language to control input constraints
– Finite-state speech recognition
– DTMF recognition
• Two variations
– XML (GRXML)
– ABNF
• Version 1.0: W3C Recommendation – March 2004
• Implemented and supported by numerous vendors
© 2007 Ken Rehor. All Rights Reserved. 23
- 24. GRXML ASR example
• asdf
<grammar type="application/srgs+xml" root="r2" version="1.0">
<rule id="r2" scope="public">
<one-of>
<item>coffee</item>
<item>tea</item>
<item>milk</item>
<item>nothing</item>
</one-of>
</rule>
</grammar>
© 2007 Ken Rehor. All Rights Reserved. 24
- 25. GRXML DTMF example
<?xml version="1.0"?>
<grammar mode="dtmf" version="1.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/06/grammar
http://www.w3.org/TR/speech-grammar/grammar.xsd"
xmlns="http://www.w3.org/2001/06/grammar">
<rule id="digit">
<one-of>
<item> 0 </item>
<item> 1 </item>
<item> 2 </item>
<item> 3 </item>
<item> 4 </item>
<item> 5 </item>
<item> 6 </item>
<item> 7 </item>
<item> 8 </item>
<item> 9 </item>
</one-of>
</rule>
<rule id="pin" scope="public">
<one-of>
<item>
<item repeat="4"><ruleref uri="#digit"/></item>
#
</item>
</one-of>
</rule>
</grammar>
© 2007 Ken Rehor. All Rights Reserved. 25
- 26. W3C Speech Synthesis Markup Language
• Markup language to control spoken and audio output
• Version 1.0: W3C Recommendation – Sept 2004
• Implemented and supported by numerous vendors
• Version 1.1: under development
– Adds support for tonal languages
– First public Working Draft published January 2007
© 2007 Ken Rehor. All Rights Reserved. 26
- 27. SSML Functions
• Audio output
– <audio>
• Text-to-Speech output
– Contained within SSML constructs
• Pronunciation controls
– <say-as>
• Interpret-as
• Format
• Detail
– <emphasis>
• Timing
– <break>
© 2007 Ken Rehor. All Rights Reserved. 27
- 28. SSML Functions (cont’d)
• Spoken language
– xml:lang
• Prosody and Style – voice control
– Voice
– Gender
– Age
– Name
• Prosody
– <prosody>
• Pitch
• Contour
• Range
• Rate
• Duration
• Volume
© 2007 Ken Rehor. All Rights Reserved. 28
- 29. SSML Functions (cont’d)
• Sentence structure
– <p>
– <s>
• phoneme -- Modify text
– <sub> - substitute text
• Location identification
– <mark>
© 2007 Ken Rehor. All Rights Reserved. 29
- 31. VoiceXML Scope
• Human-machine interaction provided by voice response
systems:
– Output
• play audio files
• produce synthesized speech
– Input
• record spoken input
• recognize spoken input
• collect character input
– Control flow
– Telephony
• transfer a user to another destination, such as a live agent
• disconnect a user
© 2007 Ken Rehor. All Rights Reserved. 31
- 32. VoiceXML Goals
• Separate user interaction from service logic
– Creates new possible business models
• Service developer can be separate from telephony platform provider
• Enable service portability across implementation platforms
– Assume common set of platform capabilities
– Provide common language for:
• Content providers, Tool providers, Platform providers
• Safely handle shared network-based applications
– deterministic behavior
• Easy to build common types of applications
• Features to build complex types of applications
• Shield application authors from low-level platform-specific
details
– Promotes portability, ease of service creation
© 2007 Ken Rehor. All Rights Reserved. 32
- 33. VoiceXML 2.0 Basic Functions
• Input
– <field>, <menu> recognition
– <record> audio recording
• Output
– <prompt> container for TTS or prerecorded audio
– <audio> prerecorded audio
• Control Flow
– <if>, <else>, <elseif> basic conditional logic
– <script> complex scripts using ECMAScript
– <goto> transition to a new document
– <submit> submit data to a web application
• Telephony
– <disconnect>
– <transfer>
© 2007 Ken Rehor. All Rights Reserved. 33
- 34. VoiceXML Execution Model
• Form Interpretation Algorithm <form>
• Execution is synchronous (mostly)
– Disconnect events are handled (somewhat) asynchronously
• Audio is queued
– Played only when encountering a waiting state
• Processing is always in one of two states:
– Waiting for input in an input item
• such as <field>, <record>, or <transfer>
– Transitioning between input items in response to an input
• Event-driven
– <catch>, <throw> generalized event mechanism
– <nomatch>, <noinput> short-hand user-input event handling
– <error> short-hand error event handling
© 2007 Ken Rehor. All Rights Reserved. 34
- 35. Key Points
• Architecture leverages all things "internet"
– Languages, protocols, servers, developers, etc.
• Separation of concerns
– Application logic / database vs. telephony / speech resources
– Enables new business models
• Voice ASP
• Prepackaged applications
• URL (application) associated with phone number
– Calling party or Called party
– Share resources among many applications (VoiceASP)
• High-level languages, specific to domain / task
– Simplify development and maintenance
© 2007 Ken Rehor. All Rights Reserved. 35
- 36. VoiceXML <form> and <field>
• <form>
– Dialog container
– "Form Interpretation Algorithm" (FIA) specifies default behavior
• <field>
– Collect input from caller
– <grammar> specifies input 'constraints'
• <prompt>
– Container for <audio> and text
© 2007 Ken Rehor. All Rights Reserved. 36
- 37. Example
<?xml version="1.0"?>
<vxml version="2.0">
<form>
<field name="main_menu">
<prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio>
</prompt>
<grammar src="main_menu.grxml"/>
</field>
<block>
<submit next="http://acme.com/route... " method="get"/>
</block>
</form>
</vxml>
main.vxml
Note: Code simplified for demonstration purposes…
© 2007 Ken Rehor. All Rights Reserved. 37
- 38. User Input - Grammars
• Grammars can be speech or DTMF (touchtone)
– Both types can be active simultaneously
• Specified by SRGS
– XML grammars are normative (aka GRXML)
– ABNF grammars are more concise but more complex to author
• Grammars may be specified inline or sourced externally
• External grammars are referenced by URI
• Multiple grammars may be active simultaneously.
© 2007 Ken Rehor. All Rights Reserved. 38
- 39. Grammars can get very complicated:
There are many ways to say the same thing…
Sales
I'd like to place an order
I need to talk to a salesman
Repair
repair department
service
service department
customer service
Order status
where's my order?
track my order
track my shipment
where the hell is my stuff?
© 2007 Ken Rehor. All Rights Reserved. 39
- 40. Basic GRXML grammar example
<grammar …xml:lang="en-US" version="1.0">
<rule id="dept" scope="public">
<one-of>
<item>sales</item>
<item>repair</item>
<item>order status</item>
</one-of>
</rule>
</grammar>
main_menu.grxml
© 2007 Ken Rehor. All Rights Reserved. 40
- 41. VoiceXML example – next step
<form>
<field name="sales_menu">
<prompt>
<audio src="sales_menu.wav">
You've reached Acme's sales department.
To place an order, say sales. To speak to
an associate, say I'd like to speak to someone.
</audio>
</prompt>
<grammar src="sales_menu.grxml"/>
</field>
<block>
<submit next="http://acme.com/... " method="get"/>
</block>
</form>
sales.vxml
© 2007 Ken Rehor. All Rights Reserved. 41
- 42. VoiceXML example with error handling
<form>
<field name="main_menu">
<prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio>
</prompt>
<grammar src="main_menu.grxml"/>
</field>
<noinput> You must say something. </noinput>
<block>
<submit next="http://acme.com/route... " method="get"/>
</block>
</form>
newmain.vxml
© 2007 Ken Rehor. All Rights Reserved. 42
- 43. VoiceXML example with error handling
<form>
<field name="main_menu">
<prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio>
</prompt>
<grammar src="main_menu.grxml"/>
</field>
<noinput> You must say something. </noinput>
<nomatch> I didn't understand you. Please try again. </nomatch>
<block>
<submit next="http://acme.com/route... " method="get"/>
</block>
</form>
newmain.vxml
© 2007 Ken Rehor. All Rights Reserved. 43
- 44. VoiceXML example with error handling
<form>
<field name="main_menu">
<prompt>
<audio src="welcome.wav"> Welcome to Acme.
You can choose sales, repair, or order status.</audio>
</prompt>
<grammar src="main_menu.grxml"/>
</field>
<help> You can say sales, repair, or order status. </help>
<noinput> You must say something. </noinput>
<nomatch> I didn't understand you. Please try again. </nomatch>
<block>
<submit next="http://acme.com/route... " method="get"/>
</block>
</form>
newmain.vxml
© 2007 Ken Rehor. All Rights Reserved. 44
- 45. Basic VoiceXML menu using <option>
<field name="maincourse">
<prompt>
Please select an entree. Today, we are featuring <enumerate/>
</prompt>
<option dtmf="1" value="fish"> swordfish </option>
<option dtmf="2" value="beef"> roast beef </option>
<option dtmf="3" value="chicken"> frog legs </option>
<filled>
<submit next="/cgi-bin/maincourse.cgi"
method="post" namelist="maincourse"/>
</filled>
</field>
maincourse.vxml
© 2007 Ken Rehor. All Rights Reserved. 45
- 46. Set platform features via <property>
• Input modes: type of input from a caller
DTMF-only <property name="inputmodes" value="dtmf">
Voice-only <property name="inputmodes" value="voice">
Both <property name="inputmodes" value="dtmf voice">
• Timeouts
<property name="timeout" value="1450ms">
<property name="termtimeout" value="2500ms">
...
© 2007 Ken Rehor. All Rights Reserved. 46
- 47. Call processing: <transfer>
• Blind
– Go somewhere but don't return
• Bridge
– Add on another party, resume
execution when done talking
© 2007 Ken Rehor. All Rights Reserved. 47
- 48. Call processing: <transfer>
• Blind transfer
<form id="xfer">
<block>
<prompt> Calling Riley. Please wait. </prompt>
</block>
<transfer name="mycall" dest="tel:+1-555-123-4567" >
</transfer>
</form>
© 2007 Ken Rehor. All Rights Reserved. 48
- 49. Call processing: <transfer>
• Bridge transfer
<form id="xfer">
<block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" >
</transfer>
</form>
© 2007 Ken Rehor. All Rights Reserved. 49
- 50. Call processing: <transfer>
• Bridge transfer with cancel feature
<form id="xfer">
<block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" >
<prompt> Say cancel at any time to disconnect this call.</prompt>
<grammar src="cancel.grxml" type="application/srgs+xml"/>
</transfer>
</form>
© 2007 Ken Rehor. All Rights Reserved. 50
- 51. Call processing: <transfer>
<form id="xfer">
<block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true" >
<prompt> Say cancel at any time to disconnect this call.</prompt>
<grammar src="cancel.grxml" type="application/srgs+xml"/>
<filled>
<assign name="mydur" expr="mycall$.duration"/>
<if cond="mycall == 'busy'">
<prompt> Riley's line is busy. Try again later. </prompt>
<elseif cond="mycall == 'noanswer'"/>
<prompt> Riley didn't answer the phone.
Please call back another time. </prompt>
</if>
</filled>
</transfer>
</form>
© 2007 Ken Rehor. All Rights Reserved. 51
- 52. Call processing: <transfer>
<form id="xfer">
<block> <prompt> Calling Riley. Please wait. </prompt> </block>
<transfer name="mycall" dest="tel:+1-555-123-4567" bridge="true"
transferaudio="music.wav" connecttimeout="60s" >
<prompt> Say cancel at any time to disconnect this call.</prompt>
<grammar src="cancel.grxml" type="application/srgs+xml"/>
<filled>
<assign name="mydur" expr="mycall$.duration"/>
<if cond="mycall == 'busy'">
<prompt> Riley's line is busy. Try back later. </prompt>
<elseif cond="mycall == 'noanswer'"/>
<prompt> Riley didn't answer the phone. Please call
back another time. </prompt>
</if>
</filled>
</transfer>
</form>
© 2007 Ken Rehor. All Rights Reserved. 52
- 54. New Features in VoiceXML 2.1
• Dynamically referencing grammars and scripts
– <grammar expr=“…”> <script expr=“…”>
• Detect Barge-in During Prompt Playback: enhance SSML 1.0 <mark>
– Add markexpr attribute
– Add markname and marktime to application.lastresult$ object
• Fetch (XML) data without transition: <data>
– Uses read-only subset of DOM
• Dynamically concatenate prompts: <foreach>
– Interate through ECMAScript array and execute content
• Record user’s utterance while attempting ASR
– recordutterance property
– Add shadow variables: recording, recordingsize, recordingduration
• Send data upon disconnect
– <disconnect namelist=“…” >
• Additional <transfer> types
– <transfer type=“…” …/>
© 2007 Ken Rehor. All Rights Reserved. 54
- 56. VoiceXML Application Structure
• Static
– User experience is the same for everyone
• Information doesn’t change frequently
• No customization per user, time of day, etc.
• Pages are created once and used many times
• Dynamic
– User experience is customized by:
• User: e.g. my.yahoo.com, amazon.com (especially once you log in)
• Situation: e.g. travel specials on expedia.com
– Data driven, e.g. inventory system, airline reservations
– Generated by a program at runtime
• JSP, ASP
• App servers such as BEA, IBM Websphere, Oracle 9iAS
© 2007 Ken Rehor. All Rights Reserved. 56
- 57. VoiceXML 2.1 and AJAX
• VoiceXML + ECMAScript + <data> + XML
• <data> element allows retrieval of arbitrary XML data
without document transition
• Static VoiceXML document can fetch user-specific data at
runtime
• Decouple presentation layer from business logic
• Performance improvements due to:
– Cache-able VoiceXML
– No need to generate entirely new pages for each dialog when only the
content is new
– Less network traffic
© 2007 Ken Rehor. All Rights Reserved. 57
- 58. Dynamic Application Considerations
Execution of VoiceXML is running a program on your server…
• Must guarantee quality of dynamically-generated VoiceXML
documents and ASR grammars
– Catch parse errors, execution errors
– What does the caller hear if there is an error?
• not “Could not parse VoiceXML document”
• Runtime performance
– Parse and interpretation time of large documents
– Inefficient scripts and speech grammars
• Security implications
– Exploit a bug in a particular implementation? Make free phone calls?
– Could there be a VoiceXML virus? Will all platforms protect against them?
Careful application design, testing and monitoring is essential
© 2007 Ken Rehor. All Rights Reserved. 58
- 59. Dynamic Application Considerations
• A mix of different simultaneous applications means variable
platform load and execution profile
– Parse time of VoiceXML document
– Fetching VoiceXML documents, grammars, audio from remote web servers
– Load Balancing
– How to protect platform from harmful application? (intentional or otherwise?)
• Max size of document
• Max size of grammar
• Complexity measurement of document or grammar (statically checked before
execution?)
Platforms, networks, and applications must be carefully engineered
© 2007 Ken Rehor. All Rights Reserved. 59
- 61. Load Balancing for Performance and Reliability
• CPU/memory utilization
– Grammar compilation
– ASR load
– TTS load
• Telephony Network
– Channel balancing
– Dead channel
• Incoming/Outgoing channel assignment / mix
© 2007 Ken Rehor. All Rights Reserved. 61
- 62. Performance: Caching
• Fetched documents, grammars, audio files, streams
• Local or distributed cache?
• Effects of prefetching
• Where to cache generated grammars?
– Per system
– In-network
• Use external grammar compilation server?
© 2007 Ken Rehor. All Rights Reserved. 62
- 64. Application Monitoring and Maintenance
• Runtime logs
– Web / application server
– Voice server
– Call Detail Reporting
• Utterance recordings and logs
– Useful for grammar and dialog tuning
– Security of recordings may be an issue
– Disk space: full-call recordings may be prohibitively large
Usage data must be continually monitored to improve user experience
© 2007 Ken Rehor. All Rights Reserved. 64
- 65. Operations, Administration, Maintenance,
Provisioning
• System Monitoring
– Interfacing to existing Telco OSSs
– Web-based for ISP environment
• Provisioning
– Application, Customer
• DN-URI mapping
– Telephony
• Call origination/transfer
• Max call timeout
• Max number of concurrent calls
– Platform-specific VoiceXML features
• ECMAScript allowed?
• Telephony control allowed?
• Max grammar size
© 2007 Ken Rehor. All Rights Reserved. 65
- 66. Billing
Logging and Charging for usage of resources
• "platform time"
– Usage of server resources
• Toll Free usage
– It's toll free, not free
• Transferred calls
– Inbound minutes
– Outbound minutes
– Network features, e.g. Network Redirect
• Outbound calls
Accurate billing information is a critical factor in application cost or profitability
© 2007 Ken Rehor. All Rights Reserved. 66
- 68. Build vs. Outsource?
Deployment Options Enable a Variety of Business Models
• Completely in-house
– Maintain complete control for security
– Development and deployment systems can be identical
• Outsourced VoiceXML/Telephony
– Large-scale distributed networks without major capital investment
– Grow quickly and incrementally
• Completely outsourced hosting
– All components and systems managed by 3rd party
• Packaged software
– VoiceXML application integrated with existing apps
© 2007 Ken Rehor. All Rights Reserved. 68
- 69. Completely In-House
• Local control of all systems
• Voice server, app server, database can be on local network
• Development and deployment systems can be identical
• Physical security: in-house team “owns” it
• Failover, reliability, scalability must be locally managed
• Redundant power, networks, etc. are required
© 2007 Ken Rehor. All Rights Reserved. 69
- 70. VoiceXML On-premises Deployment
using TDM or VoIP carrier connection
VoIP
Web
"pipe"
Applications
Web
VoIP Applications
Gateway, VoiceXML
PSTN Cisco
PBX, etc. Browsers
IPCC
TDM:
DS3,
Multiple PRI,
etc.
ASR
servers
Database
Co-location facility
© 2007 Ken Rehor. All Rights Reserved. 70
- 71. Outsourced VoiceXML / Telephony
• Telephony and VoiceXML servers outsourced to "Voice
Service Provider" (VSP)
• Application remains in your data center(s)
– Geographically distributed
– May be dedicated to specific customers
• Many carrier-grade vendors to choose from
© 2007 Ken Rehor. All Rights Reserved. 71
- 72. Outsourced VoiceXML / Telephony
• Architecture is identical to in-house deployment
• Secure IP connection used between facilities
Voice Service Provider:
Carrier-grade outsourcing facility
Co-location facility
Web
VoiceXML Applications
Web
PSTN VoIP Cisco Applications
Browsers
gateway IPCC Interne
t
ASR
servers
Database
© 2007 Ken Rehor. All Rights Reserved. 72
- 73. Advantages of Outsourcing to a VSP
• Choice of many vendors: one for all customers, or choose the
best one for each customer
• Add capacity by adding multiple vendors
• No capital investment
• Pay-as-you-go pricing models
• Failover, reliability, scalability simplified
• Physical security of equipment and networks managed by VSP
• VPN or dedicated data connection to your backend systems
© 2007 Ken Rehor. All Rights Reserved. 73
- 74. Distribute Load to Multiple VSPs
VoiceXML
Cisco
Browsers
IPCC
PSTN
VoiceXML ASR
Cisco Customer
Browsers servers
IPCC co-location facility
Web
ASR Applications
Web
servers Applications
Internet
Database
VoiceXML
Cisco
Browsers
IPCC
Multiple co-lo facilities
can be deployed for geographic
redundancy and enhanced
capacity.
ASR VoiceXML
Cisco
servers Browsers
IPCC
© 2007 Ken Rehor. All Rights Reserved. 74
- 75. Completely Outsourced
• Deploy hardware & software systems at customer-
managed co-location facilities
• Deploy complete systems at co-location facilities managed
by 3rd party
• Deploy pre-packaged VoiceXML application integrated
with customer's call center (managed by customer)
© 2007 Ken Rehor. All Rights Reserved. 75
- 76. Combination of In-house and Outsourced
Several ways to balance resources
• Primary in-house, with overflow or failover to a VSP
– Local control of resources
– Overflow to VSP during peak usage
– Backup for failover / disaster recovery
• In-house development, with primary deployment via VSP
– In-house development and trials
– “Push to the network” when ready to deploy
© 2007 Ken Rehor. All Rights Reserved. 76
- 78. Inbound call using TDM connections
• 1st-party call control: VoiceXML server handles call
routing/setup/answer
VoiceXML
PSTN Server
Caller
© 2007 Ken Rehor. All Rights Reserved. 78
- 79. Inbound call using VoIP (SIP and RTP)
• 1st-party call control: VoIP gateway routes call to VoiceXML
server, which handles call routing/setup/answer
1. INVITE
VoIP VoiceXML
PSTN Gateway 2. RTP Server
customer
© 2007 Ken Rehor. All Rights Reserved. 79
- 80. Why VoIP?
• Flexible network topology
• Simplified integration of voice dialog resources
• Vendor independence for network elements
• Separation of concerns: voice dialog resources vs. call
control
© 2007 Ken Rehor. All Rights Reserved. 80
- 81. Inbound Call using 3rd Party Call Control
• 3rd party application handles call routing/setup/answer
Call Routing
Application
1. INVITE 2. INVITE
VoIP VoiceXML
PSTN Gateway 3. RTP Server
caller
© 2007 Ken Rehor. All Rights Reserved. 81
- 82. Outbound call using 3rd Party Call Control
• 3rd party application handles outbound call
initiation/setup/routing
• “Attaches” VoiceXML dialog to connection
Outbound
Calling
Application
1. INVITE 2. INVITE
VoIP VoiceXML
PSTN Gateway 3. RTP Server
caller
© 2007 Ken Rehor. All Rights Reserved. 82
- 83. What is CCXML?
• XML-based language that manages the connections and
resources used in phone calls
• Designed for 3rd-party call control applications
• Allows for easy integration into back end web applications
very similar to VoiceXML’s model
• Uses the finite state machine model
– Event handlers move from one state to the next using markup tags
• CCXML provides commands to run a “dialog” on a call leg
© 2007 Ken Rehor. All Rights Reserved. 83
- 84. Why is CCXML Needed?
• VoiceXML was designed primarily for voice dialogs
– 1st-party call control: <disconnect> and a several predefined common
<transfer> types
• Connection management requires full asynchronous event
handling
– Connection/telephony events can occur any time during a call and must be
handled
– VoiceXML specifically limits asynchronous events to simplify the execution
and programming model
• 1st-party Call Control can be useful but has limited flexibility
– VoiceXML 2.1 <transfer> adds "consultation" feature for network
redirect
© 2007 Ken Rehor. All Rights Reserved. 84
- 85. CCXML System Architecture
Telephony Voice
Web Web
Application Application
CCXML VXML
HTTP HTTP
CCXML
Conference Server
Server
Telephony Dialog
Control Control
Interface Interface
Telephony Dialog
PSTN Interface Server
Media
Caller
© 2007 Ken Rehor. All Rights Reserved. 85
- 86. CCXML features
• Telephony channel control: voice paths and signaling
– <createcall>, <accept>, <disconnect>,
<reject>, <redirect>
• Media control: Conference Bridges and Mixers
– <join>, <unjoin>, <createconference>,
<destroyconference>
• Dialog control: Add a VoiceXML (or other dialog)
resource to a connection
– <dialogstart>, <dialogprepare>,
<dialogterminate>
© 2007 Ken Rehor. All Rights Reserved. 86
- 87. Integration of CCXML and VoiceXML
• Dialogs are created using <dialogstart>
– You pass the URL of the document that you want to run
• Dialogs can be ended using <dialogterminate>
– This allows CCXML to end a dialog based on a external event such as
someone calling you on a second line
• Dialogs can return data back to the CCXML platform
– In VoiceXML use <exit namelist="a b c"/>
– This is exposed in the CCXML dialog.exit event
© 2007 Ken Rehor. All Rights Reserved. 87
- 88. W3C CCXML 1.0 status
• Nearing "Candidate Recommendation" status
– Language complete
– Test suite under development
– Certification Program under consideration
• Growing support throughout the world
• Several open source projects underway
– See http://www.sourceforge.net
© 2007 Ken Rehor. All Rights Reserved. 88
- 90. Next-Generation Technologies
• Speaker Biometrics-based authentication
– Speaker Identification
– Speaker Verification
• Video IVR --VoiceXML augmented with video
– Early stages of commercial deployment now
– Simple extension to standard platforms
– Straightforward step towards full multimodal
• Multimodal
– Multiple input modalities: speech recognition, keypad, handwriting,
biometrics (voice, fingerprint, iris, etc.), geolocation, motion
– Multiple output modalities: graphics, audio (speech, TTS, music,
polyphonic tones)
© 2007 Ken Rehor. All Rights Reserved. 90
- 92. Why Speaker Biometrics?
• Identify an individual for remote transactions
• Text / DTMF PINs are inadequate
– Easily compromised
– Easily forgotten
– Does not identify an individual
• US Federal Regulations
– FFIEC guidelines for financial services
© 2007 Ken Rehor. All Rights Reserved. 92
- 93. Speaker Identification and Verification (SIV)
• Authentication
– The process of confirming one or more identities.
• Speaker Identification (one-to-many)
– Authentication with multiple identity claims.
• Speaker Verification (one-to-one)
– Authentication with a single identity claim.
© 2007 Ken Rehor. All Rights Reserved. 93
- 94. Types of SIV
• Text independent
– SIV technology that can operate on any freeform or structured spoken input.
• Text dependent
– SIV technology (usually verification technology) that requires the voice input
of one or more specific passwords or pass phrases (having been enrolled).
• Text prompted
– SIV technology (usually verification) that randomly selects words and/or
phrases and prompts the speaker to repeat them. The term is also called
challenge-response.
© 2007 Ken Rehor. All Rights Reserved. 94
- 95. Fundamental Phases of SIV
• Enrollment
– Capture one or more user utterances to ‘train’ the system
• Verification
– Capture one or more user utterances to make an identity claim
• Adaptation & Scoring
– Judge how close the user’s verification utterance is to the enrolled
utterance
– Refine the existing enrolled utterance with information from the
verification utterance
© 2007 Ken Rehor. All Rights Reserved. 95
- 97. “Video” VoiceXML
• Video extensions to VoiceXML
– 3G Wireless
– VoIP phones
• VoiceXML is just a dialog language
– Initially only for voice input/output
• Example
– Videomail is a dialog application very similar to voicemail
• Video and audio are somewhat analogous
– VoiceXML can be ‘hacked’ to handle video now:
• <audio src="foo.au“/> could “play” a video file
via <audio src=“foo.mpeg4”/>
– VoiceXML 3.0 might add a new language feature
• e.g. <video src="foo.avi"> or <media src="foo.mpeg4">
© 2007 Ken Rehor. All Rights Reserved. 97
- 98. “Video” VoiceXML
Deployment and Standardization
• Simple extension to standard platforms
– Easy integration with current platforms
– Doesn’t “break” existing functionality
– Well aligned with “VoiceXML model”
• Early stages of commercial deployment
– Several vendors have deployed large-scale commercial systems
• Step towards full multimodal
© 2007 Ken Rehor. All Rights Reserved. 98
- 99. Multimodal Applications
• W3C Multimodal Interaction Working Group
– Defining new standards based on extensive industry experience
• IBM / Motorola / Opera X+V 1.2
– Early stages of commercial deployment
– Freely available from Opera http://dev.opera.com/articles/voice/
For more information, see:
W3C Multimodal Interaction Working Group http://www.w3.org/2002/mmi
© 2007 Ken Rehor. All Rights Reserved. 99
- 101. VoiceXML 3.0
• Modularization
– Cleanly separate functions to enable integration with other modalities
– Enables code reuse
• New media processing
– Video
– Voice processing
– Navigation
– Speaker biometrics
• Separation of data, control flow and presentation
– Control flow embodied in new language: SCXML
• Clean data model
© 2007 Ken Rehor. All Rights Reserved. 101
- 102. References
• W3C Voice Browser Working Group http://www.w3.org/voice
– VoiceXML 2.0 Recommendation
• http://www.w3.org/TR/voicexml20/
– VoiceXML 2.1 Working Draft
• http://www.w3.org/TR/voicexml21/
– Semantic Interpretation Working Draft
• http://www.w3.org/TR/semantic-interpretation/
– SRGS 1.0 Recommendation
• http://www.w3.org/TR/speech-grammar/
– SSML
• 1.0 Recommendation http://www.w3.org/TR/speech-synthesis/
• 1.1 Working Draft http://www.w3.org/TR/speech-synthesis11/
– CCXML 1.0
• http://www.w3.org/TR/ccxml/
– SCXML
• http://www.w3.org/TR/scxml/
• IETF http://www.ietf.org
© 2007 Ken Rehor. All Rights Reserved. 102
Editor's Notes
- This DTMF grammar accepts a 4-digit PIN followed by a pound terminator