SlideShare ist ein Scribd-Unternehmen logo
1 von 118
Downloaden Sie, um offline zu lesen
Voice Browser and Multimodal Interaction In 2009


   Paolo Baggia
   Director of International Standards

   March 6th, 2009


   Google TechTalk




Google TechTalk – Mar 6th, 2009                     Paolo Baggia   11
Overview

      A Bit of History

      W3C Speech Interaction Framework Today
          ASR/DMTF
          TTS
          Lexicons
          Voice Dialog and Call Control
          Voice Platforms and Next Evolutions

      W3C Multimodal Interaction Today
          MMI Architecture
          EMMA and InkML
          A language for Emotions

      Next Future
Google TechTalk – Mar 6th, 2009                 Paolo Baggia   2
Company Profile

    Privately held company (fully owned by Telecom Italia), founded in 2001 as
    spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and
    expertise in voice processing.
    Global Company, leader in Europe and South America for award-winning, high
    quality voice technologies (synthesis, recognition, authentication and
    identification) available in 26 languages and 62 voices.
    Multilingual, proprietary technologies protected
    over 100 patents worldwide                                               Munich
                                                                   London
    Financially robust, break-even reached in 2004,
    revenues and earnings growing year on year
                                                                  Paris
    Growth-plan investment approved for
    the evolution of products and services.                       Madrid

    Offices in New York. Headquarters in Torino,                            Torino

    local representative sales offices in Rome,        New York
                                                                                 Rome
    Madrid, Paris, London, Munich
    Flexible: About 100 employees, plus a
    vibrant ecosystem of local freelancers.
Google TechTalk – Mar 6th, 2009                                                       Paolo Baggia   3
International Awards

                   “2008 Frost & Sullivan European Telematics and Infotainment
                   Emerging Company of the Year” Award

                   Winner of “Market leader-Best Speech Engine” Speech
                   Industry Award 2007 and 2008

                   Loquendo MRCP Server: Winner of 2008 IP Contact
                   Center Technology Pioneer Award

                   “Best Innovation in Automotive Speech Synthesis” Prize
                   AVIOS-SpeechTEK West 2007

                   “Best Innovation in Expressive Speech Synthesis” Prize
                   AVIOS-SpeechTEK West 2006

                   “Best Innovation in Multi-Lingual Speech Synthesis”
                   Prize AVIOS-SpeechTEK West 2005

Google TechTalk – Mar 6th, 2009                                     Paolo Baggia   4
A Bit of History




Google TechTalk – Mar 6th, 2009                      Paolo Baggia   5
Standard Bodies
      Two main standard bodies:
      W3C – World Wide Web Consortium
               Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full
               potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan).
               400 members all over the world, 50 Working, Interest and Coordination Groups.
               W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM,
               SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web
               Accessibility, Device Independence)
      IETF – Internet Engineering Task Force
               Founded in 1986, but growth in 1991as Internet Society. 1300 members.
               HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP)
               is very relevant for speech platforms.

      Two industrial forums:
      VoiceXML Forum (www.voicexml.org)
               Inventors of VoiceXML 1.0, then submitted to W3C for standardization.
               Current goal is to promote, disseminate and support VoiceXML and related standards.
      SALT Forum (www.saltforum.org)
               Supported by Microsoft to define a lightweight markup for telephony and multimodal
               applications.


      Other relevant bodies:
         3GPP, OMA, ETSI, NIST

Google TechTalk – Mar 6th, 2009                                                          Paolo Baggia   6
The (r)evolution of VoiceXML
 1998 - 2004



            W3C charters
                                                W3C charters
            Voice Browser
                                            Multimodal Interaction
                 WG
                                                    WG
                                                                                                                           EMMA 1.0
                                                                                 By Cisco, Comverse,
                             VoiceXML                                                                                      W3C Rec
                                                                SALT Forum       Intel, Microsoft, Philips,
                            Forum Birth                            Birth         SpeechWorks,                             PLS 1.0
                                     By AT&T, IBM,                                                                       W3C REC
                                     Lucent, Motorola,                                                        2007
                                                                                2004
                              2000
       1998
                                                                                                                                2009
                                                                                                                     2008
                   1999                                  2002
                                                                                  SSML 1.0
W3C Voice                                                                                                     SISR 1.0
                                                                                  W3C Rec
                                                                                SRGS 1.0
 Browser                                                                                                      W3C Rec
                               VoiceXML 1.0                                     W3C Rec                   VoiceXML 2.0
                                                                           VoiceXML 2.0
Workshop                         Released                                                                   W3C Rec
                                                                             W3C Rec



                                                                     Preparing to announce VoiceXML 1.0
                                                                     Friday Feb. 25th, 2000
                                                                     Lucent, Naperville, Illinois

                                                                     Left to right: Gerald Karam (AT&T), Linda Boyer (IBM),
                                                                     Ken Rehor (Lucent), Bruce Lucas (IBM),
                                                                     Pete Danielsen (Lucent), Jim Ferrans (Motorola),
                                                                     Dave Ladd (Motorola).


Google TechTalk – Mar 6th, 2009                                                                                      Paolo Baggia      7
Speech Interface Framework in 2000
 (by Jim Larson)


                          Semantic Interpretation for
                          Speech Recognition (SISR)

                                                                              VoiceXML 2.1
                             N-gram Grammar ML
                                                                EMMA
                             Speech Recognition           Natural Language
                                                                              VoiceXML 2.0
                            Grammar Spec. (SRGS)           Semantics ML


                                            Language
                            ASR
                                          Understanding
                                                                Context                          World
                                                             Interpretation                      Wide
                                                                                                 Web
                               DTMF Tone Recognizer


                        Pronunciation Lexicon                                    Dialog
                         Specification (PLS)                                    Manager


             User             Pre-recorded Audio Player
                                                                                               Telephone
                                                                Media                           System
                                                               Planning
                                           Language
                            TTS
                                           Generation


                                                          Reusable Components
                              Speech Synthesis                                      Call Control XML
                           Markup Language (SSML)                                       (CCXML)




Google TechTalk – Mar 6th, 2009                                                                            Paolo Baggia   8
Speech Interface Framework - Today
 (by Jim Larson)

                           Semantic Interpretation for
                           Speech Recognition (SISR)


                                                                              VoiceXML 2.1
                             N-gram Grammar ML
                                                               EMMA 1.0

                             Speech Recognition            Natural Language
                                                                              VoiceXML 2.0
                            Grammar Spec. (SRGS)            Semantics ML


                                             Language
                            ASR
                                           Understanding
                                                                Context                        World
                                                             Interpretation                    Wide
                                                                                               Web
                               DTMF Tone Recognizer

                        Pronunciation Lexicon                                  Dialog
                         Specification (PLS)                                  Manager


             User              Pre-recorded Audio Player
                                                                                             Telephone
                                                                Media                         System
                                                               Planning
                                            Language
                            TTS
                                            Generation


                                                           Reusable Components
                              Speech Synthesis                                     Call Control XML
                           Markup Language (SSML)                                      (CCXML)




Google TechTalk – Mar 6th, 2009                                                                          Paolo Baggia   9
Speech Interface Framework - End of 2009
 (by Jim Larson)

                           Semantic Interpretation for
                           Speech Recognition (SISR)


                                                                              VoiceXML 2.1
                             N-gram Grammar ML
                                                              EMMA 1.0
                             Speech Recognition            Natural Language
                                                                              VoiceXML 2.0
                            Grammar Spec. (SRGS)            Semantics ML


                                             Language
                            ASR
                                           Understanding
                                                                Context                        World
                                                             Interpretation                    Wide
                                                                                               Web
                               DTMF Tone Recognizer

                        Pronunciation Lexicon                                  Dialog
                         Specification (PLS)                                  Manager


             User              Pre-recorded Audio Player
                                                                                             Telephone
                                                                Media                         System
                                                               Planning
                                            Language
                            TTS
                                            Generation


                                                           Reusable Components
                              Speech Synthesis                                     Call Control XML
                           Markup Language (SSML)                                      (CCXML)




Google TechTalk – Mar 6th, 2009                                                                          Paolo Baggia   10
W3C Process




Google TechTalk – Mar 6th, 2009   Paolo Baggia   11
Architectural Changes

           Traditional (proprietary) architecture


                                  ASR / DTMF
                                                        Speech                  Proprietary
              User                                                                 SCE
                                                        Applic.
                                  TTS / Audio
                             Proprietary
                             platform



                                                         .grxml/.gram, .pls
          VoiceXML architecture



                                  ASR / DTMF
                                                                        .vxml
                                                      VoiceXML                      Web
             User
                                                       Browser                     Applic.
                                                                        HTTP
                                  TTS / Audio
                            VoiceXML
                            platform


                                                    .ssml, .wav/.mp3, .pls

Google TechTalk – Mar 6th, 2009                                                       Paolo Baggia   12
The VoiceXML Impact

       VoiceXML changed the landscape of IVRs and speech application
       creation
         From proprietary to standard-based speech applications

           Before                                      After
                                                     • Standard VoiceXML
            • Proprietary platforms
                                                       platforms
              (HW & SW)
                                                     • Standards for Speech
            • Proprietary
                                                       Technologies
              applications (by
              proprietary SCE)                       • Standard tools for
                                                       VoiceXML applications
            • Mainly DTMF and
              pre-recorded prompts                   • Integration of DTMF
                                                       and ASR
            • First attempts to add
              speech into IVR                        • Still predominance of
                                                       DTMF, but more and
                                                       more speech
                                                       applications




Google TechTalk – Mar 6th, 2009                                        Paolo Baggia   13
Overview

      A Bit of History

      W3C Speech Interaction Framework Today
          ASR/DMTF
          TTS
          Lexicons
          Voice Dialog and Call Control
          Voice Platforms and Next Evolutions

      W3C Multimodal Interaction Today
          MMI Architecture
          EMMA and InkML
          A language for Emotions

      Next Future
Google TechTalk – Mar 6th, 2009                 Paolo Baggia   14
Standards for ASR and DTMF
                   SRGS 1.0, SISR 1.0




Google TechTalk – Mar 6th, 2009         Paolo Baggia   15
W3C Standards for Speech/DTMF Grammars


                                                        SEMANTICS
                 SYNTAX
                                      Speech
            Defines constraints on                     Describes how to
          admissible sentences for grammar           produce results after
          a specific recognition turn               an utterance is recognized



                       SRGS                                     SISR
                       SRGS                                     SISR

     ABNF                         XML             literal           script
     ABNF                         XML              literal          script

      voice                       dtmf
      voice                       dtmf
   http://www.w3.org/TR/speech-grammar/   http://www.w3.org/TR/semantic-interpretation/

Google TechTalk – Mar 6th, 2009                                            Paolo Baggia   16
SRGS/SISR Grammars for “Torino”


                          SRGS XML                                 SRGS ABNF

             <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
             <grammar xml:lang=quot;en-USquot; version=quot;1.0quot;
             xmlns=quot;http://www.w3.org/2001/06/grammarquot;   #ABNF 1.0 iso-8859-1;
             tag-format=quot;semantics/1.0-literalsquot;>
  SISR                                                   mode voice;
                                                         tag-format <semantics/1.0-literals>;
                 <rule id=quot;mainquot; scope=quot;publicquot;>
                     <token>Torino</token>
 literal             <tag>10100</tag>
                                                         public $main = Torino {10100} ;
                 </rule>

             </grammar>



             <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
             <grammar xml:lang=quot;en-USquot; version=quot;1.0quot;
                                                         #ABNF 1.0 iso-8859-1;
             xmlns=quot;http://www.w3.org/2001/06/grammar
             quot; tag-format=quot;semantics/1.0quot;>               mode voice;
 SISR                                                    tag-format <semantics/1.0>;
                <tag>var unused=7;</tag>
                <rule id=quot;mainquot; scope=quot;publicquot;>
 script                                                  {var unused=7;};
                     <token>Torino</token>
                                                         public $main = Torino {out=quot;10100quot;;} ;
                     <tag>out=quot;10100quot;;</tag>
                 </rule>

             </grammar>



Google TechTalk – Mar 6th, 2009                                                        Paolo Baggia   17
SRGS/SISR Standards – Pros

      Powerful syntax (CFG) and very powerful semantics (ECMA)
      DMTF and Voice input are transparent to the application
      Wide and consistent adoption among technology vendors

      Two syntax XML and ABNF are great!
          Developers can choose (XML validation vs. compact format)

          Transformations are possible
          XML    ABNF (easy, simple XSLT)
          ABNF    XML (requires a ABNF parser)

          Open Source tools might be created to:
               Validate grammar syntax
               Transform grammars
               Debug grammars on written input
               Coverage tests: explode covered sentences, GenSem, SemTester, etc.


Google TechTalk – Mar 6th, 2009                                            Paolo Baggia   18
SRGS/SISR Standards – Small Issues

        Semantics declaration: tag-format attribute
            If value “semantics/1.0”?
                Mandate SISR Script semantics inside semantic tags
            If value “semantics/1.0-literal”?
                Mandate SISR Literal semantics inside semantic tags
            If missing?
                Unclear! Risk of interoperability troubles

        SISR Script Semantics
            Clumsy default assignment: returns last referenced rule only
               Developer must properly pop-up results
            Be careful to redefine “out”
               Assign a scalar value might result in errors

        SISR Literal Semantics
            Only useful for very simple word-list rules
            No support for encapsulating rules
              SISR Literal grammars as external references ONLY!

Google TechTalk – Mar 6th, 2009                                            Paolo Baggia   19
SRGS/SISR – Encapsulated Grammars



                                  Gr2.gram
                                   Literal

                                              Gr41.grxml
         Gr1.grxml
                                                Literal
          Script

                                  Gr3.grxml
                                   Script

                                              Gr42.gram
                                                Script




Google TechTalk – Mar 6th, 2009                            Paolo Baggia   20
SRGS/SISR Standards – Rich XML Results
   Section 7 of SISR 1.0 specification
          http://www.w3.org/TR/semantic-interpretation/#SI7
          Serialization rules from SISR ECMA results into XML
          Edge cases:
                 Arrays
                 Special variable “_attribute” and “_value”
                 Creation of namespaces and prefixes
     {
         drink: {
           _nsdecl: {
              _prefix:quot;n1quot;,
              _name:quot;http://www.example.com/n1quot;
           },
           _nsprefix:quot;n1quot;,
           liquid: {
              _nsdecl: {
                                                       <n1:drink xmlns:n1=quot;http://www.example.com/n1quot;>
                 _prefix:quot;n2quot;,
                                                         <liquid n2:color=quot;black“
                 _name:quot;http://www.example.com/n2quot;
                                                                xmlns:n2=quot;http://www.example.com/n2quot;>coke</liquid>
              },
              _attributes: {                             <size>medium</size>
                 color: {                              </n1:drink>
                   _nsprefix:quot;n2quot;,
                   _value:quot;blackquot;
                 }
              },
              _value:quot;cokequot;
           },
           size:quot;mediumquot;
         }
     }


Google TechTalk – Mar 6th, 2009                                                                 Paolo Baggia         21
SRGS/SISR Standards – Next Steps

      Adoption of the PLS 1.0 lexicon
           Clear entry point into PLS lexicons, <token> element
           Missing role attribute in <token> to allow homographs
           disambiguation


      Next extensions via Errata
           XML 1.1 support and IR
           Update normative references



       No Major Extensions are needed!




Google TechTalk – Mar 6th, 2009                                    Paolo Baggia   22
Speech Synthesis
                               SSML 1.0/1.1




Google TechTalk – Mar 6th, 2009                 Paolo Baggia   23
TTS – Functional Architecture and
Markup/Non-Markup support

                                                     Text-to-
    Structure                  Text                                           Prosody               Waveform
                                                    Phoneme
    Analysis               Normalization                                      Analysis              Production
                                                   Conversion


                                            Markup support:
 Markup support:
                                                                                            Markup support:
                                            <phoneme>, <lexicon>
 <p>, <s>
                                                                                            <voice>, <audio>
                                            Non-Markup support:
 Non-Markup support:
                                                                                            Non-Markup support:
                                            look up in pronunciation
 infer the structure by
                                            dictionary
 automatic text analysis




                                                                Markup support:
  Markup support:                                               <emphasis>, <break>, <prosody>
  <say-as> for date, time, phone number, numbers                Non-Markup support:
  <sub> for acronyms and transliterations                       automatically generate prosody through analysis of
  Non-Markup support:                                           document structure and sentence syntax
  automatically identify and convert constructs




  http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009                                                                      Paolo Baggia    24
SSML 1.0 – Language description (I)
                                                        version attribute
        Document Structure                              SSML namespace attribute
          <speak> root element
              <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?>
              <speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot;
              xml:lang=quot;en-USquot;>
              <p>I don't speak Japanese.</p>
              <p xml:lang=quot;jaquot;>Nihongo-ga wakarimasen.</p>
Languages     </speak>


         Processing and Pronunciation
          – <p> and <s> (paragraph and sentence)
            to give a structure to the text
          – <say-as> element
            to indicate the type of text construct contained within the element
            ex. date, numbers, etc.
          – <phoneme> element
            to provides a phonetic pronunciation for the contained text in IPA
          – <sub> element
            to provide substitutions for expanding acronyms in sequence of
            words
  http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009                                                  Paolo Baggia   25
SSML 1.0 – Language description (II)
       Style
         - <voice> element
             <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?>
             <speak version=quot;1.0quot;
               xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;>

               The moon is raising on the beach, when John says,
                  looking Mary in the eyes:
                  <voice name=quot;simonquot;>I love you!</voice>
                but she suddenly replies:
                  <voice name=quot;susanquot;> Please, be serious! </voice>
             </speak>

           Other voice selection attributes are:
               name, xml:lang, gender, age, and variant

         - <emphasis> element
           requests that the contained text be spoken with emphasis
               level attribute can set it to strong, moderate, reduced, or none
         - <break> element
           controls the pausing between words
               time attribute with two kind of values:
                    Time expressions “5s”, “20ms”
               strength attribute with values:
                 none, x-weak, weak, medium (default value), strong, or x-strong
  http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009                                                    Paolo Baggia   26
SSML 1.0 – Language description (III)

          Prosody
            <prosody> element
              permits control of the pitch, speaking rate and volume of the
              speech output.

               The attributes are:
                 volume: the volume for the contained text.
                 rate: the speaking rate in words-per-minute for the contained text.
                 duration: a value in seconds or milliseconds for the desired time to take
                   to read the element contents.
                 pitch: the baseline pitch for the contained text.
                 range: the pitch range (variability) for the contained text in Hertz.
                 contour: sets the actual pitch contour for the contained text.

          Other elements
            <audio> element          - to play an audio file
            <mark> element           - to place a marker into the text/tag sequence
            <desc> element           - to provide a description of a non-speech audio
                                       source in <audio>
  http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009                                                 Paolo Baggia   27
Towards SSML 1.1 – Motivations

    Internationalization needs:
         Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07)
         Results:
             No major needs for Eastern and Western European languages
             Many issues for Far East languages (Mandarin, Japanese, Korean)
             Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many
             Indian languages
                   Mark input with or without vowels
                   Mark the transliteration schema used for input


    Extensions required by Voice Browser:
         More powerful error handling, selection of fall-back strategies
         Trimming attributes
         Volume attribute to adopt a logarithmic scale (before was linear)

    Alignment with PLS 1.0 specification for user lexicons
  http://www.w3.org/TR/speech-synthesis11/
Google TechTalk – Mar 6th, 2009                                              Paolo Baggia   28
SSML 1.1 – Language Changes

        <w> element

        Lexicon extensions
          <lookup> element
            permits control of the pitch, speaking rate and volume of the
            speech output.

        Phonetic Alphabet Registry creation and adoption
             quot;ipaquot; for International Phonetic Alphabet
             Registering policy for other phonetic alphabets, similar to LTRU for
             Language tags
             Candidates:
                 PinYin for Mandarin Chinese
                 JEITA for Japanese
                 X-SAMPA, ASCII transliteration of IPA codes


  http://www.w3.org/TR/speech-synthesis/
Google TechTalk – Mar 6th, 2009                                           Paolo Baggia   29
Pronunciation Lexicon
                              PLS 1.0




Google TechTalk – Mar 6th, 2009                 Paolo Baggia   30
Pronunciation Lexicons

    Pronunciation Lexicon
      A mapping between words (or short phrases), their written representations,
        and their pronunciations suitable for use by an ASR engine or a TTS
        engine


    Pronunciation lexicons are not only useful for voice browsers
      They have also proven effective mechanisms to support accessibility for the
        differently able as well as greater usability for all users
      They are used to good effect in screen readers and user agents supporting
        multimodal interfaces


    The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is
    designed to enable interoperable specification of pronunciation
    lexicons


  http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009                                         Paolo Baggia   31
PLS 1.0 – Language Overview

      A PLS document is a container (<lexicon>) of several lexical entries
      (<lexeme>)

      Each lexical entry contains
       One or more spellings (<grapheme>)
       One or more pronunciations (<phoneme>) or substitutions (<alias>)

      Each PLS document is related to a single unique language (xml:lang)

      SSML 1.0 and SRGS 1.0 documents can reference one or more PLS
      documents

      Current version doesn’t include morphological, syntactic and semantic
      information associated with pronunciations


  http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009                                    Paolo Baggia   32
PLS 1.0 – An Example

  <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
  <lexicon version=quot;1.0quot;
   xmlns=quot;http://www.w3.org/2005/01/pronunciation-lexiconquot;
   xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot;
   xsi:schemaLocation=quot;http://www.w3.org/2005/01/pronunciation-lexicon
      http://www.w3.org/TR/pronunciation-lexicon/pls.xsdquot;
    alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>

        <lexeme>
            <grapheme>Sepulveda</grapheme>
                       ˈȜ Ǻ
            <phoneme>səˈpȜlvǺdə</phoneme>
        </lexeme>

        <lexeme>
            <grapheme>W3C</grapheme>
            <alias>World Wide Web Consortium</alias>
        </lexeme>

  </lexicon>

  http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009                                  Paolo Baggia   33
PLS 1.0 – Used for TTS

SSML 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<speak version=quot;1.0quot; … xml:lang=quot;en-USquot;>
    <lexicon uri=quot;http://www.example.com/SSMLexample.plsquot;/>
    The title of the movie is: quot;La vita è bellaquot; (Life is beautiful),
    which is directed by Benigni.
</speak>


PLS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
     <lexeme>
          <grapheme>La vita è bella</grapheme>
          <phoneme>ˈlǡ ˈviːȎə ˈȤeǺ ˈbǫlə</phoneme>
                     ˈǡ     ː      Ǻǫ
     </lexeme>
     <lexeme>
          <grapheme>Benigni</grapheme>
          <phoneme>bǫˈniːnji</phoneme>
                      ǫː
     </lexeme>
</lexicon>
 http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009                                    Paolo Baggia   34
PLS 1.0 – Used for ASR

SRGS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<grammar version=quot;1.0“ xml:lang=quot;en-USquot; root=quot;moviesquot; mode=quot;voicequot;>
    <lexicon uri=quot;http://www.example.com/SRGSexample.plsquot;/>
    <rule id=quot;moviesquot; scope=quot;publicquot;>
        <one-of>
            <item>Terminator 2: Judgment Day</item>
            <item>Pluto's Judgement Day</item>
        </one-of>
    </rule>
</grammar>

PLS 1.0
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;>
     <lexeme>
           <grapheme>judgment</grapheme>
           <grapheme>judgement</grapheme>
                     ˈȜ
           <phoneme>ˈdʒȜdʒ.mənt</phoneme>
    </lexeme>
</lexicon>
  http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009                             Paolo Baggia   35
Examples of Use

    Multiple pronunciations for the same orthography

    Multiple orthographies

    Homophones

    Homographs

    Acronyms, Abbreviations, etc.



        Detailed descriptions can be found in:
        W3C specification, Wikipedia
        Paolo Baggia, SpeechTEK 2008 & Voice Search 2009

Google TechTalk – Mar 6th, 2009                            Paolo Baggia   36
PLS 1.0 – Open Issues


      No wide support of IPA in speech engines
           Slowly changes are under way
           Phonetic Alphabet Registry will open doors to other alphabets in a
           controlled and interoperable way

      Integration in ASR/TTS
           SSML 1.1 will interoperate with PLS 1.0
           SRGS 1.0 still missing support of role attribute for PLS 1.0

      No matching algorithm inside PLS, because it is mainly a data
      format




  http://www.w3.org/TR/pronunciation-lexicon/
Google TechTalk – Mar 6th, 2009                                           Paolo Baggia   37
Pronunciation Alphabets
                          IPA, SAMPA




Google TechTalk – Mar 6th, 2009                Paolo Baggia   38
International Phonetic Alphabet

    Pronunciation is represented by a phonetic alphabet
         Standard phonetic alphabets
           International Phonetic Alphabet (IPA)
         Well known phonetic alphabet
           SAMPA - ASCII based (simple to write)
           Pinyin (Chinese Mandarin), JEITA (Japanese), etc.
         Proprietary phonetic alphabets


    International Phonetic Alphabet (IPA)
         Created by International Phonetic Association (active since 1896),
         collaborative effort by all the major phoneticians around the world
         Universally agreed system of notation for sounds of languages
         Covers all languages
         Requires UNICODE to write it
         Normatively referenced by PLS


Google TechTalk – Mar 6th, 2009                                          Paolo Baggia   39
IPA – Chart
   IPA was founded in 1886
   It is the major international
        association of phoneticians
   The IPA alphabet provides
        symbols making possible the
        phonemic transcription of all
        known languages




   IPA characters can be encoded in
      Unicode by supplementing
      ASCII with characters from
      other ranges, particularly:
        IPA extensions (0250–02AF)
        Latin Extended-A (0100-017F)
   See the detailed:
      http://www.unicode.org/charts



Google TechTalk – Mar 6th, 2009         Paolo Baggia   40
Phonetic Alphabets – Issues


      The real problem is how to write pronunciation in a reliable, unless
      you are trained phonetician
      Issues with fonts and authoring, browsers, but Unicode fonts today
      support IPA extensions, see:
          http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm

      There are very few tools to help writing pronunciations and to let
      you listen to what you have written


      Make available pronunciations in IPA or other general phonetic
      languages.



Google TechTalk – Mar 6th, 2009                                      Paolo Baggia   41
Voice Dialog languages:
                          VoiceXML 2.0
                          VoiceXML 2.1




Google TechTalk – Mar 6th, 2009                Paolo Baggia   42
VoiceXML 2.0 – Features, Elements

   Menus, forms, sub-dialogs         Events
     <menu>, <form>, <subdialog>      <nomatch>, <noinput>, <help>,
                                      <catch>, <throw>
   Input
                                     Transition and submission
     Speech recognition
     <grammar>                        <goto>, <submit>
     Recording                       Telephony
     <record>
                                      Connection control
     Keypad                           <transfer>, <disconnect>
     <grammar mode=quot;dtmfquot;>
                                      Telephony information
   Output
                                     Platform specifics
     Audio files                      <object>
     <audio>
                                     Performance
     Text-To-Speech
                                      Fetch
     <prompt>
                                      Properties
   Variables (ECMA-262)
     <var>, <assign>, <script>
     scoping rules
  http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009                               Paolo Baggia   43
VoiceXML 2.0 – Execution Model

      Execution is synchronous
           Only disconnect event is handled (somewhat) asynchronous

      Execution is always in a single dialog: <form> or <menu>
           Form Interpretation Algorithm for <field> selection

      Prompt are queued
           Played only when encountering a waiting state
           Played before a fetchaudio is started

      Processing is always in one of two states:
           Waiting for input in an input item:
           <field>, <record>, <transfer>, etc.
           Transitioning between input items in response of an input

      Event-driven:
                                         user’s input event handling
           <nomatch>, <noinput>
                                         generalized event mechanism
           <catch>, <throw>
                                         call event handling
           connection.*
                                         error event handling
           error.*
  http://www.w3.org/TR/voicexml20/
Google TechTalk – Mar 6th, 2009                                        Paolo Baggia   44
VoiceXML 2.1 – Extended Features
    Dynamically referencing grammars and scripts:
      <grammar expr=quot;…quot;>, <script expr=quot;…quot;>

    Record user’s utterance during form filling
      recordutterance property
      Add new shadow variables: recording, recordingsize, recordingduration

    Detect barge-in during prompt playback (SSML <mark>)
      Add markexpr attribute
      Add new shadow variables: markname and marktime


    Fetch XML data without transition
      Use read-only subset of DOM
    Dynamically concatenate prompts <foreach>
      Iterate throught ECMAScript arrays and execute content

    Send data upon disconnect
      <disconnect namelist=quot;…quot;>
    Additional transfer type
      <transfer type=quot;consultationquot;>
                                                        http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                             Paolo Baggia   45
VoiceXML Applications

     Static VoiceXML applications
         The VoiceXML page is always the same, so the user experience
         No personalization or customization


     Dynamic VoiceXML applications
         User experience is customized
           • After authentication (PIN)
           • Using caller-id or SIP-id
         Data driven
         Dynamic pages generated at runtime
         e.g. JSP, ASP, etc.




  http://www.w3.org/TR/voicexml20/                http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                       Paolo Baggia   46
A Drawback of VoiceXML 2.0

    A drawback of VoiceXML is that the transition from a VoiceXML page
    to another is a costly activity:
         Fetch the new page, if not cached
         Parse the page
         Initialize the context, possibly loading and initializing a new application
         root document
         Load or pre-compile scripts

    The transitions are the only way to return data to the Web Application
    (if the VoiceXML is dynamic)

    Pages must be created to include dynamic data

    VoiceXML 2.1 addresses part of this drawback by feeding dynamic
    data to a running VoiceXML page

  http://www.w3.org/TR/voicexml20/                      http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                              Paolo Baggia   47
Advantages of VoiceXML 2.1 - AJAX

    Two of the eight new features in VoiceXML 2.1 helps to create
    more dynamic VoiceXML applications:
         <data> element
         <foreach> element

    Static VoiceXML document can fetch user-specific data at runtime,
    without changing the VoiceXML document
    <data> element allows retrieval of arbitrary XML data without
    VoiceXML document transitions
    Returned XML data are accessible by a subset of DOM primitives
    <foreach> extend the prompts to allow the iteration on a dynamic
    array of information to create a dynamic prompt

    This is similar to AJAX programming for HTML services
    It decouples presentation layer (VoiceXML) from business logic
    (accessed via <data>)
                                              http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                   Paolo Baggia   48
VoiceXML 2.1 – <data> Element

      Attributes:
                        the variable to be filled with the DOM of the retrieved data
           name
           scr or srcexpr      the URI of the location of the XML data
                        the list of variables to be submitted
           namelist
                        either ‘get’ or ‘post’
           method
                        media encoding
           enctype
           fetch and caching attributes


      As <var>, it may appear in executable content (<form> and <vxml>)
      The value of name must be a declared variable
      The platform will fill the variable of the DOM of the fetched XML data
      <data> element is synchronous (the service stops to get data)




                                                      http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                           Paolo Baggia   49
VoiceXML 2.1 – <foreach> Element

    Attributes:
                   ECMAScript expression that must evaluate to ECMAScript array
         array
                   the variable that stores the element to be processed
         item


    <foreach> allows the application to iterate on an ECMAScript array and
    to execute the content
    <foreach> may appear:
         In executable content (all executable content elements may appear as
         content of <foreach>)
         In <prompt> (restrictions on the content are applied)
    <foreach> allows sophisticated concatenation of prompts




                                                    http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                         Paolo Baggia   50
VoiceXML – Final Remarks

       The changed landscape for speech application development:
            Virtually all the IVRs today support VoiceXML
            New options related to VoiceXML:
                 SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie)
                 Large hosting of speech applications (TellMe, Voxeo)
                 Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.)
            Further changes may come from the CCXML adoption


   … but:
            Mainly system driven applications are actually deployed
            New challenges to incorporate more powerful dialog strategies,
            mixed-initiative are under discussion.




  http://www.w3.org/TR/voicexml20/                        http://www.w3.org/TR/voicexml21/
Google TechTalk – Mar 6th, 2009                                                Paolo Baggia   51
VoiceXML Resources

   Voice Browser Working Group (spec, FAQ, implementations, resources):
       http://www.w3.org/Voice/

   VoiceXML Forum site (resources, education, interest groups):
       http://www.voicexml.org/
   VoiceXML Forum Review:
       http://www.voicexmlreview.org/
        Interesting articles related to VoiceXML and more
        Example code in the sections quot;First Wordsquot; and quot;Speak & Listenquot;

   Ken Rehor’s World of VoiceXML
       http://www.kenrehor.com/voicexml

   Online documentation related to VoiceXML Platforms
        Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie
   Many books on VoiceXML:
        Jim Larson, quot;VoiceXML Introduction to Developing Speech Applicationsquot;, Prentice-Hall,
           2002.
        A. Hocek, D. Cuddihy, quot;Definitive VoiceXMLquot;, Prentice-Hall, 2002


Google TechTalk – Mar 6th, 2009                                               Paolo Baggia      52
Call Control:
                                  CCXML 1.0




Google TechTalk – Mar 6th, 2009                   Paolo Baggia   53
CCXML 1.0 – Highlights


     Asynchronous event processing

     Acceptance or refusal of an incoming call

     Different type of transfer call management

     Outbound call activation (interaction with an external entity)

     Use of ECMAScript adding scripting capabilities to call control
     applications

     VoiceXML modularization

     Conferencing management



Google TechTalk – Mar 6th, 2009                                       Paolo Baggia   54
CCXML 1.0 – Elements Relationship




Google TechTalk – Mar 6th, 2009     Paolo Baggia   55
CCXML 1.0 – Incoming Call
                                                             CCXML document
Event catching and processing
                                                <?xml version=quot;1.0quot;
                                                          encoding=quot;UTF-8quot;?>
                                                <ccxml version=quot;1.0quot;>

                                                […]



                                                 <transition
                                    CCXML
connection.alerting                              event=quot;connection.alertingquot;>
                                  Interpreter
                                                 […]
                                                 </transition>


                      event$                    <transition
                                                 event=quot;connection.disconnectedquot;>
                                                […]
       name:’connection.alerting’;
                                                </transition>
       connectionid:‘0239023901903993’;
       eventid:’00001’; ....
       …..


 http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009                                      Paolo Baggia   56
CCXML 1.0 – connection.alerting Event

      Basic telephony information has been retrieved on alerting event and
      is available into CCXML document:
        Local URI, remote URI, protocol used, redirection info, etc.

      Based on certain checked info, CCXML can accept or refuse the
      incoming call, even before contacting the dialog server;

      Any error that can occur during the phone call can be managed by
      CCXML service (connection.failed, error.connection events)


        Call Control                    CCXML                     VoiceXML
         Adapter                      Interpreter                Interpreter

                    connection.alerting

                                            Analyzing events$ content
                    <accept/> | <reject/>

 http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009                                          Paolo Baggia   57
CCXML 1.0 – How to activate a new dialog
CCXML actions:
  Receives alerting event from Call Control Adapter
  Asks to dialog server to prepare a new dialog
  Waits for the preparation
  If the dialog has been successfully prepared, accept the call
  Asks to dialog server to start the prepared new dialog

                                      CCXML
  Call Control                                                         VoiceXML
                                    Interpreter
   Adapter                                                            Interpreter
                       alerting
                                                  prepare a new dialog
                                                    dialog prepared
                    call accepted
                        connected
                                             start the prepared dialog
                                                    dialog started



Google TechTalk – Mar 6th, 2009                                          Paolo Baggia   58
Call transfer

    CCXML supports transfer call of different modality: quot;bridgequot;, quot;blindquot;,
    quot;consultationquot;;
    Based on different modalities features CCXML language allows the expected
    interaction with the Call Control Adapter to correctly perform the transfer;
    During the different phases of transfer call creation the CCXML can receive
    any asynchronous event and correctly manage it, interrupting the call, if
    requested

                                    CCXML
  Call Control                                                       VoiceXML
                                  Interpreter
   Adapter                                                          Interpreter


                                              Performing a transfer
                    command1
                      answer1

                        […]
                                              transfer complete …




Google TechTalk – Mar 6th, 2009                                        Paolo Baggia   59
External Events

    CCXML Interpreter Context can receive events from an external entity
    able to use the HTTP protocol;
    Events generated in this way must be sent to a CCXML by a POST
    HTTP command
    A event is so performed and:
         It can be addressed on a new session whose creation must be requested
         It can be addressed on an existent session, specifying the ID in the
         request
                              CCXML                      External
                            Interpreter                   Entity

                                      basic http event

         Event
       management
                                   Event management result




 http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009                                      Paolo Baggia   60
External event on a new session:
the Outbound Call

    A particular request arrived to Call Control from an external entity;
    A particular CCXML service associated with the received event is started and
    a set of operations between Call Control Adapter, Call Control and Dialog
    Server is activated: the outbound call is so placed
                                            outbound call request

  Call Control                         CCXML                           VoiceXML
   Adapter                           Interpreter                      Interpreter
                     Create a call

              connection progressing …
                                                   Prepare a dialog

                                                       prepared

              connection connected
                                              Start the prepared dialog




Google TechTalk – Mar 6th, 2009                                          Paolo Baggia   61
External event on a session:
dialog termination request
    An external entity performs a HTTP POST request towards the CCXML
    Interpreter Context, specifying a sessionid, requesting the termination of a
    particular dialog;
    The CCXML check the session id, if this is valid then CCXML Interpreter
    injects the event received in the session;
    The CCXML service has a transition on that event and performs the dialog
    termination on a particular dialog identifier;
                                             Dialog termination request


   Call Control                                                      VoiceXML
                                        CCXML
    Adapter                                                         Interpreter
                                      Interpreter


                      It depends on          dialogterminate (dialogid)
                      dialog.exit event
                      management
                                                      dialog.exit
                 disconnect(connId)                 dialogprepare



Google TechTalk – Mar 6th, 2009                                           Paolo Baggia   62
Loading different CCXML documents:
     <fetch> and <goto> elements

     <fetch> and <goto> elements are used respectively to asynchronously fetch
     content identified by the attributes of the <fetch> and to go in a fetched
     document, if it’s successfully loaded;

                                    CCXML                   - MODULARIZATION
                                                            - SOURCE EXEMPLIFICATION
                                  Interpreter
                                                            - MORE READABILITY


<fetch
  next=quot;'http://../Fetch/doc1.ccxml'quot;
  type=quot;'application/ccxml+xml'quot;
  fetchid=quot;resultquot;/>
                                                  fetch the document quot;doc1.ccxmlquot;



                                                fetch.done / error.fetch
       The first event occurred
       in a new document
       is ccxml.loaded
                                                 goto into the new document /
                                                 continue to work on the same dialog



  http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009                                                     Paolo Baggia   63
Simple CCXML Document
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>
<ccxml version=quot;1.0quot; xmlns=quot;http://www.w3.org/2002/09/ccxmlquot;>
  <var name=quot;currentStatequot;/>
  <var name=quot;myDialogIdquot;/>
  <var name=quot;myConnIdquot;/>
  <eventprocessor statevariable=quot;currentStatequot;>
    <transition event=quot;connection.alertingquot;>
      <assign name=quot;myConnIdquot; expr=quot;event$.connectionidquot;/>
      <accept connectionid=quot;event$.connectionidquot;/>
    </transition>
    <transition event=quot;connection.connectedquot;>
      <dialogstart src=quot;'http://www.example.com/flight.vxml'quot;
         connectionid=quot;myConnIdquot; dialogid=quot;myDialogIdquot;/>
    </transition>
    <transition event=quot;dialog.startedquot;>
      <log expr=quot;’VoiceXML appl is running now’quot;/>
    </transition>
    <transition event=quot;connection.disconnectedquot;>
      <dialogterminate dialogid=quot;myDialogIdquot;/>
    </transition>
    <transition event=quot;dialog.exitquot;>
      <disconnect connectionid=quot;myConnIdquot;/>
    </transition>
    <transition event=quot;*quot;>
      <log expr=quot;'Closing, unexpected:'+ event$.namequot;/>
      <exit/>
    </transition>
  </eventprocessor>
</ccxml>

Google TechTalk – Mar 6th, 2009                                 Paolo Baggia   64
CCXML 1.0 – Next Steps

    CCXML specification is a Last Call Working Draft, all the feature
    requests and clarifications have been addressed;

    An Implementation Report test suite is under development;

    It is very close to be published as W3C Candidate Recommendation;

    Internal or external companies will be invited to send implementation
    report on their CCXML platform;

    After that, CCXML 1.0 specification will be able to become Proposed
    Recommendation and then W3C Recommendation.



 http://www.w3.org/TR/ccxml

Google TechTalk – Mar 6th, 2009                                   Paolo Baggia   65
Speech Interface Framework
                     Tour Complete!




Google TechTalk – Mar 6th, 2009          Paolo Baggia   66
Speech Interface Framework - End of 2009
 (by Jim Larson)

                           Semantic Interpretation for
                           Speech Recognition (SISR)


                                                                              VoiceXML 2.1
                             N-gram Grammar ML
                                                              EMMA 1.0
                             Speech Recognition            Natural Language
                                                                              VoiceXML 2.0
                            Grammar Spec. (SRGS)            Semantics ML


                                             Language
                            ASR
                                           Understanding
                                                                Context                        World
                                                             Interpretation                    Wide
                                                                                               Web
                               DTMF Tone Recognizer

                        Pronunciation Lexicon                                  Dialog
                         Specification (PLS)                                  Manager


             User              Pre-recorded Audio Player
                                                                                             Telephone
                                                                Media                         System
                                                               Planning
                                            Language
                            TTS
                                            Generation


                                                           Reusable Components
                              Speech Synthesis                                     Call Control XML
                           Markup Language (SSML)                                      (CCXML)




Google TechTalk – Mar 6th, 2009                                                                          Paolo Baggia   67
Architectural Changes




                                                 .grxml/.gram, .pls
        VoiceXML architecture



                              ASR / DTMF
                                                                .vxml
                                              VoiceXML                   Web
           User
                                               Browser                  Applic.
                                                                HTTP
                              TTS / Audio
                           VoiceXML
                           platform


                                            .ssml, .wav/.mp3, .pls




Google TechTalk – Mar 6th, 2009                                             Paolo Baggia   68
VoxNauta – Internal Architecture




Google TechTalk – Mar 6th, 2009    Paolo Baggia   69
Loquendo MRCP Server/LSS 7.0 Architecture

                                             Load Balancer

                      RTSP                    SIP
                                                                      MRCP v2
                    (MRCPv1)                 (SDP)

        RTP                                          SIP
                       RTSP Parser                           MRCP v2
                                                              parser
                                                     SDP
                        MRCP v1 Parser

                                                                                Management     Graphic
                                                              MP                  (SNMP)
                                                                                             Management
                                                                                Configuration Consolle
                                                             Config                files
                        AP
                                MRCP v1/v2 Server
                      Interf.
                                                             Logger              Log files
   Audio      AP
              API
  Provider
                                                                                Win32/Linux
                                                              OS
                           NLSML / EMMA

                                     TTS & ASR interface

        TTS and ASR API                          TTS and ASR API


                                                       LASR-SV
                                LASR
     LTTS

Google TechTalk – Mar 6th, 2009                                                               Paolo Baggia   70
IETF MRCP Protocols


        Media Resource Control Protocol MRCP are IETF standards
              MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on
              RTSP/RTP
              MRCPv2 is Internet Draft,
              http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP
              offering the new audio recording and Speaker Verification
              functionalities
        Optimized client-server solution for the large-scale deployment of
        speech technologies in the telephony field, such as call centers,
        CRM, news and email-reading, self-service applications, etc.
        Allows standard interface of speech technologies in all IVR platforms



                         For more information read:
            Dave Burke, Speech Processing for IP Networks. Media
                Resource Control Protocol (MRCP), ed. Wiley

Google TechTalk – Mar 6th, 2009                                                   Paolo Baggia   71
VoiceXML in a Call Center
                                           PBX


              Fixed/
                                                                           Optional
             Mobile
             Network
                                                                      Voice Gateway for
                                                                        Non SIP PBX

                                                                 VOXNAUTA IVR

                       ACD


                                  WEB            CTI          Data
                                  Server         Server       Server




                                                          Operators
Google TechTalk – Mar 6th, 2009                                             Paolo Baggia   72
VoiceXML in the IMS Architecture


                                                             TDM protocols
                                   VOICE                     SIP protocols
                Fixed/                                       RTP
                                  GATEWAY
               Mobile
                                                             VoiceXML on HTTPS
               Network




                                                       VOXNAUTA MRF



                IP
              Network




                                        Application Server
Google TechTalk – Mar 6th, 2009                                      Paolo Baggia   73
Overview

      A Bit of History

      W3C Speech Interaction Framework Today
          ASR/DMTF
          TTS
          Lexicons
          Voice Dialog and Call Control
          Voice Platforms and Next Evolutions

      W3C Multimodal Interaction Today
          MMI Architecture
          EMMA and InkML
          A language for Emotions

      Next Future
Google TechTalk – Mar 6th, 2009                 Paolo Baggia   74
Modes, Modalities and Technologies


     Speech
     Audio
     Stylus
     Touch
     Accelerometer
     Keyboard/keypad
     Mouse/touchpad
     Camera
     Geolocation
     Handwriting recognition
     Speaker verification
     Signature verification
     Fingerprint identification
     ….



Google TechTalk – Mar 6th, 2009      Paolo Baggia   75
Complement and Supplement


                  Speech                      Visual
         - Transient                   - Persistent
         - Linear                      - Spatial
         - Hands and Eyes-Free         - Eyes
         - Suffers Noise               - Suffers Light Conditions




        Enable to choose among different modalities or to mix
         them
        Adaptable to different social, environmental conditions or
         to user preference



Google TechTalk – Mar 6th, 2009                                 Paolo Baggia   76
GUI                    VUI   MUI
                                         or
                                        MMUI




Google TechTalk – Mar 6th, 2009            Paolo Baggia   77
MMI has an Intrinsic Complexity

                                              Interaction
                                              Manager
               speech
                speech
                                                                fingerprint
                 text                                            fingerprint
                  text
                                                              Face
                mouse                                          Face
                mouse
                                                              identification
                                                               identification
                                              geolocation
             handwriting                       geolocation
              handwriting                                      Speaker
                                                                Speaker
                                                               verification
                                                Vital           verification
            accelerometer                        Vital
             accelerometer
                                                signs
                                                 signs
                                                Sensor        Identification
            User intent

                                              video
                                               video
                          photograph
                           photograph
                                        Audio
                                         Audio
                            drawing
                             drawing    recording
                                         recording

                                                             Deborah Dahl, Voice Search 2009
                                  Recording
Google TechTalk – Mar 6th, 2009                                           Paolo Baggia         78
MMI can Include Many Different Technologies




                          Touchscreen                 Accelerometer




                                        Interaction
            Speech
                                                                  Geolocation
            recognition                 Manager




                     Fingerprint
                                                         Keypad
                     recognition


                                        Handwriting
                                        recognition

                                                                      Deborah Dahl, Voice Search 2009


Google TechTalk – Mar 6th, 2009                                                    Paolo Baggia         79
Uniform Representation for MMI



        Getting everything to work together is complicated.
        One simplification is to represent the same information
        from different modalities in the same format.
        The need a common language for representing the
        same information from different modalities



        EMMA (Extensible MultiModal Annotation) 1.0
        A uniform representation for multimodal information




Google TechTalk – Mar 6th, 2009                         Paolo Baggia   80
Touchscreen                       Accelerometer


                                                           EMMA
                                      EMMA


                                             Interaction
          Speech
                              EMMA                          EMMA        Geolocation
          recognition                        Manager


                                      EMMA                 EMMA
                                               EMMA
                        Fingerprint
                                                               Keypad
                        recognition


                                             Handwriting
                                             recognition

                                                                        Deborah Dahl, Voice Search 2009


Google TechTalk – Mar 6th, 2009                                                      Paolo Baggia         81
EMMA Structural Elements


                                         EMMA Elements
  Provide containers for application
  semantics and for multimodal
  annotation                                emma:emma

  <emma:emma …>                        emma:interpretation
      <emma:one-of>
          <emma:interpretation>
                                           emma:one-of
           …
          </emma:interpretation>
          <emma:interpretation>            emma:group
           …
          </emma:interpretation>
                                         emma:sequence
      </emma:one-of>
  </emma:emma>
                                          emma:lattice




  http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009                          Paolo Baggia   82
EMMA Annotations

  Characteristics and processing of input, e.g.:
                                                       token of input
               emma:tokens
                                                   reference to processing
              emma:process
                                                        lack of input
              emma:no-input
                                                    uninterpretable input
           emma:uninterpreted

                                               human language of input
                emma:lang

               emma:signal                           reference to signal

             emma:media-type                             media type

             emma:confidence                         confidence scores
               emma:source                     annotation of input source
          emma:start emma:end                Timestamps (absolute/relative)
         emma:medium emma:mode                       medium, mode, and
             emma:function                            function of input
                emma:hook                                   hook


  http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009                                              Paolo Baggia   83
EMMA 1.0 – Example Travel Application




   INPUT:
   quot;I want to go from Boston
    to Denver on March 11quot;




  http://www.w3.org/TR/emma/            Deborah Dahl, Voice Search 2009


Google TechTalk – Mar 6th, 2009                      Paolo Baggia         84
EMMA 1.0 – Same meaning


 <emma:interpretation medium=quot;acousticquot; mode=quot;voicequot;
   id=quot;int1quot;>
         <origin>Boston</origin>
                                                             Speech
         <destination>Denver</destination>
         <date>11032009</date>
  </emma:interpretation>


 <emma:interpretation medium=quot;tactilequot; mode=quot;gui“
   id=quot;int1quot;>
          <origin>Boston</origin>
                                                             Mouse
          <destination>Denver</destination>
          <date>11032009</date>
   </emma:interpretation>

  http://www.w3.org/TR/emma/                           Deborah Dahl, Voice Search 2009


Google TechTalk – Mar 6th, 2009                                     Paolo Baggia         85
EMMA 1.0 – Handwriting Input

   <emma:interpretation medium=quot;tactilequot; mode=quot;inkquot;
     id=quot;int1quot;>
           <origin>Boston</origin>
           <destination>Denver</destination>
           <date>11032009</date>
    </emma:interpretation>




  http://www.w3.org/TR/emma/                          Deborah Dahl, Voice Search 2009


Google TechTalk – Mar 6th, 2009                                    Paolo Baggia         86
EMMA 1.0 – Biometrics Input

<emma:emma version=quot;1.0quot;>               <emma:emma version=quot;1.0quot;>
   <emma:interpretation                    <emma:interpretation
      id=quot;int1quot;                               id=quot;int1quot;
      emma:confidence=quot;.75quot;                   emma:confidence=quot;.80quot;
      emma:medium=quot;visualquot;                    emma:medium=quot;acousticquot;
      emma:mode=quot;photographquot;                  emma:mode=quot;voicequot;
      emma:verbal=quot;falsequot;                     emma:verbal=quot;falsequot;
      emma:function=quot;identificationquot;>       emma:function=quot;identificationquot;>
         <person>12345</person>                   <person>12345</person>
         <name>Mary Smith</name>                  <name>Mary Smith</name>
   </emma:interpretation>                  </emma:interpretation>
</emma:emma>                            </emma:emma>




  http://www.w3.org/TR/emma/                               Deborah Dahl, Voice Search 2009

Google TechTalk – Mar 6th, 2009                                         Paolo Baggia         87
EMMA 1.0 – Representing Lattices


     Speech recognizers, Handwriting recognizers and other input
     processing components may provide lattice output:

     A graph encoding a range of possible recognition results or
     interpretations


                                                        portland
                                                                       today          please
                                             from
         flights       to       austin                                           7
     1             2        3            4          5              6                               8
                                                        oakland           tomorrow
                                boston




                                                                       From Michael Joshnston, AT&T Research
  http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009                                                            Paolo Baggia        88
EMMA 1.0 – Representing Lattices
    Lattices can be represented using EMMA elements:
     <emma:lattice emma:initial=quot;?quot; emma:final=quot;?quot;>
     <emma:arc emma:from=quot;?quot; emma:to=quot;?quot;>

  <emma:emma version=quot;1.0quot;
  xmlns:emma=quot;http://www.w3.org/2003/04/emmaquot;>
  <emma:interpretation>
  <emma:lattice emma:initial=quot;1quot; emma:final=quot;8quot;>
         <emma:arc emma:from=quot;1quot; emma:to=quot;2quot;>flights</emma:arc>
         <emma:arc emma:from=quot;2quot; emma:to=quot;3quot;>to</emma:arc>
         <emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>boston</emma:arc>
         <emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>austin</emma:arc>
         <emma:arc emma:from=quot;4quot; emma:to=quot;5quot;>from</emma:arc>
         <emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>portland</emma:arc>
         <emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>oakland</emma:arc>
         <emma:arc emma:from=quot;6quot; emma:to=quot;7quot;>today</emma:arc>
         <emma:arc emma:from=quot;7quot; emma:to=quot;8quot;>please</emma:arc>
         <emma:arc emma:from=quot;6quot; emma:to=quot;8quot;>tomorrow</emma:arc>
  </emma:lattice>
  </emma:interpretation>
  </emma:emma>
                                                       From Michael Joshnston, AT&T Research
  http://www.w3.org/TR/emma/
Google TechTalk – Mar 6th, 2009                                            Paolo Baggia        89
EMMA in Multimodal Framework
 http://www.w3.org/TR/mmi-framework




                                  EMMA




Google TechTalk – Mar 6th, 2009          Paolo Baggia   90
InkML 1.0 – Digital Ink

Ink Markup Language (InkML), http://www.w3.org/TR/InkML
   Data format for presenting digital Ink (pen, stylus, etc)
   Allows the input and processing of handwritings, gesture, sketches,
   music, etc.
                                  <ink>
                                     <trace>
                                       10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140,
                                       13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135,
                                       58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205
                                     </trace>
                                     <trace>
                                       130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125,
                                       152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200,
                                       150 208, 163 210, 178 208, 192 201, 205 192, 214 180
                                     </trace>
                                     <trace>
                                       227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134,
                                       230 148, 234 162, 235 176, 238 190, 241 204
                                     </trace>
                                     <trace>
                                       282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129,
                                       291 143, 294 157, 294 171, 294 185, 296 199, 300 213
                                     </trace>
                                     <trace>
                                       366 130, 359 143, 354 157, 349 171, 352 185, 359 197,
                                       371 204, 385 205, 398 202, 408 191, 413 177, 413 163,
                                       405 150, 392 143, 378 141, 365 150
                                     </trace>
                                  </ink>


  http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009                                                              Paolo Baggia      91
InkML 1.0 – Status and Advances

    Rich annotation for Ink:
         Trace, Trace formats and Trace collections
         Contextual information
         Canvases
         Etc.

    Result of classification of InkML traces may be a semantic
    representation in EMMA 1.0

    Current status is Last Call Working Draft, next will be Candidate
    Recommendation with release of an Impl. Report test-suite
    Raising interest from major industries




  http://www.w3.org/TR/InkML/
Google TechTalk – Mar 6th, 2009                                   Paolo Baggia   92
MMI Architecture Specification

“Multimodal Architecture and Interfaces“, W3C Working Draft,
    http://www.w3.org/TR/mmi-arch/


     Runtime Framework provides              Delivery      Interaction       Data
     the basic infrastructure and            Context        Manager        Component
                                            Component
     controls communication among
     the constituents.                             Runtime Framework

     Interaction Manager (IM)
                                                  Modality Component API
     coordinates Modality
     Components (MCs) by life-cycle
                                             Modality                      Modality
     events and contains the shared         Component 1                  Component N
     data (context).
     Event-based communication
     between IM and MCs.

 http://www.w3.org/TR/mmi-arch/                               Ingmar Kliche, SpeechTEK 2008

Google TechTalk – Mar 6th, 2009                                            Paolo Baggia       93
MMI Arch – Laboratory Implementation


    Implementation of components using W3C markup languages.



                             Delivery             Interaction          Data
                             Context               Manager           Component
                            Component
                                                  SCXML
                                              Runtime Framework

                         Modality Component API                   Modality Component API



                            HTML                                  VoiceXML
                            Modality                                 Modality
                          Component 1                              Component N
                             for GUI                                  for VUI




 http://www.w3.org/TR/mmi-arch/                                                     Ingmar Kliche, SpeechTEK 2008

Google TechTalk – Mar 6th, 2009                                                                Paolo Baggia         94
MMI Arch – Laboratory Implementation

     SCXML based Interaction Manager.
     VoiceXML + HTML modality components.



                                                     SCXML interpreter
             Server
                                                    HTTP I/O Processor

  Modality Component API: HTTP + XML (using AJAX)                                   Modality Component API: HTTP + XML (EMMA)

                                                                          CCXML/VoiceXML        Server
                                                                             Browser
                                  HTML Browser
                                                                                     Telephony interface
             Client
                                                                                Phone           Client

                               GUI modality component                    Voice modality component




 http://www.w3.org/TR/mmi-arch/                                                              Ingmar Kliche, SpeechTEK 2008

Google TechTalk – Mar 6th, 2009                                                                            Paolo Baggia         95
MMI Architecture – Open Issues


        Profiles

        Start-up, Registration, Delegation
        in distributed environment

        Transport of Events

        Extensibility of Events




 http://www.w3.org/TR/mmi-arch/

Google TechTalk – Mar 6th, 2009              Paolo Baggia   96
Emotion in Wikipedia

 From Wikipedia definition:

     “An emotion is a mental and physiological state associated with a
     wide variety of feelings, thoughts, and behaviours. It is a prime
     determinant of the sense of subjective well-being and appears to play
     a central role in many human activities. As a result of this generality,
     the subject has been explored in many, if not all of the human
     sciences and art forms. There is much controversy concerning how
     emotions are defined and classified.”

General goal: Make interaction between humans and machines more
  natural for the humans

Machines should become able:
      • to register human emotions (and related states)
      • to convey emotions (and related states)
      • to “understand” the emotional relevance of events

Google TechTalk – Mar 6th, 2009                                     Paolo Baggia   97
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009
Voice Browsing And Multimodal Interaction In 2009

Weitere ähnliche Inhalte

Andere mochten auch

Bedrijfspresentatie Connect It
Bedrijfspresentatie Connect ItBedrijfspresentatie Connect It
Bedrijfspresentatie Connect It
erikrijke
 
高筠芝
高筠芝高筠芝
高筠芝
nice567
 
蘇富惠
蘇富惠蘇富惠
蘇富惠
nice567
 
鄭倩銣
鄭倩銣鄭倩銣
鄭倩銣
nice567
 
Fotos Varias 1
Fotos Varias 1Fotos Varias 1
Fotos Varias 1
HOME
 
Cuba Turistica
Cuba TuristicaCuba Turistica
Cuba Turistica
HOME
 
Petits Angescc
Petits AngesccPetits Angescc
Petits Angescc
HOME
 
陳珮甄
陳珮甄陳珮甄
陳珮甄
nice567
 
耳科學之雲端計劃
 耳科學之雲端計劃 耳科學之雲端計劃
耳科學之雲端計劃
David Yeh
 
Dubious risk assessments
Dubious risk assessmentsDubious risk assessments
Dubious risk assessments
gnicho
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
dklawson
 

Andere mochten auch (20)

Bedrijfspresentatie Connect It
Bedrijfspresentatie Connect ItBedrijfspresentatie Connect It
Bedrijfspresentatie Connect It
 
Show 63 | Websites Are Dead | Edge of the Web Radio
Show 63 | Websites Are Dead | Edge of the Web RadioShow 63 | Websites Are Dead | Edge of the Web Radio
Show 63 | Websites Are Dead | Edge of the Web Radio
 
醫學的無限(線)未來
醫學的無限(線)未來醫學的無限(線)未來
醫學的無限(線)未來
 
Vedic Presentation New
Vedic Presentation NewVedic Presentation New
Vedic Presentation New
 
高筠芝
高筠芝高筠芝
高筠芝
 
蘇富惠
蘇富惠蘇富惠
蘇富惠
 
鄭倩銣
鄭倩銣鄭倩銣
鄭倩銣
 
Wc To2009
Wc To2009Wc To2009
Wc To2009
 
Fotos Varias 1
Fotos Varias 1Fotos Varias 1
Fotos Varias 1
 
Cuba Turistica
Cuba TuristicaCuba Turistica
Cuba Turistica
 
Petits Angescc
Petits AngesccPetits Angescc
Petits Angescc
 
Save Your Client\'s Money With MSA
Save Your Client\'s Money With MSASave Your Client\'s Money With MSA
Save Your Client\'s Money With MSA
 
陳珮甄
陳珮甄陳珮甄
陳珮甄
 
Social Media Twitter
Social Media TwitterSocial Media Twitter
Social Media Twitter
 
Literatura acuerdo 653_2013
Literatura acuerdo 653_2013Literatura acuerdo 653_2013
Literatura acuerdo 653_2013
 
耳科學之雲端計劃
 耳科學之雲端計劃 耳科學之雲端計劃
耳科學之雲端計劃
 
Paisajes
PaisajesPaisajes
Paisajes
 
Dubious risk assessments
Dubious risk assessmentsDubious risk assessments
Dubious risk assessments
 
Business Intelligence Portfolio
Business Intelligence PortfolioBusiness Intelligence Portfolio
Business Intelligence Portfolio
 
Joy Montgomery Vip Services (2)
Joy Montgomery Vip Services (2)Joy Montgomery Vip Services (2)
Joy Montgomery Vip Services (2)
 

Ähnlich wie Voice Browsing And Multimodal Interaction In 2009

Innovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun MicrosystemsInnovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun Microsystems
robinwauters
 
WS-* Specifications Update 2007
WS-* Specifications Update 2007WS-* Specifications Update 2007
WS-* Specifications Update 2007
Jorgen Thelin
 
Seaside — Agile Software Development
Seaside — Agile Software DevelopmentSeaside — Agile Software Development
Seaside — Agile Software Development
Lukas Renggli
 
Webinar WebRTC HTML5 (english)
Webinar WebRTC HTML5 (english)Webinar WebRTC HTML5 (english)
Webinar WebRTC HTML5 (english)
Quobis
 

Ähnlich wie Voice Browsing And Multimodal Interaction In 2009 (20)

OSGi - Four Years and Forward - J Barr
OSGi - Four Years and Forward - J BarrOSGi - Four Years and Forward - J Barr
OSGi - Four Years and Forward - J Barr
 
ibm språkbanken websphere
ibm språkbanken websphereibm språkbanken websphere
ibm språkbanken websphere
 
OMG Introduction Dr. Richard Mark Soley
OMG Introduction Dr. Richard Mark SoleyOMG Introduction Dr. Richard Mark Soley
OMG Introduction Dr. Richard Mark Soley
 
Developer Jam Session - Intro to Voxeo Products
Developer Jam Session - Intro to Voxeo ProductsDeveloper Jam Session - Intro to Voxeo Products
Developer Jam Session - Intro to Voxeo Products
 
Innovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun MicrosystemsInnovation for Participation - Paul De Decker, Sun Microsystems
Innovation for Participation - Paul De Decker, Sun Microsystems
 
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE, Paris, Manuel Herranz, Pangean...
 
Workshop oracle
Workshop oracleWorkshop oracle
Workshop oracle
 
Webrtc - rich communication - quobis - victor pascual
Webrtc  - rich communication - quobis - victor pascualWebrtc  - rich communication - quobis - victor pascual
Webrtc - rich communication - quobis - victor pascual
 
AnywhereYouGo - The global mobile wireless development community
AnywhereYouGo - The global mobile wireless development communityAnywhereYouGo - The global mobile wireless development community
AnywhereYouGo - The global mobile wireless development community
 
Open Source Telecom Software Landscape by Alan Quayle
Open Source Telecom Software Landscape by Alan QuayleOpen Source Telecom Software Landscape by Alan Quayle
Open Source Telecom Software Landscape by Alan Quayle
 
Mikko Puhakka: Open Source Business Models
Mikko Puhakka: Open Source Business ModelsMikko Puhakka: Open Source Business Models
Mikko Puhakka: Open Source Business Models
 
WS-* Specifications Update 2007
WS-* Specifications Update 2007WS-* Specifications Update 2007
WS-* Specifications Update 2007
 
Agile Seaside
Agile SeasideAgile Seaside
Agile Seaside
 
Seaside — Agile Software Development
Seaside — Agile Software DevelopmentSeaside — Agile Software Development
Seaside — Agile Software Development
 
Html5 Seminario Tid
Html5  Seminario TidHtml5  Seminario Tid
Html5 Seminario Tid
 
Upperside Webinar - WebRTC Standards Update
Upperside Webinar - WebRTC Standards UpdateUpperside Webinar - WebRTC Standards Update
Upperside Webinar - WebRTC Standards Update
 
KITE Network Instrumentation: Advanced WebRTC Testing
KITE Network Instrumentation: Advanced WebRTC TestingKITE Network Instrumentation: Advanced WebRTC Testing
KITE Network Instrumentation: Advanced WebRTC Testing
 
Emerging SOA + BPM Standards, Software and Platforms
Emerging SOA + BPM Standards,Software and PlatformsEmerging SOA + BPM Standards,Software and Platforms
Emerging SOA + BPM Standards, Software and Platforms
 
Webinar WebRTC HTML5 (english)
Webinar WebRTC HTML5 (english)Webinar WebRTC HTML5 (english)
Webinar WebRTC HTML5 (english)
 
CCXML For Advanced Communications Applications
CCXML For Advanced Communications ApplicationsCCXML For Advanced Communications Applications
CCXML For Advanced Communications Applications
 

Mehr von GoogleTecTalks

Web Hooks And The Programmable World Of Tomorrow
Web Hooks And The Programmable World Of TomorrowWeb Hooks And The Programmable World Of Tomorrow
Web Hooks And The Programmable World Of Tomorrow
GoogleTecTalks
 
Using The Google Collections Library For Java
Using The Google Collections Library For JavaUsing The Google Collections Library For Java
Using The Google Collections Library For Java
GoogleTecTalks
 
V Code And V Data Illustrating A New Framework For Supporting The Video Annot...
V Code And V Data Illustrating A New Framework For Supporting The Video Annot...V Code And V Data Illustrating A New Framework For Supporting The Video Annot...
V Code And V Data Illustrating A New Framework For Supporting The Video Annot...
GoogleTecTalks
 
New Media Mavericks Will The Revolution Be Spidered
New Media Mavericks Will The Revolution Be SpideredNew Media Mavericks Will The Revolution Be Spidered
New Media Mavericks Will The Revolution Be Spidered
GoogleTecTalks
 
Performance Improvements In Browsers
Performance Improvements In BrowsersPerformance Improvements In Browsers
Performance Improvements In Browsers
GoogleTecTalks
 
13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...
13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...
13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...
GoogleTecTalks
 
Black Cloud Patterns Toward The Future
Black Cloud Patterns Toward The FutureBlack Cloud Patterns Toward The Future
Black Cloud Patterns Toward The Future
GoogleTecTalks
 
Advanced Ruby Scripting For Sketch Up
Advanced Ruby Scripting For Sketch UpAdvanced Ruby Scripting For Sketch Up
Advanced Ruby Scripting For Sketch Up
GoogleTecTalks
 
An Introduction To Android
An Introduction To AndroidAn Introduction To Android
An Introduction To Android
GoogleTecTalks
 
Advanced Gadget And Ui Development Using Googles Ajax Ap Is
Advanced Gadget And Ui Development Using Googles Ajax Ap IsAdvanced Gadget And Ui Development Using Googles Ajax Ap Is
Advanced Gadget And Ui Development Using Googles Ajax Ap Is
GoogleTecTalks
 
A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...
A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...
A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...
GoogleTecTalks
 
Keynote Client Connectivity And The Cloud
Keynote Client Connectivity And The CloudKeynote Client Connectivity And The Cloud
Keynote Client Connectivity And The Cloud
GoogleTecTalks
 

Mehr von GoogleTecTalks (13)

Web Hooks And The Programmable World Of Tomorrow
Web Hooks And The Programmable World Of TomorrowWeb Hooks And The Programmable World Of Tomorrow
Web Hooks And The Programmable World Of Tomorrow
 
Using The Google Collections Library For Java
Using The Google Collections Library For JavaUsing The Google Collections Library For Java
Using The Google Collections Library For Java
 
V Code And V Data Illustrating A New Framework For Supporting The Video Annot...
V Code And V Data Illustrating A New Framework For Supporting The Video Annot...V Code And V Data Illustrating A New Framework For Supporting The Video Annot...
V Code And V Data Illustrating A New Framework For Supporting The Video Annot...
 
New Media Mavericks Will The Revolution Be Spidered
New Media Mavericks Will The Revolution Be SpideredNew Media Mavericks Will The Revolution Be Spidered
New Media Mavericks Will The Revolution Be Spidered
 
Performance Improvements In Browsers
Performance Improvements In BrowsersPerformance Improvements In Browsers
Performance Improvements In Browsers
 
13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...
13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...
13353102 Putting The Fun In Functional Applying Game Mechanics To Functional ...
 
Black Cloud Patterns Toward The Future
Black Cloud Patterns Toward The FutureBlack Cloud Patterns Toward The Future
Black Cloud Patterns Toward The Future
 
Advanced Ruby Scripting For Sketch Up
Advanced Ruby Scripting For Sketch UpAdvanced Ruby Scripting For Sketch Up
Advanced Ruby Scripting For Sketch Up
 
An Introduction To Android
An Introduction To AndroidAn Introduction To Android
An Introduction To Android
 
Advanced Gadget And Ui Development Using Googles Ajax Ap Is
Advanced Gadget And Ui Development Using Googles Ajax Ap IsAdvanced Gadget And Ui Development Using Googles Ajax Ap Is
Advanced Gadget And Ui Development Using Googles Ajax Ap Is
 
Advanced Kml
Advanced KmlAdvanced Kml
Advanced Kml
 
A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...
A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...
A World Beyond Ajax Accessing Googles Ap Is From Flash And Non Java Script En...
 
Keynote Client Connectivity And The Cloud
Keynote Client Connectivity And The CloudKeynote Client Connectivity And The Cloud
Keynote Client Connectivity And The Cloud
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Voice Browsing And Multimodal Interaction In 2009

  • 1. Voice Browser and Multimodal Interaction In 2009 Paolo Baggia Director of International Standards March 6th, 2009 Google TechTalk Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
  • 2. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 2
  • 3. Company Profile Privately held company (fully owned by Telecom Italia), founded in 2001 as spin-off from Telecom Italia Labs, capitalizing on 30yrs experience and expertise in voice processing. Global Company, leader in Europe and South America for award-winning, high quality voice technologies (synthesis, recognition, authentication and identification) available in 26 languages and 62 voices. Multilingual, proprietary technologies protected over 100 patents worldwide Munich London Financially robust, break-even reached in 2004, revenues and earnings growing year on year Paris Growth-plan investment approved for the evolution of products and services. Madrid Offices in New York. Headquarters in Torino, Torino local representative sales offices in Rome, New York Rome Madrid, Paris, London, Munich Flexible: About 100 employees, plus a vibrant ecosystem of local freelancers. Google TechTalk – Mar 6th, 2009 Paolo Baggia 3
  • 4. International Awards “2008 Frost & Sullivan European Telematics and Infotainment Emerging Company of the Year” Award Winner of “Market leader-Best Speech Engine” Speech Industry Award 2007 and 2008 Loquendo MRCP Server: Winner of 2008 IP Contact Center Technology Pioneer Award “Best Innovation in Automotive Speech Synthesis” Prize AVIOS-SpeechTEK West 2007 “Best Innovation in Expressive Speech Synthesis” Prize AVIOS-SpeechTEK West 2006 “Best Innovation in Multi-Lingual Speech Synthesis” Prize AVIOS-SpeechTEK West 2005 Google TechTalk – Mar 6th, 2009 Paolo Baggia 4
  • 5. A Bit of History Google TechTalk – Mar 6th, 2009 Paolo Baggia 5
  • 6. Standard Bodies Two main standard bodies: W3C – World Wide Web Consortium Founded in 1994, by Tim Berners-Lee with a mission to lead the Web to its full potential. Staff based in MIT (USA), ERCIM (France), Keio Univ (Japan). 400 members all over the world, 50 Working, Interest and Coordination Groups. W3C is where the framework of today’s Web is developed (HTML, CSS, XML, DOM, SOAP, RDF, OWL, VoiceXML, SVG, XSLT, P3P, XML, Internationalization, Web Accessibility, Device Independence) IETF – Internet Engineering Task Force Founded in 1986, but growth in 1991as Internet Society. 1300 members. HTTP, SIP, RTP and many others protocols. Media Resource Control Protocol (MRCP) is very relevant for speech platforms. Two industrial forums: VoiceXML Forum (www.voicexml.org) Inventors of VoiceXML 1.0, then submitted to W3C for standardization. Current goal is to promote, disseminate and support VoiceXML and related standards. SALT Forum (www.saltforum.org) Supported by Microsoft to define a lightweight markup for telephony and multimodal applications. Other relevant bodies: 3GPP, OMA, ETSI, NIST Google TechTalk – Mar 6th, 2009 Paolo Baggia 6
  • 7. The (r)evolution of VoiceXML 1998 - 2004 W3C charters W3C charters Voice Browser Multimodal Interaction WG WG EMMA 1.0 By Cisco, Comverse, VoiceXML W3C Rec SALT Forum Intel, Microsoft, Philips, Forum Birth Birth SpeechWorks, PLS 1.0 By AT&T, IBM, W3C REC Lucent, Motorola, 2007 2004 2000 1998 2009 2008 1999 2002 SSML 1.0 W3C Voice SISR 1.0 W3C Rec SRGS 1.0 Browser W3C Rec VoiceXML 1.0 W3C Rec VoiceXML 2.0 VoiceXML 2.0 Workshop Released W3C Rec W3C Rec Preparing to announce VoiceXML 1.0 Friday Feb. 25th, 2000 Lucent, Naperville, Illinois Left to right: Gerald Karam (AT&T), Linda Boyer (IBM), Ken Rehor (Lucent), Bruce Lucas (IBM), Pete Danielsen (Lucent), Jim Ferrans (Motorola), Dave Ladd (Motorola). Google TechTalk – Mar 6th, 2009 Paolo Baggia 7
  • 8. Speech Interface Framework in 2000 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 8
  • 9. Speech Interface Framework - Today (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 9
  • 10. Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 10
  • 11. W3C Process Google TechTalk – Mar 6th, 2009 Paolo Baggia 11
  • 12. Architectural Changes Traditional (proprietary) architecture ASR / DTMF Speech Proprietary User SCE Applic. TTS / Audio Proprietary platform .grxml/.gram, .pls VoiceXML architecture ASR / DTMF .vxml VoiceXML Web User Browser Applic. HTTP TTS / Audio VoiceXML platform .ssml, .wav/.mp3, .pls Google TechTalk – Mar 6th, 2009 Paolo Baggia 12
  • 13. The VoiceXML Impact VoiceXML changed the landscape of IVRs and speech application creation From proprietary to standard-based speech applications Before After • Standard VoiceXML • Proprietary platforms platforms (HW & SW) • Standards for Speech • Proprietary Technologies applications (by proprietary SCE) • Standard tools for VoiceXML applications • Mainly DTMF and pre-recorded prompts • Integration of DTMF and ASR • First attempts to add speech into IVR • Still predominance of DTMF, but more and more speech applications Google TechTalk – Mar 6th, 2009 Paolo Baggia 13
  • 14. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 14
  • 15. Standards for ASR and DTMF SRGS 1.0, SISR 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 15
  • 16. W3C Standards for Speech/DTMF Grammars SEMANTICS SYNTAX Speech Defines constraints on Describes how to admissible sentences for grammar produce results after a specific recognition turn an utterance is recognized SRGS SISR SRGS SISR ABNF XML literal script ABNF XML literal script voice dtmf voice dtmf http://www.w3.org/TR/speech-grammar/ http://www.w3.org/TR/semantic-interpretation/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 16
  • 17. SRGS/SISR Grammars for “Torino” SRGS XML SRGS ABNF <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <grammar xml:lang=quot;en-USquot; version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/06/grammarquot; #ABNF 1.0 iso-8859-1; tag-format=quot;semantics/1.0-literalsquot;> SISR mode voice; tag-format <semantics/1.0-literals>; <rule id=quot;mainquot; scope=quot;publicquot;> <token>Torino</token> literal <tag>10100</tag> public $main = Torino {10100} ; </rule> </grammar> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <grammar xml:lang=quot;en-USquot; version=quot;1.0quot; #ABNF 1.0 iso-8859-1; xmlns=quot;http://www.w3.org/2001/06/grammar quot; tag-format=quot;semantics/1.0quot;> mode voice; SISR tag-format <semantics/1.0>; <tag>var unused=7;</tag> <rule id=quot;mainquot; scope=quot;publicquot;> script {var unused=7;}; <token>Torino</token> public $main = Torino {out=quot;10100quot;;} ; <tag>out=quot;10100quot;;</tag> </rule> </grammar> Google TechTalk – Mar 6th, 2009 Paolo Baggia 17
  • 18. SRGS/SISR Standards – Pros Powerful syntax (CFG) and very powerful semantics (ECMA) DMTF and Voice input are transparent to the application Wide and consistent adoption among technology vendors Two syntax XML and ABNF are great! Developers can choose (XML validation vs. compact format) Transformations are possible XML ABNF (easy, simple XSLT) ABNF XML (requires a ABNF parser) Open Source tools might be created to: Validate grammar syntax Transform grammars Debug grammars on written input Coverage tests: explode covered sentences, GenSem, SemTester, etc. Google TechTalk – Mar 6th, 2009 Paolo Baggia 18
  • 19. SRGS/SISR Standards – Small Issues Semantics declaration: tag-format attribute If value “semantics/1.0”? Mandate SISR Script semantics inside semantic tags If value “semantics/1.0-literal”? Mandate SISR Literal semantics inside semantic tags If missing? Unclear! Risk of interoperability troubles SISR Script Semantics Clumsy default assignment: returns last referenced rule only Developer must properly pop-up results Be careful to redefine “out” Assign a scalar value might result in errors SISR Literal Semantics Only useful for very simple word-list rules No support for encapsulating rules SISR Literal grammars as external references ONLY! Google TechTalk – Mar 6th, 2009 Paolo Baggia 19
  • 20. SRGS/SISR – Encapsulated Grammars Gr2.gram Literal Gr41.grxml Gr1.grxml Literal Script Gr3.grxml Script Gr42.gram Script Google TechTalk – Mar 6th, 2009 Paolo Baggia 20
  • 21. SRGS/SISR Standards – Rich XML Results Section 7 of SISR 1.0 specification http://www.w3.org/TR/semantic-interpretation/#SI7 Serialization rules from SISR ECMA results into XML Edge cases: Arrays Special variable “_attribute” and “_value” Creation of namespaces and prefixes { drink: { _nsdecl: { _prefix:quot;n1quot;, _name:quot;http://www.example.com/n1quot; }, _nsprefix:quot;n1quot;, liquid: { _nsdecl: { <n1:drink xmlns:n1=quot;http://www.example.com/n1quot;> _prefix:quot;n2quot;, <liquid n2:color=quot;black“ _name:quot;http://www.example.com/n2quot; xmlns:n2=quot;http://www.example.com/n2quot;>coke</liquid> }, _attributes: { <size>medium</size> color: { </n1:drink> _nsprefix:quot;n2quot;, _value:quot;blackquot; } }, _value:quot;cokequot; }, size:quot;mediumquot; } } Google TechTalk – Mar 6th, 2009 Paolo Baggia 21
  • 22. SRGS/SISR Standards – Next Steps Adoption of the PLS 1.0 lexicon Clear entry point into PLS lexicons, <token> element Missing role attribute in <token> to allow homographs disambiguation Next extensions via Errata XML 1.1 support and IR Update normative references No Major Extensions are needed! Google TechTalk – Mar 6th, 2009 Paolo Baggia 22
  • 23. Speech Synthesis SSML 1.0/1.1 Google TechTalk – Mar 6th, 2009 Paolo Baggia 23
  • 24. TTS – Functional Architecture and Markup/Non-Markup support Text-to- Structure Text Prosody Waveform Phoneme Analysis Normalization Analysis Production Conversion Markup support: Markup support: Markup support: <phoneme>, <lexicon> <p>, <s> <voice>, <audio> Non-Markup support: Non-Markup support: Non-Markup support: look up in pronunciation infer the structure by dictionary automatic text analysis Markup support: Markup support: <emphasis>, <break>, <prosody> <say-as> for date, time, phone number, numbers Non-Markup support: <sub> for acronyms and transliterations automatically generate prosody through analysis of Non-Markup support: document structure and sentence syntax automatically identify and convert constructs http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 24
  • 25. SSML 1.0 – Language description (I) version attribute Document Structure SSML namespace attribute <speak> root element <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?> <speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;> <p>I don't speak Japanese.</p> <p xml:lang=quot;jaquot;>Nihongo-ga wakarimasen.</p> Languages </speak> Processing and Pronunciation – <p> and <s> (paragraph and sentence) to give a structure to the text – <say-as> element to indicate the type of text construct contained within the element ex. date, numbers, etc. – <phoneme> element to provides a phonetic pronunciation for the contained text in IPA – <sub> element to provide substitutions for expanding acronyms in sequence of words http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 25
  • 26. SSML 1.0 – Language description (II) Style - <voice> element <?xml version=quot;1.0quot; encoding=quot;ISO-8859-1quot;?> <speak version=quot;1.0quot; xmlns=quot;http://www.w3.org/2001/10/synthesisquot; xml:lang=quot;en-USquot;> The moon is raising on the beach, when John says, looking Mary in the eyes: <voice name=quot;simonquot;>I love you!</voice> but she suddenly replies: <voice name=quot;susanquot;> Please, be serious! </voice> </speak> Other voice selection attributes are: name, xml:lang, gender, age, and variant - <emphasis> element requests that the contained text be spoken with emphasis level attribute can set it to strong, moderate, reduced, or none - <break> element controls the pausing between words time attribute with two kind of values: Time expressions “5s”, “20ms” strength attribute with values: none, x-weak, weak, medium (default value), strong, or x-strong http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 26
  • 27. SSML 1.0 – Language description (III) Prosody <prosody> element permits control of the pitch, speaking rate and volume of the speech output. The attributes are: volume: the volume for the contained text. rate: the speaking rate in words-per-minute for the contained text. duration: a value in seconds or milliseconds for the desired time to take to read the element contents. pitch: the baseline pitch for the contained text. range: the pitch range (variability) for the contained text in Hertz. contour: sets the actual pitch contour for the contained text. Other elements <audio> element - to play an audio file <mark> element - to place a marker into the text/tag sequence <desc> element - to provide a description of a non-speech audio source in <audio> http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 27
  • 28. Towards SSML 1.1 – Motivations Internationalization needs: Three Workshops: Beijing (Nov’05), Crete (May’06), Hyderabad (Jan’07) Results: No major needs for Eastern and Western European languages Many issues for Far East languages (Mandarin, Japanese, Korean) Some specific issues for Semitic languages (Arabic, Hebrew), Farsi and many Indian languages Mark input with or without vowels Mark the transliteration schema used for input Extensions required by Voice Browser: More powerful error handling, selection of fall-back strategies Trimming attributes Volume attribute to adopt a logarithmic scale (before was linear) Alignment with PLS 1.0 specification for user lexicons http://www.w3.org/TR/speech-synthesis11/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 28
  • 29. SSML 1.1 – Language Changes <w> element Lexicon extensions <lookup> element permits control of the pitch, speaking rate and volume of the speech output. Phonetic Alphabet Registry creation and adoption quot;ipaquot; for International Phonetic Alphabet Registering policy for other phonetic alphabets, similar to LTRU for Language tags Candidates: PinYin for Mandarin Chinese JEITA for Japanese X-SAMPA, ASCII transliteration of IPA codes http://www.w3.org/TR/speech-synthesis/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 29
  • 30. Pronunciation Lexicon PLS 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 30
  • 31. Pronunciation Lexicons Pronunciation Lexicon A mapping between words (or short phrases), their written representations, and their pronunciations suitable for use by an ASR engine or a TTS engine Pronunciation lexicons are not only useful for voice browsers They have also proven effective mechanisms to support accessibility for the differently able as well as greater usability for all users They are used to good effect in screen readers and user agents supporting multimodal interfaces The W3C Pronunciation Lexicon Specification (PLS) Version 1.0 is designed to enable interoperable specification of pronunciation lexicons http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 31
  • 32. PLS 1.0 – Language Overview A PLS document is a container (<lexicon>) of several lexical entries (<lexeme>) Each lexical entry contains One or more spellings (<grapheme>) One or more pronunciations (<phoneme>) or substitutions (<alias>) Each PLS document is related to a single unique language (xml:lang) SSML 1.0 and SRGS 1.0 documents can reference one or more PLS documents Current version doesn’t include morphological, syntactic and semantic information associated with pronunciations http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 32
  • 33. PLS 1.0 – An Example <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <lexicon version=quot;1.0quot; xmlns=quot;http://www.w3.org/2005/01/pronunciation-lexiconquot; xmlns:xsi=quot;http://www.w3.org/2001/XMLSchema-instancequot; xsi:schemaLocation=quot;http://www.w3.org/2005/01/pronunciation-lexicon http://www.w3.org/TR/pronunciation-lexicon/pls.xsdquot; alphabet=quot;ipaquot; xml:lang=quot;en-USquot;> <lexeme> <grapheme>Sepulveda</grapheme> ˈȜ Ǻ <phoneme>səˈpȜlvǺdə</phoneme> </lexeme> <lexeme> <grapheme>W3C</grapheme> <alias>World Wide Web Consortium</alias> </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 33
  • 34. PLS 1.0 – Used for TTS SSML 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <speak version=quot;1.0quot; … xml:lang=quot;en-USquot;> <lexicon uri=quot;http://www.example.com/SSMLexample.plsquot;/> The title of the movie is: quot;La vita è bellaquot; (Life is beautiful), which is directed by Benigni. </speak> PLS 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;> <lexeme> <grapheme>La vita è bella</grapheme> <phoneme>ˈlǡ ˈviːȎə ˈȤeǺ ˈbǫlə</phoneme> ˈǡ ː Ǻǫ </lexeme> <lexeme> <grapheme>Benigni</grapheme> <phoneme>bǫˈniːnji</phoneme> ǫː </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 34
  • 35. PLS 1.0 – Used for ASR SRGS 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <grammar version=quot;1.0“ xml:lang=quot;en-USquot; root=quot;moviesquot; mode=quot;voicequot;> <lexicon uri=quot;http://www.example.com/SRGSexample.plsquot;/> <rule id=quot;moviesquot; scope=quot;publicquot;> <one-of> <item>Terminator 2: Judgment Day</item> <item>Pluto's Judgement Day</item> </one-of> </rule> </grammar> PLS 1.0 <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <lexicon version=quot;1.0quot; … alphabet=quot;ipaquot; xml:lang=quot;en-USquot;> <lexeme> <grapheme>judgment</grapheme> <grapheme>judgement</grapheme> ˈȜ <phoneme>ˈdʒȜdʒ.mənt</phoneme> </lexeme> </lexicon> http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 35
  • 36. Examples of Use Multiple pronunciations for the same orthography Multiple orthographies Homophones Homographs Acronyms, Abbreviations, etc. Detailed descriptions can be found in: W3C specification, Wikipedia Paolo Baggia, SpeechTEK 2008 & Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 36
  • 37. PLS 1.0 – Open Issues No wide support of IPA in speech engines Slowly changes are under way Phonetic Alphabet Registry will open doors to other alphabets in a controlled and interoperable way Integration in ASR/TTS SSML 1.1 will interoperate with PLS 1.0 SRGS 1.0 still missing support of role attribute for PLS 1.0 No matching algorithm inside PLS, because it is mainly a data format http://www.w3.org/TR/pronunciation-lexicon/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 37
  • 38. Pronunciation Alphabets IPA, SAMPA Google TechTalk – Mar 6th, 2009 Paolo Baggia 38
  • 39. International Phonetic Alphabet Pronunciation is represented by a phonetic alphabet Standard phonetic alphabets International Phonetic Alphabet (IPA) Well known phonetic alphabet SAMPA - ASCII based (simple to write) Pinyin (Chinese Mandarin), JEITA (Japanese), etc. Proprietary phonetic alphabets International Phonetic Alphabet (IPA) Created by International Phonetic Association (active since 1896), collaborative effort by all the major phoneticians around the world Universally agreed system of notation for sounds of languages Covers all languages Requires UNICODE to write it Normatively referenced by PLS Google TechTalk – Mar 6th, 2009 Paolo Baggia 39
  • 40. IPA – Chart IPA was founded in 1886 It is the major international association of phoneticians The IPA alphabet provides symbols making possible the phonemic transcription of all known languages IPA characters can be encoded in Unicode by supplementing ASCII with characters from other ranges, particularly: IPA extensions (0250–02AF) Latin Extended-A (0100-017F) See the detailed: http://www.unicode.org/charts Google TechTalk – Mar 6th, 2009 Paolo Baggia 40
  • 41. Phonetic Alphabets – Issues The real problem is how to write pronunciation in a reliable, unless you are trained phonetician Issues with fonts and authoring, browsers, but Unicode fonts today support IPA extensions, see: http://www.phon.ucl.ac.uk/home/wells/phoneticsymbols.htm There are very few tools to help writing pronunciations and to let you listen to what you have written Make available pronunciations in IPA or other general phonetic languages. Google TechTalk – Mar 6th, 2009 Paolo Baggia 41
  • 42. Voice Dialog languages: VoiceXML 2.0 VoiceXML 2.1 Google TechTalk – Mar 6th, 2009 Paolo Baggia 42
  • 43. VoiceXML 2.0 – Features, Elements Menus, forms, sub-dialogs Events <menu>, <form>, <subdialog> <nomatch>, <noinput>, <help>, <catch>, <throw> Input Transition and submission Speech recognition <grammar> <goto>, <submit> Recording Telephony <record> Connection control Keypad <transfer>, <disconnect> <grammar mode=quot;dtmfquot;> Telephony information Output Platform specifics Audio files <object> <audio> Performance Text-To-Speech Fetch <prompt> Properties Variables (ECMA-262) <var>, <assign>, <script> scoping rules http://www.w3.org/TR/voicexml20/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 43
  • 44. VoiceXML 2.0 – Execution Model Execution is synchronous Only disconnect event is handled (somewhat) asynchronous Execution is always in a single dialog: <form> or <menu> Form Interpretation Algorithm for <field> selection Prompt are queued Played only when encountering a waiting state Played before a fetchaudio is started Processing is always in one of two states: Waiting for input in an input item: <field>, <record>, <transfer>, etc. Transitioning between input items in response of an input Event-driven: user’s input event handling <nomatch>, <noinput> generalized event mechanism <catch>, <throw> call event handling connection.* error event handling error.* http://www.w3.org/TR/voicexml20/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 44
  • 45. VoiceXML 2.1 – Extended Features Dynamically referencing grammars and scripts: <grammar expr=quot;…quot;>, <script expr=quot;…quot;> Record user’s utterance during form filling recordutterance property Add new shadow variables: recording, recordingsize, recordingduration Detect barge-in during prompt playback (SSML <mark>) Add markexpr attribute Add new shadow variables: markname and marktime Fetch XML data without transition Use read-only subset of DOM Dynamically concatenate prompts <foreach> Iterate throught ECMAScript arrays and execute content Send data upon disconnect <disconnect namelist=quot;…quot;> Additional transfer type <transfer type=quot;consultationquot;> http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 45
  • 46. VoiceXML Applications Static VoiceXML applications The VoiceXML page is always the same, so the user experience No personalization or customization Dynamic VoiceXML applications User experience is customized • After authentication (PIN) • Using caller-id or SIP-id Data driven Dynamic pages generated at runtime e.g. JSP, ASP, etc. http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 46
  • 47. A Drawback of VoiceXML 2.0 A drawback of VoiceXML is that the transition from a VoiceXML page to another is a costly activity: Fetch the new page, if not cached Parse the page Initialize the context, possibly loading and initializing a new application root document Load or pre-compile scripts The transitions are the only way to return data to the Web Application (if the VoiceXML is dynamic) Pages must be created to include dynamic data VoiceXML 2.1 addresses part of this drawback by feeding dynamic data to a running VoiceXML page http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 47
  • 48. Advantages of VoiceXML 2.1 - AJAX Two of the eight new features in VoiceXML 2.1 helps to create more dynamic VoiceXML applications: <data> element <foreach> element Static VoiceXML document can fetch user-specific data at runtime, without changing the VoiceXML document <data> element allows retrieval of arbitrary XML data without VoiceXML document transitions Returned XML data are accessible by a subset of DOM primitives <foreach> extend the prompts to allow the iteration on a dynamic array of information to create a dynamic prompt This is similar to AJAX programming for HTML services It decouples presentation layer (VoiceXML) from business logic (accessed via <data>) http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 48
  • 49. VoiceXML 2.1 – <data> Element Attributes: the variable to be filled with the DOM of the retrieved data name scr or srcexpr the URI of the location of the XML data the list of variables to be submitted namelist either ‘get’ or ‘post’ method media encoding enctype fetch and caching attributes As <var>, it may appear in executable content (<form> and <vxml>) The value of name must be a declared variable The platform will fill the variable of the DOM of the fetched XML data <data> element is synchronous (the service stops to get data) http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 49
  • 50. VoiceXML 2.1 – <foreach> Element Attributes: ECMAScript expression that must evaluate to ECMAScript array array the variable that stores the element to be processed item <foreach> allows the application to iterate on an ECMAScript array and to execute the content <foreach> may appear: In executable content (all executable content elements may appear as content of <foreach>) In <prompt> (restrictions on the content are applied) <foreach> allows sophisticated concatenation of prompts http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 50
  • 51. VoiceXML – Final Remarks The changed landscape for speech application development: Virtually all the IVRs today support VoiceXML New options related to VoiceXML: SIP-based VoiceXML platforms (Loquendo, Voxpilot, Voxeo, VoiceGenie) Large hosting of speech applications (TellMe, Voxeo) Development tools (VoiceObjects, Audium, SpeechVillage, Syntellect, etc.) Further changes may come from the CCXML adoption … but: Mainly system driven applications are actually deployed New challenges to incorporate more powerful dialog strategies, mixed-initiative are under discussion. http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/voicexml21/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 51
  • 52. VoiceXML Resources Voice Browser Working Group (spec, FAQ, implementations, resources): http://www.w3.org/Voice/ VoiceXML Forum site (resources, education, interest groups): http://www.voicexml.org/ VoiceXML Forum Review: http://www.voicexmlreview.org/ Interesting articles related to VoiceXML and more Example code in the sections quot;First Wordsquot; and quot;Speak & Listenquot; Ken Rehor’s World of VoiceXML http://www.kenrehor.com/voicexml Online documentation related to VoiceXML Platforms Loquendo Café, Voxeo (http://www.vxml.org/), TellMe, VoiceGenie Many books on VoiceXML: Jim Larson, quot;VoiceXML Introduction to Developing Speech Applicationsquot;, Prentice-Hall, 2002. A. Hocek, D. Cuddihy, quot;Definitive VoiceXMLquot;, Prentice-Hall, 2002 Google TechTalk – Mar 6th, 2009 Paolo Baggia 52
  • 53. Call Control: CCXML 1.0 Google TechTalk – Mar 6th, 2009 Paolo Baggia 53
  • 54. CCXML 1.0 – Highlights Asynchronous event processing Acceptance or refusal of an incoming call Different type of transfer call management Outbound call activation (interaction with an external entity) Use of ECMAScript adding scripting capabilities to call control applications VoiceXML modularization Conferencing management Google TechTalk – Mar 6th, 2009 Paolo Baggia 54
  • 55. CCXML 1.0 – Elements Relationship Google TechTalk – Mar 6th, 2009 Paolo Baggia 55
  • 56. CCXML 1.0 – Incoming Call CCXML document Event catching and processing <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <ccxml version=quot;1.0quot;> […] <transition CCXML connection.alerting event=quot;connection.alertingquot;> Interpreter […] </transition> event$ <transition event=quot;connection.disconnectedquot;> […] name:’connection.alerting’; </transition> connectionid:‘0239023901903993’; eventid:’00001’; .... ….. http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 56
  • 57. CCXML 1.0 – connection.alerting Event Basic telephony information has been retrieved on alerting event and is available into CCXML document: Local URI, remote URI, protocol used, redirection info, etc. Based on certain checked info, CCXML can accept or refuse the incoming call, even before contacting the dialog server; Any error that can occur during the phone call can be managed by CCXML service (connection.failed, error.connection events) Call Control CCXML VoiceXML Adapter Interpreter Interpreter connection.alerting Analyzing events$ content <accept/> | <reject/> http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 57
  • 58. CCXML 1.0 – How to activate a new dialog CCXML actions: Receives alerting event from Call Control Adapter Asks to dialog server to prepare a new dialog Waits for the preparation If the dialog has been successfully prepared, accept the call Asks to dialog server to start the prepared new dialog CCXML Call Control VoiceXML Interpreter Adapter Interpreter alerting prepare a new dialog dialog prepared call accepted connected start the prepared dialog dialog started Google TechTalk – Mar 6th, 2009 Paolo Baggia 58
  • 59. Call transfer CCXML supports transfer call of different modality: quot;bridgequot;, quot;blindquot;, quot;consultationquot;; Based on different modalities features CCXML language allows the expected interaction with the Call Control Adapter to correctly perform the transfer; During the different phases of transfer call creation the CCXML can receive any asynchronous event and correctly manage it, interrupting the call, if requested CCXML Call Control VoiceXML Interpreter Adapter Interpreter Performing a transfer command1 answer1 […] transfer complete … Google TechTalk – Mar 6th, 2009 Paolo Baggia 59
  • 60. External Events CCXML Interpreter Context can receive events from an external entity able to use the HTTP protocol; Events generated in this way must be sent to a CCXML by a POST HTTP command A event is so performed and: It can be addressed on a new session whose creation must be requested It can be addressed on an existent session, specifying the ID in the request CCXML External Interpreter Entity basic http event Event management Event management result http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 60
  • 61. External event on a new session: the Outbound Call A particular request arrived to Call Control from an external entity; A particular CCXML service associated with the received event is started and a set of operations between Call Control Adapter, Call Control and Dialog Server is activated: the outbound call is so placed outbound call request Call Control CCXML VoiceXML Adapter Interpreter Interpreter Create a call connection progressing … Prepare a dialog prepared connection connected Start the prepared dialog Google TechTalk – Mar 6th, 2009 Paolo Baggia 61
  • 62. External event on a session: dialog termination request An external entity performs a HTTP POST request towards the CCXML Interpreter Context, specifying a sessionid, requesting the termination of a particular dialog; The CCXML check the session id, if this is valid then CCXML Interpreter injects the event received in the session; The CCXML service has a transition on that event and performs the dialog termination on a particular dialog identifier; Dialog termination request Call Control VoiceXML CCXML Adapter Interpreter Interpreter It depends on dialogterminate (dialogid) dialog.exit event management dialog.exit disconnect(connId) dialogprepare Google TechTalk – Mar 6th, 2009 Paolo Baggia 62
  • 63. Loading different CCXML documents: <fetch> and <goto> elements <fetch> and <goto> elements are used respectively to asynchronously fetch content identified by the attributes of the <fetch> and to go in a fetched document, if it’s successfully loaded; CCXML - MODULARIZATION - SOURCE EXEMPLIFICATION Interpreter - MORE READABILITY <fetch next=quot;'http://../Fetch/doc1.ccxml'quot; type=quot;'application/ccxml+xml'quot; fetchid=quot;resultquot;/> fetch the document quot;doc1.ccxmlquot; fetch.done / error.fetch The first event occurred in a new document is ccxml.loaded goto into the new document / continue to work on the same dialog http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 63
  • 64. Simple CCXML Document <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <ccxml version=quot;1.0quot; xmlns=quot;http://www.w3.org/2002/09/ccxmlquot;> <var name=quot;currentStatequot;/> <var name=quot;myDialogIdquot;/> <var name=quot;myConnIdquot;/> <eventprocessor statevariable=quot;currentStatequot;> <transition event=quot;connection.alertingquot;> <assign name=quot;myConnIdquot; expr=quot;event$.connectionidquot;/> <accept connectionid=quot;event$.connectionidquot;/> </transition> <transition event=quot;connection.connectedquot;> <dialogstart src=quot;'http://www.example.com/flight.vxml'quot; connectionid=quot;myConnIdquot; dialogid=quot;myDialogIdquot;/> </transition> <transition event=quot;dialog.startedquot;> <log expr=quot;’VoiceXML appl is running now’quot;/> </transition> <transition event=quot;connection.disconnectedquot;> <dialogterminate dialogid=quot;myDialogIdquot;/> </transition> <transition event=quot;dialog.exitquot;> <disconnect connectionid=quot;myConnIdquot;/> </transition> <transition event=quot;*quot;> <log expr=quot;'Closing, unexpected:'+ event$.namequot;/> <exit/> </transition> </eventprocessor> </ccxml> Google TechTalk – Mar 6th, 2009 Paolo Baggia 64
  • 65. CCXML 1.0 – Next Steps CCXML specification is a Last Call Working Draft, all the feature requests and clarifications have been addressed; An Implementation Report test suite is under development; It is very close to be published as W3C Candidate Recommendation; Internal or external companies will be invited to send implementation report on their CCXML platform; After that, CCXML 1.0 specification will be able to become Proposed Recommendation and then W3C Recommendation. http://www.w3.org/TR/ccxml Google TechTalk – Mar 6th, 2009 Paolo Baggia 65
  • 66. Speech Interface Framework Tour Complete! Google TechTalk – Mar 6th, 2009 Paolo Baggia 66
  • 67. Speech Interface Framework - End of 2009 (by Jim Larson) Semantic Interpretation for Speech Recognition (SISR) VoiceXML 2.1 N-gram Grammar ML EMMA 1.0 Speech Recognition Natural Language VoiceXML 2.0 Grammar Spec. (SRGS) Semantics ML Language ASR Understanding Context World Interpretation Wide Web DTMF Tone Recognizer Pronunciation Lexicon Dialog Specification (PLS) Manager User Pre-recorded Audio Player Telephone Media System Planning Language TTS Generation Reusable Components Speech Synthesis Call Control XML Markup Language (SSML) (CCXML) Google TechTalk – Mar 6th, 2009 Paolo Baggia 67
  • 68. Architectural Changes .grxml/.gram, .pls VoiceXML architecture ASR / DTMF .vxml VoiceXML Web User Browser Applic. HTTP TTS / Audio VoiceXML platform .ssml, .wav/.mp3, .pls Google TechTalk – Mar 6th, 2009 Paolo Baggia 68
  • 69. VoxNauta – Internal Architecture Google TechTalk – Mar 6th, 2009 Paolo Baggia 69
  • 70. Loquendo MRCP Server/LSS 7.0 Architecture Load Balancer RTSP SIP MRCP v2 (MRCPv1) (SDP) RTP SIP RTSP Parser MRCP v2 parser SDP MRCP v1 Parser Management Graphic MP (SNMP) Management Configuration Consolle Config files AP MRCP v1/v2 Server Interf. Logger Log files Audio AP API Provider Win32/Linux OS NLSML / EMMA TTS & ASR interface TTS and ASR API TTS and ASR API LASR-SV LASR LTTS Google TechTalk – Mar 6th, 2009 Paolo Baggia 70
  • 71. IETF MRCP Protocols Media Resource Control Protocol MRCP are IETF standards MRCPv1 is RFC 4463, http://www.ietf.org/rfc/rfc4463.txt, based on RTSP/RTP MRCPv2 is Internet Draft, http://tools.ietf.org/html/draft-ietf-speechsc-mrcpv2-17, based on SIP/RTP offering the new audio recording and Speaker Verification functionalities Optimized client-server solution for the large-scale deployment of speech technologies in the telephony field, such as call centers, CRM, news and email-reading, self-service applications, etc. Allows standard interface of speech technologies in all IVR platforms For more information read: Dave Burke, Speech Processing for IP Networks. Media Resource Control Protocol (MRCP), ed. Wiley Google TechTalk – Mar 6th, 2009 Paolo Baggia 71
  • 72. VoiceXML in a Call Center PBX Fixed/ Optional Mobile Network Voice Gateway for Non SIP PBX VOXNAUTA IVR ACD WEB CTI Data Server Server Server Operators Google TechTalk – Mar 6th, 2009 Paolo Baggia 72
  • 73. VoiceXML in the IMS Architecture TDM protocols VOICE SIP protocols Fixed/ RTP GATEWAY Mobile VoiceXML on HTTPS Network VOXNAUTA MRF IP Network Application Server Google TechTalk – Mar 6th, 2009 Paolo Baggia 73
  • 74. Overview A Bit of History W3C Speech Interaction Framework Today ASR/DMTF TTS Lexicons Voice Dialog and Call Control Voice Platforms and Next Evolutions W3C Multimodal Interaction Today MMI Architecture EMMA and InkML A language for Emotions Next Future Google TechTalk – Mar 6th, 2009 Paolo Baggia 74
  • 75. Modes, Modalities and Technologies Speech Audio Stylus Touch Accelerometer Keyboard/keypad Mouse/touchpad Camera Geolocation Handwriting recognition Speaker verification Signature verification Fingerprint identification …. Google TechTalk – Mar 6th, 2009 Paolo Baggia 75
  • 76. Complement and Supplement Speech Visual - Transient - Persistent - Linear - Spatial - Hands and Eyes-Free - Eyes - Suffers Noise - Suffers Light Conditions Enable to choose among different modalities or to mix them Adaptable to different social, environmental conditions or to user preference Google TechTalk – Mar 6th, 2009 Paolo Baggia 76
  • 77. GUI VUI MUI or MMUI Google TechTalk – Mar 6th, 2009 Paolo Baggia 77
  • 78. MMI has an Intrinsic Complexity Interaction Manager speech speech fingerprint text fingerprint text Face mouse Face mouse identification identification geolocation handwriting geolocation handwriting Speaker Speaker verification Vital verification accelerometer Vital accelerometer signs signs Sensor Identification User intent video video photograph photograph Audio Audio drawing drawing recording recording Deborah Dahl, Voice Search 2009 Recording Google TechTalk – Mar 6th, 2009 Paolo Baggia 78
  • 79. MMI can Include Many Different Technologies Touchscreen Accelerometer Interaction Speech Geolocation recognition Manager Fingerprint Keypad recognition Handwriting recognition Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 79
  • 80. Uniform Representation for MMI Getting everything to work together is complicated. One simplification is to represent the same information from different modalities in the same format. The need a common language for representing the same information from different modalities EMMA (Extensible MultiModal Annotation) 1.0 A uniform representation for multimodal information Google TechTalk – Mar 6th, 2009 Paolo Baggia 80
  • 81. Touchscreen Accelerometer EMMA EMMA Interaction Speech EMMA EMMA Geolocation recognition Manager EMMA EMMA EMMA Fingerprint Keypad recognition Handwriting recognition Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 81
  • 82. EMMA Structural Elements EMMA Elements Provide containers for application semantics and for multimodal annotation emma:emma <emma:emma …> emma:interpretation <emma:one-of> <emma:interpretation> emma:one-of … </emma:interpretation> <emma:interpretation> emma:group … </emma:interpretation> emma:sequence </emma:one-of> </emma:emma> emma:lattice http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 82
  • 83. EMMA Annotations Characteristics and processing of input, e.g.: token of input emma:tokens reference to processing emma:process lack of input emma:no-input uninterpretable input emma:uninterpreted human language of input emma:lang emma:signal reference to signal emma:media-type media type emma:confidence confidence scores emma:source annotation of input source emma:start emma:end Timestamps (absolute/relative) emma:medium emma:mode medium, mode, and emma:function function of input emma:hook hook http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 83
  • 84. EMMA 1.0 – Example Travel Application INPUT: quot;I want to go from Boston to Denver on March 11quot; http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 84
  • 85. EMMA 1.0 – Same meaning <emma:interpretation medium=quot;acousticquot; mode=quot;voicequot; id=quot;int1quot;> <origin>Boston</origin> Speech <destination>Denver</destination> <date>11032009</date> </emma:interpretation> <emma:interpretation medium=quot;tactilequot; mode=quot;gui“ id=quot;int1quot;> <origin>Boston</origin> Mouse <destination>Denver</destination> <date>11032009</date> </emma:interpretation> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 85
  • 86. EMMA 1.0 – Handwriting Input <emma:interpretation medium=quot;tactilequot; mode=quot;inkquot; id=quot;int1quot;> <origin>Boston</origin> <destination>Denver</destination> <date>11032009</date> </emma:interpretation> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 86
  • 87. EMMA 1.0 – Biometrics Input <emma:emma version=quot;1.0quot;> <emma:emma version=quot;1.0quot;> <emma:interpretation <emma:interpretation id=quot;int1quot; id=quot;int1quot; emma:confidence=quot;.75quot; emma:confidence=quot;.80quot; emma:medium=quot;visualquot; emma:medium=quot;acousticquot; emma:mode=quot;photographquot; emma:mode=quot;voicequot; emma:verbal=quot;falsequot; emma:verbal=quot;falsequot; emma:function=quot;identificationquot;> emma:function=quot;identificationquot;> <person>12345</person> <person>12345</person> <name>Mary Smith</name> <name>Mary Smith</name> </emma:interpretation> </emma:interpretation> </emma:emma> </emma:emma> http://www.w3.org/TR/emma/ Deborah Dahl, Voice Search 2009 Google TechTalk – Mar 6th, 2009 Paolo Baggia 87
  • 88. EMMA 1.0 – Representing Lattices Speech recognizers, Handwriting recognizers and other input processing components may provide lattice output: A graph encoding a range of possible recognition results or interpretations portland today please from flights to austin 7 1 2 3 4 5 6 8 oakland tomorrow boston From Michael Joshnston, AT&T Research http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 88
  • 89. EMMA 1.0 – Representing Lattices Lattices can be represented using EMMA elements: <emma:lattice emma:initial=quot;?quot; emma:final=quot;?quot;> <emma:arc emma:from=quot;?quot; emma:to=quot;?quot;> <emma:emma version=quot;1.0quot; xmlns:emma=quot;http://www.w3.org/2003/04/emmaquot;> <emma:interpretation> <emma:lattice emma:initial=quot;1quot; emma:final=quot;8quot;> <emma:arc emma:from=quot;1quot; emma:to=quot;2quot;>flights</emma:arc> <emma:arc emma:from=quot;2quot; emma:to=quot;3quot;>to</emma:arc> <emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>boston</emma:arc> <emma:arc emma:from=quot;3quot; emma:to=quot;4quot;>austin</emma:arc> <emma:arc emma:from=quot;4quot; emma:to=quot;5quot;>from</emma:arc> <emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>portland</emma:arc> <emma:arc emma:from=quot;5quot; emma:to=quot;6quot;>oakland</emma:arc> <emma:arc emma:from=quot;6quot; emma:to=quot;7quot;>today</emma:arc> <emma:arc emma:from=quot;7quot; emma:to=quot;8quot;>please</emma:arc> <emma:arc emma:from=quot;6quot; emma:to=quot;8quot;>tomorrow</emma:arc> </emma:lattice> </emma:interpretation> </emma:emma> From Michael Joshnston, AT&T Research http://www.w3.org/TR/emma/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 89
  • 90. EMMA in Multimodal Framework http://www.w3.org/TR/mmi-framework EMMA Google TechTalk – Mar 6th, 2009 Paolo Baggia 90
  • 91. InkML 1.0 – Digital Ink Ink Markup Language (InkML), http://www.w3.org/TR/InkML Data format for presenting digital Ink (pen, stylus, etc) Allows the input and processing of handwritings, gesture, sketches, music, etc. <ink> <trace> 10 0, 9 14, 8 28, 7 42, 6 56, 6 70, 8 84, 8 98, 8 112, 9 126, 10 140, 13 154, 14 168, 17 182, 18 188, 23 174, 30 160, 38 147, 49 135, 58 124, 72 121, 77 135, 80 149, 82 163, 84 177, 87 191, 93 205 </trace> <trace> 130 155, 144 159, 158 160, 170 154, 179 143, 179 129, 166 125, 152 128, 140 136, 131 149, 126 163, 124 177, 128 190, 137 200, 150 208, 163 210, 178 208, 192 201, 205 192, 214 180 </trace> <trace> 227 50, 226 64, 225 78, 227 92, 228 106, 228 120, 229 134, 230 148, 234 162, 235 176, 238 190, 241 204 </trace> <trace> 282 45, 281 59, 284 73, 285 87, 287 101, 288 115, 290 129, 291 143, 294 157, 294 171, 294 185, 296 199, 300 213 </trace> <trace> 366 130, 359 143, 354 157, 349 171, 352 185, 359 197, 371 204, 385 205, 398 202, 408 191, 413 177, 413 163, 405 150, 392 143, 378 141, 365 150 </trace> </ink> http://www.w3.org/TR/InkML/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 91
  • 92. InkML 1.0 – Status and Advances Rich annotation for Ink: Trace, Trace formats and Trace collections Contextual information Canvases Etc. Result of classification of InkML traces may be a semantic representation in EMMA 1.0 Current status is Last Call Working Draft, next will be Candidate Recommendation with release of an Impl. Report test-suite Raising interest from major industries http://www.w3.org/TR/InkML/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 92
  • 93. MMI Architecture Specification “Multimodal Architecture and Interfaces“, W3C Working Draft, http://www.w3.org/TR/mmi-arch/ Runtime Framework provides Delivery Interaction Data the basic infrastructure and Context Manager Component Component controls communication among the constituents. Runtime Framework Interaction Manager (IM) Modality Component API coordinates Modality Components (MCs) by life-cycle Modality Modality events and contains the shared Component 1 Component N data (context). Event-based communication between IM and MCs. http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 93
  • 94. MMI Arch – Laboratory Implementation Implementation of components using W3C markup languages. Delivery Interaction Data Context Manager Component Component SCXML Runtime Framework Modality Component API Modality Component API HTML VoiceXML Modality Modality Component 1 Component N for GUI for VUI http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 94
  • 95. MMI Arch – Laboratory Implementation SCXML based Interaction Manager. VoiceXML + HTML modality components. SCXML interpreter Server HTTP I/O Processor Modality Component API: HTTP + XML (using AJAX) Modality Component API: HTTP + XML (EMMA) CCXML/VoiceXML Server Browser HTML Browser Telephony interface Client Phone Client GUI modality component Voice modality component http://www.w3.org/TR/mmi-arch/ Ingmar Kliche, SpeechTEK 2008 Google TechTalk – Mar 6th, 2009 Paolo Baggia 95
  • 96. MMI Architecture – Open Issues Profiles Start-up, Registration, Delegation in distributed environment Transport of Events Extensibility of Events http://www.w3.org/TR/mmi-arch/ Google TechTalk – Mar 6th, 2009 Paolo Baggia 96
  • 97. Emotion in Wikipedia From Wikipedia definition: “An emotion is a mental and physiological state associated with a wide variety of feelings, thoughts, and behaviours. It is a prime determinant of the sense of subjective well-being and appears to play a central role in many human activities. As a result of this generality, the subject has been explored in many, if not all of the human sciences and art forms. There is much controversy concerning how emotions are defined and classified.” General goal: Make interaction between humans and machines more natural for the humans Machines should become able: • to register human emotions (and related states) • to convey emotions (and related states) • to “understand” the emotional relevance of events Google TechTalk – Mar 6th, 2009 Paolo Baggia 97