SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Speech Recognition on
  embedded devices

      Louis-Marie Aubert
ECIT – Queen’s University Belfast

    DevDays – Belfast – April 24, 2009
What should we expect from
speech recognition?
Speech Recognition success?
•   Natural continuous speech
•   Real-time
•   Large vocabulary (up to 100,000 words)
•   No training (speaker independent)
•   Adaptive to speaker accent
•   Robust against
    – Background noise
    – Audio frontend imperfections
• N-best hypotheses with confidence value
What are the solutions on the
market?
Existing solutions
• Server-based

  – Telephony, IVR

  – Dictation (Heath care industry)

  – Audio indexing


    Either offline or with important delays
Existing solutions
• Desktop-based

  – Real-time dictation

  – Language learning

    Requires a good setup, powerful computer,
   quiet environment
    Very good accuracy, no training required
Existing solutions
• Embedded applications

  – Simple voice commands
    (‘Call-mum’ type command)

  – Disconnected word recognition

     Small vocabulary and lack
    of naturalness restricts the
    range of applications
Is it so difficult?
Technical challenge


Speech waveform
                               Transcription

                   Speech
                               ‘Hello world’
                  Recognizer
Technical challenge

Speech waveform              Acoustic feature vectors


                  Spectral
                  Analyser                       ~40 coeff.



                              10 ms
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                  Transcription
             Senome
            calculation
                                 Viterbi decoding                 ‘Hello world’



                                                    Statistical
            Acoustic      Phoneme     Word
                                                    Language
             Models        Lexicon   Lexicon
                                                      Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                  Transcription
             Senome
            calculation
                                 Viterbi decoding                 ‘Hello world’



                                                    Statistical
            Acoustic      Phoneme     Word
                                                    Language
             Models        Lexicon   Lexicon
                                                      Model
Technical challenge

                        Acoustic Models
Acoustic
feature    • 4000 acoustic models
vectors                      Recognizer
           • Sub-acoustic unit                                    Transcription
              Multi-dim.
           • Functions that score 10 ms of speech
            Gaussian mixt.        Viterbi decoding                ‘Hello world’
             calculation mean and variance 40-long
               • Sets of
               vectors of Gaussian mixtures (16)

                                                    Statistical
             Acoustic      Phoneme      Word
                                                    Language
              Models        Lexicon    Lexicon
                                                      Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                  Transcription
             Senome
            calculation
                                 Viterbi decoding                 ‘Hello world’



                                                    Statistical
            Acoustic      Phoneme     Word
                                                    Language
             Models        Lexicon   Lexicon
                                                      Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                  Transcription
             Senome
            calculation
                                 Viterbi decoding                 ‘Hello world’



                                                    Statistical
            Acoustic      Phoneme     Word
                                                    Language
             Models        Lexicon   Lexicon
                                                      Model
Technical challenge

                             Phoneme

Acoustic
feature    • 50 in English
vectors                       Recognizer
           • Differentiable sounds                                 Transcription
              Multi-dim.
           • Represent a sequence of senomes: HMM
            Gaussian mixt.
           (Hidden Markov Model) Viterbi decoding                  ‘Hello world’
              calculation

             ‘ah’:     ah1   ah2     ah3
                                                     Statistical
                                            Word
              Senome         Phoneme                 Language
                                           Lexicon
             ‘l’:
                Lexicon l1    l2    l3
                              Lexicon                  Model
Technical challenge

                              Triphone

Acoustic
feature    • 2500 in English
vectors                         Recognizer
           • Differentiable sounds in their context                    Transcription
              Multi-dim.
             continuous speech
            Gaussian mixt.            Viterbi decoding                 ‘Hello world’
             calculation
           ‘hh-ah+l’:   ah1    ah2   ah3


                                                         Statistical
              Senome          Phoneme       Word
           ‘ah-l+ow’: l1       l2    l3                  Language
               Lexicon         Lexicon     Lexicon
                                                           Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                      Transcription
             Senome
            calculation
                                     Viterbi decoding                 ‘Hello world’



                                                        Statistical
            Acoustic      Triphone        Word
                                                        Language
             Models        Lexicon       Lexicon
                                                          Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                      Transcription
             Senome
            calculation
                                     Viterbi decoding                 ‘Hello world’



                                                        Statistical
            Acoustic      Triphone        Word
                                                        Language
             Models        Lexicon       Lexicon
                                                          Model
Technical challenge

                                   Word

Acoustic
feature    • Large vocabulary: 64000
vectors                       Recognizer
           • Represent a sequence of phonemes/triphones                   Transcription
              Multi-dim.
            Gaussian mixt.               Viterbi decoding                 ‘Hello world’
             calculation
           ‘hello’:    hh     ah     l       ow


                                                            Statistical
              Senome         Phoneme          Word
           ‘world’:                                         Language
              Lexicon w       Lexiconl
                              er              d
                                             Lexicon
                                                              Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                      Transcription
             Senome
            calculation
                                     Viterbi decoding                 ‘Hello world’



                                                        Statistical
            Acoustic      Triphone        Word
                                                        Language
             Models        Lexicon       Lexicon
                                                          Model
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                      Transcription
             Senome
            calculation
                                     Viterbi decoding                 ‘Hello world’



                                                        Statistical
            Acoustic      Triphone        Word
                                                        Language
             Models        Lexicon       Lexicon
                                                          Model
Technical challenge

                       Statistical language model

Acoustic
feature    • Bi-gram / Tri-gram
vectors                         Recognizer
           • Give the probability of sequence of 2/3 words           Transcription
              Multi-dim.
           • 64000 words leads to roughly 10 million states /
           50 million mixt.
            Gaussian
                      arcs         Viterbi decoding                  ‘Hello world’
             calculation

                                       0.3      mum
                       hello
                                      0.2              Statistical
             Senome            Phoneme          dad
                                              Word
                                                       Language
             Lexicon            Lexicon
                                     0.05    Lexicon
                                                         Model
                                               world
Technical challenge


Acoustic
feature
vectors                    Recognizer
                                                                      Transcription
             Senome
            calculation
                                     Viterbi decoding                 ‘Hello world’



                                                        Statistical
            Acoustic      Triphone        Word
                                                        Language
             Models        Lexicon       Lexicon
                                                          Model
Technical challenge


Acoustic
feature
vectors                       Recognizer
                                                                         Transcription
             Senome
            calculation
                                        Viterbi decoding                 ‘Hello world’



                                                           Statistical
            Acoustic         Triphone        Word
                                                           Language
             Models           Lexicon       Lexicon
                                                             Model


                          ~ 25 million states / 250 million arcs
Technical challenge


Acoustic
feature
vectors                       Recognizer
                                                                         Transcription
             Senome
            calculation
                                        Viterbi decoding                 ‘Hello world’



                                                           Statistical
            Acoustic         Triphone        Word
                                                           Language
             Models           Lexicon       Lexicon
                                                             Model


                          ~ 25 million states / 250 million arcs
Technical challenge

                             Viterbi decoding

Acoustic   • Token passing algorithm
feature    • 5000/10000 tokens to propagate every 10 ms
vectors                        Recognizer
                                                                            Transcription
           • Select the most promising tokens and output
              Multi-dim.
           associated sequence of:
           senomes mixt.
            Gaussian     triphones Viterbi decoding
                                    words      sentence                     ‘Hello world’
             calculation

                                   v1


                                                              Statistical
             Senome           Triphone
                               l1   l2   l3    Word
                                              ow1 ow2    ow3
                                                              Language
             Lexicon           Lexicon        Lexicon
                                                                Model
                              s1   s2    s3   ey1   d1   d3


                           ~ ey2 million statesd2 250 million arcs
                             25           v3    / v2
                                   ey3
Technical challenge


Acoustic
feature
vectors                       Recognizer
                                                                         Transcription
             Senome
            calculation
                                        Viterbi decoding                 ‘Hello world’



                                                           Statistical
            Acoustic         Triphone        Word
                                                           Language
             Models           Lexicon       Lexicon
                                                             Model


                          ~ 25 million states / 250 million arcs
Challenges in embedded systems
• Low computational resources
• Power consumption constraints
• Noisy environment, poor audio quality

     For a truly embedded speech recognition
    engine that works, we must move away from
    the pure software approach:
     • Make the best of all hardware acceleration available
     • Dedicated chip (accelerator) to unload CPU and
       relax memory constraints
Why do we want speech
recognition on embedded
devices anyway?
Applications on mobiles
• Complement touch screen interface with
  speech interface
• Speech enable existing mobile applications
  – Browse complex menus
  – Easily find items in large libraries,
    local or online (contacts, music…)
  – Browse Web and search maps
  – Games
  – Compose text-messages,
    emails…
Applications on mobiles
• Speech enable mobile applications




       Rubicon, quot;The Apple iPhone: Successes and Challenges for the Mobile Industryquot;, 31 March 2008
Applications on mobiles
• Key to safety when driving
  – Text-messaging
  – Satellite-Navigation function

• Voice Memo
  – Shopping list
  – Activity scheduler

• Market of Speech technology in embedded
  devices
  – $125 million in 2006
  – $500 million in 2010
    Opus Research report, March 2007
Other markets
• Developing countries
   – Access to information technology for illiterate people
       • Administrative tasks
       • Education
       • Social integration


• Health-care at home
  (self-manage diseases)
   – Exploding market
       • Chronic diseases
       • Elderly people (Baby Boomers reach retirement age)
       • Market for home health care products is evaluated at $4.3 billion today
   – Place for Speech recognition
       • Inexperience of patients with electronic interfaces
       • Poor physical condition (e.g. low vision)
       • Illiteracy                                        Medical device today, March 2009
Other applications
• Speech translation
  – IraqCom
Okay, I can’t wait!
Is there anything I can use now?
Upcoming solutions
• Voicemail accessible via text-message,
  email or dedicated application




  – Server-based
  – Require agreement and implementation by the
    carriers
Upcoming solutions
• Nuance Voice Control 2
  – Online search
  – Text-messaging

    • Embedded software for
      simple voice command
    • Server-based engine for large
      vocabulary speech recognition


• Speech Recognition API
  on Android 1.5
So?
Conclusion
• A truly embedded speech recognition system
  – A range of exciting applications
     • Real-time dictation with no perceived delay
     • Natural language interface (ASR + TTS)
     • Applications independent of the carrier
  – But… not available yet!


• New speech recognition API are arriving soon
  – Rely on network/server availability
  – Can still lead to innovative applications
Conclusion
• Key to succeed
  – Robustness, accuracy
  – Fast to load and execute
  – Well designed interface
     • Speech cannot be used on its own
     • Should be cleverly combined with other interfaces
         – Graphical
         – Touch
         – …


  – Don’t put customers off by clumsy speech recognition
    widgets, again!
Questions?

Weitere ähnliche Inhalte

Andere mochten auch

Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemREHMAT ULLAH
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project reportSarang Afle
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologyAamir-sheriff
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overviewsajanazoya
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by IqbalIqbal
 
The Main Concepts of Speech Recognition
The Main Concepts of Speech RecognitionThe Main Concepts of Speech Recognition
The Main Concepts of Speech Recognition子毅 楊
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition TechnologySeminar Links
 
Speech recognition challenges
Speech recognition challengesSpeech recognition challenges
Speech recognition challengesAlexandru Chica
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By MatlabAnkit Gujrati
 
Blackboard architecture pattern
Blackboard architecture patternBlackboard architecture pattern
Blackboard architecture patternaish006
 
Rajul computer presentation
Rajul computer presentationRajul computer presentation
Rajul computer presentationNeetu Jain
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentationhimanshubhatti
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice RecognitionAmrita More
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Chiranjeevi Adi
 

Andere mochten auch (16)

Artificial intelligence Speech recognition system
Artificial intelligence Speech recognition systemArtificial intelligence Speech recognition system
Artificial intelligence Speech recognition system
 
Speech recognition project report
Speech recognition project reportSpeech recognition project report
Speech recognition project report
 
fundamentals of speech recognition
fundamentals of speech recognitionfundamentals of speech recognition
fundamentals of speech recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech recognition An overview
Speech recognition An overviewSpeech recognition An overview
Speech recognition An overview
 
Speech Recognition by Iqbal
Speech Recognition by IqbalSpeech Recognition by Iqbal
Speech Recognition by Iqbal
 
The Main Concepts of Speech Recognition
The Main Concepts of Speech RecognitionThe Main Concepts of Speech Recognition
The Main Concepts of Speech Recognition
 
Speech Recognition Technology
Speech Recognition TechnologySpeech Recognition Technology
Speech Recognition Technology
 
Speech recognition challenges
Speech recognition challengesSpeech recognition challenges
Speech recognition challenges
 
Speech Recognition System By Matlab
Speech Recognition System By MatlabSpeech Recognition System By Matlab
Speech Recognition System By Matlab
 
Uses of speech recognition system
Uses of speech recognition systemUses of speech recognition system
Uses of speech recognition system
 
Blackboard architecture pattern
Blackboard architecture patternBlackboard architecture pattern
Blackboard architecture pattern
 
Rajul computer presentation
Rajul computer presentationRajul computer presentation
Rajul computer presentation
 
Speech recognition final presentation
Speech recognition final presentationSpeech recognition final presentation
Speech recognition final presentation
 
Voice Recognition
Voice RecognitionVoice Recognition
Voice Recognition
 
Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks Hand Written Character Recognition Using Neural Networks
Hand Written Character Recognition Using Neural Networks
 

Kürzlich hochgeladen

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Kürzlich hochgeladen (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Dev Days, Speech Recognition, LM Aubert

  • 1. Speech Recognition on embedded devices Louis-Marie Aubert ECIT – Queen’s University Belfast DevDays – Belfast – April 24, 2009
  • 2. What should we expect from speech recognition?
  • 3. Speech Recognition success? • Natural continuous speech • Real-time • Large vocabulary (up to 100,000 words) • No training (speaker independent) • Adaptive to speaker accent • Robust against – Background noise – Audio frontend imperfections • N-best hypotheses with confidence value
  • 4. What are the solutions on the market?
  • 5. Existing solutions • Server-based – Telephony, IVR – Dictation (Heath care industry) – Audio indexing Either offline or with important delays
  • 6. Existing solutions • Desktop-based – Real-time dictation – Language learning Requires a good setup, powerful computer, quiet environment Very good accuracy, no training required
  • 7. Existing solutions • Embedded applications – Simple voice commands (‘Call-mum’ type command) – Disconnected word recognition Small vocabulary and lack of naturalness restricts the range of applications
  • 8. Is it so difficult?
  • 9. Technical challenge Speech waveform Transcription Speech ‘Hello world’ Recognizer
  • 10. Technical challenge Speech waveform Acoustic feature vectors Spectral Analyser ~40 coeff. 10 ms
  • 11. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  • 12. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  • 13. Technical challenge Acoustic Models Acoustic feature • 4000 acoustic models vectors Recognizer • Sub-acoustic unit Transcription Multi-dim. • Functions that score 10 ms of speech Gaussian mixt. Viterbi decoding ‘Hello world’ calculation mean and variance 40-long • Sets of vectors of Gaussian mixtures (16) Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  • 14. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  • 15. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Phoneme Word Language Models Lexicon Lexicon Model
  • 16. Technical challenge Phoneme Acoustic feature • 50 in English vectors Recognizer • Differentiable sounds Transcription Multi-dim. • Represent a sequence of senomes: HMM Gaussian mixt. (Hidden Markov Model) Viterbi decoding ‘Hello world’ calculation ‘ah’: ah1 ah2 ah3 Statistical Word Senome Phoneme Language Lexicon ‘l’: Lexicon l1 l2 l3 Lexicon Model
  • 17. Technical challenge Triphone Acoustic feature • 2500 in English vectors Recognizer • Differentiable sounds in their context Transcription Multi-dim. continuous speech Gaussian mixt. Viterbi decoding ‘Hello world’ calculation ‘hh-ah+l’: ah1 ah2 ah3 Statistical Senome Phoneme Word ‘ah-l+ow’: l1 l2 l3 Language Lexicon Lexicon Lexicon Model
  • 18. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  • 19. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  • 20. Technical challenge Word Acoustic feature • Large vocabulary: 64000 vectors Recognizer • Represent a sequence of phonemes/triphones Transcription Multi-dim. Gaussian mixt. Viterbi decoding ‘Hello world’ calculation ‘hello’: hh ah l ow Statistical Senome Phoneme Word ‘world’: Language Lexicon w Lexiconl er d Lexicon Model
  • 21. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  • 22. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  • 23. Technical challenge Statistical language model Acoustic feature • Bi-gram / Tri-gram vectors Recognizer • Give the probability of sequence of 2/3 words Transcription Multi-dim. • 64000 words leads to roughly 10 million states / 50 million mixt. Gaussian arcs Viterbi decoding ‘Hello world’ calculation 0.3 mum hello 0.2 Statistical Senome Phoneme dad Word Language Lexicon Lexicon 0.05 Lexicon Model world
  • 24. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model
  • 25. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
  • 26. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
  • 27. Technical challenge Viterbi decoding Acoustic • Token passing algorithm feature • 5000/10000 tokens to propagate every 10 ms vectors Recognizer Transcription • Select the most promising tokens and output Multi-dim. associated sequence of: senomes mixt. Gaussian triphones Viterbi decoding words sentence ‘Hello world’ calculation v1 Statistical Senome Triphone l1 l2 l3 Word ow1 ow2 ow3 Language Lexicon Lexicon Lexicon Model s1 s2 s3 ey1 d1 d3 ~ ey2 million statesd2 250 million arcs 25 v3 / v2 ey3
  • 28. Technical challenge Acoustic feature vectors Recognizer Transcription Senome calculation Viterbi decoding ‘Hello world’ Statistical Acoustic Triphone Word Language Models Lexicon Lexicon Model ~ 25 million states / 250 million arcs
  • 29. Challenges in embedded systems • Low computational resources • Power consumption constraints • Noisy environment, poor audio quality For a truly embedded speech recognition engine that works, we must move away from the pure software approach: • Make the best of all hardware acceleration available • Dedicated chip (accelerator) to unload CPU and relax memory constraints
  • 30. Why do we want speech recognition on embedded devices anyway?
  • 31. Applications on mobiles • Complement touch screen interface with speech interface • Speech enable existing mobile applications – Browse complex menus – Easily find items in large libraries, local or online (contacts, music…) – Browse Web and search maps – Games – Compose text-messages, emails…
  • 32. Applications on mobiles • Speech enable mobile applications Rubicon, quot;The Apple iPhone: Successes and Challenges for the Mobile Industryquot;, 31 March 2008
  • 33. Applications on mobiles • Key to safety when driving – Text-messaging – Satellite-Navigation function • Voice Memo – Shopping list – Activity scheduler • Market of Speech technology in embedded devices – $125 million in 2006 – $500 million in 2010 Opus Research report, March 2007
  • 34. Other markets • Developing countries – Access to information technology for illiterate people • Administrative tasks • Education • Social integration • Health-care at home (self-manage diseases) – Exploding market • Chronic diseases • Elderly people (Baby Boomers reach retirement age) • Market for home health care products is evaluated at $4.3 billion today – Place for Speech recognition • Inexperience of patients with electronic interfaces • Poor physical condition (e.g. low vision) • Illiteracy Medical device today, March 2009
  • 35. Other applications • Speech translation – IraqCom
  • 36. Okay, I can’t wait! Is there anything I can use now?
  • 37. Upcoming solutions • Voicemail accessible via text-message, email or dedicated application – Server-based – Require agreement and implementation by the carriers
  • 38. Upcoming solutions • Nuance Voice Control 2 – Online search – Text-messaging • Embedded software for simple voice command • Server-based engine for large vocabulary speech recognition • Speech Recognition API on Android 1.5
  • 39. So?
  • 40. Conclusion • A truly embedded speech recognition system – A range of exciting applications • Real-time dictation with no perceived delay • Natural language interface (ASR + TTS) • Applications independent of the carrier – But… not available yet! • New speech recognition API are arriving soon – Rely on network/server availability – Can still lead to innovative applications
  • 41. Conclusion • Key to succeed – Robustness, accuracy – Fast to load and execute – Well designed interface • Speech cannot be used on its own • Should be cleverly combined with other interfaces – Graphical – Touch – … – Don’t put customers off by clumsy speech recognition widgets, again!