Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Making an on-device personal assistant a reality

123 Aufrufe

Veröffentlicht am

Artificial intelligence (AI) is reshaping our lives, and it’s a fantastic time and opportunity for developers working in this area. One example is in the voice UI and virtual assistant revolution, where machine learning and machine speech recognition approaches the accuracy of humans. The AI powering key voice UI components, such as automatic speech recognition and natural language processing, has traditionally run in the cloud due to computing, storage, and power constraints. However, on-device processing of voice UI provides unique benefits, such as instant response, reliability, and privacy. And developers can take advantage of on-device processing by fusing multiple on-device sensor inputs, such as camera and accelerometers, in addition to microphones to add a level of personalization to your AI development project.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Making an on-device personal assistant a reality

  1. 1. Making an on-device personal assistant a reality Jilei Hou Sr. Director, Engineering Qualcomm Technologies, Inc. @qualcomm_techJuly 10, 2018 San Diego, CA
  2. 2. 2 Reasoning Learn, infer context, and anticipate Perception Hear, see, and observe Action Act intuitively, interact naturally, and protect privacy AI brings human-like understanding and behaviors to the machines
  3. 3. 3 Advancing AI research to make on-device AI ubiquitous A common platform is fundamental to scaling AI internally and across the industry Power efficiency Model design, compression, quantization, activation, algorithms, and efficient hardware Efficient learning Robust learning through minimal data, unsupervised learning, and on-device learning Personalization Continuous learning, model adaptation, and privacy-preserved distributed learning System architecture Multi-task and multi-modal learning, sensor fusion, and cloud-edge systems Action Reinforcement learning for decision making Reasoning Scene understanding, language understanding, behavior prediction Perception Object detection, speech recognition, contextual fusion AutomotiveIoT Mobile
  4. 4. 4 A true personal assistant One of many use cases requiring a broad set of AI capabilities Power efficiency Model design, compression, quantization, activation, algorithms, and efficient hardware Efficient learning Robust learning through minimal data, unsupervised learning, and on-device learning Personalization Continuous learning, model adaptation, and privacy-preserved distributed learning AutomotiveIoT Mobile Action Reinforcement learning for decision making Reasoning Scene understanding, language understanding, behavior prediction Perception Object detection, speech recognition, contextual fusion System architecture Multi-task and multi-modal learning, sensor fusion, and cloud-edge systems
  5. 5. 5 v Designed to be: Always-on Conversational Personal Private Critical to create a true virtual assistant Voice is the transformative user interface (UI) we’ve been waiting for
  6. 6. 6 Voice UI components required for an end-to-end solution Text-to-speech Speech synthesis Natural language generation Dialog management Natural language understanding Natural language processing Speech-to-text Speech recognition “Alexa,” “Hey Snapdragon” Always-on keyword detection Voice activation Echo cancellation Speech denoising Speech pre-processing Signal acquisition and playback Front-end processing Machine speech chain: listener and speaker Qualcomm Snapdragon is a product of Qualcomm Technologies, Inc. and/or its subsidiaries.
  7. 7. 77 Machine learning has ignited the voice UI revolution “As speech recognition accuracy goes from say 95% to 99%, all of us in the room will go from barely using it today to using it all the time. Most people underestimate the difference between 95% and 99% accuracy— 99% is a gamechanger. No one wants to wait 10 seconds for a response. Accuracy, followed by latency, are the two key metrics for a production speech system.” — Andrew Ng GMM: Gaussian Mixture Model, CNN: Convolutional Neural Network, RNN: Recurrent Neural Network Human accuracy GMM RNN + CNN Machine automatic speech recognition accuracy 202020102000199019801970 50% CNN 55% 60% 62% 70% 95%
  8. 8. 8 TV Smart speakers Headphones and headsets IoT Smartphones and tablets In-car entertainment systems PCs and laptops Wearables XR Voice UI is proliferating across product categories
  9. 9. 9 Moving voice UI functionality to the end device An end-to-end solution powered by machine learning Automatic speech recognition (ASR) Text-to-speech (TTS) News SMS Music Maps Wikipedia Weather Stocks Voice activation Service manager Multi-mic echo cancellation, beamforming, and speech denoising Natural language understanding (NLU) On-device processing (always-on and real-time) Cloud processing (services) Cloud centric (today)
  10. 10. 10 Moving voice UI functionality to the end device An end-to-end solution powered by machine learning Automatic speech recognition (ASR) Text-to-speech (TTS) News SMS Music Maps Wikipedia Weather Stocks Voice activation Service manager Multi-mic echo cancellation, beamforming, and speech denoising Natural language understanding (NLU) On-device processing (always-on and real-time) Cloud processing (services) On-device centric (future)
  11. 11. 11 Cloud tasks Complex voice fallback Training and model update Knowledge base Services On-device processing of voice UI Provides unique benefits complementing the cloud Challenge Providing the voice UI functionality within the power/thermal envelope Machine learning models Offline raw data Queries On-device tasks Automatic speech recognition Natural language processing Always-on audio cognition On-device training Benefits Privacy Instant response Always-on Device context
  12. 12. 12 Noisy speech spectrogram Clean speech spectrogram “If people were more generous, there would be no need for welfare” 0.5 1 1.5 2 2.5 3 4000 3000 2000 1000 0 Frequency Time 0.5 1 1.5 2 2.5 3 4000 3000 2000 1000 0 Frequency Time DL-based denoising model trained with extensive speech noise databases DL-based denoising Speech denoising • Single or multiple mics • Applicable for ◦ Two-way conversation ◦ Voice/speaker recognition ◦ Keyword spotting • Deep learning (DL) significantly improves the performance over traditional methods • Robust in challenging interference and noise scenarios “If people were more generous, there would be no need for welfare”
  13. 13. 13 Qualcomm Voice Activation supports: Qualcomm® Voice Activation (VA) High accuracy, robust to background noise, and supports multiple languages Deep learning is improving performance Among state-of-the-art in terms of performance vs. power consumption -47% -11% 2014 2015 2016 2017 QualcommVApowerconsumption WCD9330 WCD9335 WCD9340 Amazon Alexa Baidu DUEROS Microsoft Cortana Google Assistant Qualcomm Voice Activation, Qualcomm WCD9330, Qualcomm WCD9335, and Qualcomm WCD9340 are products of Qualcomm Technologies, Inc. and/or its subsidiaries.
  14. 14. 14 Transcribe the audio to text Deep learning gives state-of-the-art accuracy on a mobile device Personalization—adaptation to individual accent and acoustic environment Automatic speech recognition Natural language understanding (NLU) Acoustic features Acoustic model Reduce input audio to essential information Language model Deep learning converts input into linguistic units Adapted to each user’s accent and environment Uses context and language statistics for best utterance estimation Adapted to each user’s speaking tendencies Allows the same intention to be expressed in multiple ways Adapted to each user’s intent expressions “Turn  on  the  light” On-device automatic speech recognition (ASR) User intention
  15. 15. 1515 An end-to-end on-device voice UI example for smart homes 99% on-device intent accuracy is achieved for domain specific command sets when adapted to accent and environmental condition Demo of automatic speech recognition and natural language understanding Large command set Turn on the living room lights Click the kitchen lights off Turn off all lights Switch on the ceiling fan Shut off the sprinklers Start music Pause song Next track Go back one Play previous song Turn speaker off Increase temperature Intent understanding Turn on the kitchen light Click kitchen light on Switch on light in the kitchen Turn the light on in the kitchen NLU: These four phrases map to the same intent
  16. 16. 16 A true virtual assistant A “digital me” sitting on the device: context aware and personalized
  17. 17. 1717 Calendar Messaging Apps On-device data Contextual intelligence is required for personalization The fusion of many types of sensors and personal information 17 Low power sensing, processing, and connectivity Efficient, heterogeneous architectures Sensor fusion and machine learning Integrated, always-on data capturing Low-energy wireless technologies (e.g. BT-LE, 5G NR IoT) Cloud data Off-device data IoT data Sensor fusion Gyroscope Compass Camera Ambient light Temperature Iris scanEnvironment Pulse HumidityMicrophone Sensor data C-V2X
  18. 18. 18 Various kitchen noise samples Posterior-gram ML-based acoustic event classification Object rustling Object snapping Cupboard Cutlery Dishes Drawer Glass jingling Object impact Walking Washing dishes Water tap running 2 4 6 8 10 12 50 100 150 200 250 300 350 400 450 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 Acoustic event detection • ML techniques are used to ◦ Classify acoustic signals into a set of predefined events ◦ Infer acoustic environment • Low power, always-on
  19. 19. 1919 Creating personalized memories Essential for a true virtual assistant History, number of people, identity After the party, strolling on the beach at sunset in La Jolla talking with my son and laughing Live sentiment analysis Strolling on the beach at sunset in La Jolla talking with my son and laughing Activity analysis Strolling on the beach at sunset in La Jolla talking with my son GPS location La Jolla, California Visual analysis A sunset over the ocean in La Jolla Sound analysis Talking with my son at sunset in La Jolla
  20. 20. 20 A true personal assistant is responsive and proactive “Remember the time I was strolling with my son after the party at La Jolla beach?” “Yes I do, here is a picture you took of the sunset. Should I share it with your family group on WeChat?” Responsive Decision-making and conversation based on contextual analysis and prompting (e.g. finding memories) Proactive Decision-making and conversation based on contextual analysis without prompting (e.g. automatically sharing memories) “I noticed that you are tired and stressed, I’m turning on the Rocky III soundtrack and navigating you to the gym for a workout and sauna.” “This music gets my blood going and a workout and sauna will help me relieve stress.”
  21. 21. 21 Multi-mic echo cancellation, beamforming, and speech denoising The first step to an on-device virtual assistant Enabling on-device voice UI News SMS Music Maps Wikipedia Weather Stocks Voice activation Service manager Automatic speech recognition (ASR) Natural language understanding (NLU) On-device processing Cloud processing Text-to-speech (TTS)
  22. 22. 22 Adding an “AI agent” to create a true virtual assistant The on-device AI agent continuously learns personal knowledge and acts intuitively On-device processing Cloud processing (services) Cloud knowledge graph Automatic speech recognition (ASR) Multi-mic echo cancellation and beamforming Sensors Text-to-speech (TTS) Natural language understanding (NLU) Voice activation AI agent
  23. 23. 23 Adding an “AI agent” to create a true virtual assistant Contextualization allows personalization at acoustic, intent, and behavior levels Cloud processing (services) Automatic speech recognition (ASR) Voice activation Multi-mic echo cancellation and beamforming Cloud knowledge graph On-device processing Text-to-speech (TTS) Sensors Natural language understanding (NLU) AI agent Dialog management Speaker identification Acoustic event detection Gender and age detection Voice activity detection Emotion classification Contextual fusion and learning Local knowledge graph
  24. 24. 2424 We are advancing AI research to make on-device AI ubiquitous We are creating AI platform innovations that are fundamental to scaling AI across the industry We provide the low-power end-to-end on-device solution for a true personal assistant
  25. 25. Follow us on: For more information, visit us at: www.qualcomm.com & www.qualcomm.com/blog Thank you! Nothing in these materials is an offer to sell any of the components or devices referenced herein. ©2018 Qualcomm Technologies, Inc. and/or its affiliated companies. All Rights Reserved. Qualcomm and Snapdragon are trademarks of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners. References in this presentation to “Qualcomm” may mean Qualcomm Incorporated, Qualcomm Technologies, Inc., and/or other subsidiaries or business units within the Qualcomm corporate structure, as applicable. Qualcomm Incorporated includes Qualcomm’s licensing business, QTL, and the vast majority of its patent portfolio. Qualcomm Technologies, Inc., a wholly-owned subsidiary of Qualcomm Incorporated, operates, along with its subsidiaries, substantially all of Qualcomm’s engineering, research and development functions, and substantially all of its product and services businesses, including its semiconductor business, QCT.

×