SlideShare ist ein Scribd-Unternehmen logo
1 von 46
AWS re:Invent
THE SCIENCE BEHIND THE ALEXA
PRIZE: MEETING THE AI CHALLENGES
A s h w i n R a m
S p y r o s M a t s o u k a s
A L X 3 2 0
November 28, 2017
CONVERSATION
Language
Understanding
Dialog
Management
Response
Generation
Knowledge
Alexa, is it
hot outside?
How about
This weekend?
Will I need an
umbrella?
I was thinking of
hiking the Dish.
It’s sunny and warm
today.
Clouds are rolling in.
It will be overcast.
Depends on where
you’ll be.
I’d take a rain jacket.
There’s a 60% chance of
rain in the Stafford Hills.
NORTH STAR
User: Alexa, let’s chat about the Mars mission.
Alexa: There are multiple Mars missions, some public and some private.
User: Who do you think will succeed?
Alexa: I think more than one will succeed. The first one to get to Mars will open the doors for others.
User: I’d love to go to Mars.
Alexa: Me too. Luckily I can hitch a ride on any computer.
User: That’s very funny.
Alexa: The biggest challenge isn’t technology, it’s funding. It will cost $500 billion to send humans to Mars.
User: With the current political climate, Congress isn’t going to approve this kind of budget.
Alexa: It will need to be privately funded.
User: Alexa, stop.
Language generation
Deciding how to phrase the response
Personalization
Adapting to individual users
Knowledge ingestion
Learning real world context (example: latest news)
Common sense reasoning
Making inferences
SCIENCE CHALLENGES
Conversation Is Hard!
Speech recognition
Free-form conversational ASR
Language understanding
Understanding meaning of utterances
Context modeling
Tracking context across multiple turns
Dialog planning
Determining the best response
Icebreakers
Conversation starters
Pauses
Topic suggestions
Deep dives
Leading conversations
Personal questions
Opinions or controversy
Frustration
Nonanswers
CUSTOMER EXPERIENCE CHALLENGES
$2.5M INAGURAL COMPETITION TO
ADVANCE
THE FIELD OF CONVERSATIONAL AI
A L E X A ,
L E T ’ S
C H A T .
Challenge
Create a socialbot that can converse coherently
and engagingly on popular topics for 20 minutes
By the numbers
15 teams
Public launch on May 8; top 10 Alexa skill
40,000+ hours of conversations; millions of interactions
3 finalists; 24% improvement in ratings from start to end
ALEXA SPOKEN LANGUAGE
UNDERSTANDING
THE ALEXA SERVICE
SUPPORTED BY TWO POWERFUL FRAMEWORKS
Lives in the cloud
Automatic speech
recognition (ASR)
Natural language
understanding (NLU)
Always learning
Alexa
Voice Service
Unparalleled distribution:
AVS allows your content
to be everywhere.
Alexa
Skills Kit
Create great content:
ASK is how you
connect
to your consumer.
WHAT CAN ALEXA DO?
Entertainment: Music, books, video…
Household: Timers, to-do lists, shopping, calendar
Communication: Messaging, calling
Information: Weather, traffic, news…
Smart home: Lights, thermostats, switches…
25,000+ skills: Developed by third parties
Dialog
manager
ASR NLU TTS
User
speech
Text Labels Actions
Speech
output
Component Input Output Example
Automatic speech
recognition (ASR)
Speech
Text (1-best or lattice of
alternatives)
“Play ‘Two Steps Behind’ by Def Leppard.”
Natural language
understanding (NLU)
Text Slots and intent type
Intent:
PlayMusicIntent
Slots:
Artist name: Def Leppard
Song: Two Steps Behind
Dialog manager
(DM)
Labels and
context
Dialog actions,
responses
Ask the application to play the song or clarify.
Text-to-speech
(TTS)
Text Speech
“Which artist?” or “Playing ‘Two Steps Behind’ by Def
Leppard.”
OVERVIEW OF SPOKEN LANGUAGE
UNDERSTANDING SYSTEMS
Clarification
dialog
COMMON THEME: DATA-DRIVEN
MACHINE LEARNING
Input
Input (training data)
Answers (truth)
Models
Training
Answers
Inference
Input: Speech Output: Sequence of words (text)Task: ASR
Input: Text Output: Intents and slotsTask: NLU
Probabilistic Modeling
That Is Robust to Noise
and Ambiguity
Relatively Inexpensive
Ground Truth Creation
Portable to New
Domains
and Languages
Trainer
Decoder
TIME VARYING LIVE PRODUCTION DATA
Audio and
transcripts
Transcripts and
NLU labels
ASR model
validation
NLU model
validation
Unlabeled audio
Models
Models
Models
deployed
DATA-DRIVEN ML FOR
CONTINUOUS IMPROVEMENT
Data sampling
(active
learning)
Ground truth
generation
ASR model
training
NLU model
training
AUTOMATIC SPEECH RECOGNITION
FAR-FIELD SPEECH CHALLENGES
• Noisy environment (example: room reverberation, ambient noise, background speech)
• Large vocabulary, high perplexity domains
• Difficult to predict spoken forms for catalog entries and their associated pronunciations (example: artist
names such as Psy, P!nk)
• Acoustically confusable titles, “sun daze” or “Sundays”
Noise
Speech
1
2
3
4
5
6
Acoustic echo cancellation (AEC)
• Applied to each beam
• Reduces echo of device playback
• Before AEC: After AEC:
SIGNAL PROCESSING
Beamforming
SPEECH RECOGNITION
Feature
extraction
Acoustic
model
Decoder
(inference)
Post
processing
Feature
vectors
[4.7, 2.3, -1.4, …]
Phonetic
probabilities
[0.1, 0.1, 0.4, …]
Words
increase
to seventy
degrees
Text
Increase
to 70⁰
Sound
Language
model
CONVERSATIONAL LANGUAGE MODELING
• Conversational free forms are inherently high perplexity
• Need to increase language model (LM) capacity
• For n-gram models this means increasing n-gram order
and number of n-grams (model size)
• Can result in increased latency
• Designed and developed a new representation for n-gram language models and a new decoder
• Can now accommodate over 10x increase in
LM size without significant impact on latency
SPEAKER ADAPTATION
Garimella, et al. “Robust i-Vector Based Adaptation of DNN Acoustic Model for Speech Recognition,” Interspeech 2015
Speech
Features
Speaker
Features
Phonetic
Probabilities
5% to 7% relative
reduction in word error
rate compared to
speaker independent
model
SPEECH ENDPOINTING
Challenge
Low latency detection of end of speech
Approach
Real time detection of sentence end
combining acoustic and linguistic information
“Play music by [PAUSE] Sting [PAUSE].”
Liu et al., “Accurate End-Pointing with Expected Pause Duration”, Interspeech 2015
Low end-of-
sentence LM
probability 
endpointProb=LOW
High end-of-sentence
LM probability 
endpointProb=HIGH
ANCHORED SPEECH DETECTION
Problem: Background speech interfering with device-directed speech
• Delayed end-point detection
• Word insertion errors
“Alexa, play music.”
[speaker 1]
“Take the dog out.”
[speaker 2]
ANCHORED SPEECH DETECTION
targets:
0 = non-speech/undesired speech
1 = desired speech
anchor word
19% relative reduction in
frame classification error
and 9% reduction in WER
compared to baseline
DNN
Maas et al., “Anchored Speech Detection”, Interspeech 2016; King et al., “Robust Speech Recognition Via Anchor Word Represen tations”, Interspeech 2017
NATURAL LANGUAGE
UNDERSTANDING
NATURAL LANGUAGE UNDERSTANDING (NLU)
Goal: Understand the spoken intent and associated salient elements (aka slots)
Challenges
• Cross-domain intent recognition (example: “Play remind me.” or “Remind me to go to the play.”)
• Robustness to ASR errors (example: “Play songs by Prince.” or “Pray songs by Prince.”)
• User correction in context, “No, The Rolling Stones.”
• Rejecting out-of-domain utterances
• Need for high precision and recall because there is no reliance on the GUI for user to select an action
INTENT CLASSIFICATION
What’s the weather?
Tell me the forecast.
Will it rain today?
Please, um, tell me the weather for, uh, Seattle.
Do I need an umbrella?
What’s the expected low temperature today?
GetWeatherForecastIntent
NAMED ENTITY RECOGNITION
March 27th
A week from Sunday
Eight days from now
Easter Sunday
Two days after Good Friday
Date
ENTITY RESOLUTION
Entity type Spoken tokens Entity value
WeatherLocationCity LA Los Angeles
WeatherDate Tomorrow 2017-11-29
Generate canonical representation for entities by searching in catalogs
TEXT-TO-SPEECH SYNTHESIS
CHALLENGES IN TEXT-TO-SPEECH
• Homographs: Words written identically that have different pronunciation
• “I live in Seattle.” or “This presentation broadcasts live from Seattle.”
• Text normalization: Disambiguation of abbreviations, acronyms, units
(example: “m” expanded as “minutes” or “miles” or “meters” or even “medium”)
• Conversion of text to phonemes (grapheme-to-phoneme) in languages with complex mapping such as
English
• Variance of pronunciation sometimes depending on geo-location
• Foreign words, proper names, slang
TEXT-TO-SPEECH APPROACH
Text
Text normalization
Grapheme-to-phoneme
conversion
Waveform generation
Speech
She has $20 in her pocket.
she has twenty dollars in her pocket
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t
DIALOG MANAGEMENT
DIALOG MANAGEMENT
DIALOG MANAGEMENT
Goal: Understand and satisfy the user’s need through a sequence of interactions, while minimizing notion of
accumulated frustration
Challenges:
• User’s need/goal can evolve through the dialog
• Understanding in context; also taking into account user preferences
• Coordination with multiple applications/skills may be necessary
• Robustness to ASR and NLU errors
• Handling ambiguity and eliciting proper clarification feedback from the user
• Generating prompts/responses that are natural and engaging
• Maintaining a balance between eliciting the required information and introducing friction
DIALOG MANAGEMENT
Markov Decision Process
• 𝒮: Set of dialog states
• 𝒜: Set of actions from current state
• 𝜋: Dialog policy, mapping states to actions
• 𝑃𝑠𝑠′
𝑎
: Probability distribution over next state 𝑠′
given
current state and action (s,a)
• 𝑅 𝑠𝑠′
𝑎
: Reward function: user satisfaction
• 𝛾: Discount factor (short-term or long-term reward)
DIALOG MANAGEMENT
Reinforcement Learning
• From sequence of interactions
(state, action, reward)
• Estimate expected cumulative
reward given the policy
• Over time,
• Explore: Take actions that are deemed
suboptimal in hope of finding better
policies
• Exploit: Take actions that maximize the
current expected cumulative reward
(deemed optimal)
ALEXA PRIZE COMPETITION
SUMMARY OF APPROACHES
Conversational ASR
• We created a customized ASR language model with conversational text and speech transcripts
• University teams used ASR N-best results for downstream robustness to speech recognition errors
Conversational NLU
• We introduced “conversational intent” in Alexa NLU to identify utterances such as
"Alexa, let’s chat” and “Alexa, would you like to have a conversation about Mars Mission?”
• University teams used open-source knowledge bases and graphs (example: Evi, Freebase, Wikidata, IMDB)
to extract and link entities
• University teams did anaphora resolution, sentence completion (“Yes.”  “Yes, I like rockets.”), named entity
extraction, topic linking, and more
SUMMARY OF APPROACHES
Dialog Management
University teams used a range of approaches, including:
• Macro- and micro-bot system combination strategies
• Micro-bots with varying focus/specialization:
• Topics (examples: sports, politics, fashion)
• Data (examples: Reddit, Twitter, Washington Post)
• Intent (examples: “chitchat,” “opinion,” “knowledge”)
• State graph to track dialog context, conversation state, feedback, user sentiment, personalization
• Engagement and customer experience modules to help drive conversations, leading to better engagement,
longer dialogs, and improved ratings
• Detecting and handling profanity and offensive speech
SUMMARY OF APPROACHES
Response Generation
University teams used a range of approaches, including:
• Rules: Templated response using AIML, ELIZA systems
• Retrieval: Nearest neighbor responses from a response bank (built from Reddit, Twitter, Washington Post)
based on similarity using TF-IDF, Word2Vec, skip-thoughts, entity matching
• Generative: Learned response generation models using LSTM, memory networks,
hierarchical NN, attention and other seq2seq approaches. Effective in handling complex,
ambiguous, and phatic responses
• Hybrid: Ensemble of retrieval and generative models, such as retrieval of responses followed by ranking
using generative model; use of generative model as a backup when retrieval has low confidence
• Multi-bot strategies: Different techniques for different micro-bots, such as rules and templates for Q&A,
generative for chitchat
SUMMARY OF APPROACHES
Ranking and Selection
• Micro-bots provide multiple candidate responses
• Select from the pool of candidate responses leading to highest rating, engagement, and longer conversation
• Train a ranker module based on conversations and corresponding ratings for the given state of the
conversation
• Features used to train the ranker include sentiment, utterance-response coherence, relevance, user feedback,
n-gram and topical match
• Best strategy: Continually update the ranker based on user feedback and rating using reinforcement learning
• Alexa users were asked to give feedback at the end
of each conversation. University teams used such
feedback to improve their system
• Report cards were shared with the finalists, in
which feedback on the following areas was
provided:
1. Topic conversational quality
2. Response error rate
3. Coherence
4. Engagement
5. Customer experience gaps
• Utterance-response pairs were shared in case of
incorrect responses as annotated internally
• Teams have used all this information to build a
better experience during semifinals and finals phase
INCORPORATING FEEDBACK
FROM ALEXA USERS AND AP TEAM
KEY LEARNINGS
• Users were generally interested in conversing with socialbots
• Users appreciated acknowledgement of their request from the socialbots, even if it wasn’t supported
• Users did not like topics being randomly switched within the conversation
• Beyond social conversation about popular topics, some socialbots engaged users in games, quizzes, and
jokes
EVALUATION METRICS
Socialbots from university teams were evaluated based on the following metrics:
• Coherence: Turn-level response quality annotations by data analysts
• Engagement: User ratings, number of turns, duration
• Domain coverage: Entropy analysis across top five competition domains: Entertainment, Politics, Sports,
Fashion, Technology
• Topical diversity: Analysis of topic frequency, vocabulary, and variation in topics across the conversations
for each socialbot
• Conversational depth: Ability to have multi-turn conversations about a topic before switching to another
topic
Finalist selection
• Top two finalists selected based on customer ratings
• Third finalist selected by Amazon based on above metrics, a large pool of internal evaluators, and science
paper reviews
~24% improvement in quality through the competition
as noted by ratings (on a 1 to 5 scale):
KEY METRICSRESULTS
Metric
All
socialbots
Finalist
socialbots
Average rating 3.05 3.29
Response error rate 20.6% 15.1%
Average #turns 12 13
Median dialog
duration
101 sec 115 sec
Metric
All
socialbots
Finalist
socialbots
Average rating 2.92 3.31
Response error rate 24.2% 16.5%
Average #turns 11 13
Median dialog
duration
97 sec 125 sec
Entire competition:
Semifinals:
~41% reduction in ASR-induced failures through
improved conversational modeling
CONCLUSIONS AND NEXT STEPS
Conclusions
• Good progress this year, but the problem is far from solved
• Customer ratings depend on many factors in addition to ASR and response correctness
• High quality, relevant data is critical to developing a good dialog system
• Teams spent a significant amount of effort on engineering pipeline
Next steps
• Share this year’s learnings with the research community through published proceedings
• Continue improving conversational ASR accuracy through context dependent modeling
• Provide next year’s Alexa Prize contestants with additional engineering support and tools
Thank you!

Weitere ähnliche Inhalte

Ähnlich wie ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges

What if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos SilveiraWhat if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos SilveiraThoughtworks
 
What if-your-application-could-speak
What if-your-application-could-speakWhat if-your-application-could-speak
What if-your-application-could-speakMarcos Vinícius
 
Design Patterns for Future Content
Design Patterns for Future Content Design Patterns for Future Content
Design Patterns for Future Content Don Day
 
Assistive Technology Reading and Writing Tools
Assistive Technology Reading and Writing ToolsAssistive Technology Reading and Writing Tools
Assistive Technology Reading and Writing ToolsATLA
 
Conversational AI with Rasa - PyData Workshop
Conversational AI with Rasa - PyData WorkshopConversational AI with Rasa - PyData Workshop
Conversational AI with Rasa - PyData WorkshopTom Bocklisch
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and ChallengesJens Lehmann
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionStephen Marquard
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingTed Xiao
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docxevonnehoggarth79783
 
Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Paul Houle
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?Dominik Seisser
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdfSoha82
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksJonathan Mugan
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017
Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017
Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017Amazon Web Services
 
ALX326_Applying Alexa’s Natural Language to Your Challenges
ALX326_Applying Alexa’s Natural Language to Your ChallengesALX326_Applying Alexa’s Natural Language to Your Challenges
ALX326_Applying Alexa’s Natural Language to Your ChallengesAmazon Web Services
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Seth Grimes
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...Seth Grimes
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendationsBalázs Hidasi
 

Ähnlich wie ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges (20)

What if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos SilveiraWhat if-your-application-could-speak, by Marcos Silveira
What if-your-application-could-speak, by Marcos Silveira
 
What if-your-application-could-speak
What if-your-application-could-speakWhat if-your-application-could-speak
What if-your-application-could-speak
 
Design Patterns for Future Content
Design Patterns for Future Content Design Patterns for Future Content
Design Patterns for Future Content
 
Assistive Technology Reading and Writing Tools
Assistive Technology Reading and Writing ToolsAssistive Technology Reading and Writing Tools
Assistive Technology Reading and Writing Tools
 
Speech totext
Speech totextSpeech totext
Speech totext
 
Conversational AI with Rasa - PyData Workshop
Conversational AI with Rasa - PyData WorkshopConversational AI with Rasa - PyData Workshop
Conversational AI with Rasa - PyData Workshop
 
Question Answering - Application and Challenges
Question Answering - Application and ChallengesQuestion Answering - Application and Challenges
Question Answering - Application and Challenges
 
Wreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognitionWreck a nice beach: adventures in speech recognition
Wreck a nice beach: adventures in speech recognition
 
A Panorama of Natural Language Processing
A Panorama of Natural Language ProcessingA Panorama of Natural Language Processing
A Panorama of Natural Language Processing
 
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx70    C o m m u n i C at i o n s  o f  t h E  a C m       j u.docx
70 C o m m u n i C at i o n s o f t h E a C m j u.docx
 
Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6Chatbots in 2017 -- Ithaca Talk Dec 6
Chatbots in 2017 -- Ithaca Talk Dec 6
 
Should we be afraid of Transformers?
Should we be afraid of Transformers?Should we be afraid of Transformers?
Should we be afraid of Transformers?
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
Generating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural NetworksGenerating Natural-Language Text with Neural Networks
Generating Natural-Language Text with Neural Networks
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017
Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017
Applying Alexa’s Natural Language to Your Challenges - ALX326 - re:Invent 2017
 
ALX326_Applying Alexa’s Natural Language to Your Challenges
ALX326_Applying Alexa’s Natural Language to Your ChallengesALX326_Applying Alexa’s Natural Language to Your Challenges
ALX326_Applying Alexa’s Natural Language to Your Challenges
 
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
Preposition Semantics: Challenges in Comprehensive Corpus Annotation and Auto...
 
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
The Ins and Outs of Preposition Semantics:
 Challenges in Comprehensive Corpu...
 
Deep learning: the future of recommendations
Deep learning: the future of recommendationsDeep learning: the future of recommendations
Deep learning: the future of recommendations
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges

  • 1. AWS re:Invent THE SCIENCE BEHIND THE ALEXA PRIZE: MEETING THE AI CHALLENGES A s h w i n R a m S p y r o s M a t s o u k a s A L X 3 2 0 November 28, 2017
  • 2.
  • 3. CONVERSATION Language Understanding Dialog Management Response Generation Knowledge Alexa, is it hot outside? How about This weekend? Will I need an umbrella? I was thinking of hiking the Dish. It’s sunny and warm today. Clouds are rolling in. It will be overcast. Depends on where you’ll be. I’d take a rain jacket. There’s a 60% chance of rain in the Stafford Hills.
  • 4. NORTH STAR User: Alexa, let’s chat about the Mars mission. Alexa: There are multiple Mars missions, some public and some private. User: Who do you think will succeed? Alexa: I think more than one will succeed. The first one to get to Mars will open the doors for others. User: I’d love to go to Mars. Alexa: Me too. Luckily I can hitch a ride on any computer. User: That’s very funny. Alexa: The biggest challenge isn’t technology, it’s funding. It will cost $500 billion to send humans to Mars. User: With the current political climate, Congress isn’t going to approve this kind of budget. Alexa: It will need to be privately funded. User: Alexa, stop.
  • 5. Language generation Deciding how to phrase the response Personalization Adapting to individual users Knowledge ingestion Learning real world context (example: latest news) Common sense reasoning Making inferences SCIENCE CHALLENGES Conversation Is Hard! Speech recognition Free-form conversational ASR Language understanding Understanding meaning of utterances Context modeling Tracking context across multiple turns Dialog planning Determining the best response
  • 6. Icebreakers Conversation starters Pauses Topic suggestions Deep dives Leading conversations Personal questions Opinions or controversy Frustration Nonanswers CUSTOMER EXPERIENCE CHALLENGES
  • 7. $2.5M INAGURAL COMPETITION TO ADVANCE THE FIELD OF CONVERSATIONAL AI A L E X A , L E T ’ S C H A T . Challenge Create a socialbot that can converse coherently and engagingly on popular topics for 20 minutes By the numbers 15 teams Public launch on May 8; top 10 Alexa skill 40,000+ hours of conversations; millions of interactions 3 finalists; 24% improvement in ratings from start to end
  • 9. THE ALEXA SERVICE SUPPORTED BY TWO POWERFUL FRAMEWORKS Lives in the cloud Automatic speech recognition (ASR) Natural language understanding (NLU) Always learning Alexa Voice Service Unparalleled distribution: AVS allows your content to be everywhere. Alexa Skills Kit Create great content: ASK is how you connect to your consumer.
  • 10. WHAT CAN ALEXA DO? Entertainment: Music, books, video… Household: Timers, to-do lists, shopping, calendar Communication: Messaging, calling Information: Weather, traffic, news… Smart home: Lights, thermostats, switches… 25,000+ skills: Developed by third parties
  • 11. Dialog manager ASR NLU TTS User speech Text Labels Actions Speech output Component Input Output Example Automatic speech recognition (ASR) Speech Text (1-best or lattice of alternatives) “Play ‘Two Steps Behind’ by Def Leppard.” Natural language understanding (NLU) Text Slots and intent type Intent: PlayMusicIntent Slots: Artist name: Def Leppard Song: Two Steps Behind Dialog manager (DM) Labels and context Dialog actions, responses Ask the application to play the song or clarify. Text-to-speech (TTS) Text Speech “Which artist?” or “Playing ‘Two Steps Behind’ by Def Leppard.” OVERVIEW OF SPOKEN LANGUAGE UNDERSTANDING SYSTEMS Clarification dialog
  • 12. COMMON THEME: DATA-DRIVEN MACHINE LEARNING Input Input (training data) Answers (truth) Models Training Answers Inference Input: Speech Output: Sequence of words (text)Task: ASR Input: Text Output: Intents and slotsTask: NLU Probabilistic Modeling That Is Robust to Noise and Ambiguity Relatively Inexpensive Ground Truth Creation Portable to New Domains and Languages Trainer Decoder
  • 13. TIME VARYING LIVE PRODUCTION DATA Audio and transcripts Transcripts and NLU labels ASR model validation NLU model validation Unlabeled audio Models Models Models deployed DATA-DRIVEN ML FOR CONTINUOUS IMPROVEMENT Data sampling (active learning) Ground truth generation ASR model training NLU model training
  • 15. FAR-FIELD SPEECH CHALLENGES • Noisy environment (example: room reverberation, ambient noise, background speech) • Large vocabulary, high perplexity domains • Difficult to predict spoken forms for catalog entries and their associated pronunciations (example: artist names such as Psy, P!nk) • Acoustically confusable titles, “sun daze” or “Sundays”
  • 16. Noise Speech 1 2 3 4 5 6 Acoustic echo cancellation (AEC) • Applied to each beam • Reduces echo of device playback • Before AEC: After AEC: SIGNAL PROCESSING Beamforming
  • 17. SPEECH RECOGNITION Feature extraction Acoustic model Decoder (inference) Post processing Feature vectors [4.7, 2.3, -1.4, …] Phonetic probabilities [0.1, 0.1, 0.4, …] Words increase to seventy degrees Text Increase to 70⁰ Sound Language model
  • 18. CONVERSATIONAL LANGUAGE MODELING • Conversational free forms are inherently high perplexity • Need to increase language model (LM) capacity • For n-gram models this means increasing n-gram order and number of n-grams (model size) • Can result in increased latency • Designed and developed a new representation for n-gram language models and a new decoder • Can now accommodate over 10x increase in LM size without significant impact on latency
  • 19. SPEAKER ADAPTATION Garimella, et al. “Robust i-Vector Based Adaptation of DNN Acoustic Model for Speech Recognition,” Interspeech 2015 Speech Features Speaker Features Phonetic Probabilities 5% to 7% relative reduction in word error rate compared to speaker independent model
  • 20. SPEECH ENDPOINTING Challenge Low latency detection of end of speech Approach Real time detection of sentence end combining acoustic and linguistic information “Play music by [PAUSE] Sting [PAUSE].” Liu et al., “Accurate End-Pointing with Expected Pause Duration”, Interspeech 2015 Low end-of- sentence LM probability  endpointProb=LOW High end-of-sentence LM probability  endpointProb=HIGH
  • 21. ANCHORED SPEECH DETECTION Problem: Background speech interfering with device-directed speech • Delayed end-point detection • Word insertion errors “Alexa, play music.” [speaker 1] “Take the dog out.” [speaker 2]
  • 22. ANCHORED SPEECH DETECTION targets: 0 = non-speech/undesired speech 1 = desired speech anchor word 19% relative reduction in frame classification error and 9% reduction in WER compared to baseline DNN Maas et al., “Anchored Speech Detection”, Interspeech 2016; King et al., “Robust Speech Recognition Via Anchor Word Represen tations”, Interspeech 2017
  • 24. NATURAL LANGUAGE UNDERSTANDING (NLU) Goal: Understand the spoken intent and associated salient elements (aka slots) Challenges • Cross-domain intent recognition (example: “Play remind me.” or “Remind me to go to the play.”) • Robustness to ASR errors (example: “Play songs by Prince.” or “Pray songs by Prince.”) • User correction in context, “No, The Rolling Stones.” • Rejecting out-of-domain utterances • Need for high precision and recall because there is no reliance on the GUI for user to select an action
  • 25. INTENT CLASSIFICATION What’s the weather? Tell me the forecast. Will it rain today? Please, um, tell me the weather for, uh, Seattle. Do I need an umbrella? What’s the expected low temperature today? GetWeatherForecastIntent
  • 26. NAMED ENTITY RECOGNITION March 27th A week from Sunday Eight days from now Easter Sunday Two days after Good Friday Date
  • 27. ENTITY RESOLUTION Entity type Spoken tokens Entity value WeatherLocationCity LA Los Angeles WeatherDate Tomorrow 2017-11-29 Generate canonical representation for entities by searching in catalogs
  • 29. CHALLENGES IN TEXT-TO-SPEECH • Homographs: Words written identically that have different pronunciation • “I live in Seattle.” or “This presentation broadcasts live from Seattle.” • Text normalization: Disambiguation of abbreviations, acronyms, units (example: “m” expanded as “minutes” or “miles” or “meters” or even “medium”) • Conversion of text to phonemes (grapheme-to-phoneme) in languages with complex mapping such as English • Variance of pronunciation sometimes depending on geo-location • Foreign words, proper names, slang
  • 30. TEXT-TO-SPEECH APPROACH Text Text normalization Grapheme-to-phoneme conversion Waveform generation Speech She has $20 in her pocket. she has twenty dollars in her pocket ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t
  • 33. DIALOG MANAGEMENT Goal: Understand and satisfy the user’s need through a sequence of interactions, while minimizing notion of accumulated frustration Challenges: • User’s need/goal can evolve through the dialog • Understanding in context; also taking into account user preferences • Coordination with multiple applications/skills may be necessary • Robustness to ASR and NLU errors • Handling ambiguity and eliciting proper clarification feedback from the user • Generating prompts/responses that are natural and engaging • Maintaining a balance between eliciting the required information and introducing friction
  • 34. DIALOG MANAGEMENT Markov Decision Process • 𝒮: Set of dialog states • 𝒜: Set of actions from current state • 𝜋: Dialog policy, mapping states to actions • 𝑃𝑠𝑠′ 𝑎 : Probability distribution over next state 𝑠′ given current state and action (s,a) • 𝑅 𝑠𝑠′ 𝑎 : Reward function: user satisfaction • 𝛾: Discount factor (short-term or long-term reward)
  • 35. DIALOG MANAGEMENT Reinforcement Learning • From sequence of interactions (state, action, reward) • Estimate expected cumulative reward given the policy • Over time, • Explore: Take actions that are deemed suboptimal in hope of finding better policies • Exploit: Take actions that maximize the current expected cumulative reward (deemed optimal)
  • 37. SUMMARY OF APPROACHES Conversational ASR • We created a customized ASR language model with conversational text and speech transcripts • University teams used ASR N-best results for downstream robustness to speech recognition errors Conversational NLU • We introduced “conversational intent” in Alexa NLU to identify utterances such as "Alexa, let’s chat” and “Alexa, would you like to have a conversation about Mars Mission?” • University teams used open-source knowledge bases and graphs (example: Evi, Freebase, Wikidata, IMDB) to extract and link entities • University teams did anaphora resolution, sentence completion (“Yes.”  “Yes, I like rockets.”), named entity extraction, topic linking, and more
  • 38. SUMMARY OF APPROACHES Dialog Management University teams used a range of approaches, including: • Macro- and micro-bot system combination strategies • Micro-bots with varying focus/specialization: • Topics (examples: sports, politics, fashion) • Data (examples: Reddit, Twitter, Washington Post) • Intent (examples: “chitchat,” “opinion,” “knowledge”) • State graph to track dialog context, conversation state, feedback, user sentiment, personalization • Engagement and customer experience modules to help drive conversations, leading to better engagement, longer dialogs, and improved ratings • Detecting and handling profanity and offensive speech
  • 39. SUMMARY OF APPROACHES Response Generation University teams used a range of approaches, including: • Rules: Templated response using AIML, ELIZA systems • Retrieval: Nearest neighbor responses from a response bank (built from Reddit, Twitter, Washington Post) based on similarity using TF-IDF, Word2Vec, skip-thoughts, entity matching • Generative: Learned response generation models using LSTM, memory networks, hierarchical NN, attention and other seq2seq approaches. Effective in handling complex, ambiguous, and phatic responses • Hybrid: Ensemble of retrieval and generative models, such as retrieval of responses followed by ranking using generative model; use of generative model as a backup when retrieval has low confidence • Multi-bot strategies: Different techniques for different micro-bots, such as rules and templates for Q&A, generative for chitchat
  • 40. SUMMARY OF APPROACHES Ranking and Selection • Micro-bots provide multiple candidate responses • Select from the pool of candidate responses leading to highest rating, engagement, and longer conversation • Train a ranker module based on conversations and corresponding ratings for the given state of the conversation • Features used to train the ranker include sentiment, utterance-response coherence, relevance, user feedback, n-gram and topical match • Best strategy: Continually update the ranker based on user feedback and rating using reinforcement learning
  • 41. • Alexa users were asked to give feedback at the end of each conversation. University teams used such feedback to improve their system • Report cards were shared with the finalists, in which feedback on the following areas was provided: 1. Topic conversational quality 2. Response error rate 3. Coherence 4. Engagement 5. Customer experience gaps • Utterance-response pairs were shared in case of incorrect responses as annotated internally • Teams have used all this information to build a better experience during semifinals and finals phase INCORPORATING FEEDBACK FROM ALEXA USERS AND AP TEAM
  • 42. KEY LEARNINGS • Users were generally interested in conversing with socialbots • Users appreciated acknowledgement of their request from the socialbots, even if it wasn’t supported • Users did not like topics being randomly switched within the conversation • Beyond social conversation about popular topics, some socialbots engaged users in games, quizzes, and jokes
  • 43. EVALUATION METRICS Socialbots from university teams were evaluated based on the following metrics: • Coherence: Turn-level response quality annotations by data analysts • Engagement: User ratings, number of turns, duration • Domain coverage: Entropy analysis across top five competition domains: Entertainment, Politics, Sports, Fashion, Technology • Topical diversity: Analysis of topic frequency, vocabulary, and variation in topics across the conversations for each socialbot • Conversational depth: Ability to have multi-turn conversations about a topic before switching to another topic Finalist selection • Top two finalists selected based on customer ratings • Third finalist selected by Amazon based on above metrics, a large pool of internal evaluators, and science paper reviews
  • 44. ~24% improvement in quality through the competition as noted by ratings (on a 1 to 5 scale): KEY METRICSRESULTS Metric All socialbots Finalist socialbots Average rating 3.05 3.29 Response error rate 20.6% 15.1% Average #turns 12 13 Median dialog duration 101 sec 115 sec Metric All socialbots Finalist socialbots Average rating 2.92 3.31 Response error rate 24.2% 16.5% Average #turns 11 13 Median dialog duration 97 sec 125 sec Entire competition: Semifinals: ~41% reduction in ASR-induced failures through improved conversational modeling
  • 45. CONCLUSIONS AND NEXT STEPS Conclusions • Good progress this year, but the problem is far from solved • Customer ratings depend on many factors in addition to ASR and response correctness • High quality, relevant data is critical to developing a good dialog system • Teams spent a significant amount of effort on engineering pipeline Next steps • Share this year’s learnings with the research community through published proceedings • Continue improving conversational ASR accuracy through context dependent modeling • Provide next year’s Alexa Prize contestants with additional engineering support and tools