ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges

AWS re:Invent
THE SCIENCE BEHIND THE ALEXA
PRIZE: MEETING THE AI CHALLENGES
A s h w i n R a m
S p y r o s M a t s o u k a s
A L X 3 2 0
November 28, 2017

CONVERSATION
Language
Understanding
Dialog
Management
Response
Generation
Knowledge
Alexa, is it
hot outside?
How about
This weekend?
Will I need an
umbrella?
I was thinking of
hiking the Dish.
It’s sunny and warm
today.
Clouds are rolling in.
It will be overcast.
Depends on where
you’ll be.
I’d take a rain jacket.
There’s a 60% chance of
rain in the Stafford Hills.

NORTH STAR
User: Alexa, let’s chat about the Mars mission.
Alexa: There are multiple Mars missions, some public and some private.
User: Who do you think will succeed?
Alexa: I think more than one will succeed. The first one to get to Mars will open the doors for others.
User: I’d love to go to Mars.
Alexa: Me too. Luckily I can hitch a ride on any computer.
User: That’s very funny.
Alexa: The biggest challenge isn’t technology, it’s funding. It will cost $500 billion to send humans to Mars.
User: With the current political climate, Congress isn’t going to approve this kind of budget.
Alexa: It will need to be privately funded.
User: Alexa, stop.

Language generation
Deciding how to phrase the response
Personalization
Adapting to individual users
Knowledge ingestion
Learning real world context (example: latest news)
Common sense reasoning
Making inferences
SCIENCE CHALLENGES
Conversation Is Hard!
Speech recognition
Free-form conversational ASR
Language understanding
Understanding meaning of utterances
Context modeling
Tracking context across multiple turns
Dialog planning
Determining the best response

Icebreakers
Conversation starters
Pauses
Topic suggestions
Deep dives
Leading conversations
Personal questions
Opinions or controversy
Frustration
Nonanswers
CUSTOMER EXPERIENCE CHALLENGES

$2.5M INAGURAL COMPETITION TO
ADVANCE
THE FIELD OF CONVERSATIONAL AI
A L E X A ,
L E T ’ S
C H A T .
Challenge
Create a socialbot that can converse coherently
and engagingly on popular topics for 20 minutes
By the numbers
15 teams
Public launch on May 8; top 10 Alexa skill
40,000+ hours of conversations; millions of interactions
3 finalists; 24% improvement in ratings from start to end

ALEXA SPOKEN LANGUAGE
UNDERSTANDING

THE ALEXA SERVICE
SUPPORTED BY TWO POWERFUL FRAMEWORKS
Lives in the cloud
Automatic speech
recognition (ASR)
Natural language
understanding (NLU)
Always learning
Alexa
Voice Service
Unparalleled distribution:
AVS allows your content
to be everywhere.
Alexa
Skills Kit
Create great content:
ASK is how you
connect
to your consumer.

WHAT CAN ALEXA DO?
Entertainment: Music, books, video…
Household: Timers, to-do lists, shopping, calendar
Communication: Messaging, calling
Information: Weather, traffic, news…
Smart home: Lights, thermostats, switches…
25,000+ skills: Developed by third parties

Dialog
manager
ASR NLU TTS
User
speech
Text Labels Actions
Speech
output
Component Input Output Example
Automatic speech
recognition (ASR)
Speech
Text (1-best or lattice of
alternatives)
“Play ‘Two Steps Behind’ by Def Leppard.”
Natural language
understanding (NLU)
Text Slots and intent type
Intent:
PlayMusicIntent
Slots:
Artist name: Def Leppard
Song: Two Steps Behind
Dialog manager
(DM)
Labels and
context
Dialog actions,
responses
Ask the application to play the song or clarify.
Text-to-speech
(TTS)
Text Speech
“Which artist?” or “Playing ‘Two Steps Behind’ by Def
Leppard.”
OVERVIEW OF SPOKEN LANGUAGE
UNDERSTANDING SYSTEMS
Clarification
dialog

COMMON THEME: DATA-DRIVEN
MACHINE LEARNING
Input
Input (training data)
Answers (truth)
Models
Training
Answers
Inference
Input: Speech Output: Sequence of words (text)Task: ASR
Input: Text Output: Intents and slotsTask: NLU
Probabilistic Modeling
That Is Robust to Noise
and Ambiguity
Relatively Inexpensive
Ground Truth Creation
Portable to New
Domains
and Languages
Trainer
Decoder

TIME VARYING LIVE PRODUCTION DATA
Audio and
transcripts
Transcripts and
NLU labels
ASR model
validation
NLU model
validation
Unlabeled audio
Models
Models
Models
deployed
DATA-DRIVEN ML FOR
CONTINUOUS IMPROVEMENT
Data sampling
(active
learning)
Ground truth
generation
ASR model
training
NLU model
training

FAR-FIELD SPEECH CHALLENGES
• Noisy environment (example: room reverberation, ambient noise, background speech)
• Large vocabulary, high perplexity domains
• Difficult to predict spoken forms for catalog entries and their associated pronunciations (example: artist
names such as Psy, P!nk)
• Acoustically confusable titles, “sun daze” or “Sundays”

Noise
Speech
1
2
3
4
5
6
Acoustic echo cancellation (AEC)
• Applied to each beam
• Reduces echo of device playback
• Before AEC: After AEC:
SIGNAL PROCESSING
Beamforming

SPEECH RECOGNITION
Feature
extraction
Acoustic
model
Decoder
(inference)
Post
processing
Feature
vectors
[4.7, 2.3, -1.4, …]
Phonetic
probabilities
[0.1, 0.1, 0.4, …]
Words
increase
to seventy
degrees
Text
Increase
to 70⁰
Sound
Language
model

CONVERSATIONAL LANGUAGE MODELING
• Conversational free forms are inherently high perplexity
• Need to increase language model (LM) capacity
• For n-gram models this means increasing n-gram order
and number of n-grams (model size)
• Can result in increased latency
• Designed and developed a new representation for n-gram language models and a new decoder
• Can now accommodate over 10x increase in
LM size without significant impact on latency

SPEAKER ADAPTATION
Garimella, et al. “Robust i-Vector Based Adaptation of DNN Acoustic Model for Speech Recognition,” Interspeech 2015
Speech
Features
Speaker
Features
Phonetic
Probabilities
5% to 7% relative
reduction in word error
rate compared to
speaker independent
model

SPEECH ENDPOINTING
Challenge
Low latency detection of end of speech
Approach
Real time detection of sentence end
combining acoustic and linguistic information
“Play music by [PAUSE] Sting [PAUSE].”
Liu et al., “Accurate End-Pointing with Expected Pause Duration”, Interspeech 2015
Low end-of-
sentence LM
probability 
endpointProb=LOW
High end-of-sentence
LM probability 
endpointProb=HIGH

ANCHORED SPEECH DETECTION
Problem: Background speech interfering with device-directed speech
• Delayed end-point detection
• Word insertion errors
“Alexa, play music.”
[speaker 1]
“Take the dog out.”
[speaker 2]

ANCHORED SPEECH DETECTION
targets:
0 = non-speech/undesired speech
1 = desired speech
anchor word
19% relative reduction in
frame classification error
and 9% reduction in WER
compared to baseline
DNN
Maas et al., “Anchored Speech Detection”, Interspeech 2016; King et al., “Robust Speech Recognition Via Anchor Word Represen tations”, Interspeech 2017

NATURAL LANGUAGE
UNDERSTANDING

NATURAL LANGUAGE UNDERSTANDING (NLU)
Goal: Understand the spoken intent and associated salient elements (aka slots)
Challenges
• Cross-domain intent recognition (example: “Play remind me.” or “Remind me to go to the play.”)
• Robustness to ASR errors (example: “Play songs by Prince.” or “Pray songs by Prince.”)
• User correction in context, “No, The Rolling Stones.”
• Rejecting out-of-domain utterances
• Need for high precision and recall because there is no reliance on the GUI for user to select an action

INTENT CLASSIFICATION
What’s the weather?
Tell me the forecast.
Will it rain today?
Please, um, tell me the weather for, uh, Seattle.
Do I need an umbrella?
What’s the expected low temperature today?
GetWeatherForecastIntent

NAMED ENTITY RECOGNITION
March 27th
A week from Sunday
Eight days from now
Easter Sunday
Two days after Good Friday
Date

ENTITY RESOLUTION
Entity type Spoken tokens Entity value
WeatherLocationCity LA Los Angeles
WeatherDate Tomorrow 2017-11-29
Generate canonical representation for entities by searching in catalogs

CHALLENGES IN TEXT-TO-SPEECH
• Homographs: Words written identically that have different pronunciation
• “I live in Seattle.” or “This presentation broadcasts live from Seattle.”
• Text normalization: Disambiguation of abbreviations, acronyms, units
(example: “m” expanded as “minutes” or “miles” or “meters” or even “medium”)
• Conversion of text to phonemes (grapheme-to-phoneme) in languages with complex mapping such as
English
• Variance of pronunciation sometimes depending on geo-location
• Foreign words, proper names, slang

TEXT-TO-SPEECH APPROACH
Text
Text normalization
Grapheme-to-phoneme
conversion
Waveform generation
Speech
She has $20 in her pocket.
she has twenty dollars in her pocket
ˈ ʃ i ˈ h æ z ˈ t w ɛ n . t i ˈ d ɑ . ɫ ə ɹ z ˈ ɪ n ˈ h ɝ ɹ ˈ p ɑ . k ə t

DIALOG MANAGEMENT
Goal: Understand and satisfy the user’s need through a sequence of interactions, while minimizing notion of
accumulated frustration
Challenges:
• User’s need/goal can evolve through the dialog
• Understanding in context; also taking into account user preferences
• Coordination with multiple applications/skills may be necessary
• Robustness to ASR and NLU errors
• Handling ambiguity and eliciting proper clarification feedback from the user
• Generating prompts/responses that are natural and engaging
• Maintaining a balance between eliciting the required information and introducing friction

DIALOG MANAGEMENT
Markov Decision Process
• 𝒮: Set of dialog states
• 𝒜: Set of actions from current state
• 𝜋: Dialog policy, mapping states to actions
• 𝑃𝑠𝑠′
𝑎
: Probability distribution over next state 𝑠′
given
current state and action (s,a)
• 𝑅 𝑠𝑠′
𝑎
: Reward function: user satisfaction
• 𝛾: Discount factor (short-term or long-term reward)

DIALOG MANAGEMENT
Reinforcement Learning
• From sequence of interactions
(state, action, reward)
• Estimate expected cumulative
reward given the policy
• Over time,
• Explore: Take actions that are deemed
suboptimal in hope of finding better
policies
• Exploit: Take actions that maximize the
current expected cumulative reward
(deemed optimal)

SUMMARY OF APPROACHES
Conversational ASR
• We created a customized ASR language model with conversational text and speech transcripts
• University teams used ASR N-best results for downstream robustness to speech recognition errors
Conversational NLU
• We introduced “conversational intent” in Alexa NLU to identify utterances such as
"Alexa, let’s chat” and “Alexa, would you like to have a conversation about Mars Mission?”
• University teams used open-source knowledge bases and graphs (example: Evi, Freebase, Wikidata, IMDB)
to extract and link entities
• University teams did anaphora resolution, sentence completion (“Yes.”  “Yes, I like rockets.”), named entity
extraction, topic linking, and more

Dialog Management
University teams used a range of approaches, including:
• Macro- and micro-bot system combination strategies
• Micro-bots with varying focus/specialization:
• Topics (examples: sports, politics, fashion)
• Data (examples: Reddit, Twitter, Washington Post)
• Intent (examples: “chitchat,” “opinion,” “knowledge”)
• State graph to track dialog context, conversation state, feedback, user sentiment, personalization
• Engagement and customer experience modules to help drive conversations, leading to better engagement,
longer dialogs, and improved ratings
• Detecting and handling profanity and offensive speech

Response Generation
University teams used a range of approaches, including:
• Rules: Templated response using AIML, ELIZA systems
• Retrieval: Nearest neighbor responses from a response bank (built from Reddit, Twitter, Washington Post)
based on similarity using TF-IDF, Word2Vec, skip-thoughts, entity matching
• Generative: Learned response generation models using LSTM, memory networks,
hierarchical NN, attention and other seq2seq approaches. Effective in handling complex,
ambiguous, and phatic responses
• Hybrid: Ensemble of retrieval and generative models, such as retrieval of responses followed by ranking
using generative model; use of generative model as a backup when retrieval has low confidence
• Multi-bot strategies: Different techniques for different micro-bots, such as rules and templates for Q&A,
generative for chitchat

Ranking and Selection
• Micro-bots provide multiple candidate responses
• Select from the pool of candidate responses leading to highest rating, engagement, and longer conversation
• Train a ranker module based on conversations and corresponding ratings for the given state of the
conversation
• Features used to train the ranker include sentiment, utterance-response coherence, relevance, user feedback,
n-gram and topical match
• Best strategy: Continually update the ranker based on user feedback and rating using reinforcement learning

• Alexa users were asked to give feedback at the end
of each conversation. University teams used such
feedback to improve their system
• Report cards were shared with the finalists, in
which feedback on the following areas was
provided:
1. Topic conversational quality
2. Response error rate
3. Coherence
4. Engagement
5. Customer experience gaps
• Utterance-response pairs were shared in case of
incorrect responses as annotated internally
• Teams have used all this information to build a
better experience during semifinals and finals phase
INCORPORATING FEEDBACK
FROM ALEXA USERS AND AP TEAM

KEY LEARNINGS
• Users were generally interested in conversing with socialbots
• Users appreciated acknowledgement of their request from the socialbots, even if it wasn’t supported
• Users did not like topics being randomly switched within the conversation
• Beyond social conversation about popular topics, some socialbots engaged users in games, quizzes, and
jokes

EVALUATION METRICS
Socialbots from university teams were evaluated based on the following metrics:
• Coherence: Turn-level response quality annotations by data analysts
• Engagement: User ratings, number of turns, duration
• Domain coverage: Entropy analysis across top five competition domains: Entertainment, Politics, Sports,
Fashion, Technology
• Topical diversity: Analysis of topic frequency, vocabulary, and variation in topics across the conversations
for each socialbot
• Conversational depth: Ability to have multi-turn conversations about a topic before switching to another
topic
Finalist selection
• Top two finalists selected based on customer ratings
• Third finalist selected by Amazon based on above metrics, a large pool of internal evaluators, and science
paper reviews

~24% improvement in quality through the competition
as noted by ratings (on a 1 to 5 scale):
KEY METRICSRESULTS
Metric
All
socialbots
Finalist
socialbots
Average rating 3.05 3.29
Response error rate 20.6% 15.1%
Average #turns 12 13
Median dialog
duration
101 sec 115 sec
Metric
All
socialbots
Finalist
socialbots
Average rating 2.92 3.31
Response error rate 24.2% 16.5%
Average #turns 11 13
Median dialog
duration
97 sec 125 sec
Entire competition:
Semifinals:
~41% reduction in ASR-induced failures through
improved conversational modeling

CONCLUSIONS AND NEXT STEPS
Conclusions
• Good progress this year, but the problem is far from solved
• Customer ratings depend on many factors in addition to ASR and response correctness
• High quality, relevant data is critical to developing a good dialog system
• Teams spent a significant amount of effort on engineering pipeline
Next steps
• Share this year’s learnings with the research community through published proceedings
• Continue improving conversational ASR accuracy through context dependent modeling
• Provide next year’s Alexa Prize contestants with additional engineering support and tools

ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges

Ähnlich wie ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

ALX320_The Science Behind the Alexa Prize Meeting The AI Challenges