SlideShare ist ein Scribd-Unternehmen logo
1 von 77
Downloaden Sie, um offline zu lesen
1
Errudite: Scalable, Reproducible, and
Testable Error Analysis
Tongshuang (Sherry) Wu @tongshuangwu
University of Washington
Marco Tulio Ribeiro
Microsoft Research
Jeffrey Heer @jeffrey_heer
Daniel S. Weld @dsweld
University of Washington
2
Motivation & Contributions
3
Error analysis is important for

Uncovering bugs
Improving the state-of-art
Safeguarding deployments
4
Where We Are
We performed an error analysis on a sample of 100 questions Fader et tal.
ACL’13
We sample 100 incorrect predictions and try to find common
error categories.
Chen et al.
ACL’16
We randomly select 50 incorrect questions and categorize
them into 6 classes.
Wadhwa et al.
ACL’18
5
Where We Are
We performed an error analysis on a sample of 100 questions Fader et tal.
ACL’13
We sample 100 incorrect predictions and try to find common
error categories.
Chen et al.
ACL’16
We randomly select 50 incorrect questions and categorize
them into 6 classes.
Wadhwa et al.
ACL’18
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
6
Where We Are
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
7
Where We Are & Our Contribution
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Principles & Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
A
B
C
D
E
F
8
A
B
C
D
E
F
9
10
Design & Use Scenario
Examine the distractor hypothesis
on BiDAF (Seo et al., 2016), with SQuAD (10570 instances; Rajpurkar et al., 2016)
Independently tested by 4 (out of 10) participants in the user study
11
The Scenario: Analyze BiDAF

John Debney created a new arrangement of Ron
Grainer’s original theme for Doctor Who in 1996.
For the return of the series in 2005, Murray Gold
provided a new arrangement... featured sampled
from the 1963 original.
Who created the 2005 theme for Doctor Who?Question
In the context of Machine Comprehension (MC)

+
↓
Murray Gold
Context
Answer
12
Why the incorrect prediction?
Who created the 2005 theme for Doctor Who?

John Debney created a new arrangement of Ron
Grainer’s original theme for Doctor Who in 1996.
For the return of the series in 2005, Murray Gold
provided a new arrangement... featured sampled
from the 1963 original.
13
Why the mistake? Distractor hypothesis

John Debney created a new arrangement
of Ron Grainer’s original theme for Doctor
Who in 1996. For the return of the series in
2005, Murray Gold provided a new
arrangement... featured sampled from the
1963 original.
Who created the 2005 theme for Doctor Who?
BiDAF

Matches entity types
Knows to ïŹnd a PERSON
Finds the exact answer spans
Distracted by other PERSON spans
14
Why the incorrect prediction?
Who created the 2005 theme for Doctor Who?

John Debney created a new arrangement of Ron
Grainer’s original theme for Doctor Who in 1996.
For the return of the series in 2005, Murray Gold
provided a new arrangement... featured sampled
from the 1963 original.
Does BiDAF make similar distractor mistakes often?
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
15
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Too ambiguous to reproduce
Biased conclusion due to

Subjectively deïŹned hypotheses

started his writing journey during his college year.
But, he didn’t ïŹnish his ïŹrst book until 1996.
prediction
groundtruth
16
What is distractor hypothesis?
Have the same recognizable entity typeD2D1 D2
D1 Are roughly the same entity type
The groundtruth and the prediction

Not a recognizable DATE!Not a recognizable DATE!
DATE (YEAR) matches “when”!
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
17
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Quantify instances with a domain
speciïŹc language
Biased conclusion due to

Subjectively deïŹned hypotheses
18
Precise DSL (Domain SpeciïŹc Language)
A
B
C
D
E
F
Attribute Extractor OperatorsTargetDSL = + +A
B
C
D
Extract
Instance Attribute
E
F
Instance Groups
Filter
length(q) > 20
C
D
19
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
C
D
F
20
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
“The groundtruth is an ENTity.”
ENT(Murray Gold) == PERSON
21
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
has_distractor
“There are more tokens matching the ground
truth entity type (ENT(g)) in the whole context
than in the groundtruth.”
count(PERSON : Murray Gold, John Dubney, Ron Grainer) == 3
count(PERSON : Murray Gold) == 1
22
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
has_distractor
correct_type
“The model prediction ENTity type matches the
groundtruth ENTity type.”
ENT(John Debney) == PERSON
23
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
has_distractor
correct_type
is_distracted
“The model prediction is incorrect.”
24
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
has_distractor
correct_type
is_distracted
CorrectIncorrect
Have the same recognizable entity typeD2
D1 Are roughly the same entity type
25
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
100 << 2000+ errors in total
26
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
has_distractor
correct_type
is_distracted
CorrectIncorrect
5.7% of all BiDAF errors:
The distractor hypothesis seems correct!
27
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

Focus exclusively on errors
Wrongly prioritize groups that are
well-handled in average.
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
28
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Wrongly prioritize groups that are
well-handled in average.
Biased conclusion due to

Focus exclusively on errors
Errudite
Cover errors & correct instances
29
Build distractor groups with DSL
ENT(g) != ""
and count(token(c, pattern=ENT(g))) >
count(token(g, pattern=ENT(g)))
and ENT(g) == ENT(p(m))
and f1(m) == 0
1
2
3
4
5
is_entity
has_distractor
correct_type
is_distracted
all_instance
CorrectIncorrect
88% EM > 68% EM:
BiDAF performs better when have distractors
& entity type is matched, than overall.
Reject / revise the hypothesis!
30
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

Small samples
+
Focus exclusively on errors
Errudite
Scale up to the entire dev set
+
Cover errors & correct instances
31

John Debney created a new arrangement
of Ron Grainer’s original theme for Doctor
Who in 1996. For the return of the series in
2005, Murray Gold provided a new
arrangement... featured sampled from the
1963 original.
Who created the 2005 theme for Doctor Who?
is_distracted
Distractor entity?
HAS distractor prediction
!=
IS WRONG due to distractor prediction
Multi-sentence reasoning?
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
32
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
HAS distractor prediction !=
IS WRONG due to distractor prediction
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

No Test on true cause
A
B
C
D
E
F
33
Are the 192 instances really wrong because of the distractor?
Would BiDAF work perfectly if we remove the distractors?
Answer what-if questions with
counterfactual analysis!
F
34
Counterfactual Analysis with Rewrite Rules
rewrite( , → )target tofrom
Re-write the target part of an instance
by replacing from
with to
35
Counterfactual Analysis with Rewrite Rules
Would BiDAF work perfectly if we remove the distractors?
rewrite( , → )target tofrom
Re-write the target part of an instance
by replacing from
with to
36
Counterfactual Analysis with Rewrite Rules
Would BiDAF work perfectly if we remove the distractors?
rewrite( , → )c tofrom
Re-write the context part of an instance
by replacing from
with to
37
Counterfactual Analysis with Rewrite Rules
Would BiDAF work perfectly if we remove the distractors?
rewrite( , → )c tostring(p(m))
Re-write the context part of an instance
by replacing the model predicted distractor string
with to
38
Counterfactual Analysis with Rewrite Rules
Would BiDAF work perfectly if we remove the distractors?
rewrite( , → )c "#"string(p(m))
Re-write the context part of an instance
by replacing the model predicted distractor string
with a placeholder token “#”
39
Counterfactual Analysis with Rewrite Rules
Would BiDAF work perfectly if we remove the distractors?
rewrite( , → )c "#"string(p(m))
Q: Who created the 2005 theme for Doctor Who?
C: 
John Dobney # created a new arrangement of Ron Grainer’s 

Murray Gold provided a new arrangement

Incorrect
Incorrect
Another distractor is still confusing the model!
40
Counterfactual Analysis with Rewrite Rules
Would BiDAF work perfectly if we remove the distractors?
rewrite( , → )c "#"string(p(m))
Incorrect
Incorrect
41
Counterfactual Analysis with Rewrite Rules
p(m) for the 192 rewritten is_distracted instances are

rewrite( , → )c "#"string(p(m))
Another distractor is still confusing the model!
Incorrect
Incorrect
42
Counterfactual Analysis with Rewrite Rules
p(m) for the 192 rewritten is_distracted instances are

rewrite( , → )c "#"string(p(m))
Incorrect
Incorrect
29% Another distractor is still confusing the model!
48% The distractor was fooling the model!
Incorrect
Correct
23% Other factors are at play!Unchanged
age of 18, 10.5% # from 18 to 24

43
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

No Test on true cause
Errudite
Test via counterfactual analysis
44
Deliveries: Precise + Reproducible + Re-applicable
Groups Rewrite ruleAttribute
is_entity
has_distractor
correct_type
is_distracted
all_instanceENT(g) rewrite(
c,
string(p(m))→"#")
45
Deliveries: Precise + Reproducible + Re-applicable
Groups Rewrite ruleAttribute
BiDAF is 

not particularly bad at distractors.
Seemingly distractor errors can be due to other factors.
+ +
applied to

46
Deliveries: Precise + Reproducible + Re-applicable
Groups Rewrite ruleAttribute
Other datasets & Other models

? at handling distractor.
+ +
applied to
Re-
47
Biased conclusion due to

Subjectively deïŹned hypotheses
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Principles & Errudite
Precise & reproducible hypotheses
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
48
Better error analysis will then

Uncover bugs
Improve the state-of-art
Safeguard deployments
Errudite improves error analysis

Precise
Reproducible
Scalable
Testable
49
Thank you!
Video demo: https://tinyurl.com/errudite-video
Opensource: https://github.com/uwdata/errudite
50
Backup
51
Domain Specific Language (DSL)
52
Precise DSL (Domain SpeciïŹc Language)
The model is bad on long questions
questions with more than 20 tokens
Qualitative Description
Quantitative Description length(q) > 20
Attribute Extractor
length
question_type
answer_type
Target
question
context
groundtruth
prediction (model)
token
sentence
Operators
>
!=
in
has_any
53
DSL (Domain SpeciïŹc Language) Attribute Extractor
Basic Attributes
General purpose linguistic features
Standard prediction performance metrics
Between-target relations
Domain-speciïŹc attribute
Length
LEMMA,POS,ENT
f1,accuracy
overlap(t1, t2)
answer_type,question_type
54
DSL (Domain SpeciïŹc Language) Attribute Extractor

John Debney created a new arrangement of Ron Grainer’s original theme
for Doctor Who in 1996. For the return of the series in 2005, Murray Gold
provided a new arrangement... featured sampled from the 1963 original.
Who created the 2005 theme for Doctor Who?
55
User Interface Functionality
56
Suggestions via programming-by-demonstration

John Debney created a new arrangement of Ron Grainer’s original theme
for Doctor Who in 1996. For the return of the series in 2005, Murray Gold
provided a new arrangement... featured sampled from the 1963 original.
Who created the 2005 theme for Doctor Who?
starts_with(p(m),pattern="NNP"))
starts_with(p(m),pattern="PERSON"))
answer_type(g) == answer_type(p(m))
exact_match(m) == 0
is_correct_sent(m) == False
overlap(q, sentence(p(m))) > overlap(q, sentence(g))
57
Suggestions via programming-by-demonstration
Who What person created the 2005 theme for Doctor Who?Who What person created the 2005 theme for Doctor Who?
58
Check attribute distribution ENT(g) in groups
is_entity
is_distracted
CorrectIncorrect
59
Check attribute distribution ENT(g) in groups
is_entity
is_distracted
CorrectIncorrect
60
Check attribute distribution ENT(g) in groups
is_entity
is_distracted
CorrectIncorrect
61
Check attribute distribution ENT(g) in groups
is_entity
is_distracted
CorrectIncorrect
62
Additional Case on Preciseness
63
User study: What is imprecise answer boundaries?
OïŹ€ by at most 2 tokens both on the left and right
exact_match(p(m)) == 0
and abs(answer_offset(p(m),"left")) <= 2
and abs(answer_offset(p(m),"right")) <= 2
D2D1
exact_match(p(m)) == 0
and f1(p(m)) > 0.7
No exact match, but high overlap
“The model is making predictions with missing or additional words
?”
64
User study: What is imprecise answer boundaries?
exact_match(p(m)) == 0
and abs(answer_offset(p(m),"left")) <= 2
and abs(answer_offset(p(m),"right")) <= 2
OïŹ€ by at most 2 tokens both on the left and rightD2
exact_match(p(m)) == 0
and f1(p(m)) > 0.7
D1 No exact match, but high overlap
“The model is making predictions with missing or additional words
?”

the polynomial time hierarchy collapses.

believed that the polynomial hierarchy does..
prediction
groundtruth
65
User study: What is imprecise answer boundaries?
OïŹ€ by at most 2 tokens both on the left and rightD2D1 No exact match, but high overlap
D1 D2
66
User Study
10 participants = NLP graduate students + QA engineers from industry
Examine BiDAF (Seo et al., 2016) on SQuAD (Rajpurkar et al., 2016)
One hour section: Replicate prior error analysis + Freely explore the model
67
Study #1 Replication: Errudite ïŹ‚exible enough?
Read BiDAF error analysis: 50 errors, hand-labeled into 6 classes
Rate closeness: Recreated groups == originals?
semantic
Recreate 4 classes with Errudite on the entire dataset
68
Users were able to express their intended groups
well with Errudite.
69
Study #1 Replication: Easy group ≠ reproducible!
How many errors are covered by user-built Imprecise Error Boundary?
Groups with low inter-agreement!
13.8% 45.8%
How close does the approximation match the paper deïŹnition?
Most conïŹdent, an easy group
1 2 3 4 5
Closeness
Boundary
Group
0% 10% 20% 30% 40% 50%
Error Coverage
Boundary
Group
70
Study #1 Replication: Easy group ≠ reproducible!
OïŹ€ by at most 2 tokens both on the left and right
exact_match(m) == 0
and abs(answer_offset(p(m),"left")) <= 2
and abs(answer_offset(p(m),"right")) <= 2
D1
Coverage = 22.1%
D2
exact_match(m) == 0
and f1(m) > 0.7
No exact match, but high overlap
Coverage = 13.8%
71
Study #1 Replication: Easy group ≠ reproducible!
OïŹ€ by at most 2 tokens both on the left and right
exact_match(m) == 0
and abs(answer_offset(p(m),"left")) <= 2
and abs(answer_offset(p(m),"right")) <= 2
D2
exact_match(m) == 0
and f1(m) > 0.7
No exact match, but high overlapD1
Coverage = 22.1% Coverage = 13.8%
Study #1 Replication: Easy group ≠ reproducible!
Coverage = 22.1% Coverage = 13.8%

commercial, scientiïŹc, and cultural growth
D1 D2
D1 D2
D1 D2

from Karakorum in Mongolia to Khanbaliq


the polynomial time hierarchy collapses.

believed that the polynomial hierarchy does..
OïŹ€ by at most 2 tokens both on the left and right D2 No exact match, but high overlapD1
73
Users were able to express their intended groups
well with Errudite.
Ambiguous manual labels prevents consistent
replication, even when users thought they did!
74
“We randomly select 50-100 incorrect questions (based on EM) and
roughly label them into N error groups.”
Biased conclusion due to

Subjectively deïŹned group
+
Small samples
+
Focus exclusively on errors
+
No Test on true cause
Errudite
Precise & reproducible grouping
+
Scale up to the entire dev set
+
Cover errors & correct instances
+
Test via counterfactual analysis
Biased conclusion due to

Subjectively deïŹned group
Errudite
Precise & reproducible grouping
75
Study #2 Exploration: Errudite Useful Enough?
Freely explore BiDAF with Errudite, think aloud
Rate insights on importance, conïŹdence, relative easiness
Describe their observations / insights on BiDAF
76
Study #2 Exploration: Errudite Useful Enough?
ConïŹrmed prior hypotheses
Extended previous knowledge
Rejected prior hypotheses
Users reported ÎŒ = 2.1, σ = 0.94 ïŹndings. Users thought their insights are

1 2 3 4 5
Score
Importance
Fidelity
Easiness
Quality
Users learned more about the model (ÎŒ= 3.9,σ=0.94).
77
User Feedback: Did they like Errudite?
Enhanced their error analysis experience.
Systematically scaled up the analysis.
Precise and inspiring more conïŹdence.
Much faster exploration.

Weitere Àhnliche Inhalte

Ähnlich wie Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error Analysis

The Rule of Three
The Rule of ThreeThe Rule of Three
The Rule of ThreeKevlin Henney
 
S10
S10S10
S10butest
 
S10
S10S10
S10butest
 
45 genetic-algorithms
45 genetic-algorithms45 genetic-algorithms
45 genetic-algorithmsgrskrishna
 
Answer questions Minimum 100 words each and reference (questions.docx
Answer questions Minimum 100 words each and reference (questions.docxAnswer questions Minimum 100 words each and reference (questions.docx
Answer questions Minimum 100 words each and reference (questions.docxnolanalgernon
 
Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1Mahammad Valiyev
 
Patterns (Geometry 1_1)
Patterns (Geometry 1_1)Patterns (Geometry 1_1)
Patterns (Geometry 1_1)rfant
 
Monte Carlo
Monte CarloMonte Carlo
Monte Carloguestc990b6
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningLibya Thomas
 
AMATYC Ignite 2016 1st half Denver
AMATYC Ignite 2016 1st half DenverAMATYC Ignite 2016 1st half Denver
AMATYC Ignite 2016 1st half DenverFred Feldon
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inferencebutest
 
Endogeneity and Entrepreneurship Research
Endogeneity and Entrepreneurship ResearchEndogeneity and Entrepreneurship Research
Endogeneity and Entrepreneurship ResearchBrian Anderson
 
artficial intelligence
artficial intelligenceartficial intelligence
artficial intelligencePhanindra Mortha
 
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Codiax
 
Anastasi Lecture 2008
Anastasi Lecture 2008Anastasi Lecture 2008
Anastasi Lecture 2008behnke3791
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Goodchristjt
 
Lec 7 genetic algorithms
Lec 7 genetic algorithmsLec 7 genetic algorithms
Lec 7 genetic algorithmsEyob Sisay
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basicsDorothy Bishop
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in aiRobert Antony
 

Ähnlich wie Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error Analysis (20)

The Rule of Three
The Rule of ThreeThe Rule of Three
The Rule of Three
 
S10
S10S10
S10
 
S10
S10S10
S10
 
45 genetic-algorithms
45 genetic-algorithms45 genetic-algorithms
45 genetic-algorithms
 
Sentiment Analysis in Machine Learning
Sentiment Analysis in  Machine LearningSentiment Analysis in  Machine Learning
Sentiment Analysis in Machine Learning
 
Answer questions Minimum 100 words each and reference (questions.docx
Answer questions Minimum 100 words each and reference (questions.docxAnswer questions Minimum 100 words each and reference (questions.docx
Answer questions Minimum 100 words each and reference (questions.docx
 
Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1Bill howe 6_machinelearning_1
Bill howe 6_machinelearning_1
 
Patterns (Geometry 1_1)
Patterns (Geometry 1_1)Patterns (Geometry 1_1)
Patterns (Geometry 1_1)
 
Monte Carlo
Monte CarloMonte Carlo
Monte Carlo
 
Supervised learning: Types of Machine Learning
Supervised learning: Types of Machine LearningSupervised learning: Types of Machine Learning
Supervised learning: Types of Machine Learning
 
AMATYC Ignite 2016 1st half Denver
AMATYC Ignite 2016 1st half DenverAMATYC Ignite 2016 1st half Denver
AMATYC Ignite 2016 1st half Denver
 
Machine Learning and Inductive Inference
Machine Learning and Inductive InferenceMachine Learning and Inductive Inference
Machine Learning and Inductive Inference
 
Endogeneity and Entrepreneurship Research
Endogeneity and Entrepreneurship ResearchEndogeneity and Entrepreneurship Research
Endogeneity and Entrepreneurship Research
 
artficial intelligence
artficial intelligenceartficial intelligence
artficial intelligence
 
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
Sean Holden (University of Cambridge) - Proving Theorems_ Still A Major Test ...
 
Anastasi Lecture 2008
Anastasi Lecture 2008Anastasi Lecture 2008
Anastasi Lecture 2008
 
Ch. 11 Simulations Good
Ch. 11 Simulations GoodCh. 11 Simulations Good
Ch. 11 Simulations Good
 
Lec 7 genetic algorithms
Lec 7 genetic algorithmsLec 7 genetic algorithms
Lec 7 genetic algorithms
 
Data simulation basics
Data simulation basicsData simulation basics
Data simulation basics
 
Forms of learning in ai
Forms of learning in aiForms of learning in ai
Forms of learning in ai
 

Mehr von Sanjana Chowdhury

Rsqrd AI: Making Conversational AI Work for Everybody
Rsqrd AI: Making Conversational AI Work for EverybodyRsqrd AI: Making Conversational AI Work for Everybody
Rsqrd AI: Making Conversational AI Work for EverybodySanjana Chowdhury
 
Rsqrd AI: Application of Explanation Model in Healthcare
Rsqrd AI: Application of Explanation Model in HealthcareRsqrd AI: Application of Explanation Model in Healthcare
Rsqrd AI: Application of Explanation Model in HealthcareSanjana Chowdhury
 
Rsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning ResearchRsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning ResearchSanjana Chowdhury
 
Rsqrd AI: Incorporating Priors with Feature Attribution on Text Classification
Rsqrd AI: Incorporating Priors with Feature Attribution on Text ClassificationRsqrd AI: Incorporating Priors with Feature Attribution on Text Classification
Rsqrd AI: Incorporating Priors with Feature Attribution on Text ClassificationSanjana Chowdhury
 
Rsqrd AI: Discovering Natural Bugs Using Adversarial Perturbations
Rsqrd AI: Discovering Natural Bugs Using Adversarial PerturbationsRsqrd AI: Discovering Natural Bugs Using Adversarial Perturbations
Rsqrd AI: Discovering Natural Bugs Using Adversarial PerturbationsSanjana Chowdhury
 
Rsqrd AI: A Survey of The Current Ecosystem of Explainability Techniques
Rsqrd AI: A Survey of The Current Ecosystem of Explainability TechniquesRsqrd AI: A Survey of The Current Ecosystem of Explainability Techniques
Rsqrd AI: A Survey of The Current Ecosystem of Explainability TechniquesSanjana Chowdhury
 
Rsqrd AI: Explaining ML Models w/ Geometric Intuition
Rsqrd AI: Explaining ML Models w/ Geometric IntuitionRsqrd AI: Explaining ML Models w/ Geometric Intuition
Rsqrd AI: Explaining ML Models w/ Geometric IntuitionSanjana Chowdhury
 
Rsqrd AI: Exploring Machine Learning Model Predictions
Rsqrd AI: Exploring Machine Learning Model PredictionsRsqrd AI: Exploring Machine Learning Model Predictions
Rsqrd AI: Exploring Machine Learning Model PredictionsSanjana Chowdhury
 
Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformSanjana Chowdhury
 
Rsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first StartupRsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first StartupSanjana Chowdhury
 
Rsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AIRsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AISanjana Chowdhury
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineSanjana Chowdhury
 

Mehr von Sanjana Chowdhury (12)

Rsqrd AI: Making Conversational AI Work for Everybody
Rsqrd AI: Making Conversational AI Work for EverybodyRsqrd AI: Making Conversational AI Work for Everybody
Rsqrd AI: Making Conversational AI Work for Everybody
 
Rsqrd AI: Application of Explanation Model in Healthcare
Rsqrd AI: Application of Explanation Model in HealthcareRsqrd AI: Application of Explanation Model in Healthcare
Rsqrd AI: Application of Explanation Model in Healthcare
 
Rsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning ResearchRsqrd AI: Recent Advances in Explainable Machine Learning Research
Rsqrd AI: Recent Advances in Explainable Machine Learning Research
 
Rsqrd AI: Incorporating Priors with Feature Attribution on Text Classification
Rsqrd AI: Incorporating Priors with Feature Attribution on Text ClassificationRsqrd AI: Incorporating Priors with Feature Attribution on Text Classification
Rsqrd AI: Incorporating Priors with Feature Attribution on Text Classification
 
Rsqrd AI: Discovering Natural Bugs Using Adversarial Perturbations
Rsqrd AI: Discovering Natural Bugs Using Adversarial PerturbationsRsqrd AI: Discovering Natural Bugs Using Adversarial Perturbations
Rsqrd AI: Discovering Natural Bugs Using Adversarial Perturbations
 
Rsqrd AI: A Survey of The Current Ecosystem of Explainability Techniques
Rsqrd AI: A Survey of The Current Ecosystem of Explainability TechniquesRsqrd AI: A Survey of The Current Ecosystem of Explainability Techniques
Rsqrd AI: A Survey of The Current Ecosystem of Explainability Techniques
 
Rsqrd AI: Explaining ML Models w/ Geometric Intuition
Rsqrd AI: Explaining ML Models w/ Geometric IntuitionRsqrd AI: Explaining ML Models w/ Geometric Intuition
Rsqrd AI: Explaining ML Models w/ Geometric Intuition
 
Rsqrd AI: Exploring Machine Learning Model Predictions
Rsqrd AI: Exploring Machine Learning Model PredictionsRsqrd AI: Exploring Machine Learning Model Predictions
Rsqrd AI: Exploring Machine Learning Model Predictions
 
Rsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI PlatformRsqrd AI: Zestimates and Zillow AI Platform
Rsqrd AI: Zestimates and Zillow AI Platform
 
Rsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first StartupRsqrd AI: ML Tooling at an AI-first Startup
Rsqrd AI: ML Tooling at an AI-first Startup
 
Rsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AIRsqrd AI: From R&D to ROI of AI
Rsqrd AI: From R&D to ROI of AI
 
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible PipelineRsqrd AI: How to Design a Reliable and Reproducible Pipeline
Rsqrd AI: How to Design a Reliable and Reproducible Pipeline
 

KĂŒrzlich hochgeladen

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

KĂŒrzlich hochgeladen (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

Rsqrd AI: Errudite- Scalable, Reproducible, and Testable Error Analysis

  • 1. 1 Errudite: Scalable, Reproducible, and Testable Error Analysis Tongshuang (Sherry) Wu @tongshuangwu University of Washington Marco Tulio Ribeiro Microsoft Research Jeffrey Heer @jeffrey_heer Daniel S. Weld @dsweld University of Washington
  • 3. 3 Error analysis is important for
 Uncovering bugs Improving the state-of-art Safeguarding deployments
  • 4. 4 Where We Are We performed an error analysis on a sample of 100 questions Fader et tal. ACL’13 We sample 100 incorrect predictions and try to find common error categories. Chen et al. ACL’16 We randomly select 50 incorrect questions and categorize them into 6 classes. Wadhwa et al. ACL’18
  • 5. 5 Where We Are We performed an error analysis on a sample of 100 questions Fader et tal. ACL’13 We sample 100 incorrect predictions and try to find common error categories. Chen et al. ACL’16 We randomly select 50 incorrect questions and categorize them into 6 classes. Wadhwa et al. ACL’18 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.”
  • 6. 6 Where We Are “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause
  • 7. 7 Where We Are & Our Contribution “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Principles & Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis
  • 10. 10 Design & Use Scenario Examine the distractor hypothesis on BiDAF (Seo et al., 2016), with SQuAD (10570 instances; Rajpurkar et al., 2016) Independently tested by 4 (out of 10) participants in the user study
  • 11. 11 The Scenario: Analyze BiDAF 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who?Question In the context of Machine Comprehension (MC)
 + ↓ Murray Gold Context Answer
  • 12. 12 Why the incorrect prediction? Who created the 2005 theme for Doctor Who? 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original.
  • 13. 13 Why the mistake? Distractor hypothesis 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? BiDAF
 Matches entity types Knows to ïŹnd a PERSON Finds the exact answer spans Distracted by other PERSON spans
  • 14. 14 Why the incorrect prediction? Who created the 2005 theme for Doctor Who? 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Does BiDAF make similar distractor mistakes often?
  • 15. Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis 15 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Too ambiguous to reproduce Biased conclusion due to
 Subjectively deïŹned hypotheses
  • 16. 
started his writing journey during his college year. But, he didn’t ïŹnish his ïŹrst book until 1996. prediction groundtruth 16 What is distractor hypothesis? Have the same recognizable entity typeD2D1 D2 D1 Are roughly the same entity type The groundtruth and the prediction
 Not a recognizable DATE!Not a recognizable DATE! DATE (YEAR) matches “when”!
  • 17. Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses 17 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Quantify instances with a domain speciïŹc language Biased conclusion due to
 Subjectively deïŹned hypotheses
  • 18. 18 Precise DSL (Domain SpeciïŹc Language) A B C D E F Attribute Extractor OperatorsTargetDSL = + +A B C D Extract Instance Attribute E F Instance Groups Filter length(q) > 20
  • 19. C D 19 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 C D F
  • 20. 20 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity “The groundtruth is an ENTity.” ENT(Murray Gold) == PERSON
  • 21. 21 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor “There are more tokens matching the ground truth entity type (ENT(g)) in the whole context than in the groundtruth.” count(PERSON : Murray Gold, John Dubney, Ron Grainer) == 3 count(PERSON : Murray Gold) == 1
  • 22. 22 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type “The model prediction ENTity type matches the groundtruth ENTity type.” ENT(John Debney) == PERSON
  • 23. 23 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted “The model prediction is incorrect.”
  • 24. 24 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted CorrectIncorrect Have the same recognizable entity typeD2 D1 Are roughly the same entity type
  • 25. 25 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples Errudite Precise & reproducible hypotheses + Scale up to the entire dev set 100 << 2000+ errors in total
  • 26. 26 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted CorrectIncorrect 5.7% of all BiDAF errors: The distractor hypothesis seems correct!
  • 27. 27 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 Focus exclusively on errors Wrongly prioritize groups that are well-handled in average.
  • 28. Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis 28 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Wrongly prioritize groups that are well-handled in average. Biased conclusion due to
 Focus exclusively on errors Errudite Cover errors & correct instances
  • 29. 29 Build distractor groups with DSL ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 1 2 3 4 5 is_entity has_distractor correct_type is_distracted all_instance CorrectIncorrect 88% EM > 68% EM: BiDAF performs better when have distractors & entity type is matched, than overall. Reject / revise the hypothesis!
  • 30. 30 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 Small samples + Focus exclusively on errors Errudite Scale up to the entire dev set + Cover errors & correct instances
  • 31. 31 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? is_distracted Distractor entity? HAS distractor prediction != IS WRONG due to distractor prediction Multi-sentence reasoning?
  • 32. Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis 32 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” HAS distractor prediction != IS WRONG due to distractor prediction Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 No Test on true cause
  • 33. A B C D E F 33 Are the 192 instances really wrong because of the distractor? Would BiDAF work perfectly if we remove the distractors? Answer what-if questions with counterfactual analysis! F
  • 34. 34 Counterfactual Analysis with Rewrite Rules rewrite( , → )target tofrom Re-write the target part of an instance by replacing from with to
  • 35. 35 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )target tofrom Re-write the target part of an instance by replacing from with to
  • 36. 36 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c tofrom Re-write the context part of an instance by replacing from with to
  • 37. 37 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c tostring(p(m)) Re-write the context part of an instance by replacing the model predicted distractor string with to
  • 38. 38 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c "#"string(p(m)) Re-write the context part of an instance by replacing the model predicted distractor string with a placeholder token “#”
  • 39. 39 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c "#"string(p(m)) Q: Who created the 2005 theme for Doctor Who? C: 
John Dobney # created a new arrangement of Ron Grainer’s 
 Murray Gold provided a new arrangement
 Incorrect Incorrect
  • 40. Another distractor is still confusing the model! 40 Counterfactual Analysis with Rewrite Rules Would BiDAF work perfectly if we remove the distractors? rewrite( , → )c "#"string(p(m)) Incorrect Incorrect
  • 41. 41 Counterfactual Analysis with Rewrite Rules p(m) for the 192 rewritten is_distracted instances are
 rewrite( , → )c "#"string(p(m)) Another distractor is still confusing the model! Incorrect Incorrect
  • 42. 42 Counterfactual Analysis with Rewrite Rules p(m) for the 192 rewritten is_distracted instances are
 rewrite( , → )c "#"string(p(m)) Incorrect Incorrect 29% Another distractor is still confusing the model! 48% The distractor was fooling the model! Incorrect Correct 23% Other factors are at play!Unchanged age of 18, 10.5% # from 18 to 24

  • 43. 43 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 No Test on true cause Errudite Test via counterfactual analysis
  • 44. 44 Deliveries: Precise + Reproducible + Re-applicable Groups Rewrite ruleAttribute is_entity has_distractor correct_type is_distracted all_instanceENT(g) rewrite( c, string(p(m))→"#")
  • 45. 45 Deliveries: Precise + Reproducible + Re-applicable Groups Rewrite ruleAttribute BiDAF is 
 not particularly bad at distractors. Seemingly distractor errors can be due to other factors. + + applied to

  • 46. 46 Deliveries: Precise + Reproducible + Re-applicable Groups Rewrite ruleAttribute Other datasets & Other models
 ? at handling distractor. + + applied to
Re-
  • 47. 47 Biased conclusion due to
 Subjectively deïŹned hypotheses + Small samples + Focus exclusively on errors + No Test on true cause Principles & Errudite Precise & reproducible hypotheses + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis
  • 48. 48 Better error analysis will then
 Uncover bugs Improve the state-of-art Safeguard deployments Errudite improves error analysis
 Precise Reproducible Scalable Testable
  • 49. 49 Thank you! Video demo: https://tinyurl.com/errudite-video Opensource: https://github.com/uwdata/errudite
  • 52. 52 Precise DSL (Domain SpeciïŹc Language) The model is bad on long questions questions with more than 20 tokens Qualitative Description Quantitative Description length(q) > 20 Attribute Extractor length question_type answer_type Target question context groundtruth prediction (model) token sentence Operators > != in has_any
  • 53. 53 DSL (Domain SpeciïŹc Language) Attribute Extractor Basic Attributes General purpose linguistic features Standard prediction performance metrics Between-target relations Domain-speciïŹc attribute Length LEMMA,POS,ENT f1,accuracy overlap(t1, t2) answer_type,question_type
  • 54. 54 DSL (Domain SpeciïŹc Language) Attribute Extractor 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who?
  • 56. 56 Suggestions via programming-by-demonstration 
John Debney created a new arrangement of Ron Grainer’s original theme for Doctor Who in 1996. For the return of the series in 2005, Murray Gold provided a new arrangement... featured sampled from the 1963 original. Who created the 2005 theme for Doctor Who? starts_with(p(m),pattern="NNP")) starts_with(p(m),pattern="PERSON")) answer_type(g) == answer_type(p(m)) exact_match(m) == 0 is_correct_sent(m) == False overlap(q, sentence(p(m))) > overlap(q, sentence(g))
  • 57. 57 Suggestions via programming-by-demonstration Who What person created the 2005 theme for Doctor Who?Who What person created the 2005 theme for Doctor Who?
  • 58. 58 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  • 59. 59 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  • 60. 60 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  • 61. 61 Check attribute distribution ENT(g) in groups is_entity is_distracted CorrectIncorrect
  • 62. 62 Additional Case on Preciseness
  • 63. 63 User study: What is imprecise answer boundaries? OïŹ€ by at most 2 tokens both on the left and right exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D2D1 exact_match(p(m)) == 0 and f1(p(m)) > 0.7 No exact match, but high overlap “The model is making predictions with missing or additional words
?”
  • 64. 64 User study: What is imprecise answer boundaries? exact_match(p(m)) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 OïŹ€ by at most 2 tokens both on the left and rightD2 exact_match(p(m)) == 0 and f1(p(m)) > 0.7 D1 No exact match, but high overlap “The model is making predictions with missing or additional words
?”
  • 65. 
the polynomial time hierarchy collapses. 
believed that the polynomial hierarchy does.. prediction groundtruth 65 User study: What is imprecise answer boundaries? OïŹ€ by at most 2 tokens both on the left and rightD2D1 No exact match, but high overlap D1 D2
  • 66. 66 User Study 10 participants = NLP graduate students + QA engineers from industry Examine BiDAF (Seo et al., 2016) on SQuAD (Rajpurkar et al., 2016) One hour section: Replicate prior error analysis + Freely explore the model
  • 67. 67 Study #1 Replication: Errudite ïŹ‚exible enough? Read BiDAF error analysis: 50 errors, hand-labeled into 6 classes Rate closeness: Recreated groups == originals? semantic Recreate 4 classes with Errudite on the entire dataset
  • 68. 68 Users were able to express their intended groups well with Errudite.
  • 69. 69 Study #1 Replication: Easy group ≠ reproducible! How many errors are covered by user-built Imprecise Error Boundary? Groups with low inter-agreement! 13.8% 45.8% How close does the approximation match the paper deïŹnition? Most conïŹdent, an easy group 1 2 3 4 5 Closeness Boundary Group 0% 10% 20% 30% 40% 50% Error Coverage Boundary Group
  • 70. 70 Study #1 Replication: Easy group ≠ reproducible! OïŹ€ by at most 2 tokens both on the left and right exact_match(m) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D1 Coverage = 22.1% D2 exact_match(m) == 0 and f1(m) > 0.7 No exact match, but high overlap Coverage = 13.8%
  • 71. 71 Study #1 Replication: Easy group ≠ reproducible! OïŹ€ by at most 2 tokens both on the left and right exact_match(m) == 0 and abs(answer_offset(p(m),"left")) <= 2 and abs(answer_offset(p(m),"right")) <= 2 D2 exact_match(m) == 0 and f1(m) > 0.7 No exact match, but high overlapD1 Coverage = 22.1% Coverage = 13.8%
  • 72. Study #1 Replication: Easy group ≠ reproducible! Coverage = 22.1% Coverage = 13.8% 
commercial, scientiïŹc, and cultural growth
D1 D2 D1 D2 D1 D2 
from Karakorum in Mongolia to Khanbaliq
 
the polynomial time hierarchy collapses. 
believed that the polynomial hierarchy does.. OïŹ€ by at most 2 tokens both on the left and right D2 No exact match, but high overlapD1
  • 73. 73 Users were able to express their intended groups well with Errudite. Ambiguous manual labels prevents consistent replication, even when users thought they did!
  • 74. 74 “We randomly select 50-100 incorrect questions (based on EM) and roughly label them into N error groups.” Biased conclusion due to
 Subjectively deïŹned group + Small samples + Focus exclusively on errors + No Test on true cause Errudite Precise & reproducible grouping + Scale up to the entire dev set + Cover errors & correct instances + Test via counterfactual analysis Biased conclusion due to
 Subjectively deïŹned group Errudite Precise & reproducible grouping
  • 75. 75 Study #2 Exploration: Errudite Useful Enough? Freely explore BiDAF with Errudite, think aloud Rate insights on importance, conïŹdence, relative easiness Describe their observations / insights on BiDAF
  • 76. 76 Study #2 Exploration: Errudite Useful Enough? ConïŹrmed prior hypotheses Extended previous knowledge Rejected prior hypotheses Users reported ÎŒ = 2.1, σ = 0.94 ïŹndings. Users thought their insights are
 1 2 3 4 5 Score Importance Fidelity Easiness Quality Users learned more about the model (ÎŒ= 3.9,σ=0.94).
  • 77. 77 User Feedback: Did they like Errudite? Enhanced their error analysis experience. Systematically scaled up the analysis. Precise and inspiring more conïŹdence. Much faster exploration.