SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Downloaden Sie, um offline zu lesen
Anna Chaney
Twitter: @anna_seg
Watson Applied Research
Rinse and Repeat:
The Spiral of Applied Machine Learning
2
DISCLAIMER
This presentation shows numerical results of machine learning
systems, and none of the systems contained Watson
technology, or used any data from an IBM customer.
These results are obtained from data sets obtained from
internal systems, using technologies from open source libraries
for sentiment analysis, and classification. All results may be
scaled or shifted for illustrative purposes and are not meant to
be representative of actual system performance.
3
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
In machine learning, we need data, data, and more data. In your application you need to make sure
you record:
• All human input into the app
• All of the machine learning components top 𝑥 responses
• The confidence score of those 𝑥	responses
• If multiple ML components, then record subsystem where the responses came from
• Timestamp and system revision (system revision should be traceable to the data used to train the
ML component)
4
Instrumentation, is Key!! 🔑
Note:	for	purposes	of	this	talk,	we	assume	that	the	design	and	use	
case	of	your	application	have	been	clearly	articulated	and	agreed	
upon	with	all	of	the	stakeholders	in	the	project.
5
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
Experts* say:
• The operative question in evaluating a machine learning system is the extent to which it produces
the results for which it was designed.
• The most straightforward way to evaluate a machine learning system is to recruit human subjects
and ask them to assess system output along some predetermined criteria
The customer (or technical owner) decides the criteria that the system responses will be evaluated
against. All of the information to judge the system response needs to be available in the logs that are
created by the system instrumentation.
6
Measuring Performance Using Humans
*Source:	The	Handbook	of	Computation	Linguistics	and	Natural	Language	Processing.		Clark,	Fox,	and	Lappin
Depends on the type of machine learning response
• Open field textual response ➡ ask human to rate response as on of the following categories,
given all of the original context information:
– Wrong - the answer and the question are completely unrelated
– Poor - the answer and the question might be related, but the answer does not satisfy the
question
– Decent - the answer relates to the question, but could be better
– Perfect - the answer directly relates to the question and is phrased clearly
• Classification response, e.g., measuring sentiment as {positive, negative, neutral} ➡ ask human
to classify response given all of the original context relevant to the classification
Note: even though services like Watson Conversation are classifiers under the hood, the response
received by the user is a textual response, not the label of the intent, thus should be evaluated as a
textual response
7
Design of Experiments – Evaluation Metrics
Overview
Give the 'what' and 'why' of a task in less than 200 characters. The overview should give a clear
high-level picture of what a worker will be doing and why what they are doing is valuable.
Steps
Describe the process by which humans will complete the task. This should be a discrete list of
steps to use to complete the task. Each step should begin with an action verb in bold.
Rules and Tips
Use green headers for positive/"Do This", yellow for warning/"Be Careful Of", red for bad/"Do Not”
Examples
Provide at least three examples of your job to contributors. This will help them perform better on
the job.
Thank you!
Humans really appreciate a customized thank you note!
8
Design of Experiments – Instructions for Human*
*Source:	https://success.crowdflower.com/hc/en-us/articles/201855779-Instructions-Template
The quickest way to judge lots of data is to involve as many humans as possible. However, as you
judge the answers to many of these questions, you will find that many of the responses are open to
interpretation. Have a small group of people 2-5, who can definitively asses if the response meets the
criteria your team has decided on for the project.
Have this team discuss and agree on around 100 the correct judgment for responses. This is your
Golden Standard. Write very clear reasons explaining why you have selected that answer. These
reasons will be shown to humans if they get a golden question incorrect. This is a great tool to explain
to your contributors how you’ve reached your answer. By explaining the answer, humans can learn
the rules and intricacies of a job in more depth than is provided in the instructions.
Golden questions should have an appropriate answer distribution that reflects your dataset. An even
answer distribution will train contributors on every possible answer instead of biasing them towards
one answer.
Before the human can contribute to evaluating the results of the job, they must pass a quiz of the
golden standard questions.
9
Design of Experiments – Create Golden
Standard
You’ve got your golden questions, now you want to load all of the data that you want annotated for
human judgement
For Q/A tasks, a general run of thumb is 20 man-hours per 1000 questions
1000 responses is the minimum number of responses I would want to judge in a given experiment,
and the maximum only depends on time, cost, and resources
10
Run the Experiment
11
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
"He that knows not,
and knows not that he knows not
is a fool.
Shun him
He that knows not,
and knows that he knows not
is a pupil.
Teach him.
He that knows,
and knows not that he knows
is asleep
Wake him.
He that knows,
and knows that he knows
is a teacher.
Follow him."
12
Question and Answering: Optimize for
Human Computer Interaction
How much does it
hurt my reputation
to answer a
question
incorrectly?
𝑅𝑒𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 =
𝑐𝑜𝑟𝑟𝑒𝑐𝑡	𝑎𝑛𝑠𝑤𝑒𝑟 − |𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡	𝑎𝑛𝑠𝑤𝑒𝑟|
|𝑎𝑠𝑘𝑒𝑑	𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠|
Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
Help Desk Question and Answer
Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
Help Desk Question and Answer
Huge Gains in Perceived Performance
Assume 1000 questions, when the threshold is set at 0.7, you’ve attempted to help
534 people, giving 486 of the a correct answer
Now assume, you wanted to answer 81% of the questions with Machine Learning.
You’ve attempted to help 810 people, giving 648 people a correct answer
Added	162	happy	customers	saves	me	
$XXXX in	level	>	0	support	time	using	
the	EXACT	SAME	SYSTEM
• Assume a sentiment score 𝑠	𝜖[−1,1]
– sentiment is negative if 𝑠 < 0
– sentiment is neutral if 𝑠 = 0
– sentiment is positive if 𝑠 > 0
Ideally, the closer to 1 the sentiment is, the more confident the classification algorithm is that the text
is positive, and the closer to -1 the sentiment score is, the more confident the classification
algorithm is that the text is negative.
Let’s measure this using the crowd sourcing platform, CrowdFlower!
16
Sentiment Analysis
• Detailed Instructions
• Human must pass a quiz to enter the job
• 1 out of every 10 judgements is a test
question, human must maintain an 80%
agreement with our test questions to
remain in the job
• First round of twitter data, 3314 samples,
3238 had human agreement
• Second round, twitter and news data,
6471 samples, 5642 had human
agreement
• Following analysis only considers
samples with human agreement
17
Collecting Data from the Crowd Using
CrowdFlower
18
Experiment Results
ML
TOTAL	SAMPLES:	8880
Number	of	Crowd	Negative:	788
Number	of	Crowd	Neutral:	6283
Number	of	Crowd	Positive:	1809
Number	of	ML	Negative:	2111
Number	of	ML	Neutral:	2156
Number	of	ML	Positive:	4613
Partial	(neg only)	Agreement:	565,	71.70%
Partial	(neu only)	Agreement:	1757,	27.96%
Partial	(pos only)	Agreement:	1083,	59.87%
Total	Judgements	Agreement:	3405,	38.34%
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
19
Redefining Neutral Using the Sentiment Score –
Effect on Positive and Negative Classification
performance
-1.0 0.0 1.0
-x	to	x
negative neutral positive
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
20
Redefining Neutral Using the Sentiment Score –
Effect on Neutral Classification performance
-1.0 0.0 1.0
-x	to	x
negative neutral positive
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
21
Redefining Neutral Using the Sentiment Score –
Effect on Total performance (all categories)
-1.0 0.0 1.0
-x	to	x
negative neutral positive
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
• We obviously don’t want to maximize for total correctness, because the class imbalance for the
neutral data (approx 70% of the data) would give a naïve classifier of “call everything neutral”
• Generated a heuristic cost function to value the correct classification of positive and negative
sentiment calls 4 times higher than the correct classification of correct neutral calls.
• Maximum of the cost function occurs at sentiment score of 0.4, now implemented in our client side
code.
• Total accuracy over all three categories goes from 38% to 57%
22
Adding in a Cost Function
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
23
Final Results, Weighting cost function
normalized to appear on same axis
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
24
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
25
Augmenting Classifier Systems Golden Training
System
Logs
System	
Judgment
Good Intent
Answer
Bad intent
Answer
Add	1	question	to	
intent	proper	training	
bin
Correct	
intent	
exists?
Create	New	
Intent
Yes
No
Note:	when	you	train	the	
classifier	the	size	of	the	
training	bin	effects	the	
probability	of	the	intent	
being	returned,	so	you	may	
want	to	down	sample	some	
training	bins	to	avoid	
unwanted	bias.	Or,	you	may	
consider	that	bias	a	good	
thing,	depending	on	the	use	
case
26
Real Data; Mocked Dashboard – Alpha denotes performance after incorporation of new data
from log files
User
Experience
76%
Performance
Prediction
80%
Note that the performance threshold (indicated by *) is fixed at runtime
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
We incorporated the human labeled data in with their other training sets, and then performed tests
across all the myriad of data we have labels and are available.
Performance on our data set jumped from 57% to 70%
It is also worth noting, that performance of other data sets that were completely distinct from our use
case also improved. A true win-win!
27
Training Sentiment Model
*Results	are	notional	and	not	meant	to	be	representative	of	any	Watson	API
28
Analyze and Improve Performance of Machine Learning in Four Easy Steps
Step 0. Deploy your machine learning application
Step 1. Assess performance of app using human judgement
Step 2. Analyze and optimize operating thresholds
Step 3. Retrain machine learning with golden examples from humans
Step 4. Go to Step 0 with new changes
The output of machine learning system can always
be improved. Better training data, algorithms more
suited to your use case, and system improvements
based on threshold setting can all be employed.
However, you will find that after each iteration, the
system will improve less and less…much like the
radius of a spiral as it makes rotations around the
orgin.
Depending on your use case, you may decide to
stop iterating at some point, or, you may need to
never stop iterating (especially true of systems that
contain golden samples that can change over time)
Rinse and Repeat:
The Spiral of Applied Machine Learning

Weitere ähnliche Inhalte

Was ist angesagt?

Scott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SF
MLconf
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
Teng Xiaolu
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
butest
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Simplilearn
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performance
Bryan Farrow
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
QuestionPro
 

Was ist angesagt? (20)

Project Report
Project ReportProject Report
Project Report
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Real-world Reinforcement Learning
Real-world Reinforcement LearningReal-world Reinforcement Learning
Real-world Reinforcement Learning
 
Scott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SFScott Clark, Software Engineer, Yelp at MLconf SF
Scott Clark, Software Engineer, Yelp at MLconf SF
 
LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019LinkedIn talk at Netflix ML Platform meetup Sep 2019
LinkedIn talk at Netflix ML Platform meetup Sep 2019
 
RecSys Challenge 2016
RecSys Challenge 2016RecSys Challenge 2016
RecSys Challenge 2016
 
copy for Gary Chin.
copy for Gary Chin.copy for Gary Chin.
copy for Gary Chin.
 
Andrew NG machine learning
Andrew NG machine learningAndrew NG machine learning
Andrew NG machine learning
 
Machine Learning Landscape
Machine Learning LandscapeMachine Learning Landscape
Machine Learning Landscape
 
MLlecture1.ppt
MLlecture1.pptMLlecture1.ppt
MLlecture1.ppt
 
Intro to machine learning
Intro to machine learningIntro to machine learning
Intro to machine learning
 
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
Supervised and Unsupervised Learning In Machine Learning | Machine Learning T...
 
Barga DIDC'14 Invited Talk
Barga DIDC'14 Invited TalkBarga DIDC'14 Invited Talk
Barga DIDC'14 Invited Talk
 
The current state of prediction in neuroimaging
The current state of prediction in neuroimagingThe current state of prediction in neuroimaging
The current state of prediction in neuroimaging
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
optimizing_site_performance
optimizing_site_performanceoptimizing_site_performance
optimizing_site_performance
 
Finding Some "Good" iOS Interview Questions for Employers
Finding Some "Good" iOS Interview Questions for EmployersFinding Some "Good" iOS Interview Questions for Employers
Finding Some "Good" iOS Interview Questions for Employers
 
Machine learning and_buzzwords
Machine learning and_buzzwordsMachine learning and_buzzwords
Machine learning and_buzzwords
 
Agile Deep Learning
Agile Deep LearningAgile Deep Learning
Agile Deep Learning
 
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing SlidesIntroduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
Introduction to MaxDiff Scaling of Importance - Parametric Marketing Slides
 

Ähnlich wie Rinse and Repeat : The Spiral of Applied Machine Learning

Descriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxDescriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
theodorelove43763
 
Prompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data ScientistsPrompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data Scientists
Kevin Lee
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentation
steve muzzy
 

Ähnlich wie Rinse and Repeat : The Spiral of Applied Machine Learning (20)

How ml can improve purchase conversions
How ml can improve purchase conversionsHow ml can improve purchase conversions
How ml can improve purchase conversions
 
Unit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptxUnit 1-ML (1) (1).pptx
Unit 1-ML (1) (1).pptx
 
Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning Week 2 Sentiment Analysis Using Machine Learning
Week 2 Sentiment Analysis Using Machine Learning
 
Machine Learning Interview Questions
Machine Learning Interview QuestionsMachine Learning Interview Questions
Machine Learning Interview Questions
 
introduction to machine learning
introduction to machine learningintroduction to machine learning
introduction to machine learning
 
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docxDescriptive Statistics and Interpretation Grading GuideQNT5.docx
Descriptive Statistics and Interpretation Grading GuideQNT5.docx
 
Prompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data ScientistsPrompt it, not Google it - Prompt Engineering for Data Scientists
Prompt it, not Google it - Prompt Engineering for Data Scientists
 
Systems development life cycle
Systems development life cycleSystems development life cycle
Systems development life cycle
 
Architecting a Better Knowledgebase
Architecting a Better KnowledgebaseArchitecting a Better Knowledgebase
Architecting a Better Knowledgebase
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Barga Data Science lecture 9
Barga Data Science lecture 9Barga Data Science lecture 9
Barga Data Science lecture 9
 
School customer service presentation
School customer service presentationSchool customer service presentation
School customer service presentation
 
Cause and effect diagrams
Cause and effect diagramsCause and effect diagrams
Cause and effect diagrams
 
Traditional versus adaptive techniques
Traditional versus adaptive techniquesTraditional versus adaptive techniques
Traditional versus adaptive techniques
 
NLP Techniques for Question Answering.docx
NLP Techniques for Question Answering.docxNLP Techniques for Question Answering.docx
NLP Techniques for Question Answering.docx
 
End-to-End Machine Learning Project
End-to-End Machine Learning ProjectEnd-to-End Machine Learning Project
End-to-End Machine Learning Project
 
Introduction To Pc Security
Introduction To Pc SecurityIntroduction To Pc Security
Introduction To Pc Security
 
Introduction To Pc Security
Introduction To Pc SecurityIntroduction To Pc Security
Introduction To Pc Security
 
Introduction To Pc Security
Introduction To Pc SecurityIntroduction To Pc Security
Introduction To Pc Security
 
The 8 Step Data Mining Process
The 8 Step Data Mining ProcessThe 8 Step Data Mining Process
The 8 Step Data Mining Process
 

Kürzlich hochgeladen

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
amitlee9823
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
amitlee9823
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
amitlee9823
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
gajnagarg
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
amitlee9823
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
gajnagarg
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
gajnagarg
 

Kürzlich hochgeladen (20)

5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night StandCall Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Shivaji Nagar ☎ 7737669865 🥵 Book Your One night Stand
 
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men  🔝mahisagar🔝   Esc...
➥🔝 7737669865 🔝▻ mahisagar Call-girls in Women Seeking Men 🔝mahisagar🔝 Esc...
 
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
Escorts Service Kumaraswamy Layout ☎ 7737669865☎ Book Your One night Stand (B...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
Just Call Vip call girls Erode Escorts ☎️9352988975 Two shot with one girl (E...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Thane West Call On 9920725232 With Body to body massage...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
Just Call Vip call girls kakinada Escorts ☎️9352988975 Two shot with one girl...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men  🔝Dindigul🔝   Escor...
➥🔝 7737669865 🔝▻ Dindigul Call-girls in Women Seeking Men 🔝Dindigul🔝 Escor...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
Just Call Vip call girls Bellary Escorts ☎️9352988975 Two shot with one girl ...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 

Rinse and Repeat : The Spiral of Applied Machine Learning

  • 1. Anna Chaney Twitter: @anna_seg Watson Applied Research Rinse and Repeat: The Spiral of Applied Machine Learning
  • 2. 2 DISCLAIMER This presentation shows numerical results of machine learning systems, and none of the systems contained Watson technology, or used any data from an IBM customer. These results are obtained from data sets obtained from internal systems, using technologies from open source libraries for sentiment analysis, and classification. All results may be scaled or shifted for illustrative purposes and are not meant to be representative of actual system performance.
  • 3. 3 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 4. In machine learning, we need data, data, and more data. In your application you need to make sure you record: • All human input into the app • All of the machine learning components top 𝑥 responses • The confidence score of those 𝑥 responses • If multiple ML components, then record subsystem where the responses came from • Timestamp and system revision (system revision should be traceable to the data used to train the ML component) 4 Instrumentation, is Key!! 🔑 Note: for purposes of this talk, we assume that the design and use case of your application have been clearly articulated and agreed upon with all of the stakeholders in the project.
  • 5. 5 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 6. Experts* say: • The operative question in evaluating a machine learning system is the extent to which it produces the results for which it was designed. • The most straightforward way to evaluate a machine learning system is to recruit human subjects and ask them to assess system output along some predetermined criteria The customer (or technical owner) decides the criteria that the system responses will be evaluated against. All of the information to judge the system response needs to be available in the logs that are created by the system instrumentation. 6 Measuring Performance Using Humans *Source: The Handbook of Computation Linguistics and Natural Language Processing. Clark, Fox, and Lappin
  • 7. Depends on the type of machine learning response • Open field textual response ➡ ask human to rate response as on of the following categories, given all of the original context information: – Wrong - the answer and the question are completely unrelated – Poor - the answer and the question might be related, but the answer does not satisfy the question – Decent - the answer relates to the question, but could be better – Perfect - the answer directly relates to the question and is phrased clearly • Classification response, e.g., measuring sentiment as {positive, negative, neutral} ➡ ask human to classify response given all of the original context relevant to the classification Note: even though services like Watson Conversation are classifiers under the hood, the response received by the user is a textual response, not the label of the intent, thus should be evaluated as a textual response 7 Design of Experiments – Evaluation Metrics
  • 8. Overview Give the 'what' and 'why' of a task in less than 200 characters. The overview should give a clear high-level picture of what a worker will be doing and why what they are doing is valuable. Steps Describe the process by which humans will complete the task. This should be a discrete list of steps to use to complete the task. Each step should begin with an action verb in bold. Rules and Tips Use green headers for positive/"Do This", yellow for warning/"Be Careful Of", red for bad/"Do Not” Examples Provide at least three examples of your job to contributors. This will help them perform better on the job. Thank you! Humans really appreciate a customized thank you note! 8 Design of Experiments – Instructions for Human* *Source: https://success.crowdflower.com/hc/en-us/articles/201855779-Instructions-Template
  • 9. The quickest way to judge lots of data is to involve as many humans as possible. However, as you judge the answers to many of these questions, you will find that many of the responses are open to interpretation. Have a small group of people 2-5, who can definitively asses if the response meets the criteria your team has decided on for the project. Have this team discuss and agree on around 100 the correct judgment for responses. This is your Golden Standard. Write very clear reasons explaining why you have selected that answer. These reasons will be shown to humans if they get a golden question incorrect. This is a great tool to explain to your contributors how you’ve reached your answer. By explaining the answer, humans can learn the rules and intricacies of a job in more depth than is provided in the instructions. Golden questions should have an appropriate answer distribution that reflects your dataset. An even answer distribution will train contributors on every possible answer instead of biasing them towards one answer. Before the human can contribute to evaluating the results of the job, they must pass a quiz of the golden standard questions. 9 Design of Experiments – Create Golden Standard
  • 10. You’ve got your golden questions, now you want to load all of the data that you want annotated for human judgement For Q/A tasks, a general run of thumb is 20 man-hours per 1000 questions 1000 responses is the minimum number of responses I would want to judge in a given experiment, and the maximum only depends on time, cost, and resources 10 Run the Experiment
  • 11. 11 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 12. "He that knows not, and knows not that he knows not is a fool. Shun him He that knows not, and knows that he knows not is a pupil. Teach him. He that knows, and knows not that he knows is asleep Wake him. He that knows, and knows that he knows is a teacher. Follow him." 12 Question and Answering: Optimize for Human Computer Interaction How much does it hurt my reputation to answer a question incorrectly? 𝑅𝑒𝑝𝑢𝑡𝑎𝑡𝑖𝑜𝑛 = 𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟 − |𝑖𝑛𝑐𝑜𝑟𝑟𝑒𝑐𝑡 𝑎𝑛𝑠𝑤𝑒𝑟| |𝑎𝑠𝑘𝑒𝑑 𝑞𝑢𝑒𝑠𝑡𝑖𝑜𝑛𝑠|
  • 15. Huge Gains in Perceived Performance Assume 1000 questions, when the threshold is set at 0.7, you’ve attempted to help 534 people, giving 486 of the a correct answer Now assume, you wanted to answer 81% of the questions with Machine Learning. You’ve attempted to help 810 people, giving 648 people a correct answer Added 162 happy customers saves me $XXXX in level > 0 support time using the EXACT SAME SYSTEM
  • 16. • Assume a sentiment score 𝑠 𝜖[−1,1] – sentiment is negative if 𝑠 < 0 – sentiment is neutral if 𝑠 = 0 – sentiment is positive if 𝑠 > 0 Ideally, the closer to 1 the sentiment is, the more confident the classification algorithm is that the text is positive, and the closer to -1 the sentiment score is, the more confident the classification algorithm is that the text is negative. Let’s measure this using the crowd sourcing platform, CrowdFlower! 16 Sentiment Analysis
  • 17. • Detailed Instructions • Human must pass a quiz to enter the job • 1 out of every 10 judgements is a test question, human must maintain an 80% agreement with our test questions to remain in the job • First round of twitter data, 3314 samples, 3238 had human agreement • Second round, twitter and news data, 6471 samples, 5642 had human agreement • Following analysis only considers samples with human agreement 17 Collecting Data from the Crowd Using CrowdFlower
  • 18. 18 Experiment Results ML TOTAL SAMPLES: 8880 Number of Crowd Negative: 788 Number of Crowd Neutral: 6283 Number of Crowd Positive: 1809 Number of ML Negative: 2111 Number of ML Neutral: 2156 Number of ML Positive: 4613 Partial (neg only) Agreement: 565, 71.70% Partial (neu only) Agreement: 1757, 27.96% Partial (pos only) Agreement: 1083, 59.87% Total Judgements Agreement: 3405, 38.34% *Results are notional and not meant to be representative of any Watson API
  • 19. 19 Redefining Neutral Using the Sentiment Score – Effect on Positive and Negative Classification performance -1.0 0.0 1.0 -x to x negative neutral positive *Results are notional and not meant to be representative of any Watson API
  • 20. 20 Redefining Neutral Using the Sentiment Score – Effect on Neutral Classification performance -1.0 0.0 1.0 -x to x negative neutral positive *Results are notional and not meant to be representative of any Watson API
  • 21. 21 Redefining Neutral Using the Sentiment Score – Effect on Total performance (all categories) -1.0 0.0 1.0 -x to x negative neutral positive *Results are notional and not meant to be representative of any Watson API
  • 22. • We obviously don’t want to maximize for total correctness, because the class imbalance for the neutral data (approx 70% of the data) would give a naïve classifier of “call everything neutral” • Generated a heuristic cost function to value the correct classification of positive and negative sentiment calls 4 times higher than the correct classification of correct neutral calls. • Maximum of the cost function occurs at sentiment score of 0.4, now implemented in our client side code. • Total accuracy over all three categories goes from 38% to 57% 22 Adding in a Cost Function *Results are notional and not meant to be representative of any Watson API
  • 23. 23 Final Results, Weighting cost function normalized to appear on same axis *Results are notional and not meant to be representative of any Watson API
  • 24. 24 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 25. 25 Augmenting Classifier Systems Golden Training System Logs System Judgment Good Intent Answer Bad intent Answer Add 1 question to intent proper training bin Correct intent exists? Create New Intent Yes No Note: when you train the classifier the size of the training bin effects the probability of the intent being returned, so you may want to down sample some training bins to avoid unwanted bias. Or, you may consider that bias a good thing, depending on the use case
  • 26. 26 Real Data; Mocked Dashboard – Alpha denotes performance after incorporation of new data from log files User Experience 76% Performance Prediction 80% Note that the performance threshold (indicated by *) is fixed at runtime *Results are notional and not meant to be representative of any Watson API
  • 27. We incorporated the human labeled data in with their other training sets, and then performed tests across all the myriad of data we have labels and are available. Performance on our data set jumped from 57% to 70% It is also worth noting, that performance of other data sets that were completely distinct from our use case also improved. A true win-win! 27 Training Sentiment Model *Results are notional and not meant to be representative of any Watson API
  • 28. 28 Analyze and Improve Performance of Machine Learning in Four Easy Steps Step 0. Deploy your machine learning application Step 1. Assess performance of app using human judgement Step 2. Analyze and optimize operating thresholds Step 3. Retrain machine learning with golden examples from humans Step 4. Go to Step 0 with new changes
  • 29. The output of machine learning system can always be improved. Better training data, algorithms more suited to your use case, and system improvements based on threshold setting can all be employed. However, you will find that after each iteration, the system will improve less and less…much like the radius of a spiral as it makes rotations around the orgin. Depending on your use case, you may decide to stop iterating at some point, or, you may need to never stop iterating (especially true of systems that contain golden samples that can change over time) Rinse and Repeat: The Spiral of Applied Machine Learning