Braden Hancock "Programmatically creating and managing training data with Snorkel"

Programmatically
Creating and Managing
Training Data with
Snorkel
Braden Hancock
Stanford University

MLApplication =
Model Data Hardware+ +
from pytorch_transformers
import BertModel as model
aws ec2 run-instances
–-instance-type p3.2xlarge
–-instance-type p3.16xlarge
State-of-the-art models and hardware are commodities
Training data is not
import GPT2Model as model
3

Current Approach: Manual Labeling

Manual Labeling Is…
Static
{Positive, Negative}
{Positive, Neutral, Negative}
Labels
Time
Slow
$10 - $100/hr
Expensive
5

Alternative Approach: Programmatic Labeling
What if we could write programs
to label data for us?

Manual
Labels
Programmatic
Labels
$10 - $100/hr
Dynamic
{Positive, Negative}
{Positive, Neutral, Negative}
Cheap
$0.10/hr
Labels
Time
Fast
Labels
Time
write
programs
run
programs
StaticSlow Expensive
7

20+ Papers
• ML: NeurIPS , ICML, ICCV
• NLP: ACL
• Systems: SIGMOD, VLDB, KDD
• Science: Nature Communications
9

The Snorkel Pipeline
Users write
labeling functions
to heuristically
label data
def LF_pneumo(x):
if re.search(r’pneumo.*’, X.text):
return “ABNORMAL”
def LF_short_report(x):
if len(X.words) < 15:
return “NORMAL”
def LF_ontology(x):
if DISEASES & X.words:
def LF_off_shelf_classifier(x):
if off_shelf_classifier(x) == 1:
return “NORMAL”
LABELING FUNCTIONS
UNLABELED DATA
DOMAIN EXPERT
Labeling Functions (LFs) are
simply black-box functions
that heuristically label some
portion of the data
13

Example Labeling Function: Spam
“My name is Braden, a
Nigerian prince in need of
money!.”
def LF_need_money(x):
if re.search(r’needs.*money’, x.text):
return SPAM
“Hi Braden, do you need
money, dear? Love,
Grandma.”
SPAM
def LF_need_money(x):
if re.search(r’needs.*money’, x.text):
return SPAM
SPAM
Note: We expect our labeling functions to be noisy! 14

LabelingFunctions inMany Flavors
Pattern Matching If a phrase like “send money” is in email
Boolean Search If unknown_sender AND (foreign_source OR num_links > 3)
Heuristics If SpellChecker finds 3+ spelling errors
Legacy System If LegacySystem votes spam
Third Party Model If TweetSpamDetector votes spam
DB Lookup If sender is in our Blacklist.db
SQL Query If sender is in SELECT sender FROM emails
GROUP BY sender
HAVING SUM(flagged_spam) > 5;
15

𝑌1
𝑌2
𝑌3
𝑌4
𝑌
LABEL MODEL
Users write
labeling functions
to heuristically
label data
Snorkel
cleans and
combines the
LF labels
PROBABILISTIC
LABELS
def LF_pneumo(x):
return “NORMAL”
def LF_ontology(x):
return “NORMAL”
LABELING FUNCTIONS
DOMAIN EXPERT
UNLABELED DATA
16

Key idea:
Learn from the agreements & disagreements between
the labeling functions
(*Probably Wrong)
No
No Yes No
No No No
*We assume only that our labeling functions are non-adversarial on average
LF
LF
LF
LF
LF
LF
LF
17

𝑌1
𝑌2
𝑌3
𝑌4
𝑌
LABEL MODEL
Users write
labeling functions
to heuristically
label data
Snorkel
cleans and
combines the
LF labels
The resulting
probabilistic
labels are used to
train an ML model
PROBABILISTIC
LABELS
CLASSIFIER
def LF_pneumo(x):
return “NORMAL”
def LF_ontology(x):
return “NORMAL”
LABELING FUNCTIONS
UNLABELED DATA
DOMAIN EXPERT
Use a commodity model for your problem! 18

Why can’t I just use my LabelModel asa classifier
directly?

Reason #1: Improved Generalization
LABEL MODEL CLASSIFIER
High Precision, Limited Coverage Generalizes beyond the LFs
20

Reason #1: Improved Generalization
Task: identify disease-causing chemicals
Phrases mentioned in Labeling Functions:
“treats”, “causes”, “induces”, “prevents”, …
The classifier learned to take advantage of features that were helpful for
prediction, but never explicitly mentioned in the LFs
Phrases given large weights by end model:
“could produce a”, “support diagnosis of”, …
21

Reason #2: Scaling with Unlabeled Data
Add more unlabeled data—without changing the LFs—and
performance improves!
22

Snorkel Drybell @
https://ai.googleblog.com/2019/03/harnessing-organizational-knowledge-for.htmlGoogle AI blog post:
+17% and +5% F1
improvement over
traditional supervision on
two high value, highly
engineered tasks
24

Months
Chest X-Ray Classification @
25
Task: Classify chest X-rays
as normal or abnormal

Months
26
Years

Write LFs over TEXT to create training labels for an IMAGE classifier!
Report 47:
Indication: Chest
pain. Findings:
Pneumothorax.
Operation
recommended.
def LF_pneumo(x):
return “NORMAL”
def LF_ontology(x):
return “NORMAL”
ABNORMAL
ABNORMAL
27

Months
28
Years
Indication: Chest pain. Findings:
Mediastinal contours are within
normal limits. Heart size is
within normal limits. No focal
consolidation, pneumothorax or
pleural effusion. Impression: No
acute cardiopulmonary
abnormality.
20 Labeling Functions

Months
Chest X-Ray Classification
29
Years
Indication: Chest pain. Findings:
Mediastinal contours are within
normal limits. Heart size is
within normal limits. No focal
consolidation, pneumothorax or
pleural effusion. Impression: No
acute cardiopulmonary
abnormality.
20 Labeling Functions
Days

Snorkel Tutorials
https://snorkel.org/use-cases
Available on the website:
31

https://github.com/snorkel-team/snorkel-tutorials/
Snorkel Tutorials
Also available on the GitHub as a Jupyter notebook:
32

Task Definition
YouTube Comment Spam Classification
Is this comment “Spam” (not related to the video) or “Ham” (related)?
33

1. Write Labeling Functions (LFs)
Keyword-based:
35

Heuristic-based:
36

3rd Party Classifier:
TextBlob is an off-the-shelf pre-trained
sentiment classifier.
We apply it as a “preprocessor” to add
the a “polarity” score to all examples.
37

No LF has sufficient coverage on its own The majority of our LFs have too low *accuracy
38
*Based on small
sample of ~200
labeled examples

M labeling functions applied to
N data points makes: an N x M
label matrix (L)
39

2. Clean and Combine LF Labels
The Label Model outputs confidence-
weighted probabilistic labels for the
train set.
40

3. Train a Classifier
Simple bag-of-ngrams features
Simple Keras logistic regression model
41

Results
Use majority vote of LFs as classifier:
Use label model trained on LFs as classifier:
Use classifier trained on labels generated by label model:
84.2%
86.7%
94.4%
42

Other Training Data Operations
44

Join the Open-Source Community!
• Learn on the website: snorkel.org
• Contribute on the repo: github.com/snorkel-team/snorkel
• Practice on the tutorials: github.com/snorkel-team/snorkel-tutorials
• Discuss in the forum: spectrum.chat/snorkel
• Reference the docs: snorkel.readthedocs.io
• Follow on Twitter: @SnorkelML
45
Thank you!

Braden Hancock "Programmatically creating and managing training data with Snorkel"

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Braden Hancock "Programmatically creating and managing training data with Snorkel"

Ähnlich wie Braden Hancock "Programmatically creating and managing training data with Snorkel" (20)

Mehr von Fwdays

Mehr von Fwdays (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Braden Hancock "Programmatically creating and managing training data with Snorkel"

Hinweis der Redaktion