Today's state-of-the-art machine learning models are more powerful and easy to use than ever before, however, they require massive amounts of training data. Traditionally, these training datasets require slow and often prohibitively expensive manual labeling by domain experts.
Instead, in Snorkel, users write "labeling functions" to heuristically label data; Snorkel then uses modern, theoretically-grounded modeling techniques to clean and integrate the resulting training data, without requiring any manual labeling. In a wide range of applications from medical image monitoring to text information extraction to industrial deployments over web data, Snorkel provides a radically faster and more flexible to build machine learning applications, by letting users programmatically build and manipulate training data rather than label it by hand.
Website: https://fwdays.com/en/event/data-science-fwdays-2019/review/creating-and-managing-data-with-snorkel
3. MLApplication =
Model Data Hardware+ +
from pytorch_transformers
import BertModel as model
aws ec2 run-instances
â-instance-type p3.2xlarge
â-instance-type p3.16xlarge
State-of-the-art models and hardware are commodities
Training data is not
import GPT2Model as model
3
13. The Snorkel Pipeline
Users write
labeling functions
to heuristically
label data
def LF_pneumo(x):
if re.search(râpneumo.*â, X.text):
return âABNORMALâ
def LF_short_report(x):
if len(X.words) < 15:
return âNORMALâ
def LF_ontology(x):
if DISEASES & X.words:
return âABNORMALâ
def LF_off_shelf_classifier(x):
if off_shelf_classifier(x) == 1:
return âNORMALâ
LABELING FUNCTIONS
UNLABELED DATA
DOMAIN EXPERT
Labeling Functions (LFs) are
simply black-box functions
that heuristically label some
portion of the data
13
14. Example Labeling Function: Spam
âMy name is Braden, a
Nigerian prince in need of
money!.â
def LF_need_money(x):
if re.search(râneeds.*moneyâ, x.text):
return SPAM
âHi Braden, do you need
money, dear? Love,
Grandma.â
SPAM
def LF_need_money(x):
if re.search(râneeds.*moneyâ, x.text):
return SPAM
SPAM
Note: We expect our labeling functions to be noisy! 14
15. LabelingFunctions inMany Flavors
Pattern Matching If a phrase like âsend moneyâ is in email
Boolean Search If unknown_sender AND (foreign_source OR num_links > 3)
Heuristics If SpellChecker finds 3+ spelling errors
Legacy System If LegacySystem votes spam
Third Party Model If TweetSpamDetector votes spam
DB Lookup If sender is in our Blacklist.db
SQL Query If sender is in SELECT sender FROM emails
GROUP BY sender
HAVING SUM(flagged_spam) > 5;
15
16. The Snorkel Pipeline
đ1
đ2
đ3
đ4
đ
LABEL MODEL
Users write
labeling functions
to heuristically
label data
Snorkel
cleans and
combines the
LF labels
PROBABILISTIC
LABELS
def LF_pneumo(x):
if re.search(râpneumo.*â, X.text):
return âABNORMALâ
def LF_short_report(x):
if len(X.words) < 15:
return âNORMALâ
def LF_ontology(x):
if DISEASES & X.words:
return âABNORMALâ
def LF_off_shelf_classifier(x):
if off_shelf_classifier(x) == 1:
return âNORMALâ
LABELING FUNCTIONS
DOMAIN EXPERT
UNLABELED DATA
16
17. Key idea:
Learn from the agreements & disagreements between
the labeling functions
(*Probably Wrong)
No
No Yes No
No No No
*We assume only that our labeling functions are non-adversarial on average
LF
LF
LF
LF
LF
LF
LF
17
18. The Snorkel Pipeline
đ1
đ2
đ3
đ4
đ
LABEL MODEL
Users write
labeling functions
to heuristically
label data
Snorkel
cleans and
combines the
LF labels
The resulting
probabilistic
labels are used to
train an ML model
PROBABILISTIC
LABELS
CLASSIFIER
def LF_pneumo(x):
if re.search(râpneumo.*â, X.text):
return âABNORMALâ
def LF_short_report(x):
if len(X.words) < 15:
return âNORMALâ
def LF_ontology(x):
if DISEASES & X.words:
return âABNORMALâ
def LF_off_shelf_classifier(x):
if off_shelf_classifier(x) == 1:
return âNORMALâ
LABELING FUNCTIONS
UNLABELED DATA
DOMAIN EXPERT
Use a commodity model for your problem! 18
19. Why canât I just use my LabelModel asa classifier
directly?
20. Reason #1: Improved Generalization
LABEL MODEL CLASSIFIER
High Precision, Limited Coverage Generalizes beyond the LFs
20
21. Reason #1: Improved Generalization
Task: identify disease-causing chemicals
Phrases mentioned in Labeling Functions:
âtreatsâ, âcausesâ, âinducesâ, âpreventsâ, âŠ
The classifier learned to take advantage of features that were helpful for
prediction, but never explicitly mentioned in the LFs
Phrases given large weights by end model:
âcould produce aâ, âsupport diagnosis ofâ, âŠ
21
22. Reason #2: Scaling with Unlabeled Data
Add more unlabeled dataâwithout changing the LFsâand
performance improves!
22
27. Write LFs over TEXT to create training labels for an IMAGE classifier!
Report 47:
Indication: Chest
pain. Findings:
Pneumothorax.
Operation
recommended.
def LF_pneumo(x):
if re.search(râpneumo.*â, X.text):
return âABNORMALâ
def LF_short_report(x):
if len(X.words) < 15:
return âNORMALâ
def LF_ontology(x):
if DISEASES & X.words:
return âABNORMALâ
def LF_off_shelf_classifier(x):
if off_shelf_classifier(x) == 1:
return âNORMALâ
ABNORMAL
ABNORMAL
Chest X-Ray Classification @
27
28. Months
28
Years
Indication: Chest pain. Findings:
Mediastinal contours are within
normal limits. Heart size is
within normal limits. No focal
consolidation, pneumothorax or
pleural effusion. Impression: No
acute cardiopulmonary
abnormality.
20 Labeling Functions
Chest X-Ray Classification @
29. Months
Chest X-Ray Classification
29
Years
Indication: Chest pain. Findings:
Mediastinal contours are within
normal limits. Heart size is
within normal limits. No focal
consolidation, pneumothorax or
pleural effusion. Impression: No
acute cardiopulmonary
abnormality.
20 Labeling Functions
Days
37. 3rd Party Classifier:
TextBlob is an off-the-shelf pre-trained
sentiment classifier.
We apply it as a âpreprocessorâ to add
the a âpolarityâ score to all examples.
1. Write Labeling Functions (LFs)
37
38. 1. Write Labeling Functions (LFs)
No LF has sufficient coverage on its own The majority of our LFs have too low *accuracy
38
*Based on small
sample of ~200
labeled examples
39. 1. Write Labeling Functions (LFs)
M labeling functions applied to
N data points makes: an N x M
label matrix (L)
39
40. 2. Clean and Combine LF Labels
The Label Model outputs confidence-
weighted probabilistic labels for the
train set.
40
41. 3. Train a Classifier
Simple bag-of-ngrams features
Simple Keras logistic regression model
41
42. Results
Use majority vote of LFs as classifier:
Use label model trained on LFs as classifier:
Use classifier trained on labels generated by label model:
84.2%
86.7%
94.4%
42
45. Join the Open-Source Community!
âą Learn on the website: snorkel.org
âą Contribute on the repo: github.com/snorkel-team/snorkel
âą Practice on the tutorials: github.com/snorkel-team/snorkel-tutorials
âą Discuss in the forum: spectrum.chat/snorkel
âą Reference the docs: snorkel.readthedocs.io
âą Follow on Twitter: @SnorkelML
45
Thank you!