The document describes an approach called Snorkel that can generate training data for machine learning models from unlabeled text documents without requiring manual labeling. It works by encoding domain knowledge into labeling functions or rules and using those rules to assign weak labels to candidate examples. These weak labels are then used to train an underlying machine learning model like logistic regression. The approach is presented as an alternative to manual labeling that scales more easily. Key steps include writing rules, validating rules, running learning algorithms on the weakly labeled data, and iterating to improve the rules. Examples of using Snorkel for relationship extraction tasks are also provided.
3. The Generalized approach of extracting text: Parsing
Tokenization Normalization Parsing Lemmatization
Tokenization: Separating sentences, words, remove
special characters, phrase detections
Normalization: lowering words, word-sense
disambiguation
Parsing: Detecting parts of speech, nouns, verbs etc.
Lemmatization: Remove plurals and different word
forms to a single word (found in the dictionary).
4. Extract sentences that contain the
speciļ¬c attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the speciļ¬c relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
5. Extract sentences that contain the
speciļ¬c attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the speciļ¬c relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
How do we generate this training data?
7. The snorkel approach of Entity Extraction
Extract sentences that contain the
speciļ¬c attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Write Rules: Encode your domain knowledge
into rules.
Validate Rules: coverage, conļ¬icts, accuracy
Run learning: logistic regression, lstm, ā¦
Examine a random
set of candidates,
create new rules
Observe the lowest
accuracy(highest conļ¬ict)
rules and edit them
iterate
8. Training Data | Rules
.
.
..
.*
.
.
..
.
.*
*
Planetary Orbits
9. How does snorkel work without training data
Write Rules: Encode your domain knowledge into rules.
The rules are modeled as a Naive Bayes model which assumes that the
rules are conditionally independent.
These probabilities are fed into Machine Learning algorithm: Logistic
Regression in the simplest case to create a model used to make
future predictions
Even though most of the time this is not true, in practice it generates a
pretty good training set with probabilities of being in either class.
http://arxiv.org/pdf/1512.06474v2.pdf
11. It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
When oral therapy is not feasible and the strength ,
dosage form , and route of administration of the drug
reasonably lend the preparation to the treatment of the
condition
Data Dive: FDA Drug Labels
12. Candidate Extraction
Using domain knowledge and language structure collect
a set of high recall low precision. Typically this set should
have 80% recall and 20% precision.
60% accuracy, too speciļ¬c need to make it more general
30% accuracy, this looks ļ¬ne
ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦.
ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦ā¦.
23. Relationship extractions
ā¢Is person X married to person Y?
ā¢Does drug X cure disease Y?
ā¢Does software X (example: snorkel) run on programing language Y
(example: python3)
Deļ¬ne ļ¬lters for candidate extraction for a pair (X,Y)
example: (snorkel, python2.7), (snorkel,python3.1), ā¦
Once you have the pairs , examine them using annotation tool.
Write rules āā> observe their performance against annotated data.
Iterate
24. Crowdsourced training data
In some cases training data is generated on the same dataset
by multiple people.
In snorkel each source can be incorporated as a separate
rule function.
The model for the rules ļ¬gure out the relative weights for each
person and create a cleaner training data.
25. Why Docker?
ā¢ Portability: develop here run
there: Internal Clusters, aws,
google cloud etc, Reusable by
team and clients
ā¢ isolation: os and docker
isolated from bugs.
ā¢ Fast
ā¢ Easy virtualization : hard ware
emulation, virtualized os.
ā¢ Lightweight
Python stack on docker
26. FROM ubuntu:latest
# MAINTAINER Sanghamitra Deb <sangha123@gmail.com>
CMD echo Installing Accenture Tech Labs Scientific Python Enviro
RUN apt-get install python -y
RUN apt-get update && apt-get upgrade -y
RUN apt-get install curl -y
RUN apt-get install emacs -y
RUN curl -O https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN rm get-pip.py
RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc
RUN apt-get install python-setuptools build-essential python-dev -y
RUN apt-get install gfortran swig -y
RUN apt-get install libatlas-dev liblapack-dev -y
RUN apt-get install libfreetype6 libfreetype6-dev -y
RUN apt-get install libxft-dev -y
RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev
RUN apt-get install python-numpy
ADD requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt -q
Dockerļ¬le
scipy
matplotlib
ipython
jupyter
pandas
Bottleneck
patsy
pymc
statsmodels
scikit-learn
BeautifulSoup
seaborn
gensim
fuzzywuzzy
xmltodict
untangle
nltk
flask
enum34
requirements.txt
docker build -t sangha/python .
docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v
/location/in/hadoop/ sangha/python bash
docker exec -it pharmaExtraction0.1 bash
docker exec -dĀ pharmaExtraction0.1 pythonĀ /root/pycodes/rest_api.py
Building the Dockerļ¬le
27. Typical ML pipeline vs Snorkel
(1) Candidate Extraction.
(2) Rule Function
(3) Hyperparameter tuning
28. Snorkel :
Pros:
ā¢ Very little training
data necessary
ā¢ Do not have to
think about feature
generation
ā¢ Do not need deep
knowledge in
Machine Learning
ā¢ Convenient UI for
data annotation
ā¢ Created structured
databases from
unstructured text
Cons:
ā¢ Code is new, so it
may not be robust
to all situations.
ā¢ Doing online
prediction is
difļ¬cult.
ā¢ Not much
transparency in the
internal workings.
29. Banks: Loan
Approval Paleontology
Design of Clinical Trials
Legal
Investigation
Market Research
Reports
Human Trafficking
Skills extraction from resume
Content Marketing
Product descriptions and
reviews
Pharmaceutical
Industry
Applicability across āØ
a variety of industries
and use cases
30. Where to get it?
https://github.com/HazyResearch/snorkel
http://arxiv.org/pdf/1512.06474v2.pdf