Data day2017

Creating Knowledge
bases from text in
absence of training data.
Sanghamitra Deb
Accenture Technology Laboratory
Phil Rogers, Jana Thompson, Hans Li

Typical Business Process
Executive
Summary
Business
Decisions
hours of knowledge
curation by experts

The Generalized approach of extracting text: Parsing
Tokenization Normalization Parsing Lemmatization
Tokenization: Separating sentences, words, remove
special characters, phrase detections
Normalization: lowering words, word-sense
disambiguation
Parsing: Detecting parts of speech, nouns, verbs etc.
Lemmatization: Remove plurals and different word
forms to a single word (found in the dictionary).

Extract sentences that contain the
speciﬁc attribute
POS tag and extract unigrams,bigrams
and trigrams centered on nouns
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the speciﬁc relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML

speciﬁc attribute
Extract Features: words around nouns:
bag of words/word vectors,
position of the noun and length of sentence.
Train a Machine Learning model to predict which unigrams, bigrams
or trigrams satisfy the speciﬁc relationship: for example the drug-disease
treatment relationship.
Map training data to create a balanced
positive and negative training set.
The Generalized approach of extracting text : ML
How do we generate this training data?

A diﬀerent Approach
Stanford
Replaces training data by encoding domain knowledge

The snorkel approach of Entity Extraction
specific attribute
Write Rules: Encode your domain knowledge
into rules.
Validate Rules: coverage, conflicts, accuracy
Run learning: logistic regression, lstm, …
Examine a random
set of candidates,
create new rules
Observe the lowest
accuracy(highest conflict)
rules and edit them
iterate

Training Data | Rules
.
.
..
.*
.
.
..
.
.*
*
Planetary Orbits

How does snorkel work without training data
Write Rules: Encode your domain knowledge into rules.
The rules are modeled as a Naive Bayes model which assumes that the
rules are conditionally independent.
These probabilities are fed into Machine Learning algorithm: Logistic
Regression in the simplest case to create a model used to make
future predictions
Even though most of the time this is not true, in practice it generates a
pretty good training set with probabilities of being in either class.
http://arxiv.org/pdf/1512.06474v2.pdf

It is indicated for treating respiratory disorder caused
due to allergy.
For the relief of symptoms of depression.
Evidence supporting efficacy of carbamazepine as an
anticonvulsant was derived from active drug-controlled
studies that enrolled patients with the following seizure
types:
When oral therapy is not feasible and the strength ,
dosage form , and route of administration of the drug
reasonably lend the preparation to the treatment of the
condition
Data Dive: FDA Drug Labels

Candidate Extraction
Using domain knowledge and language structure collect
a set of high recall low precision. Typically this set should
have 80% recall and 20% precision.
60% accuracy, too speciﬁc need to make it more general
30% accuracy, this looks ﬁne
…………………………………………………………………………………………………………………………………………………………………….
…………………………………………………………………………………………………………………………………………………………………….

Automated Features:
pos-tags
context
dep-tree
char-offsets

0
75
150
225
300
-1 0 1
Generation of training data
One rule

0
55
110
165
220
-1 0 1
two rules

0
45
90
135
180
-1 0 1
three rules

0
35
70
105
140
-1 0 1
four rules

0
35
70
105
140
-1 0 1
20 rules

Results and performance.
drug-name
disease
candidate
Candidates snorkel
Lithium
Carbonate
bipolar
disorder
1 1
Lithium
Carbonate
individual 1 0
Lithium
Carbonate
maintenance 1 0
Lithium
Carbonate
manic episode 1 1
Precision and recall ~90%

Evolution of F1-score with sample size

Relationship extractions
•Is person X married to person Y?
•Does drug X cure disease Y?
•Does software X (example: snorkel) run on programing language Y
(example: python3)
Deﬁne ﬁlters for candidate extraction for a pair (X,Y)
example: (snorkel, python2.7), (snorkel,python3.1), …
Once you have the pairs , examine them using annotation tool.
Write rules ——> observe their performance against annotated data.
Iterate

Crowdsourced training data
In some cases training data is generated on the same dataset
by multiple people.
In snorkel each source can be incorporated as a separate
rule function.
The model for the rules ﬁgure out the relative weights for each
person and create a cleaner training data.

Why Docker?
• Portability: develop here run
there: Internal Clusters, aws,
google cloud etc, Reusable by
team and clients
• isolation: os and docker
isolated from bugs.
• Fast
• Easy virtualization : hard ware
emulation, virtualized os.
• Lightweight
Python stack on docker

FROM ubuntu:latest
# MAINTAINER Sanghamitra Deb <sangha123@gmail.com>
CMD echo Installing Accenture Tech Labs Scientific Python Enviro
RUN apt-get install python -y
RUN apt-get update && apt-get upgrade -y
RUN apt-get install curl -y
RUN apt-get install emacs -y
RUN curl -O https://bootstrap.pypa.io/get-pip.py
RUN python get-pip.py
RUN rm get-pip.py
RUN echo "export PATH=~/.local/bin:$PATH" >> ~/.bashrc
RUN apt-get install python-setuptools build-essential python-dev -y
RUN apt-get install gfortran swig -y
RUN apt-get install libatlas-dev liblapack-dev -y
RUN apt-get install libfreetype6 libfreetype6-dev -y
RUN apt-get install libxft-dev -y
RUN apt-get install libxml2-dev libxslt-dev zlib1g-dev
RUN apt-get install python-numpy
ADD requirements.txt /tmp/requirements.txt
RUN pip install -r /tmp/requirements.txt -q
Dockerﬁle
scipy
matplotlib
ipython
jupyter
pandas
Bottleneck
patsy
pymc
statsmodels
scikit-learn
BeautifulSoup
seaborn
gensim
fuzzywuzzy
xmltodict
untangle
nltk
flask
enum34
requirements.txt
docker build -t sangha/python .
docker run -it -p 1108:1108 -p 1106:1106 --name pharmaExtraction0.1 -v
/location/in/hadoop/ sangha/python bash
docker exec -it pharmaExtraction0.1 bash
docker exec -d pharmaExtraction0.1 python /root/pycodes/rest_api.py
Building the Dockerﬁle

Typical ML pipeline vs Snorkel
(1) Candidate Extraction.
(2) Rule Function
(3) Hyperparameter tuning

Snorkel :
Pros:
• Very little training
data necessary
• Do not have to
think about feature
generation
• Do not need deep
knowledge in
Machine Learning
• Convenient UI for
data annotation
• Created structured
databases from
unstructured text
Cons:
• Code is new, so it
may not be robust
to all situations.
• Doing online
prediction is
difﬁcult.
• Not much
transparency in the
internal workings.

Banks: Loan
Approval Paleontology
Design of Clinical Trials
Legal
Investigation
Market Research
Reports
Human Trafficking
Skills extraction from resume
Content Marketing
Product descriptions and
reviews
Pharmaceutical
Industry
Applicability across  
a variety of industries
and use cases

Where to get it?
https://github.com/HazyResearch/snorkel
http://arxiv.org/pdf/1512.06474v2.pdf

Data day2017

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie Data day2017

Ähnlich wie Data day2017 (20)

Mehr von Sanghamitra Deb

Mehr von Sanghamitra Deb (14)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Data day2017