SlideShare a Scribd company logo
1 of 9
Download to read offline
THE DEEPDIVE FRAMEWORK
LEO ZHANG
STEP-BY-STEP ILLUSTRATION
The Stanford DeepDive, developed by Professor Chris Ré
and a team of PhDs, is a powerful data management and
preparation platform that allows users to build highly
sophisticated end-to-end data pipelines
This presentation covers the technicalities of the inference and learning
engine behind DeepDive; including how DeepDive is different from
traditional data management systems, how to build an application on
DeepDive, as well as how exactly does DeepDive work.
“We are just an advanced breed of monkeys
on a minor planet of a very average star. But
we can understand the Universe. That makes
us special””
- Stephen Hawking
THE DEEPDIVE OVERVIEW
How Is DeepDive Different?
Source: www.deepdive.stanford.edu
DeepDive is an end-to-end framework for building KBC systems.
B.Obama
and his
wife M.
Obama
Candidate
Generation
& Feature
Extraction
Super-
vision
Learning
&
Inference
Has
Spouse
Input Output
Newdocs
FeatureExt.
rules
Supervision
rules
Inference
rules
Erroranalysis
Input: Unstructured Docs
Developers will add new rules to improve quality
How Does DeepDive Work?
•  Candidate Generation and Feature Extraction
•  Save input data in relational database
•  Feature Extractors: a set of user-defined
functions
•  Supervision
•  DeepDive language is based on Markov Logic
•  Can use training data to mirror the same
function it serves under supervised learning
•  Learning and Inference
•  Factor graph
•  Error Analysis
•  Determine if the user needs to inspect the errors
DeepDive Design
Features that makes it convenient for non-computer scientists to
use:
i)  No reference to underlying machine learning algorithm.
Probabilistic semantics provide a way to debug the system
independently of algorithm
ii)  Allows users to write extra features in Python, SQL and
Scala
iii)  Fits into the familiar SQL stack, therefore allows standard
tools to inspect and visualize data
Source: Incremental Knowledge Base Construction Using DeepDive
Output: structured knowledge base
Feature
Engineering
High Quality
Allows developers to think about features
rather than algorithms
Applications have achieved higher quality
than human volunteers
Calibration
Variety of
Sources
Computes calibrated probability for every
assertion it makes
Can extract data from documents, PDFs,
web pages, tables and figures
Domain
Knowledge
Distant
Supervision
Integrates with writing sample rules to
improve quality
Does not require tedious training for every
prediction
DEVELOPMENT PROCESS OF
DEEPDIVE APPLICATIONS
Writing The Application
Running The Application
Evaluate / Debug
•  Define the data flow in DDlog schema that
describes the input data and data to be produced
•  Write User-Defined Functions (data
transformation rules)
•  Specify a statistical model in DDlog
•  The user can compile and run the application
incrementally
•  Actual data loaded to data base and queried ->
User-Defined Functions executed incrementally
•  Model’s parameters can be learned or reused to
make predictions
•  Formal error analysis supported by interactive
tools
•  DeepDive contains a suite of tools and guides:
Label data products, browse data, monitor
descriptive statistics, calibration etc.
# DDlog is a higher-level language for
writing DeepDive applications in
succinct, Datalog-like syntax
# Variable declarations + Scoping and
supervision rules + Inference rules
# A core set of commands that
supports precise control of execution
# Several commands on the statistical
model such as its creation, parameter
estimation, computation of
probabilities and keeping and reusing
the parameters
# User-Defined Functions can be
written on any standard programming
languages
# Produces calibration plots to
evaluate the iterative workflow
# Comments
Start with a
basic first
version and
improve
iteratively
Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction
“It’s okay to have your eggs in one basket as
long as you control what happens to that
basket”
- Elon Musk
THE DEEPDIVE FRAMEWORK
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
End-To-End Framework For Building KBCs
Source: Incremental Knowledge Base Construction Using DeepDive
Knowledge-Based Construction Systems
The input to a KBC system is a heterogeneous
collection of unstructured, semi-structured, and
structured data.
The output is a relational database containing
facts extracted from the input and put into the
appropriate schema
The KBC Model
The standard KBC model seeks to extract four
types of objects from input documents:
Entity
Relation
Mention
Relation
Mention
A real person, place, or thing
A relation associates two (or more) entities
A span of text in input document that refers
to the entity or relation
A phrase that connects two mentions that
participate in a relations
THE DEEPDIVE FRAMEWORK:
STEP-BY-STEP
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
Source: Incremental Knowledge Base Construction Using DeepDive
Candidate Generation & Feature Extraction
All data is stored in a relational database. This
phase populates the database using a set of SQL
queries and User-Defined Functions (Feature
Extractors)
By default, DeepDive stores all documents in the
database in one sentence per row with markup
produced by standard NLP pre-processing tools,
including HTML stripping, part-of-speech tagging,
and linguistic parsing
Then, DeepDive executes two types of queries:
Candidate mappings – SQL queries that produce
possible mentions, entities, and relations
Feature Extractors – associate features to
candidates
“A breakthrough in machine learning would be
worth ten Microsofts”
- Bill Gates
THE DEEPDIVE FRAMEWORK:
STEP-BY-STEP
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
Source: Incremental Knowledge Base Construction Using DeepDive
Just as in Markov Logic, DeepDive can use training
data or evidence about any relation.
Each user relation is associated with an evidence
that indicates whether the entry is true or false
Two standard techniques generate training data:
Hand-labeling and Distant Supervision
Distant Supervision
Traditional machine learning techniques require a
set of training data. In distant supervision, DeepDive
takes existing databases (e.g. domain-specific
database) to collect relations DeepDive wants to
extract. Then use these examples to automatically
generate the training data
Supervision
THE DEEPDIVE FRAMEWORK:
STEP-BY-STEP
Input
Candidate
Generation &
Feature Extraction
Supervision
Learning &
Inference
Output
New docs
Feature Ext.
rules
Supervision
rules
Inference
rules
Error analysis
Source: Incremental Knowledge Base Construction Using DeepDive
Learning & Inference
In this phase, DeepDive generates a factor graph
An example factor graph. There is one user relation
containing all tokens, and there are two correlation
relations for adjacent-token correlation (F1) and same-
word correlation (F2) respectively.
A probabilistic graphical model that is the abstraction
used for learning. DeepDive relies heavily on factor
graph
Raw Data In-database Representation
He said that he would come.
Factor Graph
He
Said
That
He
i
ii
iii
iv
Adjacent-
token
Same-
word
User	Rela)ons	
Token	 Word	
A	 He	
B	 Said	
C	 That	
D	 He	
Assignment Example
Correla)on	Rela)ons	
Rx	 Vars	 Rx	 Vars	
i	 (A,B)	 iv	 (A,D)	
ii	 (B,C)	
iii	 (C,D)	
F1	 F2	
Assignment	
Token	 Assignment	
A	 1	
B	 0	
C	 0	
D	 1	
Partition Function
Z =
f1(1,0) x
f1(0,0) x
f1(0,1) x
f1(1,1) x
Factors in F1
Factors in F2
Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction
A B C D
A
B
C
D
“Problems worthy of attack prove their worth
by fighting back”
- Paul Erdös
REFERENCES
Shin, Jaeho, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher
Ré. "Incremental Knowledge Base Construction Using DeepDive." Proc. VLDB Endow.
Proceedings of the VLDB Endowment 8.11 (2015): 1310-321. Web.
Ce Zhang. “DeepDive: A Data Management System for Automatic Knowledge Base Construction."
Proc. VLDB Endow. Proceedings of the VLDB Endowment 8.13 (2015): 1310-321. Web.

More Related Content

What's hot

Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesSpringPeople
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsRavi Teja
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPeter Wang
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryInside Analysis
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Caserta
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big datakk1718
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)heba_ahmad
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run GraphVaticle
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamGreg Goltsov
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKUlf Mattsson
 
Maximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformMaximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformNeo4j
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceCaserta
 
Make AI & BI work at Scale
Make AI & BI work at ScaleMake AI & BI work at Scale
Make AI & BI work at ScaleSteve Nouri
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSPhilip Filleul
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthHostedbyConfluent
 

What's hot (20)

Top Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practicesTop Big data Analytics tools: Emerging trends and Best practices
Top Big data Analytics tools: Emerging trends and Best practices
 
Bigdata analytics
Bigdata analyticsBigdata analytics
Bigdata analytics
 
Tools for Unstructured Data Analytics
Tools for Unstructured Data AnalyticsTools for Unstructured Data Analytics
Tools for Unstructured Data Analytics
 
Big data mining
Big data miningBig data mining
Big data mining
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 
Data Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data DiscoveryData Wrangling and the Art of Big Data Discovery
Data Wrangling and the Art of Big Data Discovery
 
Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)Introduction to Data Science (Data Summit, 2017)
Introduction to Data Science (Data Summit, 2017)
 
Data mining with big data
Data mining with big dataData mining with big data
Data mining with big data
 
BigData Analysis
BigData AnalysisBigData Analysis
BigData Analysis
 
Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)Introduction to data science intro,ch(1,2,3)
Introduction to data science intro,ch(1,2,3)
 
Power of the Run Graph
Power of the Run GraphPower of the Run Graph
Power of the Run Graph
 
Jobs Complexity
Jobs ComplexityJobs Complexity
Jobs Complexity
 
Full-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data TeamFull-Stack Data Science: How to be a One-person Data Team
Full-Stack Data Science: How to be a One-person Data Team
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Protecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UKProtecting data privacy in analytics and machine learning ISACA London UK
Protecting data privacy in analytics and machine learning ISACA London UK
 
Maximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data PlatformMaximize the Value of Your Data: Neo4j Graph Data Platform
Maximize the Value of Your Data: Neo4j Graph Data Platform
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Make AI & BI work at Scale
Make AI & BI work at ScaleMake AI & BI work at Scale
Make AI & BI work at Scale
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, StealthLessons from building a stream-first metadata platform | Shirshanka Das, Stealth
Lessons from building a stream-first metadata platform | Shirshanka Das, Stealth
 

Viewers also liked

Deepdive presentation GBAF20 primary care
Deepdive presentation GBAF20 primary careDeepdive presentation GBAF20 primary care
Deepdive presentation GBAF20 primary careMatthew Cunningham
 
DeepDive - Azure AD Identity Protection
DeepDive - Azure AD Identity ProtectionDeepDive - Azure AD Identity Protection
DeepDive - Azure AD Identity ProtectionMaxime Rastello
 
Silverlight2 Deepdive Mix08 External
Silverlight2 Deepdive Mix08 ExternalSilverlight2 Deepdive Mix08 External
Silverlight2 Deepdive Mix08 ExternalMartha Rotter
 
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15MLconf
 
O365 Saturday - Deepdive SharePoint Client Side Rendering
O365 Saturday - Deepdive SharePoint Client Side RenderingO365 Saturday - Deepdive SharePoint Client Side Rendering
O365 Saturday - Deepdive SharePoint Client Side RenderingRiwut Libinuko
 
Presentation about the main ideas of the DeepDive (Stanford University)
Presentation about the main ideas of the DeepDive (Stanford University)Presentation about the main ideas of the DeepDive (Stanford University)
Presentation about the main ideas of the DeepDive (Stanford University)RealSpeaker 2.0
 
Fibromyalgia-2016_Brochure
Fibromyalgia-2016_BrochureFibromyalgia-2016_Brochure
Fibromyalgia-2016_BrochureSuresh Sriramulu
 

Viewers also liked (7)

Deepdive presentation GBAF20 primary care
Deepdive presentation GBAF20 primary careDeepdive presentation GBAF20 primary care
Deepdive presentation GBAF20 primary care
 
DeepDive - Azure AD Identity Protection
DeepDive - Azure AD Identity ProtectionDeepDive - Azure AD Identity Protection
DeepDive - Azure AD Identity Protection
 
Silverlight2 Deepdive Mix08 External
Silverlight2 Deepdive Mix08 ExternalSilverlight2 Deepdive Mix08 External
Silverlight2 Deepdive Mix08 External
 
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
Ce Zhang, Postdoctoral Researcher, Stanford University at MLconf ATL - 9/18/15
 
O365 Saturday - Deepdive SharePoint Client Side Rendering
O365 Saturday - Deepdive SharePoint Client Side RenderingO365 Saturday - Deepdive SharePoint Client Side Rendering
O365 Saturday - Deepdive SharePoint Client Side Rendering
 
Presentation about the main ideas of the DeepDive (Stanford University)
Presentation about the main ideas of the DeepDive (Stanford University)Presentation about the main ideas of the DeepDive (Stanford University)
Presentation about the main ideas of the DeepDive (Stanford University)
 
Fibromyalgia-2016_Brochure
Fibromyalgia-2016_BrochureFibromyalgia-2016_Brochure
Fibromyalgia-2016_Brochure
 

Similar to Stanford DeepDive Framework

Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8dallemang
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data scienceShilpaKrishna6
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clusteringNishanth Harapanahalli
 
Case Study: Big Data Analytics
Case Study: Big Data AnalyticsCase Study: Big Data Analytics
Case Study: Big Data AnalyticsAbhinav Das
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey Kumar
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesRaphael Branger
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseLeMeniz Infotech
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge GraphTrey Grainger
 
Qiagram
QiagramQiagram
Qiagramjwppz
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer OverlordsIan Foster
 
Computer Science Related Questions
Computer Science Related QuestionsComputer Science Related Questions
Computer Science Related QuestionsBravoLulu1
 
Overview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company indiaOverview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company indiaJignesh Aakoliya
 

Similar to Stanford DeepDive Framework (20)

Sem tech 2011 v8
Sem tech 2011 v8Sem tech 2011 v8
Sem tech 2011 v8
 
Data science | What is Data science
Data science | What is Data scienceData science | What is Data science
Data science | What is Data science
 
Scalable constrained spectral clustering
Scalable constrained spectral clusteringScalable constrained spectral clustering
Scalable constrained spectral clustering
 
Mrithyunjaya_V_Sarangmath
Mrithyunjaya_V_SarangmathMrithyunjaya_V_Sarangmath
Mrithyunjaya_V_Sarangmath
 
Case Study: Big Data Analytics
Case Study: Big Data AnalyticsCase Study: Big Data Analytics
Case Study: Big Data Analytics
 
Introduction
IntroductionIntroduction
Introduction
 
Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016Shrey_Kumar_Resume_01072016
Shrey_Kumar_Resume_01072016
 
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - ExcercisesAgile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
Agile Testing Days 2017 Intoducing AgileBI Sustainably - Excercises
 
Resume
ResumeResume
Resume
 
PrachiSharma
PrachiSharmaPrachiSharma
PrachiSharma
 
RESUME_RAVI
RESUME_RAVIRESUME_RAVI
RESUME_RAVI
 
K anonymity for crowdsourcing database
K anonymity for crowdsourcing databaseK anonymity for crowdsourcing database
K anonymity for crowdsourcing database
 
ChandraSekhar CV
ChandraSekhar CVChandraSekhar CV
ChandraSekhar CV
 
The Semantic Knowledge Graph
The Semantic Knowledge GraphThe Semantic Knowledge Graph
The Semantic Knowledge Graph
 
Qiagram
QiagramQiagram
Qiagram
 
So Long Computer Overlords
So Long Computer OverlordsSo Long Computer Overlords
So Long Computer Overlords
 
Computer Science Related Questions
Computer Science Related QuestionsComputer Science Related Questions
Computer Science Related Questions
 
SurajResume
SurajResumeSurajResume
SurajResume
 
Bigdataanalytics
BigdataanalyticsBigdataanalytics
Bigdataanalytics
 
Overview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company indiaOverview of entity framework by software outsourcing company india
Overview of entity framework by software outsourcing company india
 

Recently uploaded

原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查ydyuyu
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdfMatthew Sinclair
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsMonica Sydney
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理F
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdfMatthew Sinclair
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...kumargunjan9515
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdfMatthew Sinclair
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...kajalverma014
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样ayvbos
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Roommeghakumariji156
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Balliameghakumariji156
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"growthgrids
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsMonica Sydney
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfJOHNBEBONYAP1
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdfMatthew Sinclair
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsPriya Reddy
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样ayvbos
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查ydyuyu
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasDigicorns Technologies
 

Recently uploaded (20)

原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
原版制作美国爱荷华大学毕业证(iowa毕业证书)学位证网上存档可查
 
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
20240507 QFM013 Machine Intelligence Reading List April 2024.pdf
 
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi EscortsRussian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
Russian Escort Abu Dhabi 0503464457 Abu DHabi Escorts
 
一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理一比一原版奥兹学院毕业证如何办理
一比一原版奥兹学院毕业证如何办理
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...Local Call Girls in Seoni  9332606886 HOT & SEXY Models beautiful and charmin...
Local Call Girls in Seoni 9332606886 HOT & SEXY Models beautiful and charmin...
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
best call girls in Hyderabad Finest Escorts Service 📞 9352988975 📞 Available ...
 
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime BalliaBallia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
Ballia Escorts Service Girl ^ 9332606886, WhatsApp Anytime Ballia
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call GirlsMira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
Mira Road Housewife Call Girls 07506202331, Nalasopara Call Girls
 
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
一比一原版(Curtin毕业证书)科廷大学毕业证原件一模一样
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 

Stanford DeepDive Framework

  • 1. THE DEEPDIVE FRAMEWORK LEO ZHANG STEP-BY-STEP ILLUSTRATION
  • 2. The Stanford DeepDive, developed by Professor Chris Ré and a team of PhDs, is a powerful data management and preparation platform that allows users to build highly sophisticated end-to-end data pipelines This presentation covers the technicalities of the inference and learning engine behind DeepDive; including how DeepDive is different from traditional data management systems, how to build an application on DeepDive, as well as how exactly does DeepDive work. “We are just an advanced breed of monkeys on a minor planet of a very average star. But we can understand the Universe. That makes us special”” - Stephen Hawking
  • 3. THE DEEPDIVE OVERVIEW How Is DeepDive Different? Source: www.deepdive.stanford.edu DeepDive is an end-to-end framework for building KBC systems. B.Obama and his wife M. Obama Candidate Generation & Feature Extraction Super- vision Learning & Inference Has Spouse Input Output Newdocs FeatureExt. rules Supervision rules Inference rules Erroranalysis Input: Unstructured Docs Developers will add new rules to improve quality How Does DeepDive Work? •  Candidate Generation and Feature Extraction •  Save input data in relational database •  Feature Extractors: a set of user-defined functions •  Supervision •  DeepDive language is based on Markov Logic •  Can use training data to mirror the same function it serves under supervised learning •  Learning and Inference •  Factor graph •  Error Analysis •  Determine if the user needs to inspect the errors DeepDive Design Features that makes it convenient for non-computer scientists to use: i)  No reference to underlying machine learning algorithm. Probabilistic semantics provide a way to debug the system independently of algorithm ii)  Allows users to write extra features in Python, SQL and Scala iii)  Fits into the familiar SQL stack, therefore allows standard tools to inspect and visualize data Source: Incremental Knowledge Base Construction Using DeepDive Output: structured knowledge base Feature Engineering High Quality Allows developers to think about features rather than algorithms Applications have achieved higher quality than human volunteers Calibration Variety of Sources Computes calibrated probability for every assertion it makes Can extract data from documents, PDFs, web pages, tables and figures Domain Knowledge Distant Supervision Integrates with writing sample rules to improve quality Does not require tedious training for every prediction
  • 4. DEVELOPMENT PROCESS OF DEEPDIVE APPLICATIONS Writing The Application Running The Application Evaluate / Debug •  Define the data flow in DDlog schema that describes the input data and data to be produced •  Write User-Defined Functions (data transformation rules) •  Specify a statistical model in DDlog •  The user can compile and run the application incrementally •  Actual data loaded to data base and queried -> User-Defined Functions executed incrementally •  Model’s parameters can be learned or reused to make predictions •  Formal error analysis supported by interactive tools •  DeepDive contains a suite of tools and guides: Label data products, browse data, monitor descriptive statistics, calibration etc. # DDlog is a higher-level language for writing DeepDive applications in succinct, Datalog-like syntax # Variable declarations + Scoping and supervision rules + Inference rules # A core set of commands that supports precise control of execution # Several commands on the statistical model such as its creation, parameter estimation, computation of probabilities and keeping and reusing the parameters # User-Defined Functions can be written on any standard programming languages # Produces calibration plots to evaluate the iterative workflow # Comments Start with a basic first version and improve iteratively Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction “It’s okay to have your eggs in one basket as long as you control what happens to that basket” - Elon Musk
  • 5. THE DEEPDIVE FRAMEWORK Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis End-To-End Framework For Building KBCs Source: Incremental Knowledge Base Construction Using DeepDive Knowledge-Based Construction Systems The input to a KBC system is a heterogeneous collection of unstructured, semi-structured, and structured data. The output is a relational database containing facts extracted from the input and put into the appropriate schema The KBC Model The standard KBC model seeks to extract four types of objects from input documents: Entity Relation Mention Relation Mention A real person, place, or thing A relation associates two (or more) entities A span of text in input document that refers to the entity or relation A phrase that connects two mentions that participate in a relations
  • 6. THE DEEPDIVE FRAMEWORK: STEP-BY-STEP Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis Source: Incremental Knowledge Base Construction Using DeepDive Candidate Generation & Feature Extraction All data is stored in a relational database. This phase populates the database using a set of SQL queries and User-Defined Functions (Feature Extractors) By default, DeepDive stores all documents in the database in one sentence per row with markup produced by standard NLP pre-processing tools, including HTML stripping, part-of-speech tagging, and linguistic parsing Then, DeepDive executes two types of queries: Candidate mappings – SQL queries that produce possible mentions, entities, and relations Feature Extractors – associate features to candidates “A breakthrough in machine learning would be worth ten Microsofts” - Bill Gates
  • 7. THE DEEPDIVE FRAMEWORK: STEP-BY-STEP Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis Source: Incremental Knowledge Base Construction Using DeepDive Just as in Markov Logic, DeepDive can use training data or evidence about any relation. Each user relation is associated with an evidence that indicates whether the entry is true or false Two standard techniques generate training data: Hand-labeling and Distant Supervision Distant Supervision Traditional machine learning techniques require a set of training data. In distant supervision, DeepDive takes existing databases (e.g. domain-specific database) to collect relations DeepDive wants to extract. Then use these examples to automatically generate the training data Supervision
  • 8. THE DEEPDIVE FRAMEWORK: STEP-BY-STEP Input Candidate Generation & Feature Extraction Supervision Learning & Inference Output New docs Feature Ext. rules Supervision rules Inference rules Error analysis Source: Incremental Knowledge Base Construction Using DeepDive Learning & Inference In this phase, DeepDive generates a factor graph An example factor graph. There is one user relation containing all tokens, and there are two correlation relations for adjacent-token correlation (F1) and same- word correlation (F2) respectively. A probabilistic graphical model that is the abstraction used for learning. DeepDive relies heavily on factor graph Raw Data In-database Representation He said that he would come. Factor Graph He Said That He i ii iii iv Adjacent- token Same- word User Rela)ons Token Word A He B Said C That D He Assignment Example Correla)on Rela)ons Rx Vars Rx Vars i (A,B) iv (A,D) ii (B,C) iii (C,D) F1 F2 Assignment Token Assignment A 1 B 0 C 0 D 1 Partition Function Z = f1(1,0) x f1(0,0) x f1(0,1) x f1(1,1) x Factors in F1 Factors in F2 Source: DeepDive: A Data Management System for Automatic Knowledge Base Construction A B C D A B C D “Problems worthy of attack prove their worth by fighting back” - Paul Erdös
  • 9. REFERENCES Shin, Jaeho, Sen Wu, Feiran Wang, Christopher De Sa, Ce Zhang, and Christopher Ré. "Incremental Knowledge Base Construction Using DeepDive." Proc. VLDB Endow. Proceedings of the VLDB Endowment 8.11 (2015): 1310-321. Web. Ce Zhang. “DeepDive: A Data Management System for Automatic Knowledge Base Construction." Proc. VLDB Endow. Proceedings of the VLDB Endowment 8.13 (2015): 1310-321. Web.