Abstract:
The data needed to answer queries is often available through Web-based APIs. Indeed, for a given query there may be many Web-based sources which can be used to answer it, with the sources overlapping in their vocabularies, and differing in their access restrictions (required arguments) and cost.
We introduce PDQ (Proof-Driven Query Answering), a system for determining a query plan in the presence of web-based sources. It is: (i) constraint-aware -- exploiting relationships between sources to rewrite an expensive query into a cheaper one, (ii) access-aware -- abiding by any access restrictions known in the sources, and (iii) cost-aware -- making use of any cost information that is available about services.
PDQ takes the novel approach of generating query plans from proofs that a query is answerable. We demonstrate the use of PDQ and its effectiveness in generating low-cost plans.
1. A short intro to PDQ: Proof-driven
Querying
Michael Benedikt
with Julien Leblay, Efi Tsamoura, and Michael Vanden Boom
2. Background
DBOnto: Semantics for a better world
Exploit semantics of data:
within a single source, among distributed sources, across data models
• Enable new applications
• Deliver better performance for current data-intensive tasks
• Diminish effort in integrating complex data sources
3. Background
Dimensions of Semantic Data
Completeness
of Sources/Source Access Model
Target
Implementation
Data model
for queries and constraints
4. Background
Dimensions of Semantic Data
Completeness
of Sources/Source Access Model
Target
Implementation
Data model
for queries and constraints
5. Background
Semantic Data Technology
Completeness of Sources/
Source Access Model
Target
Implementation
Data model
for queries and constraints
Semantic Web
• RDF data model, description logic constraints
• Inherently incomplete sources
• Certain answer semantics
• Wide range of target implementations
6. Background
Semantic Data Technology
Target
Implementation
Data model
for queries and constraints
Completeness of Sources/
Source Access Model
Query Optimization
with Constraints
• Relational data model and constraints
• Complete information
• Access via lookup indices in sources
• Compile to plan language of DBMS
7. Background
Semantic Data Technology
Target
Implementation
Data model
for queries and constraints
Completeness of Sources/
Source Access Model
Query Optimization
with Constraints via Reformulation
• Relational data model and constraints
• Complete sources
• Compile to query language (e.g. SQL)
8. Background
Semantic Data Technology
Target
Implementation
Data model
for queries and constraints
Completeness of Sources/
Source Access Model
Query Rewriting
with Exact Views
• Relational sources and constraints
• Base data may not be accessible
• Can still look for exact answers to queries
• Compile to query language (e.g. SQL)
9. Background
Semantic Data Technology
Target
Implementation
Data model
for queries and constraints
Completeness of Sources/
Source Access Model
Federated Querying Over Web-based
Sources
• Model sources and constraints relationally
• Complete information on subset of sources
• Distributed sources with mix of access regimes
• Compile to middleware plan
10. Background
Long-term PDQ vision
Completeness
of Sources/Source Access Model
Target
Implementation
Data model
for queries and constraints
PDQ
11. Functionality
PDQ: what it is today
System for answering queries Q in the presence of semantic relationships and
access restrictions on sources
Targets:
•Relational data model and constraints
•Sufficient accessible information assumption: there is sufficient accessible
data to obtain the exact answers to the query Q
•Compilation into a “static plan” (reformulation, physical plan, middleware plan)
Unified framework for:
•Query Optimization/Reformulation with Constraints
•Querying with Materialized Views
•Federated Querying with Complete Information
12. Functionality
PDQ: what it is
Metadata including
•D description of access to sources
•integrity constraints C
PDQ planner
Cost information
(e.g. cost function on plans)
Query Q
Pbest: plan using access model described by D with minimal cost
giving the exact answer to Q for databases satisfying constraints C
PDQ runtime Executes plans on top of
Web-based or local datasources
13. Under the hood
PDQ: how it works (sort of)
Key observation: Under the sufficient accessible information assumption
on Q, C, D there is always a “static plan” (e.g. relational algebra query) PQ
that can be run to answer Q
We can find such a PQ by looking for a “proof that there is sufficient
information to answer Q”.
• First main component: procedures to turn “proofs of answerability” into plans
• Proof-to-plan procedure works for extremely rich class of integrity constraints
• Adaptable to different target implementations (SQL query, physical plan, distributed plan…)
• These “proof-to-plan” procedures are coupled with a reasoning system
for finding the proofs of answerability.
• Plug-in architecture: Chase procedure, Tableau-based FO theorem-prover, …
14. Under the hood
PDQ: how it works in a bit more detail
Metadata including
•D description of access to sources
•integrity constraints C Query Q
PDQ planner
Reasoning
system for
finding “proofs of
answerability”
Proof-to-Plan
conversion
Cost information
(e.g. cost function on plans)
15. Under the hood
PDQ: how it works, still more
We can find a static plan PQ getting the exact answer to Q by looking for a
“proof that Q is answerable” and then applying a proof-to-plan procedure.
Last component – search strategy: we can find a good PQ by searching
for a proof that
1.witnesses that Q is answerable
2.generates a low-cost plan
Search is directed by proof goal and cost
17. Status
PDQ today and tomorrow
• Theoretical basis given in PODS 2014 paper
• Demonstration implemented over web services in VLDB 2014
• Implementation generates SQL reformulation over relational sources (run on top
of Postgres)
Moving forward:
•Pilot project beginning Oct 2014 to explore “native implementation” of PDQ on top
of the plan language of the LogicBlox DBMS
•Large EPSRC-funded project 2015-2020 to explore diverse uses of PDQ
18. Status
PDQ today and tomorrow
Completeness
of Sources/Source Access Model
Target
Implementation
Data model
for queries and constraints
PDQ
2014
PDQ
2020
19. Next Steps
PDQ: Next Steps
• More info at http://cs.ox.ac.uk/pdq
• See the demo!