SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Drill / SQL / Optiq
     Julian Hyde
 Apache Drill User Group
     2013-03-13
SQL
SQL: Pros & cons
Fact:

  SQL is older than Macaulay Culkin

Less interesting but more relevant:

  Can be written by (lots of) humans

  Can be written by machines

  Requires query optimization

  Allows query optimization

  Based on “flat” relations and basic relational
  operations
Quick intro to Optiq
Introducing Optiq
Framework
Derived from LucidDB
Minimal query mediator:

    No storage

    No runtime

    No metadata

    Query planning engine

    Core operators & rewrite rules

    Optional SQL parser/validator
SELECT p.“product_name”, COUNT(*) AS c
Expression tree                        FROM “splunk”.”splunk” AS s
                                         JOIN “mysql”.”products” AS p
                                         ON s.”product_id” = p.”product_id”
                                       WHERE s.“action” = 'purchase'
                                       GROUP BY p.”product_name”
                                       ORDER BY c DESC
 Splunk
 Table: splunk
                                                    Key: product_name
                     Key: product_id                Agg: count
                                       Condition:                       Key: c DESC
                                         action =
                                       'purchase'
 scan
                          join
MySQL                                  filter          group            sort
    scan
                 Table: products
SELECT p.“product_name”, COUNT(*) AS c
Expression tree                      FROM “splunk”.”splunk” AS s
                                       JOIN “mysql”.”products” AS p
(optimized)                            ON s.”product_id” = p.”product_id”
                                     WHERE s.“action” = 'purchase'
                                     GROUP BY p.”product_name”
                                     ORDER BY c DESC
             Splunk
                        Condition:
 Table: splunk
                          action =
                        'purchase'                     Key: product_name
                                                       Agg: count
                                                                           Key: c DESC

                                     Key: product_id
 scan                   filter

MySQL
                                     join                 group            sort
   scan
                 Table: products
Metadata SPI

    interface Table
    −   RelDataType getRowType()

    interface TableFunction
    −   List<Parameter> getParameters()
    −   Table apply(List arguments)
    −   e.g. ViewTableFunction

    interface Schema
     − Map<String, List<TableFunction>>
       getTableFunctions()
Operators and rules

    Rule: interface RelOptRule

    Operator: interface RelNode

    Core operators: TableAccess, Project,
    Filter, Join, Aggregate, Order, Union,
    Intersect, Minus, Values

    Some rules: MergeFilterRule,
    PushAggregateThroughUnionRule,
    RemoveCorrelationForScalarProjectRule
    + 100 more
Planning algorithm

    Start with a logical plan and a set of
    rewrite rules

    Form a list of rewrite rules that are
    applicable (based on pattern-matching)

    Fire the rule that is likely to do the most
    good

    Rule generates an expression that is
    equivalent (and hopefully cheaper)

    Queue up new rule matches

    Repeat until cheap enough
Concepts

    Cost

    Equivalence sets

    Calling convention

    Logical vs Physical

    Traits

    Implementation
Outside the kernel
                                      JDBC client

    SQL
    parser/validator
                                    JDBC server

    JDBC driver            Optional SQL parser /          Metadata
                                       validator            SPI

    SQL function             Core
                                        Query            Pluggable
    library (validation               3
                                       planner
                                        rd
                                              3
                                              rd
                                                           rules

    + code-               Pluggable  party party
    generation)       3 party
                            rd        ops ops
                                                    3rd party
                        data                          data

    Lingual
    (Cascading
    adapter)

    Splunk adapter

    Drill adapter
Optiq roadmap

    Building blocks for analytic DB:
    −   In-memory tables in a distributed cache
    −   Materialized views
    −   Partitioned tables

    Faster planning

    Easier rule development

    ODBC driver

    Adapters for XXX, YYY
Applying Optiq to Drill
  1. Enhance SQL
 2. Query translation
Drill vs Traditional SQL

    SQL:
    −   Flat data
    −   Schema up front

    Drill:
    −   Nested data (list & map)
    −   No schema

    We'd like to write:
    −   SELECT name, toppings[2] FROM donuts
        WHERE ppu > 0.6

    Solution: ARRAY, MAP, ANY types
ARRAY & MAP SQL types

  ARRAY is like java.util.List

  MAP is like java.util.LinkedHashMap
Examples:

  VALUES ARRAY ['a', 'b', 'c']

  VALUES MAP ['Washington', 1, 'Obama', 44]

  SELECT name, address[1], address[2], state
  FROM Employee

  SELECT * FROM Donuts WHERE
  CAST(donuts['ppu'] AS DOUBLE) > 0.6
ANY SQL type

    ANY means “type to be determined at runtime”

    Validator narrows down possible type based
    on operators used

    Similar to converting Java's type system into
    JavaScript's. (Not easy.)
Sugaring the donut
Query:

  SELECT c['ppu'], c['toppings'][1] FROM
  Donuts
Additional syntactic sugar:

  c.x means c['x']
So:

  CREATE TABLE Donuts(c ANY)

  SELECT c.ppu, c.toppings[1] FROM
  Donuts
Better:
UNNEST
Employees nested inside departments:

  CREATE TYPE employee (empno INT, name
  VARCHAR(30));

  CREATE TABLE dept (deptno INT, name
  VARCHAR(30),
   employees EMPLOYEE ARRAY);
Unnest:

  SELECT d.deptno, d.name, e.empno, e.name
  FROM department AS d
   CROSS JOIN UNNEST(d.employees) AS e
SQL standard provides other operations on
  collections:
Applying Optiq to Drill
   1. Enhance SQL
2. Query translation
Query translation
SQL:

  select d['name'] as name, d['xx'] as xx
  from (
    select _MAP['donuts'] as d from donuts)
  where cast(d['ppu'] as double) > 0.6

Drill:

  { head: { … },
      storage: { … },
      query: [ {
         op: “sequence”, do: [
           { op: “scan”, … selection: { path:
Planner log
Original rel:
AbstractConverter(subset=[rel#14:Subset#3.AR
 RAY], convention=[ARRAY])
 ProjectRel(subset=[rel#10:Subset#3.NONE],
 NAME=[ITEM($0, 'name')], XX=[ITEM($0,
 'xx')])
  FilterRel(subset=[rel#8:Subset#2.NONE],
 condition=[>(CAST(ITEM($0, 'ppu')):DOUBLE
 NOT NULL, 0.6)])
    ProjectRel(subset=[rel#6:Subset#1.NONE],
 D=[ITEM($0, 'donuts')])
     DrillScan(subset=[rel#4:Subset#0.DRILL],
Next

    Translate join, aggregate, sort, set ops

    Operator overloading with ANY

    Mondrian on Drill
Thank you!


https://github.com/julianhyde/share/tree/master/slides
https://github.com/julianhyde/incubator-drill
https://github.com/julianhyde/optiq
http://incubator.apache.org/drill

@julianhyde

Weitere ähnliche Inhalte

Was ist angesagt?

ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
Evans Ye
 

Was ist angesagt? (20)

Cost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache CalciteCost-based Query Optimization in Apache Phoenix using Apache Calcite
Cost-based Query Optimization in Apache Phoenix using Apache Calcite
 
Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)Why you care about
 relational algebra (even though you didn’t know it)
Why you care about
 relational algebra (even though you didn’t know it)
 
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
Smarter Together - Bringing Relational Algebra, Powered by Apache Calcite, in...
 
Apache Calcite: One planner fits all
Apache Calcite: One planner fits allApache Calcite: One planner fits all
Apache Calcite: One planner fits all
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
Introduction to Apache Calcite
Introduction to Apache CalciteIntroduction to Apache Calcite
Introduction to Apache Calcite
 
Calcite meetup-2016-04-20
Calcite meetup-2016-04-20Calcite meetup-2016-04-20
Calcite meetup-2016-04-20
 
Cost-based query optimization in Apache Hive
Cost-based query optimization in Apache HiveCost-based query optimization in Apache Hive
Cost-based query optimization in Apache Hive
 
SQL for NoSQL and how Apache Calcite can help
SQL for NoSQL and how  Apache Calcite can helpSQL for NoSQL and how  Apache Calcite can help
SQL for NoSQL and how Apache Calcite can help
 
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache CalciteA smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Apache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them AllApache Calcite: One Frontend to Rule Them All
Apache Calcite: One Frontend to Rule Them All
 
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
Streaming SQL (at FlinkForward, Berlin, 2016/09/12)
 
Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!Don’t optimize my queries, optimize my data!
Don’t optimize my queries, optimize my data!
 
What's new in Mondrian 4?
What's new in Mondrian 4?What's new in Mondrian 4?
What's new in Mondrian 4?
 
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
Data all over the place! How SQL and Apache Calcite bring sanity to streaming...
 
Cost-based Query Optimization
Cost-based Query Optimization Cost-based Query Optimization
Cost-based Query Optimization
 
Optiq: a SQL front-end for everything
Optiq: a SQL front-end for everythingOptiq: a SQL front-end for everything
Optiq: a SQL front-end for everything
 
ONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smartONE FOR ALL! Using Apache Calcite to make SQL smart
ONE FOR ALL! Using Apache Calcite to make SQL smart
 

Ähnlich wie Drill / SQL / Optiq

Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
Luc Bors
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
Jerome Eteve
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
DataStax
 
Nko workshop - node js & nosql
Nko workshop - node js & nosqlNko workshop - node js & nosql
Nko workshop - node js & nosql
Simon Su
 

Ähnlich wie Drill / SQL / Optiq (20)

Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0 Introduce to Spark sql 1.3.0
Introduce to Spark sql 1.3.0
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
Polyalgebra
PolyalgebraPolyalgebra
Polyalgebra
 
Odtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for youOdtug2011 adf developers make the database work for you
Odtug2011 adf developers make the database work for you
 
PerlApp2Postgresql (2)
PerlApp2Postgresql (2)PerlApp2Postgresql (2)
PerlApp2Postgresql (2)
 
Meetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbcMeetup cassandra sfo_jdbc
Meetup cassandra sfo_jdbc
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
Road to Analytics
Road to AnalyticsRoad to Analytics
Road to Analytics
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL10 Reasons to Start Your Analytics Project with PostgreSQL
10 Reasons to Start Your Analytics Project with PostgreSQL
 
Introduction to NoSQL Database
Introduction to NoSQL DatabaseIntroduction to NoSQL Database
Introduction to NoSQL Database
 
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
From Postgres to Cassandra (Rimas Silkaitis, Heroku) | C* Summit 2016
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
Spark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael NitschingerSpark Summit EU talk by Michael Nitschinger
Spark Summit EU talk by Michael Nitschinger
 
Nko workshop - node js & nosql
Nko workshop - node js & nosqlNko workshop - node js & nosql
Nko workshop - node js & nosql
 
Real-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to StreamingReal-Time Spark: From Interactive Queries to Streaming
Real-Time Spark: From Interactive Queries to Streaming
 
Apache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & librariesApache spark - Architecture , Overview & libraries
Apache spark - Architecture , Overview & libraries
 
The Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago MolaThe Pushdown of Everything by Stephan Kessler and Santiago Mola
The Pushdown of Everything by Stephan Kessler and Santiago Mola
 
Living the Nomadic life - Nic Jackson
Living the Nomadic life - Nic JacksonLiving the Nomadic life - Nic Jackson
Living the Nomadic life - Nic Jackson
 

Mehr von Julian Hyde

Mehr von Julian Hyde (20)

Building a semantic/metrics layer using Calcite
Building a semantic/metrics layer using CalciteBuilding a semantic/metrics layer using Calcite
Building a semantic/metrics layer using Calcite
 
Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!Cubing and Metrics in SQL, oh my!
Cubing and Metrics in SQL, oh my!
 
Adding measures to Calcite SQL
Adding measures to Calcite SQLAdding measures to Calcite SQL
Adding measures to Calcite SQL
 
Morel, a data-parallel programming language
Morel, a data-parallel programming languageMorel, a data-parallel programming language
Morel, a data-parallel programming language
 
Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...Is there a perfect data-parallel programming language? (Experiments with More...
Is there a perfect data-parallel programming language? (Experiments with More...
 
Morel, a Functional Query Language
Morel, a Functional Query LanguageMorel, a Functional Query Language
Morel, a Functional Query Language
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
The evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its CommunityThe evolution of Apache Calcite and its Community
The evolution of Apache Calcite and its Community
 
What to expect when you're Incubating
What to expect when you're IncubatingWhat to expect when you're Incubating
What to expect when you're Incubating
 
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache CalciteOpen Source SQL - beyond parsers: ZetaSQL and Apache Calcite
Open Source SQL - beyond parsers: ZetaSQL and Apache Calcite
 
Efficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databasesEfficient spatial queries on vanilla databases
Efficient spatial queries on vanilla databases
 
Spatial query on vanilla databases
Spatial query on vanilla databasesSpatial query on vanilla databases
Spatial query on vanilla databases
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Data profiling with Apache Calcite
Data profiling with Apache CalciteData profiling with Apache Calcite
Data profiling with Apache Calcite
 
Data Profiling in Apache Calcite
Data Profiling in Apache CalciteData Profiling in Apache Calcite
Data Profiling in Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Streaming SQL with Apache Calcite
Streaming SQL with Apache CalciteStreaming SQL with Apache Calcite
Streaming SQL with Apache Calcite
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 

Drill / SQL / Optiq

  • 1. Drill / SQL / Optiq Julian Hyde Apache Drill User Group 2013-03-13
  • 2.
  • 3. SQL
  • 4. SQL: Pros & cons Fact:  SQL is older than Macaulay Culkin Less interesting but more relevant:  Can be written by (lots of) humans  Can be written by machines  Requires query optimization  Allows query optimization  Based on “flat” relations and basic relational operations
  • 6. Introducing Optiq Framework Derived from LucidDB Minimal query mediator:  No storage  No runtime  No metadata  Query planning engine  Core operators & rewrite rules  Optional SQL parser/validator
  • 7. SELECT p.“product_name”, COUNT(*) AS c Expression tree FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Splunk Table: splunk Key: product_name Key: product_id Agg: count Condition: Key: c DESC action = 'purchase' scan join MySQL filter group sort scan Table: products
  • 8. SELECT p.“product_name”, COUNT(*) AS c Expression tree FROM “splunk”.”splunk” AS s JOIN “mysql”.”products” AS p (optimized) ON s.”product_id” = p.”product_id” WHERE s.“action” = 'purchase' GROUP BY p.”product_name” ORDER BY c DESC Splunk Condition: Table: splunk action = 'purchase' Key: product_name Agg: count Key: c DESC Key: product_id scan filter MySQL join group sort scan Table: products
  • 9. Metadata SPI  interface Table − RelDataType getRowType()  interface TableFunction − List<Parameter> getParameters() − Table apply(List arguments) − e.g. ViewTableFunction  interface Schema − Map<String, List<TableFunction>> getTableFunctions()
  • 10. Operators and rules  Rule: interface RelOptRule  Operator: interface RelNode  Core operators: TableAccess, Project, Filter, Join, Aggregate, Order, Union, Intersect, Minus, Values  Some rules: MergeFilterRule, PushAggregateThroughUnionRule, RemoveCorrelationForScalarProjectRule + 100 more
  • 11. Planning algorithm  Start with a logical plan and a set of rewrite rules  Form a list of rewrite rules that are applicable (based on pattern-matching)  Fire the rule that is likely to do the most good  Rule generates an expression that is equivalent (and hopefully cheaper)  Queue up new rule matches  Repeat until cheap enough
  • 12. Concepts  Cost  Equivalence sets  Calling convention  Logical vs Physical  Traits  Implementation
  • 13. Outside the kernel JDBC client  SQL parser/validator JDBC server  JDBC driver Optional SQL parser / Metadata validator SPI  SQL function Core Query Pluggable library (validation 3 planner rd 3 rd rules + code- Pluggable party party generation) 3 party rd ops ops 3rd party data data  Lingual (Cascading adapter)  Splunk adapter  Drill adapter
  • 14. Optiq roadmap  Building blocks for analytic DB: − In-memory tables in a distributed cache − Materialized views − Partitioned tables  Faster planning  Easier rule development  ODBC driver  Adapters for XXX, YYY
  • 15. Applying Optiq to Drill 1. Enhance SQL 2. Query translation
  • 16. Drill vs Traditional SQL  SQL: − Flat data − Schema up front  Drill: − Nested data (list & map) − No schema  We'd like to write: − SELECT name, toppings[2] FROM donuts WHERE ppu > 0.6  Solution: ARRAY, MAP, ANY types
  • 17. ARRAY & MAP SQL types  ARRAY is like java.util.List  MAP is like java.util.LinkedHashMap Examples:  VALUES ARRAY ['a', 'b', 'c']  VALUES MAP ['Washington', 1, 'Obama', 44]  SELECT name, address[1], address[2], state FROM Employee  SELECT * FROM Donuts WHERE CAST(donuts['ppu'] AS DOUBLE) > 0.6
  • 18. ANY SQL type  ANY means “type to be determined at runtime”  Validator narrows down possible type based on operators used  Similar to converting Java's type system into JavaScript's. (Not easy.)
  • 19. Sugaring the donut Query:  SELECT c['ppu'], c['toppings'][1] FROM Donuts Additional syntactic sugar:  c.x means c['x'] So:  CREATE TABLE Donuts(c ANY)  SELECT c.ppu, c.toppings[1] FROM Donuts Better:
  • 20. UNNEST Employees nested inside departments:  CREATE TYPE employee (empno INT, name VARCHAR(30));  CREATE TABLE dept (deptno INT, name VARCHAR(30), employees EMPLOYEE ARRAY); Unnest:  SELECT d.deptno, d.name, e.empno, e.name FROM department AS d CROSS JOIN UNNEST(d.employees) AS e SQL standard provides other operations on collections:
  • 21. Applying Optiq to Drill 1. Enhance SQL 2. Query translation
  • 22. Query translation SQL:  select d['name'] as name, d['xx'] as xx from ( select _MAP['donuts'] as d from donuts) where cast(d['ppu'] as double) > 0.6 Drill:  { head: { … }, storage: { … }, query: [ { op: “sequence”, do: [ { op: “scan”, … selection: { path:
  • 23. Planner log Original rel: AbstractConverter(subset=[rel#14:Subset#3.AR RAY], convention=[ARRAY]) ProjectRel(subset=[rel#10:Subset#3.NONE], NAME=[ITEM($0, 'name')], XX=[ITEM($0, 'xx')]) FilterRel(subset=[rel#8:Subset#2.NONE], condition=[>(CAST(ITEM($0, 'ppu')):DOUBLE NOT NULL, 0.6)]) ProjectRel(subset=[rel#6:Subset#1.NONE], D=[ITEM($0, 'donuts')]) DrillScan(subset=[rel#4:Subset#0.DRILL],
  • 24. Next  Translate join, aggregate, sort, set ops  Operator overloading with ANY  Mondrian on Drill

Hinweis der Redaktion

  1. It&apos;s much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn&apos;t have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.
  2. It&apos;s much more efficient if we psuh filters and aggregations to Splunk. But the user writing SQL shouldn&apos;t have to worry about that. This is not about processing data. This is about processing expressions. Reformulating the question. The question is the parse tree of a query. The parse tree is a data flow. In Splunk, a data flow looks like a pipeline of Linux commands. SQL systems have pipelines too (sometimes they are dataflow trees) built up of the basic relational operators. Think of the SQL SELECT, WHERE, JOIN, GROUP BY, ORDER BY clauses.