Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.
Apache Calcite: A Foundational
Framework for Optimized Query
Processing Over Heterogeneous Data
Sources
Edmon Begoli, Jesu...
Outline
Background and History
Architecture
Adapter Design
Optimizer and Planner
Adoption
Uses in Research and Scholastic ...
What is Calcite?
Apache Calcite is an extensible framework for
building data management systems.
It is an open source proj...
Origins and Design Principles
Origins 2004 – LucidEra and SQLstream were each building SQL systems;
2012 – Pare down code ...
Architecture
Core – Operator expressions
(relational algebra) and planner
(based on Volcano/Cascades)
External – Data stor...
Adapter Design
A pattern that defines how
Calcite incorporates diverse
data sources for general
access.
Model – specificat...
Represent query as
relational algebra
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
Condition:...
Optimize query by
applying transformation
rules
MySQL
Splunk
join
Key: productId
group
Key: productName
Agg: count
filter
...
1. Plans start
as logical
nodes.
3. Fire rules to
propagate conventions
to other nodes.
2. Assign each
Scan its table’s
na...
Conventions & adapters
Scan Scan
Join
Filter
Join
Scan
Convention provides a uniform
representation for hybrid queries
Lik...
Stream ~= append-only table
Streaming queries return deltas
Stream-table duality: Orders is used as
both stream and table
...
Uses and Adoption
Uses in Research
● Polystore research – use as lightweight
heterogeneous data processing platform
● Optimization and query...
Future Work and Roadmap
● Support its use as a standalone engine – DDL, materialized views,
indexes and constraints.
● Imp...
Thank you! Questions?
@ApacheCalcite
https://calcite.apache.org
https://arxiv.org/abs/1802.10233
Extra slides
Calcite framework
Cost, statistics
RelOptCost
RelOptCostFactory
RelMetadataProvider
• RelMdColumnUniquensss
• RelMdDistinc...
Avatica
● Database connectivity
stack
● Self-contained sub-project
of Calcite
● Fast, open, stable
● Protobuf or JSON over...
Lattice (optimized) () 1
(z, s, g, y,
m) 912k
(s, g, y,
m) 6k
(z) 43k (s) 50 (g) 2 (y) 5 (m) 12
(z, g, y,
m) 909k
(z, s, y...
Aggregation and windows on
streams
GROUP BY aggregates multiple rows into
sub-totals
➢ In regular GROUP BY each row contri...
Tumbling, hopping & session windows in SQL
Tumbling window
Hopping window
Session window
select stream … from Orders
group...
Controlling when data is emitted
Early emission is the defining
characteristic of a streaming query.
The emit clause is a ...
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Nächste SlideShare
Wird geladen in …5
×

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

1.628 Aufrufe

Veröffentlicht am

A talk given at ACM SIGMOD 2018 in support of the paper <a href="https://arxiv.org/abs/1802.10233"> Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources</a>.

Apache Calcite is a foundational software framework that provides query processing, optimization, and query language support to many popular open-source data processing systems such as Apache Hive, Apache Storm, Apache Flink, Druid, and MapD. Calcite's architecture consists of a modular and extensible query optimizer with hundreds of built-in optimization rules, a query processor capable of processing a variety of query languages, an adapter architecture designed for extensibility, and support for heterogeneous data models and stores (relational, semi-structured, streaming, and geospatial). This flexible, embeddable, and extensible architecture is what makes Calcite an attractive choice for adoption in big-data frameworks. It is an active project that continues to introduce support for the new types of data sources, query languages, and approaches to query processing and optimization.

Veröffentlicht in: Software
  • Did u try to use external powers for studying? Like ⇒ www.HelpWriting.net ⇐ ? They helped me a lot once.
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • Check the source ⇒ www.HelpWriting.net ⇐ This site is really helped me out gave me relief from headaches. Good luck!
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier
  • You can ask here for a help. They helped me a lot an i`m highly satisfied with quality of work done. I can promise you 100% un-plagiarized text and good experts there. Use with pleasure! ⇒ www.WritePaper.info ⇐
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

  1. 1. Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources Edmon Begoli, Jesú s Camacho-Rodrı́guez, Julian Hyde, Michael J. Mior, Daniel Lemire 2018 SIGMOD, Houston, Texas, USA
  2. 2. Outline Background and History Architecture Adapter Design Optimizer and Planner Adoption Uses in Research and Scholastic Potential Roadmap and Future Work
  3. 3. What is Calcite? Apache Calcite is an extensible framework for building data management systems. It is an open source project governed by the Apache Software Foundation, is written in Java, and is used by dozens of projects and companies, and several research projects.
  4. 4. Origins and Design Principles Origins 2004 – LucidEra and SQLstream were each building SQL systems; 2012 – Pare down code base, enter Apache as incubator project Problem Building a high-quality database requires ~ 20 person years (effort) and 5 years (elapsed) Solution Create an open source framework that a community can contribute to, and use to build their own DBMSs Design principles Flexible → Relational algebra Extensible/composable → Volcano-style planner Easy to contribute to → Java, FP style Alternatives PostgreSQL, Apache Spark, AsterixDB
  5. 5. Architecture Core – Operator expressions (relational algebra) and planner (based on Volcano/Cascades) External – Data storage, algorithms and catalog Optional – SQL parser, JDBC & ODBC drivers Extensible – Planner rewrite rules, statistics, cost model, algebra, UDFs
  6. 6. Adapter Design A pattern that defines how Calcite incorporates diverse data sources for general access. Model – specification of the physical properties of the data source. Schema – definition of the data (format and layouts) found in the model.
  7. 7. Represent query as relational algebra MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products Table: splunk select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  8. 8. Optimize query by applying transformation rules MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  9. 9. 1. Plans start as logical nodes. 3. Fire rules to propagate conventions to other nodes. 2. Assign each Scan its table’s native convention. 4. The best plan may use an engine not tied to any native format. To implement, generate a program that calls out to query1 and query2. Join Filter Scan ScanScan Join Conventions Join Filter Scan ScanScan Join Scan ScanScan Join Filter Join Join Filter Scan ScanScan Join
  10. 10. Conventions & adapters Scan Scan Join Filter Join Scan Convention provides a uniform representation for hybrid queries Like ordering and distribution, convention is a physical property of nodes Adapter = schema factory (lists tables) + convention + rules to convert nodes to convention
  11. 11. Stream ~= append-only table Streaming queries return deltas Stream-table duality: Orders is used as both stream and table Our contributions: ➢ Popularize streaming SQL ➢ SQL parser / validator / rules ➢ Reference implementation & TCK select stream * from Orders as o where units > ( select avg(units) from Orders as h where h.productId = o.productId and h.rowtime > o.rowtime - interval ‘1’ year) “Show me real-time orders whose size is larger than the average for that product over the preceding year” Streaming SQL
  12. 12. Uses and Adoption
  13. 13. Uses in Research ● Polystore research – use as lightweight heterogeneous data processing platform ● Optimization and query profiling – general performance, and optimizer research ● Reasoning over Streams, Graphs – under consideration ● Open-source, production grade learning and research platform
  14. 14. Future Work and Roadmap ● Support its use as a standalone engine – DDL, materialized views, indexes and constraints. ● Improvements to the design and extensibility of the planner (modularity, pluggability) ● Incorporation of new parametric approaches into the design of the optimizer. ● Support for an extended set of SQL commands, functions, and utilities, including full compliance with OpenGIS (spatial). ● New adapters for non-relational data sources such as array databases. ● Improvements to performance profiling and instrumentation.
  15. 15. Thank you! Questions? @ApacheCalcite https://calcite.apache.org https://arxiv.org/abs/1802.10233
  16. 16. Extra slides
  17. 17. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations • Unification (materialized view) • Column trimming • De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  18. 18. Avatica ● Database connectivity stack ● Self-contained sub-project of Calcite ● Fast, open, stable ● Protobuf or JSON over HTTP ● Powers Phoenix Query Server
  19. 19. Lattice (optimized) () 1 (z, s, g, y, m) 912k (s, g, y, m) 6k (z) 43k (s) 50 (g) 2 (y) 5 (m) 12 (z, g, y, m) 909k (z, s, y, m) 831k raw 1m (z, s, g, m) 644k (z, s, g, y) 392k (y, m) 60 (z, s) 43.4k (z, s, g) 83.6k (g, y) 10 (g, y, m) 120 (g, m) 24 Key z zipcode (43k) s state (50) g gender (2) y year (5) m month (12)
  20. 20. Aggregation and windows on streams GROUP BY aggregates multiple rows into sub-totals ➢ In regular GROUP BY each row contributes to exactly one sub-total ➢ In multi-GROUP BY (e.g. HOP, GROUPING SETS) a row can contribute to more than one sub-total Window functions (OVER) leave the number of rows unchanged, but compute extra expressions for each row (based on neighboring rows) Multi GROUP BY Window functions GROUP BY
  21. 21. Tumbling, hopping & session windows in SQL Tumbling window Hopping window Session window select stream … from Orders group by floor(rowtime to hour) select stream … from Orders group by tumble(rowtime, interval ‘1’ hour) select stream … from Orders group by hop(rowtime, interval ‘1’ hour, interval ‘2’ hour) select stream … from Orders group by session(rowtime, interval ‘1’ hour)
  22. 22. Controlling when data is emitted Early emission is the defining characteristic of a streaming query. The emit clause is a SQL extension inspired by Apache Beam’s “trigger” notion. (Still experimental… and evolving.) A relational (non-streaming) query is just a query with the most conservative possible emission strategy. select stream productId, count(*) as c from Orders group by productId, floor(rowtime to hour) emit at watermark, early interval ‘2’ minute, late limit 1; select * from Orders emit when complete;

×