Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite

1.375 Aufrufe

Veröffentlicht am

What if Apache Pig had a SQL front-end and query optimizer? What if Apache Calcite was able to use Pig and MapReduce to run queries? In this project, we aimed to answer both questions by adding a Pig adapter for Calcite. In this talk, we describe Calcite's adapter framework, how we used it to write a Pig adapter, and how you can use this SQL interface to Pig for interactive and long-running queries.

A talk given by Eli Levine and Julian Hyde at Apache: Big Data, Miami, on May 17th, 2017.

Veröffentlicht in: Software
  • Login to see the comments

A smarter Pig: Building a SQL interface to Apache Pig using Apache Calcite

  1. 1. A Smarter Pig: Building a SQL interface to Pig using Apache Calcite Eli Levine & Julian Hyde Apache: Big Data, Miami 2017/05/17
  2. 2. Julian Hyde @julianhyde Original developer of Calcite PMC member of Calcite, Drill, Eagle, Kylin ASF member About us Eli Levine @teleturn PMC member of Phoenix ASF member
  3. 3. Apache Calcite Apache top-level project since October, 2015 Query planning framework ➢ Relational algebra, rewrite rules ➢ Cost model & statistics ➢ Federation via adapters ➢ Extensible Packaging ➢ Library ➢ Optional SQL parser, JDBC server ➢ Community-authored rules, adapters
  4. 4. Apache Pig Apache top-level project Platform for Analyzing Large Datasets ➢ Uses Pig Latin language ○ Relational operators (join, filter) ○ Functional operators (mapreduce) ➢ Runs as MapReduce (also Tez) ➢ ETL ➢ Extensible ○ LOAD/STORE ○ UDFs
  5. 5. Outline Batch compute on Force.com Platform (Eli Levine) Apache Calcite deep dive (Julian Hyde) Building Pig adapter for Calcite (Eli Levine) Q&A
  6. 6. Salesforce Platform Object-relational data model in the cloud Contains standard objects that users can customize or add their own SQL-like query language SOQL - Real-time - Batch compute Federated data store: Oracle, HBase, external User queries span data sources (federated joins) SELECT DEPT.NAME FROM EMPLOYEE WHERE FIRST_NAME = ‘Eli’
  7. 7. Salesforce Platform - Batch Compute Called Async SOQL - REST API - Users supply SOQL and info about where to deposit results SOQL -> Pig Latin script Pig loaders move data/computation to HDFS for federated query execution Own SOQL parsing, no Calcite
  8. 8. Query Planning in Async SOQL Current Next generation
  9. 9. Apache Calcite for Next-Gen Optimizer - Strong relational algebra foundation - Support for different physical engines - Pluggable cost model - Optimization rules - Federation-aware
  10. 10. Architecture Conventional database Calcite
  11. 11. Calcite design Design goals: ● Not-just-SQL front end ● Federation ● Extensibility ● Caching / hybrid storage ● Materialized views Design points: ● Relational algebra ● Composable transformation rules tables on disk in-memory materializationsselect x, sum(y) from t group by x
  12. 12. Planning queries MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc Table: splunk
  13. 13. Optimized query MySQL Splunk join Key: productId group Key: productName Agg: count filter Condition: action = 'purchase' sort Key: c desc scan scan Table: splunk Table: products select p.productName, count(*) as c from splunk.splunk as s join mysql.products as p on s.productId = p.productId where s.action = 'purchase' group by p.productName order by c desc
  14. 14. Calcite framework Cost, statistics RelOptCost RelOptCostFactory RelMetadataProvider • RelMdColumnUniquensss • RelMdDistinctRowCount • RelMdSelectivity SQL parser SqlNode SqlParser SqlValidator Transformation rules RelOptRule • FilterMergeRule • AggregateUnionTransposeRule • 100+ more Global transformations Unification (materialized view) Column trimming De-correlation Relational algebra RelNode (operator) • TableScan • Filter • Project • Union • Aggregate • … RelDataType (type) RexNode (expression) RelTrait (physical property) • RelConvention (calling-convention) • RelCollation (sortedness) • RelDistribution (partitioning) RelBuilder JDBC driver Metadata Schema Table Function • TableFunction • TableMacro Lattice
  15. 15. Adapter Implement SchemaFactory interface Connect to a data source using parameters Extract schema - return a list of tables Push down processing to the data source: ● A set of planner rules ● Calling convention (optional) ● Query model & query generator (optional) "schemas": [ { "name": "HR", "type": "custom", "factory": "org.apache.calcite.adapter.file.FileSchemaFactory", "operand": { "directory": "hr-csv" } } ] $ ls -l hr-csv -rw-r--r-- 1 jhyde staff 62 Mar 29 12:57 DEPTS.csv -rw-r--r-- 1 jhyde staff 262 Mar 29 12:57 EMPS.csv.gz $ ./sqlline -u jdbc:calcite:model=hr.json -n scott -p tiger sqlline> select count(*) as c from emp; 'C' '5' 1 row selected (0.135 seconds)
  16. 16. Calcite Pig Adapter EMPLOYEE = LOAD 'EMPLOYEE' ... ; EMPLOYEE = GROUP EMPLOYEE BY (DEPT_ID); EMPLOYEE = FOREACH EMPLOYEE GENERATE COUNT(EMPLOYEE.DEPT_ID) as DEPT_ID__COUNT_, group as DEPT_ID; EMPLOYEE = FILTER EMPLOYEE BY (DEPT_ID__COUNT_ > 10); SELECT DEPT_ID FROM EMPLOYEE GROUP BY DEPT_ID HAVING COUNT(DEPT_ID) > 10
  17. 17. Building the Pig Adapter 1. Implement Pig-specific RelNodes. e.g. PigFilter 2. RelNode factories 3. Write RelOptRules for converting abstract RelNodes to Pig RelNodes 4. Schema implementation 5. Unit tests run local Pig
  18. 18. Lessons Learned Calcite is very flexible (both good and bad) ● “Recipe list” would be useful ● Lots of examples if you delve into existing adapters - e.g. Druid and Cassandra Lots available out of the box Dynamic code generation using Janino -- cryptic errors RelBuilder was really useful (if you are building non-SQL engine)
  19. 19. Florida Calcite
  20. 20. Thank you! Eli Levine @teleturn Julian Hyde @julianhyde http://calcite.apache.org http://pig.apache.org

×