Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Querying Druid in SQL with Superset

3.788 Aufrufe

Veröffentlicht am

Druid is a high performance, column-oriented distributed data store that is widely used at Oath for big data analysis. Druid has a JSON schema as its query language, making it difficult for new users unfamiliar with the schema to start querying Druid quickly. The JSON schema is designed to work with the data ingestion methods of Druid, so it can provide high performance features such as data aggregations in JSON, but many are unable to utilize such features, because they not familiar with the specifics of how to optimize Druid queries. However, most new Druid users at Yahoo are already very familiar with SQL, and the queries they want to write for Druid can be converted to concise SQL.
We found that our data analysts wanted an easy way to issue ad-hoc Druid queries and view the results in a BI tool in a way that's presentable to nontechnical stakeholders. In order to achieve this, we had to bridge the gap between Druid, SQL, and our BI tools such as Apache Superset. In this talk, we will explore different ways to query a Druid datasource in SQL and discuss which methods were most appropriate for our use cases. We will also discuss our open source contributions so others can utilize our work. GURUGANESH KOTTA, Software Dev Eng, Oath and JUNXIAN WU, Software Engineer, Oath Inc.

Veröffentlicht in: Technologie
  • DOWNLOAD THAT BOOKS INTO AVAILABLE FORMAT (2019 Update) ......................................................................................................................... ......................................................................................................................... Download Full PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download Full doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download PDF EBOOK here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download EPUB Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... Download doc Ebook here { http://bit.ly/2m6jJ5M } ......................................................................................................................... ......................................................................................................................... ................................................................................................................................... eBook is an electronic version of a traditional print book that can be read by using a personal computer or by using an eBook reader. (An eBook reader can be a software application for use on a computer such as Microsoft's free Reader application, or a book-sized computer that is used solely as a reading device such as Nuvomedia's Rocket eBook.) Users can purchase an eBook on diskette or CD, but the most popular method of getting an eBook is to purchase a downloadable file of the eBook (or other reading material) from a Web site (such as Barnes and Noble) to be read from the user's computer or reading device. Generally, an eBook can be downloaded in five minutes or less ......................................................................................................................... .............. Browse by Genre Available eBooks .............................................................................................................................. Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, ......................................................................................................................... ......................................................................................................................... .....BEST SELLER FOR EBOOK RECOMMEND............................................................. ......................................................................................................................... Blowout: Corrupted Democracy, Rogue State Russia, and the Richest, Most Destructive Industry on Earth,-- The Ride of a Lifetime: Lessons Learned from 15 Years as CEO of the Walt Disney Company,-- Call Sign Chaos: Learning to Lead,-- StrengthsFinder 2.0,-- Stillness Is the Key,-- She Said: Breaking the Sexual Harassment Story That Helped Ignite a Movement,-- Atomic Habits: An Easy & Proven Way to Build Good Habits & Break Bad Ones,-- Everything Is Figureoutable,-- What It Takes: Lessons in the Pursuit of Excellence,-- Rich Dad Poor Dad: What the Rich Teach Their Kids About Money That the Poor and Middle Class Do Not!,-- The Total Money Makeover: Classic Edition: A Proven Plan for Financial Fitness,-- Shut Up and Listen!: Hard Business Truths that Will Help You Succeed, ......................................................................................................................... .........................................................................................................................
       Antworten 
    Sind Sie sicher, dass Sie …  Ja  Nein
    Ihre Nachricht erscheint hier

Querying Druid in SQL with Superset

  1. 1. Druid SQL Interface Calcite
  2. 2. Problem - Druid is used extensively on our team and at Oath - Druid is hard to interact with due to its JSON input format - Many at Oath are not familiar with how to optimize Druid queries
  3. 3. Why use Druid? - Able to ingest and serve data in real-time with low latency - Good for ad-hoc queries - Good for storing aggregate data - Scalable to ingest millions of events/sec
  4. 4. Using SQL to bridge the gap ● SQL is the lingua franca of data ● Most at Oath are already familiar with SQL ● SQL is easier to write and more concise than JSON ● All BI tools we use support SQL
  5. 5. SQL vs Druid JSON Here is a sample SQL query for a given dataset: SELECT SUM("store_sales") filter (where "store_state" = 'CA'), SUM("store_cost") filter (where "store_state" = 'OR') FROM "foodmart" WHERE "the_month" == 'October' LIMIT 10 The same query in Druid JSON format is much less readable
  6. 6. SQL vs Druid JSON { "queryType":"groupBy", "dataSource":"foodmart", "granularity":"all", "dimensions":[], "limitSpec":{ "type":"default", "limit":10, "columns":[] }, "filter":{ "type":"and", "fields":[ { "type":"or", "fields":[ { "type":"selector", "dimension":"store_state", "value":"CA" }, { "type":"selector", "dimension":"store_state", "value":"OR" } ] }, { "type":"not", "field":{ "type":"selector", "dimension":"the_month", "value":"October" } } ] }, "aggregations":[ { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"CA" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$0", "fieldName":"store_sales" } }, { "type":"filtered", "filter":{ "type":"selector", "dimension":"store_state", "value":"OR" }, "aggregator":{ "type":"doubleSum", "name":"EXPR$1", "fieldName":"store_cost" } } ], "intervals":["1900-01-09T00:00:00.000/2992- 01-10T00:00:00.000"] }
  7. 7. Pre-existing Solutions - Druid SQL services - Hive Druid connection - Apache Calcite
  8. 8. Druid SQL Services - Druid has SQL support via Apache Calcite - Pros: - Significantly simplifies query JSON - Already supported in Druid - Cons: - Support is experimental - Doesn’t support DataSketch aggregators curl -XPOST -H 'Content-Type: application/json' http://BROKER:8082/druid/v2/sql/ -d @query.json { "query" : "SELECT COUNT(*) FROM data_source WHERE foo = 'bar' AND __time > TIMESTAMP '2000-01- 01 00:00:00'", "context" : {"sqlTimeZone" : "America/Los_Angeles"} }
  9. 9. Hive Druid Connection - Hive also has some level of Druid support via Apache Calcite - Pros: - Many BI tools already support Hive - Cons: - Lacks support for sketches
  10. 10. Apache Calcite - Translator between SQL and Druid JSON - Industry-standard SQL parser - Represent your query in relational algebra, transform using planning rules, and optimize according to a cost model - Open source
  11. 11. Our Solution - Use Apache Calcite directly - Address the deficiencies of Calcite and contribute back to the open source community
  12. 12. Calcite relational algebra - Relational logic tree translated from SQL query - Each node has its cost based on context - SELECT SUM(a) as c FROM table1 WHERE b=1 ORDER BY c TableScan On table1 Filter (b=1) Project (table1.a -> a) Aggregate (sum(a)) Sort on c
  13. 13. Query Planning - Apply rules on the Relational logic tree - Transform certain logic subtree into Druid Query Node TableScan Filter Project Aggregate Sort Druid GroupBy Query node Sort Druid TopN Query node Or
  14. 14. Optimization - Use cost model to estimate the performance of different transformed logic tree - Basic idea is to leverage more computation in Druid Druid GroupBy Query node Sort Druid TopN Query node Cost = 10 Cost = 10 Cost = 15
  15. 15. Renderer - Render the druid json query to be sent out - If any computation cannot be pushed to json query, run it locally in Calcite. Druid TopN Query node { "queryType":"TopN", "dataSource":"foodmart", "Granularity":"all", …
  16. 16. Major Problems - Did not support Post-Aggregation - AVERAGE function - Could run out of memory - Did not support Filtered Aggregations - Could cause Druid query all rows and process them in memory - Did not support Distinct Count Aggregators using ThetaSketches - Calcite will always try to give the user exact results - Distinct count aggregations are not pushed to Druid
  17. 17. Post Aggregation Support - New Rule to merge Post aggregation node - New Render that can generate druid query with post aggregation TableScan Project Aggregate Druid GroupBy Query node Aggregate Aggregate Druid GroupBy Query node New Rule
  18. 18. Filtered Aggregations Support - New Rule to move Filter operation from Calcite to Druid - Optimization on filters - New rule to extract common filter into outer filter - New rule to combine filter with logical ORs to outer filter TableScan Filter1 Project Aggregate Aggregate Filter2 Filter2 TableScan Filter1 Project Aggregate AggregateFilter2
  19. 19. Performance - Avoid unnecessary rows scan in Druid - Greatly reduce the runtime of when filters are involved
  20. 20. Why ThetaSketch - Sketches are a class of streaming, stochastic algorithms - Trade off accuracy for speed – orders of magnitude faster - Exact up to configurable thresholds and approximate after - Mathematically provable error bounds - Bounded in space - Set operations – union, intersect, difference Sketches logo from http://datasketches.github.io
  21. 21. ThetaSketch Support - New rule to translate Distinct count aggregator node to Thetasketches node - Allow users to config whether approximate cardinality is allowed
  22. 22. Performance - Reduced the running time of the query with count distinct aggregator when cardinality estimation is allowed - Sketches column can be utilized now - With post aggregation support, more operation can be applied
  23. 23. User Interface - Superset is commonly used with Druid - Superset SQL Lab is popular on SQL-like database From superset documentation: https://superset.incubator.apache.org/
  24. 24. Superset Calcite Connection - Superset is python application - Standard python DBAPI is created - Able to use SQL lab to run ad-hoc query on Druid
  25. 25. Perform Query Parsing, Planning Internal Computation Druid Adapter SQL Lab Calcite JDBC Output User Superset Calcite Druid
  26. 26. Questions

×