Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Presto @ Zalando - Big Data Tech Warsaw 2020

355 Aufrufe

Veröffentlicht am

A cloud journey for Europe’s leading online fashion retailer. Zalando Data Lake evolution and role of Starburst Presto in the Data Lake's landscape.

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Presto @ Zalando - Big Data Tech Warsaw 2020

  1. 1. Presto @ Zalando Max Schultze - max.schultze@zalando.de Wojciech Biela - wojciech.biela@starburstdata.com Piotr Findeisen - piotr.findeisen@starburstdata.com 27-02-2020 A cloud journey for Europe’s leading online fashion retailer @mcs1408 @wbiela @findepi
  2. 2. 2 Max Schultze ● Lead Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink ● Retired semi-professional Magic: the Gathering player Who are we? Wojciech Biela ● Senior Engineering Director ● Starburst Co-founder ● MSc in Computer Science ● Prev: Engineering lead at Hadapt (interactive SQL-on-Hadoop pioneer) ● Prev: Head of engineering @ Empik.com
  3. 3. 3 Max Schultze ● Lead Data Engineer ● MSc in Computer Science ● Took part in early development of Apache Flink ● Retired semi-professional Magic: the Gathering player Who are we? Piotr Findeisen ● Presto Committer & maintainer ● Starburst Co-founder ● MSc in Computer Science ● Prev: Presto Engineer at Teradata
  4. 4. 4 TABLE OF CONTENTS Zalando Analytics Cloud Journey The Evolution of Presto Advance Analytical Infrastructure
  5. 5. 5 Zalando Analytics Cloud Journey
  6. 6. 6 Legacy Analytics DWH
  7. 7. 7 Messaging Bus Data Lake Legacy Evolving
  8. 8. 8 Zalando’s Data Lake Ingestion Storage Serving
  9. 9. 9 Zalando’s Data Lake Web Tracking Event Bus DWH Data Center Ingestion Storage Serving
  10. 10. 10 Zalando’s Data Lake Web Tracking Event Bus DWH Data Center Ingestion Storage Serving Metastore
  11. 11. 11 Zalando’s Data Lake Data CatalogWeb Tracking Event Bus DWH Data Center Ingestion Storage Serving Metastore Fast Query Layer Processing Platform
  12. 12. 12 The evolution of Presto
  13. 13. 13 Community-driven open source project High performance ANSI SQL engine What is Presto? Separation of compute and storage No vendor lock-in
  14. 14. 14 Community-driven open source project Separation of compute and storage No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in High performance ANSI SQL engine • Proven scalability • High concurrency What is Presto?
  15. 15. 15 Community-driven open source project No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in High performance ANSI SQL engine Separation of compute and storage What is Presto?
  16. 16. 16 What is Presto? Community-driven open source project High performance ANSI SQL engine ● Proven scalability ● High concurrency No vendor lock-in • No Hadoop distro vendor lock-in • No storage engine vendor lock-in • No cloud vendor lock-in Separation of compute and storage
  17. 17. 17 Many Well Known Presto Users
  18. 18. 18 Presto Architecture Processor Processor Processor COORDINATOR WORKER WORKER DATA SOURCES Parser Optimizer Scheduler Azure SQL Database ADLS Blob Storage S3
  19. 19. 19 Presto Extensibility with Connectors Presto Coordinator Metadata SPI Hive Cassandr a Kafka MySQL Custom Data Statistics SPIHive Cassandr a Kafka MySQL Custom Presto Worker Data Stream SPI Hive Cassandr a Kafka MySQL Custom Data Location SPI Hive Cassandr a Kafka MySQL Custom
  20. 20. 20 Query Execution Performance • In-memory processing, Pipelined execution across nodes MPP-style • Vectorized columnar processing • Multithreaded execution keeps all CPU cores busy • Presto is written in highly tuned Java ○ Efficient data structures (minimizes GC) ○ Very careful coding of inner loops ○ Runtime bytecode generation • Optimized ORC & Parquet readers
  21. 21. 21 Apache Hive Connector • Access data stored in scalable and cost effective storage ○ HDFS ○ AWS S3 ○ Google GCS ○ Azure Blob & ADLS (Gen 1 and 2) ○ S3-Compatible (i.e. Minio) • Schema information stored in Hive Metastore or AWS Glue Data Catalog • Uses “Hive-Style” Table format • Partitions and Bucketing are recognized and used • Does not use Hive runtime to perform execution
  22. 22. 22 Relational Database Connectors (JDBC based) • Uses relational databases JDBC driver for Presto worker to connect to data source • Filtering pushed down into database for performance benefit • MySQL • PostgreSQL • Redshift • SQL Server • Google BigQuery • Oracle • DB2 • Teradata • Snowflake
  23. 23. 23 Non Relational Data Sources • Apache Accumulo • Apache Cassandra • Apache Phoenix • Elasticsearch • Apache Kafka • Apache Kudu • MongoDB • Redis
  24. 24. 24 SQL Support • Presto's development is guided by the SQL standard • Most major SQL features are covered • TPC-H & TPC-DS queries run entirely
  25. 25. 25 Security ● User authentication (CLI/ODBC/JDBC) ○ Basic ○ Kerberos / LDAP ● Pluggable user authorization schemes (access control) ● User impersonation (Hive, JDBC connectors) ● Support for kerberized HDFS/Hive metastore ● SSL on the wire ○ client to Presto ○ between Presto nodes ● Sentry and Ranger support ○ column and row level security
  26. 26. 26 JDBC & ODBC Connectivity • Presto provides an open source JDBC driver https://prestosql.io/download.html • Commercial JDBC and ODBC drivers available from Starburst • Do not confuse these drivers with the drivers Presto internally uses to connect to JDBC data sources (e.g. MySQL, SQL Server, etc.)
  27. 27. 27 End-User Tools Starburst provides enterprise grade ODBC and JDBC drivers allowing you to use your favorite tools with Starburst ○ PowerBI ○ Microstrategy ○ Tableau ○ Qlik ○ Looker ○ Periscope ○ DBeaver ○ And more…
  28. 28. 28 The Presto Fan Club * Multiple clusters (10,000+ of nodes) * 300PB in HDFS, MySQL, and Raptor * 1000s users, 100s concurrent queries
  29. 29. 29 * 300+ AWS nodes * 100+ PB in S3 (Parquet) * 650+ users with 6K+ queries daily The Presto Fan Club
  30. 30. 30 * 150+ PB HDFS (Parquet/ORC) * 2,000+ nodes (clusters on prem.) * 160K+ queries/ day over HDFS The Presto Fan Club
  31. 31. 31 * 2,000+ nodes (several clusters on premises and GCP) * 20K+ queries daily (Parquet) The Presto Fan Club
  32. 32. 32 * 100 Presto VMs (on premises) * 1K+ HDFS nodes * ORC data * Starburst support The Presto Fan Club
  33. 33. 33 * interactive * 400+ nodes in AWS * 100K+ queries/day * 20+ PBs in S3 (Parquet) The Presto Fan Club
  34. 34. 34 * 200+ nodes (on premises) * HDFS, ObjectStore, and Cassandra * Starburst support The Presto Fan Club
  35. 35. 35 * 120+ nodes in AWS * 4PB is S3 * 200+ users * Starburst support The Presto Fan Club
  36. 36. 36 Starburst Overview Founded 2017 • Founding team many of the largest committers to open source project Presto, working on Presto since 2015 • Former Teradata, Vertica, Hadapt, Netezza, and Ab Initio Enterprise Presto Offering • Azure, AWS, GCP, On Premises, Kubernetes Headquartered Boston Customers Globally
  37. 37. 37 Key Presto contributions from Starburst Mission Control For easy installation & management of Presto Security Integrations Kerberos, LDAP, Ranger and in-transit encryption ANSI SQL Enhancements to fully support SQL ODBC and JDBC drivers To enable BI tools such as Power BI, Tableau, Qlik, etc. Presto Connectors Teradata, Oracle, Hive Cloud Storage, Snowflake Autoscaling Presto Autoscaling in the cloud (AWS CFT, K8s, …) Query Performance Cost-Based Query Optimizer Providing performance boost Improved performance in query execution engine
  38. 38. 38 Key upcoming developments from Starburst Consumption Tracking Understand your consumption and spend on the cloud Read data from Delta Lake DeltaLake Integration Presto Insights Tuning suggestions for Presto cluster and queries Okta Support Integrate with Okta IdP provider Distributed Caching Speed up queries on hot datasets IAM Passthrough Leverage IAM roles Integrated Apache Ranger Kubernetes support Advanced K8s ecosystem support Automatically deploy Ranger in Presto for the security stack
  39. 39. 39 Try Starburst Enterprise-Grade Presto in the Cloud and On-Premises Azure, AWS, GCP, On Premises, & Kubernetes www.starburstdata.com/presto-enterprise
  40. 40. 40 Advanced Analytical Infrastructure
  41. 41. 41 Analytical Infrastructure
  42. 42. 42 Analytical Infrastructure
  43. 43. 43 Analytical Infrastructure
  44. 44. 44 $$ Analytical Infrastructure
  45. 45. 45 Advanced Analytical Infrastructure
  46. 46. 46 Advanced Analytical Infrastructure $$
  47. 47. 47 Advanced Analytical Infrastructure
  48. 48. 48 Advanced Analytical Infrastructure Presto Gateway
  49. 49. 49 Infrastructure Support Expedite Learning
  50. 50. 50 Infrastructure Support Expedite Learning Fine Tuning Infrastructure
  51. 51. 51 Infrastructure Support Expedite Learning Fine Tuning Infrastructure New Features
  52. 52. 52 Next Up
  53. 53. 53 Next Steps
  54. 54. 54 Next Steps
  55. 55. 55 Next Steps
  56. 56. 56 Presto @ Zalando A cloud journey for Europe’s leading online fashion retailer Max Schultze max.schultze@zalando.de @mcs1408 Wojciech Biela wojciech.biela@starburstdata.com @wbiela

×