Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Performance Analysis of Apache Spark and Presto in Cloud Environments

945 Aufrufe

Veröffentlicht am

Today, users have multiple options for big data analytics in terms of open-source and proprietary systems as well as in cloud computing service providers. In order to obtain the best value for their money in a SaaS cloud environment, users need to be aware of the performance of each service as well as its associated costs, while also taking into account aspects such as usability in conjunction with monitoring, interoperability, and administration capabilities.

We present an independent analysis of two mature and well-known data analytics systems, Apache Spark and Presto. Both running on the Amazon EMR platform, but in the case of Apache Spark, we also analyze the Databricks Unified Analytics Platform and its associated runtime and optimization capabilities. Our analysis is based on running the TPC-DS benchmark and thus focuses on SQL performance, which still is indispensable for data scientists and engineers. In our talk we will present quantitative results that we expect to be valuable for end users, accompanied by an in depth look into the advantages and disadvantages of each alternative.

Thus, attendees will be better informed of the current big data analytics landscape and find themselves in a better position to avoid common pitfalls in deploying data analytics at a scale.

Veröffentlicht in: Daten & Analysen
  • Loggen Sie sich ein, um Kommentare anzuzeigen.

Performance Analysis of Apache Spark and Presto in Cloud Environments

  1. 1. WIFI SSID:Spark+AISummit | Password: UnifiedDataAnalytics
  2. 2. Víctor Cuevas-Vicenttín, Barcelona Supercomputing Center Performance Analysis of Apache Spark and Presto in Cloud Environments #UnifiedDataAnalytics #SparkAISummit
  3. 3. The Barcelona Supercomputing Center (BSC) is the Spanish national supercomputing facility, and a top EU research institution, established in 2005 by the Spanish government, the Catalan government and the UPC/BarcelonaTECH university. The mission of BSC is to be at the service of the international scientific community and of industry in need of HPC resources. BSC's research lines are developed within the framework of European Union research funding programmes, and the centre also does basic and applied research in collaboration with companies like IBM, Microsoft, Intel, Nvidia, Repsol, and Iberdrola. About BSC 3
  4. 4. 4 13.7 Petaflops
  5. 5. TPC-DS Benchmark Work 5 The BSC collaborated with Databricks to benchmark comparisons on large-scale analytics computations, using the TPC-DS Toolkit v2.10.1rc3 The Transaction Processing Performance Council (TPC) Benchmark DS (1) has the objective of evaluating decision support systems, which process large volumes of data in order to provide answers to real-world business questions. Our results are not official TPC Benchmark DS results. Databricks provided BSC an account and credits, which BSC then independently used for the benchmarking study with other analytics products on the market. The TPC is a non-profit corporation focused on developing data-centric benchmark standards and disseminating objective, verifiable performance data to the industry.
  6. 6. Context and motivation • Need to adopt data analytics in a cost-effective manner – SQL still very relevant – Open-source based analytics platforms – On-demand computing resources from the Cloud • Evaluate Cloud-based SQL engines 6#UnifiedDataAnalytics #SparkAISummit
  7. 7. Systems Under Test (SUTs) • Databricks Unified Analytics Platform – Based on Apache Spark but with optimized Databricks Runtime – Notebooks for interactive development and production Jobs – JDBC and custom API access – Delta storage layer supporting ACID transactions 7#UnifiedDataAnalytics #SparkAISummit
  8. 8. Systems Under Test (SUTs) • AWS EMR Presto – Distributed SQL engine created by Facebook – Connectors non-relational and relational sources – JDBC and CLI access – Based on in-memory, pipelined parallel execution • AWS EMR Spark – Based on open-source Apache Spark 8#UnifiedDataAnalytics #SparkAISummit
  9. 9. Plan • TPC Benchmark DS • Hardware and software configuration • Benchmarking infrastructure • Benchmark results and their analysis • Usability and developer productivity • Conclusions 9#UnifiedDataAnalytics #SparkAISummit
  10. 10. TPC Benchmark DS • Created around 2006 to evaluate decision support systems • Based on a retailer with several channels of distribution • Process large volumes of data to answer real-world business questions 10#UnifiedDataAnalytics #SparkAISummit
  11. 11. TPC Benchmark DS • Snowflake schema: fact tables associated with multiple dimension tables • Data produced by data generator • 99 queries of various types – reporting – ad hoc – iterative – data mining 11#UnifiedDataAnalytics #SparkAISummit
  12. 12. 12#UnifiedDataAnalytics #SparkAISummit
  13. 13. TPC Benchmark DS • Load Test (1 TB) • Power Test • Data Refresh • Throughput Test 13#UnifiedDataAnalytics #SparkAISummit .dat ORC, parquet Query 1 Query 99Query 2 . . . Query1,1 Query1,99Query1,2 . . . Queryn,1 Queryn,99Queryn,2 . . . . . .
  14. 14. Hardware configuration 14#UnifiedDataAnalytics #SparkAISummit Type vCPUs Memory Local storage i3.2xlarge 8 (2.3 GHz Intel Xeon E5 2686 v4) 61 GiB 1 x 1,900 GB NVMe SSD 1 master node 8 worker nodes
  15. 15. Software configuration 15#UnifiedDataAnalytics #SparkAISummit System Versions Configuration parameters Runtime 5.5, Spark 2.4.3, Scala 2.11 spark.sql.broadcastTimeout: 7200 spark.sql.crossJoin.enabled: true emr-5.26.0, Presto 0.220 hive.allow-drop-table: true hive.compression-codec: SNAPPY hive.s3-file-system-type: PRESTO query.max-memory: 240 GB emr-5.26.0, Spark 2.4.3 spark.sql.broadcastTimeout : 7200 spark.driver.memory: 5692M
  16. 16. 16#UnifiedDataAnalytics #SparkAISummit SQL .dat parquet ORC client application cluster execution analysis .dat JARJAR .log AWS Glue Metastore .log .XLSX
  17. 17. Benchmark execution time (base) 17#UnifiedDataAnalytics #SparkAISummit
  18. 18. Cost-Based Optimizer (CBO) stats • Collect table and column-level statistics to create optimized query evaluation plans – distinct count, min, max, null count 18#UnifiedDataAnalytics #SparkAISummit
  19. 19. Benchmark execution time (stats) 19#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↑ 27.11
  20. 20. Speedup with table and column stats 20#UnifiedDataAnalytics #SparkAISummit CBO enabled: ↓ 0.60
  21. 21. TPC-DS Power Test – geom. mean 21#UnifiedDataAnalytics #SparkAISummit
  22. 22. TPC-DS Power Test – arith. mean 22#UnifiedDataAnalytics #SparkAISummit
  23. 23. Additional configuration for Presto 23#UnifiedDataAnalytics #SparkAISummit Query-specific configuration parameters 5, 75, 78, and 80 join_distribution_type: PARTITIONED 78 and 85 join_reordering_strategy: NONE 67 task_concurrency: 32 18 join_reordering_strategy=ELIMINATE_CROSS_JOINS Session configuration for all queries query_max_stage_count: 102 join_reordering_strategy: AUTOMATIC join_distribution_type: AUTOMATIC Query modifications (carried on to all systems) 72 manual join re-ordering 95 add distinct clause
  24. 24. TPC-DS Power Test – Query 72 • Manually modified join order 24#UnifiedDataAnalytics #SparkAISummit catalog_sales ⋈ date_dim ⋈ date_dim ⋈ inventory ⋈ date_dim ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⟕ promotion ⟕ catalog_returns • Databricks optimized join order no stats Same as modified join order + pushed down selections and projections • Original benchmark join order catalog_sales ⋈ inventory ⋈ warehouse ⋈ item ⋈ customer_demographics ⋈ household_demographics ⋈ date_dim ⋈ date_dim ⋈ date_dim ⟕ promotion ⟕ catalog_returns
  25. 25. TPC-DS Power Test – Query 72 • Databricks optimized join order with stats 25#UnifiedDataAnalytics #SparkAISummit (((((((catalog_sales ⋈ household_demographics) ⋈ date_dim) ⋈ customer_demographics) ⋈ item) (((date_dim ⋈ date_dim) ⋈ inventory) ⋈ warehouse)) ⋈ ⟕ promotion) ⟕ catalog_returns) +pushed down selections and projections • EMR Spark optimized join order with stats and CBO enabled/disabled Same as modified join order + pushed down selections and projections but different physical plans
  26. 26. Dynamic data partitioning • Splits a table based on the value of a particular column – Split only 7 largest tables by date surrogate keys – One S3 bucket folder for each value • Databricks and EMR Spark: limit number of files per partition • EMR Presto: out of memory error for largest table – Use Hive with TEZ to load data 26#UnifiedDataAnalytics #SparkAISummit
  27. 27. Benchmark exec. time (part + stats) 27#UnifiedDataAnalytics #SparkAISummit Power Test: 2 failed queries Throughput Test: 6 failed queries
  28. 28. Speedup with partitioning and stats 28#UnifiedDataAnalytics #SparkAISummit
  29. 29. TPC Benchmark total execution time 29#UnifiedDataAnalytics #SparkAISummit
  30. 30. TPC Benchmark DS metric • The modified primary performance metric is 30#UnifiedDataAnalytics #SparkAISummit 𝑄𝑝ℎ𝐷𝑆@𝑆𝐹 = 𝑆𝐹 ∗ 𝑄 , 𝑇./ ∗ 𝑇01 ∗ 𝑇11 Scale factor Num. weighted queries: num streams x 99 Load factor: 0.1 x num streams x load time Power Test and Throughput Test times
  31. 31. TPC Benchmark DS metric 31#UnifiedDataAnalytics #SparkAISummit
  32. 32. System costs 32#UnifiedDataAnalytics #SparkAISummit 𝑁𝑢𝑚. 𝑛𝑜𝑑𝑒𝑠 × 𝑛𝑜𝑑𝑒 𝑐𝑜𝑠𝑡 𝑝𝑒𝑟 ℎ𝑜𝑢𝑟 ×𝑒𝑥𝑒𝑐. 𝑡𝑖𝑚𝑒 𝑖𝑛 ℎ𝑜𝑢𝑟𝑠 System Hardware Software EMR Presto $0.624 $0.156 EMR Spark $0.624 $0.156 Databricks $0.624 $0.3 𝑛𝑜𝑑𝑒 ℎ𝑎𝑟𝑑𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠 + 𝑛𝑜𝑑𝑒 𝑠𝑜𝑓𝑡𝑤𝑎𝑟𝑒 𝑐𝑜𝑠𝑡𝑠
  33. 33. TPC Benchmark DS cost 33#UnifiedDataAnalytics #SparkAISummit
  34. 34. TPC-DS price-performance 34#UnifiedDataAnalytics #SparkAISummit
  35. 35. Disk utilization • Databricks – Automatically caches hot input data – Requires machines with NVMe SSDs • EMR Presto – Experimental spilling of state to disk – “we do not configure any of the Facebook deployments to spill…local disks would increase hardware costs…” 35#UnifiedDataAnalytics #SparkAISummit Raghav Sethi et al. Presto: SQL on Everything. ICDE 2019: 1802-1813
  36. 36. #UnifiedDataAnalytics #SparkAISummit DatabricksEMRPresto
  37. 37. 3 7 #UnifiedDataAnalytics #SparkAISummit DatabricksEMRPresto
  38. 38. Usability and developer productivity 38#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks Easy and flexible cluster creation ü ü ü Framework configuration at cluster creation time ü ü ü Direct distributed file system support û û ü Independent data catalog (metastore) ü ü ü Support for notebooks ü ü ü Integrated Web GUI û û ü
  39. 39. 39#UnifiedDataAnalytics #SparkAISummit Feature EMR Presto EMR Spark Databricks JDBC access ü ü ü Programmatic interface û ü ü Job creation and management infrastructure û û ü Customized visualization of query plan execution ü ü ü Resource utilization monitoring with Ganglia and CloudWatch ü ü ü Usability and developer productivity
  40. 40. Conclusions • Databricks is about 4x faster than EMR Presto without statistics – About 3x faster with them • Difference smaller with EMR Spark – Databricks still more cost-effective – More efficient runtime, cache, and CBO optimizer • Databricks and EMR Spark deal better with concurrency and benefit from data partitioning 40#UnifiedDataAnalytics #SparkAISummit
  41. 41. Conclusions • EMR Presto requires significantly more tuning – Minimal for Databricks and EMR Spark • Functionality of Databricks and EMR Presto/Spark for SQL very similar – Databricks more user friendly in some aspects 41#UnifiedDataAnalytics #SparkAISummit