Diese Präsentation wurde erfolgreich gemeldet.
Wir verwenden Ihre LinkedIn Profilangaben und Informationen zu Ihren Aktivitäten, um Anzeigen zu personalisieren und Ihnen relevantere Inhalte anzuzeigen. Sie können Ihre Anzeigeneinstellungen jederzeit ändern.

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

522 Aufrufe

Veröffentlicht am

Many organizations today have already migrated Hadoop workloads to cloud infrastructure or they are actively planning to do such a migration. A common question in this scenario is "Which instance types should I use for my Hadoop cluster?" There are nuances to cloud infrastructure that require careful consideration when deciding which instances types to use. This session will show the results of performance comparison of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance types commonly used in Hadoop clusters. More importantly, we will discuss the relative cost comparison of these instance types to demonstrate the which AWS instances offer the best price to performance ratio using standard benchmarks. Attendees of this session with leave with a better understanding of the performance of AWS EC2 instance types when used for Hadoop workloads and be able to make more informed decisions about which instance types makes the most sense for their needs.

Speakers
Michael Young, Senior Solutions Engineer, Hortonworks
Marcus Waineo, Principal Solutions Engineer, Hortonworks

Veröffentlicht in: Technologie
  • Als Erste(r) kommentieren

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

  1. 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters Michael Young – Senior Solutions Engineer @ Hortonworks Marcus Waineo – Principal Solutions Engineer @ Hortonworks
  2. 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda ⬢ Objective ⬢ Overview of Benchmark Tools ⬢ Overview of Cluster Configurations ⬢ TestDFSIO Results ⬢ TPC-DS Results ⬢ TeraSort Results ⬢ Analysis ⬢ Recommendations
  3. 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Objective ⬢ Evaluate relative performance of each cluster configuration ⬢ Focused on the longer running queries in the TPC-DS test suite – q4, q11, q29, q59, q74, q75, q78, q93, q97 ⬢ Perform instance cost analysis of each cluster configuration to determine “cost per job” – consider storage and instance type ⬢ Determine if a more ”expensive” server configuration yields lower, longer term cost due to performance improvements
  4. 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Benchmark Tools ⬢ TPC-DS – 5 runs, sequential for each query – Average of 5 runs – Tested 50GB and 100GB data sets ⬢ TestDFSIO – 1 run – Tested 50GB and 100GB ⬢ TeraGen & TeraSort – 1 run – Tested 50GB and 100GB
  5. 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Configurations - Hardware ⬢ All clusters configured as 14 nodes – 3 master nodes – m4-4xlarge – 1 client node – m4-4xlarge – 10 slave nodes (either 10 data nodes or 6 compute and 4 data nodes) – Tested using standard, gp2 and st1 storage types ⬢ All clusters tested in the same AWS Availability Zone (us-east-1a). ⬢ Tested the following instance configurations: – 10x m4-xlarge (4c, 16gb, 6x200GB) – 10x m4-4xlarge (16c, 64gb, 6x200GB) – 10x m5-xlarge (16c, 64gb, 6x200GB) – 10x m5-4xlarge (16c, 64gb, 6x200GB) – 6x r4-8xlarge (32c, 244gb) compute, 4x d2-2xlarge (8c, 61gb, 6x2TB) storage – 6x r4-8xlarge (32c, 244gb) compute, 4x i3-2xlarge (8c, 61gb, 1x1.9TB NVMe) storage
  6. 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Configurations - Software ⬢ All clusters were built using: – Cloudbreak 2.4.1 – Ambari 2.6.1.3 – HDP 2.6.4.5-2 ⬢ All clusters were built using a basic blueprint. – No performance optimizations were implemented
  7. 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cloudbreak/Ambari Blueprint
  8. 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cloudbreak/Ambari Blueprint - With Compute
  9. 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Overview - All Data Nodes
  10. 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO
  11. 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Commands ⬢ TestDFSIO 50GB: – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 50 -fileSize 1000 – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 50 -fileSize 1000 ⬢ TestDFSIO 100GB: – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 100 -fileSize 1000 – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 100 -fileSize 1000
  12. 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Write 50GB
  13. 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Read 50GB
  14. 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Write 100GB
  15. 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Read 100GB
  16. 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS
  17. 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS Commands ⬢ Repo: https://github.com/Jaraxal/cloud-hadoop-benchmarks – git clone https://github.com/jaraxal/cloud-hadoop-benchmarks – cd cloud-hadoop-benchmarks ⬢ TPC-DS 50GB: – ./tpcds-build.sh – ./tpcds-setup.sh 50 – ./runSuite.pl tpcds 50 ⬢ TPC-DS 100GB: – ./tpcds-build.sh (only needed once per cluster) – ./tpcds-setup.sh 100 – ./runSuite.pl tpcds 100
  18. 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS 50GB
  19. 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS 100GB
  20. 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort
  21. 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort Commands ⬢ TeraGen/TeraSort 50GB – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar teragen 500000000 /user/cloudbreak/terasort-input – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output ⬢ TeraGen/TeraSort 100GB – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar teragen 1000000000 /user/cloudbreak/terasort-input – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output
  22. 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort 50GB
  23. 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort 100GB
  24. 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cost/Performance Analysis
  25. 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Per Query - 50GB
  26. 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Per Query - 100GB
  27. 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Total - 50GB
  28. 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Total - 100GB
  29. 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MapReduce Cost Per Job - 50GB
  30. 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MapReduce Cost Per Job - 100GB
  31. 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Recommendations
  32. 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Instance Recommendations ⬢ Consider workload type ⬢ Consider performance SLAs ⬢ Hive – The m5.4xlarge instance type provides the best overall price to performance ratio. ⬢ MapReduce/YARN – The m5.xlarge is the cheapest instance type. It provides competitive performance for bigger data volumes. – The m5.4xlarge is faster and not much more expensive. If SLAs demand, use the m5.4xlarge. ⬢ When in doubt, use the m5.4xlarge ⬢ D2.x and i3.x instance types – While the d2 and i3 instance types offer much faster I/O throughput, that doesn’t translate directly to Hive and MapReduce/YARN workloads. – Use with caution. Ephemeral storage not suited for long-running clusters.
  33. 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Storage Recommendations ⬢ The st1 storage offers comparable performance to gp2 storage at half the storage cost for Hadoop workloads. ⬢ For master, compute and client servers, use gp2 storage. ⬢ For data nodes, use st1 storage.
  34. 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sample Cluster Metrics
  35. 35. 35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Tuning ⬢ Never tune an environment for a benchmark and expect it to be optimal for your workload, regardless of workload. ⬢ Room for tuning and optimization to potentially improve performance.
  36. 36. 36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Resources ⬢ Amazon EC2 Pricing: https://aws.amazon.com/ec2/pricing/on-demand/ ⬢ Amazon EBS Pricing: https://aws.amazon.com/ebs/pricing/ ⬢ Amazon EBS-Optimized Instances: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
  37. 37. 37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions?
  38. 38. 38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank you!

×