Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Comparative Performance Analysis of AWS
EC2 Instance Types Commonly Used for
Hadoop Clusters
Michael Young – Senior Solutions Engineer @ Hortonworks
Marcus Waineo – Principal Solutions Engineer @ Hortonworks

Agenda
⬢ Objective
⬢ Overview of Benchmark Tools
⬢ Overview of Cluster Configurations
⬢ TestDFSIO Results
⬢ TPC-DS Results
⬢ TeraSort Results
⬢ Analysis
⬢ Recommendations

Objective
⬢ Evaluate relative performance of each cluster configuration
⬢ Focused on the longer running queries in the TPC-DS test suite
– q4, q11, q29, q59, q74, q75, q78, q93, q97
⬢ Perform instance cost analysis of each cluster configuration to determine “cost per
job”
– consider storage and instance type
⬢ Determine if a more ”expensive” server configuration yields lower, longer term cost
due to performance improvements

Benchmark Tools
⬢ TPC-DS
– 5 runs, sequential for each query
– Average of 5 runs
– Tested 50GB and 100GB data sets
⬢ TestDFSIO
– 1 run
– Tested 50GB and 100GB
⬢ TeraGen & TeraSort
– 1 run
– Tested 50GB and 100GB

Cluster Configurations - Hardware
⬢ All clusters configured as 14 nodes
– 3 master nodes – m4-4xlarge
– 1 client node – m4-4xlarge
– 10 slave nodes (either 10 data nodes or 6 compute and 4 data nodes)
– Tested using standard, gp2 and st1 storage types
⬢ All clusters tested in the same AWS Availability Zone (us-east-1a).
⬢ Tested the following instance configurations:
– 10x m4-xlarge (4c, 16gb, 6x200GB)
– 10x m4-4xlarge (16c, 64gb, 6x200GB)
– 10x m5-xlarge (16c, 64gb, 6x200GB)
– 10x m5-4xlarge (16c, 64gb, 6x200GB)
– 6x r4-8xlarge (32c, 244gb) compute, 4x d2-2xlarge (8c, 61gb, 6x2TB) storage
– 6x r4-8xlarge (32c, 244gb) compute, 4x i3-2xlarge (8c, 61gb, 1x1.9TB NVMe) storage

Cluster Configurations - Software
⬢ All clusters were built using:
– Cloudbreak 2.4.1
– Ambari 2.6.1.3
– HDP 2.6.4.5-2
⬢ All clusters were built using a basic blueprint.
– No performance optimizations were implemented

Cloudbreak/Ambari Blueprint

Cloudbreak/Ambari Blueprint - With Compute

Cluster Overview - All Data Nodes

TestDFSIO

TestDFSIO Commands
⬢ TestDFSIO 50GB:
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 50 -fileSize
1000
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 50 -fileSize 1000
⬢ TestDFSIO 100GB:
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 100 -fileSize
1000
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 100 -fileSize
1000

TestDFSIO Write 50GB

TestDFSIO Read 50GB

TestDFSIO Write 100GB

TestDFSIO Read 100GB

TPC-DS

TPC-DS Commands
⬢ Repo: https://github.com/Jaraxal/cloud-hadoop-benchmarks
– git clone https://github.com/jaraxal/cloud-hadoop-benchmarks
– cd cloud-hadoop-benchmarks
⬢ TPC-DS 50GB:
– ./tpcds-build.sh
– ./tpcds-setup.sh 50
– ./runSuite.pl tpcds 50
⬢ TPC-DS 100GB:
– ./tpcds-build.sh (only needed once per cluster)
– ./tpcds-setup.sh 100
– ./runSuite.pl tpcds 100

TPC-DS 50GB

TPC-DS 100GB

TeraGen & TeraSort

TeraGen & TeraSort Commands
⬢ TeraGen/TeraSort 50GB
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples-
2.7.3.2.6.4.5-2.jar teragen 500000000 /user/cloudbreak/terasort-input
2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output
⬢ TeraGen/TeraSort 100GB
2.7.3.2.6.4.5-2.jar teragen 1000000000 /user/cloudbreak/terasort-input
2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output

TeraGen & TeraSort 50GB

TeraGen & TeraSort 100GB

Cost/Performance
Analysis

Hive Cost Per Query - 50GB

Hive Cost Per Query - 100GB

Hive Cost Total - 50GB

Hive Cost Total - 100GB

MapReduce Cost Per Job - 50GB

MapReduce Cost Per Job - 100GB

Recommendations

Instance Recommendations
⬢ Consider workload type
⬢ Consider performance SLAs
⬢ Hive
– The m5.4xlarge instance type provides the best overall price to performance ratio.
⬢ MapReduce/YARN
– The m5.xlarge is the cheapest instance type. It provides competitive performance for bigger data
volumes.
– The m5.4xlarge is faster and not much more expensive. If SLAs demand, use the m5.4xlarge.
⬢ When in doubt, use the m5.4xlarge
⬢ D2.x and i3.x instance types
– While the d2 and i3 instance types offer much faster I/O throughput, that doesn’t translate directly
to Hive and MapReduce/YARN workloads.
– Use with caution. Ephemeral storage not suited for long-running clusters.

Storage Recommendations
⬢ The st1 storage offers comparable performance to gp2 storage at half the storage
cost for Hadoop workloads.
⬢ For master, compute and client servers, use gp2 storage.
⬢ For data nodes, use st1 storage.

Sample Cluster Metrics

Tuning
⬢ Never tune an environment for a benchmark and expect it to be optimal for your
workload, regardless of workload.
⬢ Room for tuning and optimization to potentially improve performance.

Resources
⬢ Amazon EC2 Pricing: https://aws.amazon.com/ec2/pricing/on-demand/
⬢ Amazon EBS Pricing: https://aws.amazon.com/ebs/pricing/
⬢ Amazon EBS-Optimized Instances:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html

Questions?

Thank you!

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

Ähnlich wie Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters (20)

Mehr von DataWorks Summit

Mehr von DataWorks Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters