SlideShare ist ein Scribd-Unternehmen logo
1 von 38
1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Comparative Performance Analysis of AWS
EC2 Instance Types Commonly Used for
Hadoop Clusters
Michael Young – Senior Solutions Engineer @ Hortonworks
Marcus Waineo – Principal Solutions Engineer @ Hortonworks
2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Agenda
⬢ Objective
⬢ Overview of Benchmark Tools
⬢ Overview of Cluster Configurations
⬢ TestDFSIO Results
⬢ TPC-DS Results
⬢ TeraSort Results
⬢ Analysis
⬢ Recommendations
3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Objective
⬢ Evaluate relative performance of each cluster configuration
⬢ Focused on the longer running queries in the TPC-DS test suite
– q4, q11, q29, q59, q74, q75, q78, q93, q97
⬢ Perform instance cost analysis of each cluster configuration to determine “cost per
job”
– consider storage and instance type
⬢ Determine if a more ”expensive” server configuration yields lower, longer term cost
due to performance improvements
4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Benchmark Tools
⬢ TPC-DS
– 5 runs, sequential for each query
– Average of 5 runs
– Tested 50GB and 100GB data sets
⬢ TestDFSIO
– 1 run
– Tested 50GB and 100GB
⬢ TeraGen & TeraSort
– 1 run
– Tested 50GB and 100GB
5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster Configurations - Hardware
⬢ All clusters configured as 14 nodes
– 3 master nodes – m4-4xlarge
– 1 client node – m4-4xlarge
– 10 slave nodes (either 10 data nodes or 6 compute and 4 data nodes)
– Tested using standard, gp2 and st1 storage types
⬢ All clusters tested in the same AWS Availability Zone (us-east-1a).
⬢ Tested the following instance configurations:
– 10x m4-xlarge (4c, 16gb, 6x200GB)
– 10x m4-4xlarge (16c, 64gb, 6x200GB)
– 10x m5-xlarge (16c, 64gb, 6x200GB)
– 10x m5-4xlarge (16c, 64gb, 6x200GB)
– 6x r4-8xlarge (32c, 244gb) compute, 4x d2-2xlarge (8c, 61gb, 6x2TB) storage
– 6x r4-8xlarge (32c, 244gb) compute, 4x i3-2xlarge (8c, 61gb, 1x1.9TB NVMe) storage
6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster Configurations - Software
⬢ All clusters were built using:
– Cloudbreak 2.4.1
– Ambari 2.6.1.3
– HDP 2.6.4.5-2
⬢ All clusters were built using a basic blueprint.
– No performance optimizations were implemented
7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cloudbreak/Ambari Blueprint
8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cloudbreak/Ambari Blueprint - With Compute
9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cluster Overview - All Data Nodes
10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TestDFSIO
11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TestDFSIO Commands
⬢ TestDFSIO 50GB:
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 50 -fileSize
1000
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 50 -fileSize 1000
⬢ TestDFSIO 100GB:
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 100 -fileSize
1000
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient-
2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 100 -fileSize
1000
12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TestDFSIO Write 50GB
13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TestDFSIO Read 50GB
14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TestDFSIO Write 100GB
15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TestDFSIO Read 100GB
16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TPC-DS
17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TPC-DS Commands
⬢ Repo: https://github.com/Jaraxal/cloud-hadoop-benchmarks
– git clone https://github.com/jaraxal/cloud-hadoop-benchmarks
– cd cloud-hadoop-benchmarks
⬢ TPC-DS 50GB:
– ./tpcds-build.sh
– ./tpcds-setup.sh 50
– ./runSuite.pl tpcds 50
⬢ TPC-DS 100GB:
– ./tpcds-build.sh (only needed once per cluster)
– ./tpcds-setup.sh 100
– ./runSuite.pl tpcds 100
18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TPC-DS 50GB
19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TPC-DS 100GB
20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TeraGen & TeraSort
21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TeraGen & TeraSort Commands
⬢ TeraGen/TeraSort 50GB
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples-
2.7.3.2.6.4.5-2.jar teragen 500000000 /user/cloudbreak/terasort-input
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples-
2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output
⬢ TeraGen/TeraSort 100GB
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples-
2.7.3.2.6.4.5-2.jar teragen 1000000000 /user/cloudbreak/terasort-input
– time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples-
2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output
22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TeraGen & TeraSort 50GB
23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
TeraGen & TeraSort 100GB
24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Cost/Performance
Analysis
25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive Cost Per Query - 50GB
26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive Cost Per Query - 100GB
27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive Cost Total - 50GB
28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Hive Cost Total - 100GB
29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
MapReduce Cost Per Job - 50GB
30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
MapReduce Cost Per Job - 100GB
31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Recommendations
32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Instance Recommendations
⬢ Consider workload type
⬢ Consider performance SLAs
⬢ Hive
– The m5.4xlarge instance type provides the best overall price to performance ratio.
⬢ MapReduce/YARN
– The m5.xlarge is the cheapest instance type. It provides competitive performance for bigger data
volumes.
– The m5.4xlarge is faster and not much more expensive. If SLAs demand, use the m5.4xlarge.
⬢ When in doubt, use the m5.4xlarge
⬢ D2.x and i3.x instance types
– While the d2 and i3 instance types offer much faster I/O throughput, that doesn’t translate directly
to Hive and MapReduce/YARN workloads.
– Use with caution. Ephemeral storage not suited for long-running clusters.
33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Storage Recommendations
⬢ The st1 storage offers comparable performance to gp2 storage at half the storage
cost for Hadoop workloads.
⬢ For master, compute and client servers, use gp2 storage.
⬢ For data nodes, use st1 storage.
34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Sample Cluster Metrics
35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Tuning
⬢ Never tune an environment for a benchmark and expect it to be optimal for your
workload, regardless of workload.
⬢ Room for tuning and optimization to potentially improve performance.
36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Resources
⬢ Amazon EC2 Pricing: https://aws.amazon.com/ec2/pricing/on-demand/
⬢ Amazon EBS Pricing: https://aws.amazon.com/ebs/pricing/
⬢ Amazon EBS-Optimized Instances:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Questions?
38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved
Thank you!

Weitere ähnliche Inhalte

Was ist angesagt?

Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleYifeng Jiang
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudDataWorks Summit
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache AmbariDataWorks Summit
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsDataWorks Summit
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_featuresAlberto Romero
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0DataWorks Summit
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetupt3rmin4t0r
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloudgluent.
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceDataWorks Summit/Hadoop Summit
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveDataWorks Summit
 
Enabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationEnabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationDataWorks Summit
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Deadt3rmin4t0r
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016alanfgates
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017alanfgates
 

Was ist angesagt? (20)

Evolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage SubsystemEvolving HDFS to Generalized Storage Subsystem
Evolving HDFS to Generalized Storage Subsystem
 
Sub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scaleSub-second-sql-on-hadoop-at-scale
Sub-second-sql-on-hadoop-at-scale
 
Running Enterprise Workloads in the Cloud
Running Enterprise Workloads in the CloudRunning Enterprise Workloads in the Cloud
Running Enterprise Workloads in the Cloud
 
The Future of Apache Ambari
The Future of Apache AmbariThe Future of Apache Ambari
The Future of Apache Ambari
 
Apache Hive ACID Project
Apache Hive ACID ProjectApache Hive ACID Project
Apache Hive ACID Project
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
LLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in HiveLLAP: Sub-Second Analytical Queries in Hive
LLAP: Sub-Second Analytical Queries in Hive
 
HBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, SolutionsHBase coprocessors, Uses, Abuses, Solutions
HBase coprocessors, Uses, Abuses, Solutions
 
Hive acid and_2.x new_features
Hive acid and_2.x new_featuresHive acid and_2.x new_features
Hive acid and_2.x new_features
 
Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0Meet HBase 2.0 and Phoenix-5.0
Meet HBase 2.0 and Phoenix-5.0
 
LLAP Nov Meetup
LLAP Nov MeetupLLAP Nov Meetup
LLAP Nov Meetup
 
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the CloudSpeed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
Speed Up Your Queries with Hive LLAP Engine on Hadoop or in the Cloud
 
Major advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL complianceMajor advancements in Apache Hive towards full support of SQL compliance
Major advancements in Apache Hive towards full support of SQL compliance
 
LLAP: long-lived execution in Hive
LLAP: long-lived execution in HiveLLAP: long-lived execution in Hive
LLAP: long-lived execution in Hive
 
Enabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integrationEnabling ABAC with Accumulo and Ranger integration
Enabling ABAC with Accumulo and Ranger integration
 
Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010Database as a Service - Tutorial @ICDE 2010
Database as a Service - Tutorial @ICDE 2010
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Llap: Locality is Dead
Llap: Locality is DeadLlap: Locality is Dead
Llap: Locality is Dead
 
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
Hive2.0 sql speed-scale--hadoop-summit-dublin-apr-2016
 
Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017Hive edw-dataworks summit-eu-april-2017
Hive edw-dataworks summit-eu-april-2017
 

Ähnlich wie Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

EDB Postgres with Containers
EDB Postgres with ContainersEDB Postgres with Containers
EDB Postgres with ContainersEDB
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)Swapan Shridhar
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed_Hat_Storage
 
Arm - ceph on arm update
Arm - ceph on arm updateArm - ceph on arm update
Arm - ceph on arm updateinwin stack
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream DataDataWorks Summit
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Community
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHortonworks
 
A Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkA Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkEDB
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudNicolas Poggi
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...DataWorks Summit
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Aerospike
 
Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)Swapan Shridhar
 

Ähnlich wie Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters (20)

The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Ansible + Hadoop
Ansible + HadoopAnsible + Hadoop
Ansible + Hadoop
 
EDB Postgres with Containers
EDB Postgres with ContainersEDB Postgres with Containers
EDB Postgres with Containers
 
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
EVOLVE'16 | Enhance | Anil Kalbag & Anshul Chhabra | Comparative Architecture...
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)
 
Red Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference ArchitecturesRed Hat Storage Day New York - New Reference Architectures
Red Hat Storage Day New York - New Reference Architectures
 
Arm - ceph on arm update
Arm - ceph on arm updateArm - ceph on arm update
Arm - ceph on arm update
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Visual Mapping of Clickstream Data
Visual Mapping of Clickstream DataVisual Mapping of Clickstream Data
Visual Mapping of Clickstream Data
 
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
Ceph Day New York 2014: Best Practices for Ceph-Powered Implementations of St...
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
A Peek in the Elephant's Trunk
A Peek in the Elephant's TrunkA Peek in the Elephant's Trunk
A Peek in the Elephant's Trunk
 
The state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the CloudThe state of SQL-on-Hadoop in the Cloud
The state of SQL-on-Hadoop in the Cloud
 
Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...Dancing elephants - efficiently working with object stores from Apache Spark ...
Dancing elephants - efficiently working with object stores from Apache Spark ...
 
Hadoop 3 in a Nutshell
Hadoop 3 in a NutshellHadoop 3 in a Nutshell
Hadoop 3 in a Nutshell
 
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
Handling Increasing Load and Reducing Costs Using Aerospike NoSQL Database - ...
 
Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)Ambari Management Packs (Apache Ambari Meetup 2018)
Ambari Management Packs (Apache Ambari Meetup 2018)
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Kürzlich hochgeladen (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters

  • 1. 1 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for Hadoop Clusters Michael Young – Senior Solutions Engineer @ Hortonworks Marcus Waineo – Principal Solutions Engineer @ Hortonworks
  • 2. 2 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Agenda ⬢ Objective ⬢ Overview of Benchmark Tools ⬢ Overview of Cluster Configurations ⬢ TestDFSIO Results ⬢ TPC-DS Results ⬢ TeraSort Results ⬢ Analysis ⬢ Recommendations
  • 3. 3 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Objective ⬢ Evaluate relative performance of each cluster configuration ⬢ Focused on the longer running queries in the TPC-DS test suite – q4, q11, q29, q59, q74, q75, q78, q93, q97 ⬢ Perform instance cost analysis of each cluster configuration to determine “cost per job” – consider storage and instance type ⬢ Determine if a more ”expensive” server configuration yields lower, longer term cost due to performance improvements
  • 4. 4 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Benchmark Tools ⬢ TPC-DS – 5 runs, sequential for each query – Average of 5 runs – Tested 50GB and 100GB data sets ⬢ TestDFSIO – 1 run – Tested 50GB and 100GB ⬢ TeraGen & TeraSort – 1 run – Tested 50GB and 100GB
  • 5. 5 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Configurations - Hardware ⬢ All clusters configured as 14 nodes – 3 master nodes – m4-4xlarge – 1 client node – m4-4xlarge – 10 slave nodes (either 10 data nodes or 6 compute and 4 data nodes) – Tested using standard, gp2 and st1 storage types ⬢ All clusters tested in the same AWS Availability Zone (us-east-1a). ⬢ Tested the following instance configurations: – 10x m4-xlarge (4c, 16gb, 6x200GB) – 10x m4-4xlarge (16c, 64gb, 6x200GB) – 10x m5-xlarge (16c, 64gb, 6x200GB) – 10x m5-4xlarge (16c, 64gb, 6x200GB) – 6x r4-8xlarge (32c, 244gb) compute, 4x d2-2xlarge (8c, 61gb, 6x2TB) storage – 6x r4-8xlarge (32c, 244gb) compute, 4x i3-2xlarge (8c, 61gb, 1x1.9TB NVMe) storage
  • 6. 6 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Configurations - Software ⬢ All clusters were built using: – Cloudbreak 2.4.1 – Ambari 2.6.1.3 – HDP 2.6.4.5-2 ⬢ All clusters were built using a basic blueprint. – No performance optimizations were implemented
  • 7. 7 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cloudbreak/Ambari Blueprint
  • 8. 8 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cloudbreak/Ambari Blueprint - With Compute
  • 9. 9 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cluster Overview - All Data Nodes
  • 10. 10 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO
  • 11. 11 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Commands ⬢ TestDFSIO 50GB: – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 50 -fileSize 1000 – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 50 -fileSize 1000 ⬢ TestDFSIO 100GB: – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -write -nrFiles 100 -fileSize 1000 – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-client-jobclient- 2.7.3.2.6.4.5-2-tests.jar TestDFSIO -D mapred.output.compress=false -read -nrFiles 100 -fileSize 1000
  • 12. 12 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Write 50GB
  • 13. 13 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Read 50GB
  • 14. 14 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Write 100GB
  • 15. 15 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TestDFSIO Read 100GB
  • 16. 16 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS
  • 17. 17 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS Commands ⬢ Repo: https://github.com/Jaraxal/cloud-hadoop-benchmarks – git clone https://github.com/jaraxal/cloud-hadoop-benchmarks – cd cloud-hadoop-benchmarks ⬢ TPC-DS 50GB: – ./tpcds-build.sh – ./tpcds-setup.sh 50 – ./runSuite.pl tpcds 50 ⬢ TPC-DS 100GB: – ./tpcds-build.sh (only needed once per cluster) – ./tpcds-setup.sh 100 – ./runSuite.pl tpcds 100
  • 18. 18 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS 50GB
  • 19. 19 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TPC-DS 100GB
  • 20. 20 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort
  • 21. 21 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort Commands ⬢ TeraGen/TeraSort 50GB – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar teragen 500000000 /user/cloudbreak/terasort-input – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output ⬢ TeraGen/TeraSort 100GB – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar teragen 1000000000 /user/cloudbreak/terasort-input – time hadoop jar /usr/hdp/2.6.4.5-2/hadoop-mapreduce/hadoop-mapreduce-examples- 2.7.3.2.6.4.5-2.jar terasort /user/cloudbreak/terasort-input /user/cloudbreak/terasort-output
  • 22. 22 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort 50GB
  • 23. 23 © Hortonworks Inc. 2011 – 2017. All Rights Reserved TeraGen & TeraSort 100GB
  • 24. 24 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Cost/Performance Analysis
  • 25. 25 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Per Query - 50GB
  • 26. 26 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Per Query - 100GB
  • 27. 27 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Total - 50GB
  • 28. 28 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Hive Cost Total - 100GB
  • 29. 29 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MapReduce Cost Per Job - 50GB
  • 30. 30 © Hortonworks Inc. 2011 – 2017. All Rights Reserved MapReduce Cost Per Job - 100GB
  • 31. 31 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Recommendations
  • 32. 32 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Instance Recommendations ⬢ Consider workload type ⬢ Consider performance SLAs ⬢ Hive – The m5.4xlarge instance type provides the best overall price to performance ratio. ⬢ MapReduce/YARN – The m5.xlarge is the cheapest instance type. It provides competitive performance for bigger data volumes. – The m5.4xlarge is faster and not much more expensive. If SLAs demand, use the m5.4xlarge. ⬢ When in doubt, use the m5.4xlarge ⬢ D2.x and i3.x instance types – While the d2 and i3 instance types offer much faster I/O throughput, that doesn’t translate directly to Hive and MapReduce/YARN workloads. – Use with caution. Ephemeral storage not suited for long-running clusters.
  • 33. 33 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Storage Recommendations ⬢ The st1 storage offers comparable performance to gp2 storage at half the storage cost for Hadoop workloads. ⬢ For master, compute and client servers, use gp2 storage. ⬢ For data nodes, use st1 storage.
  • 34. 34 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Sample Cluster Metrics
  • 35. 35 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Tuning ⬢ Never tune an environment for a benchmark and expect it to be optimal for your workload, regardless of workload. ⬢ Room for tuning and optimization to potentially improve performance.
  • 36. 36 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Resources ⬢ Amazon EC2 Pricing: https://aws.amazon.com/ec2/pricing/on-demand/ ⬢ Amazon EBS Pricing: https://aws.amazon.com/ebs/pricing/ ⬢ Amazon EBS-Optimized Instances: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSOptimized.html
  • 37. 37 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Questions?
  • 38. 38 © Hortonworks Inc. 2011 – 2017. All Rights Reserved Thank you!