Deciding the deployment model is critical when enterprises adopt Hadoop. Initially, the bare metal (on-premise cluster with physical servers) model was popular to avoid I/O overhead in the virtualized environments. However, these days, cloud is also a contending option with its compelling cost savings, and ease of operation. To aid in assessing the deployment options, Accenture Technology Labs developed Accenture Data Platform Benchmark suite, a total cost of ownership (TCO) model and has tuned and compared performance of bare metal Hadoop clusters and Hadoop cloud service. Interestingly enough, the study discovered that price/performance ratio is not a critical factor in making a Hadoop deployment decision. Employing empirical and systemic analyses, the study resulted in comparable price/performance ratio from both bare metal Hadoop clusters and Hadoop-as-a-service. Moreover, cheaper purchasing options (e.g., long term contracts) provides better ratio than the bare metal one in many cases. Thus, this result debunks the idea that the cloud is not suitable to Hadoop MapReduce workloads due to their heavy I/O requirements. Furthermore, the study finds that the Hadoop default configuration provides ample headroom for performance tuning, and the cloud infrastructure enables even further performance tuning opportunities.
Spark1.0ă§ăźćäœæ€èšŒ - HadoopăŠăŒă¶ă»ăăăăăăăèŠăSparkăžăźæćŸ ïŒHadoop Conference Japan 2014ïŒNTT DATA OSS Professional Services
Introduction â Michael Wendt, R&D Developer in Data Insights R&D Group at ATLAccenture Technology Labs â the forward looking R&D group of Accenture, in San Jose and 4 other locations globallyWhen enterprises decide to adopt Hadoop, they are faced with having to answer the question: Where to deploy Hadoop: Bare-metal or Cloud?
Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportationâŠ- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, âsandboxâ, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportationâŠ- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, âsandboxâ, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
Four main deployment models for businesses:- On-premise full custom: purchase commodity hardware, install software and operate it themselves -> gives businesses full control of the Hadoop cluster.- Hadoop appliance: preconfigured Hadoop cluster -> bypass detailed technical configuration and jumpstart data analysisTransitioning outside of the corportationâŠ- Hadoop hosting: similar to ISP model -> rely on a service provider to deploy and operate Hadoop clusters - Hadoop-as-a-Service:instant access to Hadoop clusters, pay-per-use consumption model -> providing greater business agilityDeciding which deployment model is appropriate depends on the five key areas below:- Price-Performance Ratio: with a limited budget how can we get the biggest ROI; -- BM: requires a larger upfront investment, limiting scale-- CL: can scale with demand- Data Privacy: concerns with corporate data-- BM: security, contains all data in-house-- CL: need for comprehensive cloud-data privacy strategy-Data Gravity: once data volume grows, physical migration becomes slow -> locked into current platform-- need to consider portability, future growth and location of data- Data Enrichment: leveraging multiple datasets to uncover new insights, determining where to host, co-locate data- Productivity: ability to test ideas, âsandboxâ, deploy to production-- CL: advantage for deploying test clustersFor this study we focus on the extreme ends of the spectrum: On-premise & HaaSDive deeper into Price-Performance Ratio
Price-Performance Ratio has two divergent views for Hadoop:--click--1. Virtualized Hadoop cluster is slower because Hadoopâs workload has intensive I/O operations--click--2. Cloud-based model provides compelling cost savings - nodes are less expensive; Hadoop is horizontally scalable
In the Hadoop Deployment Comparison Study, we compare the price-performance ratio of a bare-metal Hadoop cluster with Hadoop-as-a-service --click--at the matched total cost of ownership (TCO) level --click--using real-world applications modeled by the Accenture Data Platform Benchmark
Letâs first take a look at the TCO analysis
*3 times replication factorServer hardware â depreciation accounted for over 3 years; full details in white paperData center â tier-3 data center 10,000 sq. ft; full details in white paperTech support â third party vendorsStaff â 3 full time employees
Staff â one full time employee; reduced needTech Support â AWS Premium SupportDifferent needs based on cloud environment, no need for data centerStorage Services â Amazon S3No need for servers only virtual instances of Hadoop service â Amazon EMR--click--Subtracted from budget to determine number of affordable instances--click--Calculated the
Time and cost prohibitive to test all 42 combinationsSelected these three instance types since they were the largest of their respective instance family
Time and cost prohibitive to test all 42 combinationsSelected these three instance types since they were the largest of their respective instance family
Assumed 50% utilization
Now letâs look at the Accenture Data Platform Benchmark
Sessionization: Constructing session from raw log data. One of several prerequisite steps for log analysis use cases (individual website optimization, infrastructure optimization, security analytics, etc.).
Filteringalogrithms basic and simple, while widely used.
*3 TB compressed
Experiment setup, how did everything come together?
Letâs switch gearsâŠ--click--8x improvement relative to default parameter settingseach iteration took about Âœ - 1 full day including performance analysis, tuning, and executionThe merit of Starfish is to achieve performance increases with much less cost than manual tuning.