Presentation for the sudoers Barcelona group 0ct 06 2015, on benchmarking Hadoop with ALOJA open source benchmarking platform. The presentation was mostly a live DEMO, posting some slides for the people who could not attend.
http://lanyrd.com/2015/sudoers-barcelona-october/
5. Hadoop design
Hadoop designed to solve complex data
Structured and non structured
With [close to] linear scalability
Simplifying the programming model
From MPI, OpenMP, CUDA, …
Operates as a blackbox for data analysts
Image source: Hadoop, the definitive guide
6. Hadoop parameters
> 100+ tunable parameters
mapred.map/reduce.tasks.speculative.execution
obscure and interrelated
io.sort.mb 100 (300)
io.sort.record.percent 5% (15%)
io.sort.spill.percent 80% (95 – 100%)
Number of Mappers and Reducers
Rule of thumb 0.5 - 2 per CPU core
7. Hadoop stack for tuning
Image source: Intel® Distribution for Apache Hadoop
8. Hadoop highly-scalable but…
Not a high-performance solution!
Requires
Design,
Clusters, topology clusters
Setup,
OS, Hadoop config
and tuning required
Iterative approach
Time consuming
And extensive benchmarking!
9. Hadoop ecosystem
Large and spread
Dominated by big players
Custom patches
Default values not ideal
Product claims
Cloud vs. On-premise
IaaS
PaaS
EMR, HDInsight
Needs standardization
and auditing!
DATA
11. Too many choices?
Remote volumes
-
-
Rotational HDDs
JBODs
Large VMs
Small VMs
GbEthernet
InfiniBand
RAID
Cost
Performance
On-Premise
Cloud
And where is my system
configurationpositionedon
each of these axes?
Highavailability
Replication
+
+
12. Project ALOJA
Open initiative to produce mechanisms for an
automated characterization of cost-effectiveness
of Big Data deployments
Results from of a growing need of the community to
understand job execution details and create transparency
Explore different configuration deployment options and
their tradeoffs
Both software and hardware
Cloud services and on-premise
Seeks to provide knowledge, tools, and an online service
to with which users make better informed decisions
reduce the TCO for their Big Data infrastructures
Guide the future development and deployment of Big Data clusters
and applications
14. Challenges (circa end 2013)
Test different clusters architectures
On-premise
Commodity, high-end, appliance, low-power
Cloud IaaS
32 different VMs in Azure, similar in other
providers
Cloud PaaS
HDInsight, EMR, CloudBigData
Different access level
Full admin, user-only, request-to-install,
everything ready, queuing systems (SGE)
Different versions
Hadoop, JVM, Spark, Hive, etc…
Dev environments and testing
Big Data usually requires a cluster to
develop and test
15. Benchmarking vs. Production envs
Need to compare different executions
Not how the systems are doing now
This is the main diff with prod products
Dada does not change (non-OLTP)
Temporary data for benchmarks vs. Important data
Fast iteration vs. Reliability
Iterates configurations vs. fixed config
Many fast, experimental changes
Security can be relaxed
Management for Hadoop
Vendor lock-in
Lack of systems support(azure, on-prem, low-power)
Hadoop is our use case, not the only one
Leave no traces on the benchmarked system
16. Available options: (circa end 2013)
Deployment
jclouds
foreman
Puppet
Ambari
Config and deploy
Ambari (hadoop only)
Use Configuration
Management (CM)
Puppet, chef, ansible…
Monitoring
Ganglia, Zabbix
Amabari
Cloudera Manager
Kibana, GraphD…
Problems
All systems thoughfor PROD
Not for comparison
No Azure support
Many different packages
No one-fits-all solution
Solution
Custom implementation
Based in simple components
Wrapping commands
17. ALOJA Platform main components
2 Online Repository
•Explore results
•Execution details
•Cluster details
•Costs
•Data sharing
3 Web Analytics
•Data views and evaluations
•Aggregates
•Abstracted Metrics
•Job characterization
•Machine Learning
•Predictions and clustering
1 Big Data Benchmarking
•Deploy & Provision
•Conf Management
•Parameter selection & Queuing
•Perf counters
•Low-level instrumentation
•App logs
17
NGINX, PHP, MySQL
BASH, Unix tools, CLIs R, SQL, JS
18. Workflow in ALOJA
Cluster(s)
definition
• VM sizes
• # nodes
• OS, disks
• Capabilities
Execution
plan
• Start cluster
• Exec Benchmarks
• Gather results
• Cleanup
Import
data
• Convert perf metric
• Parse logs
• Import into DB
Evaluate
data
• Data views in Vagrant VM
• Or http://hadoop.bsc.es
PA and KD
•Predictive
Analytics
•Knowledge
Discovery
Historic
Repo
(in progress)
21. Running benchmarks in ALOJA
Example of submitting a job to run:
https://github.com/Aloja/aloja/blob/master/aloja-bench/run_benchs.sh
To queue jobs and control results:
https://github.com/Aloja/aloja/blob/master/shell/exeq.sh
23. ALOJA Online Benchmark Repository
Entry point for explorethe results collected from the executions
Index of executions
Quick glance of executions
Searchable,Sortable
Execution details
Performance chartsandhistograms
Hadoopcounters
Jobsand taskdetails
Data management of benchmark executions
Data importing from different clusters
Execution validation
Data management and backup
Cluster definitions
Cluster capabilities (resources)
Cluster costs
Sharing results
Download executions
Add external executions
Documentation and References
Papers, links, and feature documentation
Availableat: http://aloja.bsc.es
24. Impact of SW configurations in Speedup
(4 node clusters)
Number of mappers Compression algorithm
No comp.
ZLIB
BZIP2
snappy
4m
6m
8m
10m
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
25. Impact of HW configurationsin Speedup
Disks and Network Cloud remote volumes
Local only
1 Remote
2 Remotes
3 Remotes
3 Remotes
/tmp local
2 Remotes
/tmp local
1 Remotes
/tmp local
HDD-ETH
HDD-IB
SSD-ETH
SDD-IB
Speedup (higher is better)
Results using: http://hadoop.bsc.es/configimprovement
Details: https://raw.githubusercontent.com/Aloja/aloja/master/publications/BSC-MSR_ALOJA.pdf
26. Speedup: all disk configurationsSSD vs JBOD
For DFSIOEread, DFSIOEwrite, and Terasort
URL:
http://hadoop.bsc.es/configimprovement?datefrom=&dateto=&benchs%5B%5D=dfsioe_read&benchs%5B%5D=dfsioe_write&benchs%5B%5D=terasort&id_clusters%5B%5D=21&nets%5B%5D=None&disks%5B%5D=HD2&disks%5B%5D=H
D3&disks%5B%5D=HD4&disks%5B%5D=HD5&disks%5B%5D=HDD&disks%5B%5D=HS5&disks%5B%5D=RL1&disks%5B%5D=RL2&disks%5B%5D=RL3&disks%5B%5D=RL4&disks%5B%5D=RL5&disks%5B%5D=RL6&disks%5B%5D=RR1&disks%
5B%5D=SS2&disks%5B%5D=SSD&mapss%5B%5D=None&comps%5B%5D=None&replications%5B%5D=None&blk_sizes%5B%5D=None&iosfs%5B%5D=None&iofilebufs%5B%5D=None&datanodess%5B%5D=None&bench_types%5B%5D=H
DI&bench_types%5B%5D=HiBench&vm_sizes%5B%5D=None&vm_coress%5B%5D=None&vm_RAMs%5B%5D=None&hadoop_versions%5B%5D=None&types%5B%5D=None&filters%5B%5D=valid&filters%5B%5D=filters&allunchecked=
2 SSDs
5 SATA
1 SSD /tmp
1 SSD
1 SATA
2 SATA
3 SATA
4 SATA
5 SATA
Higherisbetter
Fastest config
Highcapacity and fast
Highcapacity but slow
27. Speedup by disk configuration in the Cloud
(higher is better)
URL
http://104.130.159.92/configimprovement?benchs%5B%5D=terasort&disks%5B%5D=HDD&disks%5B%5D=RL1&disks%5B%5D=RL2&disks%5B%5D=RL3&disks%5B%5D=RR1&disks%5B%5D=RR2&disk
s%5B%5D=RR3&disks%5B%5D=RR4&disks%5B%5D=RR5&disks%5B%5D=RR6&disks%5B%5D=RS1&disks%5B%5D=RS6&disks%5B%5D=SSD&bench_types%5B%5D=HiBench&filters%5B%5D=valid&filt
ers%5B%5D=filters&allunchecked=&selected-groups=disk&datefrom=&dateto=&minexetime=150&maxexetime=1500
1-6 remotes
1 and 6
remotes with
/tmp on SSD
SSD only
Higherisbetter
29. Preview: Cost/Performance Scalability
This shows a sample of a new screen (with sample data) to find the most cost-
effective cluster size
X axis number of datanodes (cluster size
Left Y Execution time (lower is better)
Right Y Execution cost
Execution time Execution cost
Recommendedsize
31. Open questions:
is BASH good enough?
PROs CONs and Alternatives
Simple and Fast
Well known
(basics at least)
Easy to hack
Most of the work
requires running sys
commands
Custom implementation
problems
Missing some systems
Too simple, missing:
objects, inheritance,
types, data structures,
testing
Python? Perl?
Puppet? Ansible?
We’ll stick to bash for
now..
What’s missing for
incubating in Apache?
32. More info:
ALOJA Benchmarking platform and online repository
http://aloja.bsc.es
Benchmarking Big Data by Nicolas Poggi
http://www.slideshare.net/ni_po/benchmarking-hadoop
Big Data Benchmarking Community (BDBC) mailing list
(~200 members from ~80organizations)
http://clds.sdsc.edu/bdbc/community
Workshop Big Data Benchmarking (WBDB)
Next: http://clds.sdsc.edu/wbdb2015.ca
SPEC Research Big Data working group
http://research.spec.org/working-groups/big-data-working-group.html
Slides and video:
Michael Frank on Big Data benchmarking
http://www.tele-task.de/archive/podcast/20430/
Tilmann Rabl BigData BenchmarkingTutorial
http://www.slideshare.net/tilmann_rabl/ieee2014-tutorialbarurabl