Weitere ähnliche Inhalte Ähnlich wie Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? (20) Kürzlich hochgeladen (20) Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? 1. Greenplum Analytics
Workbench
APURVA DESAI
© Copyright 2012 EMC Corporation. All rights reserved. 1
3. What is Hadoop?
What is Hadoop?
– Distributed computing paradigm
– File system – HDFS
– Processing framework –Map Reduce
– Languages – PIG, HIVE
– Key Value Store – Hbase
Why is it important?
– BIG Data is everywhere
– BIG Data is mostly unstructured
– Need affordable, scalable no-sql processing
© Copyright 2012 EMC Corporation. All rights reserved. 3
4. Analytics Workbench - Motivation
Open source
– Hadoop industry is nascent
– BIG Data development needs scale
Greenplum
– Innovation & Experimentation platform
– Contribute to the community
– GPDB & GPHD - Mixed mode environment
© Copyright 2012 EMC Corporation. All rights reserved. 4
6. Buildout Pre-requisites
Hardware systems integration
Hadoop experience
Program Management
Partner ecosystem
Greenplum has Inhouse Expertise
© Copyright 2012 EMC Corporation. All rights reserved. 6
7. Team Introduction
System Integration
– Greg, Eric, Don, Dave,
Patrick
Program Management
– Mike, Joe
Hadoop
– Apurva, Judes, Clinton,
Chandra, Ashwin
© Copyright 2012 EMC Corporation. All rights reserved. 7
8. Partners
Intel
– 2000 Westmere CPUs
Mellanox
– 1,000+ NICs
– 72 IB switches
Micron
– 6,000 8GB DRAM
Seagate
– 12,000 2TB Drives
Supermicro
– 1000 Chasis/MB
© Copyright 2012 EMC Corporation. All rights reserved. 8
9. Partners
Switch
– Hosting Facilities
VMware
– Operational Support
– Rubicon
© Copyright 2012 EMC Corporation. All rights reserved. 9
10. Peek @ the Cluster
© Copyright 2012 EMC Corporation. All rights reserved. 10
11. Cluster Statistics
Largest cluster for Apache Hadoop validation!
# Of Physical Hosts : > 1,000 (> 10,000 with VMs)
# Of Racks : 54 (50 just for the DataNodes)
# Of Processors : > 24,000
Amount Of RAM : > 48TB
Amount of Disk Capacity : > 24PB
– “Equivalent to nearly half of the entire written works of
mankind from the beginning of recorded history”
© Copyright 2012 EMC Corporation. All rights reserved. 11
18. Initial Use Cases
Apache Hadoop Validation
Mellanox UDA
Terasort Benchmark
© Copyright 2012 EMC Corporation. All rights reserved. 18
19. Apache Hadoop Validation
Purpose
– Run Apache Hadoop Validation at Scale
– Validate cluster configuration
Various Configurations Validated
– Standard Out Of The Box Configs
– Configs Modified For IO Intensive Processing
© Copyright 2012 EMC Corporation. All rights reserved. 19
20. Apache Hadoop Preliminary Results
Apache Hadoop-1.0.0 validation
1.2
1
0.8
Execution Time (Min)
0.6
0.4 1000 Nodes
0.2
0
© Copyright 2012 EMC Corporation. All rights reserved. 20
21. Apache Hadoop Findings
Apache BigTop for integration tests
Functional validation passed as expected
Next Steps
– Identify integration cases
– Contribute back to BigTop
– Stabilize Hadoop 0.23
© Copyright 2012 EMC Corporation. All rights reserved. 21
22. Mellanox UDA - Overview
RDMA in Hadoop Shuffle stage
Register Map & Reduce task buffer
Hadoop JT for Task completion
cp sorted maptask o/p reduce i/p
Perform in-memory merge @reduce
Avoid disk spills for large inputs
Reduce CPU load for sort & merge
GP + Mellanox collaboration
– Open Sourcing UDA
© Copyright 2012 EMC Corporation. All rights reserved. 22
23. Mellanox UDA Preliminary Results
Preliminary UDA results provided by Mellanox
Show improvement with UDA vs Vanilla Hadoop.
Better CPU utilization
Reduced execution time
Next Steps
– Run on Analytics Workbench schedule for June 2012
– Configuration on the workbench to turn it on/off
© Copyright 2012 EMC Corporation. All rights reserved. 23
24. TeraSort Benchmark
Industry standard benchmark
Good validation of configuration
3 Steps
– Teragen – Generate 1TB of data
– Terasort – Sort generated data
– Teravalidate – Validate the sort
Measure time for each step
© Copyright 2012 EMC Corporation. All rights reserved. 24
25. TeraSort Benchmark Preliminary Results
Apache Hadoop-1.0.0 validation - TeraSort
9
8
7
Exection Time in Sec
6
5
TeraGen
4
TeraSort
3
2
1
0
1 TB 10 TB
# of TB Generated and Sorted
© Copyright 2012 EMC Corporation. All rights reserved. 25
26. TeraSort Benchmark Findings
Minimal tuning of configuration
Results are within expected range.
Next Steps
– Tune the cluster for optimal performance
– Use the benchmark for every new release
© Copyright 2012 EMC Corporation. All rights reserved. 26
28. Buildout Progress
1200
racked ready
1000
Number of nodes
800
600
400
200
0
Dec '11 Jan '12 Feb '12 Mar '12 April '12
Month
© Copyright 2012 EMC Corporation. All rights reserved. 28
30. Categories
Racking & Stacking Hadoop Deployment
Networking Post deployment
Non Hadoop Hosts Process
Base OS Setup
© Copyright 2012 EMC Corporation. All rights reserved. 30
32. Upcoming work
Workbench Tasks
– Load various data sets
– Load GPDB, Hive, Hbase, Zookeeper, etc.
– Load Chorus, Command center, UAP stack
– VM provisioning
– Various audits
On-boarding candidates
– HD Education
– Apache Hadoop Build & Validate
– Mellanox UDA
– Intel HiBench
– Big data benchmarking
– Hi resolution image processing, etc. etc.
© Copyright 2012 EMC Corporation. All rights reserved. 32
33. A day in the life @ Switch
© Copyright 2012 EMC Corporation. All rights reserved. 33
35. Other Relevant Greenplum Sessions
Session Presenter Times
Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00
Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00
Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15
Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00
Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00
Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30
Optimizing Greenplum Database on VMware Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15
Virtualized Infrastructure
Big Data Driven Businesses in Action: Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30
Creating Real Business Value Using
Greenplum UAP (Panel w/4 Customers)
Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45
Disruptive Data Science — How Data Annika Jimenez Tues 4:15-5:15 Thurs 11:30-12:30
Science and Big Data are Transforming David Dietrich
Business, IT and People
© Copyright 2012 EMC Corporation. All rights reserved. 35