SlideShare ist ein Scribd-Unternehmen logo
1 von 36
Downloaden Sie, um offline zu lesen
Greenplum Analytics
                                            Workbench


                                               APURVA DESAI




© Copyright 2012 EMC Corporation. All rights reserved.            1
Overview




© Copyright 2012 EMC Corporation. All rights reserved.              2
What is Hadoop?
 What is Hadoop?
        –    Distributed computing paradigm
        –    File system – HDFS
        –    Processing framework –Map Reduce
        –    Languages – PIG, HIVE
        –    Key Value Store – Hbase
 Why is it important?
        – BIG Data is everywhere
        – BIG Data is mostly unstructured
        – Need affordable, scalable no-sql processing


© Copyright 2012 EMC Corporation. All rights reserved.   3
Analytics Workbench - Motivation
 Open source
        – Hadoop industry is nascent
        – BIG Data development needs scale


 Greenplum
        – Innovation & Experimentation platform
        – Contribute to the community
        – GPDB & GPHD - Mixed mode environment




© Copyright 2012 EMC Corporation. All rights reserved.   4
Greenplum Vision




© Copyright 2012 EMC Corporation. All rights reserved.   5
Buildout Pre-requisites
 Hardware systems integration


 Hadoop experience


 Program Management


 Partner ecosystem

          Greenplum has Inhouse Expertise

© Copyright 2012 EMC Corporation. All rights reserved.   6
Team Introduction
                                                          System Integration
                                                           – Greg, Eric, Don, Dave,
                                                             Patrick



                                                          Program Management
                                                           – Mike, Joe



                                                          Hadoop
                                                           – Apurva, Judes, Clinton,
                                                             Chandra, Ashwin




© Copyright 2012 EMC Corporation. All rights reserved.                                 7
Partners
                                                          Intel
                                                            – 2000 Westmere CPUs

                                                          Mellanox
                                                            – 1,000+ NICs
                                                            – 72 IB switches

                                                          Micron
                                                            – 6,000 8GB DRAM

                                                          Seagate
                                                            – 12,000 2TB Drives

                                                          Supermicro
                                                            – 1000 Chasis/MB


© Copyright 2012 EMC Corporation. All rights reserved.                             8
Partners
                                                          Switch
                                                           – Hosting Facilities


                                                          VMware
                                                           – Operational Support
                                                           – Rubicon




© Copyright 2012 EMC Corporation. All rights reserved.                             9
Peek @ the Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   10
Cluster Statistics
 Largest cluster for Apache Hadoop validation!

 # Of Physical Hosts : > 1,000 (> 10,000 with VMs)
 # Of Racks : 54 (50 just for the DataNodes)
 # Of Processors : > 24,000
 Amount Of RAM : > 48TB
 Amount of Disk Capacity : > 24PB
        – “Equivalent to nearly half of the entire written works of
          mankind from the beginning of recorded history”



© Copyright 2012 EMC Corporation. All rights reserved.                11
Namenode




© Copyright 2012 EMC Corporation. All rights reserved.   12
Job Tracker




© Copyright 2012 EMC Corporation. All rights reserved.   13
CPU




© Copyright 2012 EMC Corporation. All rights reserved.   14
Use Cases




© Copyright 2012 EMC Corporation. All rights reserved.          15
Hadoop Review




© Copyright 2012 EMC Corporation. All rights reserved.   16
Hadoop Shuffle




© Copyright 2012 EMC Corporation. All rights reserved.   17
Initial Use Cases
 Apache Hadoop Validation
 Mellanox UDA
 Terasort Benchmark




© Copyright 2012 EMC Corporation. All rights reserved.   18
Apache Hadoop Validation
 Purpose
        – Run Apache Hadoop Validation at Scale
        – Validate cluster configuration


 Various Configurations Validated
        – Standard Out Of The Box Configs
        – Configs Modified For IO Intensive Processing




© Copyright 2012 EMC Corporation. All rights reserved.   19
Apache Hadoop Preliminary Results
                                       Apache Hadoop-1.0.0 validation
                          1.2


                           1


                          0.8
   Execution Time (Min)




                          0.6


                          0.4                                           1000 Nodes


                          0.2


                           0




© Copyright 2012 EMC Corporation. All rights reserved.                               20
Apache Hadoop Findings
 Apache BigTop for integration tests
 Functional validation passed as expected


 Next Steps
        – Identify integration cases
        – Contribute back to BigTop
        – Stabilize Hadoop 0.23




© Copyright 2012 EMC Corporation. All rights reserved.   21
Mellanox UDA - Overview
                                                          RDMA in Hadoop Shuffle stage
                                                          Register Map & Reduce task buffer
                                                          Hadoop JT for Task completion
                                                          cp sorted maptask o/p  reduce i/p
                                                          Perform in-memory merge @reduce
                                                          Avoid disk spills for large inputs
                                                          Reduce CPU load for sort & merge
                                                          GP + Mellanox collaboration
                                                            – Open Sourcing UDA




© Copyright 2012 EMC Corporation. All rights reserved.                                          22
Mellanox UDA Preliminary Results
 Preliminary UDA results provided by Mellanox
 Show improvement with UDA vs Vanilla Hadoop.
 Better CPU utilization
 Reduced execution time


 Next Steps
        – Run on Analytics Workbench schedule for June 2012
        – Configuration on the workbench to turn it on/off




© Copyright 2012 EMC Corporation. All rights reserved.        23
TeraSort Benchmark
 Industry standard benchmark
 Good validation of configuration
 3 Steps
        – Teragen – Generate 1TB of data
        – Terasort – Sort generated data
        – Teravalidate – Validate the sort
 Measure time for each step




© Copyright 2012 EMC Corporation. All rights reserved.   24
TeraSort Benchmark Preliminary Results
                              Apache Hadoop-1.0.0 validation - TeraSort
                          9

                          8

                          7
   Exection Time in Sec




                          6

                          5

                                                                                                TeraGen
                          4
                                                                                                TeraSort
                          3

                          2

                          1

                          0
                                       1 TB                                             10 TB
                                                         # of TB Generated and Sorted




© Copyright 2012 EMC Corporation. All rights reserved.                                                     25
TeraSort Benchmark Findings
 Minimal tuning of configuration
 Results are within expected range.
 Next Steps
        – Tune the cluster for optimal performance
        – Use the benchmark for every new release




© Copyright 2012 EMC Corporation. All rights reserved.   26
Lessons Learnt




© Copyright 2012 EMC Corporation. All rights reserved.   27
Buildout Progress
                             1200
                                                                                         racked   ready
                             1000
           Number of nodes




                             800


                             600


                             400


                             200


                                0
                               Dec '11   Jan '12         Feb '12   Mar '12   April '12
                                                          Month




© Copyright 2012 EMC Corporation. All rights reserved.                                                    28
―Real‖ Hadoop Cluster




© Copyright 2012 EMC Corporation. All rights reserved.   29
Categories
 Racking & Stacking                                      Hadoop Deployment


 Networking                                              Post deployment


 Non Hadoop Hosts                                        Process


 Base OS Setup




© Copyright 2012 EMC Corporation. All rights reserved.                         30
In Closing




© Copyright 2012 EMC Corporation. All rights reserved.           31
Upcoming work
 Workbench Tasks
        –    Load various data sets
        –    Load GPDB, Hive, Hbase, Zookeeper, etc.
        –    Load Chorus, Command center, UAP stack
        –    VM provisioning
        –    Various audits
 On-boarding candidates
        –    HD Education
        –    Apache Hadoop Build & Validate
        –    Mellanox UDA
        –    Intel HiBench
        –    Big data benchmarking
        –    Hi resolution image processing, etc. etc.



© Copyright 2012 EMC Corporation. All rights reserved.   32
A day in the life @ Switch




© Copyright 2012 EMC Corporation. All rights reserved.   33
Q&A




© Copyright 2012 EMC Corporation. All rights reserved.         34
Other Relevant Greenplum Sessions
Session                                                  Presenter          Times
Unified Analytics Platform Introduction                  Brian Wilson       Tues 10:00-11:00   Thurs 1:00-2:00
Greenplum Database Overview                              Michael Crutcher   Mon 8:30-9:30      Wed 10:00-11:00
Greenplum Hadoop Overview                                Susheel Kaushik    Mon 10:00-11:00    Wed 4:15-5:15
Greenplum DCA Overview                                   Hanxi Chen         Mon 4:00-5:00      Thurs 10:00-11:00
Greenplum Analytics Workbench                            Apurva Desai       Wed 8:30-9:30      Thurs 10:00-11:00
Analytics on Hadoop                                      Don Miner          Tues 11:30-12:30   Thurs 8:30-9:30
Optimizing Greenplum Database on VMware                  Kevin O’Leary      Mon 4:00-5:00      Tues 4:15-5:15
Virtualized Infrastructure
Big Data Driven Businesses in Action:                    Mike Maxey         Wed 4:15-5:15      Thurs 11:30-12:30
Creating Real Business Value Using
Greenplum UAP (Panel w/4 Customers)
Analytics for Business Value: Collaboration              Josh Klahr         Mon 10:00-11:00    Wed 2:45-3:45
Disruptive Data Science — How Data                       Annika Jimenez     Tues 4:15-5:15     Thurs 11:30-12:30
Science and Big Data are Transforming                    David Dietrich
Business, IT and People




© Copyright 2012 EMC Corporation. All rights reserved.                                                             35
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Weitere ähnliche Inhalte

Was ist angesagt?

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
aidanshribman
 
Ugif 04 2011 storage prov-pot_march_2011
Ugif 04 2011   storage prov-pot_march_2011Ugif 04 2011   storage prov-pot_march_2011
Ugif 04 2011 storage prov-pot_march_2011
UGIF
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
mapr-academy
 
Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3
Bill Oliver
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
DataWorks Summit
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle Linux
Terry Wang
 

Was ist angesagt? (20)

SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddin
 
Avamar 7 2010
Avamar 7 2010Avamar 7 2010
Avamar 7 2010
 
Top Technology Trends
Top Technology Trends Top Technology Trends
Top Technology Trends
 
How to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop ClusterHow to Increase Performance of Your Hadoop Cluster
How to Increase Performance of Your Hadoop Cluster
 
Ugif 04 2011 storage prov-pot_march_2011
Ugif 04 2011   storage prov-pot_march_2011Ugif 04 2011   storage prov-pot_march_2011
Ugif 04 2011 storage prov-pot_march_2011
 
30a accessing your cluster
30a accessing your cluster30a accessing your cluster
30a accessing your cluster
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
EMC Deduplication Fundamentals
EMC Deduplication FundamentalsEMC Deduplication Fundamentals
EMC Deduplication Fundamentals
 
Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3Avamar Run Book - 5-14-2015_v3
Avamar Run Book - 5-14-2015_v3
 
B17 Eliminating the database bottleneck
B17 Eliminating the database bottleneckB17 Eliminating the database bottleneck
B17 Eliminating the database bottleneck
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Presentation deduplication backup software and system
Presentation   deduplication backup software and systemPresentation   deduplication backup software and system
Presentation deduplication backup software and system
 
Debugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle LinuxDebugging and Configuration Best Practices for Oracle Linux
Debugging and Configuration Best Practices for Oracle Linux
 
Database performance with Dell PowerEdge PCIe Express Flash SSDs
Database performance with Dell PowerEdge PCIe Express Flash SSDsDatabase performance with Dell PowerEdge PCIe Express Flash SSDs
Database performance with Dell PowerEdge PCIe Express Flash SSDs
 
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
AFDS 2012 Phil Rogers Keynote: THE PROGRAMMER’S GUIDE TO A UNIVERSE OF POSSIB...
 
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.” AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
AFDS 2011 Phil Rogers Keynote: “The Programmer’s Guide to the APU Galaxy.”
 
Commercial track 1_The Power of UDP
Commercial track 1_The Power of UDPCommercial track 1_The Power of UDP
Commercial track 1_The Power of UDP
 
50a volumes
50a volumes50a volumes
50a volumes
 
Solaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and TuningSolaris Linux Performance, Tools and Tuning
Solaris Linux Performance, Tools and Tuning
 

Andere mochten auch

Jump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROIJump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROI
Actian Corporation
 
Iig excel 2010_exercise_vn
Iig excel 2010_exercise_vnIig excel 2010_exercise_vn
Iig excel 2010_exercise_vn
Chi Lê Yến
 
MySQL Administration and Monitoring
MySQL Administration and MonitoringMySQL Administration and Monitoring
MySQL Administration and Monitoring
Mark Leith
 

Andere mochten auch (20)

White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian... White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
White Paper: Backup and Recovery of the EMC Greenplum Data Computing Applian...
 
White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...
White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...
White Paper: Monitoring EMC Greenplum DCA with Nagios - EMC Greenplum Data Co...
 
Greenplum Database Overview
Greenplum Database Overview Greenplum Database Overview
Greenplum Database Overview
 
Actian Vector Whitepaper
 Actian Vector Whitepaper Actian Vector Whitepaper
Actian Vector Whitepaper
 
Actian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL EditionActian Analytics Platform - Hadoop SQL Edition
Actian Analytics Platform - Hadoop SQL Edition
 
Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi Data Science with Spark by Saeed Aghabozorgi
Data Science with Spark by Saeed Aghabozorgi
 
Platform for Data Scientists
Platform for Data ScientistsPlatform for Data Scientists
Platform for Data Scientists
 
Jump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROIJump start your analytics investments and accelerate analytics ROI
Jump start your analytics investments and accelerate analytics ROI
 
Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview Analytics at the Speed of Thought: Actian Express Overview
Analytics at the Speed of Thought: Actian Express Overview
 
Turning Your Data Lake into Measurable Business Value
Turning Your Data Lake into Measurable Business ValueTurning Your Data Lake into Measurable Business Value
Turning Your Data Lake into Measurable Business Value
 
1. Ms Excel Ung Dung Trong Kinh Te (Phan I)
1. Ms Excel Ung Dung Trong Kinh Te (Phan I)1. Ms Excel Ung Dung Trong Kinh Te (Phan I)
1. Ms Excel Ung Dung Trong Kinh Te (Phan I)
 
MySQL Workbench for DFW Unix Users Group
MySQL Workbench for DFW Unix Users GroupMySQL Workbench for DFW Unix Users Group
MySQL Workbench for DFW Unix Users Group
 
Iig excel 2010_exercise_vn
Iig excel 2010_exercise_vnIig excel 2010_exercise_vn
Iig excel 2010_exercise_vn
 
Workbench "Always on the Job!"© software-as-a-service for social collaboration
Workbench "Always on the Job!"© software-as-a-service for social collaborationWorkbench "Always on the Job!"© software-as-a-service for social collaboration
Workbench "Always on the Job!"© software-as-a-service for social collaboration
 
Lap+trinh+vba
Lap+trinh+vbaLap+trinh+vba
Lap+trinh+vba
 
greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM greenplum installation guide - 4 node VM
greenplum installation guide - 4 node VM
 
Vba cho ppt
Vba cho pptVba cho ppt
Vba cho ppt
 
Bài giảng ACCESS - VBA
Bài giảng ACCESS - VBABài giảng ACCESS - VBA
Bài giảng ACCESS - VBA
 
Telco analytics at scale
Telco analytics at scaleTelco analytics at scale
Telco analytics at scale
 
MySQL Administration and Monitoring
MySQL Administration and MonitoringMySQL Administration and Monitoring
MySQL Administration and Monitoring
 

Ähnlich wie Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
DataWorks Summit
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
Richard McDougall
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Romeo Kienzler
 
A27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesA27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practices
Insight Technology, Inc.
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
DataWorks Summit
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
Xiao Qin
 

Ähnlich wie Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? (20)

HugNov14
HugNov14HugNov14
HugNov14
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Operate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmineOperate your hadoop cluster like a high eff goldmine
Operate your hadoop cluster like a high eff goldmine
 
Greenplum feature
Greenplum featureGreenplum feature
Greenplum feature
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Hadoop Overview
Hadoop Overview Hadoop Overview
Hadoop Overview
 
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
November 2014 HUG: Lessons from Hadoop 2+Java8 migration at LinkedIn
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Data Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big DataData Science Day New York: The Platform for Big Data
Data Science Day New York: The Platform for Big Data
 
Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012Lego Cloud SAP Virtualization Week 2012
Lego Cloud SAP Virtualization Week 2012
 
A27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practicesA27 Vectorwise Performance Considerations_implementation_best_practices
A27 Vectorwise Performance Considerations_implementation_best_practices
 
Transform Your SAP Landscape Using EMC Technologies
Transform Your SAP Landscape Using EMC TechnologiesTransform Your SAP Landscape Using EMC Technologies
Transform Your SAP Landscape Using EMC Technologies
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
 
Virtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In ChineseVirtual Hadoop Introduction In Chinese
Virtual Hadoop Introduction In Chinese
 
In-Place analytics with Unified Data Access
In-Place analytics with Unified Data AccessIn-Place analytics with Unified Data Access
In-Place analytics with Unified Data Access
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
 
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters Boosting Hadoop Performance with  Emulex OneConnect® 10Gb Ethernet Adapters
Boosting Hadoop Performance with Emulex OneConnect® 10Gb Ethernet Adapters
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 

Mehr von EMC

Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
EMC
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
EMC
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
EMC
 

Mehr von EMC (20)

INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUDINDUSTRY-LEADING  TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
INDUSTRY-LEADING TECHNOLOGY FOR LONG TERM RETENTION OF BACKUPS IN THE CLOUD
 
Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote Cloud Foundry Summit Berlin Keynote
Cloud Foundry Summit Berlin Keynote
 
EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX EMC GLOBAL DATA PROTECTION INDEX
EMC GLOBAL DATA PROTECTION INDEX
 
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIOTransforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
Transforming Desktop Virtualization with Citrix XenDesktop and EMC XtremIO
 
Citrix ready-webinar-xtremio
Citrix ready-webinar-xtremioCitrix ready-webinar-xtremio
Citrix ready-webinar-xtremio
 
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
EMC FORUM RESEARCH GLOBAL RESULTS - 10,451 RESPONSES ACROSS 33 COUNTRIES
 
EMC with Mirantis Openstack
EMC with Mirantis OpenstackEMC with Mirantis Openstack
EMC with Mirantis Openstack
 
Modern infrastructure for business data lake
Modern infrastructure for business data lakeModern infrastructure for business data lake
Modern infrastructure for business data lake
 
Force Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop ElsewhereForce Cyber Criminals to Shop Elsewhere
Force Cyber Criminals to Shop Elsewhere
 
Pivotal : Moments in Container History
Pivotal : Moments in Container History Pivotal : Moments in Container History
Pivotal : Moments in Container History
 
Data Lake Protection - A Technical Review
Data Lake Protection - A Technical ReviewData Lake Protection - A Technical Review
Data Lake Protection - A Technical Review
 
Mobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or FoeMobile E-commerce: Friend or Foe
Mobile E-commerce: Friend or Foe
 
Virtualization Myths Infographic
Virtualization Myths Infographic Virtualization Myths Infographic
Virtualization Myths Infographic
 
Intelligence-Driven GRC for Security
Intelligence-Driven GRC for SecurityIntelligence-Driven GRC for Security
Intelligence-Driven GRC for Security
 
The Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure AgeThe Trust Paradox: Access Management and Trust in an Insecure Age
The Trust Paradox: Access Management and Trust in an Insecure Age
 
EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015EMC Technology Day - SRM University 2015
EMC Technology Day - SRM University 2015
 
EMC Academic Summit 2015
EMC Academic Summit 2015EMC Academic Summit 2015
EMC Academic Summit 2015
 
Data Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education ServicesData Science and Big Data Analytics Book from EMC Education Services
Data Science and Big Data Analytics Book from EMC Education Services
 
Using EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere EnvironmentsUsing EMC Symmetrix Storage in VMware vSphere Environments
Using EMC Symmetrix Storage in VMware vSphere Environments
 
Using EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBookUsing EMC VNX storage with VMware vSphereTechBook
Using EMC VNX storage with VMware vSphereTechBook
 

Kürzlich hochgeladen

Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 

Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?

  • 1. Greenplum Analytics Workbench APURVA DESAI © Copyright 2012 EMC Corporation. All rights reserved. 1
  • 2. Overview © Copyright 2012 EMC Corporation. All rights reserved. 2
  • 3. What is Hadoop?  What is Hadoop? – Distributed computing paradigm – File system – HDFS – Processing framework –Map Reduce – Languages – PIG, HIVE – Key Value Store – Hbase  Why is it important? – BIG Data is everywhere – BIG Data is mostly unstructured – Need affordable, scalable no-sql processing © Copyright 2012 EMC Corporation. All rights reserved. 3
  • 4. Analytics Workbench - Motivation  Open source – Hadoop industry is nascent – BIG Data development needs scale  Greenplum – Innovation & Experimentation platform – Contribute to the community – GPDB & GPHD - Mixed mode environment © Copyright 2012 EMC Corporation. All rights reserved. 4
  • 5. Greenplum Vision © Copyright 2012 EMC Corporation. All rights reserved. 5
  • 6. Buildout Pre-requisites  Hardware systems integration  Hadoop experience  Program Management  Partner ecosystem Greenplum has Inhouse Expertise © Copyright 2012 EMC Corporation. All rights reserved. 6
  • 7. Team Introduction  System Integration – Greg, Eric, Don, Dave, Patrick  Program Management – Mike, Joe  Hadoop – Apurva, Judes, Clinton, Chandra, Ashwin © Copyright 2012 EMC Corporation. All rights reserved. 7
  • 8. Partners  Intel – 2000 Westmere CPUs  Mellanox – 1,000+ NICs – 72 IB switches  Micron – 6,000 8GB DRAM  Seagate – 12,000 2TB Drives  Supermicro – 1000 Chasis/MB © Copyright 2012 EMC Corporation. All rights reserved. 8
  • 9. Partners  Switch – Hosting Facilities  VMware – Operational Support – Rubicon © Copyright 2012 EMC Corporation. All rights reserved. 9
  • 10. Peek @ the Cluster © Copyright 2012 EMC Corporation. All rights reserved. 10
  • 11. Cluster Statistics Largest cluster for Apache Hadoop validation!  # Of Physical Hosts : > 1,000 (> 10,000 with VMs)  # Of Racks : 54 (50 just for the DataNodes)  # Of Processors : > 24,000  Amount Of RAM : > 48TB  Amount of Disk Capacity : > 24PB – “Equivalent to nearly half of the entire written works of mankind from the beginning of recorded history” © Copyright 2012 EMC Corporation. All rights reserved. 11
  • 12. Namenode © Copyright 2012 EMC Corporation. All rights reserved. 12
  • 13. Job Tracker © Copyright 2012 EMC Corporation. All rights reserved. 13
  • 14. CPU © Copyright 2012 EMC Corporation. All rights reserved. 14
  • 15. Use Cases © Copyright 2012 EMC Corporation. All rights reserved. 15
  • 16. Hadoop Review © Copyright 2012 EMC Corporation. All rights reserved. 16
  • 17. Hadoop Shuffle © Copyright 2012 EMC Corporation. All rights reserved. 17
  • 18. Initial Use Cases  Apache Hadoop Validation  Mellanox UDA  Terasort Benchmark © Copyright 2012 EMC Corporation. All rights reserved. 18
  • 19. Apache Hadoop Validation  Purpose – Run Apache Hadoop Validation at Scale – Validate cluster configuration  Various Configurations Validated – Standard Out Of The Box Configs – Configs Modified For IO Intensive Processing © Copyright 2012 EMC Corporation. All rights reserved. 19
  • 20. Apache Hadoop Preliminary Results Apache Hadoop-1.0.0 validation 1.2 1 0.8 Execution Time (Min) 0.6 0.4 1000 Nodes 0.2 0 © Copyright 2012 EMC Corporation. All rights reserved. 20
  • 21. Apache Hadoop Findings  Apache BigTop for integration tests  Functional validation passed as expected  Next Steps – Identify integration cases – Contribute back to BigTop – Stabilize Hadoop 0.23 © Copyright 2012 EMC Corporation. All rights reserved. 21
  • 22. Mellanox UDA - Overview  RDMA in Hadoop Shuffle stage  Register Map & Reduce task buffer  Hadoop JT for Task completion  cp sorted maptask o/p  reduce i/p  Perform in-memory merge @reduce  Avoid disk spills for large inputs  Reduce CPU load for sort & merge  GP + Mellanox collaboration – Open Sourcing UDA © Copyright 2012 EMC Corporation. All rights reserved. 22
  • 23. Mellanox UDA Preliminary Results  Preliminary UDA results provided by Mellanox  Show improvement with UDA vs Vanilla Hadoop.  Better CPU utilization  Reduced execution time  Next Steps – Run on Analytics Workbench schedule for June 2012 – Configuration on the workbench to turn it on/off © Copyright 2012 EMC Corporation. All rights reserved. 23
  • 24. TeraSort Benchmark  Industry standard benchmark  Good validation of configuration  3 Steps – Teragen – Generate 1TB of data – Terasort – Sort generated data – Teravalidate – Validate the sort  Measure time for each step © Copyright 2012 EMC Corporation. All rights reserved. 24
  • 25. TeraSort Benchmark Preliminary Results Apache Hadoop-1.0.0 validation - TeraSort 9 8 7 Exection Time in Sec 6 5 TeraGen 4 TeraSort 3 2 1 0 1 TB 10 TB # of TB Generated and Sorted © Copyright 2012 EMC Corporation. All rights reserved. 25
  • 26. TeraSort Benchmark Findings  Minimal tuning of configuration  Results are within expected range.  Next Steps – Tune the cluster for optimal performance – Use the benchmark for every new release © Copyright 2012 EMC Corporation. All rights reserved. 26
  • 27. Lessons Learnt © Copyright 2012 EMC Corporation. All rights reserved. 27
  • 28. Buildout Progress 1200 racked ready 1000 Number of nodes 800 600 400 200 0 Dec '11 Jan '12 Feb '12 Mar '12 April '12 Month © Copyright 2012 EMC Corporation. All rights reserved. 28
  • 29. ―Real‖ Hadoop Cluster © Copyright 2012 EMC Corporation. All rights reserved. 29
  • 30. Categories  Racking & Stacking  Hadoop Deployment  Networking  Post deployment  Non Hadoop Hosts  Process  Base OS Setup © Copyright 2012 EMC Corporation. All rights reserved. 30
  • 31. In Closing © Copyright 2012 EMC Corporation. All rights reserved. 31
  • 32. Upcoming work  Workbench Tasks – Load various data sets – Load GPDB, Hive, Hbase, Zookeeper, etc. – Load Chorus, Command center, UAP stack – VM provisioning – Various audits  On-boarding candidates – HD Education – Apache Hadoop Build & Validate – Mellanox UDA – Intel HiBench – Big data benchmarking – Hi resolution image processing, etc. etc. © Copyright 2012 EMC Corporation. All rights reserved. 32
  • 33. A day in the life @ Switch © Copyright 2012 EMC Corporation. All rights reserved. 33
  • 34. Q&A © Copyright 2012 EMC Corporation. All rights reserved. 34
  • 35. Other Relevant Greenplum Sessions Session Presenter Times Unified Analytics Platform Introduction Brian Wilson Tues 10:00-11:00 Thurs 1:00-2:00 Greenplum Database Overview Michael Crutcher Mon 8:30-9:30 Wed 10:00-11:00 Greenplum Hadoop Overview Susheel Kaushik Mon 10:00-11:00 Wed 4:15-5:15 Greenplum DCA Overview Hanxi Chen Mon 4:00-5:00 Thurs 10:00-11:00 Greenplum Analytics Workbench Apurva Desai Wed 8:30-9:30 Thurs 10:00-11:00 Analytics on Hadoop Don Miner Tues 11:30-12:30 Thurs 8:30-9:30 Optimizing Greenplum Database on VMware Kevin O’Leary Mon 4:00-5:00 Tues 4:15-5:15 Virtualized Infrastructure Big Data Driven Businesses in Action: Mike Maxey Wed 4:15-5:15 Thurs 11:30-12:30 Creating Real Business Value Using Greenplum UAP (Panel w/4 Customers) Analytics for Business Value: Collaboration Josh Klahr Mon 10:00-11:00 Wed 2:45-3:45 Disruptive Data Science — How Data Annika Jimenez Tues 4:15-5:15 Thurs 11:30-12:30 Science and Big Data are Transforming David Dietrich Business, IT and People © Copyright 2012 EMC Corporation. All rights reserved. 35