SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Elastic, Multi-tenant
Hadoop on Demand!
Richard McDougall, !
Chief Architect, Application Infrastructure and Big Data, VMware, Inc!
@richardmcdougll!
ApacheCon Europe, 2012!
!

http://www.vmware.com/hadoop
http://cto.vmware.com/
http://projectserengeti.org
http://github.com/vmware-serengeti

                                                                         © 2009 VMware Inc. All rights reserved
Broad Application of Hadoop technology
 Horizontal Use Cases                                         Vertical Use Cases


 Log Processing / Click
                                                               Financial Services
   Stream Analytics

   Machine Learning /                                           Internet Retailer
sophisticated data mining

   Web crawling / text                                       Pharmaceutical / Drug
      processing                                                  Discovery

 Extract Transform Load
                                                                Mobile / Telecom
   (ETL) replacement

  Image / XML message
                                                              Scientific Research
      processing

   General archiving /
                                                                 Social Media
      compliance


Hadoop’s ability to handle large unstructured data affordably and efficiently makes
 it a valuable tool kit for enterprises across a number of applications and fields.
How does Hadoop enable parallel processing?
!  A framework for distributed processing of large data sets across
 clusters of computers using a simple programming model.




                                Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
Hadoop System Architecture




!  MapReduce: Programming
 framework for highly parallel data
 processing


!  Hadoop Distributed File System
 (HDFS): Distributed data storage
Job Tracker Schedules Tasks Where the Data Resides

                                             Job
                                           Tracker
      Job

       Input%File            Host%1                     Host%2           Host%3

     Split%1%–%64MB
                          Task%%                 Task%%               Task%%
     Split%2%–%64MB       Tracker                Tracker              Tracker
     Split%3%–%64MB
                            Task%<%1                   Task%<%2         Task%<%3



                            DataNode                   DataNode         DataNode



            %Input%File   Block%1%–%64MB             Block%2%–%64MB   Block%3%–%64MB
Hadoop Distributed File System
Hadoop Data Locality and Replication
The Right Big Data Tools for the Right Job…
                   Real
                  Time                             Machine
                 Streams                           Learning
                  (Social,                   (Mahout, etc…)
                 sensors)
                        Real-Time
                        Processing
                             (s4, storm,                      Data Visualization
                               spark)                             (Excel, Tableau)

   ETL                                                 Interactive               HIVE
                                  Real Time
                                  Database
                                                        Analytics
                                     (Shark,
                                                         (Impala,              Batch
(Informatica,                    Gemfire, hBase,        Greenplum,           Processing
Talend, Spring                     Cassandra)           AsterData,            (Map-Reduce)
Integration)                                            Netezza…)

                             Structured and Unstructured Data
                                           (HDFS, MAPR)


                                    Cloud Infrastructure
                 Compute               Storag                 Networking
                                       e
So yes, there’s a lot more than just Map-
  Reduce…


           Hadoop
        batch analysis

Compute                   HBase           Big SQL –                    Other
layer               real-time queries      Impala       NoSQL –         Spark,
                                                                        Shark,
                                                       Cassandra,
                                                                         Solr,
                                                       Mongo, etc      Platfora,
                                                                        Etc,…
Data                            HDFS
layer
              Some sort of distributed, resource management OS + Filesystem

           Host          Host      Host    Host       Host      Host        Host
Elasticity Enables Sharing of Resources
Containers with Isolation are a Tried and Tested
Approach


                                Reckless Workload 2
   Hungry Workload 1




                                                               Sneaky
                                                              Workload 3



     Some sort of distributed, resource management OS + Filesystem

   Host      Host      Host       Host      Host       Host        Host
Mixing Workloads: Three big types of
Isolation are Required
                        !  Resource Isolation
                          •  Control the greedy noisy neighbor
                          •  Reserve resources to meet needs
                        !  Version Isolation
                          •  Allow concurrent OS, App, Distro versions
                        !  Security Isolation
                          •  Provide privacy between users/groups
                          •  Runtime and data privacy required


     Some sort of distributed, resource management OS + Filesystem

   Host     Host       Host        Host        Host         Host         Host
Community activity in Isolation and Resource
Management
    !  YARN
      •  Goal: Support workloads other than M-R on Hadoop
      •  Initial need is for MPI/M-R from Yahoo
      •  Not quite ready for prime-time yet?
      •  Non-posix File system self selects workload types
    !  Mesos
      •  Distributed Resource Broker
      •  Mixed Workloads with some RM
      •  Active project, in use at Twitter
      •  Leverages OS Virtualization – e.g. cgroups
    !  Virtualization
      •  Virtual machine as the primary isolation, resource management and
        versioned deployment container
      •  Basis for Project Serengeti
Project Serengeti – Hadoop on Virtualization

    Simple to Operate         Highly Available           Elastic Scaling

  !  Rapid deployment     !  No more single point    !  Shrink and expand
                             of failure                 cluster on demand
  !  Unified operations
     across enterprise    !  One click to setup      !  Resource Guarantee

  !  Easy Clone of        !  High availability for   !  Independent scaling
     Cluster                 MR Jobs                    of Compute and data



Serengeti is an Open Source Project to automate deployment
of Hadoop on virtual platforms

http://projectserengeti.org
http://github.com/vmware-serengeti
Common Infrastructure for Big Data

                                                          MPP DB    HBase       Hadoop
  Virtualization Platform
                                                      Virtualization Platform


   Hadoop


                     HBase



                                        Cluster Consolidation
      MPP DB

                                        !  Simplify
                                          •  Single Hardware Infrastructure
Cluster Sprawling                         •  Unified operations
Single purpose clusters for various
business applications lead to cluster   !  Optimize
sprawl.                                   •  Shared Resources = higher utilization
                                          •  Elastic resources = faster on-demand access
Evolution of Hadoop on VMs


      Slave Node
VM                      VM                          VM             VM
     Current%
     Hadoop:%                Compute                     T1             T2
     %
     Combined%          VM                          VM
     Storage/%               Storage                     Storage
     Compute


Hadoop%in%VM!                Separate%Storage!       Separate%Compute%Clusters!
<  VM%lifecycle%             <  Separate%compute%    <  Separate%virtual%clusters%
   determined%                  from%data%              per%tenant%
   by%Datanode%              <  ElasIc%compute%      <  Stronger%VM<grade%security%
<  Limited%elasIcity%        <  Enable%shared%          and%resource%isolaIon%
<  Limited%to%Hadoop%           workloads%           <  Enable%deployment%of%
   MulI<Tenancy%             <  Raise%uIlizaIon%        mulIple%Hadoop%runIme%%
                                                        versions%
In-house Hadoop as a Service “Enterprise EMR”
– (Hadoop + Hadoop)

                                                                Production
            Ad hoc                                             ETL of log files
          data mining

Compute
                                    Production
layer                          recommendation engine


Data                       HDFS                                    HDFS
layer
                                  Virtualization platform

            Host        Host        Host        Host        Host       Host
Integrated Hadoop and Webapps – (Hadoop +
Other Workloads)

               Short-lived
          Hadoop compute cluster

Compute
                            Hadoop
layer                    compute cluster
                                                         Web servers
                                                     for ecommerce site

Data                     HDFS
layer
                                   Virtualization platform

               Host       Host       Host        Host        Host         Host
Integrated Big Data Production – (Hadoop + other
  big data)


           Hadoop
        batch analysis

Compute                   HBase              Big SQL –                      Other
layer               real-time queries         Impala         NoSQL –         Spark,
                                                                             Shark,
                                                            Cassandra,
                                                                              Solr,
                                                            Mongo, etc      Platfora,
                                                                             Etc,…
Data                            HDFS
layer
                                          Virtualization

           Host          Host      Host        Host        Host      Host        Host
Deploy a Hadoop Cluster in under 30 Minutes
Step 1: Deploy Serengeti virtual appliance on vSphere.


                                                               Deploy vHelperOVF to
                                                                     vSphere




Step 2: A few simple commands to stand up Hadoop Cluster.
                                                             Select Compute, memory,
                                                               storage and network




                                                            Select configuration template




                                                               Automate deployment




                        Done
A Tour Through Serengeti

$ ssh serengeti@serengeti-vm

$ serengeti

serengeti>
A Tour Through Serengeti
serengeti> cluster create --name dcsep

serengeti> cluster list

name: dcsep, distro: apache, status: RUNNING
  NAME     ROLES                                 INSTANCE CPU MEM(MB) TYPE
  -----------------------------------------------------------------------------
  master   [hadoop_namenode, hadoop_jobtracker] 1          6    2048     LOCAL    10
  data     [hadoop_datanode]                     1         2    1024     LOCAL    10
  compute [hadoop_tasktracker]                   8         2    1024     LOCAL    10
  client   [hadoop_client, pig, hive]            1         1    3748     LOCAL    10
Serengeti Spec File
[
        "distro":"apache",               Choice of Distro
          {
             "name": "master",
             "roles": [
                "hadoop_NameNode",
                "hadoop_jobtracker"
             ],
             "instanceNum": 1,
             "instanceType": "MEDIUM",
             “ha”:true,                  HA Option
          },
          {
             "name": "worker",
             "roles": [
                "hadoop_datanode", "hadoop_tasktracker"
             ],
             "instanceNum": 5,
             "instanceType": "SMALL",
             "storage": {                Choice of Shared Storage or Local Disk
                "type": "LOCAL",
                "sizeGB": 10
             }
          },
    ]
Fully Customizable Configuration Profile
!  Tune Hadoop cluster config in Serengeti spec file
     "configuration": {
      "hadoop": {
       "mapred-site.xml": {
        "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler"
        …

!  Control the placement of Hadoop nodes
     "placementPolicies": {
          "instancePerHost": 2,
          "groupRacks": {
            "type": "ROUNDROBIN",
            "racks": ["rack1", "rack2", "rack3“]
            …
!  Setup physical racks/hosts mapping topology
     > topology upload --fileName <topology file name>
     > topology list
!  Create Hadoop clusters using HVE topology
     > cluster create --name XXX --topology HVE --distro <HVE-supported_distro>
Getting to Insights
!  Point compute only cluster to existing HDFS
     … "externalHDFS": "hdfs://hostname-of-namenode:8020", …
!  Interact with HDFS from Serengeti CLI
     > fs ls /tmp
     > fs put --from /tmp/local.data --to /tmp/hdfs.data
!  Launch MapReduce/Pig/Hive jobs from Serengeti CLI
     > cluster target --name myHadoop
     > mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar
       --mainclass org.apache.hadoop.examples.PiEstimator --args "100
     1000000000"
!  Deploy Hive Server for ODBC/JDBC services
     "name": "client",
        "roles": [
          "hadoop_client",
          "hive",
          "hive_server",
          "pig"
        ], …
Configuring Distro’s

{
         "name" : "cdh",
         "version" : "3u3",
         "packages" : [
           {
              "roles" : ["hadoop_NameNode", "hadoop_jobtracker",
                         "hadoop_tasktracker", "hadoop_datanode",
                         "hadoop_client"],
              "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz"
           },
           {
              "roles" : ["hive"],
              "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz"
           },
           {
              "roles" : ["pig"],
              "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz"
           }
         ]
    },
Serengeti Demo

                     Deploy Serengeti vApp on vSphere


                     Deploy a Hadoop cluster in 10 Minutes


                     Run MapReduce
 Serengeti Demo	

                     Scale out the Hadoop cluster


                     Create a Customized Hadoop cluster


                     Use Your Favorite Hadoop Distribution
Serengeti Architecture                                                                                                               Java
                                                                         Serengeti CLI
http://github.com/vmware-serengeti                                       (spring-shell)
                                                                                                                                     Ruby




   Serengeti Server                                                 Serengeti web service


                             DB                      Resource        Cluster      Network        Task       Distro
                                                       Mgr            Mgr          mgr           mgr         mgr


                                                                                         Shell command to trigger
                      Report deployment                    Deploy Engine Proxy           deployment
                      progress and summary                    (bash script)

                                                                                                                              Knife cluster cli
             RabbitMQ
        (share with chef server)                      Chef Orchestration Layer
                                                              (Ironfan)
                                                       service provisioning inside vms                                               package
                                                                                                                                       server
                                                                               Cloud Manager
                                                                                                                            download packages
        cookbooks/roles                                                   Cluster provision engine
           data bags                                                            (vm CRUD)
        Chef server                          Chef bootstrap nodes
                                                                                     Fog
                                                                           vSphere Cloud provider
             download cookbook/                                              resource services
             recipes




                                                                          vCenter

    Hadoop node                Hadoop node                                     Hadoop node                 Hadoop node                      Client VM

        Chef-client                 Chef-client                                    Chef-client                      Chef-client                Chef-client
Use Local Disk where it’s Needed




   SAN Storage          NAS Filers       Local Storage

  $2 - $10/Gigabyte   $1 - $5/Gigabyte   $0.05/Gigabyte

      $1M gets:          $1M gets:          $1M gets:
    0.5Petabytes         1 Petabyte       10 Petabytes
    200,000 IOPS       200,000 IOPS       400,000 IOPS
     8Gbyte/sec         10Gbyte/sec      250 Gbytes/sec
Rules of Thumb: Sizing for Hadoop
!  Disk:
  •  Provide about 50Mbytes/sec of disk bandwidth per core
  •  If using SATA, that’s about one disk per core
!  Network
  •  Provide about 200mbits of aggregate network bandwidth per core
!  Memory
  •  Use a memory:core ratio of about 4Gbytes:core
Extend Virtual Storage Architecture to Include
  Local Disk
                                                                                              !  Hybrid Storage
 !  Shared Storage: SAN or NAS                                                                     •  SAN for boot images, VMs, other
         •  Easy to provision                                                                             workloads
         •  Automated cluster rebalancing                                                          •  Local disk for Hadoop & HDFS
                                                                                                   •  Scalable Bandwidth, Lower Cost/GB
          Other VM

                     Other VM




                                                  Other VM




                                                                               Other VM




                                                                                                   Other VM

                                                                                                              Other VM




                                                                                                                                           Other VM




                                                                                                                                                                        Other VM
Hadoop




                                Hadoop

                                         Hadoop




                                                             Hadoop

                                                                      Hadoop




                                                                                          Hadoop




                                                                                                                         Hadoop

                                                                                                                                  Hadoop




                                                                                                                                                      Hadoop

                                                                                                                                                               Hadoop
         Host                            Host                         Host                         Host                           Host                         Host
Hadoop Using Local Disks



                           Task Tracker             Datanode
  Other        Hadoop
  Workload     Virtual
               Machine
                                             Ext4      Ext4    Ext4




  Virtualization Host      OS Image - VMDK   VMDK     VMDK     VMDK



                Shared%
                Storage%
Native versus Virtual Platforms, 24 hosts, 12
disks/host
                                              450



                                              400
    Elapsed time, seconds (lower is better)




                                              350
                                                                             Native
                                                                             1 VM
                                              300
                                                                             2 VMs
                                                                             4 VMs
                                              250



                                              200



                                              150



                                              100



                                              50



                                                0
                                                    TeraGen   TeraSort   TeraValidate
Local vs Various SAN Storage Configurations
                                                       4.5
                                                                       16 x HP DL380G7, EMC VNX 7500, 96 physical disks
 Elapsed time ratio to Local disks (lower is better)



                                                        4
                                                                                    Local disks
                                                       3.5                          SAN JBOD
                                                                                    SAN RAID-0, 16 KB page size
                                                        3
                                                                                    SAN RAID-0
                                                       2.5
                                                                                    SAN RAID-5
                                                        2

                                                       1.5

                                                        1

                                                       0.5

                                                        0
                                                             TeraGen             TeraSort                         TeraValidate
Hadoop Virtualization Extensions: Topology
Awareness
Virtual Topologies
Hadoop Topology Changes for Virtualization
Hadoop Virtualization Extensions for Topology
               HVE

 Task Scheduling Policy Extension

    Balancer Policy Extension

Replica Choosing Policy Extension

Replica Placement Policy Extension

 Replica Removal Policy Extension

   Network Topology Extension
                                        Hadoop

                                     HDFS   MapReduce



                                     Hadoop Common




HADOOP-8468 (Umbrella JIRA)
HADOOP-8469            Terasort locality                Data    Node-   Rack
HDFS-3495                                               Local   group
                                                                Local
                                                                        Local

MAPREDUCE-4310
                       Normal                           392       -      8
HDFS-3498
MAPREDUCE-4309         Normal with HVE                  397       2      1

HADOOP-8470            D/C separation                    0        -     400
HADOOP-8472            D/C separation with HVE           0       400     0
Why Virtualize Hadoop?



  Simple to Operate         Highly Available           Elastic Scaling

!  Rapid deployment     !  No more single point    !  Shrink and expand
                           of failure                 cluster on demand
!  Unified operations
   across enterprise    !  One click to setup      !  Resource Guarantee

!  Easy Clone of        !  High availability for   !  Independent scaling
   Cluster                 MR Jobs                    of Compute and data
Live Machine Migration Reduces Planned
Downtime


Description:
Enables the live migration of virtual
machines from one host to another
with continuous service availability.

Benefits:
•    Revolutionary technology that is the
     basis for automated virtual machine
     movement
•    Meets service level and performance
     goals
vSphere High Availability (HA) - protection
against unplanned downtime




  Overview
   •  Protection against host and VM failures
   •  Automatic failure detection (host, guest OS)
   •  Automatic virtual machine restart in minutes, on any available host in cluster
   •  OS and application-independent, does not require complex configuration
    changes
Example HA Failover for Hadoop


     Serengeti                  vSphere HA
                  Namenode                    Namenode
      Server



    TaskTracker   TaskTracker   TaskTracker   TaskTracker
       HDFS          HDFS          HDFS          HDFS
     Datanode      Datanode      Datanode      Datanode
       Hive          Hive          Hive          Hive

      hBase         hBase         hBase         hBase
vSphere Fault Tolerance provides continuous
protection
                                                  Overview


                                                   •  Single identical VMs running in
                                                     lockstep on separate hosts
                                                   •  Zero downtime, zero data loss

 XX                                                  failover for all virtual machines in
 App   App   App        App     App   App   App

 HA HA             FT
 OS OS       OS         OS      OS    OS    OS
                                                     case of hardware failures
 VMware ESX                   VMware ESX           •  Integrated with VMware HA/DRS
                                                   •  No complex clustering or
                                                     specialized hardware required
                                                   •  Single common mechanism for all
       X                                             applications and operating
                                                     systems


  Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
High Availability for the Hadoop Stack


                               ETL Tools        BI Reporting           RDBMS


                            Pig (Data   Flow)   Hive (SQL)            HCatalog
  Zookeepr (Coordination)




                                                          Hive          Hcatalog
                                                         MetaDB          MDB




                                                                                   Management Server
                            MapReduce (Job Scheduling/Execution System)

                            HBase (Key-Value store)            Jobtracker



                                                                      Namenode
                                                  HDFS
                                   (Hadoop Distributed File System)                 Server
Performance Effect of FT for Master Daemons
!  NameNode and JobTracker placed in separate UP VMs
!  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort
!  8 MB case places similar load on NN &JT as >200 hosts with 256 MB

                                   1.04
                                                      TeraSort
    Elapsed time ratio to FT off




                                   1.03


                                   1.02


                                   1.01


                                     1
                                          256    64            16     8
                                                HDFS block size, MB
Why Virtualize Hadoop?



  Simple to Operate         Highly Available           Elastic Scaling

!  Rapid deployment     !  No more single point    !  Shrink and expand
                           of failure                 cluster on demand
!  Unified operations
   across enterprise    !  One click to setup      !  Resource Guarantee

!  Easy Clone of        !  High availability for   !  Independent scaling
   Cluster                 MR Jobs                    of Compute and data
“Time Share”
   Other VM

              Other VM

                          Other VM

                                     Other VM

                                                Other VM




                                                           Other VM

                                                                      Other VM

                                                                                  Other VM

                                                                                             Other VM

                                                                                                        Other VM




                                                                                                                   Other VM

                                                                                                                              Other VM

                                                                                                                                          Other VM

                                                                                                                                                     Other VM

                                                                                                                                                                Other VM
   Hadoop

               Hadoop




                                                            Hadoop

                                                                       Hadoop




                                                                                                                   Hadoop

                                                                                                                              Hadoop
                                                                                  Serengeti

                                                                      VMware vSphere

                         Host                                                    Host                                                    Host
                         HDFS                                                    HDFS                                                    HDFS




          While existing apps run during the day to support business
          operations, Hadoop batch jobs kicks off at night to conduct
          deep analysis of data.
Hadoop Task Tracker and Data Node in a VM


                                                Add/Remove
                                  Slot
                                                Slots?
                                  Slot
                       Virtual Task   Tracker
  Other                Hadoop
  Workload
                       Node

                               Datanode
                                                      Grow/Shrink
                                                      by tens of GB?



 Virtualization Host              VMDK




Grow/Shrink of a VM is one
approach
Add/remove Virtual Nodes


                                  Slot                     Slot
                                  Slot                     Slot
                       Virtual Task   Tracker   Virtual Task   Tracker
 Other                 Hadoop                   Hadoop
 Workload
                       Node                     Node

                               Datanode                 Datanode




 Virtualization Host              VMDK                     VMDK




Just add/remove more
virtual nodes?
But State makes it hard to power-off a
node


                                      Slot
                                      Slot
                          Virtual Task   Tracker
 Other                    Hadoop
 Workload
                          Node

                                  Datanode




 Virtualization Host                  VMDK




  Powering off the Hadoop VM
  would in effect fail the datanode
Adding a node needs data…


                                 Slot                     Slot
                                 Slot                     Slot
                      Virtual Task   Tracker   Virtual Task   Tracker
Other                 Hadoop                   Hadoop
Workload
                      Node                     Node

                              Datanode                 Datanode




Virtualization Host              VMDK                     VMDK




Adding a node would require TBs of
data replication
Separated Compute and Data


                                                                       Slot
                                      Slot               Virtual   Slot
                                                       Virtual
                                                         Hadoop        Slot
                        Virtual       Slot           Virtual
                                                       Hadoop      Slot
                        Hadoop                           Node
                                                     Hadoop
                                                       Node
                        Node                         Node         Task Tracker
  Other                           Task Tracker                 Task Tracker
  Workload




                        Virtual
                        Hadoop                   Datanode
                        Node



  Virtualization Host                 VMDK                       VMDK



Truly Elastic Hadoop:
Scalable through virtual
nodes
Dataflow with separated Compute/Data


                                      Slot
                       Virtual        Slot                        Virtual
                       Hadoop                                     Hadoop
                       Node                                       Node        Datanode
                                 NodeManager




                                   Virtual NIC                      Virtual NIC




 Virtualization Host                             Virtual Switch                   VMDK


                                                   NIC Drivers
Elastic Compute
!  Set number of active TaskTracker nodes
     > cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum
     8
!  Enable all the TaskTrackers in the cluster
     > cluster unlimit --name myHadoop
Performance Analysis of Separation
Combined mode                             Split Mode

1 Combined Compute/Datanode VM per Host   1 Datanode VM, 1 Compute node VM per Host




      Task Tracker     Task Tracker          Task Tracker     Task Tracker

       Datanode         Datanode

                                             Datanode         Datanode




 Workload: Teragen, Terasort, Teravalidate
 HW Configuration: 8 cores, 96GB RAM, 16 disks per host x 2 nodes
Performance Analysis of Separation

Minimum performance impact with separation of compute and data

                                   1.2
 Elapsed time: ratio to combined




                                    1


                                   0.8


                                   0.6                                       Combined
                                                                             Split

                                   0.4


                                   0.2


                                    0
                                         Teragen   Terasort   Teravalidate
Freedom of Choice and Open Source
             Distributions                  Community Projects




•  Flexibility to choose from major distributions
      cluster create --name myHadoop --distro apache

•  Support for multiple projects
•  Open architecture to welcome industry participation
•  Contributing Hadoop Virtualization Extensions (HVE) to open
   source community
Elastic, Multi-tenant
Hadoop on Demand!
Richard McDougall, !
Chief Architect, Application Infrastructure and Big Data, VMware, Inc!
@richardmcdougll!
ApacheCon Europe, 2012!
!

http://www.vmware.com/hadoop
http://cto.vmware.com/
http://projectserengeti.org
http://github.com/vmware-serengeti

                                                                         © 2009 VMware Inc. All rights reserved

Weitere ähnliche Inhalte

Was ist angesagt?

Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopBrock Noland
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data ApplicationsRichard McDougall
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesDataWorks Summit
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentGlusterFS
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastuctureDataWorks Summit
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGlusterFS
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisScaleOut Software
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...lucenerevolution
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHortonworks
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaMark Kerzner
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Introduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData PlatformIntroduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData PlatformGruter
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR Technologies
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbaseRavi Veeramachaneni
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projectsaf83
 
Cloumon enterprise
Cloumon enterpriseCloumon enterprise
Cloumon enterpriseGruter
 

Was ist angesagt? (20)

Common and unique use cases for Apache Hadoop
Common and unique use cases for Apache HadoopCommon and unique use cases for Apache Hadoop
Common and unique use cases for Apache Hadoop
 
Building Big Data Applications
Building Big Data ApplicationsBuilding Big Data Applications
Building Big Data Applications
 
Hadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual MachinesHadoop in the Clouds, Virtualization and Virtual Machines
Hadoop in the Clouds, Virtualization and Virtual Machines
 
Cloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and DeploymentCloud Storage Adoption, Practice, and Deployment
Cloud Storage Adoption, Practice, and Deployment
 
Big data on virtualized infrastucture
Big data on virtualized infrastuctureBig data on virtualized infrastucture
Big data on virtualized infrastucture
 
Gluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFSGluster Webinar: Introduction to GlusterFS
Gluster Webinar: Introduction to GlusterFS
 
Using Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data AnalysisUsing Distributed In-Memory Computing for Fast Data Analysis
Using Distributed In-Memory Computing for Fast Data Analysis
 
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
The Search Is Over: Integrating Solr and Hadoop in the Same Cluster to Simpli...
 
Google Compute and MapR
Google Compute and MapRGoogle Compute and MapR
Google Compute and MapR
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of ClouderaHouston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
Houston Hadoop Meetup Presentation by Vikram Oberoi of Cloudera
 
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Introduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData PlatformIntroduction to Gruter and Gruter's BigData Platform
Introduction to Gruter and Gruter's BigData Platform
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211MapR LucidWorks Joint Webinar 121211
MapR LucidWorks Joint Webinar 121211
 
Geo-based content processing using hbase
Geo-based content processing using hbaseGeo-based content processing using hbase
Geo-based content processing using hbase
 
The 25 Most Promising Open Source Projects
The 25 Most Promising Open Source ProjectsThe 25 Most Promising Open Source Projects
The 25 Most Promising Open Source Projects
 
Cloumon enterprise
Cloumon enterpriseCloumon enterprise
Cloumon enterprise
 
Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
cosbench-openstack.pdf
cosbench-openstack.pdfcosbench-openstack.pdf
cosbench-openstack.pdf
 

Andere mochten auch

VMworld 2013: What's New in VMware vSphere?
VMworld 2013: What's New in VMware vSphere? VMworld 2013: What's New in VMware vSphere?
VMworld 2013: What's New in VMware vSphere? VMworld
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsRichard McDougall
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Richard McDougall
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareRichard McDougall
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshootingglbsolutions
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011Dan Brinkmann
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution SoupDan Brinkmann
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingDan Brinkmann
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4Vepsun Technologies
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6Vepsun Technologies
 
Reference Architecture-Validated & Tested Approach to Define Network Design
Reference Architecture-Validated & Tested Approach to Define Network DesignReference Architecture-Validated & Tested Approach to Define Network Design
Reference Architecture-Validated & Tested Approach to Define Network DesignDataWorks Summit
 

Andere mochten auch (18)

VMworld 2013: What's New in VMware vSphere?
VMworld 2013: What's New in VMware vSphere? VMworld 2013: What's New in VMware vSphere?
VMworld 2013: What's New in VMware vSphere?
 
Message passing interface
Message passing interfaceMessage passing interface
Message passing interface
 
Making of the Burner Board
Making of the Burner BoardMaking of the Burner Board
Making of the Burner Board
 
Big Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure ConsiderationsBig Data/Hadoop Infrastructure Considerations
Big Data/Hadoop Infrastructure Considerations
 
Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009Solaris Internals Preso circa 2009
Solaris Internals Preso circa 2009
 
Virtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMwareVirtualizing Oracle Databases with VMware
Virtualizing Oracle Databases with VMware
 
Hadoop I/O Analysis
Hadoop I/O AnalysisHadoop I/O Analysis
Hadoop I/O Analysis
 
VMware Performance Troubleshooting
VMware Performance TroubleshootingVMware Performance Troubleshooting
VMware Performance Troubleshooting
 
Denver VMUG nov 2011
Denver VMUG nov 2011Denver VMUG nov 2011
Denver VMUG nov 2011
 
Citrix Remote Access Solution Soup
Citrix Remote Access Solution SoupCitrix Remote Access Solution Soup
Citrix Remote Access Solution Soup
 
VMware vSphere Performance Troubleshooting
VMware vSphere Performance TroubleshootingVMware vSphere Performance Troubleshooting
VMware vSphere Performance Troubleshooting
 
VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5VMware Advance Troubleshooting Workshop - Day 5
VMware Advance Troubleshooting Workshop - Day 5
 
VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2VMware Advance Troubleshooting Workshop - Day 2
VMware Advance Troubleshooting Workshop - Day 2
 
VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3VMware Advance Troubleshooting Workshop - Day 3
VMware Advance Troubleshooting Workshop - Day 3
 
VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4VMware Advance Troubleshooting Workshop - Day 4
VMware Advance Troubleshooting Workshop - Day 4
 
VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6VMware Advance Troubleshooting Workshop - Day 6
VMware Advance Troubleshooting Workshop - Day 6
 
Reference Architecture-Validated & Tested Approach to Define Network Design
Reference Architecture-Validated & Tested Approach to Define Network DesignReference Architecture-Validated & Tested Approach to Define Network Design
Reference Architecture-Validated & Tested Approach to Define Network Design
 
IdP, SAML, OAuth
IdP, SAML, OAuthIdP, SAML, OAuth
IdP, SAML, OAuth
 

Ähnlich wie Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand

App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)outstanding59
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephantsOvidiu Dimulescu
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Ranjith Sekar
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoopShashwat Shriparv
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwielerlucenerevolution
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosLester Martin
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data Mindgrub Technologies
 

Ähnlich wie Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand (20)

Cloud computing era
Cloud computing eraCloud computing era
Cloud computing era
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Zh tw cloud computing era
Zh tw cloud computing eraZh tw cloud computing era
Zh tw cloud computing era
 
Handling not so big data
Handling not so big dataHandling not so big data
Handling not so big data
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Hadoop on Azure, Blue elephants
Hadoop on Azure,  Blue elephantsHadoop on Azure,  Blue elephants
Hadoop on Azure, Blue elephants
 
Hadoop programming
Hadoop programmingHadoop programming
Hadoop programming
 
Hadoop and BigData - July 2016
Hadoop and BigData - July 2016Hadoop and BigData - July 2016
Hadoop and BigData - July 2016
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 
Big data
Big dataBig data
Big data
 
Introduction to apache hadoop
Introduction to apache hadoopIntroduction to apache hadoop
Introduction to apache hadoop
 
Architecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric BaldeschwielerArchitecting the Future of Big Data & Search - Eric Baldeschwieler
Architecting the Future of Big Data & Search - Eric Baldeschwieler
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Big Data Concepts
Big Data ConceptsBig Data Concepts
Big Data Concepts
 
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive DemosHadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
Hadoop Demystified + MapReduce (Java and C#), Pig, and Hive Demos
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
Finding the needles in the haystack. An Overview of Analyzing Big Data with H...
 
BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data BW Tech Meetup: Hadoop and The rise of Big Data
BW Tech Meetup: Hadoop and The rise of Big Data
 

Kürzlich hochgeladen

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Kürzlich hochgeladen (20)

Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand

  • 1. Elastic, Multi-tenant Hadoop on Demand! Richard McDougall, ! Chief Architect, Application Infrastructure and Big Data, VMware, Inc! @richardmcdougll! ApacheCon Europe, 2012! ! http://www.vmware.com/hadoop http://cto.vmware.com/ http://projectserengeti.org http://github.com/vmware-serengeti © 2009 VMware Inc. All rights reserved
  • 2. Broad Application of Hadoop technology Horizontal Use Cases Vertical Use Cases Log Processing / Click Financial Services Stream Analytics Machine Learning / Internet Retailer sophisticated data mining Web crawling / text Pharmaceutical / Drug processing Discovery Extract Transform Load Mobile / Telecom (ETL) replacement Image / XML message Scientific Research processing General archiving / Social Media compliance Hadoop’s ability to handle large unstructured data affordably and efficiently makes it a valuable tool kit for enterprises across a number of applications and fields.
  • 3. How does Hadoop enable parallel processing? !  A framework for distributed processing of large data sets across clusters of computers using a simple programming model. Source: http://architects.dzone.com/articles/how-hadoop-mapreduce-works
  • 4. Hadoop System Architecture !  MapReduce: Programming framework for highly parallel data processing !  Hadoop Distributed File System (HDFS): Distributed data storage
  • 5. Job Tracker Schedules Tasks Where the Data Resides Job Tracker Job Input%File Host%1 Host%2 Host%3 Split%1%–%64MB Task%% Task%% Task%% Split%2%–%64MB Tracker Tracker Tracker Split%3%–%64MB Task%<%1 Task%<%2 Task%<%3 DataNode DataNode DataNode %Input%File Block%1%–%64MB Block%2%–%64MB Block%3%–%64MB
  • 7. Hadoop Data Locality and Replication
  • 8. The Right Big Data Tools for the Right Job… Real Time Machine Streams Learning (Social, (Mahout, etc…) sensors) Real-Time Processing (s4, storm, Data Visualization spark) (Excel, Tableau) ETL Interactive HIVE Real Time Database Analytics (Shark, (Impala, Batch (Informatica, Gemfire, hBase, Greenplum, Processing Talend, Spring Cassandra) AsterData, (Map-Reduce) Integration) Netezza…) Structured and Unstructured Data (HDFS, MAPR) Cloud Infrastructure Compute Storag Networking e
  • 9. So yes, there’s a lot more than just Map- Reduce… Hadoop batch analysis Compute HBase Big SQL – Other layer real-time queries Impala NoSQL – Spark, Shark, Cassandra, Solr, Mongo, etc Platfora, Etc,… Data HDFS layer Some sort of distributed, resource management OS + Filesystem Host Host Host Host Host Host Host
  • 11. Containers with Isolation are a Tried and Tested Approach Reckless Workload 2 Hungry Workload 1 Sneaky Workload 3 Some sort of distributed, resource management OS + Filesystem Host Host Host Host Host Host Host
  • 12. Mixing Workloads: Three big types of Isolation are Required !  Resource Isolation •  Control the greedy noisy neighbor •  Reserve resources to meet needs !  Version Isolation •  Allow concurrent OS, App, Distro versions !  Security Isolation •  Provide privacy between users/groups •  Runtime and data privacy required Some sort of distributed, resource management OS + Filesystem Host Host Host Host Host Host Host
  • 13. Community activity in Isolation and Resource Management !  YARN •  Goal: Support workloads other than M-R on Hadoop •  Initial need is for MPI/M-R from Yahoo •  Not quite ready for prime-time yet? •  Non-posix File system self selects workload types !  Mesos •  Distributed Resource Broker •  Mixed Workloads with some RM •  Active project, in use at Twitter •  Leverages OS Virtualization – e.g. cgroups !  Virtualization •  Virtual machine as the primary isolation, resource management and versioned deployment container •  Basis for Project Serengeti
  • 14. Project Serengeti – Hadoop on Virtualization Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point !  Shrink and expand of failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of !  High availability for !  Independent scaling Cluster MR Jobs of Compute and data Serengeti is an Open Source Project to automate deployment of Hadoop on virtual platforms http://projectserengeti.org http://github.com/vmware-serengeti
  • 15. Common Infrastructure for Big Data MPP DB HBase Hadoop Virtualization Platform Virtualization Platform Hadoop HBase Cluster Consolidation MPP DB !  Simplify •  Single Hardware Infrastructure Cluster Sprawling •  Unified operations Single purpose clusters for various business applications lead to cluster !  Optimize sprawl. •  Shared Resources = higher utilization •  Elastic resources = faster on-demand access
  • 16. Evolution of Hadoop on VMs Slave Node VM VM VM VM Current% Hadoop:% Compute T1 T2 % Combined% VM VM Storage/% Storage Storage Compute Hadoop%in%VM! Separate%Storage! Separate%Compute%Clusters! <  VM%lifecycle% <  Separate%compute% <  Separate%virtual%clusters% determined% from%data% per%tenant% by%Datanode% <  ElasIc%compute% <  Stronger%VM<grade%security% <  Limited%elasIcity% <  Enable%shared% and%resource%isolaIon% <  Limited%to%Hadoop% workloads% <  Enable%deployment%of% MulI<Tenancy% <  Raise%uIlizaIon% mulIple%Hadoop%runIme%% versions%
  • 17. In-house Hadoop as a Service “Enterprise EMR” – (Hadoop + Hadoop) Production Ad hoc ETL of log files data mining Compute Production layer recommendation engine Data HDFS HDFS layer Virtualization platform Host Host Host Host Host Host
  • 18. Integrated Hadoop and Webapps – (Hadoop + Other Workloads) Short-lived Hadoop compute cluster Compute Hadoop layer compute cluster Web servers for ecommerce site Data HDFS layer Virtualization platform Host Host Host Host Host Host
  • 19. Integrated Big Data Production – (Hadoop + other big data) Hadoop batch analysis Compute HBase Big SQL – Other layer real-time queries Impala NoSQL – Spark, Shark, Cassandra, Solr, Mongo, etc Platfora, Etc,… Data HDFS layer Virtualization Host Host Host Host Host Host Host
  • 20. Deploy a Hadoop Cluster in under 30 Minutes Step 1: Deploy Serengeti virtual appliance on vSphere. Deploy vHelperOVF to vSphere Step 2: A few simple commands to stand up Hadoop Cluster. Select Compute, memory, storage and network Select configuration template Automate deployment Done
  • 21. A Tour Through Serengeti $ ssh serengeti@serengeti-vm $ serengeti serengeti>
  • 22. A Tour Through Serengeti serengeti> cluster create --name dcsep serengeti> cluster list name: dcsep, distro: apache, status: RUNNING NAME ROLES INSTANCE CPU MEM(MB) TYPE ----------------------------------------------------------------------------- master [hadoop_namenode, hadoop_jobtracker] 1 6 2048 LOCAL 10 data [hadoop_datanode] 1 2 1024 LOCAL 10 compute [hadoop_tasktracker] 8 2 1024 LOCAL 10 client [hadoop_client, pig, hive] 1 1 3748 LOCAL 10
  • 23. Serengeti Spec File [ "distro":"apache", Choice of Distro { "name": "master", "roles": [ "hadoop_NameNode", "hadoop_jobtracker" ], "instanceNum": 1, "instanceType": "MEDIUM", “ha”:true, HA Option }, { "name": "worker", "roles": [ "hadoop_datanode", "hadoop_tasktracker" ], "instanceNum": 5, "instanceType": "SMALL", "storage": { Choice of Shared Storage or Local Disk "type": "LOCAL", "sizeGB": 10 } }, ]
  • 24. Fully Customizable Configuration Profile !  Tune Hadoop cluster config in Serengeti spec file "configuration": { "hadoop": { "mapred-site.xml": { "mapred.jobtracker.taskScheduler": "org.apache.hadoop.mapred.FairScheduler" … !  Control the placement of Hadoop nodes "placementPolicies": { "instancePerHost": 2, "groupRacks": { "type": "ROUNDROBIN", "racks": ["rack1", "rack2", "rack3“] … !  Setup physical racks/hosts mapping topology > topology upload --fileName <topology file name> > topology list !  Create Hadoop clusters using HVE topology > cluster create --name XXX --topology HVE --distro <HVE-supported_distro>
  • 25. Getting to Insights !  Point compute only cluster to existing HDFS … "externalHDFS": "hdfs://hostname-of-namenode:8020", … !  Interact with HDFS from Serengeti CLI > fs ls /tmp > fs put --from /tmp/local.data --to /tmp/hdfs.data !  Launch MapReduce/Pig/Hive jobs from Serengeti CLI > cluster target --name myHadoop > mr jar --jarfile /opt/serengeti/cli/lib/hadoop-examples-1.0.1.jar --mainclass org.apache.hadoop.examples.PiEstimator --args "100 1000000000" !  Deploy Hive Server for ODBC/JDBC services "name": "client", "roles": [ "hadoop_client", "hive", "hive_server", "pig" ], …
  • 26. Configuring Distro’s { "name" : "cdh", "version" : "3u3", "packages" : [ { "roles" : ["hadoop_NameNode", "hadoop_jobtracker", "hadoop_tasktracker", "hadoop_datanode", "hadoop_client"], "tarball" : "cdh/3u3/hadoop-0.20.2-cdh3u3.tar.gz" }, { "roles" : ["hive"], "tarball" : "cdh/3u3/hive-0.7.1-cdh3u3.tar.gz" }, { "roles" : ["pig"], "tarball" : "cdh/3u3/pig-0.8.1-cdh3u3.tar.gz" } ] },
  • 27. Serengeti Demo Deploy Serengeti vApp on vSphere Deploy a Hadoop cluster in 10 Minutes Run MapReduce Serengeti Demo Scale out the Hadoop cluster Create a Customized Hadoop cluster Use Your Favorite Hadoop Distribution
  • 28. Serengeti Architecture Java Serengeti CLI http://github.com/vmware-serengeti (spring-shell) Ruby Serengeti Server Serengeti web service DB Resource Cluster Network Task Distro Mgr Mgr mgr mgr mgr Shell command to trigger Report deployment Deploy Engine Proxy deployment progress and summary (bash script) Knife cluster cli RabbitMQ (share with chef server) Chef Orchestration Layer (Ironfan) service provisioning inside vms package server Cloud Manager download packages cookbooks/roles Cluster provision engine data bags (vm CRUD) Chef server Chef bootstrap nodes Fog vSphere Cloud provider download cookbook/ resource services recipes vCenter Hadoop node Hadoop node Hadoop node Hadoop node Client VM Chef-client Chef-client Chef-client Chef-client Chef-client
  • 29. Use Local Disk where it’s Needed SAN Storage NAS Filers Local Storage $2 - $10/Gigabyte $1 - $5/Gigabyte $0.05/Gigabyte $1M gets: $1M gets: $1M gets: 0.5Petabytes 1 Petabyte 10 Petabytes 200,000 IOPS 200,000 IOPS 400,000 IOPS 8Gbyte/sec 10Gbyte/sec 250 Gbytes/sec
  • 30. Rules of Thumb: Sizing for Hadoop !  Disk: •  Provide about 50Mbytes/sec of disk bandwidth per core •  If using SATA, that’s about one disk per core !  Network •  Provide about 200mbits of aggregate network bandwidth per core !  Memory •  Use a memory:core ratio of about 4Gbytes:core
  • 31. Extend Virtual Storage Architecture to Include Local Disk !  Hybrid Storage !  Shared Storage: SAN or NAS •  SAN for boot images, VMs, other •  Easy to provision workloads •  Automated cluster rebalancing •  Local disk for Hadoop & HDFS •  Scalable Bandwidth, Lower Cost/GB Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Host Host Host Host Host Host
  • 32. Hadoop Using Local Disks Task Tracker Datanode Other Hadoop Workload Virtual Machine Ext4 Ext4 Ext4 Virtualization Host OS Image - VMDK VMDK VMDK VMDK Shared% Storage%
  • 33. Native versus Virtual Platforms, 24 hosts, 12 disks/host 450 400 Elapsed time, seconds (lower is better) 350 Native 1 VM 300 2 VMs 4 VMs 250 200 150 100 50 0 TeraGen TeraSort TeraValidate
  • 34. Local vs Various SAN Storage Configurations 4.5 16 x HP DL380G7, EMC VNX 7500, 96 physical disks Elapsed time ratio to Local disks (lower is better) 4 Local disks 3.5 SAN JBOD SAN RAID-0, 16 KB page size 3 SAN RAID-0 2.5 SAN RAID-5 2 1.5 1 0.5 0 TeraGen TeraSort TeraValidate
  • 35. Hadoop Virtualization Extensions: Topology Awareness
  • 37. Hadoop Topology Changes for Virtualization
  • 38. Hadoop Virtualization Extensions for Topology HVE Task Scheduling Policy Extension Balancer Policy Extension Replica Choosing Policy Extension Replica Placement Policy Extension Replica Removal Policy Extension Network Topology Extension Hadoop HDFS MapReduce Hadoop Common HADOOP-8468 (Umbrella JIRA) HADOOP-8469 Terasort locality Data Node- Rack HDFS-3495 Local group Local Local MAPREDUCE-4310 Normal 392 - 8 HDFS-3498 MAPREDUCE-4309 Normal with HVE 397 2 1 HADOOP-8470 D/C separation 0 - 400 HADOOP-8472 D/C separation with HVE 0 400 0
  • 39. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point !  Shrink and expand of failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of !  High availability for !  Independent scaling Cluster MR Jobs of Compute and data
  • 40. Live Machine Migration Reduces Planned Downtime Description: Enables the live migration of virtual machines from one host to another with continuous service availability. Benefits: •  Revolutionary technology that is the basis for automated virtual machine movement •  Meets service level and performance goals
  • 41. vSphere High Availability (HA) - protection against unplanned downtime Overview •  Protection against host and VM failures •  Automatic failure detection (host, guest OS) •  Automatic virtual machine restart in minutes, on any available host in cluster •  OS and application-independent, does not require complex configuration changes
  • 42. Example HA Failover for Hadoop Serengeti vSphere HA Namenode Namenode Server TaskTracker TaskTracker TaskTracker TaskTracker HDFS HDFS HDFS HDFS Datanode Datanode Datanode Datanode Hive Hive Hive Hive hBase hBase hBase hBase
  • 43. vSphere Fault Tolerance provides continuous protection Overview •  Single identical VMs running in lockstep on separate hosts •  Zero downtime, zero data loss XX failover for all virtual machines in App App App App App App App HA HA FT OS OS OS OS OS OS OS case of hardware failures VMware ESX VMware ESX •  Integrated with VMware HA/DRS •  No complex clustering or specialized hardware required •  Single common mechanism for all X applications and operating systems Zero downtime for Name Node, Job Tracker and other components in Hadoop clusters
  • 44. High Availability for the Hadoop Stack ETL Tools BI Reporting RDBMS Pig (Data Flow) Hive (SQL) HCatalog Zookeepr (Coordination) Hive Hcatalog MetaDB MDB Management Server MapReduce (Job Scheduling/Execution System) HBase (Key-Value store) Jobtracker Namenode HDFS (Hadoop Distributed File System) Server
  • 45. Performance Effect of FT for Master Daemons !  NameNode and JobTracker placed in separate UP VMs !  Small overhead: Enabling FT causes 2-4% slowdown for TeraSort !  8 MB case places similar load on NN &JT as >200 hosts with 256 MB 1.04 TeraSort Elapsed time ratio to FT off 1.03 1.02 1.01 1 256 64 16 8 HDFS block size, MB
  • 46. Why Virtualize Hadoop? Simple to Operate Highly Available Elastic Scaling !  Rapid deployment !  No more single point !  Shrink and expand of failure cluster on demand !  Unified operations across enterprise !  One click to setup !  Resource Guarantee !  Easy Clone of !  High availability for !  Independent scaling Cluster MR Jobs of Compute and data
  • 47. “Time Share” Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Other VM Hadoop Hadoop Hadoop Hadoop Hadoop Hadoop Serengeti VMware vSphere Host Host Host HDFS HDFS HDFS While existing apps run during the day to support business operations, Hadoop batch jobs kicks off at night to conduct deep analysis of data.
  • 48. Hadoop Task Tracker and Data Node in a VM Add/Remove Slot Slots? Slot Virtual Task Tracker Other Hadoop Workload Node Datanode Grow/Shrink by tens of GB? Virtualization Host VMDK Grow/Shrink of a VM is one approach
  • 49. Add/remove Virtual Nodes Slot Slot Slot Slot Virtual Task Tracker Virtual Task Tracker Other Hadoop Hadoop Workload Node Node Datanode Datanode Virtualization Host VMDK VMDK Just add/remove more virtual nodes?
  • 50. But State makes it hard to power-off a node Slot Slot Virtual Task Tracker Other Hadoop Workload Node Datanode Virtualization Host VMDK Powering off the Hadoop VM would in effect fail the datanode
  • 51. Adding a node needs data… Slot Slot Slot Slot Virtual Task Tracker Virtual Task Tracker Other Hadoop Hadoop Workload Node Node Datanode Datanode Virtualization Host VMDK VMDK Adding a node would require TBs of data replication
  • 52. Separated Compute and Data Slot Slot Virtual Slot Virtual Hadoop Slot Virtual Slot Virtual Hadoop Slot Hadoop Node Hadoop Node Node Node Task Tracker Other Task Tracker Task Tracker Workload Virtual Hadoop Datanode Node Virtualization Host VMDK VMDK Truly Elastic Hadoop: Scalable through virtual nodes
  • 53. Dataflow with separated Compute/Data Slot Virtual Slot Virtual Hadoop Hadoop Node Node Datanode NodeManager Virtual NIC Virtual NIC Virtualization Host Virtual Switch VMDK NIC Drivers
  • 54. Elastic Compute !  Set number of active TaskTracker nodes > cluster limit --name myHadoop --nodeGroup worker --activeComputeNodeNum 8 !  Enable all the TaskTrackers in the cluster > cluster unlimit --name myHadoop
  • 55. Performance Analysis of Separation Combined mode Split Mode 1 Combined Compute/Datanode VM per Host 1 Datanode VM, 1 Compute node VM per Host Task Tracker Task Tracker Task Tracker Task Tracker Datanode Datanode Datanode Datanode Workload: Teragen, Terasort, Teravalidate HW Configuration: 8 cores, 96GB RAM, 16 disks per host x 2 nodes
  • 56. Performance Analysis of Separation Minimum performance impact with separation of compute and data 1.2 Elapsed time: ratio to combined 1 0.8 0.6 Combined Split 0.4 0.2 0 Teragen Terasort Teravalidate
  • 57. Freedom of Choice and Open Source Distributions Community Projects •  Flexibility to choose from major distributions cluster create --name myHadoop --distro apache •  Support for multiple projects •  Open architecture to welcome industry participation •  Contributing Hadoop Virtualization Extensions (HVE) to open source community
  • 58. Elastic, Multi-tenant Hadoop on Demand! Richard McDougall, ! Chief Architect, Application Infrastructure and Big Data, VMware, Inc! @richardmcdougll! ApacheCon Europe, 2012! ! http://www.vmware.com/hadoop http://cto.vmware.com/ http://projectserengeti.org http://github.com/vmware-serengeti © 2009 VMware Inc. All rights reserved