SlideShare ist ein Scribd-Unternehmen logo
1 von 37
Fei Dong
Duke University
 April 6, 2012
•   Introduction
  •   Optimizing Multi-Job Workflows
  •   Optimizing Iterative Workflows
  •   Optimizing Key-Value Stores
  •   Alidade: A Real-Life Application
  •   Summary
  •   Questions and Answers


10/4/2012                  2             Starfish-E
10/4/2012   3   Starfish-E
Typical Hadoop Stack                   New Software and Model
                                                        Oozie       Sqoop    Jaql
 High Level       MapReduce          Hadoop Streaming
                 Programs(Java)       (Python, Ruby)    Pig     Hive    Cascading

Hadoop Core
                                                        Iterative
                      MapReduce Execution Engine                       EMR    MRv2
                                                         Model



                         Distributed File System        HBase        ElephantDB


Physical Level    Physical Machine                      Virtual Machine
                                              SATA            EC2
                         CPU                                                 SSD
                                              Disk            Unit


10/4/2012                                     4                         Starfish-E
Optimizers
              Search through space of tuning choices
                                        Cluster
                      Job

                                            Data layout

       Profiler                                     What-if Engine
                      Workflow
   Collects concise              Workload          Estimates impact of
    summaries of                                  hypothetical changes
      execution                                        on execution

 Starfish limitation: focus on individual MapReduce jobs on Hadoop

10/4/2012                          5                        Starfish-E
Starfish-Extended


10/4/2012           6      Starfish-E
High-level layers have evolved over Hadoop to support
  comprehensive workflows, such as Pig, Hive, Cascading.




       Can we optimize such workflows with Starfish?

10/4/2012                    7                 Starfish-E
• Data is processed iteratively.
 • MapReduce framework does not directly support
   iterations.
                                    Loop: n
       Input
                                                   Output3
                                                   /Input2

               Output1        Output2
        J1               J2                   J3
               /Input2        /Input3



                                  Output      J4


  Can we support iterative execution in a workflow?
10/4/2012                     8                          Starfish-E
• HDFS: Replication, Fault tolerance, Scalability
  • HBase: Host very large tables – billions of rows
    X millions of columns.




     Can we optimize storage system like HBase?

10/4/2012                 9                Starfish-E
• Rule Based Optimization (RBO)
     – Use a set of rules to determine how to execute a
       plan.
  • Cost Based Optimization
     – Cheapest plan use the least amount of resource.
  • Starfish employ CBO approach to MapReduce
    programs.
          Can we put RBO + CBO together ?


10/4/2012                   10                 Starfish-E
1. MapReduce Workflow Optimizer in Cascading

  2. Iterative Workflow Optimizer

  3. Key-Value Stores Optimizer using Rule-based
     technology




10/4/2012               11              Starfish-E
• Cascading
     – Data processing API on Hadoop
     – Flow of operation, not jobs




10/4/2012                 12           Starfish-E
• Replace Hadoop Old API with New API
  • Cascading Profiler
     – Job Graph + Conf Graph to represent a workflow
  • Cascading What-if Engine
  • Cascading Optimizer




10/4/2012                  13                 Starfish-E
10/4/2012   14   Starfish-E
10/4/2012   15   Starfish-E
• The jobs have the same execution behavior
    across iterations → we can use a single
    “iterative” profile.
  • Combine MapReduce jobs into a logical unit of
    work (inspired by Oozie)




10/4/2012               16               Starfish-E
• PageRank: 10G page graphs

                         3500

                         3000
      Running Time (s)




                         2500

                         2000

                         1500                                      Original
                         1000                                      Optimization
                          500

                            0
                                1   2          4          6   10

                                        Total Iteration


10/4/2012                                          17                Starfish-E
5. HBase Process           e.g. splits, compactions
      High

             4. HBase Schema             e.g. compression, bloom filter



             3. HBase Configuration      e.g. garbage collection, heap



             2. Hadoop Configuration e.g. xciever, handlers

      Low    1. Operating System         e.g. ulimit, nproc



10/4/2012                        18                           Starfish-E
• JVM Settings:
     – "-server -XX:+UseParallelGC -XX:ParallelGCThraed=8 -
       XX:+AggressivHeap -XX:+HeapDumpOnOutOfMemoryError".
     – The parallel GC leverages multiple CPUs.




10/4/2012                        19                     Starfish-E
10/4/2012   20   Starfish-E
10/4/2012   21   Starfish-E
10/4/2012   22   Starfish-E
• Recommend c1.xlarge, m2.xlarge to run HBase.
  • Isolate HBase cluster to avoid memory competition with other
    services.
  • Factors to affect writing: HLog > Split> Compact.
  • If applications do not require strict data durability, closing
    HLog can get 2X speedup.
  • Compression can save space on storage. Snappy provide high
    speeds and reasonable compression.
  • In a read-busy system, using bloom filters with the matching
    update or read patterns can save a huge amount of IO.
  • …


10/4/2012                       23                     Starfish-E
• Alidade is constraint-based geolocation
    system.
  • Alidade has two phases.
     – Preprocessing data
     – Iterative geolocation




10/4/2012                      24         Starfish-E
10/4/2012   25   Starfish-E
1. Iterative Model
  2. Heavy Computation
     – Represent polygon in a spherical surface.
     – Calculate Intersection of polygons
  3. Large Scale Data
  4. Limited Resource Allocation
     – Depend on many services such as
       HDFS, JobTrackers, TaskTracker, HBase, etc.


10/4/2012                   26                 Starfish-E
•   Hadoop CDH3U3 based on 0.20.2
  •   HBase CDH3U3 based on 0.90.4
  •   11 m1.large nodes, 11 m2.xlarge nodes
  •   30 map slots and 20 reduce slots
  •   Workflow:
      – YCSB generates 10M records and some workloads
        on read/write.
      – Alidade generates 70M records after translating.


10/4/2012                   27                 Starfish-E
10/4/2012   28   Starfish-E
10/4/2012   29   Starfish-E
10/4/2012   30   Starfish-E
10/4/2012   31   Starfish-E
11 m1.large   21 m1.large 11 m2.xlarge


Write Capacity        43593/s       87012/s      58273/s
CPU                   44            88           72
Storage Capacity      8.5T          17T          17T


Nodes cost per hour   $3.5          $7.1         $5.0


Traffic Cost          $4            $8           $4
Setup Duration        2hr           2hr          2hr
AWS Billed Duration   19hr          11hr         12hr


Total Cost            $68.6         $82.8        $68.8

  10/4/2012                                 32                 Starfish-E
• Alidade is a CPU-intensive job. The
    “IntersectionWritable.solve” contributes most
    of executing time (> 70%).
  • Currently, Starfish Optimizer is better fitted for
    I/O intensive jobs
  • Alidade helped Starfish improve Profiling.
    (reduce overhead for “sequencefiles”)
  • Memory issue for HBase


10/4/2012                  33                Starfish-E
• Extended Starfish to support the evolving
    Hadoop system
     – Automatic tuning of Cascading workflow. We can
       boost performance by 20% to 200%.
     – Support iterative workflow using simple syntax.
     – Optimize Key-Value Stores in Hadoop
     – Leveraged cost-based optimizer and rule-base
       optimizer to get good performance in a complex
       real-life workflow.


10/4/2012                  34                 Starfish-E
10/4/2012   35   Starfish-E
Thanks
              



10/4/2012     36     Starfish-E
10/4/2012   37   Starfish-E

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedInAllen Wittenauer
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingImpetus Technologies
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceCloudera, Inc.
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedInAllen Wittenauer
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationCloudera, Inc.
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performanceDataWorks Summit
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Ryu Kobayashi
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)mundlapudi
 
Couchbase Performance Benchmarking
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance BenchmarkingRenat Khasanshyn
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesLINE Corporation (Tech Unit)
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataDataWorks Summit
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshootingmapr-academy
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataEnkitec
 

Was ist angesagt? (20)

Cosbench apac
Cosbench apacCosbench apac
Cosbench apac
 
Hadoop Operations at LinkedIn
Hadoop Operations at LinkedInHadoop Operations at LinkedIn
Hadoop Operations at LinkedIn
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop ConsultingAdvanced Hadoop Tuning and Optimization - Hadoop Consulting
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
 
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job PerformanceHadoop Summit 2012 | Optimizing MapReduce Job Performance
Hadoop Summit 2012 | Optimizing MapReduce Job Performance
 
Hadoop Performance at LinkedIn
Hadoop Performance at LinkedInHadoop Performance at LinkedIn
Hadoop Performance at LinkedIn
 
Hadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote PresentationHadoop World 2011: Mike Olson Keynote Presentation
Hadoop World 2011: Mike Olson Keynote Presentation
 
Optimizing MapReduce Job performance
Optimizing MapReduce Job performanceOptimizing MapReduce Job performance
Optimizing MapReduce Job performance
 
Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014Treasure Data on The YARN - Hadoop Conference Japan 2014
Treasure Data on The YARN - Hadoop Conference Japan 2014
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Couchbase Performance Benchmarking
Couchbase Performance BenchmarkingCouchbase Performance Benchmarking
Couchbase Performance Benchmarking
 
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DMUpgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
HDF-EOS Tools
HDF-EOS ToolsHDF-EOS Tools
HDF-EOS Tools
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Pig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big DataPig on Tez - Low Latency ETL with Big Data
Pig on Tez - Low Latency ETL with Big Data
 
ha_module5
ha_module5ha_module5
ha_module5
 
70a monitoring & troubleshooting
70a monitoring & troubleshooting70a monitoring & troubleshooting
70a monitoring & troubleshooting
 
Kerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadataKerry osborne hadoop meets exadata
Kerry osborne hadoop meets exadata
 
52 nfs
52 nfs52 nfs
52 nfs
 

Ähnlich wie Extend starfish to Support the Growing Hadoop Ecosystem

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRVijay Rayapati
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014EDB
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveSachin Aggarwal
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudEdureka!
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Etu Solution
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFSDataWorks Summit
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Steve Min
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You? EMC
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsXiao Qin
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuningVitthal Gogate
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanNarayana B
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...DataWorks Summit
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoopDataWorks Summit
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Chris Nauroth
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)outstanding59
 

Ähnlich wie Extend starfish to Support the Growing Hadoop Ecosystem (20)

Big Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMRBig Data and Hadoop in Cloud - Leveraging Amazon EMR
Big Data and Hadoop in Cloud - Leveraging Amazon EMR
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014Implementing Parallelism in PostgreSQL - PGCon 2014
Implementing Parallelism in PostgreSQL - PGCon 2014
 
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep diveApache Spark Introduction and Resilient Distributed Dataset basics and deep dive
Apache Spark Introduction and Resilient Distributed Dataset basics and deep dive
 
Power Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS CloudPower Hadoop Cluster with AWS Cloud
Power Hadoop Cluster with AWS Cloud
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Greenplum Database on HDFS
Greenplum Database on HDFSGreenplum Database on HDFS
Greenplum Database on HDFS
 
Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)Apache Spark Overview part1 (20161107)
Apache Spark Overview part1 (20161107)
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?  Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
Greenplum Analytics Workbench - What Can a Private Hadoop Cloud Do For You?
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
 
Hadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_PlanHadoop Architecture_Cluster_Cap_Plan
Hadoop Architecture_Cluster_Cap_Plan
 
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
Comparative Performance Analysis of AWS EC2 Instance Types Commonly Used for ...
 
Searching conversations with hadoop
Searching conversations with hadoopSearching conversations with hadoop
Searching conversations with hadoop
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 

Kürzlich hochgeladen

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 

Kürzlich hochgeladen (20)

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 

Extend starfish to Support the Growing Hadoop Ecosystem

  • 1. Fei Dong Duke University April 6, 2012
  • 2. Introduction • Optimizing Multi-Job Workflows • Optimizing Iterative Workflows • Optimizing Key-Value Stores • Alidade: A Real-Life Application • Summary • Questions and Answers 10/4/2012 2 Starfish-E
  • 3. 10/4/2012 3 Starfish-E
  • 4. Typical Hadoop Stack New Software and Model Oozie Sqoop Jaql High Level MapReduce Hadoop Streaming Programs(Java) (Python, Ruby) Pig Hive Cascading Hadoop Core Iterative MapReduce Execution Engine EMR MRv2 Model Distributed File System HBase ElephantDB Physical Level Physical Machine Virtual Machine SATA EC2 CPU SSD Disk Unit 10/4/2012 4 Starfish-E
  • 5. Optimizers Search through space of tuning choices Cluster Job Data layout Profiler What-if Engine Workflow Collects concise Workload Estimates impact of summaries of hypothetical changes execution on execution Starfish limitation: focus on individual MapReduce jobs on Hadoop 10/4/2012 5 Starfish-E
  • 7. High-level layers have evolved over Hadoop to support comprehensive workflows, such as Pig, Hive, Cascading. Can we optimize such workflows with Starfish? 10/4/2012 7 Starfish-E
  • 8. • Data is processed iteratively. • MapReduce framework does not directly support iterations. Loop: n Input Output3 /Input2 Output1 Output2 J1 J2 J3 /Input2 /Input3 Output J4 Can we support iterative execution in a workflow? 10/4/2012 8 Starfish-E
  • 9. • HDFS: Replication, Fault tolerance, Scalability • HBase: Host very large tables – billions of rows X millions of columns. Can we optimize storage system like HBase? 10/4/2012 9 Starfish-E
  • 10. • Rule Based Optimization (RBO) – Use a set of rules to determine how to execute a plan. • Cost Based Optimization – Cheapest plan use the least amount of resource. • Starfish employ CBO approach to MapReduce programs. Can we put RBO + CBO together ? 10/4/2012 10 Starfish-E
  • 11. 1. MapReduce Workflow Optimizer in Cascading 2. Iterative Workflow Optimizer 3. Key-Value Stores Optimizer using Rule-based technology 10/4/2012 11 Starfish-E
  • 12. • Cascading – Data processing API on Hadoop – Flow of operation, not jobs 10/4/2012 12 Starfish-E
  • 13. • Replace Hadoop Old API with New API • Cascading Profiler – Job Graph + Conf Graph to represent a workflow • Cascading What-if Engine • Cascading Optimizer 10/4/2012 13 Starfish-E
  • 14. 10/4/2012 14 Starfish-E
  • 15. 10/4/2012 15 Starfish-E
  • 16. • The jobs have the same execution behavior across iterations → we can use a single “iterative” profile. • Combine MapReduce jobs into a logical unit of work (inspired by Oozie) 10/4/2012 16 Starfish-E
  • 17. • PageRank: 10G page graphs 3500 3000 Running Time (s) 2500 2000 1500 Original 1000 Optimization 500 0 1 2 4 6 10 Total Iteration 10/4/2012 17 Starfish-E
  • 18. 5. HBase Process e.g. splits, compactions High 4. HBase Schema e.g. compression, bloom filter 3. HBase Configuration e.g. garbage collection, heap 2. Hadoop Configuration e.g. xciever, handlers Low 1. Operating System e.g. ulimit, nproc 10/4/2012 18 Starfish-E
  • 19. • JVM Settings: – "-server -XX:+UseParallelGC -XX:ParallelGCThraed=8 - XX:+AggressivHeap -XX:+HeapDumpOnOutOfMemoryError". – The parallel GC leverages multiple CPUs. 10/4/2012 19 Starfish-E
  • 20. 10/4/2012 20 Starfish-E
  • 21. 10/4/2012 21 Starfish-E
  • 22. 10/4/2012 22 Starfish-E
  • 23. • Recommend c1.xlarge, m2.xlarge to run HBase. • Isolate HBase cluster to avoid memory competition with other services. • Factors to affect writing: HLog > Split> Compact. • If applications do not require strict data durability, closing HLog can get 2X speedup. • Compression can save space on storage. Snappy provide high speeds and reasonable compression. • In a read-busy system, using bloom filters with the matching update or read patterns can save a huge amount of IO. • … 10/4/2012 23 Starfish-E
  • 24. • Alidade is constraint-based geolocation system. • Alidade has two phases. – Preprocessing data – Iterative geolocation 10/4/2012 24 Starfish-E
  • 25. 10/4/2012 25 Starfish-E
  • 26. 1. Iterative Model 2. Heavy Computation – Represent polygon in a spherical surface. – Calculate Intersection of polygons 3. Large Scale Data 4. Limited Resource Allocation – Depend on many services such as HDFS, JobTrackers, TaskTracker, HBase, etc. 10/4/2012 26 Starfish-E
  • 27. Hadoop CDH3U3 based on 0.20.2 • HBase CDH3U3 based on 0.90.4 • 11 m1.large nodes, 11 m2.xlarge nodes • 30 map slots and 20 reduce slots • Workflow: – YCSB generates 10M records and some workloads on read/write. – Alidade generates 70M records after translating. 10/4/2012 27 Starfish-E
  • 28. 10/4/2012 28 Starfish-E
  • 29. 10/4/2012 29 Starfish-E
  • 30. 10/4/2012 30 Starfish-E
  • 31. 10/4/2012 31 Starfish-E
  • 32. 11 m1.large 21 m1.large 11 m2.xlarge Write Capacity 43593/s 87012/s 58273/s CPU 44 88 72 Storage Capacity 8.5T 17T 17T Nodes cost per hour $3.5 $7.1 $5.0 Traffic Cost $4 $8 $4 Setup Duration 2hr 2hr 2hr AWS Billed Duration 19hr 11hr 12hr Total Cost $68.6 $82.8 $68.8 10/4/2012 32 Starfish-E
  • 33. • Alidade is a CPU-intensive job. The “IntersectionWritable.solve” contributes most of executing time (> 70%). • Currently, Starfish Optimizer is better fitted for I/O intensive jobs • Alidade helped Starfish improve Profiling. (reduce overhead for “sequencefiles”) • Memory issue for HBase 10/4/2012 33 Starfish-E
  • 34. • Extended Starfish to support the evolving Hadoop system – Automatic tuning of Cascading workflow. We can boost performance by 20% to 200%. – Support iterative workflow using simple syntax. – Optimize Key-Value Stores in Hadoop – Leveraged cost-based optimizer and rule-base optimizer to get good performance in a complex real-life workflow. 10/4/2012 34 Starfish-E
  • 35. 10/4/2012 35 Starfish-E
  • 36. Thanks  10/4/2012 36 Starfish-E
  • 37. 10/4/2012 37 Starfish-E

Hinweis der Redaktion

  1. Welcome to come my defense. I am Fei Dong from computer science departmentMy project title is to extend starfish to support the growing hadoop ecosystemIn the background, you see elephant and starfish. It seems little connection in biology. We will figure out some magic connection in my lecture
  2. Data is growing so fast, from kb -> pbLots of company meet the big data problem and challenges, scalibility, relibility, performanceHow to store the data , and retrieve data efficiently
  3. Hadoop history: started from nutch, by Doug Cutting, growing in Yahoo, cloudera and hortonworks focus on this.A large scale batch data processing platformSome ideas from Google published papers, GFS, MapReduceOpen Source top project of Apache
  4. How to use Hadoop? You can follow some turtuals, but we also care about performanceHowever, there are more than 190 parameters. hard to tune manually to get good performance, Starfish is a research project led by Prof. ShivnathBabuHave some impact on academic
  5. Main difference from Cascading is JAVA API, so users can easily pick up without any burden
  6. e.g.: pagerank, kmeans algorithms
  7. BigTableThink about database, while not scalable, Hbase is one NOSQL, scale out, auto sharding
  8. Elg: cloudera suggest you can set the number of reduces close to the reduce slots the cluster ownCost model
  9. Encapsulate Abstract Operation on DataEach, GroupBy, …Connected by a Pipe
  10. DAG, direct acyclic graph
  11. Hbase often has more concurrent clients : increase dfs.datanode.max.xcievers 256->1024Uproc: maximum number of processesHandlers: io thread numberXciever: an upper bound on the number of files that it will serve at any one time
  12. A lot of miss during reads.Speed up reads by cutting down internal lookups
  13. Slots: capacity, concurrent running processes
  14. Not always good to increase reduce slots, due to Server bottleneck
  15. Synchronized time