SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Beyond Map/Reduce: 

Getting Creative with Parallel
         Processing

              "
          Ed Kohlwey
          @ekohlwey
   kohlwey_edmund@bah.com
Overview"
•  Within the last year:
  –  Two cluster schedulers have been released
  –  Two BSP frameworks have been released
  –  An in-memory Map/Reduce has been
     released
  –  Accumulo has been released
•  More importantly
  –  We have been given the tools to program in
     something besides Map/Reduce and MPI
What About…"
•  This talk covers a few specific frameworks
•  There’s lots more out there
Motivations for Schedulers"

 The cornerstone of new cluster
   computing environments
Different Tasks Have Different
               Needs"


        Host 7           Host 5
        Host 3           Host 2
CPU RAM Host 1   CPU RAM Host 1   CPU RAM

  Task A           Task B          Task C
Clusters Often Donʼt Accommodate
               This"
   Percentage of Cluster            Expense of Hosts Required
          Load                           to Execute Task




    Task A   Task B   Task C            Task A   Task B   Task C

                        Types of Hosts
                          In Cluster



                               Type 1
This is How It Should Look"
  Percentage of Cluster          Expense of Hosts Required
         Load                         to Execute Task




  Task A   Task B   Task C           Task A   Task B   Task C

                      Types of Hosts
                        In Cluster




                        Type 1   Type 2
Economic Reasons"
Power Consumption




                           Load
Simple Example: a Work Queue"
•  Data scientists execute serial
   implementations of machine learning
   algorithms
•  Some are expensive, some are not
•  Scientists aren’t running analyses all the time
•  Solution 1:
   –  Give all the analysts a big workstation
•  Solution 2:
   –  Give the analysts all thin clients and let them
      share a cluster
Advantages for Moving to a Thin
       Client/Cluster Model"
•  Scalability
  –  All analyst capabilities can be enhances by
     adding one host
•  Increases resource utilization
  –  Workstations are expensive, and will be
     highly under-utilized
•  Increase availability
  –  Using a distributed file system to store data
Desirable Scheduler Features"
                                                                  YARN	
     Mesos	
  
Operate	
  on	
  heterogeneous	
  clusters	
                      Y	
        Y	
  
Highly	
  Available	
                                             Y	
        Y	
  
Pluggable	
  scheduling	
  policies	
                             Y	
        Y	
  
Authen9ca9on	
                                                    Y	
        N	
  
Task	
  ar9fact	
  distribu9on	
                                  Y	
        P	
  
Scheduling	
  policy	
  based	
  on	
  mul9ple	
  resources	
     N	
        Y	
  
(RAM,	
  CPU)	
  
Mul9ple	
  Queues	
                                               Y	
        N	
  
Fast	
  accept/reject	
  model	
                                  N	
        P	
  
Reusable	
  method	
  of	
  describing	
  resource	
              Y	
        N	
  
requirements	
  
Pluggable	
  Isola9on	
                                           N	
        Y	
  
“Compute	
  Units”	
                                              N	
        N	
  
New Compute Environments"

  BSP, In-Memory Map/Reduce,
   and Streaming Processing
(Hadoop) Map/Reduce Pros &
             Cons"
•  Map/Reduce implements partitioned,
   parallel sorting
  –  Many algorithms (relational) express well
  –  Creates O(n lg(n)) runtime constraints for
     some problems that wouldn’t otherwise have
     them
•  Hadoop M/R is good for bulk jobs
In-Memory Map/Reduce"
•  Memory is fast
•  Often, after the map phase, a whole data
   set can fit in the memory of the cluster
•  Spark provides this, as well as a very
   succinct programming environment
   courtesy of Scala and it’s closures
In-Memory Performance"
                                        Logistic Regression Performance Comparison
           4000



           3000
Time (s)




           2000


                                                                                        Hadoop
           1000
                                                                                        Spark


            0
                                5                        10                   20   30
                                                                 Iterations
                  *Numbers taken from http://spark-project.org
Spark Wordcount"
val file = spark.textFile("hdfs://...”) 	
file.flatMap(line => line.split(" "))	
    .map(word => (word, 1))	
    .reduceByKey(_ + _)
Hadoop Wordcount"
public class WordCount {	                                                 public static void main(String[] args) throws Exception {	
	                                                                              Configuration conf = new Configuration();	
  public static class TokenizerMapper 	                                        String[] otherArgs = new GenericOptionsParser(conf,
          extends Mapper<Object, Text, Text, IntWritable>{	               args).getRemainingArgs();	
     	                                                                         if (otherArgs.length != 2) {	
     private final static IntWritable one = new IntWritable(1);	                  System.err.println("Usage: wordcount <in> <out>");	
     private Text word = new Text();	                                             System.exit(2);	
        	                                                                      }	
     public void map(Object key, Text value, Context context	                  Job job = new Job(conf, "word count");	
                       ) throws IOException, InterruptedException {	           job.setJarByClass(WordCount.class);	
        StringTokenizer itr = new StringTokenizer(value.toString());	          job.setMapperClass(TokenizerMapper.class);	
        while (itr.hasMoreTokens()) {	                                         job.setCombinerClass(IntSumReducer.class);	
           word.set(itr.nextToken());	                                         job.setReducerClass(IntSumReducer.class);	
           context.write(word, one);	                                          job.setOutputKeyClass(Text.class);	
        }	                                                                     job.setOutputValueClass(IntWritable.class);	
     }	                                                                        FileInputFormat.addInputPath(job, new Path(otherArgs[0]));	
  }	                                                                           FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));	
  	                                                                            System.exit(job.waitForCompletion(true) ? 0 : 1);	
  public static class IntSumReducer 	                                       }	
          extends Reducer<Text,IntWritable,Text,IntWritable> {	           }
     private IntWritable result = new IntWritable();	
	
     public void reduce(Text key, Iterable<IntWritable> values, 	
                          Context context	
                          ) throws IOException, InterruptedException {	
        int sum = 0;	
        for (IntWritable val : values) {	
           sum += val.get();	
        }	
        result.set(sum);	
        context.write(key, result);	
     }	
  }
Streaming Processing: Accumulo"
•  Accumulo is a BigTable implementation
•  Idea: accumulate values in a column
    –  “map” using the ETL process
•  Summarize values (stored in sorted order) at read-time
    –  “reduce” process
•  No control over partitioning outside a row
    –  Accumulo doesn’t suffer from the column family problem that HBase
       has, so this is ok
•  Less consistent than Map/Reduce because race conditions can
   occur with respect to the scan cursor
•  Iterator programming environment allows you to compose “reduce”
   operations
•  Implementing streaming Map/Reduce over a BigTable
   implementation is a hybrid of in-memory and disk based
   approaches
•  Allows revision of figures due to data provenance issues
BSP"

Generalizing Map/Reduce for
     graph processing
BSP"
•  First proposed by Valiant in 1990
•  Good at expressing iterative computation
•  Good at expressing graph algorithms
•  Concerned with passing messages
   between virtual processors
•  Perhaps the most famous implementation
   is Pregel
MR Graph Traversal"
Map	
                                                Sort	
  +	
     Reduce	
  
                                                     Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
  
B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
  



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
  
MR Graph Traversal"
Map	
                                                Sort	
  +	
     Reduce	
  
            I want to send a                         Shuffle	
  
        message	
  to	
  C!


A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
  
B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
  



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
  
MR Graph Traversal"
Map	
                                                               Sort	
  +	
     Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
  



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
     Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
  



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                                                              Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                     An	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                B
                                                                                    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                CC
                                                                                    n,                                 m	
  
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                                                                Reduce	
  
                                                                    Shuffle	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                     An	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
     A 	
  
                                                                                                                                                  n



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                B
                                                                                    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
      B 	
  
                                                                                                                                                  n
                                                                                                                                                        I got it!




C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                CC
                                                                                    n,                          mè	
                          C 	
  
                                                                                                                                                  n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                                                                      Reduce	
  
                                                                    Shuffle	
         O((n+m)	
  lg(n+m)	
  )	
  



A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                     A      n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
     A 	
  
                                                                                                                                                        n



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                B     n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
      B 	
  
                                                                                                                                                        n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                CC    n,                          mè	
                          C 	
  
                                                                                                                                                        n
MR Graph Traversal"
Map	
                                                               Sort	
  +	
                           Reduce	
  
                                                                    Shuffle	
  This	
  can	
  be	
  op9mized	
  to	
  O(m)	
  

A   n	
  	
  	
  	
  	
  	
  	
  	
  	
  è   	
   A C
                                                   n,       m	
                        A     n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
     A 	
  
                                                                                                                                                          n



B  n	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
       B 	
  
                                                   n                                   B    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  è	
      B 	
  
                                                                                                                                                          n



C  n	
  	
  	
  	
  	
  	
  	
  	
  è	
           C 	
  
                                                   n                                   CC   n,                          mè	
                          C 	
  
                                                                                                                                                          n
The BSP Version"
Compute	
                                                                                           Exchange	
                                                                                Synchronize	
  
                                                                                                    Messages	
  
                                                                                                    	
  



A 	
  
     n                                                                              	
     C 	
  
                                                                                              m     	
  	
  	
                         	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
     A 	
  
                                                                                                                                                                                                   n



B   n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
                                                                                                                                                   B 	
  
                                                                                                                                                                                                   n

                                                                                                    	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  

C                                                    B
    n	
  	
  	
  	
  	
  	
  	
  	
  è	
  	
  	
  	
  	
  	
  	
  	
  	
   m	
                                                                                                               C 	
  
                                                                                                                                                                                                  n
The BSP Version"
                                                 No9ce	
  A	
  and	
  C’s	
  message	
  
Compute	
                                                                                  Exchange	
                                                                                          Synchronize	
  
                                                 exchange	
  isn’t	
  closely	
  
                                                                                           Messages	
  
                                                 coupled,	
  providing	
  beEer	
  I/O	
  
                                                                                           	
  
                                                 u9liza9on	
  


A 	
  
     n                                                                              	
     C 	
                      m   	
  	
  	
     	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
     A 	
  
                                                                                                                                                                                                    n

                                                                                                              	
  
                                                                                           	
  	
  	
  	
  	
  

B   n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
                                                                                                                                                    B 	
  
                                                                                                                                                                                                    n



C                                                    B
    n	
  	
  	
  	
  	
  	
  	
  	
  è	
  	
  	
  	
  	
  	
  	
  	
  	
   m	
                                                                                                                C 	
  
                                                                                                                                                                                                   n
The BSP Version"
            Also,	
  no9ce	
  we	
  don’t	
  necessarily	
  
Compute	
   have	
  to	
  copy	
  the	
  en9re	
  graph	
                                                                 Exchange	
                                                            Synchronize	
  
            state.	
  We	
  just	
  send	
  whatever	
                                                                    Messages	
  
            messages	
  need	
  to	
  be	
  sent	
                                                                        	
  



A 	
  n                                                                              	
     C 	
                      m   	
  	
  	
     	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
     A 	
  
                                                                                                                                                                                                     n

                                                                                                               	
  
                                                                                            	
  	
  	
  	
  	
  

B    n	
  	
  	
  	
  	
  	
  	
  	
  	
  	
                                                                                                                                                    B 	
  
                                                                                                                                                                                                     n



C                                                     B
     n	
  	
  	
  	
  	
  	
  	
  	
  è	
  	
  	
  	
  	
  	
  	
  	
  	
   m	
                                                                                                                C 	
  
                                                                                                                                                                                                    n
BSP Implementations"
•  Giraph
  –  Currently an Apache Incubator project
  –  Has a growing community
  –  Runs during the Hadoop Map phase
•  GoldenOrb
  –  Not actively maintained since the summer
•  Both implementations are in-memory,
   modeled after Pregel
Contact Info"
Ed Kohlwey
Booz | Allen | Hamilton
@ekohlwey
kohlwey_edmund@bah.com

Weitere ähnliche Inhalte

Was ist angesagt?

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingMohammad Mustaqeem
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahoutbigdatasyd
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and MonoidsHugo Gävert
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? DataWorks Summit
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupBrian O'Neill
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Konrad Malawski
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming JobsDatabricks
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Milind Bhandarkar
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 

Was ist angesagt? (20)

Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Scalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/CascadingScalding: Twitter's Scala DSL for Hadoop/Cascading
Scalding: Twitter's Scala DSL for Hadoop/Cascading
 
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...
 
Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012Zaharia spark-scala-days-2012
Zaharia spark-scala-days-2012
 
Machine Learning with Mahout
Machine Learning with MahoutMachine Learning with Mahout
Machine Learning with Mahout
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch? Should I Use Scalding or Scoobi or Scrunch?
Should I Use Scalding or Scoobi or Scrunch?
 
Ruby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby GroupRuby on Big Data @ Philly Ruby Group
Ruby on Big Data @ Philly Ruby Group
 
Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014Scalding - the not-so-basics @ ScalaDays 2014
Scalding - the not-so-basics @ ScalaDays 2014
 
Productionizing your Streaming Jobs
Productionizing your Streaming JobsProductionizing your Streaming Jobs
Productionizing your Streaming Jobs
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011Modeling with Hadoop kdd2011
Modeling with Hadoop kdd2011
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Scalding
ScaldingScalding
Scalding
 

Ähnlich wie Beyond Map/Reduce: Getting Creative With Parallel Processing

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015cdmaxime
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014cdmaxime
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusKoichi Fujikawa
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxpetabridge
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in CassandraJairam Chandar
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Christian Peel
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerankgothicane
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAnanth PackkilDurai
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdfHiroshi Ono
 

Ähnlich wie Beyond Map/Reduce: Getting Creative With Parallel Processing (20)

Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015Apache Spark - San Diego Big Data Meetup Jan 14th 2015
Apache Spark - San Diego Big Data Meetup Jan 14th 2015
 
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
Apache Spark - Santa Barbara Scala Meetup Dec 18th 2014
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
NET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptxNET Systems Programming Learned the Hard Way.pptx
NET Systems Programming Learned the Hard Way.pptx
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop Integration in Cassandra
Hadoop Integration in CassandraHadoop Integration in Cassandra
Hadoop Integration in Cassandra
 
Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015Ehsan parallel accelerator-dec2015
Ehsan parallel accelerator-dec2015
 
Behm Shah Pagerank
Behm Shah PagerankBehm Shah Pagerank
Behm Shah Pagerank
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
Map reduce prashant
Map reduce prashantMap reduce prashant
Map reduce prashant
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
An introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduceAn introduction to Test Driven Development on MapReduce
An introduction to Test Driven Development on MapReduce
 
scalaliftoff2009.pdf
scalaliftoff2009.pdfscalaliftoff2009.pdf
scalaliftoff2009.pdf
 

Kürzlich hochgeladen

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 

Kürzlich hochgeladen (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 

Beyond Map/Reduce: Getting Creative With Parallel Processing

  • 1. Beyond Map/Reduce: 
 Getting Creative with Parallel Processing
 " Ed Kohlwey @ekohlwey kohlwey_edmund@bah.com
  • 2. Overview" •  Within the last year: –  Two cluster schedulers have been released –  Two BSP frameworks have been released –  An in-memory Map/Reduce has been released –  Accumulo has been released •  More importantly –  We have been given the tools to program in something besides Map/Reduce and MPI
  • 3. What About…" •  This talk covers a few specific frameworks •  There’s lots more out there
  • 4. Motivations for Schedulers" The cornerstone of new cluster computing environments
  • 5. Different Tasks Have Different Needs" Host 7 Host 5 Host 3 Host 2 CPU RAM Host 1 CPU RAM Host 1 CPU RAM Task A Task B Task C
  • 6. Clusters Often Donʼt Accommodate This" Percentage of Cluster Expense of Hosts Required Load to Execute Task Task A Task B Task C Task A Task B Task C Types of Hosts In Cluster Type 1
  • 7. This is How It Should Look" Percentage of Cluster Expense of Hosts Required Load to Execute Task Task A Task B Task C Task A Task B Task C Types of Hosts In Cluster Type 1 Type 2
  • 9. Simple Example: a Work Queue" •  Data scientists execute serial implementations of machine learning algorithms •  Some are expensive, some are not •  Scientists aren’t running analyses all the time •  Solution 1: –  Give all the analysts a big workstation •  Solution 2: –  Give the analysts all thin clients and let them share a cluster
  • 10. Advantages for Moving to a Thin Client/Cluster Model" •  Scalability –  All analyst capabilities can be enhances by adding one host •  Increases resource utilization –  Workstations are expensive, and will be highly under-utilized •  Increase availability –  Using a distributed file system to store data
  • 11. Desirable Scheduler Features" YARN   Mesos   Operate  on  heterogeneous  clusters   Y   Y   Highly  Available   Y   Y   Pluggable  scheduling  policies   Y   Y   Authen9ca9on   Y   N   Task  ar9fact  distribu9on   Y   P   Scheduling  policy  based  on  mul9ple  resources   N   Y   (RAM,  CPU)   Mul9ple  Queues   Y   N   Fast  accept/reject  model   N   P   Reusable  method  of  describing  resource   Y   N   requirements   Pluggable  Isola9on   N   Y   “Compute  Units”   N   N  
  • 12. New Compute Environments" BSP, In-Memory Map/Reduce, and Streaming Processing
  • 13. (Hadoop) Map/Reduce Pros & Cons" •  Map/Reduce implements partitioned, parallel sorting –  Many algorithms (relational) express well –  Creates O(n lg(n)) runtime constraints for some problems that wouldn’t otherwise have them •  Hadoop M/R is good for bulk jobs
  • 14. In-Memory Map/Reduce" •  Memory is fast •  Often, after the map phase, a whole data set can fit in the memory of the cluster •  Spark provides this, as well as a very succinct programming environment courtesy of Scala and it’s closures
  • 15. In-Memory Performance" Logistic Regression Performance Comparison 4000 3000 Time (s) 2000 Hadoop 1000 Spark 0 5 10 20 30 Iterations *Numbers taken from http://spark-project.org
  • 16. Spark Wordcount" val file = spark.textFile("hdfs://...”)  file.flatMap(line => line.split(" "))     .map(word => (word, 1))     .reduceByKey(_ + _)
  • 17. Hadoop Wordcount" public class WordCount { public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); public static class TokenizerMapper String[] otherArgs = new GenericOptionsParser(conf, extends Mapper<Object, Text, Text, IntWritable>{ args).getRemainingArgs(); if (otherArgs.length != 2) { private final static IntWritable one = new IntWritable(1); System.err.println("Usage: wordcount <in> <out>"); private Text word = new Text(); System.exit(2); } public void map(Object key, Text value, Context context Job job = new Job(conf, "word count"); ) throws IOException, InterruptedException { job.setJarByClass(WordCount.class); StringTokenizer itr = new StringTokenizer(value.toString()); job.setMapperClass(TokenizerMapper.class); while (itr.hasMoreTokens()) { job.setCombinerClass(IntSumReducer.class); word.set(itr.nextToken()); job.setReducerClass(IntSumReducer.class); context.write(word, one); job.setOutputKeyClass(Text.class); } job.setOutputValueClass(IntWritable.class); } FileInputFormat.addInputPath(job, new Path(otherArgs[0])); } FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); public static class IntSumReducer } extends Reducer<Text,IntWritable,Text,IntWritable> { } private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }
  • 18. Streaming Processing: Accumulo" •  Accumulo is a BigTable implementation •  Idea: accumulate values in a column –  “map” using the ETL process •  Summarize values (stored in sorted order) at read-time –  “reduce” process •  No control over partitioning outside a row –  Accumulo doesn’t suffer from the column family problem that HBase has, so this is ok •  Less consistent than Map/Reduce because race conditions can occur with respect to the scan cursor •  Iterator programming environment allows you to compose “reduce” operations •  Implementing streaming Map/Reduce over a BigTable implementation is a hybrid of in-memory and disk based approaches •  Allows revision of figures due to data provenance issues
  • 20. BSP" •  First proposed by Valiant in 1990 •  Good at expressing iterative computation •  Good at expressing graph algorithms •  Concerned with passing messages between virtual processors •  Perhaps the most famous implementation is Pregel
  • 21. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   B n                  è   C n                è  
  • 22. MR Graph Traversal" Map   Sort  +   Reduce   I want to send a Shuffle   message  to  C! A n                  è   B n                  è   C n                è  
  • 23. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   B n                  è   B   n C n                è   C   n
  • 24. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   B n                  è   B   n C n                è   C   n
  • 25. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   An                           B n                  è   B   n B n                           C n                è   C   n CC n, m  
  • 26. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   A n                  è   A C n, m   An                        è   A   n B n                  è   B   n B n                        è   B   n I got it! C n                è   C   n CC n, mè   C   n
  • 27. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle   O((n+m)  lg(n+m)  )   A n                  è   A C n, m   A n                        è   A   n B n                  è   B   n B n                        è   B   n C n                è   C   n CC n, mè   C   n
  • 28. MR Graph Traversal" Map   Sort  +   Reduce   Shuffle  This  can  be  op9mized  to  O(m)   A n                  è   A C n, m   A n                        è   A   n B n                  è   B   n B n                        è   B   n C n                è   C   n CC n, mè   C   n
  • 29. The BSP Version" Compute   Exchange   Synchronize   Messages     A   n   C   m                                 A   n B n                     B   n                       C B n                è                   m   C   n
  • 30. The BSP Version" No9ce  A  and  C’s  message   Compute   Exchange   Synchronize   exchange  isn’t  closely   Messages   coupled,  providing  beEer  I/O     u9liza9on   A   n   C   m                                 A   n             B n                     B   n C B n                è                   m   C   n
  • 31. The BSP Version" Also,  no9ce  we  don’t  necessarily   Compute   have  to  copy  the  en9re  graph   Exchange   Synchronize   state.  We  just  send  whatever   Messages   messages  need  to  be  sent     A  n   C   m                                 A   n             B n                     B   n C B n                è                   m   C   n
  • 32. BSP Implementations" •  Giraph –  Currently an Apache Incubator project –  Has a growing community –  Runs during the Hadoop Map phase •  GoldenOrb –  Not actively maintained since the summer •  Both implementations are in-memory, modeled after Pregel
  • 33. Contact Info" Ed Kohlwey Booz | Allen | Hamilton @ekohlwey kohlwey_edmund@bah.com