SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Big                        ML

Tools & techniques for large-scale ML


         Joaquin Vanschoren
                                        1
Big                        ML

Tools & techniques for large-scale ML


         Joaquin Vanschoren
                                        1
The Why

• A lot of ML problems concern really large datasets

  • Huge graphs: internet, social networks, protein interactions

  • Sensor data, large sensor networks

  • Data is just getting bigger, collected faster...
The What

•Map-Reduce (Hadoop) -> I/O bound processes

 •Tutorial

•Grid computing -> CPU bound processes

 •Short Intro
Map-reduce: distribute I/O over many nodes
                                                    User
                                                  Program
                                    (1) fork
                                                  (1) fork          (1) fork



                                                   Master
                                                                       (2)
                                          (2)                        assign
                                        assign                       reduce
                                         map

                       worker
  split 0                                                                               (6) write
                                                                                                    output
  split 1                                                                      worker               file 0
                                                             (5) remote read
  split 2   (3) read            (4) local write
                       worker                                                                       output
  split 3                                                                      worker
                                                                                                    file 1
  split 4


                       worker


  Input                Map                  Intermediate files                 Reduce               Output
   files               phase                 (on local disks)                   phase                files
Map-reduce: distribute I/O over many nodes
                                                    User
                                                  Program
 Split data                         (1) fork
                                                                    (1) fork
                                                  (1) fork
into chunks
                                                   Master
                                                                       (2)
                                          (2)                        assign
                                        assign                       reduce
                                         map

                       worker
  split 0                                                                               (6) write
                                                                                                    output
  split 1                                                                      worker               file 0
                                                             (5) remote read
  split 2   (3) read            (4) local write
                       worker                                                                       output
  split 3                                                                      worker
                                                                                                    file 1
  split 4


                       worker


  Input                Map                  Intermediate files                 Reduce               Output
   files               phase                 (on local disks)                   phase                files
Map-reduce: distribute I/O over many nodes
                                                    User
            Send chunks                           Program

             to worker              (1) fork
                                                  (1) fork          (1) fork

               nodes
                                                   Master
                                                                       (2)
                                          (2)                        assign
                                        assign                       reduce
                                         map

                       worker
  split 0                                                                               (6) write
                                                                                                    output
  split 1                                                                      worker               file 0
                                                             (5) remote read
  split 2   (3) read            (4) local write
                       worker                                                                       output
  split 3                                                                      worker
                                                                                                    file 1
  split 4


                       worker


  Input                Map                  Intermediate files                 Reduce               Output
   files               phase                 (on local disks)                   phase                files
Map-reduce: distribute I/O over many nodes
                                                    User
       MAP data to                                Program

       intermediate                 (1) fork
                                                  (1) fork          (1) fork

      key-value pairs
                                                   Master
                                                                       (2)
                                          (2)                        assign
                                        assign                       reduce
                                         map

                       worker
  split 0                                                                               (6) write
                                                                                                    output
  split 1                                                                      worker               file 0
                                                             (5) remote read
  split 2   (3) read            (4) local write
                       worker                                                                       output
  split 3                                                                      worker
                                                                                                    file 1
  split 4


                       worker


  Input                Map                  Intermediate files                 Reduce               Output
   files               phase                 (on local disks)                   phase                files
Map-reduce: distribute I/O over many nodes
                                                    User
                                                  Program                      Shuffle key-value
                                    (1) fork
                                                  (1) fork          (1) fork   pairs: same keys
                                                                                to same nodes
                                                   Master
                                                                       (2)          (+ sort)
                                          (2)                        assign
                                        assign                       reduce
                                         map

                       worker
  split 0                                                                                 (6) write
                                                                                                      output
  split 1                                                                        worker               file 0
                                                             (5) remote read
  split 2   (3) read            (4) local write
                       worker                                                                         output
  split 3                                                                        worker
                                                                                                      file 1
  split 4


                       worker


  Input                Map                  Intermediate files                   Reduce               Output
   files               phase                 (on local disks)                     phase                files
Map-reduce: distribute I/O over many nodes
                                                    User
                                                  Program                               REDUCE key-
                                    (1) fork
                                                  (1) fork          (1) fork            value data to
                                                                                        target output
                                                   Master
                                                                       (2)
                                          (2)                        assign
                                        assign                       reduce
                                         map

                       worker
  split 0                                                                                 (6) write
                                                                                                      output
  split 1                                                                      worker                 file 0
                                                             (5) remote read
  split 2   (3) read            (4) local write
                       worker                                                                         output
  split 3                                                                      worker
                                                                                                      file 1
  split 4


                       worker


  Input                Map                  Intermediate files                 Reduce                 Output
   files               phase                 (on local disks)                   phase                  files
Thanks to Saliya Ekanayake



Map-reduce: Intuition
One apple
Thanks to Saliya Ekanayake



Map-reduce: Intuition
One apple


A fruit basket: more knives!


   many mappers
 (people with knifes)



      1 reducer
(person with blender)
Map-reduce: Intuition
A fruit container: more blenders too!

                           Map input:
                             <a, > <o,   > <p,   > , ...
Map-reduce: Intuition
A fruit container: more blenders too!

                           Map input:
                             <a, > <o,      > <p,    > , ...
                           Map output:
                             <a’,   ><o’,    ><p’,     >, ...
Map-reduce: Intuition
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
Map-reduce: Intuition
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
                              Reduce output:
Map-reduce: Intuition
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
                              Reduce output:
Map-reduce: Intuition
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
                              Reduce output:
Map-reduce: Intuition
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
                              Reduce output:
Map-reduce: Intuition
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
                              Reduce output:
peer review analogy:
                                  reviewers (mappers)
Map-reduce: Intuition               editors (reducers)
A fruit container: more blenders too!

                              Map input:
                                <a, > <o,      > <p,       > , ...
                              Map output:
                                <a’,   ><o’,       ><p’,       >, ...
                     Shuffle
                              Reduce input:
                                <a’, , ,       ,     ,     >
                              Reduce output:
Example: inverted indexing (e.g. webpages)
Example: inverted indexing (e.g. webpages)
• Input: documents/webpages
   • Doc 1: “Why did the chicken cross the road?”
   • Doc 2: “The chicken and egg problem”
   • Doc 3: “Kentucky Fried Chicken”

• Output
   • Index of words:
      • The word “the” occurs twice in Doc 1 (positions 3 and 6), once in
        Doc 2 (position 1)
Example: inverted indexing (e.g. webpages)
Example: inverted indexing (e.g. webpages)
Example: inverted indexing (e.g. webpages)
Example: inverted indexing (e.g. webpages)
Example: inverted indexing (e.g. webpages)




Simply write mapper and reducer method, often few lines of code
Map-reduce on time series data
A simple operation: aggregation

  2008-10-24 06:15:04.559, -6.293695, -1.1263204, 2.985364, 43449.957, 2.3577218, 38271.21
  2008-10-24 06:15:04.570, -6.16952, -1.3805857, 2.6128333,43449.957, 2.4848552, 37399.26
  2008-10-24 06:15:04.580, -5.711255, -0.8897944, 3.139107, 43449.957, 2.1744132, 38281.0




  2008-10-24 06:15:04.559 min    -524.0103
  2008-10-24 06:15:04.559 max     38271.21
  2008-10-24 06:15:04.559 avg    447.37795925930226
  2008-10-24 06:15:04.570 min    -522.7882
  2008-10-24 06:15:04.570 max     37399.26
  2008-10-24 06:15:04.570 avg    437.1266600847675
A simple operation: aggregation
• Input: Table with sensors in columns and timestamps in rows
   2008-10-24 06:15:04.559, -6.293695, -1.1263204, 2.985364, 43449.957, 2.3577218, 38271.21
   2008-10-24 06:15:04.570, -6.16952, -1.3805857, 2.6128333,43449.957, 2.4848552, 37399.26
   2008-10-24 06:15:04.580, -5.711255, -0.8897944, 3.139107, 43449.957, 2.1744132, 38281.0




• Desired output: Aggregated measures per timestamp
   2008-10-24 06:15:04.559 min    -524.0103
   2008-10-24 06:15:04.559 max     38271.21
   2008-10-24 06:15:04.559 avg    447.37795925930226
   2008-10-24 06:15:04.570 min    -522.7882
   2008-10-24 06:15:04.570 max     37399.26
   2008-10-24 06:15:04.570 avg    437.1266600847675
Map-reduce on time series data
        public void map(LongWritable key, Text value, Context context) {
                String values[] = value.toString().split("t");
                Text time = new Text(values[0]);
                for(int i = 1; i <= nrStressSensors; i++)
                    context.write(time, new Text(values[i]));
    }


        public void reduce(Text key, Iterable<Text> values, Context context) {
              //init; sum, min, max, count = 0
              Double d;
              for (Text v : values) {
                     d = Double.valueOf(v.toString());
                  sum += d;
                  min = Math.min(min, d);
                  max = Math.max(max, d);
                  count++;
              }
              context.write(new Text(key+" min"), new Text(Double.toString((min))));
              context.write(new Text(key+" max"), new Text(Double.toString((max))));
              context.write(new Text(key+" avg"), new Text(Double.toString((sum/count))));
}
Map-reduce on time series data
        public void map(LongWritable key, Text value, Context context) {
                String values[] = value.toString().split("t");
                Text time = new Text(values[0]);
                for(int i = 1; i <= nrStressSensors; i++)                   MAP data to
                    context.write(time, new Text(values[i]));            <timestamp,value>
    }


        public void reduce(Text key, Iterable<Text> values, Context context) {
              //init; sum, min, max, count = 0
              Double d;
              for (Text v : values) {
                     d = Double.valueOf(v.toString());
                  sum += d;
                  min = Math.min(min, d);
                  max = Math.max(max, d);
                  count++;
              }
              context.write(new Text(key+" min"), new Text(Double.toString((min))));
              context.write(new Text(key+" max"), new Text(Double.toString((max))));
              context.write(new Text(key+" avg"), new Text(Double.toString((sum/count))));
}
Map-reduce on time series data
        public void map(LongWritable key, Text value, Context context) {
                String values[] = value.toString().split("t");
                Text time = new Text(values[0]);
                for(int i = 1; i <= nrStressSensors; i++)                   MAP data to
                    context.write(time, new Text(values[i]));            <timestamp,value>
    }


        public void reduce(Text key, Iterable<Text> values, Context context) {       Shuffle/sort
              //init; sum, min, max, count = 0                                    per timestamp
              Double d;
              for (Text v : values) {
                     d = Double.valueOf(v.toString());
                  sum += d;
                  min = Math.min(min, d);
                  max = Math.max(max, d);
                  count++;
              }
              context.write(new Text(key+" min"), new Text(Double.toString((min))));
              context.write(new Text(key+" max"), new Text(Double.toString((max))));
              context.write(new Text(key+" avg"), new Text(Double.toString((sum/count))));
}
Map-reduce on time series data
        public void map(LongWritable key, Text value, Context context) {
                String values[] = value.toString().split("t");
                Text time = new Text(values[0]);
                for(int i = 1; i <= nrStressSensors; i++)                   MAP data to
                    context.write(time, new Text(values[i]));            <timestamp,value>
    }


        public void reduce(Text key, Iterable<Text> values, Context context) {       Shuffle/sort
              //init; sum, min, max, count = 0                                    per timestamp
              Double d;
              for (Text v : values) {
                     d = Double.valueOf(v.toString());
                  sum += d;
                                                                              REDUCE data per
                  min = Math.min(min, d);                                       timestamp to
                  max = Math.max(max, d);
                  count++;                                                        aggregates
              }
              context.write(new Text(key+" min"), new Text(Double.toString((min))));
              context.write(new Text(key+" max"), new Text(Double.toString((max))));
              context.write(new Text(key+" avg"), new Text(Double.toString((sum/count))));
}
Shuffling
Map-reduce on time series data
Map-reduce on time series data
Min




Avg




Max
ML with map-reduce
• What part is I/O intensive (related to data points), and can be parallelized?
   • E.g. k-Means?
ML with map-reduce                                                 k-Means Clustering    Im


                                        Iterative MapReduce
• What part is I/O intensive (related to data points), and can be parallelized?
   • E.g. k-Means?
                                                   Centers Version i
• Calculation of distance function!
   • Split data in chunks                 Points                       Points
   • Choose initial centers
   • MAP: calculate distances to all
     centers: <point,[distances]>         Mapper                       Mapper           Fin
   • REDUCE: calculate new centers
   • Repeat                                                                             Key

• Others: SVM, NB, NN, LR,...             Reducer                      Reducer          Ave
   • Chu et al. Map-Reduce for ML on
     Multicore, NIPS ’06
                                               Centers Version i + 1
Map-reduce on graph data

• Find nearest feature on a graph

Input
graph
(node,label)




      ? nearest    within distance d?
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map
graph          ∀ , search graph with radius d
(node,label)   < ,{ ,distance} >
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map
graph          ∀ , search graph with radius d
(node,label)   < ,{ ,distance} >
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map
graph          ∀ , search graph with radius d
(node,label)   < ,{ ,distance} >
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map
graph          ∀ , search graph with radius d
(node,label)   < ,{ ,distance} >
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map
graph          ∀ , search graph with radius d
(node,label)   < ,{ ,distance} >
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map
graph          ∀ , search graph with radius d
(node,label)   < ,{ ,distance} >
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map                  Shuffle/
graph          ∀ , search graph     Sort
(node,label)   < ,{ ,distance} > by    id
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map                  Shuffle/   Reduce
graph          ∀ , search graph     Sort      < ,[{ ,distance},
(node,label)   < ,{ ,distance} > by    id      { ,distance}] >
                                              -> min()
Map-reduce on graph data

• Find nearest feature on a graph

Input          Map                  Shuffle/   Reduce              Output
graph          ∀ , search graph     Sort      < ,[{ ,distance},   < , >
(node,label)   < ,{ ,distance} > by    id      { ,distance}] >    < , >
                                              -> min()            marked graph
Map-reduce on graph data

• Find nearest feature on a graph

Input             Map                Shuffle/      Reduce              Output
graph             ∀ , search graph   Sort         < ,[{ ,distance},   < , >
(node,label)      < ,{ ,distance} > by    id       { ,distance}] >    < , >
                                                  -> min()            marked graph




               Be smart about mapping: load nearby nodes on same
               computing node: choose a good representation
The How?



 Hadoop: The Definitive Guide
 (Tom White, Yahoo!)
The How?



 Hadoop: The Definitive Guide
 (Tom White, Yahoo!)

                                Amazon Elastic Cloud (EC2)
                                Amazon Simple Storage Service (S3)
                                $0.085/CPU hour, $0.1/GBmonth

                                or... your university’s HPC center
                                or... install your own cluster
The How?



 Hadoop: The Definitive Guide
 (Tom White, Yahoo!)

                                               Amazon Elastic Cloud (EC2)
                                               Amazon Simple Storage Service (S3)
                                               $0.085/CPU hour, $0.1/GBmonth

                                               or... your university’s HPC center
                                               or... install your own cluster


    Hadoop-based ML library
    Classification: LR, NB, SVM*, NN*, RF
    Clustering: kMeans, canopy, EM, spectral
    Regression: LWLR*
    Pattern mining: Top-k FPGrowth
    Dim. Reduction, Coll. Filtering
    (*under development)
Part II: Grid computing (CPU bound)
• Disambiguation:

  • Supercomputer (shared memory)

     • CPU + memory bound, you’re rich

     • Identical nodes

  • Cluster computing

     • Parallellizable over many nodes (MPI), you’re rich/patient

     • Similar nodes

  • Grid computing

     • Parallellizable over few nodes or embarrassingly parallel

     • Very heterogenous nodes, but loads of `free’ computing power
Grid computing
Grid computing
Grid computing
Grid computing

• Grid certificate, you’ll be part of a Virtual Organization
Grid computing

• Grid certificate, you’ll be part of a Virtual Organization

• (Complex) middleware commands to move data/jobs to computing nodes:
   • Submit job: glite-wms-job-submit -d $USER -o <jobId> <jdl_file>.jdl
   • Job status: glite-wms-job-status -i <jobID>
   • Copy output to dir: glite-wms-job-output --dir <dirname> -i <jobID>
   • Show storage elements: lcg-infosites --vo ncf se
   • ls: lfc-ls -l $LFC_HOME/joaquin
   • Upload file (copy-register): lcg-cr --vo ncf -d srm://srm.grid.sara.nl:8443/pnfs/
     grid.sara.nl/data/ncf/joaquin/<remotename> -l lfn:/grid/ncf/joaquin/<remotename>
     "file://$PWD/<localname>"
   • Retrieve file from SE: lcg-cp --vo ncf lfn:/grid/ncf/joaquin/<remotename> file://$PWD/
     <localname>
Token pool servers
Video
Danke
                  Hvala            Thanks
        Xie Xie
                                            Diolch
     Toda
                                               Merci
 Grazie
                                                 Spasiba
Efharisto
                                                Obrigado
  Arigato
                                             Köszönöm
    Tesekkurler                        Dank U
            Dhanyavaad            Gracias

                                                       25

Weitere ähnliche Inhalte

Ähnlich wie Hadoop tutorial

"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
Ryosuke IWANAGA
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
Seven Nguyen
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
Veda Vyas
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
Jun Rao
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Vinoth Chandar
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 

Ähnlich wie Hadoop tutorial (20)

Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE"Mobage DBA Fight against Big Data" - NHN TE
"Mobage DBA Fight against Big Data" - NHN TE
 
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 2: Data-Intensive Computing for Text Analysis (Fall 2011)
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Making Big Data Analytics Interactive and Real-­Time
 Making Big Data Analytics Interactive and Real-­Time Making Big Data Analytics Interactive and Real-­Time
Making Big Data Analytics Interactive and Real-­Time
 
Local file systems update
Local file systems updateLocal file systems update
Local file systems update
 
GlusterFS モジュール超概論
GlusterFS モジュール超概論GlusterFS モジュール超概論
GlusterFS モジュール超概論
 
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
Эдуард Бортников «Предсказание «узких мест» при выполнении команд в кластерах...
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
Kafka replication apachecon_2013
Kafka replication apachecon_2013Kafka replication apachecon_2013
Kafka replication apachecon_2013
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Map reduce
Map reduceMap reduce
Map reduce
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
09_Practical Multicore programming
09_Practical Multicore programming09_Practical Multicore programming
09_Practical Multicore programming
 
Yapc asia 2011_zigorou
Yapc asia 2011_zigorouYapc asia 2011_zigorou
Yapc asia 2011_zigorou
 
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell PipesComposing and Executing Parallel Data Flow Graphs wth Shell Pipes
Composing and Executing Parallel Data Flow Graphs wth Shell Pipes
 
บทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplicationบทที่ 2 Mobile Aplication
บทที่ 2 Mobile Aplication
 
02 Map Reduce
02 Map Reduce02 Map Reduce
02 Map Reduce
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 

Mehr von Joaquin Vanschoren

Mehr von Joaquin Vanschoren (19)

Meta learning tutorial
Meta learning tutorialMeta learning tutorial
Meta learning tutorial
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
OpenML 2019
OpenML 2019OpenML 2019
OpenML 2019
 
Exposé Ontology
Exposé OntologyExposé Ontology
Exposé Ontology
 
Designed Serendipity
Designed SerendipityDesigned Serendipity
Designed Serendipity
 
Learning how to learn
Learning how to learnLearning how to learn
Learning how to learn
 
OpenML NeurIPS2018
OpenML NeurIPS2018OpenML NeurIPS2018
OpenML NeurIPS2018
 
Open and Automated Machine Learning
Open and Automated Machine LearningOpen and Automated Machine Learning
Open and Automated Machine Learning
 
OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017OpenML Reproducibility in Machine Learning ICML2017
OpenML Reproducibility in Machine Learning ICML2017
 
OpenML DALI
OpenML DALIOpenML DALI
OpenML DALI
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015OpenML Tutorial ECMLPKDD 2015
OpenML Tutorial ECMLPKDD 2015
 
OpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine LearningOpenML Tutorial: Networked Science in Machine Learning
OpenML Tutorial: Networked Science in Machine Learning
 
Data science
Data scienceData science
Data science
 
OpenML 2014
OpenML 2014OpenML 2014
OpenML 2014
 
Open Machine Learning
Open Machine LearningOpen Machine Learning
Open Machine Learning
 
Hadoop sensordata part2
Hadoop sensordata part2Hadoop sensordata part2
Hadoop sensordata part2
 
Hadoop sensordata part1
Hadoop sensordata part1Hadoop sensordata part1
Hadoop sensordata part1
 
Hadoop sensordata part3
Hadoop sensordata part3Hadoop sensordata part3
Hadoop sensordata part3
 

Kürzlich hochgeladen

The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 

Kürzlich hochgeladen (20)

Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024Mehran University Newsletter Vol-X, Issue-I, 2024
Mehran University Newsletter Vol-X, Issue-I, 2024
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Asian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptxAsian American Pacific Islander Month DDSD 2024.pptx
Asian American Pacific Islander Month DDSD 2024.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.This PowerPoint helps students to consider the concept of infinity.
This PowerPoint helps students to consider the concept of infinity.
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Dyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptxDyslexia AI Workshop for Slideshare.pptx
Dyslexia AI Workshop for Slideshare.pptx
 

Hadoop tutorial

  • 1. Big ML Tools & techniques for large-scale ML Joaquin Vanschoren 1
  • 2. Big ML Tools & techniques for large-scale ML Joaquin Vanschoren 1
  • 3. The Why • A lot of ML problems concern really large datasets • Huge graphs: internet, social networks, protein interactions • Sensor data, large sensor networks • Data is just getting bigger, collected faster...
  • 4. The What •Map-Reduce (Hadoop) -> I/O bound processes •Tutorial •Grid computing -> CPU bound processes •Short Intro
  • 5. Map-reduce: distribute I/O over many nodes User Program (1) fork (1) fork (1) fork Master (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files
  • 6. Map-reduce: distribute I/O over many nodes User Program Split data (1) fork (1) fork (1) fork into chunks Master (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files
  • 7. Map-reduce: distribute I/O over many nodes User Send chunks Program to worker (1) fork (1) fork (1) fork nodes Master (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files
  • 8. Map-reduce: distribute I/O over many nodes User MAP data to Program intermediate (1) fork (1) fork (1) fork key-value pairs Master (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files
  • 9. Map-reduce: distribute I/O over many nodes User Program Shuffle key-value (1) fork (1) fork (1) fork pairs: same keys to same nodes Master (2) (+ sort) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files
  • 10. Map-reduce: distribute I/O over many nodes User Program REDUCE key- (1) fork (1) fork (1) fork value data to target output Master (2) (2) assign assign reduce map worker split 0 (6) write output split 1 worker file 0 (5) remote read split 2 (3) read (4) local write worker output split 3 worker file 1 split 4 worker Input Map Intermediate files Reduce Output files phase (on local disks) phase files
  • 11. Thanks to Saliya Ekanayake Map-reduce: Intuition One apple
  • 12. Thanks to Saliya Ekanayake Map-reduce: Intuition One apple A fruit basket: more knives! many mappers (people with knifes) 1 reducer (person with blender)
  • 13. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ...
  • 14. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ...
  • 15. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , >
  • 16. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , > Reduce output:
  • 17. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , > Reduce output:
  • 18. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , > Reduce output:
  • 19. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , > Reduce output:
  • 20. Map-reduce: Intuition A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , > Reduce output:
  • 21. peer review analogy: reviewers (mappers) Map-reduce: Intuition editors (reducers) A fruit container: more blenders too! Map input: <a, > <o, > <p, > , ... Map output: <a’, ><o’, ><p’, >, ... Shuffle Reduce input: <a’, , , , , > Reduce output:
  • 22. Example: inverted indexing (e.g. webpages)
  • 23. Example: inverted indexing (e.g. webpages) • Input: documents/webpages • Doc 1: “Why did the chicken cross the road?” • Doc 2: “The chicken and egg problem” • Doc 3: “Kentucky Fried Chicken” • Output • Index of words: • The word “the” occurs twice in Doc 1 (positions 3 and 6), once in Doc 2 (position 1)
  • 24. Example: inverted indexing (e.g. webpages)
  • 25. Example: inverted indexing (e.g. webpages)
  • 26. Example: inverted indexing (e.g. webpages)
  • 27. Example: inverted indexing (e.g. webpages)
  • 28. Example: inverted indexing (e.g. webpages) Simply write mapper and reducer method, often few lines of code
  • 29. Map-reduce on time series data
  • 30. A simple operation: aggregation 2008-10-24 06:15:04.559, -6.293695, -1.1263204, 2.985364, 43449.957, 2.3577218, 38271.21 2008-10-24 06:15:04.570, -6.16952, -1.3805857, 2.6128333,43449.957, 2.4848552, 37399.26 2008-10-24 06:15:04.580, -5.711255, -0.8897944, 3.139107, 43449.957, 2.1744132, 38281.0 2008-10-24 06:15:04.559 min -524.0103 2008-10-24 06:15:04.559 max 38271.21 2008-10-24 06:15:04.559 avg 447.37795925930226 2008-10-24 06:15:04.570 min -522.7882 2008-10-24 06:15:04.570 max 37399.26 2008-10-24 06:15:04.570 avg 437.1266600847675
  • 31. A simple operation: aggregation • Input: Table with sensors in columns and timestamps in rows 2008-10-24 06:15:04.559, -6.293695, -1.1263204, 2.985364, 43449.957, 2.3577218, 38271.21 2008-10-24 06:15:04.570, -6.16952, -1.3805857, 2.6128333,43449.957, 2.4848552, 37399.26 2008-10-24 06:15:04.580, -5.711255, -0.8897944, 3.139107, 43449.957, 2.1744132, 38281.0 • Desired output: Aggregated measures per timestamp 2008-10-24 06:15:04.559 min -524.0103 2008-10-24 06:15:04.559 max 38271.21 2008-10-24 06:15:04.559 avg 447.37795925930226 2008-10-24 06:15:04.570 min -522.7882 2008-10-24 06:15:04.570 max 37399.26 2008-10-24 06:15:04.570 avg 437.1266600847675
  • 32. Map-reduce on time series data public void map(LongWritable key, Text value, Context context) { String values[] = value.toString().split("t"); Text time = new Text(values[0]); for(int i = 1; i <= nrStressSensors; i++) context.write(time, new Text(values[i])); } public void reduce(Text key, Iterable<Text> values, Context context) { //init; sum, min, max, count = 0 Double d; for (Text v : values) { d = Double.valueOf(v.toString()); sum += d; min = Math.min(min, d); max = Math.max(max, d); count++; } context.write(new Text(key+" min"), new Text(Double.toString((min)))); context.write(new Text(key+" max"), new Text(Double.toString((max)))); context.write(new Text(key+" avg"), new Text(Double.toString((sum/count)))); }
  • 33. Map-reduce on time series data public void map(LongWritable key, Text value, Context context) { String values[] = value.toString().split("t"); Text time = new Text(values[0]); for(int i = 1; i <= nrStressSensors; i++) MAP data to context.write(time, new Text(values[i])); <timestamp,value> } public void reduce(Text key, Iterable<Text> values, Context context) { //init; sum, min, max, count = 0 Double d; for (Text v : values) { d = Double.valueOf(v.toString()); sum += d; min = Math.min(min, d); max = Math.max(max, d); count++; } context.write(new Text(key+" min"), new Text(Double.toString((min)))); context.write(new Text(key+" max"), new Text(Double.toString((max)))); context.write(new Text(key+" avg"), new Text(Double.toString((sum/count)))); }
  • 34. Map-reduce on time series data public void map(LongWritable key, Text value, Context context) { String values[] = value.toString().split("t"); Text time = new Text(values[0]); for(int i = 1; i <= nrStressSensors; i++) MAP data to context.write(time, new Text(values[i])); <timestamp,value> } public void reduce(Text key, Iterable<Text> values, Context context) { Shuffle/sort //init; sum, min, max, count = 0 per timestamp Double d; for (Text v : values) { d = Double.valueOf(v.toString()); sum += d; min = Math.min(min, d); max = Math.max(max, d); count++; } context.write(new Text(key+" min"), new Text(Double.toString((min)))); context.write(new Text(key+" max"), new Text(Double.toString((max)))); context.write(new Text(key+" avg"), new Text(Double.toString((sum/count)))); }
  • 35. Map-reduce on time series data public void map(LongWritable key, Text value, Context context) { String values[] = value.toString().split("t"); Text time = new Text(values[0]); for(int i = 1; i <= nrStressSensors; i++) MAP data to context.write(time, new Text(values[i])); <timestamp,value> } public void reduce(Text key, Iterable<Text> values, Context context) { Shuffle/sort //init; sum, min, max, count = 0 per timestamp Double d; for (Text v : values) { d = Double.valueOf(v.toString()); sum += d; REDUCE data per min = Math.min(min, d); timestamp to max = Math.max(max, d); count++; aggregates } context.write(new Text(key+" min"), new Text(Double.toString((min)))); context.write(new Text(key+" max"), new Text(Double.toString((max)))); context.write(new Text(key+" avg"), new Text(Double.toString((sum/count)))); }
  • 37. Map-reduce on time series data
  • 38. Map-reduce on time series data Min Avg Max
  • 39. ML with map-reduce • What part is I/O intensive (related to data points), and can be parallelized? • E.g. k-Means?
  • 40. ML with map-reduce k-Means Clustering Im Iterative MapReduce • What part is I/O intensive (related to data points), and can be parallelized? • E.g. k-Means? Centers Version i • Calculation of distance function! • Split data in chunks Points Points • Choose initial centers • MAP: calculate distances to all centers: <point,[distances]> Mapper Mapper Fin • REDUCE: calculate new centers • Repeat Key • Others: SVM, NB, NN, LR,... Reducer Reducer Ave • Chu et al. Map-Reduce for ML on Multicore, NIPS ’06 Centers Version i + 1
  • 41. Map-reduce on graph data • Find nearest feature on a graph Input graph (node,label) ? nearest within distance d?
  • 42. Map-reduce on graph data • Find nearest feature on a graph Input Map graph ∀ , search graph with radius d (node,label) < ,{ ,distance} >
  • 43. Map-reduce on graph data • Find nearest feature on a graph Input Map graph ∀ , search graph with radius d (node,label) < ,{ ,distance} >
  • 44. Map-reduce on graph data • Find nearest feature on a graph Input Map graph ∀ , search graph with radius d (node,label) < ,{ ,distance} >
  • 45. Map-reduce on graph data • Find nearest feature on a graph Input Map graph ∀ , search graph with radius d (node,label) < ,{ ,distance} >
  • 46. Map-reduce on graph data • Find nearest feature on a graph Input Map graph ∀ , search graph with radius d (node,label) < ,{ ,distance} >
  • 47. Map-reduce on graph data • Find nearest feature on a graph Input Map graph ∀ , search graph with radius d (node,label) < ,{ ,distance} >
  • 48. Map-reduce on graph data • Find nearest feature on a graph Input Map Shuffle/ graph ∀ , search graph Sort (node,label) < ,{ ,distance} > by id
  • 49. Map-reduce on graph data • Find nearest feature on a graph Input Map Shuffle/ Reduce graph ∀ , search graph Sort < ,[{ ,distance}, (node,label) < ,{ ,distance} > by id { ,distance}] > -> min()
  • 50. Map-reduce on graph data • Find nearest feature on a graph Input Map Shuffle/ Reduce Output graph ∀ , search graph Sort < ,[{ ,distance}, < , > (node,label) < ,{ ,distance} > by id { ,distance}] > < , > -> min() marked graph
  • 51. Map-reduce on graph data • Find nearest feature on a graph Input Map Shuffle/ Reduce Output graph ∀ , search graph Sort < ,[{ ,distance}, < , > (node,label) < ,{ ,distance} > by id { ,distance}] > < , > -> min() marked graph Be smart about mapping: load nearby nodes on same computing node: choose a good representation
  • 52. The How? Hadoop: The Definitive Guide (Tom White, Yahoo!)
  • 53. The How? Hadoop: The Definitive Guide (Tom White, Yahoo!) Amazon Elastic Cloud (EC2) Amazon Simple Storage Service (S3) $0.085/CPU hour, $0.1/GBmonth or... your university’s HPC center or... install your own cluster
  • 54. The How? Hadoop: The Definitive Guide (Tom White, Yahoo!) Amazon Elastic Cloud (EC2) Amazon Simple Storage Service (S3) $0.085/CPU hour, $0.1/GBmonth or... your university’s HPC center or... install your own cluster Hadoop-based ML library Classification: LR, NB, SVM*, NN*, RF Clustering: kMeans, canopy, EM, spectral Regression: LWLR* Pattern mining: Top-k FPGrowth Dim. Reduction, Coll. Filtering (*under development)
  • 55. Part II: Grid computing (CPU bound) • Disambiguation: • Supercomputer (shared memory) • CPU + memory bound, you’re rich • Identical nodes • Cluster computing • Parallellizable over many nodes (MPI), you’re rich/patient • Similar nodes • Grid computing • Parallellizable over few nodes or embarrassingly parallel • Very heterogenous nodes, but loads of `free’ computing power
  • 59. Grid computing • Grid certificate, you’ll be part of a Virtual Organization
  • 60. Grid computing • Grid certificate, you’ll be part of a Virtual Organization • (Complex) middleware commands to move data/jobs to computing nodes: • Submit job: glite-wms-job-submit -d $USER -o <jobId> <jdl_file>.jdl • Job status: glite-wms-job-status -i <jobID> • Copy output to dir: glite-wms-job-output --dir <dirname> -i <jobID> • Show storage elements: lcg-infosites --vo ncf se • ls: lfc-ls -l $LFC_HOME/joaquin • Upload file (copy-register): lcg-cr --vo ncf -d srm://srm.grid.sara.nl:8443/pnfs/ grid.sara.nl/data/ncf/joaquin/<remotename> -l lfn:/grid/ncf/joaquin/<remotename> "file://$PWD/<localname>" • Retrieve file from SE: lcg-cp --vo ncf lfn:/grid/ncf/joaquin/<remotename> file://$PWD/ <localname>
  • 62. Video
  • 63. Danke Hvala Thanks Xie Xie Diolch Toda Merci Grazie Spasiba Efharisto Obrigado Arigato Köszönöm Tesekkurler Dank U Dhanyavaad Gracias 25