SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
Introducción a Hadoop
   El bazuca de los datos


       Iván de Prado Alonso // @ivanprado // @datasalt
Datasalt




  Foco en el Big Data
  –   Contribución al Open Source
  –   Consultoría & Desarrollo
  –   Formación



                                    2 / 34
BIG
“MAC”
 DATA



        3 / 34
Fisonomía de un proyecto Big Data



             Adquisición


           Procesamiento


               Servicio

                                    4 / 34
Tipos de sistemas Big Data

●   Offline
    –   La latencia no es un problema
●   Online
    –   La inmediatez de los datos es importante
●   Mixto
    –   Lo más común

                Offline                       Online
    MapReduce                     Bases de datos NoSQL
    Hadoop                        Motores de búsqueda
    Distributed RDBMS


                                                         5 / 34
“Swiss army knife of the
                                           21st century”
                                                              Media Guardian Innovation
                                                                                Awards




http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop   6 / 34
Historia

●   2004-2006
    –   Google publica los papers de GFS y MapReduce
    –   Doug Cutting implementa una versión Open Source en
        Nutch
●   2006-2008
    –   Hadoop se separa de Nutch
    –   Se alcanza la escala web en 2008
●   2008-Hasta ahora
    –   Hadoop se populariza y se comienza a explotar
        comercialmente.

                               Fuente: Hadoop: a brief history. Doug Cutting

                                                                        7 / 34
Hadoop

     “The Apache Hadoop
      software library is a
  framework that allows for
        the distributed
  processing of large data
    sets across clusters of
       computers using a
    simple programming
            model”
              De la página de Hadoop



                                       8 / 34
Sistema de Ficheros Distribuido

●   Sistema de ficheros distribuido (HDFS)
    –   Bloques grandes: 64 Mb
         ●   Almacenados en el sistema de ficheros del SO
    –   Tolerante a Fallos (replicación)
    –   Formatos habituales:
         ●   Ficheros en formato texto (CSV)
         ●   SequenceFiles
              –   Ristras de pares [clave, valor]




                                                            9 / 34
MapReduce

●   Dos funciones (Map y Reduce)
    –   Map(k, v) : [z,w]*
    –   Reduce(k, v*) : [z, w]*
●   Ejemplo: contar palabras
    –   Map([documento, null]) -> [palabra, 1]*
    –   Reduce(palabra, 1*) -> [palabra, total]
●   MapReduce y SQL
    –   SELECT palabra, count(*) GROUP BY palabra
●   Ejecución distribuida en un cluster con escalabilidad
    horizontal


                                                        10 / 34
El típico Word Count
  Esto es una linea
  Esto también


 Map                                  Reduce
                                       reduce(es, {1}) =
  map(“Esto es una linea”) =
                                           es, 1
      esto, 1
                                       reduce(esto, {1, 1}) =
      es, 1
                                           esto, 2
      una, 1
                                       reduce(linea, {1}) =
      linea, 1
                                           linea, 1
  map(“Esto también”) =
                                       reduce(también, {1}) =
      esto, 1
                                           también, 1
      también, 1
                                       reduce(una, {1}) =
                                           una, 1


                               es, 1
                               esto, 2
  Resultado:                   linea, 1
                               también, 1
                               una, 1



                                                                11 / 34
Word Count en Hadoop
   public class WordCountHadoop extends Configured implements Tool {

              public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> {

                        private final static IntWritable one = new IntWritable(1);
                        private Text word = new Text();

                        public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
                                  StringTokenizer itr = new StringTokenizer(value.toString());
                                  while(itr.hasMoreTokens()) {
                                            word.set(itr.nextToken());
                                            context.write(word, one);
                                  }
                        }
              }

              public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

                        private IntWritable result = new IntWritable();

                        public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException,
                            InterruptedException {
                                  int sum = 0;
                                  for(IntWritable val : values) {
                                            sum += val.get();
                                  }




 ¡Mejor vamos por partes!
                                  result.set(sum);
                                  context.write(key, result);
                        }
              }

               @Override
       public int run(String[] args) throws Exception {

                        if(args.length != 2) {
                                  System.err.println("Usage: wordcount-hadoop <in> <out>");
                                  System.exit(2);
                        }

                        Path output = new Path(args[1]);
                        HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output);

                        Job job = new Job(getConf(), "word count hadoop");
                        job.setJarByClass(WordCountHadoop.class);
                        job.setMapperClass(TokenizerMapper.class);
                        job.setCombinerClass(IntSumReducer.class);
                        job.setReducerClass(IntSumReducer.class);
                        job.setOutputKeyClass(Text.class);
                        job.setOutputValueClass(IntWritable.class);
                        FileInputFormat.addInputPath(job, new Path(args[0]));

                        FileOutputFormat.setOutputPath(job, new Path(args[1]));
                        job.waitForCompletion(true);

                        return 0;
       }

              public static void main(String[] args) throws Exception {
                        ToolRunner.run(new SortJobHadoop(), args);
              }
   }




                                                                                                                              12 / 34
Word Count en Hadoop - Mapper


      public static class TokenizerMapper extends Mapper<Object, Text,
  Text, IntWritable> {

         private final static IntWritable one = new IntWritable(1);
         private Text word = new Text();

          public void map(Object key, Text value, Context context) throws
  IOException, InterruptedException {

             StringTokenizer itr = new StringTokenizer(value.toString());
             while(itr.hasMoreTokens()) {

                 word.set(itr.nextToken());
                 context.write(word, one);
             }
         }
     }




                                                                         13 / 34
Word Count en Hadoop - Reducer


      public static class IntSumReducer extends Reducer<Text, IntWritable,
  Text, IntWritable> {

         private IntWritable result = new IntWritable();

          public void reduce(Text key, Iterable<IntWritable> values,
  Context context) throws IOException,
              InterruptedException {
              int sum = 0;
              for(IntWritable val : values) {
                  sum += val.get();
              }
              result.set(sum);
              context.write(key, result);
          }
      }




                                                                       14 / 34
Word Count en Hadoop – Configuración y
ejecución

         if(args.length != 2) {
             System.err.println("Usage: wordcount-hadoop <in> <out>");
             System.exit(2);
         }

          Path output = new Path(args[1]);
          HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf),
  output);

         Job job = new Job(getConf(), "word count hadoop");
         job.setJarByClass(WordCountHadoop.class);
         job.setMapperClass(TokenizerMapper.class);
         job.setCombinerClass(IntSumReducer.class);
         job.setReducerClass(IntSumReducer.class);
         job.setOutputKeyClass(Text.class);
         job.setOutputValueClass(IntWritable.class);
         FileInputFormat.addInputPath(job, new Path(args[0]));

         FileOutputFormat.setOutputPath(job, new Path(args[1]));
         job.waitForCompletion(true);




                                                                      15 / 34
Ejecución de un Job MapReduce
                  Bloques del fichero de entrada




                           Nodo 1




                                                   Nodo 2
    Mappers




    Datos
    Intermedios
                           Nodo 1




                                                   Nodo 2
    Reducers

    Resultado

                                                            16 / 34
Serialización

 ●   Writables
     • Serialización nativa de Hadoop
     • De muy bajo nivel
     • Tipos básicos: IntWritable, Text, etc.
 ●   Otras
     • Thrift, Avro, Protostuff
     • Compatibilidad hacia atrás.




                                                17 / 34
La curva de
aprendizaje
de Hadoop
  es alta




          18 / 34
Tuple MapReduce

●   Un MapReduce más simple
    –   Tuplas en lugar de key/value




    –   A nivel de job se define
         ●   Los campos por los que agrupar
         ●   Los campos por los que ordenar
    –   Tuple MapReduce-join


                                              19 / 34
Pangool
●   Implementación de
    TupleMap reduce
    –   Desarrollado por Datasalt
    –   OpenSource
    –   Eficiencia equiparable a
        Hadoop
●   Objetivo: reemplazar la API
    de Hadoop
●   Si quieres aprender
    Hadoop, empieza por
    Pangool



                                    20 / 34
Eficiencia de Pangool
●   Equiparable a Hadoop




    Ver http://pangool.net/benchmark.html

                                            21 / 34
Pangool – URL resolution

●   Ejemplo de Join
    –   Muy difícil en Hadoop. Fácil en Pangool.
●   Problema:
    –   Existen muchos acortadores de URLs y redirecciones
    –   Para analizar datos, suele ser útil reemplazar las URLs por su URL
        canónica
    –   Supongamos que tenemos ambos datasets
        ●   Un mapa con entradas URL → URL canónica
        ●   Un dataset con URLs (que queremos resolver) y otros campos.
    –   El siguiente job Pangool soluciona el problema de manera escalable.




                                                                          22 / 34
URL Resolution – Definiendo Schemas


    static Schema getURLRegisterSchema() {
        List<Field> urlRegisterFields = new ArrayList<Field>();
        urlRegisterFields.add(Field.create("url",Type.STRING));
        urlRegisterFields.add(Field.create("timestamp",Type.LONG));
        urlRegisterFields.add(Field.create("ip",Type.STRING));
        return new Schema("urlRegister", urlRegisterFields);
    }

    static Schema getURLMapSchema() {
        List<Field> urlMapFields = new ArrayList<Field>();
        urlMapFields.add(Field.create("url",Type.STRING));
        urlMapFields.add(Field.create("canonicalUrl",Type.STRING));
        return new Schema("urlMap", urlMapFields);
    }




                                                                      23 / 34
URL Resolution – Cargando el fichero a
resolver


     public static class UrlProcessor extends TupleMapper<LongWritable,
 Text> {

        private Tuple tuple = new Tuple(getURLRegisterSchema());

         @Override
         public void map(LongWritable key, Text value, TupleMRContext
 context, Collector collector)
             throws IOException, InterruptedException {

            String[] fields = value.toString().split("t");
            tuple.set("url", fields[0]);
            tuple.set("timestamp", Long.parseLong(fields[1]));
            tuple.set("ip", fields[2]);
            collector.write(tuple);
        }
    }




                                                                        24 / 34
URL Resolution – Cargando el mapa de URLs


     public static class UrlMapProcessor extends TupleMapper<LongWritable,
 Text> {

        private Tuple tuple = new Tuple(getURLMapSchema());

         @Override
         public void map(LongWritable key, Text value, TupleMRContext
 context, Collector collector)
             throws IOException, InterruptedException {

            String[] fields = value.toString().split("t");
            tuple.set("url", fields[0]);
            tuple.set("canonicalUrl", fields[1]);
            collector.write(tuple);
        }
    }




                                                                        25 / 34
URL Resolution – Resolución en el reducer
     public static class Handler extends TupleReducer<Text, NullWritable>
 {

        private Text result;

         @Override
         public void reduce(ITuple group, Iterable<ITuple> tuples,
 TupleMRContext context, Collector collector)
             throws IOException, InterruptedException, TupleMRException {
             if (result == null) {
                 result = new Text();
             }
             String cannonicalUrl = null;
             for(ITuple tuple : tuples) {
                 if("urlMap".equals(tuple.getSchema().getName())) {
                     cannonicalUrl = tuple.get("canonicalUrl").toString();
                 } else {
                     result.set(cannonicalUrl + "t" +
 tuple.get("timestamp") + "t" + tuple.get("ip"));
                     collector.write(result, NullWritable.get());
                 }
             }
         }
     }

                                                                      26 / 34
URL Resolution – Configurando y Lanzando
el job
  String input1 = args[0];
  String input2 = args[1];
  String output = args[2];

  deleteOutput(output);

  TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution");
  mr.addIntermediateSchema(getURLMapSchema());
  mr.addIntermediateSchema(getURLRegisterSchema());
  mr.setGroupByFields("url");
  mr.setOrderBy(
      new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC));
  mr.setTupleReducer(new Handler());
  mr.setOutput(new Path(output),
      new HadoopOutputFormat(TextOutputFormat.class),
      Text.class,
      NullWritable.class);
  mr.addInput(new Path(input1),
      new HadoopInputFormat(TextInputFormat.class),
      new UrlMapProcessor());
  mr.addInput(new Path(input2),
      new HadoopInputFormat(TextInputFormat.class),
      new UrlProcessor());
  mr.createJob().waitForCompletion(true);



                                                                      27 / 34
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop
Introducción a hadoop

Weitere ähnliche Inhalte

Was ist angesagt?

Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
MapR Technologies
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
DataWorks Summit
 

Was ist angesagt? (20)

Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Pptx present
Pptx presentPptx present
Pptx present
 
Hadoop MapReduce Streaming and Pipes
Hadoop MapReduce  Streaming and PipesHadoop MapReduce  Streaming and Pipes
Hadoop MapReduce Streaming and Pipes
 
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter SlidesJuly 2010 Triangle Hadoop Users Group - Chad Vawter Slides
July 2010 Triangle Hadoop Users Group - Chad Vawter Slides
 
Using Apache Drill
Using Apache DrillUsing Apache Drill
Using Apache Drill
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Mongodb
Introduction to MongodbIntroduction to Mongodb
Introduction to Mongodb
 
TRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use CaseTRHUG 2015 - Veloxity Big Data Migration Use Case
TRHUG 2015 - Veloxity Big Data Migration Use Case
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Cassandra/Hadoop Integration
Cassandra/Hadoop IntegrationCassandra/Hadoop Integration
Cassandra/Hadoop Integration
 
Drilling into Data with Apache Drill
Drilling into Data with Apache DrillDrilling into Data with Apache Drill
Drilling into Data with Apache Drill
 
Understanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache DrillUnderstanding the Value and Architecture of Apache Drill
Understanding the Value and Architecture of Apache Drill
 
Large scale ETL with Hadoop
Large scale ETL with HadoopLarge scale ETL with Hadoop
Large scale ETL with Hadoop
 
Ecossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine LuizaEcossistema Hadoop no Magazine Luiza
Ecossistema Hadoop no Magazine Luiza
 
Hadoop
HadoopHadoop
Hadoop
 

Andere mochten auch

Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
datasalt
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
datasalt
 
Day snowman
Day snowmanDay snowman
Day snowman
afresh65
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
GYK Antler
 
1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final
amgonzalezpineiro
 
Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte
nacionalaidentitate
 
Keyboarding chapter 4 ppt
Keyboarding chapter 4 pptKeyboarding chapter 4 ppt
Keyboarding chapter 4 ppt
ks0385
 
Rihanna[1]
Rihanna[1]Rihanna[1]
Rihanna[1]
ibutt5
 
Gepard gm6 lynx x
Gepard gm6 lynx xGepard gm6 lynx x
Gepard gm6 lynx x
riskis
 
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzeNācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
nacionalaidentitate
 

Andere mochten auch (20)

Tuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduceTuple map reduce: beyond classic mapreduce
Tuple map reduce: beyond classic mapreduce
 
Big data, map reduce and beyond
Big data, map reduce and beyondBig data, map reduce and beyond
Big data, map reduce and beyond
 
Splout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for HadoopSplout SQL - Web latency SQL views for Hadoop
Splout SQL - Web latency SQL views for Hadoop
 
Datasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactionsDatasalt - BBVA case study - extracting value from credit card transactions
Datasalt - BBVA case study - extracting value from credit card transactions
 
Day snowman
Day snowmanDay snowman
Day snowman
 
Final evaluation
Final evaluationFinal evaluation
Final evaluation
 
Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892Internettrendsv1 150526193103-lva1-app6892
Internettrendsv1 150526193103-lva1-app6892
 
Switching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting PracticeSwitching On the Growth Engine in Your Small Consulting Practice
Switching On the Growth Engine in Your Small Consulting Practice
 
1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final1. creative writing workshop april 2015 final
1. creative writing workshop april 2015 final
 
Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte Kopienu rīcībspēja un identitāte
Kopienu rīcībspēja un identitāte
 
Saul Sours
Saul SoursSaul Sours
Saul Sours
 
Keyboarding chapter 4 ppt
Keyboarding chapter 4 pptKeyboarding chapter 4 ppt
Keyboarding chapter 4 ppt
 
1. interviewing revised feb 3 2015
1. interviewing revised feb 3 20151. interviewing revised feb 3 2015
1. interviewing revised feb 3 2015
 
Mobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnostMobilne usluge - Ljubicasta buducnost
Mobilne usluge - Ljubicasta buducnost
 
Rihanna[1]
Rihanna[1]Rihanna[1]
Rihanna[1]
 
Gepard gm6 lynx x
Gepard gm6 lynx xGepard gm6 lynx x
Gepard gm6 lynx x
 
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...Driving Profits in the Downturn, Using Data to Improve Website Performance an...
Driving Profits in the Downturn, Using Data to Improve Website Performance an...
 
Qualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. JonesQualcomm Life Connect 2013 - Michael Z. Jones
Qualcomm Life Connect 2013 - Michael Z. Jones
 
Basisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORTBasisregistratie Ondergronds - StadSPOORT
Basisregistratie Ondergronds - StadSPOORT
 
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīzeNācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
Nācija un nacionālās identitātes diskurss: politiskās elites vēstījumu analīze
 

Ähnlich wie Introducción a hadoop

JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
Koichi Fujikawa
 

Ähnlich wie Introducción a hadoop (20)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
Cloud jpl
Cloud jplCloud jpl
Cloud jpl
 
JRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop PapyrusJRubyKaigi2010 Hadoop Papyrus
JRubyKaigi2010 Hadoop Papyrus
 
Scoobi - Scala for Startups
Scoobi - Scala for StartupsScoobi - Scala for Startups
Scoobi - Scala for Startups
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Map-Reduce and Apache Hadoop
Map-Reduce and Apache HadoopMap-Reduce and Apache Hadoop
Map-Reduce and Apache Hadoop
 
Hadoop + Clojure
Hadoop + ClojureHadoop + Clojure
Hadoop + Clojure
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
Scalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedInScalable and Flexible Machine Learning With Scala @ LinkedIn
Scalable and Flexible Machine Learning With Scala @ LinkedIn
 
Hw09 Hadoop + Clojure
Hw09   Hadoop + ClojureHw09   Hadoop + Clojure
Hw09 Hadoop + Clojure
 
Lecture 2 part 3
Lecture 2 part 3Lecture 2 part 3
Lecture 2 part 3
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Introduction to Scalding and Monoids
Introduction to Scalding and MonoidsIntroduction to Scalding and Monoids
Introduction to Scalding and Monoids
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Introducción a hadoop

  • 1. Introducción a Hadoop El bazuca de los datos Iván de Prado Alonso // @ivanprado // @datasalt
  • 2. Datasalt Foco en el Big Data – Contribución al Open Source – Consultoría & Desarrollo – Formación 2 / 34
  • 4. Fisonomía de un proyecto Big Data Adquisición Procesamiento Servicio 4 / 34
  • 5. Tipos de sistemas Big Data ● Offline – La latencia no es un problema ● Online – La inmediatez de los datos es importante ● Mixto – Lo más común Offline Online MapReduce Bases de datos NoSQL Hadoop Motores de búsqueda Distributed RDBMS 5 / 34
  • 6. “Swiss army knife of the 21st century” Media Guardian Innovation Awards http://www.guardian.co.uk/technology/2011/mar/25/media-guardian-innovation-awards-apache-hadoop 6 / 34
  • 7. Historia ● 2004-2006 – Google publica los papers de GFS y MapReduce – Doug Cutting implementa una versión Open Source en Nutch ● 2006-2008 – Hadoop se separa de Nutch – Se alcanza la escala web en 2008 ● 2008-Hasta ahora – Hadoop se populariza y se comienza a explotar comercialmente. Fuente: Hadoop: a brief history. Doug Cutting 7 / 34
  • 8. Hadoop “The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model” De la página de Hadoop 8 / 34
  • 9. Sistema de Ficheros Distribuido ● Sistema de ficheros distribuido (HDFS) – Bloques grandes: 64 Mb ● Almacenados en el sistema de ficheros del SO – Tolerante a Fallos (replicación) – Formatos habituales: ● Ficheros en formato texto (CSV) ● SequenceFiles – Ristras de pares [clave, valor] 9 / 34
  • 10. MapReduce ● Dos funciones (Map y Reduce) – Map(k, v) : [z,w]* – Reduce(k, v*) : [z, w]* ● Ejemplo: contar palabras – Map([documento, null]) -> [palabra, 1]* – Reduce(palabra, 1*) -> [palabra, total] ● MapReduce y SQL – SELECT palabra, count(*) GROUP BY palabra ● Ejecución distribuida en un cluster con escalabilidad horizontal 10 / 34
  • 11. El típico Word Count Esto es una linea Esto también Map Reduce reduce(es, {1}) = map(“Esto es una linea”) = es, 1 esto, 1 reduce(esto, {1, 1}) = es, 1 esto, 2 una, 1 reduce(linea, {1}) = linea, 1 linea, 1 map(“Esto también”) = reduce(también, {1}) = esto, 1 también, 1 también, 1 reduce(una, {1}) = una, 1 es, 1 esto, 2 Resultado: linea, 1 también, 1 una, 1 11 / 34
  • 12. Word Count en Hadoop public class WordCountHadoop extends Configured implements Tool { public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } ¡Mejor vamos por partes! result.set(sum); context.write(key, result); } } @Override public int run(String[] args) throws Exception { if(args.length != 2) { System.err.println("Usage: wordcount-hadoop <in> <out>"); System.exit(2); } Path output = new Path(args[1]); HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output); Job job = new Job(getConf(), "word count hadoop"); job.setJarByClass(WordCountHadoop.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); return 0; } public static void main(String[] args) throws Exception { ToolRunner.run(new SortJobHadoop(), args); } } 12 / 34
  • 13. Word Count en Hadoop - Mapper public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while(itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } } 13 / 34
  • 14. Word Count en Hadoop - Reducer public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for(IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } } 14 / 34
  • 15. Word Count en Hadoop – Configuración y ejecución if(args.length != 2) { System.err.println("Usage: wordcount-hadoop <in> <out>"); System.exit(2); } Path output = new Path(args[1]); HadoopUtils.deleteIfExists(FileSystem.get(output.toUri(), conf), output); Job job = new Job(getConf(), "word count hadoop"); job.setJarByClass(WordCountHadoop.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); 15 / 34
  • 16. Ejecución de un Job MapReduce Bloques del fichero de entrada Nodo 1 Nodo 2 Mappers Datos Intermedios Nodo 1 Nodo 2 Reducers Resultado 16 / 34
  • 17. Serialización ● Writables • Serialización nativa de Hadoop • De muy bajo nivel • Tipos básicos: IntWritable, Text, etc. ● Otras • Thrift, Avro, Protostuff • Compatibilidad hacia atrás. 17 / 34
  • 18. La curva de aprendizaje de Hadoop es alta 18 / 34
  • 19. Tuple MapReduce ● Un MapReduce más simple – Tuplas en lugar de key/value – A nivel de job se define ● Los campos por los que agrupar ● Los campos por los que ordenar – Tuple MapReduce-join 19 / 34
  • 20. Pangool ● Implementación de TupleMap reduce – Desarrollado por Datasalt – OpenSource – Eficiencia equiparable a Hadoop ● Objetivo: reemplazar la API de Hadoop ● Si quieres aprender Hadoop, empieza por Pangool 20 / 34
  • 21. Eficiencia de Pangool ● Equiparable a Hadoop Ver http://pangool.net/benchmark.html 21 / 34
  • 22. Pangool – URL resolution ● Ejemplo de Join – Muy difícil en Hadoop. Fácil en Pangool. ● Problema: – Existen muchos acortadores de URLs y redirecciones – Para analizar datos, suele ser útil reemplazar las URLs por su URL canónica – Supongamos que tenemos ambos datasets ● Un mapa con entradas URL → URL canónica ● Un dataset con URLs (que queremos resolver) y otros campos. – El siguiente job Pangool soluciona el problema de manera escalable. 22 / 34
  • 23. URL Resolution – Definiendo Schemas static Schema getURLRegisterSchema() { List<Field> urlRegisterFields = new ArrayList<Field>(); urlRegisterFields.add(Field.create("url",Type.STRING)); urlRegisterFields.add(Field.create("timestamp",Type.LONG)); urlRegisterFields.add(Field.create("ip",Type.STRING)); return new Schema("urlRegister", urlRegisterFields); } static Schema getURLMapSchema() { List<Field> urlMapFields = new ArrayList<Field>(); urlMapFields.add(Field.create("url",Type.STRING)); urlMapFields.add(Field.create("canonicalUrl",Type.STRING)); return new Schema("urlMap", urlMapFields); } 23 / 34
  • 24. URL Resolution – Cargando el fichero a resolver public static class UrlProcessor extends TupleMapper<LongWritable, Text> { private Tuple tuple = new Tuple(getURLRegisterSchema()); @Override public void map(LongWritable key, Text value, TupleMRContext context, Collector collector) throws IOException, InterruptedException { String[] fields = value.toString().split("t"); tuple.set("url", fields[0]); tuple.set("timestamp", Long.parseLong(fields[1])); tuple.set("ip", fields[2]); collector.write(tuple); } } 24 / 34
  • 25. URL Resolution – Cargando el mapa de URLs public static class UrlMapProcessor extends TupleMapper<LongWritable, Text> { private Tuple tuple = new Tuple(getURLMapSchema()); @Override public void map(LongWritable key, Text value, TupleMRContext context, Collector collector) throws IOException, InterruptedException { String[] fields = value.toString().split("t"); tuple.set("url", fields[0]); tuple.set("canonicalUrl", fields[1]); collector.write(tuple); } } 25 / 34
  • 26. URL Resolution – Resolución en el reducer public static class Handler extends TupleReducer<Text, NullWritable> { private Text result; @Override public void reduce(ITuple group, Iterable<ITuple> tuples, TupleMRContext context, Collector collector) throws IOException, InterruptedException, TupleMRException { if (result == null) { result = new Text(); } String cannonicalUrl = null; for(ITuple tuple : tuples) { if("urlMap".equals(tuple.getSchema().getName())) { cannonicalUrl = tuple.get("canonicalUrl").toString(); } else { result.set(cannonicalUrl + "t" + tuple.get("timestamp") + "t" + tuple.get("ip")); collector.write(result, NullWritable.get()); } } } } 26 / 34
  • 27. URL Resolution – Configurando y Lanzando el job String input1 = args[0]; String input2 = args[1]; String output = args[2]; deleteOutput(output); TupleMRBuilder mr = new TupleMRBuilder(conf,"Pangool Url Resolution"); mr.addIntermediateSchema(getURLMapSchema()); mr.addIntermediateSchema(getURLRegisterSchema()); mr.setGroupByFields("url"); mr.setOrderBy( new OrderBy().add("url", Order.ASC).addSchemaOrder(Order.ASC)); mr.setTupleReducer(new Handler()); mr.setOutput(new Path(output), new HadoopOutputFormat(TextOutputFormat.class), Text.class, NullWritable.class); mr.addInput(new Path(input1), new HadoopInputFormat(TextInputFormat.class), new UrlMapProcessor()); mr.addInput(new Path(input2), new HadoopInputFormat(TextInputFormat.class), new UrlProcessor()); mr.createJob().waitForCompletion(true); 27 / 34