SlideShare ist ein Scribd-Unternehmen logo
1 von 42
Downloaden Sie, um offline zu lesen
Processing Data with Map Reduce



Allahbaksh Mohammedali Asadullah
       Infosys Labs, Infosys Technologies




                                            1
Content
Map Function
Reduce Function
Why Hadoop
HDFS
Map Reduce – Hadoop
Some Questions


                      2
What is Map Function
Map is a classic primitive of Functional
Programming.
Map means apply a function to a list of
elements and return the modified list.
    function List Map(Function func, List elements){

        List newElement;
        foreach element in elements{
            newElement.put(apply(func, element))
        }
        return newElement
                                                       3
    }
Example Map Function
    function double increaseSalary(double salary){
            return salary* (1+0.15);
    }
    function double Map(Function increaseSalary, List<Employee>
 employees){
                List<Employees> newList;
                foreach employee in employees{
                    Employee tempEmployee = new (
                    newList.add(tempEmployee.income=increaseSalary(
                    tempEmployee.income)
                }
    }                                                                 4
Fold or Reduce Funtion
Fold/Reduce reduces a list of values to one.
Fold means apply a function to a list of
elements and return a resulting element
  function Element Reduce(Function func, List elements){

         Element earlierResult;
         foreach element in elements{
             func(element, earlierResult)
         }   return earlierResult;
     }


                                                           5
Example Reduce Function
function double add(double number1, double number2){
      return number1 + number2;
}


function double Reduce (Function add, List<Employee> employees){
      double totalAmout=0.0;
      foreach employee in employees{
          totalAmount =add(totalAmount,emplyee.income);
      }
      return totalAmount
}
                                                                   6
I know Map and Reduce, How do I
use it

I will use    some   library   or
framework.


                               7
Why some framework?
Lazy to write boiler plate code
For modularity
Code reusability




                                  8
What is best choice




                      9
Why Hadoop?




              10
11
Programming Language Support



                   C++



                               12
Who uses it




              13
Strong Community




                                     14
Image Courtesy http://goo.gl/15Nu3
Commercial Support




                     15
Hadoop




         16
Hadoop HDFS



              17
Hadoop Distributed File System
Large Distributed File          System     on
commudity hardware
4k nodes, Thousands of files, Petabytes of
data
Files are replicated so that hard disk failure
can be handled easily
One NameNode and many DataNode

                                             18
Hadoop Distributed File System
                  HDFS ARCHITECTURE

                                                 Metadata (Name,replicas,..):
                          Namenode
   Metadata ops

        Client                                   Block ops
 Read              Data Nodes                                    Data Nodes
                                         Replication

                                                                         Blocks
                                              Write

                 Rack 1                                        Rack 2
                                     Client


                                                                                  19
NameNode
Meta-data in RAM
    The entire metadata is in main memory.
  Metadata consist of
    •   List of files
    •   List of Blocks for each file
    •   List of DataNodes for each block
    •   File attributes, e.g creation time
    •   Transaction Log
NameNode uses heartbeats to detect DataNode failure
                                                      20
Data Node
Data Node stores the data in file system
Stores meta-data of a block
Serves data and meta-data to Clients
Pipelining of Data i.e forwards data to other
specified DataNodes
DataNodes send heartbeat to the NameNode
every three sec.


                                           21
HDFS Commands
Accessing HDFS
   hadoop dfs –mkdir myDirectory
    hadoop dfs -cat myFirstFile.txt
Web Interface
   http://host:port/dfshealth.jsp



                                      22
Hadoop MapReduce



                   23
Map Reduce Diagramtically
                Mapper
                                                 Reducer

                                                           Output Files

 Input Files

Input Split 0
Input Split 1
Input Split 2
Input Split 3
Input Split 4
Input Split 5




                  Intermediate file is divided into R
                  partitions, by partitioning function               24
Input Format
InputFormat descirbes the input sepcification to a MapReduce
job. That is how the data is to be read from the File System .
Split up the input file into logical InputSplits, each of which is
assigned to an Mapper
Provide the RecordReader implementation to be used to collect
input record from logical InputSplit for processing by Mapper
RecordReader, typically, converts the byte-oriented view of the
input, provided by the InputSplit, and presents a record-oriented
view for the Mapper & Reducer tasks for processing. It thus
assumes the responsibility of processing record boundaries and
presenting the tasks with keys and values.
 
                                                                25
Creating a your Mapper
The mapper should implements .mapred.Mapper
Earlier version use to extend class .mapreduce.Mapper class
Extend .mapred.MapReduceBase class which provides default
implementation of close and configure method.
The Main method is map ( WritableComparable key, Writable
value, OutputCollector<K2,V2> output, Reporter reporter)
One instance of your Mapper is initialized per task. Exists in separate process from all
other instances of Mapper – no data sharing. So static variables will be different for
different map task.
Writable -- Hadoop defines a interface called Writable which is Serializable.
Examples IntWritable, LongWritable, Text etc.

WritableComparables can be compared to each other, typically via Comparators. Any
type which is to be used as a key in the Hadoop Map-Reduce framework should
implement this interface.
InverseMapper swaps the key and value                                           26
Combiner
Combiners are used to optimize/minimize the number
of key value pairs that will be shuffled across the
network between mappers and reducers.
Combiner are sort of mini reducer that will be applied
potentially several times still during the map phase
before to send the new set of key/value pairs to the
reducer(s).
Combiners should be used when the function you want
to apply is both commutative and associative.
Example: WordCount and Mean value computation
Reference   http://goo.gl/iU5kR
                                                    27
Partitioner
Partitioner controls the partitioning of the keys of the
intermediate map-outputs.
The key (or a subset of the key) is used to derive the
partition, typically by a hash function.
The total number of partitions is the same as the
number of reduce tasks for the job.
Some Partitioner are BinaryPartitioner, HashPartitioner,
KeyFieldBasedPartitioner, TotalOrderPartitioner



                                                      28
Creating a your Reducer
The mapper should implements .mapred.Reducer
Earlier version use to extend class .mapreduce.Reduces class
Extend .mapred.MapReduceBase class which provides default
implementation of close and configure method.
The Main method is reduce(WritableComparable key, Iterator
values, OutputCollector output, Reporter reporter)
Keys & values sent to one partition all goes to the same reduce task

Iterator.next() always returns the same object, different data

HashPartioner partition it based on Hash function written

IdentityReducer is default implementation of the Reducer


                                                                       29
Output Format
OutputFormat is similar to InputFormat
Different type of output formats are
   TextOutputFormat
   SequenceFileOutputFormat
   NullOutputFormat




                                         30
Mechanics of whole process
Configure the Input and Output
Configure the Mapper and Reducer
Specify other parameters like number Map
job, number of reduce job etc.
Submit the job to client




                                       31
Example
JobConf conf = new JobConf(WordCount.class);

conf.setJobName("wordcount");

conf.setMapperClass(Map.class);

conf.setCombinerClass(Reduce.class);

conf.setReducerClass(Reduce.class);

conf.setOutputKeyClass(Text.class);

conf.setOutputValueClass(IntWritable.class);

conf.setInputFormat(TextInputFormat.class);

conf.setOutputFormat(TextOutputFormat.class);

FileInputFormat.setInputPaths(conf, new
    Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1]));

JobClient.runJob(conf); //JobClient.submit

                                                                              32
Job Tracker & Task Tracker
                   Master Node

                     Job Tracker




 Slave Node                             Slave Node
   Task Tracker                           Task Tracker
                   .....           ..
 Task       Task                        Task       Task

                                                          33
Job Launch Process
JobClient determines proper division of
input into InputSplits
Sends job data to master JobTracker server.
Saves the jar and JobConf (serialized to
XML) in shared location and posts the job
into a queue.



                                            34
                                34
Job Launch Process Contd..
TaskTrackers running on slave nodes
periodically query JobTracker for
work.
Get the job jar from the Master node
to the data node.
Launch the main class in separate
JVM queue.
TaskTracker.Child.main()   35      35
Small File Problem
What should I do if I have lots of small
 files?
One word answer is SequenceFile.
               SequenceFile Layout
        Key     Value   Key   Value   Key   Value   Key   Value


     File Name File Content




Tar to SequenceFile     http://goo.gl/mKGC7
                                                                  36
Consolidator      http://goo.gl/EVvi7
Problem of Large File
What if I have single big file of 20Gb?
One word answer is There is no problems with
large files




                                               37
SQL Data
What is way to access SQL data?
One word answer is DBInputFormat.
DBInputFormat provides a simple method of scanning entire tables from a
database, as well as the means to read from arbitrary SQL queries performed
against the database.
DBInputFormat provides a simple method of scanning entire tables from a
database, as well as the means to read from arbitrary SQL queries performed
against the database.


Database Access with Hadoop       http://goo.gl/CNOBc



                                                                              38
JobConf conf = new JobConf(getConf(), MyDriver.class);

     conf.setInputFormat(DBInputFormat.class);

      DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”,
“jdbc:mysql://localhost:port/dbNamee”);

     String [] fields = { “employee_id”, "name" };

      DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”,
fields);




                                                                                                     39
public class MyRow implements Writable, DBWritable {

         private int employeeNumber;

         private String employeeName;

         public void write(DataOutput out) throws IOException {

                    out.writeInt(employeeNumber);                 out.writeChars(employeeName);

         }

         public void readFields(DataInput in) throws IOException {

                    employeeNumber= in.readInt();                 employeeName = in.readUTF();

         }

         public void write(PreparedStatement statement) throws SQLException {

              statement.setInt(1, employeeNumber); statement.setString(2, employeeName);

         }

         public void readFields(ResultSet resultSet) throws SQLException {

              employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2);

         }
                                                                                                  40
}
Question
   &
Answer
           41
Thanks You


             42

Weitere ähnliche Inhalte

Was ist angesagt?

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examplesAndrea Iacono
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelTakahiro Inoue
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Rajesh Ananda Kumar
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXAndrea Iacono
 
Map Reduce
Map ReduceMap Reduce
Map Reduceschapht
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce APITom Croucher
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...CloudxLab
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collectionsvishal choudhary
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advancedChirag Ahuja
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tipsSubhas Kumar Ghosh
 

Was ist angesagt? (20)

Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing ModelMongoDB & Hadoop: Flexible Hourly Batch Processing Model
MongoDB & Hadoop: Flexible Hourly Batch Processing Model
 
Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop Anatomy of classic map reduce in hadoop
Anatomy of classic map reduce in hadoop
 
Unit 2
Unit 2Unit 2
Unit 2
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
Unit 2 part-2
Unit 2 part-2Unit 2 part-2
Unit 2 part-2
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Upgrading To The New Map Reduce API
Upgrading To The New Map Reduce APIUpgrading To The New Map Reduce API
Upgrading To The New Map Reduce API
 
Unit 4-apache pig
Unit 4-apache pigUnit 4-apache pig
Unit 4-apache pig
 
Unit 3
Unit 3Unit 3
Unit 3
 
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
Writing MapReduce Programs using Java | Big Data Hadoop Spark Tutorial | Clou...
 
Scalding
ScaldingScalding
Scalding
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
 
Hadoop 2
Hadoop 2Hadoop 2
Hadoop 2
 
Unit 3 writable collections
Unit 3 writable collectionsUnit 3 writable collections
Unit 3 writable collections
 
Map/Reduce intro
Map/Reduce introMap/Reduce intro
Map/Reduce intro
 
Mapreduce advanced
Mapreduce advancedMapreduce advanced
Mapreduce advanced
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
Hadoop performance optimization tips
Hadoop performance optimization tipsHadoop performance optimization tips
Hadoop performance optimization tips
 

Andere mochten auch

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Lu Wei
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Rohit Agrawal
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceMahantesh Angadi
 

Andere mochten auch (6)

Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]Resource Aware Scheduling for Hadoop [Final Presentation]
Resource Aware Scheduling for Hadoop [Final Presentation]
 
Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3Hadoop MapReduce framework - Module 3
Hadoop MapReduce framework - Module 3
 
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduceBIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
BIGDATA- Survey on Scheduling Methods in Hadoop MapReduce
 
Hadoop and Big Data Security
Hadoop and Big Data SecurityHadoop and Big Data Security
Hadoop and Big Data Security
 

Ähnlich wie Processing Data with Map Reduce: "MapReduce Data Processing

MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxHARIKRISHNANU13
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview questionpappupassindia
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data trainingagiamas
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaDesing Pathshala
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectMao Geng
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comsoftwarequery
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkPatrick Wendell
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionSubhas Kumar Ghosh
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop IntroductionSNEHAL MASNE
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiUnmesh Baile
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 

Ähnlich wie Processing Data with Map Reduce: "MapReduce Data Processing (20)

Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
mapReduce.pptx
mapReduce.pptxmapReduce.pptx
mapReduce.pptx
 
Hadoop interview question
Hadoop interview questionHadoop interview question
Hadoop interview question
 
Hadoop and big data training
Hadoop and big data trainingHadoop and big data training
Hadoop and big data training
 
hadoop.ppt
hadoop.ppthadoop.ppt
hadoop.ppt
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design PathshalaAdvance Map reduce - Apache hadoop Bigdata training by Design Pathshala
Advance Map reduce - Apache hadoop Bigdata training by Design Pathshala
 
Hadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log projectHadoop and HBase experiences in perf log project
Hadoop and HBase experiences in perf log project
 
Hadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.comHadoop interview questions - Softwarequery.com
Hadoop interview questions - Softwarequery.com
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Hadoop first mr job - inverted index construction
Hadoop first mr job - inverted index constructionHadoop first mr job - inverted index construction
Hadoop first mr job - inverted index construction
 
Hadoop Introduction
Hadoop IntroductionHadoop Introduction
Hadoop Introduction
 
Big-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbaiBig-data-analysis-training-in-mumbai
Big-data-analysis-training-in-mumbai
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 

Mehr von IndicThreads

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs itIndicThreads
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsIndicThreads
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayIndicThreads
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices IndicThreads
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreadsIndicThreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreadsIndicThreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingIndicThreads
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreadsIndicThreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprisesIndicThreads
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIndicThreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present FutureIndicThreads
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams IndicThreads
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameIndicThreads
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceIndicThreads
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java CarputerIndicThreads
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache SparkIndicThreads
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & DockerIndicThreads
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackIndicThreads
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack CloudsIndicThreads
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!IndicThreads
 

Mehr von IndicThreads (20)

Http2 is here! And why the web needs it
Http2 is here! And why the web needs itHttp2 is here! And why the web needs it
Http2 is here! And why the web needs it
 
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive ApplicationsUnderstanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
Understanding Bitcoin (Blockchain) and its Potential for Disruptive Applications
 
Go Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang wayGo Programming Language - Learning The Go Lang way
Go Programming Language - Learning The Go Lang way
 
Building Resilient Microservices
Building Resilient Microservices Building Resilient Microservices
Building Resilient Microservices
 
App using golang indicthreads
App using golang  indicthreadsApp using golang  indicthreads
App using golang indicthreads
 
Building on quicksand microservices indicthreads
Building on quicksand microservices  indicthreadsBuilding on quicksand microservices  indicthreads
Building on quicksand microservices indicthreads
 
How to Think in RxJava Before Reacting
How to Think in RxJava Before ReactingHow to Think in RxJava Before Reacting
How to Think in RxJava Before Reacting
 
Iot secure connected devices indicthreads
Iot secure connected devices indicthreadsIot secure connected devices indicthreads
Iot secure connected devices indicthreads
 
Real world IoT for enterprises
Real world IoT for enterprisesReal world IoT for enterprises
Real world IoT for enterprises
 
IoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreadsIoT testing and quality assurance indicthreads
IoT testing and quality assurance indicthreads
 
Functional Programming Past Present Future
Functional Programming Past Present FutureFunctional Programming Past Present Future
Functional Programming Past Present Future
 
Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams Harnessing the Power of Java 8 Streams
Harnessing the Power of Java 8 Streams
 
Building & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fameBuilding & scaling a live streaming mobile platform - Gr8 road to fame
Building & scaling a live streaming mobile platform - Gr8 road to fame
 
Internet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads ConferenceInternet of things architecture perspective - IndicThreads Conference
Internet of things architecture perspective - IndicThreads Conference
 
Cars and Computers: Building a Java Carputer
 Cars and Computers: Building a Java Carputer Cars and Computers: Building a Java Carputer
Cars and Computers: Building a Java Carputer
 
Scrap Your MapReduce - Apache Spark
 Scrap Your MapReduce - Apache Spark Scrap Your MapReduce - Apache Spark
Scrap Your MapReduce - Apache Spark
 
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
Continuous Integration (CI) and Continuous Delivery (CD) using Jenkins & Docker
 
Speed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedbackSpeed up your build pipeline for faster feedback
Speed up your build pipeline for faster feedback
 
Unraveling OpenStack Clouds
 Unraveling OpenStack Clouds Unraveling OpenStack Clouds
Unraveling OpenStack Clouds
 
Digital Transformation of the Enterprise. What IT leaders need to know!
Digital Transformation of the Enterprise. What IT  leaders need to know!Digital Transformation of the Enterprise. What IT  leaders need to know!
Digital Transformation of the Enterprise. What IT leaders need to know!
 

Kürzlich hochgeladen

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 

Kürzlich hochgeladen (20)

Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Processing Data with Map Reduce: "MapReduce Data Processing

  • 1. Processing Data with Map Reduce Allahbaksh Mohammedali Asadullah Infosys Labs, Infosys Technologies 1
  • 2. Content Map Function Reduce Function Why Hadoop HDFS Map Reduce – Hadoop Some Questions 2
  • 3. What is Map Function Map is a classic primitive of Functional Programming. Map means apply a function to a list of elements and return the modified list. function List Map(Function func, List elements){ List newElement; foreach element in elements{ newElement.put(apply(func, element)) } return newElement 3 }
  • 4. Example Map Function function double increaseSalary(double salary){ return salary* (1+0.15); } function double Map(Function increaseSalary, List<Employee> employees){ List<Employees> newList; foreach employee in employees{ Employee tempEmployee = new ( newList.add(tempEmployee.income=increaseSalary( tempEmployee.income) } } 4
  • 5. Fold or Reduce Funtion Fold/Reduce reduces a list of values to one. Fold means apply a function to a list of elements and return a resulting element function Element Reduce(Function func, List elements){ Element earlierResult; foreach element in elements{ func(element, earlierResult) } return earlierResult; } 5
  • 6. Example Reduce Function function double add(double number1, double number2){ return number1 + number2; } function double Reduce (Function add, List<Employee> employees){ double totalAmout=0.0; foreach employee in employees{ totalAmount =add(totalAmount,emplyee.income); } return totalAmount } 6
  • 7. I know Map and Reduce, How do I use it I will use some library or framework. 7
  • 8. Why some framework? Lazy to write boiler plate code For modularity Code reusability 8
  • 9. What is best choice 9
  • 11. 11
  • 14. Strong Community 14 Image Courtesy http://goo.gl/15Nu3
  • 16. Hadoop 16
  • 18. Hadoop Distributed File System Large Distributed File System on commudity hardware 4k nodes, Thousands of files, Petabytes of data Files are replicated so that hard disk failure can be handled easily One NameNode and many DataNode 18
  • 19. Hadoop Distributed File System HDFS ARCHITECTURE Metadata (Name,replicas,..): Namenode Metadata ops Client Block ops Read Data Nodes Data Nodes Replication Blocks Write Rack 1 Rack 2 Client 19
  • 20. NameNode Meta-data in RAM The entire metadata is in main memory. Metadata consist of • List of files • List of Blocks for each file • List of DataNodes for each block • File attributes, e.g creation time • Transaction Log NameNode uses heartbeats to detect DataNode failure 20
  • 21. Data Node Data Node stores the data in file system Stores meta-data of a block Serves data and meta-data to Clients Pipelining of Data i.e forwards data to other specified DataNodes DataNodes send heartbeat to the NameNode every three sec. 21
  • 22. HDFS Commands Accessing HDFS hadoop dfs –mkdir myDirectory hadoop dfs -cat myFirstFile.txt Web Interface http://host:port/dfshealth.jsp 22
  • 24. Map Reduce Diagramtically Mapper Reducer Output Files Input Files Input Split 0 Input Split 1 Input Split 2 Input Split 3 Input Split 4 Input Split 5 Intermediate file is divided into R partitions, by partitioning function 24
  • 25. Input Format InputFormat descirbes the input sepcification to a MapReduce job. That is how the data is to be read from the File System . Split up the input file into logical InputSplits, each of which is assigned to an Mapper Provide the RecordReader implementation to be used to collect input record from logical InputSplit for processing by Mapper RecordReader, typically, converts the byte-oriented view of the input, provided by the InputSplit, and presents a record-oriented view for the Mapper & Reducer tasks for processing. It thus assumes the responsibility of processing record boundaries and presenting the tasks with keys and values.   25
  • 26. Creating a your Mapper The mapper should implements .mapred.Mapper Earlier version use to extend class .mapreduce.Mapper class Extend .mapred.MapReduceBase class which provides default implementation of close and configure method. The Main method is map ( WritableComparable key, Writable value, OutputCollector<K2,V2> output, Reporter reporter) One instance of your Mapper is initialized per task. Exists in separate process from all other instances of Mapper – no data sharing. So static variables will be different for different map task. Writable -- Hadoop defines a interface called Writable which is Serializable. Examples IntWritable, LongWritable, Text etc. WritableComparables can be compared to each other, typically via Comparators. Any type which is to be used as a key in the Hadoop Map-Reduce framework should implement this interface. InverseMapper swaps the key and value 26
  • 27. Combiner Combiners are used to optimize/minimize the number of key value pairs that will be shuffled across the network between mappers and reducers. Combiner are sort of mini reducer that will be applied potentially several times still during the map phase before to send the new set of key/value pairs to the reducer(s). Combiners should be used when the function you want to apply is both commutative and associative. Example: WordCount and Mean value computation Reference http://goo.gl/iU5kR 27
  • 28. Partitioner Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Some Partitioner are BinaryPartitioner, HashPartitioner, KeyFieldBasedPartitioner, TotalOrderPartitioner 28
  • 29. Creating a your Reducer The mapper should implements .mapred.Reducer Earlier version use to extend class .mapreduce.Reduces class Extend .mapred.MapReduceBase class which provides default implementation of close and configure method. The Main method is reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Keys & values sent to one partition all goes to the same reduce task Iterator.next() always returns the same object, different data HashPartioner partition it based on Hash function written IdentityReducer is default implementation of the Reducer 29
  • 30. Output Format OutputFormat is similar to InputFormat Different type of output formats are TextOutputFormat SequenceFileOutputFormat NullOutputFormat 30
  • 31. Mechanics of whole process Configure the Input and Output Configure the Mapper and Reducer Specify other parameters like number Map job, number of reduce job etc. Submit the job to client 31
  • 32. Example JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0]));FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); //JobClient.submit 32
  • 33. Job Tracker & Task Tracker Master Node Job Tracker Slave Node Slave Node Task Tracker Task Tracker ..... .. Task Task Task Task 33
  • 34. Job Launch Process JobClient determines proper division of input into InputSplits Sends job data to master JobTracker server. Saves the jar and JobConf (serialized to XML) in shared location and posts the job into a queue. 34 34
  • 35. Job Launch Process Contd.. TaskTrackers running on slave nodes periodically query JobTracker for work. Get the job jar from the Master node to the data node. Launch the main class in separate JVM queue. TaskTracker.Child.main() 35 35
  • 36. Small File Problem What should I do if I have lots of small files? One word answer is SequenceFile. SequenceFile Layout Key Value Key Value Key Value Key Value File Name File Content Tar to SequenceFile http://goo.gl/mKGC7 36 Consolidator http://goo.gl/EVvi7
  • 37. Problem of Large File What if I have single big file of 20Gb? One word answer is There is no problems with large files 37
  • 38. SQL Data What is way to access SQL data? One word answer is DBInputFormat. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. DBInputFormat provides a simple method of scanning entire tables from a database, as well as the means to read from arbitrary SQL queries performed against the database. Database Access with Hadoop http://goo.gl/CNOBc 38
  • 39. JobConf conf = new JobConf(getConf(), MyDriver.class); conf.setInputFormat(DBInputFormat.class); DBConfiguration.configureDB(conf, “com.mysql.jdbc.Driver”, “jdbc:mysql://localhost:port/dbNamee”); String [] fields = { “employee_id”, "name" }; DBInputFormat.setInput(conf, MyRow.class, “employees”, null /* conditions */, “employee_id”, fields); 39
  • 40. public class MyRow implements Writable, DBWritable { private int employeeNumber; private String employeeName; public void write(DataOutput out) throws IOException { out.writeInt(employeeNumber); out.writeChars(employeeName); } public void readFields(DataInput in) throws IOException { employeeNumber= in.readInt(); employeeName = in.readUTF(); } public void write(PreparedStatement statement) throws SQLException { statement.setInt(1, employeeNumber); statement.setString(2, employeeName); } public void readFields(ResultSet resultSet) throws SQLException { employeeNumber = resultSet.getInt(1); employeeName = resultSet.getString (2); } 40 }
  • 41. Question & Answer 41