SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Downloaden Sie, um offline zu lesen
Overview of Hadoop and MapReduce
                         Ganesh Neelakanta Iyer
      Research Scholar, National University of Singapore
About Me


I have 3 years of Industry work experience
   - Sasken Communication Technologies Ltd, Bangalore
   - NXP Semiconductors Pvt Ltd (Formerly Philips Semiconductors), Bangalore
I have finished my Masters in Electrical and Computer Engineering from NUS (National
    University of Singapore) in 2008.
Currently Research Scholar in NUS under the guidance of A/P. Bharadwaj Veeravalli.


Research Interests: Cloud computing, Game theory, Resource Allocation and Pricing
Personal Interests: Kathakali, Teaching, Travelling, Photography
Agenda
• Introduction to Hadoop

• Introduction to HDFS

• MapReduce Paradigm

• Some practical MapReduce examples

• MapReduce in Hadoop

• Concluding remarks
Introduction to Hadoop
Data!
• Facebook hosts approximately 10 billion photos, taking up one
  petabyte of storage

• The New York Stock Exchange generates about one terabyte of
  new trade data per day

• In last one week, I personally took 15 GB photos while I was
  travelling. So imagine the memory requirements for all photos
  taken in a day all over the world!
Hadoop
• Open source Cloud supported by Apache

• Reliable shared storage and analysis system

• Uses distributed file system (Called as HDFS) like GFS

• Can be used for a variety of applications
Typical Hadoop Cluster




                         Pro-Hadoop by Jason Venner
Typical Hadoop Cluster
                                       Aggregation switch


              Rack switch




  40 nodes/rack, 1000-4000 nodes in cluster
  1 Gbps bandwidth within rack, 8 Gbps out of rack
  Node specs (Yahoo terasort):
     8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
               Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf
mage from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf
Introduction to HDFS
HDFS – Hadoop Distributed File System
Very Large Distributed File System
   – 10K nodes, 100 million files, 10 PB
Assumes Commodity Hardware
   – Files are replicated to handle hardware failure
   – Detect failures and recover from them
Optimized for Batch Processing
   – Data locations exposed so that computations can move to where data
   resides
   – Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS



                                          http://www.gartner.com/it/page.jsp?id=1447613
Distributed File System
   Data Coherency
      – Write-once-read-many access model
      – Client can only append to existing files

   Files are broken up into blocks
       – Typically 128 MB block size
       – Each block replicated on multiple DataNodes

   Intelligent Client
      – Client can find location of blocks
      – Client accesses data directly from DataNode
MapReduce Paradigm
MapReduce
Simple data-parallel programming model designed for scalability and
   fault-tolerance

Framework for distributed processing of large data sets

Originally designed by Google

Pluggable user code runs in generic framework

Pioneered by Google - Processes 20 petabytes of data per day
What is MapReduce used for?
At Google:
    Index construction for Google Search
    Article clustering for Google News
    Statistical machine translation
At Yahoo!:
    “Web map” powering Yahoo! Search
    Spam detection for Yahoo! Mail
At Facebook:
    Data mining
    Ad optimization
    Spam detection
What is MapReduce used for?
In research:
    Astronomical image analysis (Washington)
    Bioinformatics (Maryland)
    Analyzing Wikipedia conflicts (PARC)
    Natural language processing (CMU)
    Particle physics (Nebraska)
    Ocean climate simulation (Washington)
    <Your application here>
MapReduce Programming Model
Data type: key-value records

Map function:
                     (Kin, Vin)   list(Kinter, Vinter)

Reduce function:
              (Kinter, list(Vinter))    list(Kout, Vout)
Example: Word Count
  def mapper(line):
      foreach word in line.split():
         output(word, 1)


  def reducer(key, values):
      output(key, sum(values))
Input      Map              Shuffle & Sort               Reduce   Output
                            the, 1
                           brown, 1
 the quick
              Map
                            fox, 1                                   brown, 2
brown fox                                                             fox, 2
                                                            Reduce
                                                                     how, 1
                    the, 1
                    fox, 1                                           now, 1
                    the, 1                                            the, 3
the fox ate
              Map
the mouse                                        quick, 1

                 how, 1
                                       ate, 1                         ate, 1
                 now, 1
                                      mouse, 1
                brown, 1                                    Reduce    cow, 1
how now
              Map                      cow, 1                        mouse, 1
brown cow
                                                                     quick, 1
MapReduce Execution Details
Single master controls job execution on multiple slaves

Mappers preferentially placed on same node or same rack as their
  input block
   Minimizes network usage

Mappers save outputs to local disk before serving them to reducers
  Allows recovery if a reducer crashes
  Allows having more reducers than nodes
Fault Tolerance in MapReduce
1. If a task crashes:
     Retry on another node
          OK for a map because it has no dependencies
          OK for reduce because map outputs are on disk
     If the same task fails repeatedly, fail the job or ignore that input
   block (user-controlled)
Fault Tolerance in MapReduce

2. If a node crashes:
     Re-launch its current tasks on other nodes
     Re-run any maps the node previously ran
         Necessary because their output files were lost along with the
       crashed node
Fault Tolerance in MapReduce
3. If a task is going slowly (straggler):
     Launch second copy of task on another node (“speculative
   execution”)
     Take the output of whichever copy finishes first, and kill the other

  Surprisingly important in large clusters
   Stragglers occur frequently due to failing hardware, software bugs,
  misconfiguration, etc
   Single straggler may noticeably slow down a job
Takeaways
By providing a data-parallel programming model, MapReduce can
   control job execution in useful ways:
    Automatic division of job into tasks
    Automatic placement of computation near data
    Automatic load balancing
    Recovery from failures & stragglers

User focuses on application, not on complexities of distributed
  computing
Some practical MapReduce
examples
1. Search
Input: (lineNumber, line) records
Output: lines matching a given pattern

Map:
          if(line matches pattern):
              output(line)

Reduce: identify function
   Alternative: no reducer (map-only job)
2. Sort
Input: (key, value) records
Output: same records, sorted by key   Map
                                                    ant, bee
                                                                 Reduce [A-M]
                                            zebra
                                                                    aardvark
Map: identity function                                                 ant
                                             cow                       bee
Reduce: identify function                                              cow
                                      Map
                                                                    elephant
                                              pig

Trick: Pick partitioning                                         Reduce [N-Z]
                                      aardvark,
                                                                     pig
   function h such that               elephant
                                                                    sheep
   k1<k2 => h(k1)<h(k2)               Map           sheep, yak       yak
                                                                    zebra
3. Inverted Index
Input: (filename, text) records
Output: list of files containing each word

Map:
          foreach word in text.split():
             output(word, filename)

Combine: uniquify filenames for each word

Reduce:
      def reduce(word, filenames):
          output(word, sort(filenames))
Inverted Index Example
    hamlet.txt
                  to, hamlet.txt
   to be or not   be, hamlet.txt
       to be      or, hamlet.txt             afraid, (12th.txt)
                  not, hamlet.txt       be, (12th.txt, hamlet.txt)
                                          greatness, (12th.txt)
                                        not, (12th.txt, hamlet.txt)
                                               of, (12th.txt)
                  be, 12th.txt                or, (hamlet.txt)
     12th.txt
                  not, 12th.txt               to, (hamlet.txt)
  be not afraid   afraid, 12th.txt
  of greatness    of, 12th.txt
                  greatness, 12th.txt
4. Most Popular Words
Input: (filename, text) records
Output: top 100 words occurring in the most files

Two-stage solution:
   Job 1:
       Create inverted index, giving (word, list(file)) records
   Job 2:
       Map each (word, list(file)) to (count, word)
       Sort these records by count as in sort job
MapReduce in Hadoop
MapReduce in Hadoop

Three ways to write jobs in Hadoop:
   Java API
   Hadoop Streaming (for Python, Perl, etc)
   Pipes API (C++)
Word Count in Python with Hadoop Streaming
              import sys
Mapper.py:    for line in sys.stdin:
               for word in line.split():
                 print(word.lower() + "t" + 1)


Reducer.py:    import sys
               counts = {}
               for line in sys.stdin:
                 word, count = line.split("t”)
                 dict[word] = dict.get(word, 0) +
                 int(count)
               for word, count in counts:
                 print(word.lower() + "t" + 1)
Concluding remarks
Conclusions
MapReduce programming model hides the complexity of work
  distribution and fault tolerance

Principal design philosophies:
    Make it scalable, so you can throw hardware at problems
    Make it cheap, lowering hardware, programming and admin costs

MapReduce is not suitable for all problems, but when it works, it may
  save you quite a bit of time

Cloud computing makes it straightforward to start using Hadoop (or
   other parallel software) at scale
What next?
MapReduce has limitations – Applications are limited

Some developments:
  • Pig started at Yahoo research
  • Hive developed at Facebook
  • Amazon Elastic MapReduce
Resources
Hadoop: http://hadoop.apache.org/core/
Pig: http://hadoop.apache.org/pig
Hive: http://hadoop.apache.org/hive
Video tutorials: http://www.cloudera.com/hadoop-training

Amazon Web Services: http://aws.amazon.com/
Amazon Elastic MapReduce guide:
  http://docs.amazonwebservices.com/ElasticMapReduce/latest/Getti
  ngStartedGuide/

Slides of the talk delivered by Matei Zaharia, EECS, University of
   California, Berkeley
Thank you!
ganesh.iyer@nus.edu.sg
http://ganeshniyer.com

Weitere ähnliche Inhalte

Was ist angesagt? (20)

MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Map Reduce basics
Map Reduce basicsMap Reduce basics
Map Reduce basics
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
The Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphXThe Pregel Programming Model with Spark GraphX
The Pregel Programming Model with Spark GraphX
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Apache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce TutorialApache Hadoop MapReduce Tutorial
Apache Hadoop MapReduce Tutorial
 
Spark and shark
Spark and sharkSpark and shark
Spark and shark
 
Shark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at ScaleShark SQL and Rich Analytics at Scale
Shark SQL and Rich Analytics at Scale
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
MapReduce
MapReduceMapReduce
MapReduce
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Lec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptxLec_4_1_IntrotoPIG.pptx
Lec_4_1_IntrotoPIG.pptx
 
Unit 1
Unit 1Unit 1
Unit 1
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Hadoop introduction 2
Hadoop introduction 2Hadoop introduction 2
Hadoop introduction 2
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark20130912 YTC_Reynold Xin_Spark and Shark
20130912 YTC_Reynold Xin_Spark and Shark
 
Hadoop 3
Hadoop 3Hadoop 3
Hadoop 3
 

Andere mochten auch

Communicating State Machines
Communicating State MachinesCommunicating State Machines
Communicating State Machinessrirammalhar
 
Hadoop and MapReduce
Hadoop and MapReduceHadoop and MapReduce
Hadoop and MapReduceamreshkr19
 
What is big data
What is big dataWhat is big data
What is big dataCnu Federer
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?Cnu Federer
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache CassandraMike Frampton
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterpriseDataWorks Summit
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...TheInevitableCloud
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkesgfawkesnew2
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2IMC Institute
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)UzmaRuhy
 
ppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYtanshu singh
 

Andere mochten auch (16)

Communicating State Machines
Communicating State MachinesCommunicating State Machines
Communicating State Machines
 
Hadoop and MapReduce
Hadoop and MapReduceHadoop and MapReduce
Hadoop and MapReduce
 
What is big data
What is big dataWhat is big data
What is big data
 
Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?
 
Hadoop - How It Works
Hadoop - How It WorksHadoop - How It Works
Hadoop - How It Works
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The Enterprise
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Nuclear Weapons
Nuclear WeaponsNuclear Weapons
Nuclear Weapons
 
Hyperloop
HyperloopHyperloop
Hyperloop
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)
 
ppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGY
 

Ähnlich wie Introduction to Hadoop and MapReduce

Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopVeda Vyas
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир СуворовEYevseyeva
 
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Sergejus Barinovas
 
Intro to threp
Intro to threpIntro to threp
Intro to threpHong Wu
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptMaruthiPrasad96
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightGert Drapers
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)npinto
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Miningaravindan_raghu
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using DiscoJim Roepcke
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansattilacsordas
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindEMC
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & StorageIlayaraja P
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 

Ähnlich wie Introduction to Hadoop and MapReduce (20)

Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
 
Intro to Big Data using Hadoop
Intro to Big Data using Hadoop Intro to Big Data using Hadoop
Intro to Big Data using Hadoop
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
L19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .pptL19CloudMapReduce introduction for cloud computing .ppt
L19CloudMapReduce introduction for cloud computing .ppt
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
The Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsightThe Fundamentals Guide to HDP and HDInsight
The Fundamentals Guide to HDP and HDInsight
 
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
[Harvard CS264] 08b - MapReduce and Hadoop (Zak Stone, Harvard)
 
WOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph MiningWOOster: A Map-Reduce based Platform for Graph Mining
WOOster: A Map-Reduce based Platform for Graph Mining
 
Introduction to MapReduce using Disco
Introduction to MapReduce using DiscoIntroduction to MapReduce using Disco
Introduction to MapReduce using Disco
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Hadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticiansHadoop 101 for bioinformaticians
Hadoop 101 for bioinformaticians
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Big Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilindBig Data Analytics with Hadoop with @techmilind
Big Data Analytics with Hadoop with @techmilind
 
Large Scale Data Processing & Storage
Large Scale Data Processing & StorageLarge Scale Data Processing & Storage
Large Scale Data Processing & Storage
 
Hadoop london
Hadoop londonHadoop london
Hadoop london
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 

Mehr von Dr Ganesh Iyer

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignDr Ganesh Iyer
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewDr Ganesh Iyer
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2Dr Ganesh Iyer
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 Dr Ganesh Iyer
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLADr Ganesh Iyer
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2Dr Ganesh Iyer
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1Dr Ganesh Iyer
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicityDr Ganesh Iyer
 
SRE Demystified - 07 - Practical Alerting
SRE Demystified - 07 - Practical AlertingSRE Demystified - 07 - Practical Alerting
SRE Demystified - 07 - Practical AlertingDr Ganesh Iyer
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringDr Ganesh Iyer
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationDr Ganesh Iyer
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelDr Ganesh Iyer
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsDr Ganesh Iyer
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionDr Ganesh Iyer
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachDr Ganesh Iyer
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering ApplicationsDr Ganesh Iyer
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its ApplicationsDr Ganesh Iyer
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneurDr Ganesh Iyer
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetesDr Ganesh Iyer
 

Mehr von Dr Ganesh Iyer (20)

SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System DesignSRE Demystified - 16 - NALSD - Non-Abstract Large System Design
SRE Demystified - 16 - NALSD - Non-Abstract Large System Design
 
SRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overviewSRE Demystified - 14 - SRE Practices overview
SRE Demystified - 14 - SRE Practices overview
 
SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2SRE Demystified - 13 - Docs that matter -2
SRE Demystified - 13 - Docs that matter -2
 
SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1 SRE Demystified - 12 - Docs that matter -1
SRE Demystified - 12 - Docs that matter -1
 
SRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLASRE Demystified - 01 - SLO SLI and SLA
SRE Demystified - 01 - SLO SLI and SLA
 
SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2SRE Demystified - 11 - Release management-2
SRE Demystified - 11 - Release management-2
 
SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1SRE Demystified - 10 - Release management-1
SRE Demystified - 10 - Release management-1
 
SRE Demystified - 09 - Simplicity
SRE Demystified - 09 - SimplicitySRE Demystified - 09 - Simplicity
SRE Demystified - 09 - Simplicity
 
SRE Demystified - 07 - Practical Alerting
SRE Demystified - 07 - Practical AlertingSRE Demystified - 07 - Practical Alerting
SRE Demystified - 07 - Practical Alerting
 
SRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed MonitoringSRE Demystified - 06 - Distributed Monitoring
SRE Demystified - 06 - Distributed Monitoring
 
SRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil EliminationSRE Demystified - 05 - Toil Elimination
SRE Demystified - 05 - Toil Elimination
 
SRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement ModelSRE Demystified - 04 - Engagement Model
SRE Demystified - 04 - Engagement Model
 
SRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOsSRE Demystified - 03 - Choosing SLIs and SLOs
SRE Demystified - 03 - Choosing SLIs and SLOs
 
Machine Learning for Statisticians - Introduction
Machine Learning for Statisticians - IntroductionMachine Learning for Statisticians - Introduction
Machine Learning for Statisticians - Introduction
 
Making Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approachMaking Decisions - A Game Theoretic approach
Making Decisions - A Game Theoretic approach
 
Cloud and Industry4.0
Cloud and Industry4.0Cloud and Industry4.0
Cloud and Industry4.0
 
Game Theory and Engineering Applications
Game Theory and Engineering ApplicationsGame Theory and Engineering Applications
Game Theory and Engineering Applications
 
Machine Learning and its Applications
Machine Learning and its ApplicationsMachine Learning and its Applications
Machine Learning and its Applications
 
How to become a successful entrepreneur
How to become a successful entrepreneurHow to become a successful entrepreneur
How to become a successful entrepreneur
 
Dockers and kubernetes
Dockers and kubernetesDockers and kubernetes
Dockers and kubernetes
 

Kürzlich hochgeladen

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 

Kürzlich hochgeladen (20)

The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 

Introduction to Hadoop and MapReduce

  • 1. Overview of Hadoop and MapReduce Ganesh Neelakanta Iyer Research Scholar, National University of Singapore
  • 2. About Me I have 3 years of Industry work experience - Sasken Communication Technologies Ltd, Bangalore - NXP Semiconductors Pvt Ltd (Formerly Philips Semiconductors), Bangalore I have finished my Masters in Electrical and Computer Engineering from NUS (National University of Singapore) in 2008. Currently Research Scholar in NUS under the guidance of A/P. Bharadwaj Veeravalli. Research Interests: Cloud computing, Game theory, Resource Allocation and Pricing Personal Interests: Kathakali, Teaching, Travelling, Photography
  • 3. Agenda • Introduction to Hadoop • Introduction to HDFS • MapReduce Paradigm • Some practical MapReduce examples • MapReduce in Hadoop • Concluding remarks
  • 5. Data! • Facebook hosts approximately 10 billion photos, taking up one petabyte of storage • The New York Stock Exchange generates about one terabyte of new trade data per day • In last one week, I personally took 15 GB photos while I was travelling. So imagine the memory requirements for all photos taken in a day all over the world!
  • 6. Hadoop • Open source Cloud supported by Apache • Reliable shared storage and analysis system • Uses distributed file system (Called as HDFS) like GFS • Can be used for a variety of applications
  • 7. Typical Hadoop Cluster Pro-Hadoop by Jason Venner
  • 8. Typical Hadoop Cluster Aggregation switch Rack switch 40 nodes/rack, 1000-4000 nodes in cluster 1 Gbps bandwidth within rack, 8 Gbps out of rack Node specs (Yahoo terasort): 8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?) Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf
  • 11. HDFS – Hadoop Distributed File System Very Large Distributed File System – 10K nodes, 100 million files, 10 PB Assumes Commodity Hardware – Files are replicated to handle hardware failure – Detect failures and recover from them Optimized for Batch Processing – Data locations exposed so that computations can move to where data resides – Provides very high aggregate bandwidth User Space, runs on heterogeneous OS http://www.gartner.com/it/page.jsp?id=1447613
  • 12. Distributed File System Data Coherency – Write-once-read-many access model – Client can only append to existing files Files are broken up into blocks – Typically 128 MB block size – Each block replicated on multiple DataNodes Intelligent Client – Client can find location of blocks – Client accesses data directly from DataNode
  • 14. MapReduce Simple data-parallel programming model designed for scalability and fault-tolerance Framework for distributed processing of large data sets Originally designed by Google Pluggable user code runs in generic framework Pioneered by Google - Processes 20 petabytes of data per day
  • 15. What is MapReduce used for? At Google: Index construction for Google Search Article clustering for Google News Statistical machine translation At Yahoo!: “Web map” powering Yahoo! Search Spam detection for Yahoo! Mail At Facebook: Data mining Ad optimization Spam detection
  • 16. What is MapReduce used for? In research: Astronomical image analysis (Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington) <Your application here>
  • 17. MapReduce Programming Model Data type: key-value records Map function: (Kin, Vin) list(Kinter, Vinter) Reduce function: (Kinter, list(Vinter)) list(Kout, Vout)
  • 18. Example: Word Count def mapper(line): foreach word in line.split(): output(word, 1) def reducer(key, values): output(key, sum(values))
  • 19. Input Map Shuffle & Sort Reduce Output the, 1 brown, 1 the quick Map fox, 1 brown, 2 brown fox fox, 2 Reduce how, 1 the, 1 fox, 1 now, 1 the, 1 the, 3 the fox ate Map the mouse quick, 1 how, 1 ate, 1 ate, 1 now, 1 mouse, 1 brown, 1 Reduce cow, 1 how now Map cow, 1 mouse, 1 brown cow quick, 1
  • 20. MapReduce Execution Details Single master controls job execution on multiple slaves Mappers preferentially placed on same node or same rack as their input block Minimizes network usage Mappers save outputs to local disk before serving them to reducers Allows recovery if a reducer crashes Allows having more reducers than nodes
  • 21. Fault Tolerance in MapReduce 1. If a task crashes: Retry on another node OK for a map because it has no dependencies OK for reduce because map outputs are on disk If the same task fails repeatedly, fail the job or ignore that input block (user-controlled)
  • 22. Fault Tolerance in MapReduce 2. If a node crashes: Re-launch its current tasks on other nodes Re-run any maps the node previously ran Necessary because their output files were lost along with the crashed node
  • 23. Fault Tolerance in MapReduce 3. If a task is going slowly (straggler): Launch second copy of task on another node (“speculative execution”) Take the output of whichever copy finishes first, and kill the other Surprisingly important in large clusters Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc Single straggler may noticeably slow down a job
  • 24. Takeaways By providing a data-parallel programming model, MapReduce can control job execution in useful ways: Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers User focuses on application, not on complexities of distributed computing
  • 26. 1. Search Input: (lineNumber, line) records Output: lines matching a given pattern Map: if(line matches pattern): output(line) Reduce: identify function Alternative: no reducer (map-only job)
  • 27. 2. Sort Input: (key, value) records Output: same records, sorted by key Map ant, bee Reduce [A-M] zebra aardvark Map: identity function ant cow bee Reduce: identify function cow Map elephant pig Trick: Pick partitioning Reduce [N-Z] aardvark, pig function h such that elephant sheep k1<k2 => h(k1)<h(k2) Map sheep, yak yak zebra
  • 28. 3. Inverted Index Input: (filename, text) records Output: list of files containing each word Map: foreach word in text.split(): output(word, filename) Combine: uniquify filenames for each word Reduce: def reduce(word, filenames): output(word, sort(filenames))
  • 29. Inverted Index Example hamlet.txt to, hamlet.txt to be or not be, hamlet.txt to be or, hamlet.txt afraid, (12th.txt) not, hamlet.txt be, (12th.txt, hamlet.txt) greatness, (12th.txt) not, (12th.txt, hamlet.txt) of, (12th.txt) be, 12th.txt or, (hamlet.txt) 12th.txt not, 12th.txt to, (hamlet.txt) be not afraid afraid, 12th.txt of greatness of, 12th.txt greatness, 12th.txt
  • 30. 4. Most Popular Words Input: (filename, text) records Output: top 100 words occurring in the most files Two-stage solution: Job 1: Create inverted index, giving (word, list(file)) records Job 2: Map each (word, list(file)) to (count, word) Sort these records by count as in sort job
  • 32. MapReduce in Hadoop Three ways to write jobs in Hadoop: Java API Hadoop Streaming (for Python, Perl, etc) Pipes API (C++)
  • 33. Word Count in Python with Hadoop Streaming import sys Mapper.py: for line in sys.stdin: for word in line.split(): print(word.lower() + "t" + 1) Reducer.py: import sys counts = {} for line in sys.stdin: word, count = line.split("t”) dict[word] = dict.get(word, 0) + int(count) for word, count in counts: print(word.lower() + "t" + 1)
  • 35. Conclusions MapReduce programming model hides the complexity of work distribution and fault tolerance Principal design philosophies: Make it scalable, so you can throw hardware at problems Make it cheap, lowering hardware, programming and admin costs MapReduce is not suitable for all problems, but when it works, it may save you quite a bit of time Cloud computing makes it straightforward to start using Hadoop (or other parallel software) at scale
  • 36. What next? MapReduce has limitations – Applications are limited Some developments: • Pig started at Yahoo research • Hive developed at Facebook • Amazon Elastic MapReduce
  • 37. Resources Hadoop: http://hadoop.apache.org/core/ Pig: http://hadoop.apache.org/pig Hive: http://hadoop.apache.org/hive Video tutorials: http://www.cloudera.com/hadoop-training Amazon Web Services: http://aws.amazon.com/ Amazon Elastic MapReduce guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/Getti ngStartedGuide/ Slides of the talk delivered by Matei Zaharia, EECS, University of California, Berkeley