SlideShare ist ein Scribd-Unternehmen logo
1 von 29
Introduction to Hadoop
                 23 Feb 2012, Sydney

                                 Mark Fei
                          mark.fei@cloudera.com




Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   201109   01-1
Agenda

During this talk, we’ll cover:
§  Why Hadoop is needed
§  The basic concepts of HDFS and MapReduce
§  What sort of problems can be solved with Hadoop
§  What other projects are included in the Hadoop ecosystem
§  What resources are available to assist in managing your Hadoop
    installation




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-2
Traditional Large-Scale Computation

§  Traditionally, computation has been processor-bound
      –  Relatively small amounts of data
      –  Significant amount of complex processing performed on that data
§  For decades, the primary push was to increase the computing
    power of a single machine
     –  Faster processor, more RAM




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-3
Distributed Systems

  §  Moore’s Law: roughly stated, processing power doubles every
      two years
  §  Even that hasn’t always proved adequate for very CPU-intensive
      jobs
  §  Distributed systems evolved to allow developers to use multiple
      machines for a single job
        –  MPI
        –  PVM
        –  Condor




MPI: Message Passing Interface
PVM: Parallel Virtual Machine
                  Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-4
Data Becomes the Bottleneck

§  Getting the data to the processors becomes the bottleneck
§  Quick calculation
     –  Typical disk data transfer rate: 75MB/sec
     –  Time taken to transfer 100GB of data to the processor: approx 22
        minutes!
          –  Assuming sustained reads
          –  Actual time will be worse, since most servers have less than
             100GB of RAM available
§  A new approach is needed




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-5
Hadoop’s History

§  Hadoop is based on work done by Google in the early 2000s
     –  Specifically, on papers describing the Google File System (GFS)
        published in 2003, and MapReduce published in 2004
§  This work takes a radical new approach to the problem of
    distributed computing
      –  Meets all the requirements we have for reliability, scalability etc
§  Core concept: distribute the data as it is initially stored in the
    system
      –  Individual nodes can work on data local to those nodes
           –  No data transfer over the network is required for initial
              processing




             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-6
Core Hadoop Concepts

§  Applications are written in high-level code
     –  Developers do not worry about network programming, temporal
        dependencies etc
§  Nodes talk to each other as little as possible
     –  Developers should not write code which communicates between
        nodes
     –  ‘Shared nothing’ architecture
§  Data is spread among machines in advance
     –  Computation happens where the data is stored, wherever
        possible
          –  Data is replicated multiple times on the system for increased
             availability and reliability


             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-7
Hadoop Components

§  Hadoop consists of two core components
     –  The Hadoop Distributed File System (HDFS)
     –  MapReduce Software Framework
§  There are many other projects based around core Hadoop
     –  Often referred to as the ‘Hadoop Ecosystem’
     –  Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
         –  Many are discussed later in the course
§  A set of machines running HDFS and MapReduce is known as a
    Hadoop Cluster
      –  Individual machines are known as nodes
      –  A cluster can have as few as one node, as many as several
         thousands
           –  More nodes = better performance!

            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-8
Hadoop Components: HDFS

§  HDFS, the Hadoop Distributed File System, is responsible for
    storing data on the cluster
§  Data files are split into blocks and distributed across multiple
    nodes in the cluster
§  Each block is replicated multiple times
     –  Default is to replicate each block three times
     –  Replicas are stored on different nodes
     –  This ensures both reliability and availability




             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-9
Hadoop Components: MapReduce

§  MapReduce is the system used to process data in the Hadoop
    cluster
§  Consists of two phases: Map, and then Reduce
§  Each Map task operates on a discrete portion of the overall
    dataset
      –  Typically one HDFS data block
§  After all Maps are complete, the MapReduce system distributes
    the intermediate data to nodes which perform the Reduce phase




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-10
MapReduce Example: Word Count

§  Count the number of occurrences of each word in a large
    amount of input data

Map(input_key, input_value)
   foreach word w in input_value:
       emit(w, 1)


§  Input to the Mapper

(3414, 'the cat sat on the mat')
(3437, 'the aardvark sat on the sofa')

§  Output from the Mapper

('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-11
Map Phase

                                                                                          Mapper Output
                                                                                              The, 1
                                                                                              cat, 1
                                                                                              sat, 1
               Mapper Input                                                                   on, 1
                                                                                              the, 1
    The cat sat on the mat                                                                    mat, 1
    The aardvark sat on the sofa                                                              The, 1
                                                                                              aardvark, 1
                                                                                              sat, 1
                                                                                              on, 1
                                                                                              the, 1
                                                                                              sofa, 1




         Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-12
MapReduce: The Reducer

§  After the Map phase is over, all the intermediate values for a
    given intermediate key are combined together into a list
§  This list is given to a Reducer
     –  There may be a single Reducer, or multiple Reducers
     –  All values associated with a particular intermediate key are
        guaranteed to go to the same Reducer
     –  The intermediate keys, and their value lists, are passed to the
        Reducer in sorted key order
     –  This step is known as the ‘shuffle and sort’
§  The Reducer outputs zero or more final key/value pairs
     –  These are written to HDFS
     –  In practice, the Reducer usually emits a single key/value pair for
        each input key
             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-13
Example Reducer: Sum Reducer

§  Add up all the values associated with each intermediate key:

 reduce(output_key,                                                                        Reducer Input               Reducer Output
           intermediate_vals)                                                              aardvark, 1                 aardvark, 1
    set count = 0
    foreach v in intermediate_vals:                                                        cat, 1                      cat, 1
        count += v                                                                         mat, 1                      mat, 1
    emit(output_key, count)
                                                                                           on [1, 1]                   on, 2
§  Reducer output:                                                                        sat [1, 1]                  sat , 2

 ('aardvark', 1)                                                                           sofa, 1                     sofa, 1
 ('cat', 1)
 ('mat', 1)                                                                                the [1, 1, 1, 1]            the, 4
 ('on', 2)
 ('sat', 2)
 ('sofa', 1)
 ('the', 4)


            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.             01-14
Putting It All Together


                                        The	
  overall	
  word	
  count	
  process	
  

                                                    Mapping                     Shuffling                   Reducing
                                                  The, 1                     aardvark, 1                 aardvark, 1
                                                  cat, 1
                                                                             cat, 1                                               Final Result
                                                  sat, 1                                                 cat, 1
                                                  on, 1                                                                           aardvark, 1
        Mapper Input
                                                  the, 1                     mat, 1                      mat, 1                   cat, 1
The cat sat on the mat                            mat, 1                                                                          mat, 1
                                                  The, 1                     on [1, 1]                   on, 2                    on, 2
The aardvark sat on the sofa
                                                  aardvark, 1                                                                     sat, 2
                                                  sat, 1                     sat [1, 1]                  sat, 2                   sofa, 1
                                                  on, 1                                                                           the, 4
                                                  the, 1                     sofa, 1                     sofa, 1
                                                  sofa, 1
                                                                             the [1, 1, 1, 1]            the, 4




                       Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.           01-15
Why Do We Care About Counting Words?

§  Word count is challenging over massive amounts of data
     –  Using a single compute node would be too time-consuming
     –  Using distributed nodes require moving data
     –  Number of unique words can easily exceed the RAM
         –  Would need a hash table on disk
     –  Would need to partition the results (sort and shuffle)
§  Fundamentals of statistics often are simple aggregate functions
§  Most aggregation functions have distributive nature
     –  e.g., max, min, sum, count
§  MapReduce breaks complex tasks down into smaller elements
    which can be executed in parallel



            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-16
What is Common Across Hadoop-able
Problems?
§  Nature of the data
     –  Complex data
     –  Multiple data sources
     –  Lots of it




§  Nature of the analysis
     –  Batch processing
     –  Parallel execution
     –  Spread data over a cluster of servers and take the computation
        to the data

             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-17
Where Does Data Come From?

§  Science
     –  Medical imaging, sensor data, genome sequencing, weather
        data, satellite feeds, etc.
§  Industry
      –  Financial, pharmaceutical, manufacturing, insurance, online,
         energy, retail data
§  Legacy
      –  Sales data, customer behavior, product databases, accounting
         data, etc.
§  System Data
     –  Log files, health & status feeds, activity streams, network
        messages, Web Analytics, intrusion detection, spam filters


             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-18
What Analysis is Possible With Hadoop?

§  Text mining                                                             §  Collaborative filtering

§  Index building                                                          §  Prediction models

§  Graph creation and analysis                                             §  Sentiment analysis

§  Pattern recognition                                                     §  Risk assessment




             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-19
Benefits of Analyzing With Hadoop

§  Previously impossible/impractical to do this analysis
§  Analysis conducted at lower cost
§  Analysis conducted in less time
§  Greater flexibility
§  Linear scalability




              Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-20
Facts about Apache Hadoop

§  Open source (using the Apache license)
§  Around 40 core Hadoop committers from ~10 companies
     –  Cloudera, Yahoo!, Facebook, Apple and more
§  Hundreds of contributors writing features, fixing bugs
§  Many related projects, applications, tools, etc.




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-21
A large ecosystem




       Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-22
Who uses Hadoop?




      Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-23
Vendor integration




        Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-24
About Cloudera

§  Cloudera is “The commercial Hadoop company”
§  Founded by leading experts on Hadoop from Facebook, Google,
    Oracle and Yahoo
§  Provides consulting and training services for Hadoop users
§  Staff includes several committers to Hadoop projects




           Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-25
Cloudera Software (All Open-Source)

§  Cloudera’s Distribution including Apache Hadoop (CDH)
      –  A single, easy-to-install package from the Apache Hadoop core
         repository
      –  Includes a stable version of Hadoop, plus critical bug fixes and
         solid new features from the development version
§  Components
     –  Apache Hadoop
     –  Apache Hive
     –  Apache Pig
     –  Apache HBase
     –  Apache Zookeeper
     –  Flume, Hue, Oozie, and Sqoop



             Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-26
A Coherent Platform




                                                    Hue                                                             Hue SDK



                        Oozie                                                 Oozie                                     Hive


                                                                                         Pig/
                                                                                         Hive




      Flume/Sqoop                                                                                                     HBase



                                                                                                                   Zookeeper




        Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.        01-27
Cloudera Software (Licensed)

§  Cloudera Enterprise
      –  Cloudera’s Distribution including Apache Hadoop (CDH)
      –  Management applications
      –  Production support
§  Management Apps
     –  Authorization Management and Provisioning
     –  Resource Management
     –  Integration Configuration and Monitoring




            Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-28
Conclusion

§  Apache Hadoop is a fast-growing data framework
§  Cloudera’s Distribution including Apache Hadoop offers a free,
    cohesive platform that encapsulates:
      –  Data integration
      –  Data processing
      –  Workflow scheduling
      –  Monitoring




           Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent.   01-29

Weitere ähnliche Inhalte

Ähnlich wie Introduction to Hadoop

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopVeda Vyas
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир СуворовEYevseyeva
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomBTI360
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraSomnath Mazumdar
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hivezahid-mian
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for MobileBugSense
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoopfvanvollenhoven
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoopmarklpollack
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2ovarene
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoopdatasalt
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelt3rmin4t0r
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an exampleNikita Kesharwani
 
Buzz words
Buzz wordsBuzz words
Buzz wordscwensel
 

Ähnlich wie Introduction to Hadoop (20)

Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Cloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoopCloud computing-with-map reduce-and-hadoop
Cloud computing-with-map reduce-and-hadoop
 
EMC2, Владимир Суворов
EMC2, Владимир СуворовEMC2, Владимир Суворов
EMC2, Владимир Суворов
 
Tackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the RoomTackling Big Data with the Elephant in the Room
Tackling Big Data with the Elephant in the Room
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
Hadoop M/R Pig Hive
Hadoop M/R Pig HiveHadoop M/R Pig Hive
Hadoop M/R Pig Hive
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
Big Data for Mobile
Big Data for MobileBig Data for Mobile
Big Data for Mobile
 
GOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x HadoopGOTO 2011 preso: 3x Hadoop
GOTO 2011 preso: 3x Hadoop
 
Pivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache HadoopPivotal HD and Spring for Apache Hadoop
Pivotal HD and Spring for Apache Hadoop
 
Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2Big Data @ Orange - Dev Day 2013 - part 2
Big Data @ Orange - Dev Day 2013 - part 2
 
Introducción a hadoop
Introducción a hadoopIntroducción a hadoop
Introducción a hadoop
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Tez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthelTez: Accelerating Data Pipelines - fifthel
Tez: Accelerating Data Pipelines - fifthel
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 
Hadoop installation with an example
Hadoop installation with an exampleHadoop installation with an example
Hadoop installation with an example
 
Buzz words
Buzz wordsBuzz words
Buzz words
 

Kürzlich hochgeladen

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Kürzlich hochgeladen (20)

WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

Introduction to Hadoop

  • 1. Introduction to Hadoop 23 Feb 2012, Sydney Mark Fei mark.fei@cloudera.com Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 201109 01-1
  • 2. Agenda During this talk, we’ll cover: §  Why Hadoop is needed §  The basic concepts of HDFS and MapReduce §  What sort of problems can be solved with Hadoop §  What other projects are included in the Hadoop ecosystem §  What resources are available to assist in managing your Hadoop installation Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-2
  • 3. Traditional Large-Scale Computation §  Traditionally, computation has been processor-bound –  Relatively small amounts of data –  Significant amount of complex processing performed on that data §  For decades, the primary push was to increase the computing power of a single machine –  Faster processor, more RAM Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-3
  • 4. Distributed Systems §  Moore’s Law: roughly stated, processing power doubles every two years §  Even that hasn’t always proved adequate for very CPU-intensive jobs §  Distributed systems evolved to allow developers to use multiple machines for a single job –  MPI –  PVM –  Condor MPI: Message Passing Interface PVM: Parallel Virtual Machine Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-4
  • 5. Data Becomes the Bottleneck §  Getting the data to the processors becomes the bottleneck §  Quick calculation –  Typical disk data transfer rate: 75MB/sec –  Time taken to transfer 100GB of data to the processor: approx 22 minutes! –  Assuming sustained reads –  Actual time will be worse, since most servers have less than 100GB of RAM available §  A new approach is needed Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-5
  • 6. Hadoop’s History §  Hadoop is based on work done by Google in the early 2000s –  Specifically, on papers describing the Google File System (GFS) published in 2003, and MapReduce published in 2004 §  This work takes a radical new approach to the problem of distributed computing –  Meets all the requirements we have for reliability, scalability etc §  Core concept: distribute the data as it is initially stored in the system –  Individual nodes can work on data local to those nodes –  No data transfer over the network is required for initial processing Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-6
  • 7. Core Hadoop Concepts §  Applications are written in high-level code –  Developers do not worry about network programming, temporal dependencies etc §  Nodes talk to each other as little as possible –  Developers should not write code which communicates between nodes –  ‘Shared nothing’ architecture §  Data is spread among machines in advance –  Computation happens where the data is stored, wherever possible –  Data is replicated multiple times on the system for increased availability and reliability Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-7
  • 8. Hadoop Components §  Hadoop consists of two core components –  The Hadoop Distributed File System (HDFS) –  MapReduce Software Framework §  There are many other projects based around core Hadoop –  Often referred to as the ‘Hadoop Ecosystem’ –  Pig, Hive, HBase, Flume, Oozie, Sqoop, etc –  Many are discussed later in the course §  A set of machines running HDFS and MapReduce is known as a Hadoop Cluster –  Individual machines are known as nodes –  A cluster can have as few as one node, as many as several thousands –  More nodes = better performance! Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-8
  • 9. Hadoop Components: HDFS §  HDFS, the Hadoop Distributed File System, is responsible for storing data on the cluster §  Data files are split into blocks and distributed across multiple nodes in the cluster §  Each block is replicated multiple times –  Default is to replicate each block three times –  Replicas are stored on different nodes –  This ensures both reliability and availability Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-9
  • 10. Hadoop Components: MapReduce §  MapReduce is the system used to process data in the Hadoop cluster §  Consists of two phases: Map, and then Reduce §  Each Map task operates on a discrete portion of the overall dataset –  Typically one HDFS data block §  After all Maps are complete, the MapReduce system distributes the intermediate data to nodes which perform the Reduce phase Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-10
  • 11. MapReduce Example: Word Count §  Count the number of occurrences of each word in a large amount of input data Map(input_key, input_value) foreach word w in input_value: emit(w, 1) §  Input to the Mapper (3414, 'the cat sat on the mat') (3437, 'the aardvark sat on the sofa') §  Output from the Mapper ('the', 1), ('cat', 1), ('sat', 1), ('on', 1), ('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1), ('sat', 1), ('on', 1), ('the', 1), ('sofa', 1) Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-11
  • 12. Map Phase Mapper Output The, 1 cat, 1 sat, 1 Mapper Input on, 1 the, 1 The cat sat on the mat mat, 1 The aardvark sat on the sofa The, 1 aardvark, 1 sat, 1 on, 1 the, 1 sofa, 1 Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-12
  • 13. MapReduce: The Reducer §  After the Map phase is over, all the intermediate values for a given intermediate key are combined together into a list §  This list is given to a Reducer –  There may be a single Reducer, or multiple Reducers –  All values associated with a particular intermediate key are guaranteed to go to the same Reducer –  The intermediate keys, and their value lists, are passed to the Reducer in sorted key order –  This step is known as the ‘shuffle and sort’ §  The Reducer outputs zero or more final key/value pairs –  These are written to HDFS –  In practice, the Reducer usually emits a single key/value pair for each input key Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-13
  • 14. Example Reducer: Sum Reducer §  Add up all the values associated with each intermediate key: reduce(output_key, Reducer Input Reducer Output intermediate_vals) aardvark, 1 aardvark, 1 set count = 0 foreach v in intermediate_vals: cat, 1 cat, 1 count += v mat, 1 mat, 1 emit(output_key, count) on [1, 1] on, 2 §  Reducer output: sat [1, 1] sat , 2 ('aardvark', 1) sofa, 1 sofa, 1 ('cat', 1) ('mat', 1) the [1, 1, 1, 1] the, 4 ('on', 2) ('sat', 2) ('sofa', 1) ('the', 4) Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-14
  • 15. Putting It All Together The  overall  word  count  process   Mapping Shuffling Reducing The, 1 aardvark, 1 aardvark, 1 cat, 1 cat, 1 Final Result sat, 1 cat, 1 on, 1 aardvark, 1 Mapper Input the, 1 mat, 1 mat, 1 cat, 1 The cat sat on the mat mat, 1 mat, 1 The, 1 on [1, 1] on, 2 on, 2 The aardvark sat on the sofa aardvark, 1 sat, 2 sat, 1 sat [1, 1] sat, 2 sofa, 1 on, 1 the, 4 the, 1 sofa, 1 sofa, 1 sofa, 1 the [1, 1, 1, 1] the, 4 Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-15
  • 16. Why Do We Care About Counting Words? §  Word count is challenging over massive amounts of data –  Using a single compute node would be too time-consuming –  Using distributed nodes require moving data –  Number of unique words can easily exceed the RAM –  Would need a hash table on disk –  Would need to partition the results (sort and shuffle) §  Fundamentals of statistics often are simple aggregate functions §  Most aggregation functions have distributive nature –  e.g., max, min, sum, count §  MapReduce breaks complex tasks down into smaller elements which can be executed in parallel Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-16
  • 17. What is Common Across Hadoop-able Problems? §  Nature of the data –  Complex data –  Multiple data sources –  Lots of it §  Nature of the analysis –  Batch processing –  Parallel execution –  Spread data over a cluster of servers and take the computation to the data Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-17
  • 18. Where Does Data Come From? §  Science –  Medical imaging, sensor data, genome sequencing, weather data, satellite feeds, etc. §  Industry –  Financial, pharmaceutical, manufacturing, insurance, online, energy, retail data §  Legacy –  Sales data, customer behavior, product databases, accounting data, etc. §  System Data –  Log files, health & status feeds, activity streams, network messages, Web Analytics, intrusion detection, spam filters Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-18
  • 19. What Analysis is Possible With Hadoop? §  Text mining §  Collaborative filtering §  Index building §  Prediction models §  Graph creation and analysis §  Sentiment analysis §  Pattern recognition §  Risk assessment Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-19
  • 20. Benefits of Analyzing With Hadoop §  Previously impossible/impractical to do this analysis §  Analysis conducted at lower cost §  Analysis conducted in less time §  Greater flexibility §  Linear scalability Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-20
  • 21. Facts about Apache Hadoop §  Open source (using the Apache license) §  Around 40 core Hadoop committers from ~10 companies –  Cloudera, Yahoo!, Facebook, Apple and more §  Hundreds of contributors writing features, fixing bugs §  Many related projects, applications, tools, etc. Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-21
  • 22. A large ecosystem Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-22
  • 23. Who uses Hadoop? Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-23
  • 24. Vendor integration Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-24
  • 25. About Cloudera §  Cloudera is “The commercial Hadoop company” §  Founded by leading experts on Hadoop from Facebook, Google, Oracle and Yahoo §  Provides consulting and training services for Hadoop users §  Staff includes several committers to Hadoop projects Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-25
  • 26. Cloudera Software (All Open-Source) §  Cloudera’s Distribution including Apache Hadoop (CDH) –  A single, easy-to-install package from the Apache Hadoop core repository –  Includes a stable version of Hadoop, plus critical bug fixes and solid new features from the development version §  Components –  Apache Hadoop –  Apache Hive –  Apache Pig –  Apache HBase –  Apache Zookeeper –  Flume, Hue, Oozie, and Sqoop Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-26
  • 27. A Coherent Platform Hue Hue SDK Oozie Oozie Hive Pig/ Hive Flume/Sqoop HBase Zookeeper Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-27
  • 28. Cloudera Software (Licensed) §  Cloudera Enterprise –  Cloudera’s Distribution including Apache Hadoop (CDH) –  Management applications –  Production support §  Management Apps –  Authorization Management and Provisioning –  Resource Management –  Integration Configuration and Monitoring Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-28
  • 29. Conclusion §  Apache Hadoop is a fast-growing data framework §  Cloudera’s Distribution including Apache Hadoop offers a free, cohesive platform that encapsulates: –  Data integration –  Data processing –  Workflow scheduling –  Monitoring Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-29