Weitere Àhnliche Inhalte
Ăhnlich wie Introduction to Hadoop (20)
KĂŒrzlich hochgeladen (20)
Introduction to Hadoop
- 1. Introduction to Hadoop
23 Feb 2012, Sydney
Mark Fei
mark.fei@cloudera.com
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 201109 01-1
- 2. Agenda
During this talk, weâll cover:
§ï§âŻ Why Hadoop is needed
§ï§âŻ The basic concepts of HDFS and MapReduce
§ï§âŻ What sort of problems can be solved with Hadoop
§ï§âŻ What other projects are included in the Hadoop ecosystem
§ï§âŻ What resources are available to assist in managing your Hadoop
installation
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-2
- 3. Traditional Large-Scale Computation
§ï§âŻ Traditionally, computation has been processor-bound
â⯠Relatively small amounts of data
â⯠Significant amount of complex processing performed on that data
§ï§âŻ For decades, the primary push was to increase the computing
power of a single machine
â⯠Faster processor, more RAM
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-3
- 4. Distributed Systems
§ï§âŻ Mooreâs Law: roughly stated, processing power doubles every
two years
§ï§âŻ Even that hasnât always proved adequate for very CPU-intensive
jobs
§ï§âŻ Distributed systems evolved to allow developers to use multiple
machines for a single job
â⯠MPI
â⯠PVM
â⯠Condor
MPI: Message Passing Interface
PVM: Parallel Virtual Machine
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-4
- 5. Data Becomes the Bottleneck
§ï§âŻ Getting the data to the processors becomes the bottleneck
§ï§âŻ Quick calculation
â⯠Typical disk data transfer rate: 75MB/sec
â⯠Time taken to transfer 100GB of data to the processor: approx 22
minutes!
â⯠Assuming sustained reads
â⯠Actual time will be worse, since most servers have less than
100GB of RAM available
§ï§âŻ A new approach is needed
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-5
- 6. Hadoopâs History
§ï§âŻ Hadoop is based on work done by Google in the early 2000s
â⯠Specifically, on papers describing the Google File System (GFS)
published in 2003, and MapReduce published in 2004
§ï§âŻ This work takes a radical new approach to the problem of
distributed computing
â⯠Meets all the requirements we have for reliability, scalability etc
§ï§âŻ Core concept: distribute the data as it is initially stored in the
system
â⯠Individual nodes can work on data local to those nodes
â⯠No data transfer over the network is required for initial
processing
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-6
- 7. Core Hadoop Concepts
§ï§âŻ Applications are written in high-level code
â⯠Developers do not worry about network programming, temporal
dependencies etc
§ï§âŻ Nodes talk to each other as little as possible
â⯠Developers should not write code which communicates between
nodes
â⯠âShared nothingâ architecture
§ï§âŻ Data is spread among machines in advance
â⯠Computation happens where the data is stored, wherever
possible
â⯠Data is replicated multiple times on the system for increased
availability and reliability
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-7
- 8. Hadoop Components
§ï§âŻ Hadoop consists of two core components
â⯠The Hadoop Distributed File System (HDFS)
â⯠MapReduce Software Framework
§ï§âŻ There are many other projects based around core Hadoop
â⯠Often referred to as the âHadoop Ecosystemâ
â⯠Pig, Hive, HBase, Flume, Oozie, Sqoop, etc
â⯠Many are discussed later in the course
§ï§âŻ A set of machines running HDFS and MapReduce is known as a
Hadoop Cluster
â⯠Individual machines are known as nodes
â⯠A cluster can have as few as one node, as many as several
thousands
â⯠More nodes = better performance!
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-8
- 9. Hadoop Components: HDFS
§ï§âŻ HDFS, the Hadoop Distributed File System, is responsible for
storing data on the cluster
§ï§âŻ Data files are split into blocks and distributed across multiple
nodes in the cluster
§ï§âŻ Each block is replicated multiple times
â⯠Default is to replicate each block three times
â⯠Replicas are stored on different nodes
â⯠This ensures both reliability and availability
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-9
- 10. Hadoop Components: MapReduce
§ï§âŻ MapReduce is the system used to process data in the Hadoop
cluster
§ï§âŻ Consists of two phases: Map, and then Reduce
§ï§âŻ Each Map task operates on a discrete portion of the overall
dataset
â⯠Typically one HDFS data block
§ï§âŻ After all Maps are complete, the MapReduce system distributes
the intermediate data to nodes which perform the Reduce phase
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-10
- 11. MapReduce Example: Word Count
§ï§âŻ Count the number of occurrences of each word in a large
amount of input data
Map(input_key, input_value)
foreach word w in input_value:
emit(w, 1)
§ï§âŻ Input to the Mapper
(3414, 'the cat sat on the mat')
(3437, 'the aardvark sat on the sofa')
§ï§âŻ Output from the Mapper
('the', 1), ('cat', 1), ('sat', 1), ('on', 1),
('the', 1), ('mat', 1), ('the', 1), ('aardvark', 1),
('sat', 1), ('on', 1), ('the', 1), ('sofa', 1)
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-11
- 12. Map Phase
Mapper Output
The, 1
cat, 1
sat, 1
Mapper Input on, 1
the, 1
The cat sat on the mat mat, 1
The aardvark sat on the sofa The, 1
aardvark, 1
sat, 1
on, 1
the, 1
sofa, 1
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-12
- 13. MapReduce: The Reducer
§ï§âŻ After the Map phase is over, all the intermediate values for a
given intermediate key are combined together into a list
§ï§âŻ This list is given to a Reducer
â⯠There may be a single Reducer, or multiple Reducers
â⯠All values associated with a particular intermediate key are
guaranteed to go to the same Reducer
â⯠The intermediate keys, and their value lists, are passed to the
Reducer in sorted key order
â⯠This step is known as the âshuffle and sortâ
§ï§âŻ The Reducer outputs zero or more final key/value pairs
â⯠These are written to HDFS
â⯠In practice, the Reducer usually emits a single key/value pair for
each input key
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-13
- 14. Example Reducer: Sum Reducer
§ï§âŻ Add up all the values associated with each intermediate key:
reduce(output_key, Reducer Input Reducer Output
intermediate_vals) aardvark, 1 aardvark, 1
set count = 0
foreach v in intermediate_vals: cat, 1 cat, 1
count += v mat, 1 mat, 1
emit(output_key, count)
on [1, 1] on, 2
§ï§âŻ Reducer output: sat [1, 1] sat , 2
('aardvark', 1) sofa, 1 sofa, 1
('cat', 1)
('mat', 1) the [1, 1, 1, 1] the, 4
('on', 2)
('sat', 2)
('sofa', 1)
('the', 4)
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-14
- 15. Putting It All Together
The
 overall
 word
 count
 process
Â
Mapping ShufïŹing Reducing
The, 1 aardvark, 1 aardvark, 1
cat, 1
cat, 1 Final Result
sat, 1 cat, 1
on, 1 aardvark, 1
Mapper Input
the, 1 mat, 1 mat, 1 cat, 1
The cat sat on the mat mat, 1 mat, 1
The, 1 on [1, 1] on, 2 on, 2
The aardvark sat on the sofa
aardvark, 1 sat, 2
sat, 1 sat [1, 1] sat, 2 sofa, 1
on, 1 the, 4
the, 1 sofa, 1 sofa, 1
sofa, 1
the [1, 1, 1, 1] the, 4
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-15
- 16. Why Do We Care About Counting Words?
§ï§âŻ Word count is challenging over massive amounts of data
â⯠Using a single compute node would be too time-consuming
â⯠Using distributed nodes require moving data
â⯠Number of unique words can easily exceed the RAM
â⯠Would need a hash table on disk
â⯠Would need to partition the results (sort and shuffle)
§ï§âŻ Fundamentals of statistics often are simple aggregate functions
§ï§âŻ Most aggregation functions have distributive nature
â⯠e.g., max, min, sum, count
§ï§âŻ MapReduce breaks complex tasks down into smaller elements
which can be executed in parallel
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-16
- 17. What is Common Across Hadoop-able
Problems?
§ï§âŻ Nature of the data
â⯠Complex data
â⯠Multiple data sources
â⯠Lots of it
§ï§âŻ Nature of the analysis
â⯠Batch processing
â⯠Parallel execution
â⯠Spread data over a cluster of servers and take the computation
to the data
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-17
- 18. Where Does Data Come From?
§ï§âŻ Science
â⯠Medical imaging, sensor data, genome sequencing, weather
data, satellite feeds, etc.
§ï§âŻ Industry
â⯠Financial, pharmaceutical, manufacturing, insurance, online,
energy, retail data
§ï§âŻ Legacy
â⯠Sales data, customer behavior, product databases, accounting
data, etc.
§ï§âŻ System Data
â⯠Log files, health & status feeds, activity streams, network
messages, Web Analytics, intrusion detection, spam filters
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-18
- 19. What Analysis is Possible With Hadoop?
§ï§âŻ Text mining §ï§âŻ Collaborative filtering
§ï§âŻ Index building §ï§âŻ Prediction models
§ï§âŻ Graph creation and analysis §ï§âŻ Sentiment analysis
§ï§âŻ Pattern recognition §ï§âŻ Risk assessment
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-19
- 20. Benefits of Analyzing With Hadoop
§ï§âŻ Previously impossible/impractical to do this analysis
§ï§âŻ Analysis conducted at lower cost
§ï§âŻ Analysis conducted in less time
§ï§âŻ Greater flexibility
§ï§âŻ Linear scalability
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-20
- 21. Facts about Apache Hadoop
§ï§âŻ Open source (using the Apache license)
§ï§âŻ Around 40 core Hadoop committers from ~10 companies
â⯠Cloudera, Yahoo!, Facebook, Apple and more
§ï§âŻ Hundreds of contributors writing features, fixing bugs
§ï§âŻ Many related projects, applications, tools, etc.
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-21
- 22. A large ecosystem
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-22
- 23. Who uses Hadoop?
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-23
- 24. Vendor integration
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-24
- 25. About Cloudera
§ï§âŻ Cloudera is âThe commercial Hadoop companyâ
§ï§âŻ Founded by leading experts on Hadoop from Facebook, Google,
Oracle and Yahoo
§ï§âŻ Provides consulting and training services for Hadoop users
§ï§âŻ Staff includes several committers to Hadoop projects
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-25
- 26. Cloudera Software (All Open-Source)
§ï§âŻ Clouderaâs Distribution including Apache Hadoop (CDH)
â⯠A single, easy-to-install package from the Apache Hadoop core
repository
â⯠Includes a stable version of Hadoop, plus critical bug fixes and
solid new features from the development version
§ï§âŻ Components
â⯠Apache Hadoop
â⯠Apache Hive
â⯠Apache Pig
â⯠Apache HBase
â⯠Apache Zookeeper
â⯠Flume, Hue, Oozie, and Sqoop
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-26
- 27. A Coherent Platform
Hue Hue SDK
Oozie Oozie Hive
Pig/
Hive
Flume/Sqoop HBase
Zookeeper
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-27
- 28. Cloudera Software (Licensed)
§ï§âŻ Cloudera Enterprise
â⯠Clouderaâs Distribution including Apache Hadoop (CDH)
â⯠Management applications
â⯠Production support
§ï§âŻ Management Apps
â⯠Authorization Management and Provisioning
â⯠Resource Management
â⯠Integration Configuration and Monitoring
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-28
- 29. Conclusion
§ï§âŻ Apache Hadoop is a fast-growing data framework
§ï§âŻ Clouderaâs Distribution including Apache Hadoop offers a free,
cohesive platform that encapsulates:
â⯠Data integration
â⯠Data processing
â⯠Workflow scheduling
â⯠Monitoring
Copyright © 2010-2012 Cloudera. All rights reserved. Not to be reproduced without prior written consent. 01-29