A comprehensive overview on the entire Hadoop operations and tools: cluster management, coordination, injection, streaming, formats, storage, resources, processing, workflow, analysis, search and visualization
4. What is Hadoop?
• Hadoop – Open source implementation of MapReduce (MR)
• Perform MR Jobs fast and efficient
Goal
generating Value from large datasets
That cannot be analyzed
using traditional technologies
5. Hadoop Concepts
Requirements
• Linear horizontal scalability
• Jobs run in isolation
• Simple programming model
Challenges and solution
• Ch1: Data access bottleneck
• Sol: Store and process data on same node
• Ch1: Distributed Programming is Difficult
• Sol: Use high level languages API
6. Hadoop Timeline
2003 Oct
Google File System
paper released
2004 Dec
MapReduce: Simplified Data
Processing on Large Clusters
2006 Oct
Hadoop 1.0 released
2007 Oct
Yahoo Labs creates Pig
2008 Oct
Cloudera, Hadoop
distributor is founded
2010 Sep
Hive and Pig Graduates
2011 Jan
Zookeeper Graduates
2013 Mar
Yarn deployed in Yahoo
2014 Feb
Apache Spark top
Level Apache Project
9. Storage / HDFS
• “Hadoop Distributed File System”
• Design:
• Write once – read many times pattern
• Cheap hardware
• Low latency data access
• Concepts:
• Block – File is split to Size 128 MB blocks, redundancy - 3
• NameNode (Master) – per cluster - file system namespace for blocks (single point of
failure)
• DataNode (Worker) – per Node - store and retrieve blocks
• Functions:
• High availability – run a second NameNode
• Block caching – block cached in only one DataNode
• Locality - Rack sensitive, network topology
• File permissions – like POSIX – r w x – owner/group/mode file/directory
• Interfaces – HTTP (proxy/direct), Java API
• Cluster balance – evenly spread the block on the cluster
10. 2Rack
1Rack
Data
Block 1
Block 2
Block 3
DataNodeDataNodeDataNodeDataNode
Block 1
Block 1
Block 2
Block 2
Block 3
Block 3
Block 1
DataNode
Block 2
Block 3
NameNode
HDFS proxy Client
file is distribution and
accessed on Hadoop HDFS
12. Resource Management / YARN
• “Yet Another Resource Negotiator”
• Manage and schedule the cluster resource
• Daemons:
• Resource Manager – Per Cluster – manage resource across the cluster
• Node Manager – Per Node – launch and monitor a Container
• Container – execute an app process
• Resource requests for containers:
• Amount of computers (CPU & Memory)
• Locality (node/rack)
• Lifespan: application per user job or long-running apps shared by users
• Scheduling:
• Allocate resource by policy (FIFO, capacity (ordanisation), Fair
19. Storage / HBase
• Distributes Column Base database on top HDFS
• Real time read/write random access for large data-sets
• Region – tables splitting by row
• Pheonex - SQL on HBase
RowKey Column Family 1 Column Family 2
Col 1.1
Version Data
Col 1.2 Col 1.3
Version Data
Version Data
Hbase Data Model
21. Coordination / ZooKeeper
• Hadoop’s distributed coordination service
• Coordinate read/write action on data
• high availability filesystem
• Implementation:
• Data model:
• Tree build from Znodes (1MB data)
• Znode – data changes, ACL (access control list )
• Leader - perform write and broadcast an update
• Follower – pass atomic request to leader
• Lock service
• User groups
• Replicate mode
24. Row Based Avro
• Language natural data serialization system
• Share many data formats with many code language
• Split able and sortable - Allow easy map reduce
• Rich schema resolution – flexible scheme
• Other Row Based formats
• sequenceFile - Logfile format
• MapFile - Sorted sequenceFile
33. Data Integration / Streaming
• Stream processing
• Kafka Stream - Process and analyze data in Kafka
• Storm – real-time computation
• Spark streaming – process live data and can apply Spark MLib and
graphX
Flume Agent 1
Data
Kafka
Spark Streaming
Flume Agent 2 Storm
Topic
A
Topic
B
HDFS
1
1
1
2
2
37. Scripting / Pig
• Data flow programming language - Map reduce abstraction
• support: User defined functions (UDF), Streaming, nested data
• Don’t support: random read/write
• Pig Latin - Scripting language
• Load, store, filtering, Group, Join, Sort, Union and Split, UDF, Co-group
• Modes
• Local – small datasets
• MR mode – run on cluster
• Execution - script, grunt (shell), embedded (java)
• Parameter substitution – run script with different parameters
• Similar
• Crunch – MR pipeline with Java (no UDF)
41. Workflow / Oozie
• Schedule Hadoop jobs
• Job types:
• Workflows – sequence of jobs via Directed Graphs (DAGs)
• Coordinator - trigger jobs by time or availability
start Sqoop Fork
Pig
PigMR
Sub
workflow
FS
(HDFS)
Join End
Control flow
Action
Email
47. Cluster Management / Cloudera
• 100% open source
• The most complete and tested distribution of Hadoop
• Integrate all Hadoop project
• Express – free, end to end administration
• Enterprise – Extra features and support
HDFS – manage the file system across network of machines
Design to store big files
Master worker pattern
Namenode maintain the directory tree –doesn’t maintain a perstistent location but reconstract when reboot
Namenode is the most important component in the cluster when it lost the entire access to the cluster is lost therefore it possible to create high availabuility where we
Design to support map reduce but is used for other operations