Hadoop

HADOOP
CUONG LUU
KMS TECHNOLOGY
21| 04 | 2014

AGENDA
1. Big data
2. Hadoop v1
– Hadoop file system (HDFS)
– MapReduce programming model
3. Hadoop v2
4. Hadoop ecosystem
5. Q&A
HADOOP

Big Data
BIG DATA ACCOMPLISHMENTS

“Big Data” Definition
HADOOP
“Big data is high volume, high velocity, and/or
high variety information assets that require new
forms of processing to enable enhanced decision
making, insight discovery and process optimization”
2012, Gartner

Problem & Solutions
HADOOP
 Storage
• HDFS, NoSql DB, google file system, big
table.
 Data Processing
• MapReduce, Stream processing - storm,
Dremel – Drill.

Big Data Examples
HADOOP
Twitter
- Data in 2010: “1 trillion tweets. Today, we are seeing
50 million tweets per day”.
- Platform: Hadoop, Pig, Protocol Buffers
- Type of applications: Analysis, People search …
See full: http://goo.gl/y7rEw7

Big Data Examples
HADOOP
Facebook data warehouse
- Data: 200GB per day in March 2008.
12+TB(compressed) raw data per day in 2010.
- Platform: Hadoop, Hive
- Type of applications: Reporting, Analysis, Machine
learning…
See full: http://goo.gl/XUHD9k

Hadoop V1
BIG DATA ACCOMPLISHMENTS

Hadoop File System (HDFS)
HADOOP

HDFS High Level
HADOOP
HDFS is a distributed file system designed for storing
very large files with streaming data access patterns,
running on clusters of commodity hardware

HDFS High Level
HADOOP
HDFS is A File System, not Database

HDFS Block
HADOOP
In file system, a block is the minimum amount of data that it
can read or write.
Local File system | HDFS
normally 512 bytes | 64 MB by default

HDFS Archives
HADOOP
Hadoop Archives (HAR) are a file archiving facility that
packs files into HDFS blocks more efficiently, thereby
reducing namenode memory usage.

MapReduce programming model
HADOOP

MapReduce
HADOOP
Programming model and an associated implementation for
processing and generating large data sets that hides the
messy details of parallelization, fault-tolerance, data
distribution and load balancing.

Example2: Google Flu Trends
HADOOP

Example2: Google Flu Trends
HADOOP
select month(date), avg(california), avg(new_york) from google_flu_trends group by month (date);

Why YARN? – MapReduce V1
Limitations
HADOOP
– Scalability
• Maximum Cluster size – 4,000 nodes
• Maximum concurrent tasks – 4,000
• Job Tracker was overloaded by map tasks.
– Availability
• Failure kills all queued & running jobs.
– Low resource utilization
– Lacks support for alternative paradigms and services

YARN - Concept
HADOOP
– Application
• Application is a job submitted to the framework
• Ex: Map-reduce job
– Container
• Basic unit of allocation
• Fine-grained resource allocation across multiple resource type
(memory, cpu, disk, network, gpu …)
– Ex: container = 2gb, 1CPU

Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (10)

Similar to Hadoop

Similar to Hadoop (20)

Recently uploaded

Recently uploaded (20)

Hadoop