Hadoop masterclass from Big Data Partnership

Copyright © 2014 Big Data Partnership Ltd. All rights reserved.
Apache Hadoop Masterclass
An introduction to concepts & ecosystem for all audiences

Who We Are?

Agenda
1. What is Hadoop? -- Why is it so popular?
2. Hardware
3. Ecosystem Tools
4. Summary & Recap
3

1.What is Hadoop?
How is it different? Why is it so popular?
4

1.1 First, a question…
Conceptual design exercise
5

What is Big Data?
Why people are turning to Hadoop
6
Data Complexity: Variety and Velocity
TB
GB
MB
PB Big Data!
!
Log files
Spatial &  
GPS coordinates
Data market feeds
eGov feeds
Weather
Text/image
Click stream
Wikis/blogs
Sensors/RFID/ 
devices
Social sentiment
Audio/video
Web 2.0
!
!
Web Logs
Digital Marketing
Search Marketing
Recommendations
Advertising
Mobile
Collaboration
eCommerce
ERP/CRM
Payables
Payroll
Inventory
Contacts
Deal Tracking
Sales Pipeline
DataSize

Problem:
Legacy database, experiencing huge growth in data volumes
7
Scale
UP
€ / GB
€€ / GB
€€€ / GBLarge Application Database
or Data Warehouse
€€€€ / GB
TB ➔ PB ?
Data Volume
Performance
Cost

How would you re-design from the basics?
8
Scale OUT
1/x
1/x1/x
1/x
1/x
✓ Commodity  
hardware
✓ Low,  
predictable 
cost per unit
How do we store data?
Requirements:
Less cost
Less complexity
Less risk
More scalable
More reliable
More flexible
0-5 6-10 11-15 16-20 21-25
0-5 0-56-10 6-10
11-15 11-1516-20
16-20
21-2521-25
How do we query data?
Q QQQQ ✓ Distributed 
Filesystem
✓ Distributed 
Computation
€ € € € €

Engineering for efficiency & reliability
What issues exist for coordinating between nodes?
!
▪ Limited network bandwidth
!
▪ Data loss/re-transmission
!
▪ Server death
!
▪ Failover scenarios
9
⇒ Do more locally before crossing network
!
⇒ Use reliable transfers
!
⇒ Make servers repeatable elsewhere
!
⇒ Make nodes identity-less: any other can take over
⇒ Job co-ordination
These are our system requirements

Engineering for efficiency & reliability
What issues exist within each node?
!
▪ Hardware - MTBF (Mean Time Between Failure)
!
▪ Software crash
!
▪ Disk seek time
!
▪ Disk contention
10
⇒ Expect failure: nodes have no identity, data replication
!
!
⇒ as above
!
!
⇒ Avoid random seeks, read sequentially (10x faster than seek)
!
!
⇒ Use multiple disks in parallel, with multiple cores
!
!
!

Engineering for flexibility
What issues exist for a query interface?
!
▪ Schema not obvious up front
!
▪ Schema changes after implementation
!
▪ Language may not suit all users
!
▪ Language may not support advanced use cases
11
⇒ Support both schema and schema-less operations
!
!
⇒ as above + Support sparse/columnar storage
!
!
⇒ Plug-in query framework and/or Programming API
!
!
⇒ Generic programming API
!

To sum up:
▪ Original legacy assumptions, traditional enterprise systems -
▪ Hardware can be made reliable by spending more on it
▪ Machines have a unique identity each
▪ Datasets must be centralised
!
▪ New paradigm for data management with Hadoop -
▪ Distribute + duplicate chunks of each data file across many nodes
▪ Use local compute resource to process each chunk in parallel
▪ Minimise elapsed seek time
▪ Expect failure
▪ Handle failover elegantly, and automatically
12

1.2 What is Hadoop?
Architecture Overview
13

What is Hadoop?
Logical Architecture
14
HDFS 
Hadoop Distributed Filesystem
MapReduce
Computation:
Storage:Data
Management
Data
Analysis
User Tools
Core
Distribution
Distributed
System
+
Node NodeNode Node NodeNode
Fast, Private
Network

What is Hadoop?
Physical Architecture
15
Computation:
Storage:
Node
DataNode
TaskTracker
RegionServer
Node
DataNode
TaskTracker
RegionServer
Node
DataNode
TaskTracker
RegionServer
Node
DataNode
TaskTracker
RegionServerHBase:
Master:
Node
NameNode
Node
JobTracker
Node
HMaster
User Tools
Client Libraries

1.3 HDFS
Distributed Storage
16

Storage: writing data
17
HDFS 
48 TB
8 TB 8 TB8 TB 8 TB 8 TB8 TB

Storage: reading data
18
HDFS 
48 TB

1.4 MapReduce
Distributed Computation
19

Computation
20
HDFS 
48 TB
MapReduce
F(x) F(x)

Computation: MapReduce
High-level MapReduce process
21
Input File
(split into chunks)
Map Shuffle
& sort
Reduce Output Files
(one per
Reducer)

Computation: MapReduce
High-level MapReduce process
22
Then there were
two cousins laid up;
when the one
should
be lamed with
reasons and the
other mad without
any. But is all this
for your father? No,
some of it is for my
child’s father. O,
how full of briers is
this working day
world! They are but
burs, cousin,
thrown upon thee in
holiday
foolery: if we walk
not in the trodden
paths our very
the,1
other,1
…
were,1
the,1
…
this,1
some,1
…
father,1
this,1
…
they,1
upon,1
…
the,1
paths,1
…
father,{1}
other,{1}
paths,{1}
some,{1}
the,{1,1,1}
they,{1}
this,{1,1}
upon,{1}
where,{1}
father,1
other,1
paths,1
some,1
the,3
they,1
this,2
upon,1
where,1
Input Map Shuffle
& Sort
Reduce Output

1.5 Key Concepts
Points to Remember
23

Linear Scalability
24
HDFS 
Hadoop Distributed Filesystem
MapReduce
Node Node NodeNode NodeNode Node NodeNode
Cluster Size 150% Time
Cluster Size
Storage Capacity
Compute Capacity
Compute Time
Hardware €
Software €
Total €
€ / TB
Staffing €

HDFS for Storage
What do I need to remember?
Functional Qualities:
▪ Store many files into a large pooled filesystem
▪ Familiar interface: standard UNIX filesystem
▪ Organise by any arbitrary naming scheme: files/directories
!
Non-Functional Qualities:
▪ Files can be very very large (>> single disk)
▪ Distributed, fault-tolerant, reliable – replicas of chunks are
stored
▪ High aggregate throughput: parallel reads/writes
▪ Scalable
25

Computation: Not only MapReduce
▪ With Hadoop 2.1, Hadoop introduced YARN
▪ Enhanced MapReduce:
▪ Makes MapReduce just one of many pluggable framework
modules
▪ Existing MapReduce programs continue to run as before
▪ Previous MapReduce API versions run side-by-side
▪ Takes us way beyond MapReduce:
▪ Any new programming model can now be plugged into a
running Hadoop cluster on demand
▪ Including: MPI, BSP, graph, others …
26

1.6 Why is it so popular?
Key Differentiators
27

Why is Hadoop so popular?
The story so far
1. Scalability
▪ Democratisation - the ability to store and manage petabytes
of files and analyse them across thousands of nodes has
never been available to the general population before
▪ Why should the HPC world care? 
More users = more problems solved for you by others
▪ Predictable scalability curve
2. Cost
▪ It can be done on cheap hardware, with minimal staff
▪ Free software
3. Ease of use
▪ Familiar Filesystem interface
▪ Simple programming model
28

Why is Hadoop so popular?
▪ What about flexibility?
▪ How does it compare to a traditional database system?
▪ Tables consisting of rows, columns => 2D
▪ Foreign keys linking tables together
▪ One-to-one, one-to-many, many-to-many relations
29

Traditional Database Approach
▪ Data model tightly
coupled to storage
▪ Have to model like this
up-front before
anything can be
inserted
▪ Only choice of model
▪ May not be appropriate
for your application
▪ We’ve grown up with
this as the normal
general-purpose
approach, but is it
really?
30
Customers
Orders
Order Lines
SCHEMA
ON WRITE

Traditional database approach
Sometimes it doesn’t fit
▪ What if we want to store graph data?
▪ Network of nodes & edges
▪ e.g. social graph
!
!
!
!
!
▪ Tricky!
▪ Need to store lots of self-references to far away rows
▪ Slow performance
▪ What if we need changing attributes for different node types?
▪ Loss of intuitive schema design
31

▪ What if we want to store a sparse matrix?
▪ To avoid a very complicated schema, would need to store a
table with NULL for all empty matrix cells
!
!
▪ Lots of NULLs
▪ Waste memory + disk space
▪ Very poor performance
▪ Loss of intuitive schema design
32

▪ What if we want to store binary files from a scientific
instrument… images… videos?
▪ Could store metadata in first few cols, then pack binary
data into one huge column?
!
!
▪ Inefficient
▪ Slow
▪ Might run out of space in the binary column
▪ No internal structure
▪ Database driver is a bottleneck
33

Hadoop data model
Types of data
34
Structured
• Pre-defined
schema
• Highly
• Example:
relational
database system
Semi-
structured
• Inconsistent
structure
• Cannot be
stored in rows
and tables in a
typical
database
• Examples: 
logs, tweets,
sensor feeds
Unstructur
ed
• Lacks structure
or..
• Parts of it lack
structure
• Examples: free-
form text,
reports,
customer
feedback forms
SCHEMA
ON READ

Hadoop approach
A more flexible option?
▪ Just load your data files into the cluster as they are
▪ Don’t worry about the format right now
▪ Leave files in their native/application format
▪ Since they are now in a cluster with distributed computing
power local to each storage shard anyway:
▪ Query the data files in-place
▪ Bring the computation to the data
▪ Saves shipping data around
▪ Use a more flexible model – don’t worry about the schema up-
front, figure it out later
▪ Use alternative programming models to query data – MapReduce
is more abstract than SQL so can support wider range of
customisation
▪ Beyond that, can replace MapReduce altogether
35

3. Hardware Considerations
50,000 ft picture
36

Hardware Considerations
The Basics
▪ Cluster infrastructure design is a complex topic
▪ Full treatment is beyond the scope of this talk
▪ But, essentially:
▪ Prefer cheap commodity hardware
▪ Failure is planned and expected – Hadoop will deal with it
gracefully
▪ The only exception to this is a service called the
“NameNode” which coordinates the filesystem – for some
distributions this requires a high-availability node
▪ The more memory the better
37

Performance
▪ Try to pack as many disks and cores into a node as is
reasonable:
▪ 8 disks +
▪ Cheap SATA drives – NOT expensive SAS drives
▪ Cheap CPUs, not expensive server-grade high performance
chips
!
▪ Match one core to a CPU, generally
▪ So for 12 disks, get 12 CPU cores
▪ Adding more cores => add more memory
38

Bottlenecks
▪ You will hit bottlenecks in the following order:
1. Disk I/O throughput
2. Network throughput
3. Running out of memory
4. CPU speed
!
▪ Therefore, money will be wasted on expensive high-speed
CPUs, and they will increase power consumption costs
▪ Disk I/O is best addressed by following the one-disk-to-
CPU-core ratio guideline
39

4. HadoopTools
Ecosystem tools & utilities
40

Mana
gem
ent
&
Moni
torin
g
(Ambar
i,
Zookee
per,
Nagios,
Ganglia
)
Hadoop Subprojects &Tools
Overview
41
Distributed Storage 
(HDFS)
Distributed Processing
(MapReduce)
Scripting
(Pig)
Metadata Management
(HCatalog)
NoSQLColumnDB
(HBase)
DataExtraction&Load
(Sqoop,WebHDFS,3rdpartytools)
Workflow&Scheduling
(Oozie)
Query
(Hive)
Machine Learning & Predictive Analytics
(Mahout)

Pig
▪ High-level dataflow scripting tool
▪ Best for exploring data quickly or prototyping
▪ When you load data, you can:
▪ Specify a schema at load time
▪ Not specify one and let Pig guess
▪ Not specify one and tell Pig later once you work it out
▪ Pig eats any kind of data (hence the name “Pig”)
▪ Pluggable load/save adapters available (e.g. Text, Binary,
Hbase)
42

Pig
Sample script
> LOAD 'student' USING PigStorage() AS (name:chararray, year:int, gpa:float); 
> DUMP A; 
(John,2005,4.0F) 
(John,2006,3.9F) 
(Mary,2006,3.8F) 
(Mary,2007,2.7F) 
(Bill,2009,3.9F) 
(Joe,2010,3.8F)
> B = GROUP A BY name; 
> C = FOREACH B GENERATE AVG(gpa); 
> STORE C INTO ’grades’ USING PigStorage();
43

Hive
▪ Data Warehouse-like front end & SQL for Hadoop
▪ Best for managing known data with fixed schema
▪ Stores data in HDFS, metadata in a local Database (MySql)
▪ Supports two table types:
▪ Internal tables (Hive manages the data files & format)
▪ External tables (you manage the data files & format)
▪ Not fully safe to throw inexperienced native-SQL users at
44

Hive
Sample script
> CREATE TABLE page_view(viewTime INT, userid BIGINT, 
page_url STRING, referrer_url STRING, 
ip STRING COMMENT 'IP Address of the User’) 
> COMMENT 'This is the page view table’ 
> PARTITIONED BY(dt STRING, country STRING) 
> ROW FORMAT DELIMITED 
> FIELDS TERMINATED BY '1’ 
> STORED AS SEQUENCEFILE;
> SELECT * FROM page_view WHERE viewTime>1400;
> SELECT * FROM page_view 
> JOIN users ON page_view.userid = users.userid 
> ORDER BY users.subscription_date;
45

Oozie
▪ Workflow engine & scheduler for Hadoop
▪ Best for building repeatable production workflows of
common Hadoop jobs
▪ Great for composing smaller jobs into larger more
complex ones
▪ Stores job metadata in a local Database (MySql)
▪ Complex recurring schedules easy to create
▪ Some counter-intuitive cluster setup steps required, due
to the way job actions for Hive and Pig are launched
46

Sqoop
▪ Data import/export tool for Hadoop + external database
▪ Suitable for importing/exporting flat simple tables
▪ Suitable for importing arbitrary query as table
▪ Does not have any configuration or state
▪ Can easily overload a remote database if too much
parallelism requested (think DDoS attack) !
▪ Can load directly to/from Hive tables as well as HDFS files
47

5. Summary
48

Summary
Hadoop as a concept
▪ Driving factors:
▪ Why it was designed the way it was
▪ Why that is important to you as a user
▪ What makes it popular
▪ What it isn’t so suitable for
49

Summary
Hadoop as a platform
▪ Key concepts:
▪ Distributed Storage
▪ Distributed Computation
▪ Bring the computation to the data
▪ Variety in data formats (store files, analyse in-place later)
▪ Variety in data analysis (more abstract programming
models)
▪ Linear Scalability
50

Summary
Hadoop Ecosystem
▪ Multiple vendors
▪ Brings choice and competition
▪ Feature differentiators
▪ Variety of licensing models
!
▪ Rich set of user tools
▪ Exploration
▪ Querying
▪ Workflow & ETL
▪ Distributed Database
▪ Machine Learning
51

FurtherTraining
▪ Apache Hadoop 2.0 Developing Java Applications (4 days)
▪ Apache Hadoop 2.0 Development for Data Analysts (4
days)
▪ Apache Hadoop 2.0 Operations Management (3 days)
▪ MapR Hadoop Fundamentals of Administration (3 days)
▪ Apache Cassandra DevOps Fundamentals (3 days)
▪ Apache Hadoop Masterclass (1 day)
▪ Big Data Concepts Masterclass (1 day)
▪ Machine Learning at scale with Apache Mahout (1 day)
52

Contact Details
Tim Seears 
CTO 
Big Data Partnership 
 
info@bigdatapartnership.com 
@BigDataExperts

Hadoop masterclass from Big Data Partnership

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Empfohlen

Empfohlen (20)

Hadoop masterclass from Big Data Partnership

Hinweis der Redaktion