Big data & hadoop framework

BIG DATA &
HADOOP
FRAMEWORK

1

2

Who am I?
 Name:

Pham Phuong Tu
 Work as R&D developer at VC Corp
 Related Projects: muachung.vn,
enbac.com, rongbay.com, …
 Interest with system & software
architecture, big data, data statistic &
analytic
 Email: phamptu@gmail.com

4

Statistic System
Measuring Marketing Effective

5

User Activity Recorder
Mouse Click

6

Mouse Click

7

Mouse Scroll

9

Table of content



Challenger
Big Data









Hadoop Framework










Overview
Data type
What – Who – Why
Extract, transform, load data
Data operation
Big data platform
Overview
History
User
Architecture
Map Reduce
Hadoop Environment

Q&A
Demo

16

Analyze All Available Data
Data warehouse

Social Media

Website

Billing
ERP

CRM

Devices

Network Switches

17

Type of Data
 Plain

Text Data (Web)
 Semi-structured Data (XML, Json)
 Relational Data (Tables/Transaction/Legacy
Data)
 Graph Data


Social Network, Semantic Web (RDF), …

 Multi


Media Data

Image, Video, …

 Streaming


Data

You can only scan the data once

18

What is collecting all
this data?








Web Browsers
Web Sites (Search
Engine, Social Network,
E-commerce Platform…)
Applications
Computer, Smartphone,
Tablet, Games Boxes
Other System (Banking,
Phone, Medical, GPS)
Internet Service
Providers

19

Who is collecting your data?
 Government

Agencies

 Companies
 Service

Provider
 Big Stores

20

Why are they collecting
your data?


Search




Recommendation Systems




New York Times, Eyealike

Target Marketing




Facebook, AOL

Video and Image Analysis




Facebook, Yahoo, Google

Data Warehouse




Facebook, Amazon

Log analytic




Yahoo, Amazon, Zvents

Google Ads, Facebook Ads

Business strategy


Walmart

21

ETL
 Extract:

To convert the data into a single format
appropriate for transformation processing.
 Transform: Applies a series of rules or functions to
the extracted data from the source.
 Load: Loads the data into the end target, usually
the Data Warehouse.

22

Real-life ETL cycle
The typical real-life ETL cycle consists of the following
execution steps:
Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data
integrity, create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with
business rules. Also, in case of failure, helps to
diagnose/repair)
Publish (to target tables)
Archive
Clean up

23

Operational data
with current implement

24

Operational data
with big data implement

25

Big Data Platform
Understand and navigate
federated big data sources

Federated Discovery and
Navigation

Manage & store huge
volume of any data

Hadoop File System
MapReduce

Structure and control data

Data Warehousing

Manage streaming data

Stream Computing

Analyze unstructured data

Text Analytics Engine

Integrate and govern all
data sources

Integration, Data Quality,
Security, Lifecycle Management,
MDM

27

Single-node architecture
CPU
Machine Learning, Statistics
Memory
“Classical” Data Mining
Disk

28

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack

Switch

Switch

CPU
Mem
Disk

Switch

CPU

…

CPU

Mem

Mem

Disk

Disk

Each rack contains 16-64 nodes

CPU

…

Mem
Disk

29

Hadoop History
 Dec

2004 – Google GFS paper published
 July 2005 – Nutch uses MapReduce
 Feb 2006 – Becomes Lucene subproject
 Apr 2007 – Yahoo! on 1000-node cluster
 Jan 2008 – An Apache Top Level Project
 Jul 2008 – A 4000 node test cluster

30

Who uses Hadoop?
 Google,

Yahoo, Bing
 Amazon, Ebay, Alibaba
 Facebook, Twitter
 IBM, HP. Toshiba, Intel
 New York Times, BBC
 Line, Wechat
 VC Corp, FPT, VNG, VTC

31

Hadoop Components
 Distributed



file system (HDFS)

Single namespace for entire cluster
Replicates data 3x for fault-tolerance

 MapReduce



framework

Executes user jobs specified as “map” and
“reduce” functions
Manages work distribution & fault-tolerance

33

Goals of HDFS


Very Large Distributed File System




10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware


Files are replicated to handle hardware failure

Detect failures and recovers from them
Optimized for Batch Processing






Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS




Data locations exposed so that computations
can move to where data resides

36

NameNode


Meta-data in Memory


The entire metadata is in main memory

No demand paging of meta-data
Types of Metadata






List of files

List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication
factor
A Transaction Log








Records file creations, file deletions. etc

37

DataNode


A Block Server


Stores data in the local file system (e.g. ext3)

Stores meta-data of a block
 Serves data and meta-data to Clients
Block Report








Periodically sends a report of all existing blocks
to the NameNode

Facilitates Pipelining of Data


Forwards data to other specified DataNodes

38

Block Placement
 Current


Strategy

One replica on local node

 Second

replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly
placed
 Clients read from nearest replica
 Would like to make this policy pluggable

39

Data Correctness


Use Checksums to validate data




Use CRC32

File Creation


Client computes checksum per 512 byte

DataNode stores the checksum
File access






Client retrieves the data and checksum from
DataNode



If Validation fails, Client tries other replicas

40

NameNode Failure
A

single point of failure
 Transaction Log stored in multiple
directories


A directory on the local file system

A

directory on a remote file system
(NFS/CIFS)
 Need to develop a real HA solution

41

Data Pipelining
 Client

retrieves a list of DataNodes on
which to place replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to
the next DataNode in the Pipeline
 When all replicas are written, the Client
moves on to write the next block in file

42

Rebalancer
 Goal:

% disk full on DataNodes should be

similar





Usually run when new DataNodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network
congestion
Command line tool

43

What is MapReduce?
 Simple

data-parallel programming model designed
for scalability and fault-tolerance

 Pioneered


by Google

Processes 20 petabytes of data per day

 Popularized


by open-source Hadoop project

Used at Yahoo!, Facebook, Amazon, …

47

Example: Word count process

Big data & hadoop framework

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big data & hadoop framework

Similar to Big data & hadoop framework (20)

More from Tu Pham

More from Tu Pham (20)

Recently uploaded

Recently uploaded (20)

Big data & hadoop framework

Editor's Notes