SlideShare a Scribd company logo
1 of 49
BIG DATA &
HADOOP
FRAMEWORK

1
2

Who am I?
 Name:

Pham Phuong Tu
 Work as R&D developer at VC Corp
 Related Projects: muachung.vn,
enbac.com, rongbay.com, …
 Interest with system & software
architecture, big data, data statistic &
analytic
 Email: phamptu@gmail.com
3

Statistic System
4

Statistic System
Measuring Marketing Effective
5

User Activity Recorder
Mouse Click
6

User Activity Recorder
Mouse Click
7

User Activity Recorder
Mouse Scroll
8

Mining System Log
9

Table of content



Challenger
Big Data









Hadoop Framework










Overview
Data type
What – Who – Why
Extract, transform, load data
Data operation
Big data platform
Overview
History
User
Architecture
Map Reduce
Hadoop Environment

Q&A
Demo
10

CHALLENGER
11
12
13
14

BIG DATA
15
16

Analyze All Available Data
Data warehouse

Social Media

Website

Billing
ERP

CRM

Devices

Network Switches
17

Type of Data
 Plain

Text Data (Web)
 Semi-structured Data (XML, Json)
 Relational Data (Tables/Transaction/Legacy
Data)
 Graph Data


Social Network, Semantic Web (RDF), …

 Multi


Media Data

Image, Video, …

 Streaming


Data

You can only scan the data once
18

What is collecting all
this data?








Web Browsers
Web Sites (Search
Engine, Social Network,
E-commerce Platform…)
Applications
Computer, Smartphone,
Tablet, Games Boxes
Other System (Banking,
Phone, Medical, GPS)
Internet Service
Providers
19

Who is collecting your data?
 Government

Agencies

 Companies
 Service

Provider
 Big Stores
20

Why are they collecting
your data?


Search




Recommendation Systems




New York Times, Eyealike

Target Marketing




Facebook, AOL

Video and Image Analysis




Facebook, Yahoo, Google

Data Warehouse




Facebook, Amazon

Log analytic




Yahoo, Amazon, Zvents

Google Ads, Facebook Ads

Business strategy


Walmart
21

ETL
 Extract:

To convert the data into a single format
appropriate for transformation processing.
 Transform: Applies a series of rules or functions to
the extracted data from the source.
 Load: Loads the data into the end target, usually
the Data Warehouse.
22

Real-life ETL cycle
The typical real-life ETL cycle consists of the following
execution steps:
Cycle initiation
Build reference data
Extract (from sources)
Validate
Transform (clean, apply business rules, check for data
integrity, create aggregates or disaggregates)
Stage (load into staging tables, if used)
Audit reports (for example, on compliance with
business rules. Also, in case of failure, helps to
diagnose/repair)
Publish (to target tables)
Archive
Clean up
23

Operational data
with current implement
24

Operational data
with big data implement
25

Big Data Platform
Understand and navigate
federated big data sources

Federated Discovery and
Navigation

Manage & store huge
volume of any data

Hadoop File System
MapReduce

Structure and control data

Data Warehousing

Manage streaming data

Stream Computing

Analyze unstructured data

Text Analytics Engine

Integrate and govern all
data sources

Integration, Data Quality,
Security, Lifecycle Management,
MDM
26

HADOOP
FRAMEWORK
27

Single-node architecture
CPU
Machine Learning, Statistics
Memory
“Classical” Data Mining
Disk
28

Cluster Architecture
2-10 Gbps backbone between racks
1 Gbps between
any pair of nodes
in a rack

Switch

Switch

CPU
Mem
Disk

Switch

CPU

…

CPU

Mem

Mem

Disk

Disk

Each rack contains 16-64 nodes

CPU

…

Mem
Disk
29

Hadoop History
 Dec

2004 – Google GFS paper published
 July 2005 – Nutch uses MapReduce
 Feb 2006 – Becomes Lucene subproject
 Apr 2007 – Yahoo! on 1000-node cluster
 Jan 2008 – An Apache Top Level Project
 Jul 2008 – A 4000 node test cluster
30

Who uses Hadoop?
 Google,

Yahoo, Bing
 Amazon, Ebay, Alibaba
 Facebook, Twitter
 IBM, HP. Toshiba, Intel
 New York Times, BBC
 Line, Wechat
 VC Corp, FPT, VNG, VTC
31

Hadoop Components
 Distributed



file system (HDFS)

Single namespace for entire cluster
Replicates data 3x for fault-tolerance

 MapReduce



framework

Executes user jobs specified as “map” and
“reduce” functions
Manages work distribution & fault-tolerance
32
33

Goals of HDFS


Very Large Distributed File System




10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware


Files are replicated to handle hardware failure

Detect failures and recovers from them
Optimized for Batch Processing






Provides very high aggregate bandwidth
User Space, runs on heterogeneous OS




Data locations exposed so that computations
can move to where data resides
34

HDFS Multi Cluster
35

Hadoop 1.0 vs 2.0
36

NameNode


Meta-data in Memory


The entire metadata is in main memory

No demand paging of meta-data
Types of Metadata






List of files

List of Blocks for each file
List of DataNodes for each block
File attributes, e.g creation time, replication
factor
A Transaction Log








Records file creations, file deletions. etc
37

DataNode


A Block Server


Stores data in the local file system (e.g. ext3)

Stores meta-data of a block
 Serves data and meta-data to Clients
Block Report








Periodically sends a report of all existing blocks
to the NameNode

Facilitates Pipelining of Data


Forwards data to other specified DataNodes
38

Block Placement
 Current


Strategy

One replica on local node

 Second

replica on a remote rack
 Third replica on same remote rack
 Additional replicas are randomly
placed
 Clients read from nearest replica
 Would like to make this policy pluggable
39

Data Correctness


Use Checksums to validate data




Use CRC32

File Creation


Client computes checksum per 512 byte

DataNode stores the checksum
File access






Client retrieves the data and checksum from
DataNode



If Validation fails, Client tries other replicas
40

NameNode Failure
A

single point of failure
 Transaction Log stored in multiple
directories


A directory on the local file system

A

directory on a remote file system
(NFS/CIFS)
 Need to develop a real HA solution
41

Data Pipelining
 Client

retrieves a list of DataNodes on
which to place replicas of a block
 Client writes block to the first DataNode
 The first DataNode forwards the data to
the next DataNode in the Pipeline
 When all replicas are written, the Client
moves on to write the next block in file
42

Rebalancer
 Goal:

% disk full on DataNodes should be

similar





Usually run when new DataNodes are added
Cluster is online when Rebalancer is active
Rebalancer is throttled to avoid network
congestion
Command line tool
43

What is MapReduce?
 Simple

data-parallel programming model designed
for scalability and fault-tolerance

 Pioneered


by Google

Processes 20 petabytes of data per day

 Popularized


by open-source Hadoop project

Used at Yahoo!, Facebook, Amazon, …
44

Map Reduce Data Flow
45

Execution
46

Parallel Execution
47

Example: Word count process
48

Hadoop Environment
49

More Related Content

What's hot

Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Raid Data Recovery
 
Lecture 10 distributed database management system
Lecture 10   distributed database management systemLecture 10   distributed database management system
Lecture 10 distributed database management systememailharmeet
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and HadoopFlavio Vit
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memorysgpraju
 
Message passing ( in computer science)
Message   passing  ( in   computer  science)Message   passing  ( in   computer  science)
Message passing ( in computer science)Computer_ at_home
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memoryAshish Kumar
 
Memory organization (Computer architecture)
Memory organization (Computer architecture)Memory organization (Computer architecture)
Memory organization (Computer architecture)Sandesh Jonchhe
 
Multiple Access Protocal
Multiple Access ProtocalMultiple Access Protocal
Multiple Access Protocaltes31
 
Ethernet and Token ring (Computer Networks)
Ethernet and Token ring (Computer Networks)Ethernet and Token ring (Computer Networks)
Ethernet and Token ring (Computer Networks)Shail Nakum
 
Analog and Digital Transmission
Analog and Digital TransmissionAnalog and Digital Transmission
Analog and Digital TransmissionAnushiya Ram
 
Control Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unitControl Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unitabdosaidgkv
 
Unit 4 ca-input-output
Unit 4 ca-input-outputUnit 4 ca-input-output
Unit 4 ca-input-outputBBDITM LUCKNOW
 
Lecture 6 -_presentation_layer
Lecture 6 -_presentation_layerLecture 6 -_presentation_layer
Lecture 6 -_presentation_layerSerious_SamSoul
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 

What's hot (20)

Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
Understanding RAID Levels (RAID 0, RAID 1, RAID 2, RAID 3, RAID 4, RAID 5)
 
Lecture 10 distributed database management system
Lecture 10   distributed database management systemLecture 10   distributed database management system
Lecture 10 distributed database management system
 
Big Data and Hadoop
Big Data and HadoopBig Data and Hadoop
Big Data and Hadoop
 
Token bus
Token busToken bus
Token bus
 
Os Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual MemoryOs Swapping, Paging, Segmentation and Virtual Memory
Os Swapping, Paging, Segmentation and Virtual Memory
 
Message passing ( in computer science)
Message   passing  ( in   computer  science)Message   passing  ( in   computer  science)
Message passing ( in computer science)
 
distributed shared memory
 distributed shared memory distributed shared memory
distributed shared memory
 
Message passing in Distributed Computing Systems
Message passing in Distributed Computing SystemsMessage passing in Distributed Computing Systems
Message passing in Distributed Computing Systems
 
Memory organization (Computer architecture)
Memory organization (Computer architecture)Memory organization (Computer architecture)
Memory organization (Computer architecture)
 
Multiple Access Protocal
Multiple Access ProtocalMultiple Access Protocal
Multiple Access Protocal
 
Physical layer ppt
Physical layer pptPhysical layer ppt
Physical layer ppt
 
Ethernet and Token ring (Computer Networks)
Ethernet and Token ring (Computer Networks)Ethernet and Token ring (Computer Networks)
Ethernet and Token ring (Computer Networks)
 
Analog and Digital Transmission
Analog and Digital TransmissionAnalog and Digital Transmission
Analog and Digital Transmission
 
Control Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unitControl Units : Microprogrammed and Hardwired:control unit
Control Units : Microprogrammed and Hardwired:control unit
 
Unit 4 ca-input-output
Unit 4 ca-input-outputUnit 4 ca-input-output
Unit 4 ca-input-output
 
Desktop and multiprocessor systems
Desktop and multiprocessor systemsDesktop and multiprocessor systems
Desktop and multiprocessor systems
 
Lecture 6 -_presentation_layer
Lecture 6 -_presentation_layerLecture 6 -_presentation_layer
Lecture 6 -_presentation_layer
 
Cache memory
Cache memoryCache memory
Cache memory
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Parallel processing
Parallel processingParallel processing
Parallel processing
 

Viewers also liked

Lecture 16
Lecture 16Lecture 16
Lecture 16Shani729
 
Data warehouse solutions
Data warehouse solutionsData warehouse solutions
Data warehouse solutionsTu Pham
 
Big data in action
Big data in actionBig data in action
Big data in actionTu Pham
 
Recommendation system for ecommerce
Recommendation system for ecommerceRecommendation system for ecommerce
Recommendation system for ecommerceTu Pham
 
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUBTu Pham
 
Database, data storage, hosting with Firebase
Database, data storage, hosting with FirebaseDatabase, data storage, hosting with Firebase
Database, data storage, hosting with FirebaseTu Pham
 
Building Reactive Applications With Akka And Java
Building Reactive Applications With Akka And JavaBuilding Reactive Applications With Akka And Java
Building Reactive Applications With Akka And JavaTu Pham
 
Understanding Kubernetes
Understanding KubernetesUnderstanding Kubernetes
Understanding KubernetesTu Pham
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu Solution
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop DeveloperEdureka!
 
Big data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkBig data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkNeerajsabhnani
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAPaolo Platter
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadooppengshanzhang
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_MeetupPaolo Platter
 

Viewers also liked (20)

Lecture 16
Lecture 16Lecture 16
Lecture 16
 
Data warehouse solutions
Data warehouse solutionsData warehouse solutions
Data warehouse solutions
 
Big data in action
Big data in actionBig data in action
Big data in action
 
Recommendation system for ecommerce
Recommendation system for ecommerceRecommendation system for ecommerce
Recommendation system for ecommerce
 
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB�
MILLIONS EVENT DELIVERY WITH CLOUD PUB / SUB
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Database, data storage, hosting with Firebase
Database, data storage, hosting with FirebaseDatabase, data storage, hosting with Firebase
Database, data storage, hosting with Firebase
 
Building Reactive Applications With Akka And Java
Building Reactive Applications With Akka And JavaBuilding Reactive Applications With Akka And Java
Building Reactive Applications With Akka And Java
 
Understanding Kubernetes
Understanding KubernetesUnderstanding Kubernetes
Understanding Kubernetes
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能Etu DW Offload 解放資料倉儲的運算效能
Etu DW Offload 解放資料倉儲的運算效能
 
Hadoop Developer
Hadoop DeveloperHadoop Developer
Hadoop Developer
 
eCommerce Justification Presentation
eCommerce Justification Presentation eCommerce Justification Presentation
eCommerce Justification Presentation
 
Big data initiative justification and prioritization framework
Big data initiative justification and prioritization frameworkBig data initiative justification and prioritization framework
Big data initiative justification and prioritization framework
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
Agile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKAAgile Lab_BigData_Meetup_AKKA
Agile Lab_BigData_Meetup_AKKA
 
Guagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoopGuagua an iterative computing framework on hadoop
Guagua an iterative computing framework on hadoop
 
Agile Lab_BigData_Meetup
Agile Lab_BigData_MeetupAgile Lab_BigData_Meetup
Agile Lab_BigData_Meetup
 

Similar to Big data & hadoop framework

A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataRobert Grossman
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom IndustryCloudera, Inc.
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...i_scienceEU
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big DataFrank Kienle
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentHostedbyConfluent
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptxElsonPaul2
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadatamarkgrover
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Jim Dowling
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataRobert Grossman
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsKamalika Dutta
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...MLconf
 
Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)jwnoteboom
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008bosc_2008
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 

Similar to Big data & hadoop framework (20)

A Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate DataA Gen3 Perspective of Disparate Data
A Gen3 Perspective of Disparate Data
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
Hw09   Hadoop Based Data Mining Platform For The Telecom IndustryHw09   Hadoop Based Data Mining Platform For The Telecom Industry
Hw09 Hadoop Based Data Mining Platform For The Telecom Industry
 
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
Kave Salamatian, Universite de Savoie and Eiko Yoneki, University of Cambridg...
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction Big Data
Introduction Big DataIntroduction Big Data
Introduction Big Data
 
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, ConfluentApache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
Apache Kafka and the Data Mesh | Ben Stopford and Michael Noll, Confluent
 
Bigdata
BigdataBigdata
Bigdata
 
Big Data Session 1.pptx
Big Data Session 1.pptxBig Data Session 1.pptx
Big Data Session 1.pptx
 
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
 
Matlab, Big Data, and HDF Server
Matlab, Big Data, and HDF ServerMatlab, Big Data, and HDF Server
Matlab, Big Data, and HDF Server
 
Data Discovery and Metadata
Data Discovery and MetadataData Discovery and Metadata
Data Discovery and Metadata
 
Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019 Hopsworks in the cloud Berlin Buzzwords 2019
Hopsworks in the cloud Berlin Buzzwords 2019
 
My Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big DataMy Other Computer is a Data Center: The Sector Perspective on Big Data
My Other Computer is a Data Center: The Sector Perspective on Big Data
 
Big Data Analytics for Real Time Systems
Big Data Analytics for Real Time SystemsBig Data Analytics for Real Time Systems
Big Data Analytics for Real Time Systems
 
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT ...
 
Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)Sensor metadata management with SWM (SMWCon fall 2013)
Sensor metadata management with SWM (SMWCon fall 2013)
 
BigData
BigDataBigData
BigData
 
Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008Smith T Bio Hdf Bosc2008
Smith T Bio Hdf Bosc2008
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 

More from Tu Pham

Go from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxGo from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxTu Pham
 
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 Secure your app against DDOS, API Abuse, Hijacking, and Fraud Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Secure your app against DDOS, API Abuse, Hijacking, and FraudTu Pham
 
Challenges In Implementing SRE
Challenges In Implementing SREChallenges In Implementing SRE
Challenges In Implementing SRETu Pham
 
IT Strategy
IT Strategy IT Strategy
IT Strategy Tu Pham
 
Set up Learn and Development program
Set up Learn and Development programSet up Learn and Development program
Set up Learn and Development programTu Pham
 
Cost Management For IT Project / Product
Cost Management For IT Project / ProductCost Management For IT Project / Product
Cost Management For IT Project / ProductTu Pham
 
Minimum Viable Product 101
Minimum Viable Product 101Minimum Viable Product 101
Minimum Viable Product 101Tu Pham
 
Understand your customers
Understand your customersUnderstand your customers
Understand your customersTu Pham
 
Let's build great products for mid-size companies
Let's build great products for mid-size companiesLet's build great products for mid-size companies
Let's build great products for mid-size companiesTu Pham
 
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Tu Pham
 
End To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudEnd To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudTu Pham
 
High Output Tech Management
High Output Tech Management High Output Tech Management
High Output Tech Management Tu Pham
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway Tu Pham
 
Security On The Cloud
Security On The CloudSecurity On The Cloud
Security On The CloudTu Pham
 
Eway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesEway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesTu Pham
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud Tu Pham
 
Eway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingEway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingTu Pham
 
Php 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonPhp 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonTu Pham
 
System Security on Cloud
System Security on CloudSystem Security on Cloud
System Security on CloudTu Pham
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNOTu Pham
 

More from Tu Pham (20)

Go from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptxGo from idea to app with no coding using AppSheet.pptx
Go from idea to app with no coding using AppSheet.pptx
 
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 Secure your app against DDOS, API Abuse, Hijacking, and Fraud Secure your app against DDOS, API Abuse, Hijacking, and Fraud
Secure your app against DDOS, API Abuse, Hijacking, and Fraud
 
Challenges In Implementing SRE
Challenges In Implementing SREChallenges In Implementing SRE
Challenges In Implementing SRE
 
IT Strategy
IT Strategy IT Strategy
IT Strategy
 
Set up Learn and Development program
Set up Learn and Development programSet up Learn and Development program
Set up Learn and Development program
 
Cost Management For IT Project / Product
Cost Management For IT Project / ProductCost Management For IT Project / Product
Cost Management For IT Project / Product
 
Minimum Viable Product 101
Minimum Viable Product 101Minimum Viable Product 101
Minimum Viable Product 101
 
Understand your customers
Understand your customersUnderstand your customers
Understand your customers
 
Let's build great products for mid-size companies
Let's build great products for mid-size companiesLet's build great products for mid-size companies
Let's build great products for mid-size companies
 
Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns Latency Control And Supervision In Resilience Design Patterns
Latency Control And Supervision In Resilience Design Patterns
 
End To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google CloudEnd To End Business Intelligence On Google Cloud
End To End Business Intelligence On Google Cloud
 
High Output Tech Management
High Output Tech Management High Output Tech Management
High Output Tech Management
 
Big Data Driven At Eway
Big Data Driven At Eway Big Data Driven At Eway
Big Data Driven At Eway
 
Security On The Cloud
Security On The CloudSecurity On The Cloud
Security On The Cloud
 
Eway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding GuidelinesEway Tech Talk #2 Coding Guidelines
Eway Tech Talk #2 Coding Guidelines
 
End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud End To End Machine Learning With Google Cloud
End To End Machine Learning With Google Cloud
 
Eway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge SharingEway Tech Talk #0 Knowledge Sharing
Eway Tech Talk #0 Knowledge Sharing
 
Php 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparisonPhp 5.6 vs Php 7 performance comparison
Php 5.6 vs Php 7 performance comparison
 
System Security on Cloud
System Security on CloudSystem Security on Cloud
System Security on Cloud
 
Big Data at DYNO
Big Data at DYNOBig Data at DYNO
Big Data at DYNO
 

Recently uploaded

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Recently uploaded (20)

Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Big data & hadoop framework

  • 2. 2 Who am I?  Name: Pham Phuong Tu  Work as R&D developer at VC Corp  Related Projects: muachung.vn, enbac.com, rongbay.com, …  Interest with system & software architecture, big data, data statistic & analytic  Email: phamptu@gmail.com
  • 9. 9 Table of content   Challenger Big Data        Hadoop Framework         Overview Data type What – Who – Why Extract, transform, load data Data operation Big data platform Overview History User Architecture Map Reduce Hadoop Environment Q&A Demo
  • 11. 11
  • 12. 12
  • 13. 13
  • 15. 15
  • 16. 16 Analyze All Available Data Data warehouse Social Media Website Billing ERP CRM Devices Network Switches
  • 17. 17 Type of Data  Plain Text Data (Web)  Semi-structured Data (XML, Json)  Relational Data (Tables/Transaction/Legacy Data)  Graph Data  Social Network, Semantic Web (RDF), …  Multi  Media Data Image, Video, …  Streaming  Data You can only scan the data once
  • 18. 18 What is collecting all this data?       Web Browsers Web Sites (Search Engine, Social Network, E-commerce Platform…) Applications Computer, Smartphone, Tablet, Games Boxes Other System (Banking, Phone, Medical, GPS) Internet Service Providers
  • 19. 19 Who is collecting your data?  Government Agencies  Companies  Service Provider  Big Stores
  • 20. 20 Why are they collecting your data?  Search   Recommendation Systems   New York Times, Eyealike Target Marketing   Facebook, AOL Video and Image Analysis   Facebook, Yahoo, Google Data Warehouse   Facebook, Amazon Log analytic   Yahoo, Amazon, Zvents Google Ads, Facebook Ads Business strategy  Walmart
  • 21. 21 ETL  Extract: To convert the data into a single format appropriate for transformation processing.  Transform: Applies a series of rules or functions to the extracted data from the source.  Load: Loads the data into the end target, usually the Data Warehouse.
  • 22. 22 Real-life ETL cycle The typical real-life ETL cycle consists of the following execution steps: Cycle initiation Build reference data Extract (from sources) Validate Transform (clean, apply business rules, check for data integrity, create aggregates or disaggregates) Stage (load into staging tables, if used) Audit reports (for example, on compliance with business rules. Also, in case of failure, helps to diagnose/repair) Publish (to target tables) Archive Clean up
  • 25. 25 Big Data Platform Understand and navigate federated big data sources Federated Discovery and Navigation Manage & store huge volume of any data Hadoop File System MapReduce Structure and control data Data Warehousing Manage streaming data Stream Computing Analyze unstructured data Text Analytics Engine Integrate and govern all data sources Integration, Data Quality, Security, Lifecycle Management, MDM
  • 27. 27 Single-node architecture CPU Machine Learning, Statistics Memory “Classical” Data Mining Disk
  • 28. 28 Cluster Architecture 2-10 Gbps backbone between racks 1 Gbps between any pair of nodes in a rack Switch Switch CPU Mem Disk Switch CPU … CPU Mem Mem Disk Disk Each rack contains 16-64 nodes CPU … Mem Disk
  • 29. 29 Hadoop History  Dec 2004 – Google GFS paper published  July 2005 – Nutch uses MapReduce  Feb 2006 – Becomes Lucene subproject  Apr 2007 – Yahoo! on 1000-node cluster  Jan 2008 – An Apache Top Level Project  Jul 2008 – A 4000 node test cluster
  • 30. 30 Who uses Hadoop?  Google, Yahoo, Bing  Amazon, Ebay, Alibaba  Facebook, Twitter  IBM, HP. Toshiba, Intel  New York Times, BBC  Line, Wechat  VC Corp, FPT, VNG, VTC
  • 31. 31 Hadoop Components  Distributed   file system (HDFS) Single namespace for entire cluster Replicates data 3x for fault-tolerance  MapReduce   framework Executes user jobs specified as “map” and “reduce” functions Manages work distribution & fault-tolerance
  • 32. 32
  • 33. 33 Goals of HDFS  Very Large Distributed File System   10K nodes, 100 million files, 10 PB Assumes Commodity Hardware  Files are replicated to handle hardware failure Detect failures and recovers from them Optimized for Batch Processing    Provides very high aggregate bandwidth User Space, runs on heterogeneous OS   Data locations exposed so that computations can move to where data resides
  • 36. 36 NameNode  Meta-data in Memory  The entire metadata is in main memory No demand paging of meta-data Types of Metadata    List of files List of Blocks for each file List of DataNodes for each block File attributes, e.g creation time, replication factor A Transaction Log      Records file creations, file deletions. etc
  • 37. 37 DataNode  A Block Server  Stores data in the local file system (e.g. ext3) Stores meta-data of a block  Serves data and meta-data to Clients Block Report     Periodically sends a report of all existing blocks to the NameNode Facilitates Pipelining of Data  Forwards data to other specified DataNodes
  • 38. 38 Block Placement  Current  Strategy One replica on local node  Second replica on a remote rack  Third replica on same remote rack  Additional replicas are randomly placed  Clients read from nearest replica  Would like to make this policy pluggable
  • 39. 39 Data Correctness  Use Checksums to validate data   Use CRC32 File Creation  Client computes checksum per 512 byte DataNode stores the checksum File access    Client retrieves the data and checksum from DataNode  If Validation fails, Client tries other replicas
  • 40. 40 NameNode Failure A single point of failure  Transaction Log stored in multiple directories  A directory on the local file system A directory on a remote file system (NFS/CIFS)  Need to develop a real HA solution
  • 41. 41 Data Pipelining  Client retrieves a list of DataNodes on which to place replicas of a block  Client writes block to the first DataNode  The first DataNode forwards the data to the next DataNode in the Pipeline  When all replicas are written, the Client moves on to write the next block in file
  • 42. 42 Rebalancer  Goal: % disk full on DataNodes should be similar     Usually run when new DataNodes are added Cluster is online when Rebalancer is active Rebalancer is throttled to avoid network congestion Command line tool
  • 43. 43 What is MapReduce?  Simple data-parallel programming model designed for scalability and fault-tolerance  Pioneered  by Google Processes 20 petabytes of data per day  Popularized  by open-source Hadoop project Used at Yahoo!, Facebook, Amazon, …
  • 49. 49

Editor's Notes

  1. - Nói về Google Analytic
  2. Sự tăng trưởng ngày càng nhanh của dữ liệu (độ lớn, chủng loải) thông qua số liệu chi tiết Giá trị big data đem lại với các ngành (đặc biệt bán lẻ, tư vấn, vận chuyển, xây dựng, …) Khó khăn (57.6% tổ chức gặp khó khăn) The world's technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s;[14] as of 2012, every day 2.5 exabytes (2.5×1018) of data were created.[15] The challenge for large enterprises is determining who should own big data initiatives that straddle the entire organization.[16]
  3. - Các khó khăn thường gặp trong big data (Độ phức tạp dữ liệu, Độ lớn dữ liệu, Hiệu năng, Kĩ năng nhân viên, Sự tăng trưởng của dữ liệu, Giá thành)
  4. - 3 vấn đề lớn với big data
  5. Tổng kết big data là gì Big data[1][2] is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage,[3]search, sharing, transfer, analysis,[4] and visualization. The trend to larger data sets is due to the additional information derivable from analysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found to "spot business trends, determine quality of research, prevent diseases,link legal citations, combat crime, and determine real-time roadway traffic conditions."
  6. Big data = Giao dịch + Tương tác + Theo dõi ERP (Enterprise Resource Planning) CRM (Customer Relationship Management) Đi sâu vào log, user click stream, afilliate networks, behavioral targeting
  7. Website Billing ERP Device Network Social Media
  8. - Ví du chi tiết từng loại dữ liệu
  9. - Các nguồn thu thập dữ liệu
  10. - Các đơn vị thu thập dữ liệu
  11. - Mục đích thu thập dữ liệu
  12. Extract (Each separate system may also use a different data organization and/or format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data structures) Transform – PRE PROCESS (Selecting only certain columns to load (or selecting nullcolumns not to load), Translating coded values, Encoding free-form values, Deriving a new calculated value, Sorting, Joining data from multiple sources, Aggregation, Generating surrogate-key values, Transposing or pivoting, Splitting a column into multiple columns, Disaggregation of repeating columns into a separate detail table, Lookup and validate the relevant data from tables or referential files for slowly changing dimensions, Applying any form of simple or complex data validation.) Load ( Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative information; frequently, updating extracted data is done on a daily, weekly, or monthly basis. Other data warehouses (or even other parts of the same data warehouse) may add new data in a historical form at regular intervals -- for example, hourly. To understand this, consider a data warehouse that is required to maintain sales records of the last year. This data warehouse overwrites any data older than a year with newer data. However, the entry of data for any one year window is made in a historical manner.)
  13. Reference data are data from outside the organization (often from standards organizations) which is, apart from occasional revisions, static. This non-dynamic data is sometimes also known as "standing data".[1] Examples would be currency codes,Countries (in this case covered by a global standard ISO 3166-1) etc. Reference data should be distinguished[2] from "Master Data" which is also relatively static data but originating from within the organization e.g. products, departments, even customers. A staging table is just a regular SQL server table. For example, if you have a process that imports some data from say .CSV files then you put this data in a staging table. You may then decide to apply some data cleaning or business rules to the data and move it to a different staging tables etc..
  14. Tệ nhất: + No log Phổ biến: + Chỉ phân tích được một phần nhỏ dữ liệu hiện tại + No data warehouse
  15. Lưu trữ lâu dài trong kho dữ liệu Phục vụ phân tích dữ liệu
  16. Vấn đề về scale up server, fault tolerance, performance
  17. Vấn đề quản lý lỗi (thiết bị, hệ thống), raid, network device, performance
  18. Sử dụng hadoop để làm gì Các hướng (core, distribution)
  19. MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster.[1] A MapReduce program is composed of a Map() procedure that performs filtering and sorting (such as sorting students by first name into queues, one queue for each name) and a Reduce() procedure that performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The "MapReduce System" (also called "infrastructure" or "framework") orchestrates by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy andfault tolerance. The model is inspired by the map and reduce functions commonly used in functional programming,[2] although their purpose in the MapReduce framework is not the same as in their original forms.[3] Furthermore, the key contributions of the MapReduce framework are not the actual map and reduce functions, but the scalability and fault-tolerance achieved for a variety of applications by optimizing the execution engine once. MapReduce libraries have been written in many programming languages, with different levels of optimization. A popular open-source implementation is Apache Hadoop. The name MapReduce originally referred to the proprietary Google technology but has since been genericized.
  20. View from task perspective
  21. View from scheduled m/c perspective
  22. - Giới thiệu các sub project
  23. Kết luận tương lai big data Q&A