SlideShare ist ein Scribd-Unternehmen logo
1 von 44
Downloaden Sie, um offline zu lesen
Ing. Vladimír Hanušniak
University of Žilina, March 2014
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
2
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
3
4
With no signs of slowing, Big Data keep growing.
5
 X-ray – 30MB
 3D CT scan – 1GB
 3D MRI – 150MB
 Mammograms – 120MB
 Growing – 20-40%/year
 Preemies health
◦ University of Ontario & IBM
◦ 16 different data streams
◦ 1260 data points per second
 Early treatment
 Data structure and storage
 Analytical methods & Processing power
 Needed parallelization
6
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
7
 Task decomposition (HPC Uniza)
◦ Computationally expensive task
◦ Move the data to processing
◦ Execution order
◦ Shared data storage
8
 Slow HDD read spead !!!
 HDD reading speed ~100MB/s
◦ Read 1000GB => 10000s (166,6 min)
 100 parallel reading machines => 1,6 min
9
 Data decomposition (Hadoop)
◦ Data has regular structure (type, size)
◦ Move processing to data
10
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
11
 Hadoop – framework for processing BigData
 Two main components:
◦ HDFS
◦ MapReduce
 Thousands of nodes in
cluster
12
 Distributed fault-tolerant file system designed
to run on commodity hardware
 Main characteristics
◦ Scalability
◦ High availability
◦ Large files
◦ Common hardware
◦ Streeming data access - write once read many times
13
 NameNode
◦ Master
◦ Control storage
◦ Store metadata about files
 Name, path, size, block size, block IDs, ...
 DataNode
◦ Slave
◦ Store data in blocks
14
15
 Files are stored in blocks
◦ Large files are split
 Size: 64, 128, 256 MB …
 Stored in NameNode memory
◦ Limit factor
 150 bytes per file/directory or block object
◦ 3GB of memory = 10 million one blocks files.
16
 Seek time - 10ms seek time
 Block size - 100 MB 1% of
 Transfer rate - 100 MB/s transfer time
 Number of Map & Reduce Jobs depends on
block size
17
18
19
 First – same node, client
 Second – off-rack
 Third – same rack, different node
 Next… - random nodes (tries to avoid placing
too many replicas on the same rack)
20
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
21
 Programing model for data processing
◦ Functional programming - directed acyclic graph
 Hadoop support: Java, RUBY, Python, C++
 Associative array
◦ <key,value> pairs
22
 Job - unit of work
◦ Input data
◦ Map & Reduce program
◦ Configuration information
 Job is divided into task
◦ Map tasks
◦ Reduce tasks
23
 Job tracker
◦ Coordinates all jobs by scheduling tasks to run on
task trackers
◦ Keeps job progress records
◦ Reschedule task in case of fails
 Task trackers
◦ Run tasks
◦ Send progress report to Jobtracker
24
25
 Hadoop divide input to MapReduce job into
fixed-size piece of work – input split
 Create one map per split
◦ Run user define map function
 Split size tends to be the size of an HDFS
block
26
 Data locality optimization
◦ Run the map task on a node where the input data
resides in HDFS.
◦ Data-local (a), rack-local (b), and off-rack (c) map
tasks.
27
 Output - <Key, Value> pairs
 Write to local disk – NOT to HDFS !!!
◦ Map output is processed by reduce tasks to
produce final output
◦ No replicas needed
 Sort <Key, Value> pairs
 If node fails before reduce –> map again
28
 TaskTracker read region files remotely (RPC)
 Invoke Reduce function (aggregate)
 Output is stored in HDFS
 Don’t have the advantage of data locality
◦ Input to reduce – output from all mappers
29
30
 Minimize the data transferred between map
and reduce tasks
 Run on the map output
 “Reduce on Map side”
 max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
 mean(0, 20, 10, 25, 15) = 14
 mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15
31
 Java (RUBY, Python, C++)
◦ Good for programmers
 Pig
◦ Scripting language with a focus on dataflows.
◦ Use Pig Latin language
◦ Allow merging, filtering, applying functions
 Hive
◦ Use HiveQL - similar to SQL (use Facebook)
◦ Provides a database query interface
 Hbase
32
33
 Brief review
 Parallel processing
 Hadoop
◦ HDFS (Hadoop Distributed File System)
◦ MapReduce
 Example
34
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)
 Data Set
 Find the maximum temperature by year
1901 - 317
1902 - 244
1903 - 289
1904 - 256
1905 - 283
...
35
#!/usr/bin/env bash
for year in all/*
do
echo -ne `basename $year .gz`"t"
gunzip -c $year | 
awk '{ temp = substr($0, 88, 5) + 0;
q = substr($0, 93, 1);
if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp }
END { print max }'
done
% ./max_temperature.sh
1901 317
1902 244
1903 289
1904 256
1905 283
...
36
37
 Run parts of the program in parallel
◦ Process different years in different processes
 Problems
◦ Non equal-size pieces
◦ Combining partial results need processing time
◦ Single machine processing limit
◦ Long processing time
38
39
40
41
42
Zdroj: Infoware 1-2/2014
43
 Hadoop: The Definitive Guide, 3rd Edition
◦ http://it-ebooks.info/book/635/
 Big Data: A Revolution That Will Transform
How We Live, Work, and Think
 http://hadoop.apache.org/
 http://architects.dzone.com/articles/how-
hadoop-mapreduce-works
44

Weitere ähnliche Inhalte

Was ist angesagt?

The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
J Singh
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
Modern Data Stack France
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
Cloudera, Inc.
 

Was ist angesagt? (20)

AWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data AnalyticsAWS Redshift Introduction - Big Data Analytics
AWS Redshift Introduction - Big Data Analytics
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)Messaging architecture @FB (Fifth Elephant Conference)
Messaging architecture @FB (Fifth Elephant Conference)
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
The Hadoop Ecosystem
The Hadoop EcosystemThe Hadoop Ecosystem
The Hadoop Ecosystem
 
Big Data and Hadoop Ecosystem
Big Data and Hadoop EcosystemBig Data and Hadoop Ecosystem
Big Data and Hadoop Ecosystem
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
Big Data Journey
Big Data JourneyBig Data Journey
Big Data Journey
 
Cloud Optimized Big Data
Cloud Optimized Big DataCloud Optimized Big Data
Cloud Optimized Big Data
 
Hadoop And Their Ecosystem
 Hadoop And Their Ecosystem Hadoop And Their Ecosystem
Hadoop And Their Ecosystem
 
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?Hadoop:  Big Data Stacks validation w/ iTest  How to tame the elephant?
Hadoop: Big Data Stacks validation w/ iTest How to tame the elephant?
 
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKIntroduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hadoop Fundamentals
Hadoop FundamentalsHadoop Fundamentals
Hadoop Fundamentals
 
February 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and InsidesFebruary 2014 HUG : Tez Details and Insides
February 2014 HUG : Tez Details and Insides
 
Partners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with HadoopPartners in Crime: Cassandra Analytics and ETL with Hadoop
Partners in Crime: Cassandra Analytics and ETL with Hadoop
 
The Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop EcosystemThe Evolution of the Hadoop Ecosystem
The Evolution of the Hadoop Ecosystem
 
Hw09 Sqoop Database Import For Hadoop
Hw09   Sqoop Database Import For HadoopHw09   Sqoop Database Import For Hadoop
Hw09 Sqoop Database Import For Hadoop
 
Simple Works Best
 Simple Works Best Simple Works Best
Simple Works Best
 

Andere mochten auch

What is big data
What is big dataWhat is big data
What is big data
Cnu Federer
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The Enterprise
DataWorks Summit
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
gfawkesnew2
 
Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)
UzmaRuhy
 
ppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGY
tanshu singh
 

Andere mochten auch (16)

Cloud Computing
Cloud ComputingCloud Computing
Cloud Computing
 
What is big data
What is big dataWhat is big data
What is big data
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
What is hadoop and how it works?
What is hadoop and how it works?What is hadoop and how it works?
What is hadoop and how it works?
 
Putting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The EnterprisePutting Hadoop To Work In The Enterprise
Putting Hadoop To Work In The Enterprise
 
An introduction to Apache Cassandra
An introduction to Apache CassandraAn introduction to Apache Cassandra
An introduction to Apache Cassandra
 
Ct scan
Ct scanCt scan
Ct scan
 
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
Intro to Big Data and Apache Hadoop by Dr. Amr Awadallah at CLOUD WEEKEND '13...
 
Intro to big data and hadoop ubc cs lecture series - g fawkes
Intro to big data and hadoop   ubc cs lecture series - g fawkesIntro to big data and hadoop   ubc cs lecture series - g fawkes
Intro to big data and hadoop ubc cs lecture series - g fawkes
 
Imaging in diagnosis and treatment of carcinoma cervix
Imaging in diagnosis and treatment of carcinoma cervixImaging in diagnosis and treatment of carcinoma cervix
Imaging in diagnosis and treatment of carcinoma cervix
 
Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2Hadoop Hand-on Lab: Installing Hadoop 2
Hadoop Hand-on Lab: Installing Hadoop 2
 
Nuclear Weapons
Nuclear WeaponsNuclear Weapons
Nuclear Weapons
 
Hyperloop
HyperloopHyperloop
Hyperloop
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)Li-Fi Technology (Perfect slides)
Li-Fi Technology (Perfect slides)
 
ppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGYppt on LIFI TECHNOLOGY
ppt on LIFI TECHNOLOGY
 

Ähnlich wie Hadoop - How It Works

Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
Cisco Canada
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
Varun Narang
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
Nisanth Simon
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
pramodbiligiri
 

Ähnlich wie Hadoop - How It Works (20)

Big Data Architecture and Deployment
Big Data Architecture and DeploymentBig Data Architecture and Deployment
Big Data Architecture and Deployment
 
Cisco connect toronto 2015 big data sean mc keown
Cisco connect toronto 2015 big data  sean mc keownCisco connect toronto 2015 big data  sean mc keown
Cisco connect toronto 2015 big data sean mc keown
 
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
(Berkeley CS186 guest lecture) Big Data Analytics Systems: What Goes Around C...
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologiesHadoop trainting in hyderabad@kelly technologies
Hadoop trainting in hyderabad@kelly technologies
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Hadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologiesHadoop trainting-in-hyderabad@kelly technologies
Hadoop trainting-in-hyderabad@kelly technologies
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop institutes-in-bangalore
Hadoop institutes-in-bangaloreHadoop institutes-in-bangalore
Hadoop institutes-in-bangalore
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop2
Hadoop2Hadoop2
Hadoop2
 
Map Reduce Online
Map Reduce OnlineMap Reduce Online
Map Reduce Online
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Report Hadoop Map Reduce
Report Hadoop Map ReduceReport Hadoop Map Reduce
Report Hadoop Map Reduce
 
Hadoop Network Performance profile
Hadoop Network Performance profileHadoop Network Performance profile
Hadoop Network Performance profile
 
Hadoop ecosystem framework n hadoop in live environment
Hadoop ecosystem framework  n hadoop in live environmentHadoop ecosystem framework  n hadoop in live environment
Hadoop ecosystem framework n hadoop in live environment
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop training-in-hyderabad
Hadoop training-in-hyderabadHadoop training-in-hyderabad
Hadoop training-in-hyderabad
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Six Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal OntologySix Myths about Ontologies: The Basics of Formal Ontology
Six Myths about Ontologies: The Basics of Formal Ontology
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 

Hadoop - How It Works

  • 1. Ing. Vladimír Hanušniak University of Žilina, March 2014
  • 2.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 2
  • 3.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 3
  • 4. 4 With no signs of slowing, Big Data keep growing.
  • 5. 5  X-ray – 30MB  3D CT scan – 1GB  3D MRI – 150MB  Mammograms – 120MB  Growing – 20-40%/year  Preemies health ◦ University of Ontario & IBM ◦ 16 different data streams ◦ 1260 data points per second  Early treatment
  • 6.  Data structure and storage  Analytical methods & Processing power  Needed parallelization 6
  • 7.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 7
  • 8.  Task decomposition (HPC Uniza) ◦ Computationally expensive task ◦ Move the data to processing ◦ Execution order ◦ Shared data storage 8
  • 9.  Slow HDD read spead !!!  HDD reading speed ~100MB/s ◦ Read 1000GB => 10000s (166,6 min)  100 parallel reading machines => 1,6 min 9
  • 10.  Data decomposition (Hadoop) ◦ Data has regular structure (type, size) ◦ Move processing to data 10
  • 11.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 11
  • 12.  Hadoop – framework for processing BigData  Two main components: ◦ HDFS ◦ MapReduce  Thousands of nodes in cluster 12
  • 13.  Distributed fault-tolerant file system designed to run on commodity hardware  Main characteristics ◦ Scalability ◦ High availability ◦ Large files ◦ Common hardware ◦ Streeming data access - write once read many times 13
  • 14.  NameNode ◦ Master ◦ Control storage ◦ Store metadata about files  Name, path, size, block size, block IDs, ...  DataNode ◦ Slave ◦ Store data in blocks 14
  • 15. 15
  • 16.  Files are stored in blocks ◦ Large files are split  Size: 64, 128, 256 MB …  Stored in NameNode memory ◦ Limit factor  150 bytes per file/directory or block object ◦ 3GB of memory = 10 million one blocks files. 16
  • 17.  Seek time - 10ms seek time  Block size - 100 MB 1% of  Transfer rate - 100 MB/s transfer time  Number of Map & Reduce Jobs depends on block size 17
  • 18. 18
  • 19. 19
  • 20.  First – same node, client  Second – off-rack  Third – same rack, different node  Next… - random nodes (tries to avoid placing too many replicas on the same rack) 20
  • 21.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 21
  • 22.  Programing model for data processing ◦ Functional programming - directed acyclic graph  Hadoop support: Java, RUBY, Python, C++  Associative array ◦ <key,value> pairs 22
  • 23.  Job - unit of work ◦ Input data ◦ Map & Reduce program ◦ Configuration information  Job is divided into task ◦ Map tasks ◦ Reduce tasks 23
  • 24.  Job tracker ◦ Coordinates all jobs by scheduling tasks to run on task trackers ◦ Keeps job progress records ◦ Reschedule task in case of fails  Task trackers ◦ Run tasks ◦ Send progress report to Jobtracker 24
  • 25. 25
  • 26.  Hadoop divide input to MapReduce job into fixed-size piece of work – input split  Create one map per split ◦ Run user define map function  Split size tends to be the size of an HDFS block 26
  • 27.  Data locality optimization ◦ Run the map task on a node where the input data resides in HDFS. ◦ Data-local (a), rack-local (b), and off-rack (c) map tasks. 27
  • 28.  Output - <Key, Value> pairs  Write to local disk – NOT to HDFS !!! ◦ Map output is processed by reduce tasks to produce final output ◦ No replicas needed  Sort <Key, Value> pairs  If node fails before reduce –> map again 28
  • 29.  TaskTracker read region files remotely (RPC)  Invoke Reduce function (aggregate)  Output is stored in HDFS  Don’t have the advantage of data locality ◦ Input to reduce – output from all mappers 29
  • 30. 30
  • 31.  Minimize the data transferred between map and reduce tasks  Run on the map output  “Reduce on Map side”  max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25  mean(0, 20, 10, 25, 15) = 14  mean(mean(0, 20, 10), mean(25, 15)) = mean(10, 20) = 15 31
  • 32.  Java (RUBY, Python, C++) ◦ Good for programmers  Pig ◦ Scripting language with a focus on dataflows. ◦ Use Pig Latin language ◦ Allow merging, filtering, applying functions  Hive ◦ Use HiveQL - similar to SQL (use Facebook) ◦ Provides a database query interface  Hbase 32
  • 33. 33
  • 34.  Brief review  Parallel processing  Hadoop ◦ HDFS (Hadoop Distributed File System) ◦ MapReduce  Example 34
  • 35. (0, 0067011990999991950051507004...9999999N9+00001+99999999999...) (106, 0043011990999991950051512004...9999999N9+00221+99999999999...) (212, 0043011990999991950051518004...9999999N9-00111+99999999999...) (318, 0043012650999991949032412004...0500001N9+01111+99999999999...) (424, 0043012650999991949032418004...0500001N9+00781+99999999999...)  Data Set  Find the maximum temperature by year 1901 - 317 1902 - 244 1903 - 289 1904 - 256 1905 - 283 ... 35
  • 36. #!/usr/bin/env bash for year in all/* do echo -ne `basename $year .gz`"t" gunzip -c $year | awk '{ temp = substr($0, 88, 5) + 0; q = substr($0, 93, 1); if (temp !=9999 && q ~ /[01459]/ && temp > max) max = temp } END { print max }' done % ./max_temperature.sh 1901 317 1902 244 1903 289 1904 256 1905 283 ... 36
  • 37. 37  Run parts of the program in parallel ◦ Process different years in different processes  Problems ◦ Non equal-size pieces ◦ Combining partial results need processing time ◦ Single machine processing limit ◦ Long processing time
  • 38. 38
  • 39. 39
  • 40. 40
  • 41. 41
  • 43. 43
  • 44.  Hadoop: The Definitive Guide, 3rd Edition ◦ http://it-ebooks.info/book/635/  Big Data: A Revolution That Will Transform How We Live, Work, and Think  http://hadoop.apache.org/  http://architects.dzone.com/articles/how- hadoop-mapreduce-works 44