SlideShare a Scribd company logo
1 of 22
Chandan Rajah
Hadoop and Friends
1
• What is Hadoop?
- Reliable, Scalable, Distributed, Parallel, Fault Tolerant
- Built for non-reliable heterogeneous commodity H/W
• Where did it come from?
- Community + Yahoo!
• Why Hadoop? – Business Benefits
• Scales linearly >500k nodes, Ultra Flexible & Versatile
• Carrier Grade 99.999, High Performance, Runs on any H/W
• Low TCO per Node (Node assumes > 8 CPU with 8 cores)
• Open Source no S/W License Fee
• ~ $4k/Node (= DWH $1m – $100m /node, = Oracle > $20k/Core)1
• ~ < $250/TB (= RDBMS >$10,000/TB)1
• ~ 200 GFLOPS/Node at 65% (= £20/GFLOP) in processing
• ~ > 20 GFLOPS/Watt in power
• Running cost ~$32/hour (= Oracle ~$100/hour)2
•Is it Mature?
• Yahoo manages >200 petabytes across >50,000 nodes1
Overview
2
1 Source: Forbes - http://www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-big-data/
2 Source: Computerworld - http://news.idg.no/cw/art.cfm?id=AEF8309A-FDB9-3AFA-6F188DFCA9B24083
To name just a few…
Who is using it?
3
So, what does it give you?
4
• HDFS - Distributed Resilient Filesystem
• Infinitely Scalable (> 50PB in production)
• Replicated data storage, self healing, fault tolerant
• Master/Slave (Name Node/Data Nodes)
• MapReduce - Distributed Parallel Processing
• Schema-less data stream processing
• Paradigm shift: processing to where data exists
• Master/Slave (Job Tracker/Task Trackers)
• Hive - SQL Database with ODBC/JDBC Access
• SQL database for ETL and Data Warehousing
• Oracle Business Objects connectors
• HBase - NoSQL Database with REST/Thrift Access
• Sub-second reads/writes
Few important things...
5
• Very Large Distributed File System
• >10K nodes, >100 million files, >50 PB
• Assumes Commodity Hardware
• Files are replicated to handle hardware failure
• Detect failures and recovers from them
• 128 MB blocks replicated at least thrice
• CRC32 error check and correct on each node
• Optimized for Batch Processing
• Computations moves to where data resides
• Provides very high aggregate bandwidth
• User Space, runs on heterogeneous OS
• Transaction Log stored on Name Node
• Replicated local and on NFS/CIFS
• Zookeeper Quorum of Name Nodes
• No SPOF
HDFS - Overview
6
7
HDFS - Architecture
HDFS - Write
Name Node
1 32
Client 1. Create Metadata
2. Put Blocks
Data Nodes
Control / Monitoring
1 1
2 2
3 3
8
HDFS - Read
Name Node
1 1 1 2
2
2
3 3 34
4 4
Client 1. Get Metadata
2. Fetch Blocks
Data Nodes
Control / Monitoring
9
• Pioneered by Google
• Moves processing to data
• Works like Unix pipeline (called Jobs)
• cat input | grep | sort | uniq -c | cat > output
• input | Map | Shuffle/Sort | Reduce | Output
• Job Pipeline made of Mappers and reducers
• Mappers: input mapped to key-value pair
• Reducer: receive all values for key and output aggregation
• Jobs submitted to JobTracker for execution
• Mappers/Reducers sent to the DataNode that has data block
• Mapper/Reducer state managed and maintained by JT
• Innate fault tolerance, JT restarts jobs that fail
MapReduce - Overview
10
MapReduce – Job Submit
Job TrackerClient 1. Setup Job
Task Trackers
Control / Monitoring
M M
M M
R R
M M
M M
R R
M M
M M
R R
M M
M M
R R
M M
M M
R R
11
MapReduce – Pipeline Schematic
12
13
MapReduce – Example
• Data warehouse infrastructure built on Hadoop
• Providing data summarization, query, analysis, etc.
• ETL. Structure. Access to different storage types HDFS, HBase
• Query execution via MapReduce.
• Key Building Principles:
• SQL queries
• Web based UI for ad-hoc queries (Hue)
• Extensibility – Types, Functions, Formats, Scripts
• Performance, Scalability, Reliability
• Data Units are Databases, Tables, Partitions &
Clusters
• ODBC and JDBC connectors readily available
• Used extensively by Facebook, Reuters, UBS
Hive - Overview
14
Hive - Components
15
HDFS
Hive CLI
DDLQueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDe
Thrift Jute JSON..
Execution
Hive QL
Parser
Planner
Mgmt.WebUI
Hue Web UI
ODBC
Data
Warehouse
Hive – Data Warehousing at Facebook
16
> 800M users, > 600TB of data, > 2TB a day, > 4000 queries a day
HBase - Overview
17
• Very large scale analytic processing
• Big queries – typically range or table scans.
• Big databases (100s of TB)
• Sub second query response time
• Performance of RDBMS system good for transaction processing
but very inefficient for very large scale analytic processing
• Column oriented database
• No SQL queries
• Tables have one primary key and index
• No join operations
• Data is unstructured and not typed
• Data is versioned with timestamp
• Ultra fast lookup with row key and optional timestamp
• Full table scans & range scans have no performance impact
• Built on HDFS
• Horizontal scalability, highly available, high performance
HBase – Data Model
18
Row key
Time
Stamp
Column
“contents:”
Column “anchor:”
“com.apache.www”
t12 “<html>…”
t11 “<html>…”
t10 “anchor:apache.com” “APACHE”
“com.cnn.www”
t15 “anchor:cnnsi.com” “CNN”
t13 “anchor:my.look.ca” “CNN.com”
t6 “<html>…”
t5 “<html>…”
t3 “<html>…”
HBase – Architecture
19
• Pig – English like “Pig Latin” to build queries
• Rapid prototyping
• Sqoop – Structured data ingest
• Highly available structured data ingest
• Supports ODBC connection to Oracle, MySQL, etc
• Flume – Real time event ingest
• Ingest real time events scaling to >100k tps
• Scalding – Pipeline construction in Scala
• Created by Twitter built on Cascading
• Production ready code with TDD and CI
• Oozie – Job scheduler and coordinator
• Ideal to build job dependencies
• Schedule job to run automatically
• Mahout – distributed machine learning algorithms
• R & RMR for statistical analysis and insight
Big Data – More tools
20
21
Big Data Stack
Questions ?!
22

More Related Content

What's hot

From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
DataWorks Summit
 

What's hot (20)

Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, VectorizedData Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
 
Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15Presto@Netflix Presto Meetup 03-19-15
Presto@Netflix Presto Meetup 03-19-15
 
Superset druid realtime
Superset druid realtimeSuperset druid realtime
Superset druid realtime
 
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
Low-latency data applications with Kafka and Agg indexes | Tino Tereshko, Fir...
 
Facebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streamsFacebook style notifications using hbase and event streams
Facebook style notifications using hbase and event streams
 
From Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache ApexFrom Batch to Streaming ET(L) with Apache Apex
From Batch to Streaming ET(L) with Apache Apex
 
Change Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVHChange Data Capture with Data Collector @OVH
Change Data Capture with Data Collector @OVH
 
Treasure Data From MySQL to Redshift
Treasure Data  From MySQL to RedshiftTreasure Data  From MySQL to Redshift
Treasure Data From MySQL to Redshift
 
Presto: Distributed sql query engine
Presto: Distributed sql query engine Presto: Distributed sql query engine
Presto: Distributed sql query engine
 
Data streaming-systems
Data streaming-systemsData streaming-systems
Data streaming-systems
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Presto: SQL-on-anything
Presto: SQL-on-anythingPresto: SQL-on-anything
Presto: SQL-on-anything
 
Automatic Scaling Iterative Computations
Automatic Scaling Iterative ComputationsAutomatic Scaling Iterative Computations
Automatic Scaling Iterative Computations
 
Big data architecture
Big data architectureBig data architecture
Big data architecture
 
Introduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OKIntroduction to Data Engineer and Data Pipeline at Credit OK
Introduction to Data Engineer and Data Pipeline at Credit OK
 
Data Pipeline at Tapad
Data Pipeline at TapadData Pipeline at Tapad
Data Pipeline at Tapad
 
An Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and KibanaAn Intro to Elasticsearch and Kibana
An Intro to Elasticsearch and Kibana
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
Dirty Data? Clean it up! - Rocky Mountain DataCon 2016
 
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the CloudAlluxio+Presto: An Architecture for Fast SQL in the Cloud
Alluxio+Presto: An Architecture for Fast SQL in the Cloud
 

Similar to Hadoop and friends

Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
Cloudera, Inc.
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
saipriyacoool
 

Similar to Hadoop and friends (20)

Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
Big Data Developers Moscow Meetup 1 - sql on hadoop
Big Data Developers Moscow Meetup 1  - sql on hadoopBig Data Developers Moscow Meetup 1  - sql on hadoop
Big Data Developers Moscow Meetup 1 - sql on hadoop
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop ppt on the basics and architecture
Hadoop ppt on the basics and architectureHadoop ppt on the basics and architecture
Hadoop ppt on the basics and architecture
 
2. hadoop fundamentals
2. hadoop fundamentals2. hadoop fundamentals
2. hadoop fundamentals
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和SparkEtu Solution Day 2014 Track-D: 掌握Impala和Spark
Etu Solution Day 2014 Track-D: 掌握Impala和Spark
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
02 Hadoop.pptx HADOOP VENNELA DONTHIREDDY
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Big Data Technologies - Hadoop
Big Data Technologies - HadoopBig Data Technologies - Hadoop
Big Data Technologies - Hadoop
 

More from Chandan Rajah

More from Chandan Rajah (19)

Business Change through Predictive Analytics
Business Change through Predictive AnalyticsBusiness Change through Predictive Analytics
Business Change through Predictive Analytics
 
Business Change through Predictive Analytics
Business Change through Predictive AnalyticsBusiness Change through Predictive Analytics
Business Change through Predictive Analytics
 
Data Disruption by Vertical Innovation
Data Disruption by Vertical InnovationData Disruption by Vertical Innovation
Data Disruption by Vertical Innovation
 
Data Innovation in the UK
Data Innovation in the UKData Innovation in the UK
Data Innovation in the UK
 
Data Disruption by Vertical Innovation in Media
Data Disruption by Vertical Innovation in MediaData Disruption by Vertical Innovation in Media
Data Disruption by Vertical Innovation in Media
 
Catalysing Sector Advantage
Catalysing Sector AdvantageCatalysing Sector Advantage
Catalysing Sector Advantage
 
Rise of the Machines
Rise of the MachinesRise of the Machines
Rise of the Machines
 
Health Innovation and the Digital Catapult
Health Innovation and the Digital CatapultHealth Innovation and the Digital Catapult
Health Innovation and the Digital Catapult
 
Connected Farms ...and the Digital Catapult
Connected Farms ...and the Digital CatapultConnected Farms ...and the Digital Catapult
Connected Farms ...and the Digital Catapult
 
Steps to the Big Data Science Epiphany
Steps to the Big Data Science EpiphanySteps to the Big Data Science Epiphany
Steps to the Big Data Science Epiphany
 
Data Innovation in the Digital Economy
Data Innovation in the Digital EconomyData Innovation in the Digital Economy
Data Innovation in the Digital Economy
 
Disruptive Data in Future Care
Disruptive Data in Future CareDisruptive Data in Future Care
Disruptive Data in Future Care
 
Big Data Science at the Digital Catapult
Big Data Science at the Digital CatapultBig Data Science at the Digital Catapult
Big Data Science at the Digital Catapult
 
Data Warehouse to Data Science
Data Warehouse to Data ScienceData Warehouse to Data Science
Data Warehouse to Data Science
 
Business Impact of Predictive Analytics
Business Impact of Predictive AnalyticsBusiness Impact of Predictive Analytics
Business Impact of Predictive Analytics
 
Social Triangulation with Big Data
Social Triangulation with Big DataSocial Triangulation with Big Data
Social Triangulation with Big Data
 
Big Data Science Challenges in Media
Big Data Science Challenges in MediaBig Data Science Challenges in Media
Big Data Science Challenges in Media
 
Big Data Science: Intro and Benefits
Big Data Science: Intro and BenefitsBig Data Science: Intro and Benefits
Big Data Science: Intro and Benefits
 
IPTV Case Study
IPTV Case StudyIPTV Case Study
IPTV Case Study
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 

Recently uploaded (20)

Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 

Hadoop and friends

  • 2. • What is Hadoop? - Reliable, Scalable, Distributed, Parallel, Fault Tolerant - Built for non-reliable heterogeneous commodity H/W • Where did it come from? - Community + Yahoo! • Why Hadoop? – Business Benefits • Scales linearly >500k nodes, Ultra Flexible & Versatile • Carrier Grade 99.999, High Performance, Runs on any H/W • Low TCO per Node (Node assumes > 8 CPU with 8 cores) • Open Source no S/W License Fee • ~ $4k/Node (= DWH $1m – $100m /node, = Oracle > $20k/Core)1 • ~ < $250/TB (= RDBMS >$10,000/TB)1 • ~ 200 GFLOPS/Node at 65% (= £20/GFLOP) in processing • ~ > 20 GFLOPS/Watt in power • Running cost ~$32/hour (= Oracle ~$100/hour)2 •Is it Mature? • Yahoo manages >200 petabytes across >50,000 nodes1 Overview 2 1 Source: Forbes - http://www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-big-data/ 2 Source: Computerworld - http://news.idg.no/cw/art.cfm?id=AEF8309A-FDB9-3AFA-6F188DFCA9B24083
  • 3. To name just a few… Who is using it? 3
  • 4. So, what does it give you? 4
  • 5. • HDFS - Distributed Resilient Filesystem • Infinitely Scalable (> 50PB in production) • Replicated data storage, self healing, fault tolerant • Master/Slave (Name Node/Data Nodes) • MapReduce - Distributed Parallel Processing • Schema-less data stream processing • Paradigm shift: processing to where data exists • Master/Slave (Job Tracker/Task Trackers) • Hive - SQL Database with ODBC/JDBC Access • SQL database for ETL and Data Warehousing • Oracle Business Objects connectors • HBase - NoSQL Database with REST/Thrift Access • Sub-second reads/writes Few important things... 5
  • 6. • Very Large Distributed File System • >10K nodes, >100 million files, >50 PB • Assumes Commodity Hardware • Files are replicated to handle hardware failure • Detect failures and recovers from them • 128 MB blocks replicated at least thrice • CRC32 error check and correct on each node • Optimized for Batch Processing • Computations moves to where data resides • Provides very high aggregate bandwidth • User Space, runs on heterogeneous OS • Transaction Log stored on Name Node • Replicated local and on NFS/CIFS • Zookeeper Quorum of Name Nodes • No SPOF HDFS - Overview 6
  • 8. HDFS - Write Name Node 1 32 Client 1. Create Metadata 2. Put Blocks Data Nodes Control / Monitoring 1 1 2 2 3 3 8
  • 9. HDFS - Read Name Node 1 1 1 2 2 2 3 3 34 4 4 Client 1. Get Metadata 2. Fetch Blocks Data Nodes Control / Monitoring 9
  • 10. • Pioneered by Google • Moves processing to data • Works like Unix pipeline (called Jobs) • cat input | grep | sort | uniq -c | cat > output • input | Map | Shuffle/Sort | Reduce | Output • Job Pipeline made of Mappers and reducers • Mappers: input mapped to key-value pair • Reducer: receive all values for key and output aggregation • Jobs submitted to JobTracker for execution • Mappers/Reducers sent to the DataNode that has data block • Mapper/Reducer state managed and maintained by JT • Innate fault tolerance, JT restarts jobs that fail MapReduce - Overview 10
  • 11. MapReduce – Job Submit Job TrackerClient 1. Setup Job Task Trackers Control / Monitoring M M M M R R M M M M R R M M M M R R M M M M R R M M M M R R 11
  • 12. MapReduce – Pipeline Schematic 12
  • 14. • Data warehouse infrastructure built on Hadoop • Providing data summarization, query, analysis, etc. • ETL. Structure. Access to different storage types HDFS, HBase • Query execution via MapReduce. • Key Building Principles: • SQL queries • Web based UI for ad-hoc queries (Hue) • Extensibility – Types, Functions, Formats, Scripts • Performance, Scalability, Reliability • Data Units are Databases, Tables, Partitions & Clusters • ODBC and JDBC connectors readily available • Used extensively by Facebook, Reuters, UBS Hive - Overview 14
  • 15. Hive - Components 15 HDFS Hive CLI DDLQueriesBrowsing Map Reduce MetaStore Thrift API SerDe Thrift Jute JSON.. Execution Hive QL Parser Planner Mgmt.WebUI Hue Web UI ODBC Data Warehouse
  • 16. Hive – Data Warehousing at Facebook 16 > 800M users, > 600TB of data, > 2TB a day, > 4000 queries a day
  • 17. HBase - Overview 17 • Very large scale analytic processing • Big queries – typically range or table scans. • Big databases (100s of TB) • Sub second query response time • Performance of RDBMS system good for transaction processing but very inefficient for very large scale analytic processing • Column oriented database • No SQL queries • Tables have one primary key and index • No join operations • Data is unstructured and not typed • Data is versioned with timestamp • Ultra fast lookup with row key and optional timestamp • Full table scans & range scans have no performance impact • Built on HDFS • Horizontal scalability, highly available, high performance
  • 18. HBase – Data Model 18 Row key Time Stamp Column “contents:” Column “anchor:” “com.apache.www” t12 “<html>…” t11 “<html>…” t10 “anchor:apache.com” “APACHE” “com.cnn.www” t15 “anchor:cnnsi.com” “CNN” t13 “anchor:my.look.ca” “CNN.com” t6 “<html>…” t5 “<html>…” t3 “<html>…”
  • 20. • Pig – English like “Pig Latin” to build queries • Rapid prototyping • Sqoop – Structured data ingest • Highly available structured data ingest • Supports ODBC connection to Oracle, MySQL, etc • Flume – Real time event ingest • Ingest real time events scaling to >100k tps • Scalding – Pipeline construction in Scala • Created by Twitter built on Cascading • Production ready code with TDD and CI • Oozie – Job scheduler and coordinator • Ideal to build job dependencies • Schedule job to run automatically • Mahout – distributed machine learning algorithms • R & RMR for statistical analysis and insight Big Data – More tools 20

Editor's Notes

  1. Hadoop is topology awareBlocks replicated on multiple racks
  2. Combiners – Map side reducerPartitioners – Sorts, groups values of same key