SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Downloaden Sie, um offline zu lesen
Introducing:

The Modern Data Operating System
Hadoop is ...
A scalable fault tolerant distributed for data storage and
processing (open source under the Apache license)
- Core Hadoop has two main systems:
● Hadoop Distributed FileSystem (HDFS):
self-healing, high-bandwidth clustered storage

● MapReduce: distributed fault-tolerant
resource management and scheduling
coupled with a scalable data programming
abstraction
Hadoop Origins

>>>

HDFS

>>>

MapReduce

GFS

Map/Reduce

>>>
BigTable
Hadoop Chronicles

GFS

Map/Reduce

BigTable

Doug Cutting
Etymology
● Hadoop was created in 2004
by "Douglass (Doug) Cutting"
● Implemented Google
Filesystem and Big Tables
papers
● He aimed it, to index the
internet in google style for
startup search engine 'Nutch'
● Named it after his son's
elephant shaped favourite
toy named hadoop
What is Big Data?
"In Information Technology, big data is loosely
defined term used to describe set so large and
complex that they became awkward to work with
using on-hand database management tools."
Wikipedia
How big is big?
● 2008: Google processes 20PB a day
● 2012: Facebook ingests 500TB of data a day
● 2009: eBay has 6.5 PB user data + 50 TB a day
● 2011: Yahoo! has 180-200 PB of data
Limitations of Existing Analytics Architecture
Can't explore original raw data

BI Reports + Online Apps
RDBMS (aggregated data)
ETL (Extract, Transfer & Load)

Moving Data from storage to
compute doesn't scale!
Storage Grid
Archiving = Premature death
Mostly Append
Data Collection

Instrumentation (Raw Data Sources)
Why Hadoop?
Challenge: Read 1 TB of data

1 Machine
- 4 IO channels
- Each channel: 100 MB/s

?
45 minutes

10 Machines
- 4 IO channels
- Each channel: 100 MB/s

4.5 minutes
?
Hadoop and Friends
The Key Benefit: Agility/Flexibility
Schema-On-Write (RDBMS)

Schema-On-Read (Hadoop)

- Schema must be created before any
data can be loaded

- Data is simply copied to the file store, no
transformations are needed

- An explicit load operation has to take
place which transforms data to DB internal
structure

- A SerDe (Serializer/Deserializer) is
applied during read tume to extract the
required column (late binding)

- New columns must be be added
explicitly before new data for such
columns can be loaded into the database

- New data can strat flowing anytime and
will appear retroactively once the SerDe is
updated to parse it

- Reads are fast
- Standards / Governance

- Load is fast
- Flexibility / Agility
Hadoop Components
Master/Slave Architecture

Name Node

Data Nodes

Job Tracker

Task Trackers
r=3

NameNode
File metadata:
/kenshoo/data1.txt ---> 1,2,3
/kenshoo/data2.txt ---> 4,5

hdfs-site.xml

dfs.replication

3

5

3

5

4

5

1

4

1

4

2

2

3

Data Nodes

1

2
Underlying FS options

ext3
- released in 2001
- Used by Yahoo!
- bootstrap + format slow
- set:
- noatime
- tune2fs (to turn
off reserved blocks)

ext4
- released in 2008
- Used by Google
- Fast as XFS
- set:
- delayed
allocation off
-noatime
- tune2fs (to turn off
reserved blocks)

XFS
- released in 1993
- Fast
- Drawbacks:
- deleting large # of files
Sample HDFS shell Commands
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop
bin/hadoop

fs
fs
fs
fs
fs
fs
fs
fs
fs

-ls
-mkdir
-copyFromLocal
-copyToLocal
-moveToLocal
-rm
-tail
-chmod
-setrep -w 4 -R /dir1/s-dir

Mounting using FUSE:
hadoop-fuse-dfs dfs://10.73.9.50 /hdfs
Network Topology

Yahoo! Installation

Name Node

Job Tracker

HBase Master

2

2

3

3

3

4

4

4

5
Rack 1

2

5

5

Rack 2

Rack 3

- 8 core switches
- 100 racks
- 40 servers/rack
- 1 GBit in rack
- 10 GBit among
racks
-Total 11PB
Rack Awareness

NameNode

Name Node

Job Tracker

metadata

HBase Master

file.txt =
A

2

A

7

3

A

8
B

4
5
Rack 1

B

Blk A: A
DN: 2,7,8

13
B

9
10

Rack 2

12

14
15

Rack 3

Blk B: B
DN: 9,12,14
HDFS Writes
Client
NameNode
Core
metadata
A

B

C

file.txt =
A

Blk A:
DN: 2,7,9

A

A

2
3

8
A

4
5
Rack 1

7

9
10

Rack 2
Reading Files
File1.txt parts:
Blk A: 2,7,8
Blk B: 9,12,14

wanna read file1.txt

Client
NameNode
Core
metadata
file.txt =
Blk A: A
DN: 2,7,8
A

2

A

7

3

A

8
B

4
5
Rack 1

B

13
B

9
10

Rack 2

12

14
15

Rack 3

Blk B: B
DN: 9,12,14

Weitere ähnliche Inhalte

Was ist angesagt?

Database Connection Pooling With c3p0
Database Connection Pooling With c3p0Database Connection Pooling With c3p0
Database Connection Pooling With c3p0Kasun Madusanke
 
Security Multitenant
Security MultitenantSecurity Multitenant
Security MultitenantArush Jain
 
A first Draft to Java Configuration
A first Draft to Java ConfigurationA first Draft to Java Configuration
A first Draft to Java ConfigurationAnatole Tresch
 
Owner - Java properties reinvented.
Owner - Java properties reinvented.Owner - Java properties reinvented.
Owner - Java properties reinvented.Luigi Viggiano
 
5050 dev nation
5050 dev nation5050 dev nation
5050 dev nationArun Gupta
 
Spring dependency injection
Spring dependency injectionSpring dependency injection
Spring dependency injectionsrmelody
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search EnginesAtul Shridhar
 
Spring - Part 2 - Autowiring, Annotations, Java based Configuration - slides
Spring - Part 2 - Autowiring, Annotations, Java based Configuration - slidesSpring - Part 2 - Autowiring, Annotations, Java based Configuration - slides
Spring - Part 2 - Autowiring, Annotations, Java based Configuration - slidesHitesh-Java
 
Advance java session 5
Advance java session 5Advance java session 5
Advance java session 5Smita B Kumar
 
Spring 4 final xtr_presentation
Spring 4 final xtr_presentationSpring 4 final xtr_presentation
Spring 4 final xtr_presentationsourabh aggarwal
 
Dependency Injection in Spring in 10min
Dependency Injection in Spring in 10minDependency Injection in Spring in 10min
Dependency Injection in Spring in 10minCorneil du Plessis
 
Weblogic Administration Managed Server migration
Weblogic Administration Managed Server migrationWeblogic Administration Managed Server migration
Weblogic Administration Managed Server migrationRakesh Gujjarlapudi
 
JPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJames Bayer
 
Hibernate jj
Hibernate jjHibernate jj
Hibernate jjJoe Jacob
 

Was ist angesagt? (20)

Jdbc
JdbcJdbc
Jdbc
 
Orcale Presentation
Orcale PresentationOrcale Presentation
Orcale Presentation
 
Database Connection Pooling With c3p0
Database Connection Pooling With c3p0Database Connection Pooling With c3p0
Database Connection Pooling With c3p0
 
Hibernate tutorial
Hibernate tutorialHibernate tutorial
Hibernate tutorial
 
Security Multitenant
Security MultitenantSecurity Multitenant
Security Multitenant
 
A first Draft to Java Configuration
A first Draft to Java ConfigurationA first Draft to Java Configuration
A first Draft to Java Configuration
 
Owner - Java properties reinvented.
Owner - Java properties reinvented.Owner - Java properties reinvented.
Owner - Java properties reinvented.
 
5050 dev nation
5050 dev nation5050 dev nation
5050 dev nation
 
Spring dependency injection
Spring dependency injectionSpring dependency injection
Spring dependency injection
 
Semantic Search Engines
Semantic Search EnginesSemantic Search Engines
Semantic Search Engines
 
Spring - Part 2 - Autowiring, Annotations, Java based Configuration - slides
Spring - Part 2 - Autowiring, Annotations, Java based Configuration - slidesSpring - Part 2 - Autowiring, Annotations, Java based Configuration - slides
Spring - Part 2 - Autowiring, Annotations, Java based Configuration - slides
 
Advance java session 5
Advance java session 5Advance java session 5
Advance java session 5
 
Spring 4 final xtr_presentation
Spring 4 final xtr_presentationSpring 4 final xtr_presentation
Spring 4 final xtr_presentation
 
Dependency Injection in Spring in 10min
Dependency Injection in Spring in 10minDependency Injection in Spring in 10min
Dependency Injection in Spring in 10min
 
JNDI
JNDIJNDI
JNDI
 
Spring 3.1
Spring 3.1Spring 3.1
Spring 3.1
 
Weblogic Administration Managed Server migration
Weblogic Administration Managed Server migrationWeblogic Administration Managed Server migration
Weblogic Administration Managed Server migration
 
Quiery builder
Quiery builderQuiery builder
Quiery builder
 
JPA and Coherence with TopLink Grid
JPA and Coherence with TopLink GridJPA and Coherence with TopLink Grid
JPA and Coherence with TopLink Grid
 
Hibernate jj
Hibernate jjHibernate jj
Hibernate jj
 

Andere mochten auch

Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL TechnologiesAmit Singh
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetupiwrigley
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohugAdam Muise
 
Njug presentation
Njug presentationNjug presentation
Njug presentationiwrigley
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101Adam Muise
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014cdmaxime
 
Cloudera cluster setup and configuration
Cloudera cluster setup and configurationCloudera cluster setup and configuration
Cloudera cluster setup and configurationSudheer Kondla
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingYahoo Developer Network
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopSavvycom Savvycom
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsCloudera, Inc.
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseDataWorks Summit
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop TutorialEdureka!
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation HadoopVarun Narang
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101EMC
 

Andere mochten auch (20)

Big Data using NoSQL Technologies
Big Data using NoSQL TechnologiesBig Data using NoSQL Technologies
Big Data using NoSQL Technologies
 
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics MeetupIntroduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
Introduction to Hadoop and Cloudera, Louisville BI & Big Data Analytics Meetup
 
hadoop 101 aug 21 2012 tohug
 hadoop 101 aug 21 2012 tohug hadoop 101 aug 21 2012 tohug
hadoop 101 aug 21 2012 tohug
 
Njug presentation
Njug presentationNjug presentation
Njug presentation
 
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop1012014 feb 24_big_datacongress_hadoopsession1_hadoop101
2014 feb 24_big_datacongress_hadoopsession1_hadoop101
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014Introduction to Spark - Phoenix Meetup 08-19-2014
Introduction to Spark - Phoenix Meetup 08-19-2014
 
Hadoop 101 v1
Hadoop 101 v1Hadoop 101 v1
Hadoop 101 v1
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 
Cloudera cluster setup and configuration
Cloudera cluster setup and configurationCloudera cluster setup and configuration
Cloudera cluster setup and configuration
 
Hadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data ProcessingHadoop @ Yahoo! - Internet Scale Data Processing
Hadoop @ Yahoo! - Internet Scale Data Processing
 
Introduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & HadoopIntroduction of Big data, NoSQL & Hadoop
Introduction of Big data, NoSQL & Hadoop
 
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop ProfessionalsBest Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
Best Practices for the Hadoop Data Warehouse: EDW 101 for Hadoop Professionals
 
Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Hadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data WarehouseHadoop and Enterprise Data Warehouse
Hadoop and Enterprise Data Warehouse
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 
Big Data & Hadoop Tutorial
Big Data & Hadoop TutorialBig Data & Hadoop Tutorial
Big Data & Hadoop Tutorial
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
Seminar Presentation Hadoop
Seminar Presentation HadoopSeminar Presentation Hadoop
Seminar Presentation Hadoop
 
Hadoop 101
Hadoop 101Hadoop 101
Hadoop 101
 

Ähnlich wie Hadoop 101

Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nageSantosh Nage
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorialvinayiqbusiness
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATarak Tar
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewNisanth Simon
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete informationbhargavi804095
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxAltafKhadim
 

Ähnlich wie Hadoop 101 (20)

Hadoop installation by santosh nage
Hadoop installation by santosh nageHadoop installation by santosh nage
Hadoop installation by santosh nage
 
Hadoop – big deal
Hadoop – big dealHadoop – big deal
Hadoop – big deal
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0Hadoop 2.0 handout 5.0
Hadoop 2.0 handout 5.0
 
Anju
AnjuAnju
Anju
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Hadoop architecture-tutorial
Hadoop  architecture-tutorialHadoop  architecture-tutorial
Hadoop architecture-tutorial
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATATHE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
 
Unit IV.pdf
Unit IV.pdfUnit IV.pdf
Unit IV.pdf
 
hadoop
hadoophadoop
hadoop
 
hadoop
hadoophadoop
hadoop
 
List of Engineering Colleges in Uttarakhand
List of Engineering Colleges in UttarakhandList of Engineering Colleges in Uttarakhand
List of Engineering Colleges in Uttarakhand
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Hadoop.pptx
Hadoop.pptxHadoop.pptx
Hadoop.pptx
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Apache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce OverviewApache hadoop, hdfs and map reduce Overview
Apache hadoop, hdfs and map reduce Overview
 
Hadoop ppt1
Hadoop ppt1Hadoop ppt1
Hadoop ppt1
 
hadoop distributed file systems complete information
hadoop distributed file systems complete informationhadoop distributed file systems complete information
hadoop distributed file systems complete information
 
OPERATING SYSTEM .pptx
OPERATING SYSTEM .pptxOPERATING SYSTEM .pptx
OPERATING SYSTEM .pptx
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 

Kürzlich hochgeladen (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 

Hadoop 101

  • 1. Introducing: The Modern Data Operating System
  • 2. Hadoop is ... A scalable fault tolerant distributed for data storage and processing (open source under the Apache license) - Core Hadoop has two main systems: ● Hadoop Distributed FileSystem (HDFS): self-healing, high-bandwidth clustered storage ● MapReduce: distributed fault-tolerant resource management and scheduling coupled with a scalable data programming abstraction
  • 5. Etymology ● Hadoop was created in 2004 by "Douglass (Doug) Cutting" ● Implemented Google Filesystem and Big Tables papers ● He aimed it, to index the internet in google style for startup search engine 'Nutch' ● Named it after his son's elephant shaped favourite toy named hadoop
  • 6. What is Big Data? "In Information Technology, big data is loosely defined term used to describe set so large and complex that they became awkward to work with using on-hand database management tools." Wikipedia
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
  • 12. How big is big? ● 2008: Google processes 20PB a day ● 2012: Facebook ingests 500TB of data a day ● 2009: eBay has 6.5 PB user data + 50 TB a day ● 2011: Yahoo! has 180-200 PB of data
  • 13. Limitations of Existing Analytics Architecture Can't explore original raw data BI Reports + Online Apps RDBMS (aggregated data) ETL (Extract, Transfer & Load) Moving Data from storage to compute doesn't scale! Storage Grid Archiving = Premature death Mostly Append Data Collection Instrumentation (Raw Data Sources)
  • 14. Why Hadoop? Challenge: Read 1 TB of data 1 Machine - 4 IO channels - Each channel: 100 MB/s ? 45 minutes 10 Machines - 4 IO channels - Each channel: 100 MB/s 4.5 minutes ?
  • 16. The Key Benefit: Agility/Flexibility Schema-On-Write (RDBMS) Schema-On-Read (Hadoop) - Schema must be created before any data can be loaded - Data is simply copied to the file store, no transformations are needed - An explicit load operation has to take place which transforms data to DB internal structure - A SerDe (Serializer/Deserializer) is applied during read tume to extract the required column (late binding) - New columns must be be added explicitly before new data for such columns can be loaded into the database - New data can strat flowing anytime and will appear retroactively once the SerDe is updated to parse it - Reads are fast - Standards / Governance - Load is fast - Flexibility / Agility
  • 17. Hadoop Components Master/Slave Architecture Name Node Data Nodes Job Tracker Task Trackers
  • 18. r=3 NameNode File metadata: /kenshoo/data1.txt ---> 1,2,3 /kenshoo/data2.txt ---> 4,5 hdfs-site.xml dfs.replication 3 5 3 5 4 5 1 4 1 4 2 2 3 Data Nodes 1 2
  • 19. Underlying FS options ext3 - released in 2001 - Used by Yahoo! - bootstrap + format slow - set: - noatime - tune2fs (to turn off reserved blocks) ext4 - released in 2008 - Used by Google - Fast as XFS - set: - delayed allocation off -noatime - tune2fs (to turn off reserved blocks) XFS - released in 1993 - Fast - Drawbacks: - deleting large # of files
  • 20. Sample HDFS shell Commands bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop bin/hadoop fs fs fs fs fs fs fs fs fs -ls -mkdir -copyFromLocal -copyToLocal -moveToLocal -rm -tail -chmod -setrep -w 4 -R /dir1/s-dir Mounting using FUSE: hadoop-fuse-dfs dfs://10.73.9.50 /hdfs
  • 21. Network Topology Yahoo! Installation Name Node Job Tracker HBase Master 2 2 3 3 3 4 4 4 5 Rack 1 2 5 5 Rack 2 Rack 3 - 8 core switches - 100 racks - 40 servers/rack - 1 GBit in rack - 10 GBit among racks -Total 11PB
  • 22. Rack Awareness NameNode Name Node Job Tracker metadata HBase Master file.txt = A 2 A 7 3 A 8 B 4 5 Rack 1 B Blk A: A DN: 2,7,8 13 B 9 10 Rack 2 12 14 15 Rack 3 Blk B: B DN: 9,12,14
  • 23. HDFS Writes Client NameNode Core metadata A B C file.txt = A Blk A: DN: 2,7,9 A A 2 3 8 A 4 5 Rack 1 7 9 10 Rack 2
  • 24. Reading Files File1.txt parts: Blk A: 2,7,8 Blk B: 9,12,14 wanna read file1.txt Client NameNode Core metadata file.txt = Blk A: A DN: 2,7,8 A 2 A 7 3 A 8 B 4 5 Rack 1 B 13 B 9 10 Rack 2 12 14 15 Rack 3 Blk B: B DN: 9,12,14