SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Intro to Hadoop
TriHUG, July 2010


Jeff Turner
Bronto Software
Who am I ?

Director of Platform Engineering at Bronto

Former Googler/FeedBurner(er)

Web Analytics background

Still working this out in therapy
What is a Hadoop?
Open source distributed computing framework built on Java

Named by Doug Cutting (Apache Lucene) after son’s toy elephant

Main components: HDFS and MapReduce

Heavily used and sponsored by Yahoo

Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others

Tremendous community and growing popularity
What does Hadoop do?
Networks nodes together to combine storage and computing power

Scales to petabytes of storage

Manages fault tolerance and data replication automagically

Excels at processing semi-structured and unstructured data

Provides framework for analyzing data in parallel (MapReduce)
What does Hadoop not do?
No random access (it’s not a database)

Not real-time (it’s batch oriented)

Make things obvious (there’s a learning curve)
Where do we start?
1. HDFS & MapReduce

2. ???

3. Profit
Hadoop’s Filesystem (HDFS)
Hadoop Distributed File System, based on Google’s GFS whitepaper

Data stored in blocks across cluster

Hadoop manages replication, node failure, rebalancing

Namenode is the master; Datanodes are slaves

Data stored on disk, but not accessible via local file system; use Hadoop
API/tools
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                                       file001 (1,2,3) file002 (2)
                                     Namenode
                                                  file003 (1,3)   file004 (3)
Namenode looks up                                 file005 (2)     file006 (4)
block locations, returns
which Datanodes have
data                                 Datanode 1          file001, file003
                              HDFS
Hadoop Client/API talks              Datanode 2      file001, file002, file005
to Datanodes to read file
data
                                     Datanode 3      file001, file003, file004

                                     Datanode 4             file006
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                                       file001 (1,2,3) file002 (2)
                                     Namenode
                                                  file003 (1,3)   file004 (3)
Namenode looks up                                 file005 (2)     file006 (4)
block locations, returns
which Datanodes have
data                                 Datanode 1          file001, file003
                              HDFS
Hadoop Client/API talks              Datanode 2      file001, file002, file005
to Datanodes to read file
data
                                     Datanode 3      file001, file003, file004
This is the only way to
access HDFS data                     Datanode 4             file006
How HDFS stores data
Hadoop Client/API talks                                 local filesystem
to Namenode                                       file001 (1,2,3) file002 (2)
                                     Namenode
                                                  file003 (1,3)   file004 (3)
Namenode looks up                                 file005 (2)     file006 (4)
block locations, returns
which Datanodes have
data                                 Datanode 1          file001, file003
                              HDFS
Hadoop Client/API talks              Datanode 2                                 HDFS data on
                                                     file001, file002, file005
to Datanodes to read file                                                      local file system
data                                                                               is stored in
                                     Datanode 3      file001, file003, file004     blocks all over
This is the only way to                                                             the cluster
access HDFS data                     Datanode 4             file006
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
                                          Datanode              Datanode

Namenode keeps track of available
Datanodes and file locations across the
cluster
                                                     Namenode

Namenode is a SPOF

                                          Datanode              Datanode
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data
                                             Datanode   Datanode

Namenode keeps track of available
Datanodes and file locations across the
cluster

Namenode is a SPOF

If you lose Namenode metadata, Hadoop        Datanode   Datanode
has no idea which files are in which blocks
About that Namenode ...
Namenode manages filesystem and file
metadata, Datanodes store actual blocks
of data

Namenode keeps track of available
Datanodes and file locations across the
cluster

Namenode is a SPOF

If you lose Namenode metadata, Hadoop
has no idea which files are in which blocks
HDFS Tips & Tricks
Write Namenode data to multiple local & a remote device (NFS mount)

No RAID, use JBOD. More disks == more disk I/O

Mount disks with noatime (skip writing last accessed time on file reads)

LZO compression; saves space, speeds network transfer

Tweak and test settings with included JARs: TestDFSIO, sort example
Quick break before we move on to MapReduce
Hadoop’s MapReduce
Framework for running tasks in parallel, based on Google’s whitepaper

JobTracker is the master; schedules tasks on nodes, monitors tasks and re-
tries failures

TaskTrackers are the slaves; runs specified task against specified bits of data
on HDFS

Map/Reduce functions operate on smaller parts of problem, distributed
across multiple nodes
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"



1. Each line of log file is input into the map function. The   mapper (filename, file-contents):
map parses the line, emits a key/value pair representing        for each line in file-contents:
                                                                  page = parsePage(line)
the page, and that it was viewed once.                            emit(page, 1)
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"



1. Each line of log file is input into the map function. The   mapper (filename, file-contents):
map parses the line, emits a key/value pair representing        for each line in file-contents:
                                                                  page = parsePage(line)
the page, and that it was viewed once.                            emit(page, 1)

2. Reducer is given a key and all occurrences of values       reduce (key, values):
for that key. The reducer sums the values and outputs a         int views = 0
key/value pair that represents the page and a total # of        for each value in values:
                                                                  views++
views.
                                                                emit(key, views)
Oversimplified MapReduce Example
18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent"
77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent"
121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent"
42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"



1. Each line of log file is input into the map function. The   mapper (filename, file-contents):
map parses the line, emits a key/value pair representing        for each line in file-contents:
                                                                  page = parsePage(line)
the page, and that it was viewed once.                            emit(page, 1)

2. Reducer is given a key and all occurrences of values       reduce (key, values):
for that key. The reducer sums the values and outputs a         int views = 0
key/value pair that represents the page and a total # of        for each value in values:
                                                                  views++
views.
                                                                emit(key, views)



3. The result is a count of how many times a webpage          (index1, 3)
                                                              (index2, 1)
has appeared in this log file.
Hadoop MapReduce data flow


InputFormat controls where data comes from,
breaks into InputSplits

RecordReader knows how to read InputSplit, passes
data to map function

Mappers do their thing, output intermediate data to
local disk

Hadoop shuffles, sorts keys in map output so all
occurrences of same key are passed to reducer
together

Reducers do their thing, send output to
OutputFormat

                                                                chart from Yahoo! Hadoop Tutorial
OutputFormat controls where data goes                 http://developer.yahoo.com/hadoop/tutorial/index.html
Input/Output Formats

TextInputFormat - Reads text files, each line is an input

TextOutputFormat - Writes output from Hadoop to plain text

DBInputFormat - Reads JDBC sources, rows map to custom DBWritable

DBOutputFormat - Writes to JDBC sources, again using DBWritable

ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
MapReduce Tips & Tricks
You don’t have to do it in Java; current MapReduce abstractions are
awesome

Pig, Hive - performance is close enough to native MR, with big productivity
boost

Hadoop Streaming - passes data through stdin/stdout so you can use any
language. Ruby, Python popular choices

Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
Hadoop at Bronto
5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8
cores

Mostly Pig scripts, some Java utility MR jobs

Jobs process raw data/mail logs; store aggregate stats in Cassandra

Ad-hoc scripts analyze internal logs for app monitoring/debugging

Using Cassandra with Hadoop (we’re rolling our own InputFormat)
Summary
Hadoop excels at big data, analytics, batch processing

Not real-time, no random access; not a database

HDFS makes it all possible: massively scalable, fault tolerant file system

MapReduce provides framework for processing data on HDFS

Pig, Hive easy to use, big productivity gain, close enough performance in
most cases
Questions?
      email: jeff.turner@bronto.com
    twitter: twitter.com/jefft

We’re hiring: http://bronto.com/company/careers

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop TechnologyManish Borkar
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigMilind Bhandarkar
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation Shivanee garg
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadooproyans
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentationArvind Kumar
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFSBrendan Tierney
 

Was ist angesagt? (20)

Introduction to Hadoop Technology
Introduction to Hadoop TechnologyIntroduction to Hadoop Technology
Introduction to Hadoop Technology
 
Practical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & PigPractical Problem Solving with Apache Hadoop & Pig
Practical Problem Solving with Apache Hadoop & Pig
 
Big data Hadoop presentation
Big data  Hadoop  presentation Big data  Hadoop  presentation
Big data Hadoop presentation
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
Big Data and Hadoop - An Introduction
Big Data and Hadoop - An IntroductionBig Data and Hadoop - An Introduction
Big Data and Hadoop - An Introduction
 
Hadoop Technologies
Hadoop TechnologiesHadoop Technologies
Hadoop Technologies
 
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and HadoopFacebooks Petabyte Scale Data Warehouse using Hive and Hadoop
Facebooks Petabyte Scale Data Warehouse using Hive and Hadoop
 
Hadoop basics
Hadoop basicsHadoop basics
Hadoop basics
 
Hadoop
Hadoop Hadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop technology
Hadoop technologyHadoop technology
Hadoop technology
 
Hadoop and big data
Hadoop and big dataHadoop and big data
Hadoop and big data
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop-Introduction
Hadoop-IntroductionHadoop-Introduction
Hadoop-Introduction
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Hadoop hive presentation
Hadoop hive presentationHadoop hive presentation
Hadoop hive presentation
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Overview of Hadoop and HDFS
Overview of Hadoop and HDFSOverview of Hadoop and HDFS
Overview of Hadoop and HDFS
 

Andere mochten auch

Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataCyanny LIANG
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to HadoopKen Krugler
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (9)

Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 
Hadoop distributed computing framework for big data
Hadoop distributed computing framework for big dataHadoop distributed computing framework for big data
Hadoop distributed computing framework for big data
 
A (very) short intro to Hadoop
A (very) short intro to HadoopA (very) short intro to Hadoop
A (very) short intro to Hadoop
 
Grid computing ppt
Grid computing pptGrid computing ppt
Grid computing ppt
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Big data and Hadoop
Big data and HadoopBig data and Hadoop
Big data and Hadoop
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 

Ähnlich wie Intro to Hadoop

Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryIJRESJOURNAL
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Simplilearn
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Rupak Roy
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out Sumeet Singh
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewNitesh Ghosh
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalMichael Rainey
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pigSudar Muthu
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File SystemNilaNila16
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 

Ähnlich wie Intro to Hadoop (20)

Design and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on RaspberryDesign and Research of Hadoop Distributed Cluster Based on Raspberry
Design and Research of Hadoop Distributed Cluster Based on Raspberry
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
Hadoop Interview Questions And Answers Part-1 | Big Data Interview Questions ...
 
Hadoop pig
Hadoop pigHadoop pig
Hadoop pig
 
Introduction to hadoop ecosystem
Introduction to hadoop ecosystem Introduction to hadoop ecosystem
Introduction to hadoop ecosystem
 
Unit 1
Unit 1Unit 1
Unit 1
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Nextag talk
Nextag talkNextag talk
Nextag talk
 
HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out HUG Meetup 2013: HCatalog / Hive Data Out
HUG Meetup 2013: HCatalog / Hive Data Out
 
May 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data OutMay 2013 HUG: HCatalog/Hive Data Out
May 2013 HUG: HCatalog/Hive Data Out
 
Hadoop
HadoopHadoop
Hadoop
 
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overviewHdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
Hdfs, Map Reduce & hadoop 1.0 vs 2.0 overview
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
SQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle ProfessionalSQL on Hadoop for the Oracle Professional
SQL on Hadoop for the Oracle Professional
 
Hadoop
HadoopHadoop
Hadoop
 
Hands on Hadoop and pig
Hands on Hadoop and pigHands on Hadoop and pig
Hands on Hadoop and pig
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 

Kürzlich hochgeladen

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 

Kürzlich hochgeladen (20)

How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 

Intro to Hadoop

  • 1. Intro to Hadoop TriHUG, July 2010 Jeff Turner Bronto Software
  • 2. Who am I ? Director of Platform Engineering at Bronto Former Googler/FeedBurner(er) Web Analytics background Still working this out in therapy
  • 3. What is a Hadoop? Open source distributed computing framework built on Java Named by Doug Cutting (Apache Lucene) after son’s toy elephant Main components: HDFS and MapReduce Heavily used and sponsored by Yahoo Also used by Facebook, Twitter, Rackspace, LinkedIn, countless others Tremendous community and growing popularity
  • 4. What does Hadoop do? Networks nodes together to combine storage and computing power Scales to petabytes of storage Manages fault tolerance and data replication automagically Excels at processing semi-structured and unstructured data Provides framework for analyzing data in parallel (MapReduce)
  • 5. What does Hadoop not do? No random access (it’s not a database) Not real-time (it’s batch oriented) Make things obvious (there’s a learning curve)
  • 6. Where do we start? 1. HDFS & MapReduce 2. ??? 3. Profit
  • 7. Hadoop’s Filesystem (HDFS) Hadoop Distributed File System, based on Google’s GFS whitepaper Data stored in blocks across cluster Hadoop manages replication, node failure, rebalancing Namenode is the master; Datanodes are slaves Data stored on disk, but not accessible via local file system; use Hadoop API/tools
  • 8. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 Datanode 4 file006
  • 9. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 file001, file002, file005 to Datanodes to read file data Datanode 3 file001, file003, file004 This is the only way to access HDFS data Datanode 4 file006
  • 10. How HDFS stores data Hadoop Client/API talks local filesystem to Namenode file001 (1,2,3) file002 (2) Namenode file003 (1,3) file004 (3) Namenode looks up file005 (2) file006 (4) block locations, returns which Datanodes have data Datanode 1 file001, file003 HDFS Hadoop Client/API talks Datanode 2 HDFS data on file001, file002, file005 to Datanodes to read file local file system data is stored in Datanode 3 file001, file003, file004 blocks all over This is the only way to the cluster access HDFS data Datanode 4 file006
  • 11. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode Namenode is a SPOF Datanode Datanode
  • 12. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Datanode Datanode Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop Datanode Datanode has no idea which files are in which blocks
  • 13. About that Namenode ... Namenode manages filesystem and file metadata, Datanodes store actual blocks of data Namenode keeps track of available Datanodes and file locations across the cluster Namenode is a SPOF If you lose Namenode metadata, Hadoop has no idea which files are in which blocks
  • 14. HDFS Tips & Tricks Write Namenode data to multiple local & a remote device (NFS mount) No RAID, use JBOD. More disks == more disk I/O Mount disks with noatime (skip writing last accessed time on file reads) LZO compression; saves space, speeds network transfer Tweak and test settings with included JARs: TestDFSIO, sort example
  • 15. Quick break before we move on to MapReduce
  • 16. Hadoop’s MapReduce Framework for running tasks in parallel, based on Google’s whitepaper JobTracker is the master; schedules tasks on nodes, monitors tasks and re- tries failures TaskTrackers are the slaves; runs specified task against specified bits of data on HDFS Map/Reduce functions operate on smaller parts of problem, distributed across multiple nodes
  • 17. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent"
  • 18. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1)
  • 19. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views)
  • 20. Oversimplified MapReduce Example 18.106.61.94 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 354 company.com "-" "User agent" 77.220.219.58 - [18/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 238 company.com "-" "User agent" 121.41.7.104 - [18/Jul/2010:07:02:42 -0400] "GET /index2 HTTP/1.1" 200 2079 company.com "-" "User agent" 42.7.64.102 - - [20/Jul/2010:07:02:42 -0400] "GET /index1 HTTP/1.1" 200 173 company.com "-" "User agent" 1. Each line of log file is input into the map function. The mapper (filename, file-contents): map parses the line, emits a key/value pair representing for each line in file-contents: page = parsePage(line) the page, and that it was viewed once. emit(page, 1) 2. Reducer is given a key and all occurrences of values reduce (key, values): for that key. The reducer sums the values and outputs a int views = 0 key/value pair that represents the page and a total # of for each value in values: views++ views. emit(key, views) 3. The result is a count of how many times a webpage (index1, 3) (index2, 1) has appeared in this log file.
  • 21. Hadoop MapReduce data flow InputFormat controls where data comes from, breaks into InputSplits RecordReader knows how to read InputSplit, passes data to map function Mappers do their thing, output intermediate data to local disk Hadoop shuffles, sorts keys in map output so all occurrences of same key are passed to reducer together Reducers do their thing, send output to OutputFormat chart from Yahoo! Hadoop Tutorial OutputFormat controls where data goes http://developer.yahoo.com/hadoop/tutorial/index.html
  • 22. Input/Output Formats TextInputFormat - Reads text files, each line is an input TextOutputFormat - Writes output from Hadoop to plain text DBInputFormat - Reads JDBC sources, rows map to custom DBWritable DBOutputFormat - Writes to JDBC sources, again using DBWritable ColumnFamilyInputFormat - Reads rows from a Cassandra ColumnFamily
  • 23. MapReduce Tips & Tricks You don’t have to do it in Java; current MapReduce abstractions are awesome Pig, Hive - performance is close enough to native MR, with big productivity boost Hadoop Streaming - passes data through stdin/stdout so you can use any language. Ruby, Python popular choices Amazon’s Elastic MapReduce - on-demand MR jobs on EC2 instances
  • 24. Hadoop at Bronto 5 node cluster, adding 8 more; each node 4x 1TB drives, 16GB memory, 8 cores Mostly Pig scripts, some Java utility MR jobs Jobs process raw data/mail logs; store aggregate stats in Cassandra Ad-hoc scripts analyze internal logs for app monitoring/debugging Using Cassandra with Hadoop (we’re rolling our own InputFormat)
  • 25. Summary Hadoop excels at big data, analytics, batch processing Not real-time, no random access; not a database HDFS makes it all possible: massively scalable, fault tolerant file system MapReduce provides framework for processing data on HDFS Pig, Hive easy to use, big productivity gain, close enough performance in most cases
  • 26. Questions? email: jeff.turner@bronto.com twitter: twitter.com/jefft We’re hiring: http://bronto.com/company/careers

Hinweis der Redaktion

  1. Yahoo 38K nodes, 4K node cluster Facebook 2K node cluster, 21 PB
  2. >60% jobs at Yahoo are Pig