SlideShare ist ein Scribd-Unternehmen logo
1 von 25
Introduction to Hadoop
Ron Sher
Agenda

•
•
•
•
•

Big data - big issues
Hadoop to the rescue
Storage - HDFS
Processing - MapReduce
Hadoop ecosystem
Big Data - Big Issues
● Volume, Velocity, Variability
● Lots of data - logs, sensors, social, pictures,
video, etc.
● May not fit a single machine
● Access to data is slow
● Hardware may fail
● Network errors happen
Hadoop to the rescue

•
•
•
•
•
•

Distributed “operating system”
Scalable - many servers of commodity hardware
with lots of cores and disks
Reliable - detect failures, redundant storage
Fault-tolerant - auto-retry, self-healing
Simple - use many servers as one really big
computer
Suitable for batch processing (throughput over
Storage - HDFS

•
•

•
•

Hadoop Distributed File System
Replicated (3 default) fixed size blocks
(64MB default)
runs on large clusters of commodity
machines
Optimized for write once - read many
throughput of large files
HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png
Useful HDFS commands
•
•
•
•
•
•
•
•

hdfs dfs -get <file name> - copy a file from hdfs to local
hdfs dfs -put <file name> [destination]- copy a file from local
to hdfs in the specified destination
hdfs dfs -cat <file name> - prints a file to stdout
hdfs dfs -ls <dir name> - show all files under the specified
directory
hdfs dfs -mv <file name> <changed name> - rename a file
hdfs dfs -rm <file name> - remove a file
hdfs dfs -rmr <directory name> - remove a directory
hdfs dfs -mkdir <dir name> - creates a directory
Processing - MapReduce

•
•

•
•

A distributed data processing model and execution
environment that runs on large clusters of commodity
machines
Responsible for running a job in parallel on many
servers
Handles re-trying a task that fails, validating complete
results
Computation moved to the data
MapReduce Sample - Word Count
input

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling
Ini, 1
Ini, 1
Mini, 1
Mini, 1
Miny, 1
Miny, 1
Mo, 1
Mo, 1
Mo, 1
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mini

mapping
Ini, 1
Mini, 1
Miny,1
Mo, 1
Mo, 1
Miny,1

Ini, 1
Mo, 1
Mini, 1

shuffling

reducing

Ini, 1
Ini, 1

Ini, [1,1]

Mini, 1
Mini, 1

Mini, [1,1]

Miny, 1
Miny, 1

Miny, [1,1]

Mo, 1
Mo, 1
Mo, 1

Mo, [1,1,1]

final result

Ini, 2
Mini, 2
Miny,2
Mo, 3
http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png

How a MapReduce Job Runs in Hadoop
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}
}
}
Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();
}
output.collect(key, new IntWritable(sum));
}
}
Hadoop Ecosystem

•
•
•
•
•
•
•
•

Hive - SQL like language over big data using MR
HBase - distributed, column-oriented database
ZooKeeper - coordination service
Avro - cross language serialization
Pig - language for exploring big data
Impala - SQL like directly over HDFS
Sqoop - tool for moving data from DBs to HDFS
Mahout - machine learning and data mining library
Some resources

•
•
•
•
•
•

Motivation about hadoop and where it’s
going video and whitepaper
HDFS Architecture Guide
How MapReduce Works With Hadoop
HDFS shell commands
VM
MapReduce tutorial

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopMohamed Elsaka
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and PigRicardo Varela
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesDataWorks Summit/Hadoop Summit
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoopAbhi Goyan
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsLeila panahi
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceDr Ganesh Iyer
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Cloudera, Inc.
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitTyler Treat
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-ReduceBrendan Tierney
 
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)Unity Technologies Japan K.K.
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part IMarin Dimitrov
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduceNewvewm
 

Was ist angesagt? (19)

Introduction to MapReduce and Hadoop
Introduction to MapReduce and HadoopIntroduction to MapReduce and Hadoop
Introduction to MapReduce and Hadoop
 
introduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pigintroduction to data processing using Hadoop and Pig
introduction to data processing using Hadoop and Pig
 
GoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with DependenciesGoodFit: Multi-Resource Packing of Tasks with Dependencies
GoodFit: Multi-Resource Packing of Tasks with Dependencies
 
MapReduce
MapReduceMapReduce
MapReduce
 
MapReduce and Hadoop
MapReduce and HadoopMapReduce and Hadoop
MapReduce and Hadoop
 
Big data & hadoop
Big data & hadoopBig data & hadoop
Big data & hadoop
 
MapReduce Scheduling Algorithms
MapReduce Scheduling AlgorithmsMapReduce Scheduling Algorithms
MapReduce Scheduling Algorithms
 
06 pig etl features
06 pig etl features06 pig etl features
06 pig etl features
 
Introduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduceIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
Hadoop Summit 2012 | Bayesian Counters AKA In Memory Data Mining for Large Da...
 
Probabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profitProbabilistic algorithms for fun and pseudorandom profit
Probabilistic algorithms for fun and pseudorandom profit
 
Introduction to Map-Reduce
Introduction to Map-ReduceIntroduction to Map-Reduce
Introduction to Map-Reduce
 
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
【Unite 2017 Tokyo】C#ジョブシステムによるモバイルゲームのパフォーマンス向上テクニック(note付き)
 
Priority queue
Priority queuePriority queue
Priority queue
 
Large Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part ILarge Scale Data Analysis with Map/Reduce, part I
Large Scale Data Analysis with Map/Reduce, part I
 
Hadoop job chaining
Hadoop job chainingHadoop job chaining
Hadoop job chaining
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
Hadoop & MapReduce
Hadoop & MapReduceHadoop & MapReduce
Hadoop & MapReduce
 

Andere mochten auch

Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersXiao Qin
 
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Analytics
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesVasia Kalavri
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 

Andere mochten auch (7)

HDFS Issues
HDFS IssuesHDFS Issues
HDFS Issues
 
Heirarchy
HeirarchyHeirarchy
Heirarchy
 
Resume2015 copy
Resume2015 copyResume2015 copy
Resume2015 copy
 
Performance Issues on Hadoop Clusters
Performance Issues on Hadoop ClustersPerformance Issues on Hadoop Clusters
Performance Issues on Hadoop Clusters
 
IBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission ShowcaseIBM Big Data for Social Good Challenge - Submission Showcase
IBM Big Data for Social Good Challenge - Submission Showcase
 
MapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open IssuesMapReduce: Optimizations, Limitations, and Open Issues
MapReduce: Optimizations, Limitations, and Open Issues
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 

Ähnlich wie Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introductionSandeep Singh
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM Joy Rahman
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.pptSathish24111
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNATomas Cervenka
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentationJoseph Adler
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerCodemotion
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Uwe Printz
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectBYOUNG GON KIM
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Christopher Hogue
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataDhanashri Yadav
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerancePallav Jha
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class Chris Gates
 

Ähnlich wie Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem (20)

Hadoop-Quick introduction
Hadoop-Quick introductionHadoop-Quick introduction
Hadoop-Quick introduction
 
MapReduce on Zero VM
MapReduce on Zero VM MapReduce on Zero VM
MapReduce on Zero VM
 
Hadoop tutorial
Hadoop tutorialHadoop tutorial
Hadoop tutorial
 
Hadoop Tutorial.ppt
Hadoop Tutorial.pptHadoop Tutorial.ppt
Hadoop Tutorial.ppt
 
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNAFirst Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
 
Big data week presentation
Big data week presentationBig data week presentation
Big data week presentation
 
MapReduce basics
MapReduce basicsMapReduce basics
MapReduce basics
 
Hadoop and Distributed Computing
Hadoop and Distributed ComputingHadoop and Distributed Computing
Hadoop and Distributed Computing
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop - Introduction to HDFS
Hadoop - Introduction to HDFSHadoop - Introduction to HDFS
Hadoop - Introduction to HDFS
 
Introduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe SeilerIntroduction to the hadoop ecosystem by Uwe Seiler
Introduction to the hadoop ecosystem by Uwe Seiler
 
Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)Introduction to the Hadoop Ecosystem (SEACON Edition)
Introduction to the Hadoop Ecosystem (SEACON Edition)
 
OpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo ProjectOpenSource Big Data Platform - Flamingo Project
OpenSource Big Data Platform - Flamingo Project
 
Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013Manta Unleashed BigDataSG talk 2 July 2013
Manta Unleashed BigDataSG talk 2 July 2013
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Hadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big DataHadoop: A distributed framework for Big Data
Hadoop: A distributed framework for Big Data
 
Hadoop fault tolerance
Hadoop  fault toleranceHadoop  fault tolerance
Hadoop fault tolerance
 
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
The Dirty Little Secrets They Didn’t Teach You In Pentesting Class
 

Kürzlich hochgeladen

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 

Kürzlich hochgeladen (20)

The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 

Introduction to Hadoop: An Overview of Storage, Processing, and the Hadoop Ecosystem

  • 2. Agenda • • • • • Big data - big issues Hadoop to the rescue Storage - HDFS Processing - MapReduce Hadoop ecosystem
  • 3. Big Data - Big Issues ● Volume, Velocity, Variability ● Lots of data - logs, sensors, social, pictures, video, etc. ● May not fit a single machine ● Access to data is slow ● Hardware may fail ● Network errors happen
  • 4. Hadoop to the rescue • • • • • • Distributed “operating system” Scalable - many servers of commodity hardware with lots of cores and disks Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Suitable for batch processing (throughput over
  • 5. Storage - HDFS • • • • Hadoop Distributed File System Replicated (3 default) fixed size blocks (64MB default) runs on large clusters of commodity machines Optimized for write once - read many throughput of large files
  • 7. Useful HDFS commands • • • • • • • • hdfs dfs -get <file name> - copy a file from hdfs to local hdfs dfs -put <file name> [destination]- copy a file from local to hdfs in the specified destination hdfs dfs -cat <file name> - prints a file to stdout hdfs dfs -ls <dir name> - show all files under the specified directory hdfs dfs -mv <file name> <changed name> - rename a file hdfs dfs -rm <file name> - remove a file hdfs dfs -rmr <directory name> - remove a directory hdfs dfs -mkdir <dir name> - creates a directory
  • 8. Processing - MapReduce • • • • A distributed data processing model and execution environment that runs on large clusters of commodity machines Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Computation moved to the data
  • 9. MapReduce Sample - Word Count input Ini Mini Miny Mo Mo Miny Ini Mo Mini
  • 10. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini
  • 11. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1
  • 12. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling Ini, 1 Ini, 1 Mini, 1 Mini, 1 Miny, 1 Miny, 1 Mo, 1 Mo, 1 Mo, 1
  • 13. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1]
  • 14. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1] final result Ini, 2 Mini, 2 Miny,2 Mo, 3
  • 16. Monitoring MR jobs (machine:50030)
  • 17. Monitoring MR jobs (machine:50030)
  • 18. Monitoring MR jobs (machine:50030)
  • 19. Monitoring MR jobs (machine:50030)
  • 20. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 21. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 22. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 23. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 24. Hadoop Ecosystem • • • • • • • • Hive - SQL like language over big data using MR HBase - distributed, column-oriented database ZooKeeper - coordination service Avro - cross language serialization Pig - language for exploring big data Impala - SQL like directly over HDFS Sqoop - tool for moving data from DBs to HDFS Mahout - machine learning and data mining library
  • 25. Some resources • • • • • • Motivation about hadoop and where it’s going video and whitepaper HDFS Architecture Guide How MapReduce Works With Hadoop HDFS shell commands VM MapReduce tutorial