SlideShare ist ein Scribd-Unternehmen logo
1 von 86
Downloaden Sie, um offline zu lesen
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 1 of 5
Course Outline
What is Hadoop?
 Open-source data storage and processing API
 Massively scalable, automatically parallelizable

Based on work from Google

GFS + MapReduce + BigTable

Current Distributions based on Open Source and Vendor Work

Apache Hadoop

Cloudera – CH4 w/ Impala

Hortonworks

MapR

AWS

Windows Azure HDInsight
Why Use Hadoop?
 Cheaper

Scales to Petabytes or
more
 Faster

Parallel data processing
 Better

Suited for particular types
of BigData problems
What types of business problems for Hadoop?
Source: Cloudera “Ten Common Hadoopable Problems”
Companies Using
Hadoop
 Facebook
 Yahoo
 Amazon
 eBay
 American Airlines
 The New York Times
 Federal Reserve Board
 IBM
 Orbitz
Forecast growth of Hadoop Job Market
Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
Hadoop is a set of Apache Frameworks and more…
 Data storage (HDFS)

Runs on commodity hardware (usually Linux)

Horizontally scalable
 Processing (MapReduce)

Parallelized (scalable) processing

Fault Tolerant
 Other Tools / Frameworks

Data Access

HBase, Hive, Pig, Mahout

Tools

Hue, Sqoop

Monitoring

Greenplum, Cloudera
Hadoop Core - HDFS
MapReduce API
Data Access
Tools & Libraries
Monitoring & Alerting
What are the core parts of a Hadoop distribution?
Hadoop Cluster HDFS (Physical) Storage
MapReduce Job – Logical View
Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
Hadoop Ecosystem
Hadoop MapReduce Fundamentals
Common Hadoop Distributions
 Open Source

Apache
 Commercial

Cloudera

Hortonworks

MapR

AWS MapReduce

Microsoft HDInsight (Beta)
A View of Hadoop (from Hortonworks)
Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
Setting up Hadoop Development
Demo – Setting up Cloudera Hadoop
Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 2 of 5
So, what’s the problem?
 “I can just use some ‘SQL-like’ language to query Hadoop, right?
 “Yeah, SQL-on-Hadoop…that’s what I want
 “I don’t want learn a new query language and….
 “I want massive scale for my shiny, new BigData
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Using Hive QL on CDH4
What is Hive?
 a data warehouse system for Hadoop that

facilitates easy data summarization

supports ad-hoc queries (still batch though…)

created by Facebook
 a mechanism to project structure onto this data and query the data using a
SQL-like language – HiveQL

Interactive-console –or-

Execute scripts

Kicks off one or more MapReduce jobs in the background
 an ability to use indexes, built-in user-defined functions
Is HQL == ANSI SQL? – NO!
--non-equality joins ARE allowed on ANSI SQL
--but are NOT allowed on Hive (HQL)
SELECT a.*
FROM a
JOIN b ON (a.id <> b.id)
Note: Joins are quite different in MapReduce, more on that coming up…
Preparing for MapReduce
Common Hadoop Shell Commands
hadoop fs –cat file:///file2
hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2
hadoop fs –copyFromLocal <fromDir> <toDir>
hadoop fs –put <localfile>
hdfs://nn.example.com/hadoop/hadoopfile
sudo hadoop jar <jarFileName> <method> <fromDir> <toDir>
hadoop fs –ls /user/hadoop/dir1
hadoop fs –cat hdfs://nn1.example.com/file1
hadoop fs –get /user/hadoop/file <localfile>
Tips
-- ‘sudo’ means ‘run as administrator’ (super user)
--some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link
included for more detail
Demo – Working with Files and HDFS
Thinking in MapReduce
 Hint: “It’s Functional”
Understanding MapReduce – P1/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
Understanding MapReduce – P2/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
 Shuffle/Sort>>
Understanding MapReduce – P3/3
 Map>>

(K1, V1) 

Info in

Input Split

list (K2, V2)

Key / Value out
(intermediate values)

One list per local
node

Can implement local
Reducer (or
Combiner)
 Reduce

(K2, list(V2) 

Shuffle / Sort phase
precedes Reduce phase

Combines Map output
into a list

list (K3, V3)

Usually aggregates
intermediate values
(input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)
 Shuffle/Sort>>
Image from: http://blog.jteam.nl/wp-content/uploads/2009/08/MapReduceWordCountOverview1.png
MapReduce Example - WordCount
MapReduce Objects
Each daemon spawns a new JVM
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – Running MapReduce WordCount
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 3 of 5
Ways to run MapReduce Jobs
 Configure JobConf options
 From Development Environment (IDE)
 From a GUI utility

Cloudera – Hue

Microsoft Azure – HDInsight console
 From the command line

hadoop jar <filename.jar> input output
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Setting up Hadoop On Windows Azure
 About HDInsight
Demo – MapReduce in the Cloud
 WordCount MapReduce using HDInsight
MapReduce (WordCount) with Java Script
Note: JavaScript is
part of the Azure
Hadoop distribution
Common Data Sources for MapReduce Jobs
Where is your Data coming from?
 On premises

Local file system

Local HDFS instance
 Private Cloud

Cloud storage
 Public Cloud

Input Storage buckets

Script / Code buckets

Output buckets
Common Data Jobs for MapReduce
Demo – Other Types of MapReduce
Tip: Review the Java MapReduce code in these samples as well.
Methods to write MapReduce Jobs
 Typical – usually written in Java

MapReduce 2.0 API

MapReduce 1.0 API
 Streaming

Uses stdin and stdout

Can use any language to write Map and Reduce Functions

C#, Python, JavaScript, etc…
 Pipes

Often used with C++
 Abstraction libraries

Hive, Pig, etc… write in a higher level language, generate one or more
MapReduce jobs
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Demo – MapReduce via C# & PowerShell
Ways to MapReduce
Libraries Languages
Note: Java is most common, but other languages can be used
Using AWS MapReduce
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
What is Pig?
 ETL Library for HDFS developed at Yahoo

Pig Runtime

Pig Language

Generates MapReduce Jobs
 ETL steps

LOAD <file>

FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…

DUMP {to screen for testing}  STORE <newFile>
MapReduce Python Sample
Remember that white space matters in Python!
Demo – Using AWS MapReduce with
Pig
Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the
AWS Cloud
AWS Data Pipeline with HIVE
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 4 of 5
Better MapReduce - Optimizations
Optimization BEFORE running a MapReduce Job
More about Input File Compression
 From Cloudera…
 Their version of LZO ‘splittable’
Type File Size GB Compress Decompress
None Log 8.0 - -
Gzip Log.gz 1.3 241 72
LZO Log.lzo 2.0 55 35
Optimization WITHIN a MapReduce Job
59
Mapper Task Optimization
Data Types
 Writable

Text (String)

IntWritable

LongWritable

FloatWritable

BooleanWritable
 WritableComparable for keys
 Custom Types supported – write RawComparator
Reducer Task Optimization
MapReduce Job Optimization
Demo – Unit Testing MapReduce
 Using MRUnit + Asserts
 Optionally using ApprovalTests
Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
A note about MapReduce 2.0
 Splits the existing JobTracker’s roles

resource management

job lifecycle management
 MapReduce 2.0 provides many benefits over the existing MapReduce
framework, such as better scalability

through distributed job lifecycle management

support for multiple Hadoop MapReduce API versions in a single cluster
What is Mahout?
 Library with common machine learning algorithms
 Over 20 algorithms

Recommendation (likelihood – Pandora)

Classification (known data and new data – spam id)

Clustering (new groups of similar data – Google news)
 Can non-statisticians find value using this library?
Mahout Algorithms
Setting up Hadoop on Windows
 For local development
 Install from binaries from Web Platform Installer
 Install .NET Azure SDK (for Azure BLOB storage)
 Install other tools

Neudesic Azure Storage Viewer
Demo – Mahout
 Using HDInsight
What about the output?
Clients (Visualizations) for HDFS
 Many clients use Hive

Often included in GUI console tools for Hadoop distributions as well
 Microsoft includes clients in Office (Excel 2013)

Direct Hive client

Connect using ODBC

PowerPivot – data mashups and presentation

Data Explorer – connect, transform, mashup and filter

Hadoop SDK on Codeplex
 Other popular clients

Qlikview

Tableau

Karmasphere
Demo – Executing Hive Queries
Demo – Using HDFS output in Excel 2013
To download Data Explorer:
http://www.microsoft.com/en-
us/download/details.aspx?id=36803
AboutVisualization
Demo – New Visualizations – D3
Hadoop MapReduce Fundamentals
@LynnLangit
a five part series – Part 5 of 5
Limitations of MapReduce
Comparing: RDBMS vs. Hadoop
Traditional RDBMS Hadoop / MapReduce
Data Size Gigabytes (Terabytes) Petabytes (Hexabytes)
Access Interactive and Batch Batch – NOT Interactive
Updates Read / Write many times Write once, Read many times
Structure Static Schema Dynamic Schema
Integrity High (ACID) Low
Scaling Nonlinear Linear
Query Response
Time
Can be near immediate Has latency (due to batch processing)
Microsoft alternatives to MapReduce
 Use existing relational system

Scale via cloud or edition (i.e. Enterprise or PDW)
 Use in memory OLAP

SQL Server Analysis Services Tabular Models
 Use “productized” Dremel

Microsoft Polybase – status = beta?
Looking Forward - Dremel or Apache Drill
 Based on original research from Google
Apache Drill Architecture
In-market MapReduce Alternatives
Cloudera
 Impala
Google
 Big Query
Demo – Google’s BigQuery
 Dremel for the rest of us
Hadoop MapReduce Call to Action
More MapReduce Developer Resources
 Based on the distribution – on premises

Apache

MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera

Cloudera

Cloudera University - http://university.cloudera.com/

Cloudera Developer Course (4 day) - *RECOMMENDED* -
http://university.cloudera.com/training/apache_hadoop/developer.html

Hortonworks

MapR
 Based on the distribution – cloud

AWS MapReduce

Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs

Windows Azure HDInsight

Tutorial -
http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/

More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
The Changing Data Landscape

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to HadoopApache Apex
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component rebeccatho
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY pptsravya raju
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem pptsunera pathan
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaEdureka!
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATAGauravBiswas9
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?sudhakara st
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Cloudera, Inc.
 

Was ist angesagt? (20)

Hadoop
HadoopHadoop
Hadoop
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component Introduction to Hadoop and Hadoop component
Introduction to Hadoop and Hadoop component
 
Introduction to Pig
Introduction to PigIntroduction to Pig
Introduction to Pig
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Hadoop Tutorial For Beginners
Hadoop Tutorial For BeginnersHadoop Tutorial For Beginners
Hadoop Tutorial For Beginners
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
HADOOP TECHNOLOGY ppt
HADOOP  TECHNOLOGY pptHADOOP  TECHNOLOGY ppt
HADOOP TECHNOLOGY ppt
 
Hadoop And Their Ecosystem ppt
 Hadoop And Their Ecosystem ppt Hadoop And Their Ecosystem ppt
Hadoop And Their Ecosystem ppt
 
Hadoop YARN
Hadoop YARNHadoop YARN
Hadoop YARN
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Map reduce in BIG DATA
Map reduce in BIG DATAMap reduce in BIG DATA
Map reduce in BIG DATA
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Dynamic Itemset Counting
Dynamic Itemset CountingDynamic Itemset Counting
Dynamic Itemset Counting
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
PPT on Hadoop
PPT on HadoopPPT on Hadoop
PPT on Hadoop
 
Hadoop introduction , Why and What is Hadoop ?
Hadoop introduction , Why and What is  Hadoop ?Hadoop introduction , Why and What is  Hadoop ?
Hadoop introduction , Why and What is Hadoop ?
 
Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2Introduction to YARN and MapReduce 2
Introduction to YARN and MapReduce 2
 

Andere mochten auch

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsSkillspeed
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reducerantav
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Wei-Yu Chen
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windowsMuhammad Shahid
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word countJeff Patti
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big DataLegacy Typesafe (now Lightbend)
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reducePaladion Networks
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 

Andere mochten auch (18)

Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce FundamentalsIntroduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
Introduction to MapReduce | MapReduce Architecture | MapReduce Fundamentals
 
Introduction To Map Reduce
Introduction To Map ReduceIntroduction To Map Reduce
Introduction To Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計Hadoop Map Reduce 程式設計
Hadoop Map Reduce 程式設計
 
Map reduce and Hadoop on windows
Map reduce and Hadoop on windowsMap reduce and Hadoop on windows
Map reduce and Hadoop on windows
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
MapReduce in Simple Terms
MapReduce in Simple TermsMapReduce in Simple Terms
MapReduce in Simple Terms
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Map reduce: beyond word count
Map reduce: beyond word countMap reduce: beyond word count
Map reduce: beyond word count
 
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
[Sneak Preview] Apache Spark: Preparing for the next wave of Reactive Big Data
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Analysing of big data using map reduce
Analysing of big data using map reduceAnalysing of big data using map reduce
Analysing of big data using map reduce
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
An Introduction To Map-Reduce
An Introduction To Map-ReduceAn Introduction To Map-Reduce
An Introduction To Map-Reduce
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Map reduce vs spark
Map reduce vs sparkMap reduce vs spark
Map reduce vs spark
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 

Ähnlich wie Hadoop MapReduce Fundamentals

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010BOSC 2010
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introductiondewang_mistry
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1Thanh Nguyen
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big pictureJ S Jodha
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony NguyenThanh Nguyen
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoopveeracynixit
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop GuideSimplilearn
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Rajan Kanitkar
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Yahoo Developer Network
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainersriram0233
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: RevealedSachin Holla
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache HadoopOleksiy Krotov
 

Ähnlich wie Hadoop MapReduce Fundamentals (20)

Taylor bosc2010
Taylor bosc2010Taylor bosc2010
Taylor bosc2010
 
Hadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log ProcessingHadoop a Natural Choice for Data Intensive Log Processing
Hadoop a Natural Choice for Data Intensive Log Processing
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop - A Very Short Introduction
Hadoop - A Very Short IntroductionHadoop - A Very Short Introduction
Hadoop - A Very Short Introduction
 
Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Hadoop Big Data A big picture
Hadoop Big Data A big pictureHadoop Big Data A big picture
Hadoop Big Data A big picture
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big data overview of apache hadoop
Big data overview of apache hadoopBig data overview of apache hadoop
Big data overview of apache hadoop
 
Big Data and Hadoop Guide
Big Data and Hadoop GuideBig Data and Hadoop Guide
Big Data and Hadoop Guide
 
Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014Big Data Hoopla Simplified - TDWI Memphis 2014
Big Data Hoopla Simplified - TDWI Memphis 2014
 
Big data concepts
Big data conceptsBig data concepts
Big data concepts
 
Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010Hadoop Frameworks Panel__HadoopSummit2010
Hadoop Frameworks Panel__HadoopSummit2010
 
Hadoop online training by certified trainer
Hadoop online training by certified trainerHadoop online training by certified trainer
Hadoop online training by certified trainer
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
BIG DATA: Apache Hadoop
BIG DATA: Apache HadoopBIG DATA: Apache Hadoop
BIG DATA: Apache Hadoop
 
Lecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptxLecture 2 Hadoop.pptx
Lecture 2 Hadoop.pptx
 
Intro to Hadoop
Intro to HadoopIntro to Hadoop
Intro to Hadoop
 

Mehr von Lynn Langit

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWSLynn Langit
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless ArchitecturesLynn Langit
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids ProgrammingLynn Langit
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on DockerLynn Langit
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina LanguageLynn Langit
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsLynn Langit
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesLynn Langit
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data PipelinesLynn Langit
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids ProgrammingLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesLynn Langit
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsLynn Langit
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSLynn Langit
 
Serverless Reality
Serverless RealityServerless Reality
Serverless RealityLynn Langit
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond RelationalLynn Langit
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for BioinformaticsLynn Langit
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsLynn Langit
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformLynn Langit
 

Mehr von Lynn Langit (20)

VariantSpark on AWS
VariantSpark on AWSVariantSpark on AWS
VariantSpark on AWS
 
Serverless Architectures
Serverless ArchitecturesServerless Architectures
Serverless Architectures
 
10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming10+ Years of Teaching Kids Programming
10+ Years of Teaching Kids Programming
 
Blastn plus jupyter on Docker
Blastn plus jupyter on DockerBlastn plus jupyter on Docker
Blastn plus jupyter on Docker
 
Testing in Ballerina Language
Testing in Ballerina LanguageTesting in Ballerina Language
Testing in Ballerina Language
 
Teaching Kids to create Alexa Skills
Teaching Kids to create Alexa SkillsTeaching Kids to create Alexa Skills
Teaching Kids to create Alexa Skills
 
Practical cloud
Practical cloudPractical cloud
Practical cloud
 
Understanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examplesUnderstanding Jupyter notebooks using bioinformatics examples
Understanding Jupyter notebooks using bioinformatics examples
 
Genome-scale Big Data Pipelines
Genome-scale Big Data PipelinesGenome-scale Big Data Pipelines
Genome-scale Big Data Pipelines
 
Teaching Kids Programming
Teaching Kids ProgrammingTeaching Kids Programming
Teaching Kids Programming
 
Practical Cloud
Practical CloudPractical Cloud
Practical Cloud
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Genomic Scale Big Data Pipelines
Genomic Scale Big Data PipelinesGenomic Scale Big Data Pipelines
Genomic Scale Big Data Pipelines
 
VariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomicsVariantSpark - a Spark library for genomics
VariantSpark - a Spark library for genomics
 
Bioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWSBioinformatics Data Pipelines built by CSIRO on AWS
Bioinformatics Data Pipelines built by CSIRO on AWS
 
Serverless Reality
Serverless RealityServerless Reality
Serverless Reality
 
Beyond Relational
Beyond RelationalBeyond Relational
Beyond Relational
 
New AWS Services for Bioinformatics
New AWS Services for BioinformaticsNew AWS Services for Bioinformatics
New AWS Services for Bioinformatics
 
Google Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline PatternsGoogle Cloud and Data Pipeline Patterns
Google Cloud and Data Pipeline Patterns
 
Scaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud PlatformScaling Galaxy on Google Cloud Platform
Scaling Galaxy on Google Cloud Platform
 

Kürzlich hochgeladen

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-pyJamie (Taka) Wang
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXTarek Kalaji
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaborationbruanjhuli
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Will Schroeder
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 

Kürzlich hochgeladen (20)

Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
20230202 - Introduction to tis-py
20230202 - Introduction to tis-py20230202 - Introduction to tis-py
20230202 - Introduction to tis-py
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
VoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBXVoIP Service and Marketing using Odoo and Asterisk PBX
VoIP Service and Marketing using Odoo and Asterisk PBX
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online CollaborationCOMPUTER 10: Lesson 7 - File Storage and Online Collaboration
COMPUTER 10: Lesson 7 - File Storage and Online Collaboration
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
Apres-Cyber - The Data Dilemma: Bridging Offensive Operations and Machine Lea...
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 

Hadoop MapReduce Fundamentals

  • 1. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 1 of 5
  • 3. What is Hadoop?  Open-source data storage and processing API  Massively scalable, automatically parallelizable  Based on work from Google  GFS + MapReduce + BigTable  Current Distributions based on Open Source and Vendor Work  Apache Hadoop  Cloudera – CH4 w/ Impala  Hortonworks  MapR  AWS  Windows Azure HDInsight
  • 4. Why Use Hadoop?  Cheaper  Scales to Petabytes or more  Faster  Parallel data processing  Better  Suited for particular types of BigData problems
  • 5. What types of business problems for Hadoop? Source: Cloudera “Ten Common Hadoopable Problems”
  • 6. Companies Using Hadoop  Facebook  Yahoo  Amazon  eBay  American Airlines  The New York Times  Federal Reserve Board  IBM  Orbitz
  • 7. Forecast growth of Hadoop Job Market Source: Indeed -- http://www.indeed.com/jobtrends/Hadoop.html
  • 8. Hadoop is a set of Apache Frameworks and more…  Data storage (HDFS)  Runs on commodity hardware (usually Linux)  Horizontally scalable  Processing (MapReduce)  Parallelized (scalable) processing  Fault Tolerant  Other Tools / Frameworks  Data Access  HBase, Hive, Pig, Mahout  Tools  Hue, Sqoop  Monitoring  Greenplum, Cloudera Hadoop Core - HDFS MapReduce API Data Access Tools & Libraries Monitoring & Alerting
  • 9. What are the core parts of a Hadoop distribution?
  • 10. Hadoop Cluster HDFS (Physical) Storage
  • 11. MapReduce Job – Logical View Image from - http://mm-tom.s3.amazonaws.com/blog/MapReduce.png
  • 14. Common Hadoop Distributions  Open Source  Apache  Commercial  Cloudera  Hortonworks  MapR  AWS MapReduce  Microsoft HDInsight (Beta)
  • 15. A View of Hadoop (from Hortonworks) Source: “Intro to Map Reduce” -- http://www.youtube.com/watch?v=ht3dNvdNDzI
  • 16. Setting up Hadoop Development
  • 17. Demo – Setting up Cloudera Hadoop Note: Demo VMs can be downloaded from - https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  • 18. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 2 of 5
  • 19. So, what’s the problem?  “I can just use some ‘SQL-like’ language to query Hadoop, right?  “Yeah, SQL-on-Hadoop…that’s what I want  “I don’t want learn a new query language and….  “I want massive scale for my shiny, new BigData
  • 20. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 21. Demo – Using Hive QL on CDH4
  • 22. What is Hive?  a data warehouse system for Hadoop that  facilitates easy data summarization  supports ad-hoc queries (still batch though…)  created by Facebook  a mechanism to project structure onto this data and query the data using a SQL-like language – HiveQL  Interactive-console –or-  Execute scripts  Kicks off one or more MapReduce jobs in the background  an ability to use indexes, built-in user-defined functions
  • 23. Is HQL == ANSI SQL? – NO! --non-equality joins ARE allowed on ANSI SQL --but are NOT allowed on Hive (HQL) SELECT a.* FROM a JOIN b ON (a.id <> b.id) Note: Joins are quite different in MapReduce, more on that coming up…
  • 25. Common Hadoop Shell Commands hadoop fs –cat file:///file2 hadoop fs –mkdir /user/hadoop/dir1 /user/hadoop/dir2 hadoop fs –copyFromLocal <fromDir> <toDir> hadoop fs –put <localfile> hdfs://nn.example.com/hadoop/hadoopfile sudo hadoop jar <jarFileName> <method> <fromDir> <toDir> hadoop fs –ls /user/hadoop/dir1 hadoop fs –cat hdfs://nn1.example.com/file1 hadoop fs –get /user/hadoop/file <localfile> Tips -- ‘sudo’ means ‘run as administrator’ (super user) --some hadoop configurations use ‘hadoop dfs’ rather than ‘hadoop fs’ – file paths to hadoop differ for the former, see the link included for more detail
  • 26. Demo – Working with Files and HDFS
  • 27. Thinking in MapReduce  Hint: “It’s Functional”
  • 28. Understanding MapReduce – P1/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)
  • 29. Understanding MapReduce – P2/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Shuffle/Sort>>
  • 30. Understanding MapReduce – P3/3  Map>>  (K1, V1)   Info in  Input Split  list (K2, V2)  Key / Value out (intermediate values)  One list per local node  Can implement local Reducer (or Combiner)  Reduce  (K2, list(V2)   Shuffle / Sort phase precedes Reduce phase  Combines Map output into a list  list (K3, V3)  Usually aggregates intermediate values (input) <k1, v1>  map  <k2, v2>  combine  <k2, v2>  reduce  <k3, v3> (output)  Shuffle/Sort>>
  • 32. MapReduce Objects Each daemon spawns a new JVM
  • 33. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 34. Demo – Running MapReduce WordCount
  • 35. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 3 of 5
  • 36. Ways to run MapReduce Jobs  Configure JobConf options  From Development Environment (IDE)  From a GUI utility  Cloudera – Hue  Microsoft Azure – HDInsight console  From the command line  hadoop jar <filename.jar> input output
  • 37. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 38. Setting up Hadoop On Windows Azure  About HDInsight
  • 39. Demo – MapReduce in the Cloud  WordCount MapReduce using HDInsight
  • 40. MapReduce (WordCount) with Java Script Note: JavaScript is part of the Azure Hadoop distribution
  • 41. Common Data Sources for MapReduce Jobs
  • 42. Where is your Data coming from?  On premises  Local file system  Local HDFS instance  Private Cloud  Cloud storage  Public Cloud  Input Storage buckets  Script / Code buckets  Output buckets
  • 43. Common Data Jobs for MapReduce
  • 44. Demo – Other Types of MapReduce Tip: Review the Java MapReduce code in these samples as well.
  • 45. Methods to write MapReduce Jobs  Typical – usually written in Java  MapReduce 2.0 API  MapReduce 1.0 API  Streaming  Uses stdin and stdout  Can use any language to write Map and Reduce Functions  C#, Python, JavaScript, etc…  Pipes  Often used with C++  Abstraction libraries  Hive, Pig, etc… write in a higher level language, generate one or more MapReduce jobs
  • 46. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 47. Demo – MapReduce via C# & PowerShell
  • 48. Ways to MapReduce Libraries Languages Note: Java is most common, but other languages can be used
  • 49. Using AWS MapReduce Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  • 50. What is Pig?  ETL Library for HDFS developed at Yahoo  Pig Runtime  Pig Language  Generates MapReduce Jobs  ETL steps  LOAD <file>  FILTER, JOIN, GROUP BY, FOREACH, GENERATE, COUNT…  DUMP {to screen for testing}  STORE <newFile>
  • 51. MapReduce Python Sample Remember that white space matters in Python!
  • 52. Demo – Using AWS MapReduce with Pig Note: You can select Apache or MapR Hadoop Distributions to run your MapReduce job on the AWS Cloud
  • 53. AWS Data Pipeline with HIVE
  • 54. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 4 of 5
  • 55. Better MapReduce - Optimizations
  • 56. Optimization BEFORE running a MapReduce Job
  • 57. More about Input File Compression  From Cloudera…  Their version of LZO ‘splittable’ Type File Size GB Compress Decompress None Log 8.0 - - Gzip Log.gz 1.3 241 72 LZO Log.lzo 2.0 55 35
  • 58. Optimization WITHIN a MapReduce Job
  • 59. 59
  • 61. Data Types  Writable  Text (String)  IntWritable  LongWritable  FloatWritable  BooleanWritable  WritableComparable for keys  Custom Types supported – write RawComparator
  • 64. Demo – Unit Testing MapReduce  Using MRUnit + Asserts  Optionally using ApprovalTests Image from http://c0de-x.com/wp-content/uploads/2012/10/staredad_english.png
  • 65. A note about MapReduce 2.0  Splits the existing JobTracker’s roles  resource management  job lifecycle management  MapReduce 2.0 provides many benefits over the existing MapReduce framework, such as better scalability  through distributed job lifecycle management  support for multiple Hadoop MapReduce API versions in a single cluster
  • 66. What is Mahout?  Library with common machine learning algorithms  Over 20 algorithms  Recommendation (likelihood – Pandora)  Classification (known data and new data – spam id)  Clustering (new groups of similar data – Google news)  Can non-statisticians find value using this library?
  • 68. Setting up Hadoop on Windows  For local development  Install from binaries from Web Platform Installer  Install .NET Azure SDK (for Azure BLOB storage)  Install other tools  Neudesic Azure Storage Viewer
  • 69. Demo – Mahout  Using HDInsight
  • 70. What about the output?
  • 71. Clients (Visualizations) for HDFS  Many clients use Hive  Often included in GUI console tools for Hadoop distributions as well  Microsoft includes clients in Office (Excel 2013)  Direct Hive client  Connect using ODBC  PowerPivot – data mashups and presentation  Data Explorer – connect, transform, mashup and filter  Hadoop SDK on Codeplex  Other popular clients  Qlikview  Tableau  Karmasphere
  • 72. Demo – Executing Hive Queries
  • 73. Demo – Using HDFS output in Excel 2013 To download Data Explorer: http://www.microsoft.com/en- us/download/details.aspx?id=36803
  • 75. Demo – New Visualizations – D3
  • 76. Hadoop MapReduce Fundamentals @LynnLangit a five part series – Part 5 of 5
  • 78. Comparing: RDBMS vs. Hadoop Traditional RDBMS Hadoop / MapReduce Data Size Gigabytes (Terabytes) Petabytes (Hexabytes) Access Interactive and Batch Batch – NOT Interactive Updates Read / Write many times Write once, Read many times Structure Static Schema Dynamic Schema Integrity High (ACID) Low Scaling Nonlinear Linear Query Response Time Can be near immediate Has latency (due to batch processing)
  • 79. Microsoft alternatives to MapReduce  Use existing relational system  Scale via cloud or edition (i.e. Enterprise or PDW)  Use in memory OLAP  SQL Server Analysis Services Tabular Models  Use “productized” Dremel  Microsoft Polybase – status = beta?
  • 80. Looking Forward - Dremel or Apache Drill  Based on original research from Google
  • 82. In-market MapReduce Alternatives Cloudera  Impala Google  Big Query
  • 83. Demo – Google’s BigQuery  Dremel for the rest of us
  • 85. More MapReduce Developer Resources  Based on the distribution – on premises  Apache  MapReduce tutorial - http://hadoop.apache.org/docs/r1.0.4/mapred_tutorial.htmlCloudera  Cloudera  Cloudera University - http://university.cloudera.com/  Cloudera Developer Course (4 day) - *RECOMMENDED* - http://university.cloudera.com/training/apache_hadoop/developer.html  Hortonworks  MapR  Based on the distribution – cloud  AWS MapReduce  Tutorial - http://aws.amazon.com/elasticmapreduce/training/#gs  Windows Azure HDInsight  Tutorial - http://www.windowsazure.com/en-us/manage/services/hdinsight/using-mapreduce-with-hdinsight/  More resources - http://www.windowsazure.com/en-us/develop/net/tutorials/intro-to-hadoop/
  • 86. The Changing Data Landscape

Hinweis der Redaktion

  1. http://en.wikipedia.org/wiki/MapReduce
  2. http://allthingsd.com/files/2012/04/big-numbers.jpg
  3. http://www.cloudera.com/content/dam/cloudera/Resources/PDF/cloudera_White_Paper_Ten_Hadoopable_Problems_Real_World_Use_Cases.pdf Also -- http://gigaom.com/2012/06/05/10-ways-companies-are-using-hadoop-to-do-more-than-serve-ads/
  4. Image: http://siliconangle.com/files/2012/08/hadoop-300x300.jpg
  5. http://www.platfora.com/wp-content/themes/PlatforaV2.0/img/enter/deployment_pick_graphic.png
  6. http://indoos.files.wordpress.com/2010/08/hadoop_map1.png?w=819&amp;h=612
  7. http://datameer2.datameer.com/blog/wp-content/uploads/2012/06/hadoop_ecosystem_d3_photoshop.jpg http://datameer2.datameer.com/blog/wp-content/uploads/2013/01/hadoop_ecosystem_clean.png http://www.datameer.com/blog/perspectives/hadoop-ecosystem-as-of-january-2013-now-an-app.html
  8. Image from: http://vichargrave.com/wp-content/uploads/2013/02/Hadoop-Development.png http://wiki.apache.org/hadoop/HowToSetupYourDevelopmentEnvironment https://ccp.cloudera.com/display/SUPPORT/Cloudera&apos;s+Hadoop+Demo+VM+for+CDH4
  9. https://ccp.cloudera.com/display/SUPPORT/CDH+Downloads
  10. http://queryio.com/hadoop-big-data-images/hadoop-sql.jpg
  11. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  12. http://hive.apache.org/ https://cwiki.apache.org/confluence/display/Hive/GettingStarted
  13. https://cwiki.apache.org/confluence/display/Hive/LanguageManual http://en.wikipedia.org/wiki/Apache_Hive
  14. http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html http://nsinfra.blogspot.in/2012/06/difference-between-hadoop-dfs-and.html
  15. http://www.fincher.org/tips/General/SoftwareEngineering/FunctionalProgramming.shtml http://rbxbx.info/images/fault-tolerance.png
  16. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  17. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  18. The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the InputFormat for the job.
  19. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  20. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  21. http://www.windowsazure.com/en-us/manage/services/hdinsight/get-started-hdinsight/
  22. Image from http://curiousellie.typepad.com/.a/6a0133ec911c1f970b0168ebe6a2e4970c-500wi
  23. http://hadoop.apache.org/docs/r1.1.2/streaming.html How to run and compile a Hadoop Java program -- https://sites.google.com/site/hadoopandhive/home/how-to-run-and-compile-a-hadoop-program Sample code to compile a JAVA class: javac –classpath ~/hadoop/hadoop-core-1.0.1.jar;commons-cli-1.2.jar –d classes &lt;nameOfJavaFile&gt;.java &amp;&amp; jar –cvf &lt;nameOfJarFile&gt;.jar –C classes/
  24. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  25. http://blogs.msdn.com/b/carlnol/archive/2013/02/05/submitting-hadoop-mapreduce-jobs-using-powershell.aspx
  26. http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/
  27. About: Pig - http://en.wikipedia.org/wiki/Pig_(programming_tool) PigLatin language reference - http://pig.apache.org/docs/r0.10.0/start.html#pl-statements
  28. http://blog.cloudera.com/blog/2009/11/hadoop-at-twitter-part-1-splittable-lzo-compression/
  29. http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ http://www.slideshare.net/cloudera/mr-perf
  30. http://4.bp.blogspot.com/-2S6IuPD71A8/TZiNw8AyWkI/AAAAAAAAB0k/tS5QTP9SzHA/s1600/Detailed%2BHadoop%2BMapreduce%2BData%2BFlow.png
  31. The key and value classes have to be serializable by the framework and hence need to implement the Writable interface. Additionally, the key classes have to implement the WritableComparable interface to facilitate sorting by the framework.
  32. Tips from Cloudera -- http://blog.cloudera.com/blog/2009/12/7-tips-for-improving-mapreduce-performance/ &amp; http://www.slideshare.net/Hadoop_Summit/optimizing-mapreduce-job-performance
  33. http://blog.cloudera.com/blog/2012/02/mapreduce-2-0-in-hadoop-0-23/ http://hadoop.apache.org/docs/r0.23.6/api/index.html
  34. http://mahout.apache.org/
  35. Download local Hadoop via the Web Platform InstallerAlso download the Azure .NET SDK for VS 2012Link to download Windows Azure storage explorerhttp://azurestorageexplorer.codeplex.com/LInk for downloading .NET SDK for Hadoophttp://hadoopsdk.codeplex.com/wikipage?title=roadmap&amp;referringTitle=Home
  36. Image from - http://bluewatersql.files.wordpress.com/2013/04/image12.png
  37. http://www.research-live.com/Journals/1/Files/2013/1/11/covermania.jpg
  38. https://github.com/mbostock/d3/wiki/Gallery
  39. Original Reference: Tom White’ s Hadoop: The Definitive Guide (I made some modifications based on my experience)
  40. http://research.google.com/pubs/pub36632.html
  41. https://docs.google.com/document/d/1QTL8warUYS2KjldQrGUse7zp8eA72VKtLOHwfXy6c7I/edit
  42. http://cloudera.com/content/cloudera/en/campaign/introducing-impala.html GigaOm ‘The Future…of Hadoop is real-time’ -- http://gigaom.com/2013/03/07/5-reasons-why-the-future-of-hadoop-is-real-time-relatively-speaking/ http://devopsangle.com/2012/08/20/googles-dremel-here-comes-a-new-challenger-to-yarnhadoop/
  43. Course Title: Module Title ©2011 DevelopMentor 1-Oct-2011