R + 15 minutes = Hadoop cluster

•

4 gefällt mir•1,797 views

The document discusses how to use the R programming language and Amazon's Elastic MapReduce service to quickly create a Hadoop cluster on Amazon Web Services in only 15 minutes. It demonstrates running a stochastic simulation to estimate pi by distributing 1,000 simulations across the Hadoop cluster and combining the results. The total cost of running the 15 minute cluster was only $0.15, showing how inexpensive it can be to leverage Hadoop's capabilities.

Technologie Bildung Business

useR Vignette:

R + 15 minutes =
Hadoop cluster

Greater Boston useR Group
February 2011

by

Jeffrey Breen
jbreen@cambridge.aero

Agenda

● What's Hadoop?
● But I don't have Big
Data
● Building the cluster
● Estimating π
stochastically
● Want to know more?

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2

MapReduce, Hadoop and Big Data

● Hadoop is an open source implementation of
Google's MapReduce-based data processing
infrastructure
● Designed to process huge data sets
– “huge” = “all of facebook's web logs”
– Yahoo! sorted 1TB in 62 seconds in May 2009
– HDFS distributed file system makes replication decisions
based on knowledge of network topology
● Amazon Elastic MapReduce is full Hadoop stack
on EC2

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 3

MapReduce = Map + shuffle + Reduce

Source: http://developer.yahoo.com/hadoop/tutorial/module4.html

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 4

But I don't have Big Data

● Agricultural economist J.D. Long doesn't either, but
he does have a bunch of simulations to run
● Had a key insight: the input could be small amount
of data (like 1:1000) to serve as random seeds for
simulation code in “mapper” function
● Enjoy Hadoop's infrastructure for job scheduling,
fault tolerance, inter-node communication, etc.
● Use Amazon's cloud to scale up quickly as needed

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 5

Load the segue library
> library(segue)
Loading required package: rJava
Loading required package: caTools
Loading required package: bitops
Segue did not find your AWS credentials. Please run
the setCredentials() function.

> setCredentials('YOUR_ACCESS_KEY_ID',
'YOUR_SECRET_ACCESS_KEY')

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 6

Start the cluster
> myCluster <- createCluster(numInstances=5)
STARTING - 2011-01-04 15:07:53
[…]
BOOTSTRAPPING - 2011-01-04 15:11:28
[…]
WAITING - 2011-01-04 15:15:35
Your Amazon EMR Hadoop Cluster is ready for action.
Remember to terminate your cluster with
stopCluster().
Amazon is billing you!

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 7

$Estimate π stochastically > estimatePi <- function(seed){ set.seed(seed) numDraws <- 1e6 r <- .5 #radius x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) return(sum(inCircle) / length(inCircle) * 4) } useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 8$

Run the simulation
> seedList <- as.list(1:1e3)
> myEstimates <- emrlapply( myCluster, seedList,
estimatePi )
RUNNING - 2011-01-04 15:22:28
[…]
WAITING - 2011-01-04 15:32:18
> myPi <- Reduce(sum, myEstimates) / length(myEstimates)
> format(myPi, digits=10)
[1] "3.141586544"
> format(pi, digits=10)
[1] "3.141592654"

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 9

Won't break the bank

● Total cost: $0.15
Standard On-Demand Amazon EC2 Amazon Elastic
Instances Price per hour MapReduce
(On-Demand Instances) Price per hour

Small (Default) $0.085 per hour $0.015 per hour

Large $0.34 per hour $0.06 per hour

Extra Large $0.68 per hour $0.12 per hour

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 10

Want to know more?

● JD Long's segue package
● http://code.google.com/p/segue/
● Hadoop
● http://hadoop.apache.org/
● Book: http://oreilly.com/catalog/0636920010388
● My blog
● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-a

useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 11

Empfohlen

Grouping & Summarizing Data in RJeffrey Breen

Accessing Databases from RJeffrey Breen

Data Manipulation Using R (& dplyr)Ram Narasimhan

Rsplit apply combineMichelle Darling

Data Profiling in Apache CalciteJulian Hyde

Hive User Meeting August 2009 Facebookragho

Spatial query on vanilla databasesJulian Hyde

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa

Empfohlen

Grouping & Summarizing Data in RJeffrey Breen

Accessing Databases from RJeffrey Breen

Data Manipulation Using R (& dplyr)Ram Narasimhan

Rsplit apply combineMichelle Darling

Data Profiling in Apache CalciteJulian Hyde

Hive User Meeting August 2009 Facebookragho

Spatial query on vanilla databasesJulian Hyde

Get up to Speed (Quick Guide to data.table in R and Pentaho PDI)Serban Tanasa

Next Generation Programming in RFlorian Uhlitz

Hive User Meeting 2009 8 FacebookZheng Shao

Data manipulation with dplyrRomain Francois

Morel, a Functional Query LanguageJulian Hyde

Big Data Analysis With RHadoopDavid Chiu

Hive User Meeting March 2010 - Hive TeamZheng Shao

Hadoop Summit 2009 HiveZheng Shao

Don’t optimize my queries, optimize my data!Julian Hyde

Data preparation covariatesFAO

R data-import, data-exportFAO

Scaling PostreSQL with StadoJim Mlodgenski

ACADILD:: HADOOP LESSON Padma shree. T

HiveSrinath Reddy

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Advanced Sharding Techniques with Spider (MUC2010)Kentoku

Scaling PostgreSQL With GridSQLJim Mlodgenski

Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo

Efficient spatial queries on vanilla databasesJulian Hyde

Session 19 - MapReduce AnandMHadoop

Hive query optimization infinityShashwat Shriparv

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen

Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen

Weitere ähnliche Inhalte

Was ist angesagt?

Next Generation Programming in RFlorian Uhlitz

Hive User Meeting 2009 8 FacebookZheng Shao

Data manipulation with dplyrRomain Francois

Morel, a Functional Query LanguageJulian Hyde

Big Data Analysis With RHadoopDavid Chiu

Hive User Meeting March 2010 - Hive TeamZheng Shao

Hadoop Summit 2009 HiveZheng Shao

Don’t optimize my queries, optimize my data!Julian Hyde

Data preparation covariatesFAO

R data-import, data-exportFAO

Scaling PostreSQL with StadoJim Mlodgenski

ACADILD:: HADOOP LESSON Padma shree. T

HiveSrinath Reddy

How to understand and analyze Apache Hive query execution plan for performanc...DataWorks Summit/Hadoop Summit

Advanced Sharding Techniques with Spider (MUC2010)Kentoku

Scaling PostgreSQL With GridSQLJim Mlodgenski

Mastering Hadoop Map Reduce - Custom Types and Other Optimizationsscottcrespo

Efficient spatial queries on vanilla databasesJulian Hyde

Session 19 - MapReduce AnandMHadoop

Hive query optimization infinityShashwat Shriparv

Was ist angesagt? (20)

Next Generation Programming in R

Hive User Meeting 2009 8 Facebook

Data manipulation with dplyr

Morel, a Functional Query Language

Big Data Analysis With RHadoop

Hive User Meeting March 2010 - Hive Team

Hadoop Summit 2009 Hive

Don’t optimize my queries, optimize my data!

Data preparation covariates

R data-import, data-export

Scaling PostreSQL with Stado

ACADILD:: HADOOP LESSON

Hive

How to understand and analyze Apache Hive query execution plan for performanc...

Advanced Sharding Techniques with Spider (MUC2010)

Scaling PostgreSQL With GridSQL

Mastering Hadoop Map Reduce - Custom Types and Other Optimizations

Efficient spatial queries on vanilla databases

Session 19 - MapReduce

Hive query optimization infinity

Andere mochten auch

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....Jeffrey Breen

Big Data Step-by-Step: Infrastructure 1/3: Local VMJeffrey Breen

Real Time Data Processing Using Spark StreamingHari Shreedharan

BIG Data Science: A Path ForwardDan Mallinger

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster AnswersRevolution Analytics

Big Analytics: Building Lasting ValueDan Mallinger

Move your data (Hans Rosling style) with googleVis + 1 line of R codeJeffrey Breen

Apachecon Europe 2012: Operating HBase - Things you need to knowChristian Gügi

Getting started with R & HadoopJeffrey Breen

Setting High Availability in Hadoop ClusterEdureka!

Running R on Hadoop - CHUG - 20120815Chicago Hadoop Users Group

January 2015 HUG: Apache Flink: Fast and reliable large-scale data processingYahoo Developer Network

Are You Ready for Big Data Big Analytics? Revolution Analytics

Reshaping Data in RJeffrey Breen

HBase and Impala Notes - Munich HUG - 20131017larsgeorge

Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2Jeffrey Breen

Using R with HadoopRevolution Analytics

High Performance Predictive Analytics in R and HadoopRevolution Analytics

Tapping the Data Deluge with RJeffrey Breen

Predictive Analytics using RJeffrey Strickland, Ph.D., CMSP

Andere mochten auch (20)

Big Data Step-by-Step: Infrastructure 3/3: Taking it to the cloud... easily.....

Big Data Step-by-Step: Infrastructure 1/3: Local VM

Real Time Data Processing Using Spark Streaming

BIG Data Science: A Path Forward

R+Hadoop - Ask Bigger (and New) Questions and Get Better, Faster Answers

Big Analytics: Building Lasting Value

Move your data (Hans Rosling style) with googleVis + 1 line of R code

Apachecon Europe 2012: Operating HBase - Things you need to know

Getting started with R & Hadoop

Setting High Availability in Hadoop Cluster

Running R on Hadoop - CHUG - 20120815

January 2015 HUG: Apache Flink: Fast and reliable large-scale data processing

Are You Ready for Big Data Big Analytics?

Reshaping Data in R

HBase and Impala Notes - Munich HUG - 20131017

Big Data Step-by-Step: Infrastructure 2/3: Running R and RStudio on EC2

Using R with Hadoop

High Performance Predictive Analytics in R and Hadoop

Tapping the Data Deluge with R

Predictive Analytics using R

Ähnlich wie R + 15 minutes = Hadoop cluster

Cost effective BigData Processing on Amazon EC2Sujee Maniyam

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...Sumeet Singh

Hadoop ecosystem framework n hadoop in live environmentDelhi/NCR HUG

Hadoop For Enterprisesnvvrajesh

Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)Jeffrey Breen

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Hadoop at NokiaJosh Devins

Hadoop Summit 2015: Hive at Yahoo: Letters from the TrenchesMithun Radhakrishnan

Hive at Yahoo: Letters from the trenchesDataWorks Summit

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014Dataiku

Getting Started with HadoopJosh Devins

Hadoop crashcourse v3Hortonworks

Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar

Exploring BigData with Google BigQueryDharmesh Vaya

Big Data Real Time Analytics - A Facebook Case StudyNati Shalom

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduceHadoop User Group

Big Data LaboratoryJ Singh

Airflow - a data flow engineWalter Liu

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

Big data-at-detikk4ndar

Ähnlich wie R + 15 minutes = Hadoop cluster (20)

Cost effective BigData Processing on Amazon EC2

Hadoop Summit Amsterdam 2014: Capacity Planning In Multi-tenant Hadoop Deploy...

Hadoop ecosystem framework n hadoop in live environment

Hadoop For Enterprises

Big Data Step-by-Step: Using R & Hadoop (with RHadoop's rmr package)

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Hadoop at Nokia

Hadoop Summit 2015: Hive at Yahoo: Letters from the Trenches

Hive at Yahoo: Letters from the trenches

Dataiku - hadoop ecosystem - @Epitech Paris - janvier 2014

Getting Started with Hadoop

Hadoop crashcourse v3

Hadoop And Big Data - My Presentation To Selective Audience

Exploring BigData with Google BigQuery

Big Data Real Time Analytics - A Facebook Case Study

Public Terabyte Dataset Project: Web crawling with Amazon Elastic MapReduce

Big Data Laboratory

Airflow - a data flow engine

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

Big data-at-detik

Kürzlich hochgeladen

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Introduction to Multilingual Retrieval Augmented Generation (RAG)Zilliz

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Why Teams call analytics are critical to your entire businesspanagenda

MINDCTI Revenue Release Quarter One 2024MIND CTI

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10

Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea

Understanding the FAA Part 107 License ..Christopher Logan Kennedy

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Kürzlich hochgeladen (20)

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke

[BuildWithAI] Introduction to Gemini.pdf

TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery

Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff

Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...

Introduction to Multilingual Retrieval Augmented Generation (RAG)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood

Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER

Why Teams call analytics are critical to your entire business

MINDCTI Revenue Release Quarter One 2024

Connector Corner: Accelerate revenue generation using UiPath API-centric busi...

Finding Java's Hidden Performance Traps @ DevoxxUK 2024

Understanding the FAA Part 107 License ..

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME

How to Troubleshoot Apps for the Modern Connected Worker

Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model

FWD Group - Insurer Innovation Award 2024

R + 15 minutes = Hadoop cluster

1. useR Vignette: R + 15 minutes = Hadoop cluster Greater Boston useR Group February 2011 by Jeffrey Breen jbreen@cambridge.aero

2. Agenda ● What's Hadoop? ● But I don't have Big Data ● Building the cluster ● Estimating π stochastically ● Want to know more? useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 2

3. MapReduce, Hadoop and Big Data ● Hadoop is an open source implementation of Google's MapReduce-based data processing infrastructure ● Designed to process huge data sets – “huge” = “all of facebook's web logs” – Yahoo! sorted 1TB in 62 seconds in May 2009 – HDFS distributed file system makes replication decisions based on knowledge of network topology ● Amazon Elastic MapReduce is full Hadoop stack on EC2 useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 3

4. MapReduce = Map + shuffle + Reduce Source: http://developer.yahoo.com/hadoop/tutorial/module4.html useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 4

5. But I don't have Big Data ● Agricultural economist J.D. Long doesn't either, but he does have a bunch of simulations to run ● Had a key insight: the input could be small amount of data (like 1:1000) to serve as random seeds for simulation code in “mapper” function ● Enjoy Hadoop's infrastructure for job scheduling, fault tolerance, inter-node communication, etc. ● Use Amazon's cloud to scale up quickly as needed useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 5

6. Load the segue library > library(segue) Loading required package: rJava Loading required package: caTools Loading required package: bitops Segue did not find your AWS credentials. Please run the setCredentials() function. > setCredentials('YOUR_ACCESS_KEY_ID', 'YOUR_SECRET_ACCESS_KEY') useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 6

7. Start the cluster > myCluster <- createCluster(numInstances=5) STARTING - 2011-01-04 15:07:53 […] BOOTSTRAPPING - 2011-01-04 15:11:28 […] WAITING - 2011-01-04 15:15:35 Your Amazon EMR Hadoop Cluster is ready for action. Remember to terminate your cluster with stopCluster(). Amazon is billing you! useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 7

8. Estimate π stochastically > estimatePi <- function(seed){ set.seed(seed) numDraws <- 1e6 r <- .5 #radius x <- runif(numDraws, min=-r, max=r) y <- runif(numDraws, min=-r, max=r) inCircle <- ifelse( (x^2 + y^2)^.5 < r , 1, 0) return(sum(inCircle) / length(inCircle) * 4) } useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 8

9. Run the simulation > seedList <- as.list(1:1e3) > myEstimates <- emrlapply( myCluster, seedList, estimatePi ) RUNNING - 2011-01-04 15:22:28 […] WAITING - 2011-01-04 15:32:18 > myPi <- Reduce(sum, myEstimates) / length(myEstimates) > format(myPi, digits=10) [1] "3.141586544" > format(pi, digits=10) [1] "3.141592654" useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 9

10. Won't break the bank ● Total cost: $0.15 Standard On-Demand Amazon EC2 Amazon Elastic Instances Price per hour MapReduce (On-Demand Instances) Price per hour Small (Default) $0.085 per hour $0.015 per hour Large $0.34 per hour $0.06 per hour Extra Large $0.68 per hour $0.12 per hour useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 10

11. Want to know more? ● JD Long's segue package ● http://code.google.com/p/segue/ ● Hadoop ● http://hadoop.apache.org/ ● Book: http://oreilly.com/catalog/0636920010388 ● My blog ● http://jeffreybreen.wordpress.com/2011/01/10/segue-r-to-a useR Vignette: R + 15 minutes = Hadoop Cluster Greater Boston useR Meeting, February 2011 Slide 11