SlideShare ist ein Scribd-Unternehmen logo
1 von 51
Workshop: From Zero
to _
Budapest DW Forum 2014
Agenda today
1. Some setup before we start
2. (Back to the) introduction
3. Our workshop today
4. Part 2: a simple Scalding job on EMR
Some setup before we start
There is a lot to copy and paste – so let’s all join a Google
Hangout chat
http://bit.ly/1xgSQId
• If I forget to paste some content into the chat room, just shout
out and remind me
First, let’s all download and setup Virtualbox and Vagrant
http://docs.vagrantup.com/v2/installation/in
dex.html
https://www.virtualbox.org/wiki/Downloads
Now let’s setup our development environment
$ vagrant plugin install vagrant-vbguest
If you have git already installed:
$ git clone --recursive
https://github.com/snowplow/dev-environment.git
If not:
$ wget https://github.com/snowplow/dev-
environment/archive/temp.zip
$ unzip temp.zip
$ wget https://github.com/snowplow/ansible-
playbooks/archive/temp.zip
$ unzip temp.zip
Now let’s setup our development environment
$ cd dev-environment
$ vagrant up
$ vagrant ssh
Final step for now, let’s install some software
$ ansible-playbook /vagrant/ansible-
playbooks/aws-tools.yml --inventory-
file=/home/vagrant/ansible_hosts --
connection=local
$ ansible-playbook /vagrant/ansible-
playbooks/scala-sbt.yml --inventory-
file=/home/vagrant/ansible_hosts --
connection=local
(Back to the) introduction
Snowplow is an open-source web and event analytics platform,
built on Hadoop
• Co-founders Alex Dean and Yali Sassoon met at
OpenX, the open-source ad technology business
in 2008
• We released Snowplow as a skunkworks
prototype at start of 2012:
github.com/snowplow/snowplow
• We built Snowplow on top of Hadoop from the
very start
We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your
specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your
data
Data warehouseData pipeline
Analyse your data in
any analysis tool
And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift
Our Snowplow event processing flow runs on Hadoop,
specifically Amazon’s Elastic MapReduce hosted Hadoop service
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-
based event
collector
Scalding-
based
enrichment
on Hadoop
JavaScript
event tracker
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
Why did we pick Hadoop?
Scalability
Easy to
reprocess
data
Highly
testable
We have customers processing 350m Snowplow
events a day in Hadoop – runs in <2 hours
If business rules change, we can fire up a large
cluster and re-process all historical raw
Snowplow events
We write unit and integration tests for our jobs
and run them locally, giving us confidence that
our jobs will run correctly at scale on Hadoop
And why Amazon’s Elastic MapReduce (EMR)?
No need to
run our own
cluster
Elastic
Interop with
other AWS
services
Running your own Hadoop cluster is a huge pain
– not for the fainthearted. By contrast, EMR just
works (most of the time !)
Snowplow runs as a nightly (sometimes more
frequent) batch job. We spin up the EMR cluster
to run the job, and shut it down straight after
EMR works really well with Amazon S3 as a file
store. We are big fans of Amazon Redshift
(hosted columnar database) too
Our workshop today
Hadoop is complicated…
… for our workshop today, we will stick to using Elastic
MapReduce and try to avoid any unnecessary complexity
… and we will learn by doing!
• Lots of books and articles about Hadoop and the theory of
MapReduce
• We will learn by doing – no theory unless it’s required to
directly explain the jobs we are creating
• Our priority is to get you up-and-running on Elastic
MapReduce, and confident enough to write your own
Hadoop jobs
Part 1: a simple Pig Latin job
on EMR
What is Pig (Latin)?
• Pig is a high-level platform for creating MapReduce jobs which can run
on Hadoop
• The language you write Pig jobs in is called Pig Latin
• For quick-and-dirty scripts, Pig just works
Hadoop DFS
Hadoop MapReduce
Crunch Hive Pig
Java
Cascading
Let’s all come up with a unique name for ourselves
• Lowercase letters, no spaces or hyphens or anything
• E.g. I will be alexsnowplow – please come up with a unique name for
yourself!
• It will be visible to other participants so choose something you don’t
mind being public 
• In the rest of this workshop, wherever you see YOURNAME, replace it
with your unique name
Let’s restart our Vagrant and do some setup
$ mkdir zero2hadoop
$ aws configure
// And type in:
AWS Access Key ID [None]:
AKIAILD6DCBTFI642JPQ
AWS Secret Access Key [None]:
KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou
Default region name [None]: eu-west-1
Default output format [None]:
Let’s create some buckets in Amazon S3 – this is where our data
and our apps will live
$ aws s3 mb s3://zero2hadoop-in-YOURNAME
$ aws s3 mb s3://zero2hadoop-out-YOURNAME
$ aws s3 mb s3://zero2hadoop-jobs-YOURNAME
// Check those worked
$ aws s3 ls
Let’s get some source data uploaded
$ mkdir -p ~/zero2hadoop/part1/in
$ cd ~/zero2hadoop/part1/in
$ wget
https://raw.githubusercontent.com/snowplow/sc
alding-example-project/master/data/hello.txt
$ cat hello.txt
Hello world
Goodbye world
$ aws s3 cp hello.txt s3://zero2hadoop-in-
YOURNAME/part1/hello.txt
Let’s get our EMR command-line tools installed (1/2)
$ /vagrant/emr-cli/elastic-mapreduce
$ rvm install ruby-1.8.7-head
$ rvm use 1.8.7
$ alias emr=/vagrant/emr-cli/elastic-
mapreduce
Let’s get our EMR command-line tools installed (2/2)
Add this file:
{
"access_id": "AKIAI55OSYYRLYWLXH7A",
"private_key":
"SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP",
"region": "eu-west-1"
}
to: /vagrant/emr-cli/credentials.json
(sudo sntp -s 24.56.178.140)
Let’s get our EMR command-line tools installed (2/2)
// This should work fine now:
$ emr --list
<no output>
Let’s do some local file work
$ mkdir -p ~/zero2hadoop/part1/pig
$ cd ~/zero2hadoop/part1/pig
$ wget
https://gist.githubusercontent.com/alexanderd
ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1
01e296ccf27331384df3288d/wordcount.pig
// The original
https://gist.github.com/alexanderdean/d8371ce
bdf00064591ae
Now upload to S3
$ aws s3 cp wordcount.pig s3://zero2hadoop-
jobs-YOURNAME/part1/
$ aws s3 ls --recursive s3://zero2hadoop-
jobs-YOURNAME/part1/
2014-06-06 09:10:31 674
part1/wordcount.pig
And now we run our Pig script
$ emr --create --name "part1 YOURNAME" 
--set-visible-to-all-users true 
--pig-script s3n://zero2hadoop-jobs-
YOURNAME/part1/wordcount.pig 
--ami-version 2.0 
--args "-p,INPUT=s3n://zero2hadoop-in-
YOURNAME/part1, 
-p,OUTPUT=s3n://zero2hadoop-out-
YOURNAME/part1"
Let’s check out the jobs running in Elastic MapReduce – first at
the console
$ $ emr --list
j-1HR90SWPP40M4 STARTING
part1 YOURNAME
PENDING Setup Pig
PENDING Run Pig Script
and also in the UI
Okay let’s check the output of our job! (1/2)
$ aws s3 ls --recursive s3://zero2hadoop-out-
YOURNAME/part1
2014-06-06 09:57:53 0 part1/_SUCCESS
2014-06-06 09:57:50 26 part1/part-r-
00000
Okay let’s check the output of our job!
$ mkdir -p ~/zero2hadoop/part1/out
$ cd ~/zero2hadoop/part1/out
$ aws s3 cp --recursive s3://zero2hadoop-out-
YOURNAME/part1 .
$ ls
part-r-00000 _SUCCESS
$ cat part-r-00000
2 world
1 Hello
1 Goodbye
Part 2: a simple Scalding job
on EMR
What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Pig …
Java
Scalding Cascalog PyCascading
cascading.
jruby
Cascading has a “plumbing” abstraction over vanilla MapReduce
which should be quite comfortable to DW practitioners
Scalding improves further on Cascading by reducing boilerplate
and making more complex pipelines easier to express
• Scalding written in Scala – reduces a lot of boilerplate
versus vanilla Cascading. Easier to look at a job in its
entirety and see what it does
• Scalding created and supported by Twitter, who use it
throughout their organization
• We believe that data pipelines should be as strongly
typed as possible – all the other DSLs/APIs on top of
Cascading encourage dynamic typing
Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:
Okay let’s get started!
• Head to https://github.com/snowplow/scalding-example-project
Let’s get this code down locally and build it
$ mkdir -p ~/zero2hadoop/part2
$ cd ~/zero2hadoop/part2
$ git clone
git://github.com/snowplow/scalding-example-
project.git
$ cd scalding-example-project
$ sbt assembly
Here is our MapReduce code
Good, tests are passing, now let’s upload this to S3 so it’s
available to our EMR job
$ aws s3 cp target/scala-2.10/scalding-
example-project-0.0.5.jar s3://zero2hadoop-
jobs-YOURNAME/part2/
// If that doesn’t work:
$ aws cp s3://snowplow-hosted-assets/third-
party/scalding-example-project-0.0.5.jar
s3://zero2hadoop-jobs-YOURNAME/part2/
$ aws s3 ls s3://zero2hadoop-jobs-
YOURNAME/part2/
And now we run it!
$ emr --create --name ”part2 YOURNAME" 
--set-visible-to-all-users true 
--jar s3n://zero2hadoop-jobs-
YOURNAME/part2/scalding-example-project-
0.0.5.jar 
--arg
com.snowplowanalytics.hadoop.scalding.WordCou
ntJob 
--arg --hdfs 
--arg --input --arg s3n://zero2hadoop-in-
YOURNAME/part1/hello.txt 
--arg --output --arg s3n://zero2hadoop-out-
YOURNAME/part2
Let’s check out the jobs running in Elastic MapReduce – first at
the console
$ emr --list
j-1M62IGREPL7I STARTING
scalding-example-project
PENDING Example Jar Step
and also in the UI
Okay let’s check the output of our job!
$ aws s3 ls --recursive s3://zero2hadoop-out-
YOURNAME/part2
$ mkdir -p ~/zero2hadoop/part2/out
$ cd ~/zero2hadoop/part2/out
$ aws s3 cp --recursive s3://zero2hadoop-out-
YOURNAME/part2 .
$ ls
$ cat part-00000
goodbye 1
hello 1
world 2
Part 3: a more complex
Scalding job on EMR
Let’s explore another tutorial together
https://github.com/sharethrough/scalding-emr-tutorial
Questions?
http://snowplowanalytics.com
https://github.com/snowplow/snowplow
@snowplowdata
To talk offline – @alexcrdean on Twitter or
alex@snowplowanalytics.com

Weitere ähnliche Inhalte

Was ist angesagt?

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Tugdual Grall
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapiragluent.
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari Sid Anand
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...HostedbyConfluent
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNblueboxtraveler
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafkaconfluent
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Jujuseoul_engineer
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!confluent
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin Databricks
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioPiotr Czarnas
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Sid Anand
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Spark Summit
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Guido Schmutz
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchHakka Labs
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedGuido Schmutz
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluDataWorks Summit
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERShuyi Chen
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanDatabricks
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to OneSerg Masyutin
 

Was ist angesagt? (20)

Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
Lambda Architecture: The Best Way to Build Scalable and Reliable Applications!
 
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen ShapiraGNW03: Stream Processing with Apache Kafka by Gwen Shapira
GNW03: Stream Processing with Apache Kafka by Gwen Shapira
 
Airflow @ Agari
Airflow @ Agari Airflow @ Agari
Airflow @ Agari
 
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
Should you read Kafka as a stream or in batch? Should you even care? | Ido Na...
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Data Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache KafkaData Driven Enterprise with Apache Kafka
Data Driven Enterprise with Apache Kafka
 
Lambda architecture
Lambda architectureLambda architecture
Lambda architecture
 
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and JujuMining public datasets using opensource tools: Zeppelin, Spark and Juju
Mining public datasets using opensource tools: Zeppelin, Spark and Juju
 
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
Apache Kafka and KSQL in Action: Let's Build a Streaming Data Pipeline!
 
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
 
Kappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.ioKappa Architecture on Apache Kafka and Querona: datamass.io
Kappa Architecture on Apache Kafka and Querona: datamass.io
 
Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)Resilient Predictive Data Pipelines (QCon London 2016)
Resilient Predictive Data Pipelines (QCon London 2016)
 
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
Online Security Analytics on Large Scale Video Surveillance System by Yu Cao ...
 
Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?Kafka as your Data Lake - is it Feasible?
Kafka as your Data Lake - is it Feasible?
 
DataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series searchDataEngConf SF16 - High cardinality time series search
DataEngConf SF16 - High cardinality time series search
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Lessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at HuluLessons Learned - Monitoring the Data Pipeline at Hulu
Lessons Learned - Monitoring the Data Pipeline at Hulu
 
Scalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBERScalable complex event processing on samza @UBER
Scalable complex event processing on samza @UBER
 
Spark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu KasinathanSpark Compute as a Service at Paypal with Prabhu Kasinathan
Spark Compute as a Service at Paypal with Prabhu Kasinathan
 
Lambda architecture: from zero to One
Lambda architecture: from zero to OneLambda architecture: from zero to One
Lambda architecture: from zero to One
 

Ähnlich wie From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Olalekan Fuad Elesin
 
ZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small TeamsZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small TeamsJoe Ferguson
 
Madison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small TeamsMadison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small TeamsJoe Ferguson
 
DevOps For Small Teams
DevOps For Small TeamsDevOps For Small Teams
DevOps For Small TeamsJoe Ferguson
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Søren Lund
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Holden Karau
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails FinalRobert Postill
 
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPHands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPDana Luther
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scalingsmattoon
 
Docker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker, Inc.
 
The Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session IThe Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session IOded Sagir
 
Midwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small teamMidwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small teamJoe Ferguson
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby TeamArto Artnik
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sunsmattoon
 
Shipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with DockerShipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with DockerJérôme Petazzoni
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User GroupCsaba Toth
 
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...Edward Wilde
 

Ähnlich wie From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce (20)

Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2 Introduction to Apache Spark :: Lagos Scala Meetup session 2
Introduction to Apache Spark :: Lagos Scala Meetup session 2
 
ZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small TeamsZendCon 2015 - DevOps for Small Teams
ZendCon 2015 - DevOps for Small Teams
 
Madison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small TeamsMadison PHP 2015 - DevOps For Small Teams
Madison PHP 2015 - DevOps For Small Teams
 
DevOps For Small Teams
DevOps For Small TeamsDevOps For Small Teams
DevOps For Small Teams
 
Triple Blitz Strike
Triple Blitz StrikeTriple Blitz Strike
Triple Blitz Strike
 
Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)Playing with Hadoop (NPW2013)
Playing with Hadoop (NPW2013)
 
Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018Big Data Beyond the JVM - Strata San Jose 2018
Big Data Beyond the JVM - Strata San Jose 2018
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
DiUS Computing Lca Rails Final
DiUS  Computing Lca Rails FinalDiUS  Computing Lca Rails Final
DiUS Computing Lca Rails Final
 
Drupal development
Drupal development Drupal development
Drupal development
 
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHPHands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
Hands on Docker - Launch your own LEMP or LAMP stack - SunshinePHP
 
Drupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, ScalingDrupal Efficiency - Coding, Deployment, Scaling
Drupal Efficiency - Coding, Deployment, Scaling
 
Docker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in ProductionDocker Online Meetup #3: Docker in Production
Docker Online Meetup #3: Docker in Production
 
The Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session IThe Secrets of The FullStack Ninja - Part A - Session I
The Secrets of The FullStack Ninja - Part A - Session I
 
Midwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small teamMidwest PHP 2017 DevOps For Small team
Midwest PHP 2017 DevOps For Small team
 
Toolbox of a Ruby Team
Toolbox of a Ruby TeamToolbox of a Ruby Team
Toolbox of a Ruby Team
 
Drupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from SunDrupal Efficiency using open source technologies from Sun
Drupal Efficiency using open source technologies from Sun
 
Shipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with DockerShipping Applications to Production in Containers with Docker
Shipping Applications to Production in Containers with Docker
 
Hive and Pig for .NET User Group
Hive and Pig for .NET User GroupHive and Pig for .NET User Group
Hive and Pig for .NET User Group
 
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
Docker and serverless Randstad Jan 2019: OpenFaaS Serverless: when functions ...
 

Mehr von Alexander Dean

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAlexander Dean
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Alexander Dean
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricAlexander Dean
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logAlexander Dean
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againAlexander Dean
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logAlexander Dean
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowAlexander Dean
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Alexander Dean
 

Mehr von Alexander Dean (8)

Asynchronous micro-services and the unified log
Asynchronous micro-services and the unified logAsynchronous micro-services and the unified log
Asynchronous micro-services and the unified log
 
Snowplow New York City Meetup #2
Snowplow New York City Meetup #2Snowplow New York City Meetup #2
Snowplow New York City Meetup #2
 
Introducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabricIntroducing Tupilak, Snowplow's unified log fabric
Introducing Tupilak, Snowplow's unified log fabric
 
Unified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified logUnified Log London (May 2015) - Why your company needs a unified log
Unified Log London (May 2015) - Why your company needs a unified log
 
Snowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back againSnowplow Analytics: from NoSQL to SQL and back again
Snowplow Analytics: from NoSQL to SQL and back again
 
Span Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified logSpan Conference: Why your company needs a unified log
Span Conference: Why your company needs a unified log
 
Big Data Beers - Introducing Snowplow
Big Data Beers - Introducing SnowplowBig Data Beers - Introducing Snowplow
Big Data Beers - Introducing Snowplow
 
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
Snowplow and Kinesis - Presentation to the inaugural Amazon Kinesis London Us...
 

Kürzlich hochgeladen

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsAhmed Mohamed
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identityteam-WIBU
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Mater
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfStefano Stabellini
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalLionel Briand
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecturerahul_net
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 

Kürzlich hochgeladen (20)

Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Unveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML DiagramsUnveiling Design Patterns: A Visual Guide with UML Diagrams
Unveiling Design Patterns: A Visual Guide with UML Diagrams
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
Post Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on IdentityPost Quantum Cryptography – The Impact on Identity
Post Quantum Cryptography – The Impact on Identity
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)Ahmed Motair CV April 2024 (Senior SW Developer)
Ahmed Motair CV April 2024 (Senior SW Developer)
 
Xen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdfXen Safety Embedded OSS Summit April 2024 v4.pdf
Xen Safety Embedded OSS Summit April 2024 v4.pdf
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
Precise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive GoalPrecise and Complete Requirements? An Elusive Goal
Precise and Complete Requirements? An Elusive Goal
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Understanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM ArchitectureUnderstanding Flamingo - DeepMind's VLM Architecture
Understanding Flamingo - DeepMind's VLM Architecture
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 

From Zero to Hadoop: a tutorial for getting started writing Hadoop jobs on Amazon Elastic MapReduce

  • 1. Workshop: From Zero to _ Budapest DW Forum 2014
  • 2. Agenda today 1. Some setup before we start 2. (Back to the) introduction 3. Our workshop today 4. Part 2: a simple Scalding job on EMR
  • 3. Some setup before we start
  • 4. There is a lot to copy and paste – so let’s all join a Google Hangout chat http://bit.ly/1xgSQId • If I forget to paste some content into the chat room, just shout out and remind me
  • 5. First, let’s all download and setup Virtualbox and Vagrant http://docs.vagrantup.com/v2/installation/in dex.html https://www.virtualbox.org/wiki/Downloads
  • 6. Now let’s setup our development environment $ vagrant plugin install vagrant-vbguest If you have git already installed: $ git clone --recursive https://github.com/snowplow/dev-environment.git If not: $ wget https://github.com/snowplow/dev- environment/archive/temp.zip $ unzip temp.zip $ wget https://github.com/snowplow/ansible- playbooks/archive/temp.zip $ unzip temp.zip
  • 7. Now let’s setup our development environment $ cd dev-environment $ vagrant up $ vagrant ssh
  • 8. Final step for now, let’s install some software $ ansible-playbook /vagrant/ansible- playbooks/aws-tools.yml --inventory- file=/home/vagrant/ansible_hosts -- connection=local $ ansible-playbook /vagrant/ansible- playbooks/scala-sbt.yml --inventory- file=/home/vagrant/ansible_hosts -- connection=local
  • 9. (Back to the) introduction
  • 10. Snowplow is an open-source web and event analytics platform, built on Hadoop • Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008 • We released Snowplow as a skunkworks prototype at start of 2012: github.com/snowplow/snowplow • We built Snowplow on top of Hadoop from the very start
  • 11. We wanted to take a fresh approach to web analytics • Your own web event data -> in your own data warehouse • Your own event data model • Slice / dice and mine the data in highly bespoke ways to answer your specific business questions • Plug in the broadest possible set of analysis tools to drive value from your data Data warehouseData pipeline Analyse your data in any analysis tool
  • 12. And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis Amazon EMRAmazon S3CloudFront Amazon Redshift
  • 13. Our Snowplow event processing flow runs on Hadoop, specifically Amazon’s Elastic MapReduce hosted Hadoop service Website / webapp Snowplow Hadoop data pipeline CloudFront- based event collector Scalding- based enrichment on Hadoop JavaScript event tracker Amazon Redshift / PostgreSQL Amazon S3 or Clojure- based event collector
  • 14. Why did we pick Hadoop? Scalability Easy to reprocess data Highly testable We have customers processing 350m Snowplow events a day in Hadoop – runs in <2 hours If business rules change, we can fire up a large cluster and re-process all historical raw Snowplow events We write unit and integration tests for our jobs and run them locally, giving us confidence that our jobs will run correctly at scale on Hadoop
  • 15. And why Amazon’s Elastic MapReduce (EMR)? No need to run our own cluster Elastic Interop with other AWS services Running your own Hadoop cluster is a huge pain – not for the fainthearted. By contrast, EMR just works (most of the time !) Snowplow runs as a nightly (sometimes more frequent) batch job. We spin up the EMR cluster to run the job, and shut it down straight after EMR works really well with Amazon S3 as a file store. We are big fans of Amazon Redshift (hosted columnar database) too
  • 18. … for our workshop today, we will stick to using Elastic MapReduce and try to avoid any unnecessary complexity
  • 19. … and we will learn by doing! • Lots of books and articles about Hadoop and the theory of MapReduce • We will learn by doing – no theory unless it’s required to directly explain the jobs we are creating • Our priority is to get you up-and-running on Elastic MapReduce, and confident enough to write your own Hadoop jobs
  • 20. Part 1: a simple Pig Latin job on EMR
  • 21. What is Pig (Latin)? • Pig is a high-level platform for creating MapReduce jobs which can run on Hadoop • The language you write Pig jobs in is called Pig Latin • For quick-and-dirty scripts, Pig just works Hadoop DFS Hadoop MapReduce Crunch Hive Pig Java Cascading
  • 22. Let’s all come up with a unique name for ourselves • Lowercase letters, no spaces or hyphens or anything • E.g. I will be alexsnowplow – please come up with a unique name for yourself! • It will be visible to other participants so choose something you don’t mind being public  • In the rest of this workshop, wherever you see YOURNAME, replace it with your unique name
  • 23. Let’s restart our Vagrant and do some setup $ mkdir zero2hadoop $ aws configure // And type in: AWS Access Key ID [None]: AKIAILD6DCBTFI642JPQ AWS Secret Access Key [None]: KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou Default region name [None]: eu-west-1 Default output format [None]:
  • 24. Let’s create some buckets in Amazon S3 – this is where our data and our apps will live $ aws s3 mb s3://zero2hadoop-in-YOURNAME $ aws s3 mb s3://zero2hadoop-out-YOURNAME $ aws s3 mb s3://zero2hadoop-jobs-YOURNAME // Check those worked $ aws s3 ls
  • 25. Let’s get some source data uploaded $ mkdir -p ~/zero2hadoop/part1/in $ cd ~/zero2hadoop/part1/in $ wget https://raw.githubusercontent.com/snowplow/sc alding-example-project/master/data/hello.txt $ cat hello.txt Hello world Goodbye world $ aws s3 cp hello.txt s3://zero2hadoop-in- YOURNAME/part1/hello.txt
  • 26. Let’s get our EMR command-line tools installed (1/2) $ /vagrant/emr-cli/elastic-mapreduce $ rvm install ruby-1.8.7-head $ rvm use 1.8.7 $ alias emr=/vagrant/emr-cli/elastic- mapreduce
  • 27. Let’s get our EMR command-line tools installed (2/2) Add this file: { "access_id": "AKIAI55OSYYRLYWLXH7A", "private_key": "SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP", "region": "eu-west-1" } to: /vagrant/emr-cli/credentials.json (sudo sntp -s 24.56.178.140)
  • 28. Let’s get our EMR command-line tools installed (2/2) // This should work fine now: $ emr --list <no output>
  • 29. Let’s do some local file work $ mkdir -p ~/zero2hadoop/part1/pig $ cd ~/zero2hadoop/part1/pig $ wget https://gist.githubusercontent.com/alexanderd ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1 01e296ccf27331384df3288d/wordcount.pig // The original https://gist.github.com/alexanderdean/d8371ce bdf00064591ae
  • 30. Now upload to S3 $ aws s3 cp wordcount.pig s3://zero2hadoop- jobs-YOURNAME/part1/ $ aws s3 ls --recursive s3://zero2hadoop- jobs-YOURNAME/part1/ 2014-06-06 09:10:31 674 part1/wordcount.pig
  • 31. And now we run our Pig script $ emr --create --name "part1 YOURNAME" --set-visible-to-all-users true --pig-script s3n://zero2hadoop-jobs- YOURNAME/part1/wordcount.pig --ami-version 2.0 --args "-p,INPUT=s3n://zero2hadoop-in- YOURNAME/part1, -p,OUTPUT=s3n://zero2hadoop-out- YOURNAME/part1"
  • 32. Let’s check out the jobs running in Elastic MapReduce – first at the console $ $ emr --list j-1HR90SWPP40M4 STARTING part1 YOURNAME PENDING Setup Pig PENDING Run Pig Script
  • 33. and also in the UI
  • 34. Okay let’s check the output of our job! (1/2) $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part1 2014-06-06 09:57:53 0 part1/_SUCCESS 2014-06-06 09:57:50 26 part1/part-r- 00000
  • 35. Okay let’s check the output of our job! $ mkdir -p ~/zero2hadoop/part1/out $ cd ~/zero2hadoop/part1/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part1 . $ ls part-r-00000 _SUCCESS $ cat part-r-00000 2 world 1 Hello 1 Goodbye
  • 36. Part 2: a simple Scalding job on EMR
  • 37. What is Scalding? • Scalding is a Scala API over Cascading, the Java framework for building data processing pipelines on Hadoop: Hadoop DFS Hadoop MapReduce Cascading Pig … Java Scalding Cascalog PyCascading cascading. jruby
  • 38. Cascading has a “plumbing” abstraction over vanilla MapReduce which should be quite comfortable to DW practitioners
  • 39. Scalding improves further on Cascading by reducing boilerplate and making more complex pipelines easier to express • Scalding written in Scala – reduces a lot of boilerplate versus vanilla Cascading. Easier to look at a job in its entirety and see what it does • Scalding created and supported by Twitter, who use it throughout their organization • We believe that data pipelines should be as strongly typed as possible – all the other DSLs/APIs on top of Cascading encourage dynamic typing
  • 40. Strongly typed data pipelines – why? • Catch errors as soon as possible – and report them in a strongly typed way too • Define the inputs and outputs of each of your data processing steps in an unambiguous way • Forces you to formerly address the data types flowing through your system • Lets you write code like this:
  • 41. Okay let’s get started! • Head to https://github.com/snowplow/scalding-example-project
  • 42. Let’s get this code down locally and build it $ mkdir -p ~/zero2hadoop/part2 $ cd ~/zero2hadoop/part2 $ git clone git://github.com/snowplow/scalding-example- project.git $ cd scalding-example-project $ sbt assembly
  • 43. Here is our MapReduce code
  • 44. Good, tests are passing, now let’s upload this to S3 so it’s available to our EMR job $ aws s3 cp target/scala-2.10/scalding- example-project-0.0.5.jar s3://zero2hadoop- jobs-YOURNAME/part2/ // If that doesn’t work: $ aws cp s3://snowplow-hosted-assets/third- party/scalding-example-project-0.0.5.jar s3://zero2hadoop-jobs-YOURNAME/part2/ $ aws s3 ls s3://zero2hadoop-jobs- YOURNAME/part2/
  • 45. And now we run it! $ emr --create --name ”part2 YOURNAME" --set-visible-to-all-users true --jar s3n://zero2hadoop-jobs- YOURNAME/part2/scalding-example-project- 0.0.5.jar --arg com.snowplowanalytics.hadoop.scalding.WordCou ntJob --arg --hdfs --arg --input --arg s3n://zero2hadoop-in- YOURNAME/part1/hello.txt --arg --output --arg s3n://zero2hadoop-out- YOURNAME/part2
  • 46. Let’s check out the jobs running in Elastic MapReduce – first at the console $ emr --list j-1M62IGREPL7I STARTING scalding-example-project PENDING Example Jar Step
  • 47. and also in the UI
  • 48. Okay let’s check the output of our job! $ aws s3 ls --recursive s3://zero2hadoop-out- YOURNAME/part2 $ mkdir -p ~/zero2hadoop/part2/out $ cd ~/zero2hadoop/part2/out $ aws s3 cp --recursive s3://zero2hadoop-out- YOURNAME/part2 . $ ls $ cat part-00000 goodbye 1 hello 1 world 2
  • 49. Part 3: a more complex Scalding job on EMR
  • 50. Let’s explore another tutorial together https://github.com/sharethrough/scalding-emr-tutorial