Hadoop is everywhere these days, but it can seem like a complex, intimidating ecosystem to those who have yet to jump in.
In this hands-on workshop, Alex Dean, co-founder of Snowplow Analytics, will take you "from zero to Hadoop", showing you how to run a variety of simple (but powerful) Hadoop jobs on Elastic MapReduce, Amazon's hosted Hadoop service. Alex will start with a no-nonsense overview of what Hadoop is, explaining its strengths and weaknesses and why it's such a powerful platform for data warehouse practitioners. Then Alex will help get you setup with EMR and Amazon S3, before leading you through a very simple job in Pig, a simple language for writing Hadoop jobs. After this we will move onto writing a more advanced job in Scalding, Twitter's Scala API for writing Hadoop jobs. For our final job, we will consolidate everything we have learnt by building a more sophisticated job in Scalding.
4. There is a lot to copy and paste – so let’s all join a Google
Hangout chat
http://bit.ly/1xgSQId
• If I forget to paste some content into the chat room, just shout
out and remind me
5. First, let’s all download and setup Virtualbox and Vagrant
http://docs.vagrantup.com/v2/installation/in
dex.html
https://www.virtualbox.org/wiki/Downloads
6. Now let’s setup our development environment
$ vagrant plugin install vagrant-vbguest
If you have git already installed:
$ git clone --recursive
https://github.com/snowplow/dev-environment.git
If not:
$ wget https://github.com/snowplow/dev-
environment/archive/temp.zip
$ unzip temp.zip
$ wget https://github.com/snowplow/ansible-
playbooks/archive/temp.zip
$ unzip temp.zip
7. Now let’s setup our development environment
$ cd dev-environment
$ vagrant up
$ vagrant ssh
8. Final step for now, let’s install some software
$ ansible-playbook /vagrant/ansible-
playbooks/aws-tools.yml --inventory-
file=/home/vagrant/ansible_hosts --
connection=local
$ ansible-playbook /vagrant/ansible-
playbooks/scala-sbt.yml --inventory-
file=/home/vagrant/ansible_hosts --
connection=local
10. Snowplow is an open-source web and event analytics platform,
built on Hadoop
• Co-founders Alex Dean and Yali Sassoon met at
OpenX, the open-source ad technology business
in 2008
• We released Snowplow as a skunkworks
prototype at start of 2012:
github.com/snowplow/snowplow
• We built Snowplow on top of Hadoop from the
very start
11. We wanted to take a fresh approach to web analytics
• Your own web event data -> in your own data warehouse
• Your own event data model
• Slice / dice and mine the data in highly bespoke ways to answer your
specific business questions
• Plug in the broadest possible set of analysis tools to drive value from your
data
Data warehouseData pipeline
Analyse your data in
any analysis tool
12. And we saw the potential of new “big data” technologies and
services to solve these problems in a scalable, low-cost manner
These tools make it possible to capture, transform, store and analyse all your
granular, event-level data, to you can perform any analysis
Amazon EMRAmazon S3CloudFront Amazon Redshift
13. Our Snowplow event processing flow runs on Hadoop,
specifically Amazon’s Elastic MapReduce hosted Hadoop service
Website / webapp
Snowplow Hadoop data pipeline
CloudFront-
based event
collector
Scalding-
based
enrichment
on Hadoop
JavaScript
event tracker
Amazon
Redshift /
PostgreSQL
Amazon S3
or
Clojure-
based event
collector
14. Why did we pick Hadoop?
Scalability
Easy to
reprocess
data
Highly
testable
We have customers processing 350m Snowplow
events a day in Hadoop – runs in <2 hours
If business rules change, we can fire up a large
cluster and re-process all historical raw
Snowplow events
We write unit and integration tests for our jobs
and run them locally, giving us confidence that
our jobs will run correctly at scale on Hadoop
15. And why Amazon’s Elastic MapReduce (EMR)?
No need to
run our own
cluster
Elastic
Interop with
other AWS
services
Running your own Hadoop cluster is a huge pain
– not for the fainthearted. By contrast, EMR just
works (most of the time !)
Snowplow runs as a nightly (sometimes more
frequent) batch job. We spin up the EMR cluster
to run the job, and shut it down straight after
EMR works really well with Amazon S3 as a file
store. We are big fans of Amazon Redshift
(hosted columnar database) too
18. … for our workshop today, we will stick to using Elastic
MapReduce and try to avoid any unnecessary complexity
19. … and we will learn by doing!
• Lots of books and articles about Hadoop and the theory of
MapReduce
• We will learn by doing – no theory unless it’s required to
directly explain the jobs we are creating
• Our priority is to get you up-and-running on Elastic
MapReduce, and confident enough to write your own
Hadoop jobs
21. What is Pig (Latin)?
• Pig is a high-level platform for creating MapReduce jobs which can run
on Hadoop
• The language you write Pig jobs in is called Pig Latin
• For quick-and-dirty scripts, Pig just works
Hadoop DFS
Hadoop MapReduce
Crunch Hive Pig
Java
Cascading
22. Let’s all come up with a unique name for ourselves
• Lowercase letters, no spaces or hyphens or anything
• E.g. I will be alexsnowplow – please come up with a unique name for
yourself!
• It will be visible to other participants so choose something you don’t
mind being public
• In the rest of this workshop, wherever you see YOURNAME, replace it
with your unique name
23. Let’s restart our Vagrant and do some setup
$ mkdir zero2hadoop
$ aws configure
// And type in:
AWS Access Key ID [None]:
AKIAILD6DCBTFI642JPQ
AWS Secret Access Key [None]:
KMVdr/bsq4FDTI5H143K3gjt4ErG2oTjd+1+a+ou
Default region name [None]: eu-west-1
Default output format [None]:
24. Let’s create some buckets in Amazon S3 – this is where our data
and our apps will live
$ aws s3 mb s3://zero2hadoop-in-YOURNAME
$ aws s3 mb s3://zero2hadoop-out-YOURNAME
$ aws s3 mb s3://zero2hadoop-jobs-YOURNAME
// Check those worked
$ aws s3 ls
25. Let’s get some source data uploaded
$ mkdir -p ~/zero2hadoop/part1/in
$ cd ~/zero2hadoop/part1/in
$ wget
https://raw.githubusercontent.com/snowplow/sc
alding-example-project/master/data/hello.txt
$ cat hello.txt
Hello world
Goodbye world
$ aws s3 cp hello.txt s3://zero2hadoop-in-
YOURNAME/part1/hello.txt
26. Let’s get our EMR command-line tools installed (1/2)
$ /vagrant/emr-cli/elastic-mapreduce
$ rvm install ruby-1.8.7-head
$ rvm use 1.8.7
$ alias emr=/vagrant/emr-cli/elastic-
mapreduce
27. Let’s get our EMR command-line tools installed (2/2)
Add this file:
{
"access_id": "AKIAI55OSYYRLYWLXH7A",
"private_key":
"SHRXNIBRdfWuLPbCt57ZVjf+NMKUjm9WTknDHPTP",
"region": "eu-west-1"
}
to: /vagrant/emr-cli/credentials.json
(sudo sntp -s 24.56.178.140)
28. Let’s get our EMR command-line tools installed (2/2)
// This should work fine now:
$ emr --list
<no output>
29. Let’s do some local file work
$ mkdir -p ~/zero2hadoop/part1/pig
$ cd ~/zero2hadoop/part1/pig
$ wget
https://gist.githubusercontent.com/alexanderd
ean/d8371cebdf00064591ae/raw/cb3030a6c48b85d1
01e296ccf27331384df3288d/wordcount.pig
// The original
https://gist.github.com/alexanderdean/d8371ce
bdf00064591ae
30. Now upload to S3
$ aws s3 cp wordcount.pig s3://zero2hadoop-
jobs-YOURNAME/part1/
$ aws s3 ls --recursive s3://zero2hadoop-
jobs-YOURNAME/part1/
2014-06-06 09:10:31 674
part1/wordcount.pig
31. And now we run our Pig script
$ emr --create --name "part1 YOURNAME"
--set-visible-to-all-users true
--pig-script s3n://zero2hadoop-jobs-
YOURNAME/part1/wordcount.pig
--ami-version 2.0
--args "-p,INPUT=s3n://zero2hadoop-in-
YOURNAME/part1,
-p,OUTPUT=s3n://zero2hadoop-out-
YOURNAME/part1"
32. Let’s check out the jobs running in Elastic MapReduce – first at
the console
$ $ emr --list
j-1HR90SWPP40M4 STARTING
part1 YOURNAME
PENDING Setup Pig
PENDING Run Pig Script
37. What is Scalding?
• Scalding is a Scala API over Cascading, the Java framework for building
data processing pipelines on Hadoop:
Hadoop DFS
Hadoop MapReduce
Cascading Pig …
Java
Scalding Cascalog PyCascading
cascading.
jruby
38. Cascading has a “plumbing” abstraction over vanilla MapReduce
which should be quite comfortable to DW practitioners
39. Scalding improves further on Cascading by reducing boilerplate
and making more complex pipelines easier to express
• Scalding written in Scala – reduces a lot of boilerplate
versus vanilla Cascading. Easier to look at a job in its
entirety and see what it does
• Scalding created and supported by Twitter, who use it
throughout their organization
• We believe that data pipelines should be as strongly
typed as possible – all the other DSLs/APIs on top of
Cascading encourage dynamic typing
40. Strongly typed data pipelines – why?
• Catch errors as soon as possible – and report them in a strongly typed way too
• Define the inputs and outputs of each of your data processing steps in an
unambiguous way
• Forces you to formerly address the data types flowing through your system
• Lets you write code like this:
41. Okay let’s get started!
• Head to https://github.com/snowplow/scalding-example-project
42. Let’s get this code down locally and build it
$ mkdir -p ~/zero2hadoop/part2
$ cd ~/zero2hadoop/part2
$ git clone
git://github.com/snowplow/scalding-example-
project.git
$ cd scalding-example-project
$ sbt assembly
44. Good, tests are passing, now let’s upload this to S3 so it’s
available to our EMR job
$ aws s3 cp target/scala-2.10/scalding-
example-project-0.0.5.jar s3://zero2hadoop-
jobs-YOURNAME/part2/
// If that doesn’t work:
$ aws cp s3://snowplow-hosted-assets/third-
party/scalding-example-project-0.0.5.jar
s3://zero2hadoop-jobs-YOURNAME/part2/
$ aws s3 ls s3://zero2hadoop-jobs-
YOURNAME/part2/
45. And now we run it!
$ emr --create --name ”part2 YOURNAME"
--set-visible-to-all-users true
--jar s3n://zero2hadoop-jobs-
YOURNAME/part2/scalding-example-project-
0.0.5.jar
--arg
com.snowplowanalytics.hadoop.scalding.WordCou
ntJob
--arg --hdfs
--arg --input --arg s3n://zero2hadoop-in-
YOURNAME/part1/hello.txt
--arg --output --arg s3n://zero2hadoop-out-
YOURNAME/part2
46. Let’s check out the jobs running in Elastic MapReduce – first at
the console
$ emr --list
j-1M62IGREPL7I STARTING
scalding-example-project
PENDING Example Jar Step