Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

Big Data Infrastructure workshop
A hands-on introduction
Saturday, December 6, 2014

Agenda
08:30 AM Breakfast
09:00 AM Introduction and Strengths of Technologies
10:00 AM Start an EMR Cluster
10:15 AM break + set up query tool
10:30 AM Hadoop hands-on
10:55 AM break
11:10 AM Redshift hands-on
11:40 AM Operationalizing your code
12:00 PM adjourn
12/6/2014 2

DataKitchen Leadership
Chris Bergh
(Executive Chef)
4
Gil Benghiat
(VP Product)
Eric Estabrooks
(VP Cloud and
Data Services)
Software development origins and executive experience
delivering enterprise software focused on Marketing and
Health Care sectors.
Deep Analytic Experience: Spent past decade solving the
analytic data preparation problem
New Approach To Data Preparation and Production:
focused on the Analysts

Analysts And Their Teams Are Spending
60-80% Of Their Time
On Data Preparation And Production
5

This creates an expectation gap
6
Analyze
Prepare Data
C
Analyze
Prepare Data
Business Customer
Expectation
Analyst
Reality
Communicate
The business does not
think that Analysts are
preparing data
(Analysts don’t want to
prepare data)

What Analyst Really Want:
An Integrated Data Set Ready For Analysis
With: Autonomy & Agility
Without: All the Work & Anxiety

8
DataKitchen
solves this
problem.
We are on a mission
to prepare data to
make analysts
successful.

Agenda
08:30 AM Breakfast
10:55 AM break
12:00 PM adjourn
12/6/2014 9

Experience of Audience
• Who considers themselves
• Analyst
• Data scientist
• Programmer / Scripter
• On the Business side
• Who knows SQL – can write a simple select?
• Who had an AWS account before today?
12/6/2014 10

What Is Apache Hadoop?
• Software framework
• Large scale processing
• Network of commodity hardware
• Handles hardware failures
12/6/2014 12
http://hadoop.apache.org/

What is Hadoop good for?
• Problems that are huge (batch), but not
hard, and can be run in parallel over
immutable data
• NOT OLTP
(e.g. backend to e-commerce site)
• Providing a Map Reduce framework
12/6/2014 13

Map Reduce
http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf
12/6/2014 14

You can write map reduce jobs in your favorite language
Streaming Interface
• Lets you specify mappers and
reducer
• Supports
• Java
• Python
• Ruby
• Unix Shell
• R
• Any executable
Map Reduce “generators”
• Results in map reduce jobs
• PIG
• Hive
12/6/2014 16

Applications that lend themselves to map reduce
• Word Count
• PDF Generation (NY Times 11,000,000 articles)
• Analysis of stock market historical data (ROI and standard deviation)
• Geographical Data (Finding intersections, rendering map files)
• Log file querying and analysis
• Statistical machine translation
• Spam detection
• Analyzing Tweets
12/6/2014 17

Would you use an excavator to plant a tomato?
12/6/2014 18

Another use …
Some people use a Hadoop cluster for a “data lake”
• Store all
your raw
data
• Cook it on
demand
12/6/2014 19

Impala
12/6/2014 http20://pixgood.com/hadoop-ecosystem-diagram.html

Pig
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
• Pig Latin - the scripting language
• Grunt – Shell for executing Pig Commands
12/6/2014 21

http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009
This is what it would be in Java
12/6/2014 22

Hive
You write SQL! Well, almost, it is HiveQL
12/6/2014 23
SELECT user.*
FROM user
WHERE
user.active = 1;
JDBC
SQL
Workbench
The first hands on session will focus on this.

In Amazon, the common workflow for batch
processing starts and ends with s3.
Hive
Script
12/6/2014 24

Impala
• Uses SQL very similar to HiveQL
• Runs 10-100x faster
• Runs in memory so it does not scale up as well
• Great for developing your code on a small data set
• Can use interactively with Tableau and other BI tools
• Some batch jobs run faster on Impala than Hive
12/6/2014 25

What is EMR?
• Hadoop offered by Amazon
• EMR = Elastic Map Reduce
• Amazon does almost all of the work to create a cluster
12/6/2014 26
OR

Three ways to pay for EMR
• On Demand - highest price, by the hour, no commitment
• m1.small $0.055 per Hour
• i2.8xlarge $7.09 per hour
• (29 different machine options)
• Reservation - 1 and 3 year terms (No, All, & Partial Upfront)
• Spot - lowest price, machine can be taken away
Do I leave my cluster up all the time?
12/6/2014 27

Adding machines: Time down, Cost up
Cost in ECU
12/6/2014 28

What Is Redshift?
• Columnar database
• Great for reads
• Scale by adding machines
• Two ways to pay
• On Demand
• Reservation
• Good for SQL-based ETL too
12/6/2014 29
http://hadoop.apache.org/

Redshift Machine Options (on demand prices)
12/6/2014 30
Petabyte scale
Remember: Amazon charges for s3 storage too

Redshift usage pattern
• Load data to s3 first
• Use BI tools to send in SQL
• Amazon Redshift is based on PostgreSQL
The second hands on session will focus on this.
12/6/2014 31
JDBC
SQL
Workbench

Agenda
08:30 AM Breakfast
10:55 AM break
12:00 PM adjourn
12/6/2014 32

Should I use Redshift or EMR?
Redshift for
• Structured data
• Interactive queries
• Speed
Hadoop for
• Data format flexibility
• Computation flexibility
• Super Big Data
• Try both
• Compare costs
• If it works in Redshift, start there
12/6/2014 33

Performance comparison (3. Join Query)
12/6/2014 34
https://amplab.cs.berkeley.edu/benchmark/

Recap
• Started a Hadoop cluster via the AWS Console (Web UI)
• Loaded Data
• Wrote some queries
• Same for Redshift
Eventually, you will do this for real and have a script that has value.
Now what?
12/6/2014 35

To run your data job you need to …
• Wait for the new data to arrive
• Move it to s3
• Start a cluster
• Load the data
• Run your SQL scripts
• Wait for it to finish
• Shut down your cluster
12/6/2014 36

And hope …
• The new data is in the right format
• Assumptions you made during development are still true
• Someone did not mess up your code with an "easy change“
• The new data transfers run successfully
• A table you depend on has been updated correctly
• The new data has not been truncated by the source
• No data quality issues with the source data
Wouldn’t it be great to turn your hopes into tests?
12/6/2014 37

DataKitchen: We produce the data
SQL, tests and
the check list
go into a
Recipe
You data
are
Ingredients
12/6/2014 38
The results
are
Servings

DataKitchen brings reality in line with expectations
39
Analyze
Prepare Data
C
Analyze
Prepare Data
Business Customer
Expectation
Analyst
Reality
Communicate
Communicate
Analyze
Prepare Data
With
DataKitchen

The story of our first Recipe
12/6/2014 40

The story of our first Recipe
With DataKitchen, we got 75% of our time back!
… and we don’t have to remember to shut down our cluster.
12/6/2014 41

Remember to shut down your clusters

43
Thank you!
Send us an email
to receive our newsletter
or to give us feedback.
info@datakitchen.io

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

Ähnlich wie Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift (20)

Mehr von DataKitchen

Mehr von DataKitchen (7)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift