SlideShare ist ein Scribd-Unternehmen logo
1 von 43
Big Data Infrastructure workshop 
A hands-on introduction 
Saturday, December 6, 2014
Agenda 
08:30 AM Breakfast 
09:00 AM Introduction and Strengths of Technologies 
10:00 AM Start an EMR Cluster 
10:15 AM break + set up query tool 
10:30 AM Hadoop hands-on 
10:55 AM break 
11:10 AM Redshift hands-on 
11:40 AM Operationalizing your code 
12:00 PM adjourn 
12/6/2014 2
Background on your presenters
DataKitchen Leadership 
Chris Bergh 
(Executive Chef) 
4 
Gil Benghiat 
(VP Product) 
Eric Estabrooks 
(VP Cloud and 
Data Services) 
Software development origins and executive experience 
delivering enterprise software focused on Marketing and 
Health Care sectors. 
Deep Analytic Experience: Spent past decade solving the 
analytic data preparation problem 
New Approach To Data Preparation and Production: 
focused on the Analysts
Analysts And Their Teams Are Spending 
60-80% Of Their Time 
On Data Preparation And Production 
5
This creates an expectation gap 
6 
Analyze 
Prepare Data 
C 
Analyze 
Prepare Data 
Business Customer 
Expectation 
Analyst 
Reality 
Communicate 
The business does not 
think that Analysts are 
preparing data 
(Analysts don’t want to 
prepare data)
What Analyst Really Want: 
An Integrated Data Set Ready For Analysis 
With: Autonomy & Agility 
Without: All the Work & Anxiety
8 
DataKitchen 
solves this 
problem. 
We are on a mission 
to prepare data to 
make analysts 
successful.
Agenda 
08:30 AM Breakfast 
09:00 AM Introduction and Strengths of Technologies 
10:00 AM Start an EMR Cluster 
10:15 AM break + set up query tool 
10:30 AM Hadoop hands-on 
10:55 AM break 
11:10 AM Redshift hands-on 
11:40 AM Operationalizing your code 
12:00 PM adjourn 
12/6/2014 9
Experience of Audience 
• Who considers themselves 
• Analyst 
• Data scientist 
• Programmer / Scripter 
• On the Business side 
• Who knows SQL – can write a simple select? 
• Who had an AWS account before today? 
12/6/2014 10
Hadoop & Redshift
What Is Apache Hadoop? 
• Software framework 
• Large scale processing 
• Network of commodity hardware 
• Handles hardware failures 
12/6/2014 12 
http://hadoop.apache.org/
What is Hadoop good for? 
• Problems that are huge (batch), but not 
hard, and can be run in parallel over 
immutable data 
• NOT OLTP 
(e.g. backend to e-commerce site) 
• Providing a Map Reduce framework 
12/6/2014 13
Map Reduce 
http://www.cs.berkeley.edu/~matei/talks/2010/amp_mapreduce.pdf 
12/6/2014 14
12/6/2014 15
You can write map reduce jobs in your favorite language 
Streaming Interface 
• Lets you specify mappers and 
reducer 
• Supports 
• Java 
• Python 
• Ruby 
• Unix Shell 
• R 
• Any executable 
Map Reduce “generators” 
• Results in map reduce jobs 
• PIG 
• Hive 
12/6/2014 16
Applications that lend themselves to map reduce 
• Word Count 
• PDF Generation (NY Times 11,000,000 articles) 
• Analysis of stock market historical data (ROI and standard deviation) 
• Geographical Data (Finding intersections, rendering map files) 
• Log file querying and analysis 
• Statistical machine translation 
• Spam detection 
• Analyzing Tweets 
12/6/2014 17
Would you use an excavator to plant a tomato? 
12/6/2014 18
Another use … 
Some people use a Hadoop cluster for a “data lake” 
• Store all 
your raw 
data 
• Cook it on 
demand 
12/6/2014 19
Impala 
12/6/2014 http20://pixgood.com/hadoop-ecosystem-diagram.html
Pig 
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 
• Pig Latin - the scripting language 
• Grunt – Shell for executing Pig Commands 
12/6/2014 21
http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 
This is what it would be in Java 
12/6/2014 22
Hive 
You write SQL! Well, almost, it is HiveQL 
12/6/2014 23 
SELECT user.* 
FROM user 
WHERE 
user.active = 1; 
JDBC 
SQL 
Workbench 
The first hands on session will focus on this.
In Amazon, the common workflow for batch 
processing starts and ends with s3. 
Hive 
Script 
12/6/2014 24
Impala 
• Uses SQL very similar to HiveQL 
• Runs 10-100x faster 
• Runs in memory so it does not scale up as well 
• Great for developing your code on a small data set 
• Can use interactively with Tableau and other BI tools 
• Some batch jobs run faster on Impala than Hive 
12/6/2014 25
What is EMR? 
• Hadoop offered by Amazon 
• EMR = Elastic Map Reduce 
• Amazon does almost all of the work to create a cluster 
12/6/2014 26 
OR
Three ways to pay for EMR 
• On Demand - highest price, by the hour, no commitment 
• m1.small $0.055 per Hour 
• i2.8xlarge $7.09 per hour 
• (29 different machine options) 
• Reservation - 1 and 3 year terms (No, All, & Partial Upfront) 
• Spot - lowest price, machine can be taken away 
Do I leave my cluster up all the time? 
12/6/2014 27
Adding machines: Time down, Cost up 
Cost in ECU 
12/6/2014 28
What Is Redshift? 
• Columnar database 
• Great for reads 
• Scale by adding machines 
• Two ways to pay 
• On Demand 
• Reservation 
• Good for SQL-based ETL too 
12/6/2014 29 
http://hadoop.apache.org/
Redshift Machine Options (on demand prices) 
12/6/2014 30 
Petabyte scale 
Remember: Amazon charges for s3 storage too
Redshift usage pattern 
• Load data to s3 first 
• Use BI tools to send in SQL 
• Amazon Redshift is based on PostgreSQL 
The second hands on session will focus on this. 
12/6/2014 31 
JDBC 
SQL 
Workbench
Agenda 
08:30 AM Breakfast 
09:00 AM Introduction and Strengths of Technologies 
10:00 AM Start an EMR Cluster 
10:15 AM break + set up query tool 
10:30 AM Hadoop hands-on 
10:55 AM break 
11:10 AM Redshift hands-on 
11:40 AM Operationalizing your code 
12:00 PM adjourn 
12/6/2014 32
Should I use Redshift or EMR? 
Redshift for 
• Structured data 
• Interactive queries 
• Speed 
Hadoop for 
• Data format flexibility 
• Computation flexibility 
• Super Big Data 
• Try both 
• Compare costs 
• If it works in Redshift, start there 
12/6/2014 33
Performance comparison (3. Join Query) 
12/6/2014 34 
https://amplab.cs.berkeley.edu/benchmark/
Recap 
• Started a Hadoop cluster via the AWS Console (Web UI) 
• Loaded Data 
• Wrote some queries 
• Same for Redshift 
Eventually, you will do this for real and have a script that has value. 
Now what? 
12/6/2014 35
To run your data job you need to … 
• Wait for the new data to arrive 
• Move it to s3 
• Start a cluster 
• Load the data 
• Run your SQL scripts 
• Wait for it to finish 
• Shut down your cluster 
12/6/2014 36
And hope … 
• The new data is in the right format 
• Assumptions you made during development are still true 
• Someone did not mess up your code with an "easy change“ 
• The new data transfers run successfully 
• A table you depend on has been updated correctly 
• The new data has not been truncated by the source 
• No data quality issues with the source data 
Wouldn’t it be great to turn your hopes into tests? 
12/6/2014 37
DataKitchen: We produce the data 
SQL, tests and 
the check list 
go into a 
Recipe 
You data 
are 
Ingredients 
12/6/2014 38 
The results 
are 
Servings
DataKitchen brings reality in line with expectations 
39 
Analyze 
Prepare Data 
C 
Analyze 
Prepare Data 
Business Customer 
Expectation 
Analyst 
Reality 
Communicate 
Communicate 
Analyze 
Prepare Data 
With 
DataKitchen
The story of our first Recipe 
12/6/2014 40
The story of our first Recipe 
With DataKitchen, we got 75% of our time back! 
… and we don’t have to remember to shut down our cluster. 
12/6/2014 41
Remember to shut down your clusters
43 
Thank you! 
Send us an email 
to receive our newsletter 
or to give us feedback. 
info@datakitchen.io

Weitere ähnliche Inhalte

Was ist angesagt?

Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for DummiesRodney Joyce
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Rajan Kanitkar
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on AzureTrivadis
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016StampedeCon
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analyticsjoshwills
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTurkish Testing Board
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...DataWorks Summit
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure DatabricksJames Serra
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseData Con LA
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019DataKitchen
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerDataWorks Summit
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016StampedeCon
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsWaqas Idrees
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleMark Kerzner
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine LearningMark Tabladillo
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonMapR Technologies
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS CloudIdan Tohami
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick viewRajesh Nadipalli
 

Was ist angesagt? (20)

Databricks for Dummies
Databricks for DummiesDatabricks for Dummies
Databricks for Dummies
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014Talend Big Data Capabilities - 2014
Talend Big Data Capabilities - 2014
 
TechEvent Databricks on Azure
TechEvent Databricks on AzureTechEvent Databricks on Azure
TechEvent Databricks on Azure
 
Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016Turn Data Into Actionable Insights - StampedeCon 2016
Turn Data Into Actionable Insights - StampedeCon 2016
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland LeusdenTestistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
Testistanbul 2016 - Keynote: "Performance Testing of Big Data" by Roland Leusden
 
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
Can you Re-Platform your Teradata, Oracle, Netezza and SQL Server Analytic Wo...
 
Introduction to Azure Databricks
Introduction to Azure DatabricksIntroduction to Azure Databricks
Introduction to Azure Databricks
 
Owning Your Own (Data) Lake House
Owning Your Own (Data) Lake HouseOwning Your Own (Data) Lake House
Owning Your Own (Data) Lake House
 
Washington DC DataOps Meetup -- Nov 2019
Washington DC DataOps Meetup   -- Nov 2019Washington DC DataOps Meetup   -- Nov 2019
Washington DC DataOps Meetup -- Nov 2019
 
Data Science with Hadoop: A Primer
Data Science with Hadoop: A PrimerData Science with Hadoop: A Primer
Data Science with Hadoop: A Primer
 
How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016How to get started in Big Data without Big Costs - StampedeCon 2016
How to get started in Big Data without Big Costs - StampedeCon 2016
 
Azure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake AnalyticsAzure Data Lake and Azure Data Lake Analytics
Azure Data Lake and Azure Data Lake Analytics
 
Hadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - AltiscaleHadoop Hadoop & Spark meetup - Altiscale
Hadoop Hadoop & Spark meetup - Altiscale
 
201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning201905 Azure Databricks for Machine Learning
201905 Azure Databricks for Machine Learning
 
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad AnsersonUsing Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
Using Hadoop to Offload Data Warehouse Processing and More - Brad Anserson
 
Data lake – On Premise VS Cloud
Data lake – On Premise VS CloudData lake – On Premise VS Cloud
Data lake – On Premise VS Cloud
 
Hd insight essentials quick view
Hd insight essentials quick viewHd insight essentials quick view
Hd insight essentials quick view
 
Big Data Introduction
Big Data IntroductionBig Data Introduction
Big Data Introduction
 

Ähnlich wie Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift IntroductionDataKitchen
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edgeRam Kedem
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Mark Rittman
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeInside Analysis
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impalamarkgrover
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...Amazon Web Services
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataPentaho
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Inside Analysis
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Vantara
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonDremio Corporation
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data AnalyticsAttunity
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics PlatformN Masahiro
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointInside Analysis
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiFelicia Haggarty
 

Ähnlich wie Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift (20)

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Redshift Introduction
Redshift IntroductionRedshift Introduction
Redshift Introduction
 
Twitter with hadoop for oow
Twitter with hadoop for oowTwitter with hadoop for oow
Twitter with hadoop for oow
 
Impala use case @ edge
Impala use case @ edgeImpala use case @ edge
Impala use case @ edge
 
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
Using Oracle Big Data SQL 3.0 to add Hadoop & NoSQL to your Oracle Data Wareh...
 
The Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On TimeThe Hadoop Guarantee: Keeping Analytics Running On Time
The Hadoop Guarantee: Keeping Analytics Running On Time
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Unlock the value of your big data infrastructure
Unlock the value of your big data infrastructureUnlock the value of your big data infrastructure
Unlock the value of your big data infrastructure
 
Amazon EMR
Amazon EMRAmazon EMR
Amazon EMR
 
Big Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big DataBig Data Integration Webinar: Getting Started With Hadoop Big Data
Big Data Integration Webinar: Getting Started With Hadoop Big Data
 
Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?Hadoop as an Analytic Platform: Why Not?
Hadoop as an Analytic Platform: Why Not?
 
Hitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop SolutionHitachi Data Systems Hadoop Solution
Hitachi Data Systems Hadoop Solution
 
Bi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in LondonBi on Big Data - Strata 2016 in London
Bi on Big Data - Strata 2016 in London
 
Accelerating Big Data Analytics
Accelerating Big Data AnalyticsAccelerating Big Data Analytics
Accelerating Big Data Analytics
 
Architecting Your First Big Data Implementation
Architecting Your First Big Data ImplementationArchitecting Your First Big Data Implementation
Architecting Your First Big Data Implementation
 
Technologies for Data Analytics Platform
Technologies for Data Analytics PlatformTechnologies for Data Analytics Platform
Technologies for Data Analytics Platform
 
Hadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter PointHadoop and the Data Warehouse: Point/Counter Point
Hadoop and the Data Warehouse: Point/Counter Point
 
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin MotgiWhither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
Whither the Hadoop Developer Experience, June Hadoop Meetup, Nitin Motgi
 

Mehr von DataKitchen

Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You! DataKitchen
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019DataKitchen
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoDataKitchen
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsDataKitchen
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...DataKitchen
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile DataDataKitchen
 
Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!DataKitchen
 

Mehr von DataKitchen (7)

Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!Your Data Nerd Friends Need You!
Your Data Nerd Friends Need You!
 
seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019seven steps to dataops @ dataops.rocks conference Oct 2019
seven steps to dataops @ dataops.rocks conference Oct 2019
 
ODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps ManifestoODSC May 2019 - The DataOps Manifesto
ODSC May 2019 - The DataOps Manifesto
 
Fri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataopsFri benghiat gil-odsc-data-kitchen-data science to dataops
Fri benghiat gil-odsc-data-kitchen-data science to dataops
 
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
Strata+hadoop data kitchen-seven-steps-to-high-velocity-data-analytics-with d...
 
Open Data Science Conference Agile Data
Open Data Science Conference Agile DataOpen Data Science Conference Agile Data
Open Data Science Conference Agile Data
 
Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!Do Agile Data in Just 5 Shocking Steps!
Do Agile Data in Just 5 Shocking Steps!
 

Kürzlich hochgeladen

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxAndreas Kunz
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesPhilip Schwarz
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesŁukasz Chruściel
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Natan Silnitsky
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfDrew Moseley
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaHanief Utama
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfYashikaSharma391629
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...Akihiro Suda
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...confluent
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 

Kürzlich hochgeladen (20)

VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptxUI5ers live - Custom Controls wrapping 3rd-party libs.pptx
UI5ers live - Custom Controls wrapping 3rd-party libs.pptx
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Folding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a seriesFolding Cheat Sheet #4 - fourth in a series
Folding Cheat Sheet #4 - fourth in a series
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Unveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New FeaturesUnveiling the Future: Sylius 2.0 New Features
Unveiling the Future: Sylius 2.0 New Features
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
Taming Distributed Systems: Key Insights from Wix's Large-Scale Experience - ...
 
Comparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdfComparing Linux OS Image Update Models - EOSS 2024.pdf
Comparing Linux OS Image Update Models - EOSS 2024.pdf
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
React Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief UtamaReact Server Component in Next.js by Hanief Utama
React Server Component in Next.js by Hanief Utama
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdfInnovate and Collaborate- Harnessing the Power of Open Source Software.pdf
Innovate and Collaborate- Harnessing the Power of Open Source Software.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
20240415 [Container Plumbing Days] Usernetes Gen2 - Kubernetes in Rootless Do...
 
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
Catch the Wave: SAP Event-Driven and Data Streaming for the Intelligence Ente...
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 

Introduction to Big Data Technologies: Hadoop/EMR/Map Reduce & Redshift

  • 1. Big Data Infrastructure workshop A hands-on introduction Saturday, December 6, 2014
  • 2. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 2
  • 3. Background on your presenters
  • 4. DataKitchen Leadership Chris Bergh (Executive Chef) 4 Gil Benghiat (VP Product) Eric Estabrooks (VP Cloud and Data Services) Software development origins and executive experience delivering enterprise software focused on Marketing and Health Care sectors. Deep Analytic Experience: Spent past decade solving the analytic data preparation problem New Approach To Data Preparation and Production: focused on the Analysts
  • 5. Analysts And Their Teams Are Spending 60-80% Of Their Time On Data Preparation And Production 5
  • 6. This creates an expectation gap 6 Analyze Prepare Data C Analyze Prepare Data Business Customer Expectation Analyst Reality Communicate The business does not think that Analysts are preparing data (Analysts don’t want to prepare data)
  • 7. What Analyst Really Want: An Integrated Data Set Ready For Analysis With: Autonomy & Agility Without: All the Work & Anxiety
  • 8. 8 DataKitchen solves this problem. We are on a mission to prepare data to make analysts successful.
  • 9. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 9
  • 10. Experience of Audience • Who considers themselves • Analyst • Data scientist • Programmer / Scripter • On the Business side • Who knows SQL – can write a simple select? • Who had an AWS account before today? 12/6/2014 10
  • 12. What Is Apache Hadoop? • Software framework • Large scale processing • Network of commodity hardware • Handles hardware failures 12/6/2014 12 http://hadoop.apache.org/
  • 13. What is Hadoop good for? • Problems that are huge (batch), but not hard, and can be run in parallel over immutable data • NOT OLTP (e.g. backend to e-commerce site) • Providing a Map Reduce framework 12/6/2014 13
  • 16. You can write map reduce jobs in your favorite language Streaming Interface • Lets you specify mappers and reducer • Supports • Java • Python • Ruby • Unix Shell • R • Any executable Map Reduce “generators” • Results in map reduce jobs • PIG • Hive 12/6/2014 16
  • 17. Applications that lend themselves to map reduce • Word Count • PDF Generation (NY Times 11,000,000 articles) • Analysis of stock market historical data (ROI and standard deviation) • Geographical Data (Finding intersections, rendering map files) • Log file querying and analysis • Statistical machine translation • Spam detection • Analyzing Tweets 12/6/2014 17
  • 18. Would you use an excavator to plant a tomato? 12/6/2014 18
  • 19. Another use … Some people use a Hadoop cluster for a “data lake” • Store all your raw data • Cook it on demand 12/6/2014 19
  • 21. Pig http://www.slideshare.net/kevinweil/hadoop-pig-and-twitter-nosql-east-2009 • Pig Latin - the scripting language • Grunt – Shell for executing Pig Commands 12/6/2014 21
  • 23. Hive You write SQL! Well, almost, it is HiveQL 12/6/2014 23 SELECT user.* FROM user WHERE user.active = 1; JDBC SQL Workbench The first hands on session will focus on this.
  • 24. In Amazon, the common workflow for batch processing starts and ends with s3. Hive Script 12/6/2014 24
  • 25. Impala • Uses SQL very similar to HiveQL • Runs 10-100x faster • Runs in memory so it does not scale up as well • Great for developing your code on a small data set • Can use interactively with Tableau and other BI tools • Some batch jobs run faster on Impala than Hive 12/6/2014 25
  • 26. What is EMR? • Hadoop offered by Amazon • EMR = Elastic Map Reduce • Amazon does almost all of the work to create a cluster 12/6/2014 26 OR
  • 27. Three ways to pay for EMR • On Demand - highest price, by the hour, no commitment • m1.small $0.055 per Hour • i2.8xlarge $7.09 per hour • (29 different machine options) • Reservation - 1 and 3 year terms (No, All, & Partial Upfront) • Spot - lowest price, machine can be taken away Do I leave my cluster up all the time? 12/6/2014 27
  • 28. Adding machines: Time down, Cost up Cost in ECU 12/6/2014 28
  • 29. What Is Redshift? • Columnar database • Great for reads • Scale by adding machines • Two ways to pay • On Demand • Reservation • Good for SQL-based ETL too 12/6/2014 29 http://hadoop.apache.org/
  • 30. Redshift Machine Options (on demand prices) 12/6/2014 30 Petabyte scale Remember: Amazon charges for s3 storage too
  • 31. Redshift usage pattern • Load data to s3 first • Use BI tools to send in SQL • Amazon Redshift is based on PostgreSQL The second hands on session will focus on this. 12/6/2014 31 JDBC SQL Workbench
  • 32. Agenda 08:30 AM Breakfast 09:00 AM Introduction and Strengths of Technologies 10:00 AM Start an EMR Cluster 10:15 AM break + set up query tool 10:30 AM Hadoop hands-on 10:55 AM break 11:10 AM Redshift hands-on 11:40 AM Operationalizing your code 12:00 PM adjourn 12/6/2014 32
  • 33. Should I use Redshift or EMR? Redshift for • Structured data • Interactive queries • Speed Hadoop for • Data format flexibility • Computation flexibility • Super Big Data • Try both • Compare costs • If it works in Redshift, start there 12/6/2014 33
  • 34. Performance comparison (3. Join Query) 12/6/2014 34 https://amplab.cs.berkeley.edu/benchmark/
  • 35. Recap • Started a Hadoop cluster via the AWS Console (Web UI) • Loaded Data • Wrote some queries • Same for Redshift Eventually, you will do this for real and have a script that has value. Now what? 12/6/2014 35
  • 36. To run your data job you need to … • Wait for the new data to arrive • Move it to s3 • Start a cluster • Load the data • Run your SQL scripts • Wait for it to finish • Shut down your cluster 12/6/2014 36
  • 37. And hope … • The new data is in the right format • Assumptions you made during development are still true • Someone did not mess up your code with an "easy change“ • The new data transfers run successfully • A table you depend on has been updated correctly • The new data has not been truncated by the source • No data quality issues with the source data Wouldn’t it be great to turn your hopes into tests? 12/6/2014 37
  • 38. DataKitchen: We produce the data SQL, tests and the check list go into a Recipe You data are Ingredients 12/6/2014 38 The results are Servings
  • 39. DataKitchen brings reality in line with expectations 39 Analyze Prepare Data C Analyze Prepare Data Business Customer Expectation Analyst Reality Communicate Communicate Analyze Prepare Data With DataKitchen
  • 40. The story of our first Recipe 12/6/2014 40
  • 41. The story of our first Recipe With DataKitchen, we got 75% of our time back! … and we don’t have to remember to shut down our cluster. 12/6/2014 41
  • 42. Remember to shut down your clusters
  • 43. 43 Thank you! Send us an email to receive our newsletter or to give us feedback. info@datakitchen.io