SlideShare ist ein Scribd-Unternehmen logo
1 von 78
Downloaden Sie, um offline zu lesen
Scaling Your Analytics
with Amazon Elastic MapReduce
Peter Sirota, General Manager - Amazon Elastic MapReduce
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Agenda
• Amazon EMR: Hadoop in the cloud
• Hadoop Ecosystem on Amazon EMR
• Customer Use Cases
Hadoop is the right system for Big Data
• Scalable and fault tolerant
• Flexibility for multiple languages
and data formats
• Open source
• Ecosystem of tools
• Batch and real-time analytics
Challenges with Hadoop
On Premise

On Amazon EC2

• Manage HDFS, upgrades,
and system administration
• Pay for expensive support
contracts
• Select hardware in
advance and stick with
predictions

• Difficult to integrate with
AWS storage services
• Independently manage
and monitor clusters
Amazon EMR is the

easiest way to run Hadoop in the cloud
Why Amazon EMR?
•
•
•
•

Managed services
Easy to tune clusters and trim costs
Support for multiple data stores
Unique features and ecosystem support
S3, DynamoDB, Redshift
Input data
S3, DynamoDB, Redshift
Input data

Code

Elastic
MapReduce
S3, DynamoDB, Redshift
Input data

Code

Elastic
MapReduce

Name
node
S3, DynamoDB, Redshift
Input data

Code

Elastic
MapReduce

Name
node

S3/HDFS

Elastic
cluster
S3, DynamoDB, Redshift
Input data

Code

Elastic
MapReduce

Name
node

Queries
+ BI

S3/HDFS

Via JDBC, Pig, Hive

Elastic
cluster
S3, DynamoDB, Redshift
Input data

Code

Elastic
MapReduce

Output

Name
node

Queries
+ BI

S3/HDFS

Via JDBC, Pig, Hive

Elastic
cluster
S3, DynamoDB, Redshift
Input data

Output
Elastic clusters
Customize size and type to reduce costs
Choose your instance types
Try out different configurations to find your
optimal architecture

CPU
c1.xlarge
cc1.4xlarge
cc2.8xlarge

Memory
m1.large
m2.2xlarge
m2.4xlarge

Disk
hs1.8xlarge
Long running or transient clusters
Easy to run Hadoop clusters short-term or 24/7, and
only pay for what you need

=
Resizable clusters
Easy to add and remove compute
capacity on your cluster

10 hours
Resizable clusters
Easy to add and remove compute
capacity on your cluster

6 hours
Resizable clusters
Easy to add and remove compute
capacity on your cluster

Peak capacity
Resizable clusters
Easy to add and remove compute
capacity on your cluster

Matched compute
demands with cluster sizing
10 hours
Use Spot and Reserved Instances
Minimize costs by supplementing on-demand pricing
Easy to use Spot Instances
Name-your-price supercomputing to minimize costs

Spot for
task nodes

On-demand for
core nodes

Up to 90%
off Amazon
EC2
on-demand
pricing

Standard
Amazon EC2
pricing for
on-demand
capacity
24/7 clusters on Reserved Instances
Minimize cost for consistent capacity
Reserved
Instances for
long running
clusters
Up to 65% off
on-demand
pricing
Your data, your choice
Easy to integrate Amazon EMR with your data stores
Using Amazon S3 and HDFS
Data aggregated
and stored in
Amazon S3

Ad-hoc Query
Long running EMR cluster
holding data in HDFS for
Hive interactive queries

Weekly Report
Data Sources
Transient EMR cluster
for batch map/reduce jobs
for daily reports
Use Amazon EMR with Amazon Redshift
and Amazon S3

Processed data
loaded into
Amazon Redshift
data warehouse

Daily data
aggregated in
Amazon S3

Data Sources

Amazon EMR
cluster used to
process data
Use the Hadoop Ecosystem
on Amazon EMR
Leverage a diverse set of tools to get the most out of your data
Hadoop 2.x

•
•
•
•
•

and much more...

Databases
Machine learning
Metadata stores
Exchange formats
Diverse query languages
Use Hive on Amazon EMR to interact with
your data in HDFS and Amazon S3
• Data warehouse for Hadoop
• Integration with Amazon S3 for
better performance reading and
writing to Amazon S3
• SQL-like query language to make
iterative queries easier
• Easy to scale in HDFS on a
persistent Amazon EMR cluster
Use HBase on a persistent Amazon EMR cluster
as a column-oriented scalable data store

• Billions of rows and millions
of columns
• Backup to and restore from
Amazon S3

• Flexible datatypes
• Modulate your HBase tables
when adding new data to
your system
Use ad-hoc queries on your cluster to
drive insights in real-time
Spark / Shark
• In-memory MapReduce
for faster queries
• Use HiveQL to interact
with your data
Use ad-hoc queries on your cluster to
drive insights in real-time
Spark / Shark
• In-memory MapReduce
for faster queries
• Use HiveQL to interact
with your data

Impala (coming soon!)
• Parallel database
engine for Hadoop
• Use SQL to query data
in HDFS on your cluster
in real-time
“Hadoop-as-a-Service [Amazon EMR] offers a
better price-performance ratio [than bare-metal Hadoop].”

1. Elastic clusters and cost optimization
2. Rapid, tuned provisioning
3. Agility for experimentation
4. Easy integration with diverse datastores
Diverse set of partners to build on Amazon EMR

BI / Visualization

Hadoop Distribution

Monitoring

Business Intelligence

Data Transformation

Data Transfer

Performance Tuning

Available on AWS Marketplace

BI / Visualization

ETL Tool

Graphical IDE

Available as a distribution in Amazon Elastic MapReduce

BI / Visualization

Encryption

Graphical IDE
Thousands of customers
How Netflix scales Big Data Platform on
Amazon EMR
Eva Tse, Director of Big Data Platform, Netflix
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Hadoop ecosystem as
our Data Analytics platform
in the cloud
How we got here?
How do we scale?
Separate compute and
storage layers
Amazon S3 as our DW
S3
Source
of
truth
S3
S3mper-enabled
Source
of
truth
Multiple clusters
Ad hoc

SLA

zone y

zone x

S3
Source
of
truth
Ad hoc

SLA
Bonus
zone x

Bonus
zone y

S3
Source
of
truth

Bonus
zone z
Unified and global big data
collection pipeline
Events Pipeline

SLA

cloud
apps
Suro

Ursula

S3
Bonus
Source
of
truth

Aegisthus

Dimension Pipeline

Adhoc
Innovate – services and tools
Sting

CLIs
Gateways
Putting into perspective …
•
•
•
•
•

Billions of viewing hours of data
~3000 nodes clusters
Hundred billion events / day
Few petabytes DW on Amazon S3
Thousands of jobs / day
Adhoc querying
Simple Reporting
E

E

T

T

L

T

L
Analytics and statistical modeling
Open Connect
What works for us?

Scalability
What works for us?

Hadoop integration on Amazon EC2 / AWS
What works for us?

Let us focus on innovation and build a solution
What works for us?

Tight engagement with Amazon EMR & Amazon
EC2 teams for tactical issues and strategic
roadmap
Next Steps …
• Heterogeneous node cluster
• Auto expand shrink
• Richer monitoring infrastructure
We strive to build the best of class
big data platform in the cloud
Big Data at Channel 4
Amazon Elastic MapReduce for Competitive Advantage
Bob Harris – Channel 4 Television
14th November 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Channel 4 – Background
•

Channel 4 is a public service, commercially funded, not-for-profit, broadcaster.

•

We have a remit to deliver innovative, experimental, distinctive, and diverse
content across television, film, and digital media.

•

We are funded predominantly by television advertising, competing with the other
established UK commercial broadcasters, and increasingly with emerging,
Internet based, providers.

•

Our content, is available across our portfolio of around 10 core and time-shift
channels, and our on demand service 4oD is accessible across multiple devices
and platforms.
Why Big Data at C4
Business Intelligence at C4
•

Well established Business Intelligence capability

•

Based on industry standard proprietary products

•

Real-time data warehousing

•

Comprehensive business reporting

•

Excellent internal skills

•

Good external skills availability
Big Data Technology at C4
•

2011 - Embarked on Big Data initiative
–
–

•

2012 - Ran Amazon EMR in parallel with
conventional BI
–
–

•

Ran in-house and cloud-based PoCs
Selected Amazon EMR

Hive deployed to Data Analysts
Amazon EMR workflows deployed to production

2013 – Amazon EMR confirmed as primary Big Data
platform
–
–

Amazon EMR usage growing, focus on automation
Experimenting with Mahout for Machine Learning
What problems are we solving?

Single view of the viewer
recognising them across
devices and serving
relevant content

Personalising the viewer experience
How are we doing this?
• Principal tasks…
– Audience segmentation
– Personalisation
– Recommendations

• What data do we process…
–
–
–
–

Website clickstream logs
4oD activity and viewing history
Over 9m registered users
Majority of activity now from “logged-in” users
High-Level Architecture
High-Level Architecture
•

Amazon EMR and existing BI technology are
complementary

•

Process billions of data rows in Amazon EMR,
store millions of result rows in RDBMS

•

No need to “rip and replace”, existing technology
investment is protected

•

Amazon EMR will continue to underpin major
growth in data volumes and processing
complexity
Where Next?
• Continued growth in usage of Amazon EMR
• Migrate to Hadoop 2.x
• Adopt Amazon Redshift
• Improved integration between C4 and AWS
• Shift toward “near real-time” processing
Please give us your feedback on this
presentation

BDT301
As a thank you, we will select prize
winners daily for completed surveys!

Weitere ähnliche Inhalte

Was ist angesagt?

AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
Amazon Web Services
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Amazon Web Services
 

Was ist angesagt? (20)

Hadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMRHadoop in the cloud with AWS' EMR
Hadoop in the cloud with AWS' EMR
 
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloudHive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
Hive + Amazon EMR + S3 = Elastic big data SQL analytics processing in the cloud
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
Scaling your Application for Growth using Automation (CPN209) | AWS re:Invent...
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
Best Practices for Running Amazon EC2 Spot Instances with Amazon EMR - AWS On...
 
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)Masterclass Webinar - Amazon Elastic MapReduce (EMR)
Masterclass Webinar - Amazon Elastic MapReduce (EMR)
 
(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices(BDT305) Amazon EMR Deep Dive and Best Practices
(BDT305) Amazon EMR Deep Dive and Best Practices
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
AWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explainedAWS EMR (Elastic Map Reduce) explained
AWS EMR (Elastic Map Reduce) explained
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)Masterclass Webinar: Amazon Elastic MapReduce (EMR)
Masterclass Webinar: Amazon Elastic MapReduce (EMR)
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
Amazon EMR Masterclass
Amazon EMR MasterclassAmazon EMR Masterclass
Amazon EMR Masterclass
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
Getting Started with Amazon EMR
Getting Started with Amazon EMRGetting Started with Amazon EMR
Getting Started with Amazon EMR
 
Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
AWS Summit London 2014 | From One to Many - Evolving VPC Design (400)
 
Data Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMRData Science & Best Practices for Apache Spark on Amazon EMR
Data Science & Best Practices for Apache Spark on Amazon EMR
 
Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3Interactively Querying Large-scale Datasets on Amazon S3
Interactively Querying Large-scale Datasets on Amazon S3
 

Andere mochten auch

Big data for product managers
Big data for product managersBig data for product managers
Big data for product managers
AIPMM Administration
 
Bringing Wireless Sensing to its full potential
Bringing Wireless Sensing to its full potentialBringing Wireless Sensing to its full potential
Bringing Wireless Sensing to its full potential
Adrian Hornsby
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
 

Andere mochten auch (20)

White Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known CustomersWhite Paper: Turning Anonymous Shoppers into Known Customers
White Paper: Turning Anonymous Shoppers into Known Customers
 
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010Using Hadoop and Hive to Optimize Travel Search, WindyCityDB 2010
Using Hadoop and Hive to Optimize Travel Search , WindyCityDB 2010
 
Matthew Bishop - A Quick Introduction to AWS Elastic MapReduce
Matthew Bishop - A Quick Introduction to AWS Elastic MapReduceMatthew Bishop - A Quick Introduction to AWS Elastic MapReduce
Matthew Bishop - A Quick Introduction to AWS Elastic MapReduce
 
AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce AWS Office Hours: Amazon Elastic MapReduce
AWS Office Hours: Amazon Elastic MapReduce
 
Big data for product managers
Big data for product managersBig data for product managers
Big data for product managers
 
Bringing Wireless Sensing to its full potential
Bringing Wireless Sensing to its full potentialBringing Wireless Sensing to its full potential
Bringing Wireless Sensing to its full potential
 
Analytics for Product Managers
Analytics for Product ManagersAnalytics for Product Managers
Analytics for Product Managers
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
User experience and Web Analytics for product managers
User experience and Web Analytics for product managersUser experience and Web Analytics for product managers
User experience and Web Analytics for product managers
 
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
Data Science at Netflix with Amazon EMR (BDT306) | AWS re:Invent 2013
 
Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2Cost effective BigData Processing on Amazon EC2
Cost effective BigData Processing on Amazon EC2
 
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
AWS Summit London 2014 | Amazon Elastic MapReduce Deep Dive and Best Practice...
 
Netflix security monkey overview
Netflix security monkey overviewNetflix security monkey overview
Netflix security monkey overview
 
Mini-Training: Netflix Simian Army
Mini-Training: Netflix Simian ArmyMini-Training: Netflix Simian Army
Mini-Training: Netflix Simian Army
 
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례Presto, Zeppelin을 이용한 초간단 BI 구축 사례
Presto, Zeppelin을 이용한 초간단 BI 구축 사례
 
Machine Learning Loves Hadoop
Machine Learning Loves HadoopMachine Learning Loves Hadoop
Machine Learning Loves Hadoop
 
Technical Track
Technical TrackTechnical Track
Technical Track
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0Efficient Data Storage for Analytics with Apache Parquet 2.0
Efficient Data Storage for Analytics with Apache Parquet 2.0
 
Amazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best PracticesAmazon EMR Deep Dive & Best Practices
Amazon EMR Deep Dive & Best Practices
 

Ähnlich wie Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

Ähnlich wie Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013 (20)

Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon RedshiftData warehousing in the era of Big Data: Deep Dive into Amazon Redshift
Data warehousing in the era of Big Data: Deep Dive into Amazon Redshift
 
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
AWS Summit 2013 | Singapore - Big Data Analytics, Presented by AWS, Intel and...
 
Data & Analytics - Session 1 - Big Data Analytics
Data & Analytics - Session 1 -  Big Data AnalyticsData & Analytics - Session 1 -  Big Data Analytics
Data & Analytics - Session 1 - Big Data Analytics
 
Aaum Analytics event - Big data in the cloud
Aaum Analytics event - Big data in the cloudAaum Analytics event - Big data in the cloud
Aaum Analytics event - Big data in the cloud
 
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
AWS Partner Webcast - Hadoop in the Cloud: Unlocking the Potential of Big Dat...
 
Fast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWSFast Track to Your Data Lake on AWS
Fast Track to Your Data Lake on AWS
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Real-time Analytics with Open-Source
Real-time Analytics with Open-SourceReal-time Analytics with Open-Source
Real-time Analytics with Open-Source
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Database and Analytics on the AWS Cloud
Database and Analytics on the AWS CloudDatabase and Analytics on the AWS Cloud
Database and Analytics on the AWS Cloud
 
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdfBuilding_a_Modern_Data_Platform_in_the_Cloud.pdf
Building_a_Modern_Data_Platform_in_the_Cloud.pdf
 
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
BDA308 Serverless Analytics with Amazon Athena and Amazon QuickSight, featuri...
 
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
AWS Summit 2013 | India - Petabyte Scale Data Warehousing at Low Cost, Abhish...
 
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
AWS re:Invent 2016: Extending Hadoop and Spark to the AWS Cloud (GPST304)
 
Big Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS CloudBig Data and High Performance Computing Solutions in the AWS Cloud
Big Data and High Performance Computing Solutions in the AWS Cloud
 
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWSAWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
AWS Summit Stockholm 2014 – B4 – Business intelligence on AWS
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
(BDT201) Big Data and HPC State of the Union | AWS re:Invent 2014
 
Aws meetup 20190427
Aws meetup 20190427Aws meetup 20190427
Aws meetup 20190427
 
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
Using data lakes to quench your analytics fire - AWS Summit Cape Town 2018
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Kürzlich hochgeladen (20)

AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfRising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdf
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Scaling your Analytics with Amazon Elastic MapReduce (BDT301) | AWS re:Invent 2013

  • 1. Scaling Your Analytics with Amazon Elastic MapReduce Peter Sirota, General Manager - Amazon Elastic MapReduce November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 2. Agenda • Amazon EMR: Hadoop in the cloud • Hadoop Ecosystem on Amazon EMR • Customer Use Cases
  • 3. Hadoop is the right system for Big Data • Scalable and fault tolerant • Flexibility for multiple languages and data formats • Open source • Ecosystem of tools • Batch and real-time analytics
  • 4. Challenges with Hadoop On Premise On Amazon EC2 • Manage HDFS, upgrades, and system administration • Pay for expensive support contracts • Select hardware in advance and stick with predictions • Difficult to integrate with AWS storage services • Independently manage and monitor clusters
  • 5. Amazon EMR is the easiest way to run Hadoop in the cloud
  • 6. Why Amazon EMR? • • • • Managed services Easy to tune clusters and trim costs Support for multiple data stores Unique features and ecosystem support
  • 8. S3, DynamoDB, Redshift Input data Code Elastic MapReduce
  • 9. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Name node
  • 10. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Name node S3/HDFS Elastic cluster
  • 11. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Name node Queries + BI S3/HDFS Via JDBC, Pig, Hive Elastic cluster
  • 12. S3, DynamoDB, Redshift Input data Code Elastic MapReduce Output Name node Queries + BI S3/HDFS Via JDBC, Pig, Hive Elastic cluster
  • 14. Elastic clusters Customize size and type to reduce costs
  • 15. Choose your instance types Try out different configurations to find your optimal architecture CPU c1.xlarge cc1.4xlarge cc2.8xlarge Memory m1.large m2.2xlarge m2.4xlarge Disk hs1.8xlarge
  • 16. Long running or transient clusters Easy to run Hadoop clusters short-term or 24/7, and only pay for what you need =
  • 17. Resizable clusters Easy to add and remove compute capacity on your cluster 10 hours
  • 18. Resizable clusters Easy to add and remove compute capacity on your cluster 6 hours
  • 19. Resizable clusters Easy to add and remove compute capacity on your cluster Peak capacity
  • 20. Resizable clusters Easy to add and remove compute capacity on your cluster Matched compute demands with cluster sizing 10 hours
  • 21. Use Spot and Reserved Instances Minimize costs by supplementing on-demand pricing
  • 22. Easy to use Spot Instances Name-your-price supercomputing to minimize costs Spot for task nodes On-demand for core nodes Up to 90% off Amazon EC2 on-demand pricing Standard Amazon EC2 pricing for on-demand capacity
  • 23. 24/7 clusters on Reserved Instances Minimize cost for consistent capacity Reserved Instances for long running clusters Up to 65% off on-demand pricing
  • 24. Your data, your choice Easy to integrate Amazon EMR with your data stores
  • 25.
  • 26. Using Amazon S3 and HDFS Data aggregated and stored in Amazon S3 Ad-hoc Query Long running EMR cluster holding data in HDFS for Hive interactive queries Weekly Report Data Sources Transient EMR cluster for batch map/reduce jobs for daily reports
  • 27. Use Amazon EMR with Amazon Redshift and Amazon S3 Processed data loaded into Amazon Redshift data warehouse Daily data aggregated in Amazon S3 Data Sources Amazon EMR cluster used to process data
  • 28. Use the Hadoop Ecosystem on Amazon EMR Leverage a diverse set of tools to get the most out of your data
  • 29. Hadoop 2.x • • • • • and much more... Databases Machine learning Metadata stores Exchange formats Diverse query languages
  • 30. Use Hive on Amazon EMR to interact with your data in HDFS and Amazon S3 • Data warehouse for Hadoop • Integration with Amazon S3 for better performance reading and writing to Amazon S3 • SQL-like query language to make iterative queries easier • Easy to scale in HDFS on a persistent Amazon EMR cluster
  • 31. Use HBase on a persistent Amazon EMR cluster as a column-oriented scalable data store • Billions of rows and millions of columns • Backup to and restore from Amazon S3 • Flexible datatypes • Modulate your HBase tables when adding new data to your system
  • 32. Use ad-hoc queries on your cluster to drive insights in real-time Spark / Shark • In-memory MapReduce for faster queries • Use HiveQL to interact with your data
  • 33. Use ad-hoc queries on your cluster to drive insights in real-time Spark / Shark • In-memory MapReduce for faster queries • Use HiveQL to interact with your data Impala (coming soon!) • Parallel database engine for Hadoop • Use SQL to query data in HDFS on your cluster in real-time
  • 34. “Hadoop-as-a-Service [Amazon EMR] offers a better price-performance ratio [than bare-metal Hadoop].” 1. Elastic clusters and cost optimization 2. Rapid, tuned provisioning 3. Agility for experimentation 4. Easy integration with diverse datastores
  • 35. Diverse set of partners to build on Amazon EMR BI / Visualization Hadoop Distribution Monitoring Business Intelligence Data Transformation Data Transfer Performance Tuning Available on AWS Marketplace BI / Visualization ETL Tool Graphical IDE Available as a distribution in Amazon Elastic MapReduce BI / Visualization Encryption Graphical IDE
  • 37. How Netflix scales Big Data Platform on Amazon EMR Eva Tse, Director of Big Data Platform, Netflix November 14, 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 38. Hadoop ecosystem as our Data Analytics platform in the cloud
  • 39. How we got here?
  • 40.
  • 41.
  • 42. How do we scale?
  • 43.
  • 45. Amazon S3 as our DW
  • 49. Ad hoc SLA zone y zone x S3 Source of truth
  • 50. Ad hoc SLA Bonus zone x Bonus zone y S3 Source of truth Bonus zone z
  • 51. Unified and global big data collection pipeline
  • 55. Putting into perspective … • • • • • Billions of viewing hours of data ~3000 nodes clusters Hundred billion events / day Few petabytes DW on Amazon S3 Thousands of jobs / day
  • 60.
  • 62. What works for us? Scalability
  • 63. What works for us? Hadoop integration on Amazon EC2 / AWS
  • 64. What works for us? Let us focus on innovation and build a solution
  • 65. What works for us? Tight engagement with Amazon EMR & Amazon EC2 teams for tactical issues and strategic roadmap
  • 66. Next Steps … • Heterogeneous node cluster • Auto expand shrink • Richer monitoring infrastructure
  • 67. We strive to build the best of class big data platform in the cloud
  • 68. Big Data at Channel 4 Amazon Elastic MapReduce for Competitive Advantage Bob Harris – Channel 4 Television 14th November 2013 © 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
  • 69. Channel 4 – Background • Channel 4 is a public service, commercially funded, not-for-profit, broadcaster. • We have a remit to deliver innovative, experimental, distinctive, and diverse content across television, film, and digital media. • We are funded predominantly by television advertising, competing with the other established UK commercial broadcasters, and increasingly with emerging, Internet based, providers. • Our content, is available across our portfolio of around 10 core and time-shift channels, and our on demand service 4oD is accessible across multiple devices and platforms.
  • 70. Why Big Data at C4
  • 71. Business Intelligence at C4 • Well established Business Intelligence capability • Based on industry standard proprietary products • Real-time data warehousing • Comprehensive business reporting • Excellent internal skills • Good external skills availability
  • 72. Big Data Technology at C4 • 2011 - Embarked on Big Data initiative – – • 2012 - Ran Amazon EMR in parallel with conventional BI – – • Ran in-house and cloud-based PoCs Selected Amazon EMR Hive deployed to Data Analysts Amazon EMR workflows deployed to production 2013 – Amazon EMR confirmed as primary Big Data platform – – Amazon EMR usage growing, focus on automation Experimenting with Mahout for Machine Learning
  • 73. What problems are we solving? Single view of the viewer recognising them across devices and serving relevant content Personalising the viewer experience
  • 74. How are we doing this? • Principal tasks… – Audience segmentation – Personalisation – Recommendations • What data do we process… – – – – Website clickstream logs 4oD activity and viewing history Over 9m registered users Majority of activity now from “logged-in” users
  • 76. High-Level Architecture • Amazon EMR and existing BI technology are complementary • Process billions of data rows in Amazon EMR, store millions of result rows in RDBMS • No need to “rip and replace”, existing technology investment is protected • Amazon EMR will continue to underpin major growth in data volumes and processing complexity
  • 77. Where Next? • Continued growth in usage of Amazon EMR • Migrate to Hadoop 2.x • Adopt Amazon Redshift • Improved integration between C4 and AWS • Shift toward “near real-time” processing
  • 78. Please give us your feedback on this presentation BDT301 As a thank you, we will select prize winners daily for completed surveys!