SlideShare ist ein Scribd-Unternehmen logo
1 von 74
Downloaden Sie, um offline zu lesen
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
Design Patterns and Best Practices
for Data Analytics with Amazon EMR
J o n a t h a n F r i t z , P r i n c i p a l P r o d u c t M a n a g e r – A m a z o n E M R
A n y a B i d a , S e n i o r M e m b e r o f T e c h n i c a l S t a f f - S a l e s f o r c e
A B D 3 0 5
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Overview
- Intro and architectures
- Using Amazon EC2 Spot and Auto Scaling
- Security overview
- Ad-hoc and advanced workflows
- Apache Spark and Amazon EMR at Salesforce
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Easy to use
Launch a cluster in minutes
What is Amazon EMR?
Low cost
Pay per-second
Open-source variety
Latest versions of software
Managed
Spend less time monitoring
Secure
Easy to enable options
Flexible
Full customization and control
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Open-source applications
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop
HBase/Phoenix
Presto
Streaming
Flink
Zeppelin - Notebooks
MXNet
Hue - SQL
Ganglia - Monitoring
Livy – Job Server
Connectorsto
AWSservices
Amazon EMR
service
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use the AWS Glue Data Catalog
• Support for Spark, Hive
and Presto
• Auto-generate schema
and partitions
• Managed table updates
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
HBase for random access at massive scale
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time and batch processing
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New – Deep learning with GPU instances
Use P3 instances
with GPUs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lower your costs
T i p s t o
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Transient or long-running clusters
• Amazon Linux AMI with preinstalled customizations for faster cluster creation
• Auto Scaling to minimize costs for long-running clusters
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
EC2 Spot and instance fleets
• EMR will select optimal EC2 AZ
• Provision across instance types
• Switch to on-demand
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Auto Scaling
Scaling options
Threshold
CloudWatch or custom metric
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Auto Scaling
• EMR scales-in at YARN task completion
• Selectively removes nodes with no running tasks
• yarn.resourcemanager.decommissioning.timeout
• Default timeout is one hour
• Spark scale-in contributions
• Spark specific blacklisting of tasks
• Unregistering cached data and shuffle blocks
• Advanced error handling
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Secure your cluster
T i p s t o
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Encryption
• Spark
• Tez
• MapReduce
• Presto
• HBase
• Hive
• Pig
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Authentication LDAP
HiveServer2
Presto Coordinator
Spark Thrift Server
Hue Server
Zeppelin Server
AWS credentials
EMR Step (EMR API)
EC2 key pair
SSH as “hadoop”
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New – Authentication with Kerberos
Microsoft
Active Directory
KDC
Users
YARN RMDoAs
Service principals for
all cluster nodes
Master Node
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Authorization
• Storage-based
• EMRFS/S3
• HDFS
• HiveServer2 and Presto (SQL-based)
• HBase
• YARN queues
• Fine-grained access control by cluster tag (IAM)
• Apache Ranger on edge node (using CloudFormation)
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
New – EMRFS fine-grained authorization
Context
User: aduser
Group: analyst
IAM role: analytics_prod
Can map IAM roles to user, group, or S3 prefix
Context
User: aduser2
Group: dev
IAM role: analytics_dev
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Security configuration demo
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Submit workflows
T i p s t o
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Use Livy as an ad-hoc Spark job server
Custom
Application
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Oozie and Airflow for DAGs of jobs
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Select Amazon EMR customers
Apache Spark and Amazon EMR at Salesforce
abida@salesforce.com @anyabida1
Anya Bida, Senior Member of Technical Staff at Salesforce
Fastest Growing Top 5
Enterprise Software Company
$5.4BFY15
$4.1BFY14
$3.1BFY13
$6.7BFY16
$2.3BFY12
$1.7BFY11
$2.56BFY18Q2 revenue
$8.4BFY17 revenue
2009 • 2010 • 2011
2012 • 2013 • 2014
2015 • 2016 • 2017
September
2016
2011 • 2012 • 2013
2014 • 2015 • 2016 • 2017
The world’s most
innovative companies
“Innovator of
the Decade”
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use AWS Identity and Access Management (IAM) roles
Isolate Environments
Complete ML Pipeline
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Selecting a tool
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Supports each step of our ML pipeline
Scales for small & large jobs
Good ML Libraries
Active user base
Ability to deploy production ready code
Selecting a tool
ETL
Feature Engineering
Model Training
Model Evaluation
Deploy & Operationalize Models
Score & Update Models
Support Batch & Real Time
Supports each step of our ML pipeline
Scales for small & large jobs
Good ML Libraries
Active user base
Ability to deploy production ready code
We wanted Spark…now how to deploy it?
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & compute options
We wanted Spark…now how to deploy it?
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & Compute options
Need: Management
We wanted Spark…now how to deploy it?
Need: Management
EC2
• Support for batch / streaming
• Integrates with our tooling
• Spin up / down clusters
• Larger / smaller clusters
• Support for different versions of Hadoop, Spark
• Storage & Compute options
Provision EMR
Simplest approach
Input bucket
Monitoring
aws emr create-cluster
--applications Name=Hadoop
Name=Spark Name=Ganglia
--tags
--ec2-attributes
--release-label
--log-uri
…
--name
--instance-groups
--region
Provision EMR
Simplest approach
Input bucket
Monitoring
aws emr create-cluster
--applications Name=Hadoop
Name=Spark Name=Ganglia
# tag for cost analysis by project
--tags
--ec2-attributes
--release-label
# send logs to S3
--log-uri
…
# use naming conventions for service
discovery
# <region>-<project>-<version>-<env>
--name
# CORE nodes used for writes to HDFS
# TASK nodes used for compute - try spot
instances here for starters
--instance-groups
--region
Simplest approach
Input bucket
EMR cluster
Output bucket
Logs
bucket
Spark primer
Apache Spark
Spark Primer
Apache Spark
Driver Program
SparkContext
Node
Executor Cache
TaskTask
Cluster Manager
Node
Executor Cache
TaskTask
Spark on Amazon EMR
Apache Spark
Master Node Core Node
Cluster Manager
ResourceManager
NameNode
NodeManager
Datanode
Task Node
NodeManagerNodeManager
Spark on Amazon EMR
Apache Spark
Master Node Core Node
Executor Cache
TaskTask
Cluster Manager
ResourceManager
NameNode
Task Node
Executor Cache
TaskTask
Driver Program
SparkContext
Executor Cache
TaskTask
NodeManager
Datanode
NodeManagerNodeManager
Properties related to Dynamic Allocation
Property Value
Spark.dynamicAllocation.enabled true
Spark.shuffle.service.enabled true
spark.dynamicAllocation.minExecutors 5
spark.dynamicAllocation.maxExecutors 17
spark.dynamicAllocation.initalExecutors 0
sparkdynamicAllocation.executorIdleTime 60s
spark.dynamicAllocation.schedulerBacklogTimeout 5s
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s
Optional
Where do I find Metrics?
Ganglia
windowing, dashboarding
Cloudwatch
YARNMemoryAvailablePercentage
Logs?
Anatomy of a Spark Job
High Performance Spark, Karau & Warren, O’Reilly
Spark Context/Spark Session Object
Actions (e.g., collect, saveAsTextFile)
Wide transformations; (sort, groupByKey)
Computation to evaluate one partition
(combine narrow transforms)
Spark
Application
Job
Stage Stage
Task Task
Simplest approach
Input bucket
EMR cluster
Output bucket
Reads from S3
• Jar files too!
Write Intermediate files
• MEM or Disk?
• Local? HDFS? Amazon S3?
Writes to S3
• Data available after cluster is
terminated
Cache Persist Checkpoint Local Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is decommed, are
writes available?
No No Yes No
If job finishes are writes
available?
No No Yes No
Preserve lineage graph? Yes Yes No No
RDD Re-use
Cache Persist Checkpoint Local Checkpoint
local mem cache MEM MEM MEM
local disk DISK DISK
HDFS / S3 Specify dir
If exec is decommed, are
writes available?
No No Yes No
If job finishes are writes
available?
No No Yes No
Preserve lineage graph? Yes Yes No No
RDD Re-use
Persist to improve speed, Checkpoint to improve fault tolerance
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Monitor multiple viewpoints
https://light.co/camera
Understand resource allocation
Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems)
Mayuresh Kunjir (Duke University)
Node Memory
Container Memory
8Gb
Node Memory
Container
Memory
8Gb
Can my 8Gb container launch on this cluster?
Node
Memory
Node
Memory
Node
Memory
4Gb
used
8Gb
total
8Gb
Can my 8Gb container launch on this cluster?
Scale-out Rule: Num Containers Pending
Node
Memory
Node
Memory
Node
Memory
4Gb
used
8Gb
total
8Gb
Monitor multiple viewpoints
Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon
Monitor multiple viewpoints
Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon
Monitor multiple viewpoints
Connectivity viewer
https://www.linkedin.com/in/vaibhavt/
Vaibhav Tandon
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
IAM
Use IAM roles
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
In
Use IAM roles
Every user, service, & job should have specific, auditable permissions.
New: EMRFS fine-grained access control!!
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
aws emr create-cluster
…
--service-role
In
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
Out
Logs
In
Cluster Manager
Scheduler
virtual private cloud Out
Logs
In
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
VPC subnet
virtual private cloud
VPC subnet
Out
Logs
In
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
Out
Logs
In
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
Out
Logs
In
aws emr create-cluster
…
--ec2-attributes '{
"KeyName":"",
"InstanceProfile":"",
"ServiceAccessSecurityGroup":"",
"SubnetId":"”,
"EmrManagedSlaveSecurityGroup":"",
"EmrManagedMasterSecurityGroup":""}'}'
Isolate environments
Need: Build and release? Multitenancy?
Cluster Manager
Scheduler
In
Out
Logs
VPC subnet
virtual private cloud
security group
IAM
security group
security group
VPC subnet
IAM
Environment
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Dev Staging
Canary Prod
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Dev Staging
Canary Prod
Automation
• Use Cloudformation or
Terraform
• Upgrades use the same
provisioning script + DNS
Upsert
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Availability
Zone
region
Dev Staging
Canary Prod
Isolate environments
Need: Build and release? Multitenancy?
VPC subnet
virtual private cloud
security group
security group
security group
VPC subnet
IAM
Environment
Availability
Zone
region
Dev Staging
Canary Prod
Availability
Zone
region
Dev Staging
Canary Prod
Availability
Zone
Availability
Zone
Isolate environments
Need: Build and release? Multitenancy?
Overview
Our Goal
Getting started with EMR
Spark primer
Monitor multiple viewpoints
Use IAM Roles
Isolate Environments
Did we just automate ourselves
out of our jobs?
Nope. Now we have time to take on new projects and grow…
© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
THANK YOU!
J o n a t h a n F r i t z – j o n f r i t z @ a m a z o n . c o m
A n y a B i d a – a b i d a @ s a l e s f o r c e . c o m
a w s . a m a z o n . c o m / e m r
a w s . a m a z o n . c o m / b l o g s / b i g - d a t a
a w s . a m a z o n . c o m / b l o g s / a i
Extra slides
`spark.history.fs.cleaner.enabled=true`

Weitere ähnliche Inhalte

Was ist angesagt?

Assaulting diameter IPX network
Assaulting diameter IPX networkAssaulting diameter IPX network
Assaulting diameter IPX network
Alexandre De Oliveira
 
AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017
AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017
AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017
Amazon Web Services Korea
 

Was ist angesagt? (20)

Prometheus on EKS
Prometheus on EKSPrometheus on EKS
Prometheus on EKS
 
ASA Firepower NGFW Update and Deployment Scenarios
ASA Firepower NGFW Update and Deployment ScenariosASA Firepower NGFW Update and Deployment Scenarios
ASA Firepower NGFW Update and Deployment Scenarios
 
AWS VPC Fundamentals- Webinar
AWS VPC Fundamentals- WebinarAWS VPC Fundamentals- Webinar
AWS VPC Fundamentals- Webinar
 
AWS Black Belt Online Seminar 2016 AWS CloudFormation
AWS Black Belt Online Seminar 2016 AWS CloudFormationAWS Black Belt Online Seminar 2016 AWS CloudFormation
AWS Black Belt Online Seminar 2016 AWS CloudFormation
 
Assaulting diameter IPX network
Assaulting diameter IPX networkAssaulting diameter IPX network
Assaulting diameter IPX network
 
Amazon EFS (Elastic File System) 이해하고사용하기
Amazon EFS (Elastic File System) 이해하고사용하기Amazon EFS (Elastic File System) 이해하고사용하기
Amazon EFS (Elastic File System) 이해하고사용하기
 
클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...
클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...
클라우드 마이그레이션 성공적인 여정, 그 중요한 시작 "Readiness Assessment (전환 준비 평가)" - 김준범, AWS Mi...
 
AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017
AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017
AWS 클라우드 비용 최적화를 위한 모범 사례-AWS Summit Seoul 2017
 
AWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 SnowballAWS Black Belt online seminar 2017 Snowball
AWS Black Belt online seminar 2017 Snowball
 
[AWS Builders] AWS 스토리지 서비스 소개 및 사용 방법
[AWS Builders] AWS 스토리지 서비스 소개 및 사용 방법[AWS Builders] AWS 스토리지 서비스 소개 및 사용 방법
[AWS Builders] AWS 스토리지 서비스 소개 및 사용 방법
 
Infrastructure Security: Your Minimum Security Baseline
Infrastructure Security: Your Minimum Security BaselineInfrastructure Security: Your Minimum Security Baseline
Infrastructure Security: Your Minimum Security Baseline
 
AWS Black Belt Techシリーズ Amazon Kinesis
AWS Black Belt Techシリーズ  Amazon KinesisAWS Black Belt Techシリーズ  Amazon Kinesis
AWS Black Belt Techシリーズ Amazon Kinesis
 
Palo alto outline course | Mostafa El Lathy
Palo alto outline course | Mostafa El LathyPalo alto outline course | Mostafa El Lathy
Palo alto outline course | Mostafa El Lathy
 
AWSのセキュリティについて
AWSのセキュリティについてAWSのセキュリティについて
AWSのセキュリティについて
 
Introduction to New CloudWatch Agent
Introduction to New CloudWatch AgentIntroduction to New CloudWatch Agent
Introduction to New CloudWatch Agent
 
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
Amazon Virtual Private Cloud (VPC): Networking Fundamentals and Connectivity ...
 
높은 가용성과 성능 향상을 위한 ElastiCache 활용 팁 - 임근택, SendBird :: AWS Summit Seoul 2019
높은 가용성과 성능 향상을 위한 ElastiCache 활용 팁 - 임근택, SendBird :: AWS Summit Seoul 2019 높은 가용성과 성능 향상을 위한 ElastiCache 활용 팁 - 임근택, SendBird :: AWS Summit Seoul 2019
높은 가용성과 성능 향상을 위한 ElastiCache 활용 팁 - 임근택, SendBird :: AWS Summit Seoul 2019
 
AWS Black Belt Online Seminar 2016 AWS上でのActive Directory構築
AWS Black Belt Online Seminar 2016 AWS上でのActive Directory構築AWS Black Belt Online Seminar 2016 AWS上でのActive Directory構築
AWS Black Belt Online Seminar 2016 AWS上でのActive Directory構築
 
AWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザAWS で Presto を徹底的に使いこなすワザ
AWS で Presto を徹底的に使いこなすワザ
 
AWS PrivateLink Fundamentals
AWS PrivateLink FundamentalsAWS PrivateLink Fundamentals
AWS PrivateLink Fundamentals
 

Ähnlich wie Design patterns and best practices for data analytics with amazon emr (ABD305)

Ähnlich wie Design patterns and best practices for data analytics with amazon emr (ABD305) (20)

Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
ABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS GlueABD215_Serverless Data Prep with AWS Glue
ABD215_Serverless Data Prep with AWS Glue
 
Data freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWSData freedom: come migrare i carichi di lavoro Big Data su AWS
Data freedom: come migrare i carichi di lavoro Big Data su AWS
 
ABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWSABD312_Deep Dive Migrating Big Data Workloads to AWS
ABD312_Deep Dive Migrating Big Data Workloads to AWS
 
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
Deep Learning Using Caffe2 on AWS - MCL313 - re:Invent 2017
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
The Future of Research Computing on AWS - AWS Public Sector Summit Singapore ...
 
Using Data Lakes
Using Data LakesUsing Data Lakes
Using Data Lakes
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
STG329_ProtectWise optimizes performance of Cassandra and Kafka workloads wit...
 
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
A Practitioner’s Guide on Migrating to, and Running on Amazon Aurora - DAT315...
 
Integrating Deep Learning into your Enterprise
Integrating Deep Learning into your EnterpriseIntegrating Deep Learning into your Enterprise
Integrating Deep Learning into your Enterprise
 
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your EnterpriseAWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
AWS Machine Learning Week SF: Integrating Deep Learning into Your Enterprise
 
Accelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMRAccelerate Data Analytics at Scale with Amazon EMR
Accelerate Data Analytics at Scale with Amazon EMR
 
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDSDAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
DAT309_Best Practices for Migrating from Oracle and SQL Server to Amazon RDS
 

Mehr von Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
amitlee9823
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Kürzlich hochgeladen (20)

Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
ALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptxALSO dropshipping via API with DroFx.pptx
ALSO dropshipping via API with DroFx.pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 

Design patterns and best practices for data analytics with amazon emr (ABD305)

  • 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT Design Patterns and Best Practices for Data Analytics with Amazon EMR J o n a t h a n F r i t z , P r i n c i p a l P r o d u c t M a n a g e r – A m a z o n E M R A n y a B i d a , S e n i o r M e m b e r o f T e c h n i c a l S t a f f - S a l e s f o r c e A B D 3 0 5
  • 2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Overview - Intro and architectures - Using Amazon EC2 Spot and Auto Scaling - Security overview - Ad-hoc and advanced workflows - Apache Spark and Amazon EMR at Salesforce
  • 3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Easy to use Launch a cluster in minutes What is Amazon EMR? Low cost Pay per-second Open-source variety Latest versions of software Managed Spend less time monitoring Secure Easy to enable options Flexible Full customization and control
  • 4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Open-source applications Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Flink, Mahout, Sqoop HBase/Phoenix Presto Streaming Flink Zeppelin - Notebooks MXNet Hue - SQL Ganglia - Monitoring Livy – Job Server Connectorsto AWSservices Amazon EMR service
  • 5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use the AWS Glue Data Catalog • Support for Spark, Hive and Presto • Auto-generate schema and partitions • Managed table updates
  • 6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. HBase for random access at massive scale
  • 7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Real-time and batch processing
  • 8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – Deep learning with GPU instances Use P3 instances with GPUs
  • 9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Lower your costs T i p s t o
  • 10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Transient or long-running clusters • Amazon Linux AMI with preinstalled customizations for faster cluster creation • Auto Scaling to minimize costs for long-running clusters
  • 11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. EC2 Spot and instance fleets • EMR will select optimal EC2 AZ • Provision across instance types • Switch to on-demand
  • 12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Auto Scaling Scaling options Threshold CloudWatch or custom metric
  • 13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Auto Scaling • EMR scales-in at YARN task completion • Selectively removes nodes with no running tasks • yarn.resourcemanager.decommissioning.timeout • Default timeout is one hour • Spark scale-in contributions • Spark specific blacklisting of tasks • Unregistering cached data and shuffle blocks • Advanced error handling
  • 14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Secure your cluster T i p s t o
  • 15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Encryption • Spark • Tez • MapReduce • Presto • HBase • Hive • Pig
  • 16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Authentication LDAP HiveServer2 Presto Coordinator Spark Thrift Server Hue Server Zeppelin Server AWS credentials EMR Step (EMR API) EC2 key pair SSH as “hadoop”
  • 17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – Authentication with Kerberos Microsoft Active Directory KDC Users YARN RMDoAs Service principals for all cluster nodes Master Node
  • 18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Authorization • Storage-based • EMRFS/S3 • HDFS • HiveServer2 and Presto (SQL-based) • HBase • YARN queues • Fine-grained access control by cluster tag (IAM) • Apache Ranger on edge node (using CloudFormation)
  • 19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. New – EMRFS fine-grained authorization Context User: aduser Group: analyst IAM role: analytics_prod Can map IAM roles to user, group, or S3 prefix Context User: aduser2 Group: dev IAM role: analytics_dev
  • 20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Security configuration demo
  • 21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Submit workflows T i p s t o
  • 22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use Livy as an ad-hoc Spark job server Custom Application
  • 23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Oozie and Airflow for DAGs of jobs
  • 24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Select Amazon EMR customers
  • 25. Apache Spark and Amazon EMR at Salesforce abida@salesforce.com @anyabida1 Anya Bida, Senior Member of Technical Staff at Salesforce
  • 26. Fastest Growing Top 5 Enterprise Software Company $5.4BFY15 $4.1BFY14 $3.1BFY13 $6.7BFY16 $2.3BFY12 $1.7BFY11 $2.56BFY18Q2 revenue $8.4BFY17 revenue 2009 • 2010 • 2011 2012 • 2013 • 2014 2015 • 2016 • 2017 September 2016 2011 • 2012 • 2013 2014 • 2015 • 2016 • 2017 The world’s most innovative companies “Innovator of the Decade”
  • 27. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use AWS Identity and Access Management (IAM) roles Isolate Environments
  • 28. Complete ML Pipeline ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time
  • 29. Selecting a tool ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time Supports each step of our ML pipeline Scales for small & large jobs Good ML Libraries Active user base Ability to deploy production ready code
  • 30. Selecting a tool ETL Feature Engineering Model Training Model Evaluation Deploy & Operationalize Models Score & Update Models Support Batch & Real Time Supports each step of our ML pipeline Scales for small & large jobs Good ML Libraries Active user base Ability to deploy production ready code
  • 31. We wanted Spark…now how to deploy it? EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & compute options
  • 32. We wanted Spark…now how to deploy it? EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & Compute options Need: Management
  • 33. We wanted Spark…now how to deploy it? Need: Management EC2 • Support for batch / streaming • Integrates with our tooling • Spin up / down clusters • Larger / smaller clusters • Support for different versions of Hadoop, Spark • Storage & Compute options
  • 34. Provision EMR Simplest approach Input bucket Monitoring aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia --tags --ec2-attributes --release-label --log-uri … --name --instance-groups --region
  • 35. Provision EMR Simplest approach Input bucket Monitoring aws emr create-cluster --applications Name=Hadoop Name=Spark Name=Ganglia # tag for cost analysis by project --tags --ec2-attributes --release-label # send logs to S3 --log-uri … # use naming conventions for service discovery # <region>-<project>-<version>-<env> --name # CORE nodes used for writes to HDFS # TASK nodes used for compute - try spot instances here for starters --instance-groups --region
  • 36. Simplest approach Input bucket EMR cluster Output bucket Logs bucket
  • 38. Spark Primer Apache Spark Driver Program SparkContext Node Executor Cache TaskTask Cluster Manager Node Executor Cache TaskTask
  • 39. Spark on Amazon EMR Apache Spark Master Node Core Node Cluster Manager ResourceManager NameNode NodeManager Datanode Task Node NodeManagerNodeManager
  • 40. Spark on Amazon EMR Apache Spark Master Node Core Node Executor Cache TaskTask Cluster Manager ResourceManager NameNode Task Node Executor Cache TaskTask Driver Program SparkContext Executor Cache TaskTask NodeManager Datanode NodeManagerNodeManager
  • 41. Properties related to Dynamic Allocation Property Value Spark.dynamicAllocation.enabled true Spark.shuffle.service.enabled true spark.dynamicAllocation.minExecutors 5 spark.dynamicAllocation.maxExecutors 17 spark.dynamicAllocation.initalExecutors 0 sparkdynamicAllocation.executorIdleTime 60s spark.dynamicAllocation.schedulerBacklogTimeout 5s spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 5s Optional
  • 42. Where do I find Metrics? Ganglia windowing, dashboarding Cloudwatch YARNMemoryAvailablePercentage Logs?
  • 43. Anatomy of a Spark Job High Performance Spark, Karau & Warren, O’Reilly Spark Context/Spark Session Object Actions (e.g., collect, saveAsTextFile) Wide transformations; (sort, groupByKey) Computation to evaluate one partition (combine narrow transforms) Spark Application Job Stage Stage Task Task
  • 44. Simplest approach Input bucket EMR cluster Output bucket Reads from S3 • Jar files too! Write Intermediate files • MEM or Disk? • Local? HDFS? Amazon S3? Writes to S3 • Data available after cluster is terminated
  • 45. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use
  • 46. Cache Persist Checkpoint Local Checkpoint local mem cache MEM MEM MEM local disk DISK DISK HDFS / S3 Specify dir If exec is decommed, are writes available? No No Yes No If job finishes are writes available? No No Yes No Preserve lineage graph? Yes Yes No No RDD Re-use Persist to improve speed, Checkpoint to improve fault tolerance
  • 47. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 49. Understand resource allocation Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems) Mayuresh Kunjir (Duke University) Node Memory Container Memory 8Gb Node Memory Container Memory 8Gb
  • 50. Can my 8Gb container launch on this cluster? Node Memory Node Memory Node Memory 4Gb used 8Gb total 8Gb
  • 51. Can my 8Gb container launch on this cluster? Scale-out Rule: Num Containers Pending Node Memory Node Memory Node Memory 4Gb used 8Gb total 8Gb
  • 52. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  • 53. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  • 54. Monitor multiple viewpoints Connectivity viewer https://www.linkedin.com/in/vaibhavt/ Vaibhav Tandon
  • 55. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 56. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role
  • 57. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role IAM
  • 58. Use IAM roles Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role • IAM Roles determine read / write access to data IAM Out Logs IAM In
  • 59. Use IAM roles Every user, service, & job should have specific, auditable permissions. New: EMRFS fine-grained access control!! Cluster Manager Scheduler IAM IAM Roles • User has an IAM Role • Job has an IAM Role • IAM Roles determine read / write access to data IAM Out Logs IAM aws emr create-cluster … --service-role In
  • 60. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 61. Isolate environments Need: Build and release? Multitenancy? Cluster Manager Scheduler Out Logs In
  • 62. Cluster Manager Scheduler virtual private cloud Out Logs In Isolate environments Need: Build and release? Multitenancy?
  • 63. Cluster Manager Scheduler VPC subnet virtual private cloud VPC subnet Out Logs In Isolate environments Need: Build and release? Multitenancy?
  • 64. Cluster Manager Scheduler VPC subnet virtual private cloud security group security group security group VPC subnet Out Logs In Isolate environments Need: Build and release? Multitenancy?
  • 65. Cluster Manager Scheduler VPC subnet virtual private cloud security group security group security group VPC subnet Out Logs In aws emr create-cluster … --ec2-attributes '{ "KeyName":"", "InstanceProfile":"", "ServiceAccessSecurityGroup":"", "SubnetId":"”, "EmrManagedSlaveSecurityGroup":"", "EmrManagedMasterSecurityGroup":""}'}' Isolate environments Need: Build and release? Multitenancy?
  • 66. Cluster Manager Scheduler In Out Logs VPC subnet virtual private cloud security group IAM security group security group VPC subnet IAM Environment Isolate environments Need: Build and release? Multitenancy?
  • 67. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Dev Staging Canary Prod Isolate environments Need: Build and release? Multitenancy?
  • 68. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Dev Staging Canary Prod Automation • Use Cloudformation or Terraform • Upgrades use the same provisioning script + DNS Upsert Isolate environments Need: Build and release? Multitenancy?
  • 69. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Availability Zone region Dev Staging Canary Prod Isolate environments Need: Build and release? Multitenancy?
  • 70. VPC subnet virtual private cloud security group security group security group VPC subnet IAM Environment Availability Zone region Dev Staging Canary Prod Availability Zone region Dev Staging Canary Prod Availability Zone Availability Zone Isolate environments Need: Build and release? Multitenancy?
  • 71. Overview Our Goal Getting started with EMR Spark primer Monitor multiple viewpoints Use IAM Roles Isolate Environments
  • 72. Did we just automate ourselves out of our jobs? Nope. Now we have time to take on new projects and grow…
  • 73. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. THANK YOU! J o n a t h a n F r i t z – j o n f r i t z @ a m a z o n . c o m A n y a B i d a – a b i d a @ s a l e s f o r c e . c o m a w s . a m a z o n . c o m / e m r a w s . a m a z o n . c o m / b l o g s / b i g - d a t a a w s . a m a z o n . c o m / b l o g s / a i