SlideShare ist ein Scribd-Unternehmen logo
1 von 88
©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Deep Dive: Amazon EMR
Anthony Nguyen
Senior Big Data Consultant
AWS
aanwin@amazon.com
Four key takeaways about EMR from this talk
1. Open source is awesome
2. Decoupling of compute and storage
– HDFS vs S3
– Reliability and Durability
– Cost
– Multiple clusters
– Logical separation of jobs/clusters/teams
3. Programmable long running or transient clusters
4. Use compute as fungible commodities
What is EMR?
Hosted framework that allows you to run Hadoop,
Spark and other distributed compute frameworks
on AWS
Makes it easy to quickly and cost-effectively
process vast amounts of data.
Just some of the organizations using EMR:
Why Amazon EMR?
Easy to Use
Launch a cluster in minutes
Low Cost
Pay an hourly rate
Elastic
Easily add or remove capacity
Reliable
Spend less time monitoring
Secure
Managed firewalls
Flexible
You control the cluster
The Hadoop ecosystem can run in Amazon EMR
Architecture of Amazon EMR
Storage
S3 (EMRFS), HDFS
YARN
Cluster Resource Management
Batch
MapReduce
Interactive
Tez
In Memory
Spark
Applications
Hive, Pig, Spark SQL/Streaming/ML, Mahout,
Sqoop
HBase/Phoenix
Presto
Hue (SQL Interface / Metastore
Management)
Zeppelin (Interactive Notebook)
Ganglia (Monitoring)
HiveServer2 / Spark Thriftserver
(JDBC/ODBC)
Amazon EMR service
Unlike on-prem or EC2-fleet Hadoop,
EMR Decouples Storage & Compute
Tightly-coupled
compute & storage
 inflexibility
Decoupled compute &
storage
 flexible storage
 Right-sizing compute
Traditional Hadoop Amazon EMR
On an On-premises Environment
Tightly coupled
Compute and Storage Grow Together
Tightly coupled
• Storage grows along with
compute
• Compute requirements
vary
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Re-processingWeekly peaks
Steady state
Underutilized or Scarce Resources
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
Underutilized capacity
Provisioned capacity
Contention for Same Resources
Compute
bound
Memory
bound
EMR: Aggregate all Data in S3 as your Data Lake
Surrounded by a collection of the right tools
EMR Kinesis
Redshift DynamoDB RDS
Data Pipeline
Spark Streaming Storm
Amazon
S3
Import / Export
Snowball
EMRFS: Amazon S3 as your persistent data store
• Decouple compute & storage
• Shut down EMR clusters with no data
loss
• Right-size EMR cluster
independently of storage
• Multiple Amazon EMR clusters can
concurrently use same data in Amazon
S3
EMR
EMR
Amazon
S3
HDFS is still there if you need it
• Iterative workloads
– If you’re processing the same dataset more than
once
– Consider using Spark & RDDs for this too
• Disk I/O intensive workloads
• Persist data on Amazon S3 and use S3DistCp to
copy to/from HDFS for processing
Provisioning clusters
What Do I Need to Build a Cluster ?
1. Choose instances
2. Choose your software
3. Choose your access method
EMR is Easy to Deploy
AWS Management Console Command Line
or use the EMR API with your
favorite SDK
Try different configurations to find your optimal architecture
CPU
c3 family
cc1.4xlarge
cc2.8xlarge
Memory
m2 family
r3 family
m4 family
Disk/IO
d2 family
i2 family
General
m1 family
m3 family
Choose your instance types
Batch Machine Spark and Large
process learning interactive HDFS
Use multiple EMR instance groups
Master Node
r3.2xlarge
Example Amazon
EMR Cluster
Slave Group - Core
c3.2xlarge
Slave Group – Task
m3.xlarge (EC2 Spot)
Slave Group – Task
m3.2xlarge (EC2 Spot)
Core nodes run HDFS
(DataNode). Task nodes do
not run HDFS. Core and
Task nodes each run YARN
(NodeManager).
Master node runs
NameNode (HDFS),
ResourceManager (YARN),
and serves as a gateway.
EMR 5.0 - Applications
Easy to monitor and debug
Monitor Debug
Integrated with Amazon CloudWatch
Monitor cluster, node, and I/O, and 20+ custom Hadoop
metrics
Task level debugging information already pushed in a
datastore
EMR logging to S3 makes logs easily available
Logs in S3. Can use ELK to visualize
Easy to add and remove compute
capacity on your cluster.
Match compute
demands with
cluster sizing.
Resizable clusters
Spot for
task nodes
Up to 90%
off EC2
on-demand
pricing
On-demand for
core nodes
Standard
Amazon EC2
pricing for
on-demand
capacity
Easy to use Spot Instances
Meet SLA at predictable cost Exceed SLA at lower cost
The spot advantage
• Lets say a 7 hour job needing 100 nodes with
each node at $1 = 7 x 100 x $1 = $700
• Scale up to 200 nodes
– 4 * 100 * $1 = $400
– 4 * 100 * $0.50 = $200
– Save $100 dollars and finish the job fast
• Run on-demand for worst acceptable SLA
– Scale up to meet demand with Spot instances
Options to submit jobs – Off Cluster
Amazon EMR
Step API
Submit a Spark
application
Amazon EMR
AWS Data Pipeline
Airflow, Luigi, or other
schedulers on EC2
Create a pipeline
to schedule job
submission or create
complex workflows
AWS Lambda
Use AWS Lambda to
submit applications to
EMR Step API or directly
to Spark on your cluster
Options to submit jobs – On Cluster
Web UIs: Hue SQL editor,
Zeppelin notebooks,
R Studio, and more!
Connect with ODBC / JDBC using
HiveServer2/Spark Thriftserver
Use Spark Actions in your Apache Oozie
workflow to create DAGs of jobs.
(start using
start-thriftserver.sh)
Or, use the native APIs and CLIs for
each application
EMR Security
EMRFS client-side encryption
Amazon S3
AmazonS3encryptionclients
EMRFSenabledfor
AmazonS3client-sideencryption
Key vendor (AWS KMS or your custom key vendor)
(client-side encrypted objects)
Use Identity and Access Management (IAM) roles with
your Amazon EMR cluster
• IAM roles give AWS services fine grained
control over delegating permissions to AWS
services and access to AWS resources
• EMR uses two IAM roles:
– EMR service role is for the Amazon EMR
control plane
– EC2 instance profile is for the actual
instances in the Amazon EMR cluster
• Default IAM roles can be easily created and
used from the AWS Console and AWS CLI
EMR Security Groups: default and custom
• A security group is a virtual firewall which controls access
to the EC2 instances in your Amazon EMR cluster
– There is a single default master and default slave
security group across all of your clusters
– The master security group has port 22 access for
SSHing to your cluster
• You can add additional security groups to the master and
slave groups on a cluster to separate them from the default
master and slave security groups, and further limit ingress
and egress policies.
Slave
Security
Group
Master
Security
Group
Encryption
• EMRFS encryption options
– S3 server-side encryption
– S3 client-side encryption (use AWS Key Management Service keys or custom keys)
• CloudTrail integration
– Track Amazon EMR API calls for auditing
• Launch your Amazon EMR clusters in a VPC
– Logically isolated portion of the cloud (“Virtual Private Network”)
– Enhanced networking on certain instance types
NASDAQ
• Encrypt data in the cloud, but the keys need to
be on on-premises
• SafeNet Luna SA – KMS
• EMRFS
– Utilizes S3 encryption clients envelope encryption and provides
a hook
– Custom Encryption Materials providers
NASDAQ
• Write your custom encryption materials provider
• -emrfs
Encryption=ClientSide,ProviderType=Custom,Custo
mProviderLocation=s3://mybucket/myfolder/myprovi
der.jar,CustomProviderClass=providerclassname
• EMR pulls and installs the JAR on every node
• More info @ “NASDAQ AWS Big Data blog”
• Code samples
Read data directly into Hive,
Pig, streaming and cascading
from Amazon Kinesis streams
No intermediate data
persistence required
Simple way to introduce real time sources into
batch oriented systems
Multi-application support & automatic
checkpointing
Amazon EMR integration with Amazon Kinesis
Use Hive with EMR to query data DynamoDB
• Export data stored in DynamoDB to
Amazon S3
• Import data in Amazon S3 to
DynamoDB
• Query live DynamoDB data using SQL-
like statements (HiveQL)
• Join data stored in DynamoDB and
export it or query against the joined data
• Load DynamoDB data into HDFS and
use it in your EMR job
Use AWS Data Pipeline and EMR to transform data
and load into Amazon Redshift
Unstructured Data Processed Data
Pipeline orchestrated and scheduled by AWS Data Pipeline
Install an iPython Notebook using Bootstrap
aws emr create-cluster --name iPythonNotebookEMR  --ami-version 3.2.3 --instance-type
m3.xlarge --instance-count 3  --ec2-attributes KeyName=<<MYKEY>>>  --use-default-roles  --
bootstrap-actions Path=s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython-
notebook,Name=Install_iPython_NB  --termination-protected
Install Hadoop Applications with Bootstrap Actions
Github
Repository
Amazon EMR – Design Patterns
EMR example #1: Batch Processing
GB of logs pushed to
S3 hourly
Daily EMR cluster
using Hive to process
data
Input and output
stored in S3
250 Amazon EMR jobs per day, processing 30 TB of data
http://aws.amazon.com/solutions/case-studies/yelp/
EMR example #2: Long-running cluster
Data pushed to S3 Daily EMR cluster
ETL data into
database
24/7 EMR cluster running
HBase holds last 2 years of
data
Front-end service uses
HBase cluster to power
dashboard with high
concurrency
TBs of logs sent
daily
Logs stored in
Amazon S3
Hive Metastore
on Amazon EMR
EMR example #3: Interactive query
Interactive query using Presto on multi-petabyte warehouse
http://nflx.it/1dO7Pnt
Why Spark? Functional Programming Basics
messages = textFile(...).filter(lambda s: s.contains(“ERROR”))
.map(lambda s: s.split(‘t’)[2])
for (int i = 0, i <= n, i++) {
if (s[i].contains(“ERROR”) {
messages[i] = split(s[i], ‘t’)[2]
}
}
Easy to parallel
Sequential processing
• RDDs track the transformations used to build
them (their lineage) to recompute lost data
• E.g:
RDDs (and now DataFrames) and Fault Tolerance
messages = textFile(...).filter(lambda s: s.contains(“ERROR”))
.map(lambda s: s.split(‘t’)[2])
HadoopRDD
path = hdfs://…
FilteredRDD
func = contains(...)
MappedRDD
func = split(…)
Why Presto? SQL on Unstructured and Unlimited
Data
• Dynamic Catalog
• Rich ANSI SQL
• Connectors as Plugins
• High Concurrency
• Batch (ETL)/Interactive Clusters
EMR example #4: Streaming data processing
TBs of logs sent
daily
Logs stored in
Amazon Kinesis
Amazon Kinesis
Client Library
AWS Lambda
Amazon EMR
Amazon EC2
EMR Best Practices
File formats
• Row oriented
– Text files
– Sequence files
• Writable object
– Avro data files
• Described by schema
• Columnar format
– Object Record Columnar (ORC)
– Parquet
Logical table
Row oriented
Column oriented
S3 File Format: Parquet
• Parquet file format: http://parquet.apache.org
• Self-describing columnar file format
• Supports nested structures (Dremel “record
shredding” algo)
• Emerging standard data format for Hadoop
– Supported by: Presto, Spark, Drill, Hive, Impala, etc.
Parquet vs ORC
• Evaluated Parquet and ORC (competing open columnar formats)
• ORC encrypted performance is currently a problem
– 15x slower vs. unencrypted (94% slower)
– 8 CPUs on 2 nodes: ~900 MB/sec vs. ~60 MB/sec encrypted
– Encrypted Parquet is ~27% slower vs. unencrypted
– Parquet: ~100MB/sec from S3 per CPU core (encrypted)
Factors to consider
• Processing and query tools
– Hive, Impala and Presto
• Evolution of schema
– Avro for schema and Presto for storage
• File format “splittability”
– Avoid JSON/XML Files. Use them as records
File sizes
• Avoid small files
– Anything smaller than 100MB
• Each mapper is a single JVM
– CPU time is required to spawn JVMs/mappers
• Fewer files, matching closely to block size
– fewer calls to S3
– fewer network/HDFS requests
Dealing with small files
• Reduce HDFS block size, e.g. 1MB (default is
128MB)
--bootstrap-action s3://elasticmapreduce/bootstrap-
actions/configure-hadoop --args “-
m,dfs.block.size=1048576”
• Better: Use S3DistCp to combine smaller files
together
– S3DistCp takes a pattern and target path to combine smaller input
files to larger ones
– Supply a target size and compression codec
Compression
• Always compress data files On Amazon S3
– Reduces network traffic between Amazon S3 and Amazon EMR
– Speeds Up Your Job
• Compress mappers and reducer output
• Amazon EMR compresses inter-node traffic with
LZO with Hadoop 1, and Snappy with Hadoop 2
Choosing the right compression
• Time sensitive, faster compressions are a better choice
• Large amount of data, use space efficient compressions
• Combined Workload, use gzip
Algorithm Splittable? Compression ratio
Compress +
decompress speed
Gzip (DEFLATE) No High Medium
bzip2 Yes Very high Slow
LZO Yes Low Fast
Snappy No Low Very fast
Cost saving tips for Amazon EMR
• Use S3 as your persistent data store – query it using Presto, Hive,
Spark, etc.
• Only pay for compute when you need it
• Use Amazon EC2 Spot instances to save >80%
• Use Amazon EC2 Reserved instances for steady workloads
• Use CloudWatch alerts to notify you if a cluster is underutilized, then
shut it down. E.g. 0 mappers running for >N hours
Holy Grail Question
What if I have small file
issues?
S3DistCP Options
Option
--src,LOCATION
--dest,LOCATION
--srcPattern,PATTERN
--groupBy,PATTERN
--targetSize,SIZE
--appendToLastFile
--outputCodec,CODEC
--s3ServerSideEncryption
--deleteOnSuccess
--disableMultipartUpload
--multipartUploadChunkSize,SIZE
--numberFiles
--startingIndex,INDEX
--outputManifest,FILENAME
--previousManifest,PATH
--requirePreviousManifest
--copyFromManifest
--s3Endpoint ENDPOINT
--storageClass CLASS
• Most Important Options
• --src
• --srcPattern
• --dest
• --groupBy
• --outputCodec
Persistent vs. Transient Clusters
Persistent Clusters
• Very similar to traditional Hadoop
deployments
• Cluster stays around after the job
is done
• Data persistence model:
• Amazon S3
• Amazon S3 Copy To HDFS
• HDFS, with Amazon S3 as
backup
Persistent Clusters
• Always keep data safe on Amazon S3 even if
you’re using HDFS for primary storage
• Get in the habit of shutting down your cluster
and start a new one, once a week or month
• Design your data processing workflow to account for
failure
• You can use workflow managements such as
AWS Data Pipeline
Benefits of Persistent Clusters
• Ability to share data between multiple jobs
• Always On for Analyst Access
EMR
Amazon S3
HDFSHDFS
Amazon S3
EMR
EMR
Transient cluster Long running clusters
Benefit of Persistent Clusters
• Cost effective for repetitive jobs
EMR
pm pm
EMR
EMR
EMR
pm pm
EMR
When to use Persistent clusters?
If ( Data Load Time + Processing Time) x Number Of Jobs > 24
Use Persistent Clusters
Else
Use Transient Clusters
When to use Persistent clusters?
e.g.
( 20min data load + 1 hour Processing time ) x 20 jobs
= 26 hours
Is > 24 hour, therefore use Persistent Clusters
Transient Clusters
• Cluster lives only for the duration of the job
• Shut down the cluster when the job is done
• Data persisted on
Amazon S3
• Input & output
Data on Amazon
S3
Benefits of Transient Clusters
1. Control your cost
2. Minimum maintenance
• Cluster goes away when job is done
3. Best Flexibility of Tools
4. Practice cloud architecture
• Pay for what you use
• Data processing as a workflow
When to use Transient cluster?
If ( Data Load Time + Processing Time) x Number Of Jobs < 24
Use Transient Clusters
Else
Use Persistent Clusters
When to use Transient clusters?
e.g.
( 20min data load + 1 hour Processing time ) x 10 jobs
= 13 hours
Is < 24 hour = Use Transient Clusters
Mix Spot and On-Demand instances to reduce cost and
accelerate computation while protecting against interruption
#1: Cost without Spot
4 instances *14 hrs * $0.50 = $28
Job Flow
14 Hours
Duration:
Other EMR + Spot Use Cases
Run entire cluster on Spot for biggest cost savings
Reduce the cost of application testing
#2: Cost with Spot
4 instances *7 hrs * $0.50 = $14 +
5 instances * 7 hrs * $0.25 = $8.75
Total = $22.75
Scenario #1
Duration:
Job Flow
7 Hours
Scenario #2
Time Savings: 50%
Cost Savings: ~20%
Reducing Costs with Spot Instances
Amazon EMR Nodes and Sizes
• Use m1 and c1 family for functional testing
• Use m3 and c3 xlarge and larger nodes for
production workloads
• Use cc2/c3 for memory and CPU intensive jobs
• hs1, hi1, i2 instances for HDFS workloads
• Prefer a smaller cluster of larger nodes
Holy Grail Question
How many nodes do I
need?
Cluster Sizing Calculation
1. Estimate the number of mappers
your job requires.
Cluster Sizing Calculation
2. Pick an EC2 instance and note down the
number of mappers it can run in parallel
e.g. m1.xlarge = 8 mappers in parallel
Cluster Sizing Calculation
3. We need to pick some sample data
files to run a test workload. The
number of sample files should be the
same number from step #2.
Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single
core node and process your sample files
from #3. Note down the amount of time
taken to process your sample files.
Total Mappers * Time To Process Sample Files
Instance Mapper Capacity * Desired Processing Time
Estimated Number Of Nodes =
Cluster Sizing Calculation
Example: Cluster Sizing Calculation
1. Estimate the number of mappers your job
requires
150
2. Pick an instance and note down the number of
mappers it can run in parallel
m1.xlarge with 8 mapper capacity per instance
Example: Cluster Sizing Calculation
3. We need to pick some sample data files to run a
test workload. The number of sample files should
be the same number from step #2.
8 files selected for our sample test
Example: Cluster Sizing Calculation
4. Run an Amazon EMR cluster with a single core
node and process your sample files from #3.
Note down the amount of time taken to process
your sample files.
3 min to process 8 files
Example: Cluster Sizing Calculation
Total Mappers For Your Job * Time To Process Sample Files
Per Instance Mapper Capacity * Desired Processing Time
Estimated number of nodes =
150 * 3 min
8 * 5 min
= 11 m1.xlarge
xxxxxx@amazon.com
Xxxxxxx

Weitere ähnliche Inhalte

Was ist angesagt?

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRIsrael AWS User Group
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangDatabricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSAmazon Web Services
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsAlluxio, Inc.
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptxchennakesava44
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQLDatabricks
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path ForwardAlluxio, Inc.
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroDatabricks
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekVenkata Naga Ravi
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...Databricks
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Flink Forward
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Ryan Blue
 

Was ist angesagt? (20)

Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Scaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMRScaling your analytics with Amazon EMR
Scaling your analytics with Amazon EMR
 
Masterclass Live: Amazon EMR
Masterclass Live: Amazon EMRMasterclass Live: Amazon EMR
Masterclass Live: Amazon EMR
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang WangApache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
Apache Spark Data Source V2 with Wenchen Fan and Gengliang Wang
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Apache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic DatasetsApache Iceberg - A Table Format for Hige Analytic Datasets
Apache Iceberg - A Table Format for Hige Analytic Datasets
 
Snowflake Architecture.pptx
Snowflake Architecture.pptxSnowflake Architecture.pptx
Snowflake Architecture.pptx
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Physical Plans in Spark SQL
Physical Plans in Spark SQLPhysical Plans in Spark SQL
Physical Plans in Spark SQL
 
Apache Hudi: The Path Forward
Apache Hudi: The Path ForwardApache Hudi: The Path Forward
Apache Hudi: The Path Forward
 
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/AvroThe Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
The Rise of ZStandard: Apache Spark/Parquet/ORC/Avro
 
Processing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeekProcessing Large Data with Apache Spark -- HasGeek
Processing Large Data with Apache Spark -- HasGeek
 
Introduction to Amazon Redshift
Introduction to Amazon RedshiftIntroduction to Amazon Redshift
Introduction to Amazon Redshift
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
The Top Five Mistakes Made When Writing Streaming Applications with Mark Grov...
 
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
Jim Dowling - Multi-tenant Flink-as-a-Service on YARN
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)Iceberg: A modern table format for big data (Strata NY 2018)
Iceberg: A modern table format for big data (Strata NY 2018)
 

Ähnlich wie Amazon EMR Deep Dive & Best Practices

Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts Julien SIMON
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduceAmazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Amazon Web Services
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRAmazon Web Services
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Web Services
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceAmazon Web Services
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesVladimir Simek
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkAmazon Web Services
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Amazon Web Services
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAmazon Web Services
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSAmazon Web Services
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivAmazon Web Services
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Amazon Web Services
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 GamingAmazon Web Services Korea
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot InstancesArun Sirimalla
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석Amazon Web Services Korea
 

Ähnlich wie Amazon EMR Deep Dive & Best Practices (20)

Amazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the conceptsAmazon Elastic Map Reduce: the concepts
Amazon Elastic Map Reduce: the concepts
 
AWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMRAWS May Webinar Series - Getting Started with Amazon EMR
AWS May Webinar Series - Getting Started with Amazon EMR
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce(BDT208) A Technical Introduction to Amazon Elastic MapReduce
(BDT208) A Technical Introduction to Amazon Elastic MapReduce
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
Apache Hadoop and Spark on AWS: Getting started with Amazon EMR - Pop-up Loft...
 
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMRSpark and the Hadoop Ecosystem: Best Practices for Amazon EMR
Spark and the Hadoop Ecosystem: Best Practices for Amazon EMR
 
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
Amazon Elastic MapReduce Deep Dive and Best Practices (BDT404) | AWS re:Inven...
 
Deep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduceDeep Dive: Amazon Elastic MapReduce
Deep Dive: Amazon Elastic MapReduce
 
How to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutesHow to run your Hadoop Cluster in 10 minutes
How to run your Hadoop Cluster in 10 minutes
 
Lighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache SparkLighting your Big Data Fire with Apache Spark
Lighting your Big Data Fire with Apache Spark
 
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
Tune your Big Data Platform to Work at Scale: Taking Hadoop to the Next Level...
 
EMR Training
EMR TrainingEMR Training
EMR Training
 
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best PracticesAWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
 
Apache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWSApache Spark and the Hadoop Ecosystem on AWS
Apache Spark and the Hadoop Ecosystem on AWS
 
Big data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel AvivBig data with amazon EMR - Pop-up Loft Tel Aviv
Big data with amazon EMR - Pop-up Loft Tel Aviv
 
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
Best Practices for Managing Hadoop Framework Based Workloads (on Amazon EMR) ...
 
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
데이터 마이그레이션 AWS와 같이하기 - 김일호 솔루션즈 아키텍트:: AWS Cloud Track 3 Gaming
 
Cost Optimization with Spot Instances
Cost Optimization with Spot InstancesCost Optimization with Spot Instances
Cost Optimization with Spot Instances
 
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석AWS Summit Seoul 2015 -  AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
AWS Summit Seoul 2015 - AWS 클라우드를 활용한 빅데이터 및 실시간 스트리밍 분석
 

Mehr von Amazon Web Services

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Amazon Web Services
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Amazon Web Services
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateAmazon Web Services
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSAmazon Web Services
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Amazon Web Services
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Amazon Web Services
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...Amazon Web Services
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsAmazon Web Services
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareAmazon Web Services
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSAmazon Web Services
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAmazon Web Services
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareAmazon Web Services
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWSAmazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckAmazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without serversAmazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...Amazon Web Services
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceAmazon Web Services
 

Mehr von Amazon Web Services (20)

Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
Come costruire servizi di Forecasting sfruttando algoritmi di ML e deep learn...
 
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
Big Data per le Startup: come creare applicazioni Big Data in modalità Server...
 
Esegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS FargateEsegui pod serverless con Amazon EKS e AWS Fargate
Esegui pod serverless con Amazon EKS e AWS Fargate
 
Costruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWSCostruire Applicazioni Moderne con AWS
Costruire Applicazioni Moderne con AWS
 
Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot Come spendere fino al 90% in meno con i container e le istanze spot
Come spendere fino al 90% in meno con i container e le istanze spot
 
Open banking as a service
Open banking as a serviceOpen banking as a service
Open banking as a service
 
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
Rendi unica l’offerta della tua startup sul mercato con i servizi Machine Lea...
 
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...OpsWorks Configuration Management: automatizza la gestione e i deployment del...
OpsWorks Configuration Management: automatizza la gestione e i deployment del...
 
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 

Kürzlich hochgeladen

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 

Kürzlich hochgeladen (20)

Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 

Amazon EMR Deep Dive & Best Practices

  • 1. ©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved Deep Dive: Amazon EMR Anthony Nguyen Senior Big Data Consultant AWS aanwin@amazon.com
  • 2. Four key takeaways about EMR from this talk 1. Open source is awesome 2. Decoupling of compute and storage – HDFS vs S3 – Reliability and Durability – Cost – Multiple clusters – Logical separation of jobs/clusters/teams 3. Programmable long running or transient clusters 4. Use compute as fungible commodities
  • 3. What is EMR? Hosted framework that allows you to run Hadoop, Spark and other distributed compute frameworks on AWS Makes it easy to quickly and cost-effectively process vast amounts of data.
  • 4. Just some of the organizations using EMR:
  • 5. Why Amazon EMR? Easy to Use Launch a cluster in minutes Low Cost Pay an hourly rate Elastic Easily add or remove capacity Reliable Spend less time monitoring Secure Managed firewalls Flexible You control the cluster
  • 6. The Hadoop ecosystem can run in Amazon EMR
  • 8. Storage S3 (EMRFS), HDFS YARN Cluster Resource Management Batch MapReduce Interactive Tez In Memory Spark Applications Hive, Pig, Spark SQL/Streaming/ML, Mahout, Sqoop HBase/Phoenix Presto Hue (SQL Interface / Metastore Management) Zeppelin (Interactive Notebook) Ganglia (Monitoring) HiveServer2 / Spark Thriftserver (JDBC/ODBC) Amazon EMR service
  • 9. Unlike on-prem or EC2-fleet Hadoop, EMR Decouples Storage & Compute Tightly-coupled compute & storage  inflexibility Decoupled compute & storage  flexible storage  Right-sizing compute Traditional Hadoop Amazon EMR
  • 10. On an On-premises Environment Tightly coupled
  • 11. Compute and Storage Grow Together Tightly coupled • Storage grows along with compute • Compute requirements vary
  • 12. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Re-processingWeekly peaks Steady state
  • 13. Underutilized or Scarce Resources 0 20 40 60 80 100 120 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Underutilized capacity Provisioned capacity
  • 14. Contention for Same Resources Compute bound Memory bound
  • 15. EMR: Aggregate all Data in S3 as your Data Lake Surrounded by a collection of the right tools EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Storm Amazon S3 Import / Export Snowball
  • 16. EMRFS: Amazon S3 as your persistent data store • Decouple compute & storage • Shut down EMR clusters with no data loss • Right-size EMR cluster independently of storage • Multiple Amazon EMR clusters can concurrently use same data in Amazon S3 EMR EMR Amazon S3
  • 17. HDFS is still there if you need it • Iterative workloads – If you’re processing the same dataset more than once – Consider using Spark & RDDs for this too • Disk I/O intensive workloads • Persist data on Amazon S3 and use S3DistCp to copy to/from HDFS for processing
  • 19. What Do I Need to Build a Cluster ? 1. Choose instances 2. Choose your software 3. Choose your access method
  • 20. EMR is Easy to Deploy AWS Management Console Command Line or use the EMR API with your favorite SDK
  • 21. Try different configurations to find your optimal architecture CPU c3 family cc1.4xlarge cc2.8xlarge Memory m2 family r3 family m4 family Disk/IO d2 family i2 family General m1 family m3 family Choose your instance types Batch Machine Spark and Large process learning interactive HDFS
  • 22. Use multiple EMR instance groups Master Node r3.2xlarge Example Amazon EMR Cluster Slave Group - Core c3.2xlarge Slave Group – Task m3.xlarge (EC2 Spot) Slave Group – Task m3.2xlarge (EC2 Spot) Core nodes run HDFS (DataNode). Task nodes do not run HDFS. Core and Task nodes each run YARN (NodeManager). Master node runs NameNode (HDFS), ResourceManager (YARN), and serves as a gateway.
  • 23. EMR 5.0 - Applications
  • 24. Easy to monitor and debug Monitor Debug Integrated with Amazon CloudWatch Monitor cluster, node, and I/O, and 20+ custom Hadoop metrics Task level debugging information already pushed in a datastore
  • 25. EMR logging to S3 makes logs easily available Logs in S3. Can use ELK to visualize
  • 26. Easy to add and remove compute capacity on your cluster. Match compute demands with cluster sizing. Resizable clusters
  • 27. Spot for task nodes Up to 90% off EC2 on-demand pricing On-demand for core nodes Standard Amazon EC2 pricing for on-demand capacity Easy to use Spot Instances Meet SLA at predictable cost Exceed SLA at lower cost
  • 28. The spot advantage • Lets say a 7 hour job needing 100 nodes with each node at $1 = 7 x 100 x $1 = $700 • Scale up to 200 nodes – 4 * 100 * $1 = $400 – 4 * 100 * $0.50 = $200 – Save $100 dollars and finish the job fast • Run on-demand for worst acceptable SLA – Scale up to meet demand with Spot instances
  • 29.
  • 30. Options to submit jobs – Off Cluster Amazon EMR Step API Submit a Spark application Amazon EMR AWS Data Pipeline Airflow, Luigi, or other schedulers on EC2 Create a pipeline to schedule job submission or create complex workflows AWS Lambda Use AWS Lambda to submit applications to EMR Step API or directly to Spark on your cluster
  • 31. Options to submit jobs – On Cluster Web UIs: Hue SQL editor, Zeppelin notebooks, R Studio, and more! Connect with ODBC / JDBC using HiveServer2/Spark Thriftserver Use Spark Actions in your Apache Oozie workflow to create DAGs of jobs. (start using start-thriftserver.sh) Or, use the native APIs and CLIs for each application
  • 33. EMRFS client-side encryption Amazon S3 AmazonS3encryptionclients EMRFSenabledfor AmazonS3client-sideencryption Key vendor (AWS KMS or your custom key vendor) (client-side encrypted objects)
  • 34. Use Identity and Access Management (IAM) roles with your Amazon EMR cluster • IAM roles give AWS services fine grained control over delegating permissions to AWS services and access to AWS resources • EMR uses two IAM roles: – EMR service role is for the Amazon EMR control plane – EC2 instance profile is for the actual instances in the Amazon EMR cluster • Default IAM roles can be easily created and used from the AWS Console and AWS CLI
  • 35. EMR Security Groups: default and custom • A security group is a virtual firewall which controls access to the EC2 instances in your Amazon EMR cluster – There is a single default master and default slave security group across all of your clusters – The master security group has port 22 access for SSHing to your cluster • You can add additional security groups to the master and slave groups on a cluster to separate them from the default master and slave security groups, and further limit ingress and egress policies. Slave Security Group Master Security Group
  • 36. Encryption • EMRFS encryption options – S3 server-side encryption – S3 client-side encryption (use AWS Key Management Service keys or custom keys) • CloudTrail integration – Track Amazon EMR API calls for auditing • Launch your Amazon EMR clusters in a VPC – Logically isolated portion of the cloud (“Virtual Private Network”) – Enhanced networking on certain instance types
  • 37. NASDAQ • Encrypt data in the cloud, but the keys need to be on on-premises • SafeNet Luna SA – KMS • EMRFS – Utilizes S3 encryption clients envelope encryption and provides a hook – Custom Encryption Materials providers
  • 38.
  • 39. NASDAQ • Write your custom encryption materials provider • -emrfs Encryption=ClientSide,ProviderType=Custom,Custo mProviderLocation=s3://mybucket/myfolder/myprovi der.jar,CustomProviderClass=providerclassname • EMR pulls and installs the JAR on every node • More info @ “NASDAQ AWS Big Data blog” • Code samples
  • 40. Read data directly into Hive, Pig, streaming and cascading from Amazon Kinesis streams No intermediate data persistence required Simple way to introduce real time sources into batch oriented systems Multi-application support & automatic checkpointing Amazon EMR integration with Amazon Kinesis
  • 41. Use Hive with EMR to query data DynamoDB • Export data stored in DynamoDB to Amazon S3 • Import data in Amazon S3 to DynamoDB • Query live DynamoDB data using SQL- like statements (HiveQL) • Join data stored in DynamoDB and export it or query against the joined data • Load DynamoDB data into HDFS and use it in your EMR job
  • 42. Use AWS Data Pipeline and EMR to transform data and load into Amazon Redshift Unstructured Data Processed Data Pipeline orchestrated and scheduled by AWS Data Pipeline
  • 43. Install an iPython Notebook using Bootstrap aws emr create-cluster --name iPythonNotebookEMR --ami-version 3.2.3 --instance-type m3.xlarge --instance-count 3 --ec2-attributes KeyName=<<MYKEY>>> --use-default-roles -- bootstrap-actions Path=s3://elasticmapreduce.bootstrapactions/ipython-notebook/install-ipython- notebook,Name=Install_iPython_NB --termination-protected
  • 44. Install Hadoop Applications with Bootstrap Actions Github Repository
  • 45. Amazon EMR – Design Patterns
  • 46. EMR example #1: Batch Processing GB of logs pushed to S3 hourly Daily EMR cluster using Hive to process data Input and output stored in S3 250 Amazon EMR jobs per day, processing 30 TB of data http://aws.amazon.com/solutions/case-studies/yelp/
  • 47. EMR example #2: Long-running cluster Data pushed to S3 Daily EMR cluster ETL data into database 24/7 EMR cluster running HBase holds last 2 years of data Front-end service uses HBase cluster to power dashboard with high concurrency
  • 48. TBs of logs sent daily Logs stored in Amazon S3 Hive Metastore on Amazon EMR EMR example #3: Interactive query Interactive query using Presto on multi-petabyte warehouse http://nflx.it/1dO7Pnt
  • 49. Why Spark? Functional Programming Basics messages = textFile(...).filter(lambda s: s.contains(“ERROR”)) .map(lambda s: s.split(‘t’)[2]) for (int i = 0, i <= n, i++) { if (s[i].contains(“ERROR”) { messages[i] = split(s[i], ‘t’)[2] } } Easy to parallel Sequential processing
  • 50. • RDDs track the transformations used to build them (their lineage) to recompute lost data • E.g: RDDs (and now DataFrames) and Fault Tolerance messages = textFile(...).filter(lambda s: s.contains(“ERROR”)) .map(lambda s: s.split(‘t’)[2]) HadoopRDD path = hdfs://… FilteredRDD func = contains(...) MappedRDD func = split(…)
  • 51. Why Presto? SQL on Unstructured and Unlimited Data • Dynamic Catalog • Rich ANSI SQL • Connectors as Plugins • High Concurrency • Batch (ETL)/Interactive Clusters
  • 52. EMR example #4: Streaming data processing TBs of logs sent daily Logs stored in Amazon Kinesis Amazon Kinesis Client Library AWS Lambda Amazon EMR Amazon EC2
  • 54. File formats • Row oriented – Text files – Sequence files • Writable object – Avro data files • Described by schema • Columnar format – Object Record Columnar (ORC) – Parquet Logical table Row oriented Column oriented
  • 55. S3 File Format: Parquet • Parquet file format: http://parquet.apache.org • Self-describing columnar file format • Supports nested structures (Dremel “record shredding” algo) • Emerging standard data format for Hadoop – Supported by: Presto, Spark, Drill, Hive, Impala, etc.
  • 56. Parquet vs ORC • Evaluated Parquet and ORC (competing open columnar formats) • ORC encrypted performance is currently a problem – 15x slower vs. unencrypted (94% slower) – 8 CPUs on 2 nodes: ~900 MB/sec vs. ~60 MB/sec encrypted – Encrypted Parquet is ~27% slower vs. unencrypted – Parquet: ~100MB/sec from S3 per CPU core (encrypted)
  • 57. Factors to consider • Processing and query tools – Hive, Impala and Presto • Evolution of schema – Avro for schema and Presto for storage • File format “splittability” – Avoid JSON/XML Files. Use them as records
  • 58. File sizes • Avoid small files – Anything smaller than 100MB • Each mapper is a single JVM – CPU time is required to spawn JVMs/mappers • Fewer files, matching closely to block size – fewer calls to S3 – fewer network/HDFS requests
  • 59. Dealing with small files • Reduce HDFS block size, e.g. 1MB (default is 128MB) --bootstrap-action s3://elasticmapreduce/bootstrap- actions/configure-hadoop --args “- m,dfs.block.size=1048576” • Better: Use S3DistCp to combine smaller files together – S3DistCp takes a pattern and target path to combine smaller input files to larger ones – Supply a target size and compression codec
  • 60. Compression • Always compress data files On Amazon S3 – Reduces network traffic between Amazon S3 and Amazon EMR – Speeds Up Your Job • Compress mappers and reducer output • Amazon EMR compresses inter-node traffic with LZO with Hadoop 1, and Snappy with Hadoop 2
  • 61. Choosing the right compression • Time sensitive, faster compressions are a better choice • Large amount of data, use space efficient compressions • Combined Workload, use gzip Algorithm Splittable? Compression ratio Compress + decompress speed Gzip (DEFLATE) No High Medium bzip2 Yes Very high Slow LZO Yes Low Fast Snappy No Low Very fast
  • 62. Cost saving tips for Amazon EMR • Use S3 as your persistent data store – query it using Presto, Hive, Spark, etc. • Only pay for compute when you need it • Use Amazon EC2 Spot instances to save >80% • Use Amazon EC2 Reserved instances for steady workloads • Use CloudWatch alerts to notify you if a cluster is underutilized, then shut it down. E.g. 0 mappers running for >N hours
  • 63. Holy Grail Question What if I have small file issues?
  • 66. Persistent Clusters • Very similar to traditional Hadoop deployments • Cluster stays around after the job is done • Data persistence model: • Amazon S3 • Amazon S3 Copy To HDFS • HDFS, with Amazon S3 as backup
  • 67. Persistent Clusters • Always keep data safe on Amazon S3 even if you’re using HDFS for primary storage • Get in the habit of shutting down your cluster and start a new one, once a week or month • Design your data processing workflow to account for failure • You can use workflow managements such as AWS Data Pipeline
  • 68. Benefits of Persistent Clusters • Ability to share data between multiple jobs • Always On for Analyst Access EMR Amazon S3 HDFSHDFS Amazon S3 EMR EMR Transient cluster Long running clusters
  • 69. Benefit of Persistent Clusters • Cost effective for repetitive jobs EMR pm pm EMR EMR EMR pm pm EMR
  • 70. When to use Persistent clusters? If ( Data Load Time + Processing Time) x Number Of Jobs > 24 Use Persistent Clusters Else Use Transient Clusters
  • 71. When to use Persistent clusters? e.g. ( 20min data load + 1 hour Processing time ) x 20 jobs = 26 hours Is > 24 hour, therefore use Persistent Clusters
  • 72. Transient Clusters • Cluster lives only for the duration of the job • Shut down the cluster when the job is done • Data persisted on Amazon S3 • Input & output Data on Amazon S3
  • 73. Benefits of Transient Clusters 1. Control your cost 2. Minimum maintenance • Cluster goes away when job is done 3. Best Flexibility of Tools 4. Practice cloud architecture • Pay for what you use • Data processing as a workflow
  • 74. When to use Transient cluster? If ( Data Load Time + Processing Time) x Number Of Jobs < 24 Use Transient Clusters Else Use Persistent Clusters
  • 75. When to use Transient clusters? e.g. ( 20min data load + 1 hour Processing time ) x 10 jobs = 13 hours Is < 24 hour = Use Transient Clusters
  • 76. Mix Spot and On-Demand instances to reduce cost and accelerate computation while protecting against interruption #1: Cost without Spot 4 instances *14 hrs * $0.50 = $28 Job Flow 14 Hours Duration: Other EMR + Spot Use Cases Run entire cluster on Spot for biggest cost savings Reduce the cost of application testing #2: Cost with Spot 4 instances *7 hrs * $0.50 = $14 + 5 instances * 7 hrs * $0.25 = $8.75 Total = $22.75 Scenario #1 Duration: Job Flow 7 Hours Scenario #2 Time Savings: 50% Cost Savings: ~20% Reducing Costs with Spot Instances
  • 77. Amazon EMR Nodes and Sizes • Use m1 and c1 family for functional testing • Use m3 and c3 xlarge and larger nodes for production workloads • Use cc2/c3 for memory and CPU intensive jobs • hs1, hi1, i2 instances for HDFS workloads • Prefer a smaller cluster of larger nodes
  • 78. Holy Grail Question How many nodes do I need?
  • 79. Cluster Sizing Calculation 1. Estimate the number of mappers your job requires.
  • 80. Cluster Sizing Calculation 2. Pick an EC2 instance and note down the number of mappers it can run in parallel e.g. m1.xlarge = 8 mappers in parallel
  • 81. Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.
  • 82. Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files.
  • 83. Total Mappers * Time To Process Sample Files Instance Mapper Capacity * Desired Processing Time Estimated Number Of Nodes = Cluster Sizing Calculation
  • 84. Example: Cluster Sizing Calculation 1. Estimate the number of mappers your job requires 150 2. Pick an instance and note down the number of mappers it can run in parallel m1.xlarge with 8 mapper capacity per instance
  • 85. Example: Cluster Sizing Calculation 3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2. 8 files selected for our sample test
  • 86. Example: Cluster Sizing Calculation 4. Run an Amazon EMR cluster with a single core node and process your sample files from #3. Note down the amount of time taken to process your sample files. 3 min to process 8 files
  • 87. Example: Cluster Sizing Calculation Total Mappers For Your Job * Time To Process Sample Files Per Instance Mapper Capacity * Desired Processing Time Estimated number of nodes = 150 * 3 min 8 * 5 min = 11 m1.xlarge

Hinweis der Redaktion

  1. Data lacks structure Analyzing streams of information Processing large datasets Warehousing large datasets Flexibility for undefined ad hoc analysis Speed of queries on large data sets
  2. Four main reasons why Amazon EMR
  3. Amazon EMR is more than just MapReduce. Bootstrap actions available on GitHub
  4. Create a managed Hadoop cluster in just a few clicks and use easy monitoring and debugging tools
  5. In the next few slides, we’ll talk about data persistence models with EMR. The first pattern is Amazon S3 as HDFS. With this data persistence model, data gets stored on Amazon S3. HDFS does not play any role in storing data. As a matter of fact, HDFS is only there for temporary storage. Another common thing I hear is that storing data on Amazon S3 instead of HDFS slows my job down a lot because data has to get copied to HDFS/disk first before processing starts. That’s incorrect. If you tell Hadoop that your data is on Amazon S3, Hadoop reads directly from Amazon S3 and streams data to Mappers without toughing the disk. Not to be completely correct, data does touch HDFS when data has to shuffle from mappers to reducers, but as I mentioned, HDFS acts as the temp space and nothing more. EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency.
  6. Decompression is less intensive
  7. Decompression is less intensive
  8. Decompression is less intensive
  9. EMR example #3: EMR for ETL and query engine for investigations which require all raw data
  10. Show recomputing
  11. Show recomputing
  12. EMR example #4: Streaming processing with Spark Streaming or Flink
  13. Basically, your use cases drive your formats Keep your hive metastore off cluster, so you can experiment with this easily. Golden rule of analytic workloads – test, test, and test some more.
  14. Avoid small files at all costs. Small files can cause a lot of issues. I call them the termites of big data. The reason small files are just trouble is that each file, as discussed previously, eventually becomes a mapper. An each mapper is a java JVM. Smaller files cause Java JVM to get spawn up but as soon as the mapper is up and running, the content of file is processed in a short amount of time and the mapper goes away. That’s waste of CPU and memory. Ideally you like your mapper to do as much processing as possible before shutting down.
  15. Decompression is less intensive
  16. Give guidance
  17. Transient clusters are the type of clusters that are only around for the duration of the job. Once the job is done the cluster shuts down. This model of running Hadoop clusters is very different from traditional Hadoop deployments. With traditional Hadoop deployment, the Hadoop clusters stays up and running regardless if there are any jobs for the cluster to process mostly due to fact that the Hadoop cluster also hosts HDFS storage. With HDFS storage, you don’t have the luxury of shutting down the cluster. With Transient EMR clusters, your data persist on S3, meaning that you don’t use HDFS to store your data. Once data is safe and secure on S3, you have the ability to shutdown the cluster after your job is done knowing that you wont lose data.
  18. I hear this a lot: EMR is only good for short-lived/transient clusters, do I need to run my own Hadoop on EC2 if I need long running clusters? That’s not true at all. EMR can be designed for longer running clusters. You would still want to keep your data on S3 for durability or you can copy data from S3 to HDFS first or you can use HDFS as your primary storage and use S3 as the data store backup.
  19. Notice that we don’t remove S3 from our design. Its important to keep your data safe on S3 just in case you experience cluster failures. In fact I want you to think or always plan for failure. Just because you’re running a long running cluster doesn’t mean you wont see failures. So architecting your data processing workflow to be able to deal with cluster failures is super important. One way to do that is to use a workflow management tool such as Amazon Data pipeline.
  20. There are two important benefits of running long running clusters. One is the ability to share data between multiple jobs. With transient clusters you use S3 to share data between two jobs. If you have high dependency between two or multiple jobs and they all have to use the output of the previous job, running a long running cluster is a better choice.
  21. The other benefit is cost effectiveness if you’re running multiple jobs in a day where its more cost effective to keep the cluster around. For example, if you’re running two jobs every hour, instead of running two clusters, you can keep the cluster around for longer period of time.
  22. So it’s probably obvious to you by now that long running clusters make sense when the total time spent processing your data for a given day is more than 24 hours
  23. Example of when a long running cluster makes more sense.
  24. Transient clusters are the type of clusters that are only around for the duration of the job. Once the job is done the cluster shuts down. This model of running Hadoop clusters is very different from traditional Hadoop deployments. With traditional Hadoop deployment, the Hadoop clusters stays up and running regardless if there are any jobs for the cluster to process mostly due to fact that the Hadoop cluster also hosts HDFS storage. With HDFS storage, you don’t have the luxury of shutting down the cluster. With Transient EMR clusters, your data persist on S3, meaning that you don’t use HDFS to store your data. Once data is safe and secure on S3, you have the ability to shutdown the cluster after your job is done knowing that you wont lose data.
  25. There are many reasons you want to use EMR transient clusters. Cost: Shutting down resources you don’t need is the best path towards optimizing your workload for cost efficiency. Don’t pay for what you’re not using. At AWS that’s all we talk about. 2) No Maintenance. Obviously if you’re shutting down the cluster, then you’re only maintaining the cluster for the duration of your job. This help tremendously reducing the headache maintaining long running clusters 3) Practice cloud architecture best practices. It’ll also makes you a better cloud practitioner. Again you’re getting in the habit of paying for what you’re using which is great. You’ll also get into the habit of thinking of your data processing as a workflow where resources come and go as needed.
  26. When does it make sense to use transient clusters? It’s simple. All you need to do is to figure out if the total number of jobs you run + data load and processing time of each job is less than 24 hours. Here’s the math to make it clear.
  27. Lets run through a simple example
  28. Do not use smaller nodes for production workload unless you’re 100% sure you know what you’re doing. The majority of jobs I’ve seen requires more CPU and Memory the smaller instances have to offer and most of the times causes job failures if the cluster is not fine tuned. Instead of spending time fine tunning small nodes, get a larger node and run your workload with peace of mind. Anything larger and including m1.xlarge is a good candidate. m1.xlarge, c1.xlarge, m2.4xlarge and all cluster compute instances are good choices.
  29. This is my fav question: given this much data, how many nodes do I need?