This document summarizes Scott Miao's presentation on Analytic Engine (AE), a common big data computation service on AWS. AE provides a RESTful API for users to create AWS EMR clusters, submit jobs to clusters, and delete clusters. It handles job scheduling and delivery to clusters to optimize usage of AWS resources. Using AE and AWS services like EMR and S3 allows Trend Micro to scale their data and computation needs elastically with reduced operational overhead compared to managing infrastructure on their own.
2. Who am I
• Scott Miao
• RD, SPN, Trend Micro
• Worked on Hadoop ecosystem about 6
years
• Worked on AWS for BigData about 3 years
• Expertise in HDFS/MR/HBase
• Speaker in some Hadoop related confs
• @takeshi.miao
3. Agenda
• What problems we suffered ?
• Why AWS ?
• Analytic Engine
• The benefits AWS brings to AE
• AE roadmap on AWS
7. Return of Investment
• On traditional infra., we put a lot of efforts on services operation
• On the Cloud, we can leverage its elasticities to automate our
services
• More focus on innovation !!
Time
Money
Revenue
Cost
8. AWS is a leader of IaaS platform
https://www.gartner.com/doc/reprints?id=1-2G2O5FC&ct=150519&st=sbSource: Gartner (May 2015)
11. High Level Architecture
Analytic Engine
(AE)
CloudStorage
(CS)
createCluster
submitJob
deleteCluster
Input from
Output to
AWS EMR
RESTful API RESTful API
RDs
Researchers
Services
Common
Storage
Service
Common
Computation
Service
12. Common Cloud Services in Trend
Analytic Engine
• Computation service for
Trenders
• Based on AWS EMR
• Simple RESTful API calls
• Computing on demand
• Short live
• Long running
• No operation effort
• Pay by computing resources
Cloud Storage
• Storage service for Trenders
• Based on AWS S3
• Simple RESTful API calls
• Share data to all in one place
• Metadata search for files
• No operation effort
• Pay by storage size used
13. Why we use AE instead of EMR directly ?
• Abstraction
• Avoid locked-in
• Hide details impl. behind the scene
• AWS EMR was not design for long running jobs
• >= AMI-3.1.1 – 256 ACTIVE or PENDING jobs (STEPs)
• < AMI-3.1.1 – 256 jobs in total
• Better integrated with other common services
• Keep our hands off from AWS native codes
• Centralized Authentication & Authorization
• No AWS/SSH keys for user
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/AddingStepstoaJobFlow.html
14. Common usecases for AE
• User creates a cluster
• User can create multiple clusters
• User submits job to target cluster
• AE helps user to deliver job to secondary cluster
• User wants to know their cost
15. Usecase#1 – User creates a cluster
AEusers
createCluster
EMR
1.User invokes createCluster
2.AE launches an EMR cluster for user
With tags attached
1.
2.
tag:
‘sched:routine’,
‘env:prod’,
m3.xlarge * 10
tag:
‘sched:routine’,
‘env:prod’,
m3.xlarge * 10It is RESTful API,
so I can use any
client I am familiar
with !
16. Usecase#2 – User can create multiple clusters
as he/she need
AEusers
createCluster
EMR
1.User invokes createCluster
2.AE launches another new EMR cluster for user
with tags attached
3. User can create many clusters he/she needs
1.
2.
tag:
‘sched:adhoc’,
‘env:prod’,
c3.4xlarge * 20
tag:
‘sched:routine’,
‘env:prod’,
m3.xlarge * 10
tag:
‘sched:adhoc’,
‘env:prod’,
c3.4xlarge * 20
17. 1.User invokes submitJob
2.AE matches the job and
deliver it to target cluster
3. AE submits job
4.EMR pulls data from CS
5.Job runs on target cluster
6.EMR outputs result to CS
7. AE sends msg to SNS
Topic if user specified
Usecase#3 – User submits job to target cluster
to run
AEusers
submitJob
EMR
CS
1.
2.
3.
clusterCriteria:
[[‘sched:adhoc’,
‘env:prod’],
[“env:prod”]]
tag:
‘sched:routine’,
‘env:prod’
tag:
‘sched:adhoc’,
‘env:prod’
5.7.
4. 6.
18. Usecase#4 – AE delivers job to secondary
cluster if target cluster down
AEusers
submitJob
EMR
CS
1.
2.
3.
clusterCriteria:
[[‘sched:adhoc’,
‘env:prod’],
[“env:prod”]]
tag:
‘sched:routine’,
‘env:prod’
tag:
‘sched:adhoc’,
‘env:prod’
1.User invokes submitJob
2.AE matches the job and
deliver it to secondary cluster
3. AE submits job
4.EMR pull data from CS
5.Job run on target cluster
6.EMR output result to CS
5.
4. 6.
19. Usecase#5 – User wants to know what their
current cost is
Billing & Cost management -> Cost Explorer -> Launch Cost Explorer
20. IDC
Middle Level Architecture
AZb
AE API servers
RDS
Internal ELB
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
input/output
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
Internet
HTTPS/HTTP
Basic/VPN
Cloud Storage
HTTPS/HTTP
Basic
Amazon
SNS
Oregon (us-west-2)
peering
22. Pros & Cons
Aspects IDC AWS
Data Capacity Limited by physical rack
space
No limitation in
seasonable amount
Computation Capacity Limited by physical rack
space
No limitation in
seasonable amount
DevOps Hard, due to on physical
machine/ VM farm
Easy, due to code is
everything (Continuous
Deployment)
Scalability Hard, due to on physical
machine/ VM farm
Easy, relied on ELB,
Autoscaling group from
AWS
23. Pros & Cons
Aspects IDC AWS
Disaster Recovery Hard, due to on physical
machine/ VM farm
Easy, due to code is
everything
Data Location Limited due to IDC
location
Various and easy due to
multiple regions of AWS
Cost Implied in Total Cost of
Ownership
Acceptable cost with
Cloud optimized design
29. AZb
AE API servers
RDS
Internal ELB
AZa
AZb
AZc
AE API servers
RDS
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
input/output
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
Oregon (us-west-2)
4. Provision AE SaaS by CI/CD
4.
30. IDC
AZb
AE API servers
RDS
Internal ELB
AZa
AZb
AZc
AE API servers
RDS
services
services
services
peering
HTTPS
EMR
EMR
Cross-account
S3 buckets
Auto
Scaling
group
worker
s
worker
sMulti-AZs
Auto
Scaling
group
Auto
Scaling
group
Eureka
Eureka
Internet
HTTPS/HTTP
Basic/VPN
Cloud Storage
HTTPS/HTTP
BasicOregon (us-west-2)
5. Users can access via VPN, FW open for Trend
6. Input from CS or S3
7. Computation in AWS EMR cluster
5.
7.
6. 8.
6. 8.
Amazon
SNS
9.
8. Output to CS or S3
9. Job end msg to AWS SNS (optional)
31. What is Netflix Genie
• A practice from Netflix
• Hadoop client to submit different kinds of Job
• Flexible data model design to adopt diff kind of cluster
• Flexible Job/cluster matching design (based on tags)
• Cloud characteristics built-in design
• e.g. auto-scaling, load-balance, etc
• It’s goal is plain & simple
• We use it as an internal component
https://github.com/Netflix/genie/wiki
32. What is Netflix Eureka
• Is a RESTful service
• Built by Netflix
• A critical component for Genie to do Load Balance and failover
Genie
API client API client API client