This document discusses building scalable big data solutions. It begins with an introduction from Bob Rogers, Chief Data Scientist for Intel. The next section features a presentation from Durga Nemani of AOL on their hybrid approach to big data. Some key aspects of AOL's architecture include separating compute and storage across Amazon S3 and EMR, using transient clusters of varying sizes, and optimizing for cost by using spot instances. The presentation provides data and insights such as processing 150 TB of compressed data per day across 350 EMR clusters. It breaks down AWS costs, with storage being the largest recurring fee. Best practices are suggested around resource tagging, infrastructure as code, security, and scaling. The presentation concludes by discussing next steps
19. TheThree Vs
• Volume
• Multiple Terabytes per day
• Variety
• Delimited, Avro, JSON
• Velocity
• Hourly, Batch
20. Workload Management
• “One size fits all” model does not work.
• Specific infrastructure tuned to needs and requirements
• Variety of EMR clusters as per Data need
2
0
Workloads with significant
diversity of needs
Resources with lowest
common denominator
Resources for
workloads with significant
diversity of needs
26. KeyFeatures
• Separation of Compute and Storage: Amazon S3 and Amazon EMR
• Transient Clusters: No permanent cluster. Different size clusters for
different datasets
• Separation of duties: Independent jobs for Processing,
Extracting, loading and monitoring.
• Parallelism: Process the smallest chunk of data possible in
parallel to reduce dependencies
• Scalability: Hundreds of Amazon EMR clusters in multiple
regions and Availability Zones
• Cost optimized: All Spot instances. Launch in Availability Zone
with lowest spot prices.
32. Tag all resources
Infrastructure as
Code
Command Line Interface
JSON as configuration files
AWS Identity and
Access Management
(IAM) roles and policies
Use of application ID
Enable CloudTrail
S3 lifecycle
management
S3 versioning
Separate code/data/logs buckets
Keyless EMR
clusters
Hybrid model
Enable debugging
Create multiple CLI profiles
Multi-factor authentication
CloudWatch billing alarms
EC2 Spot
instances
SNS notifications for failures
Loosely coupled Apps
Scale horizontally
34. 3
4
Database on cloud
• Database on AWS
• Options: Amazon RDS, Amazon Redshift, or others using
Amazon EC2
Event-driven design
• Kick off code based on events
• Run downstream processes as soon as upstream completes
• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS
Data Pipeline
Data analytics
• Implement massive parallel processing technologies
• Options: Spark, Impala or Presto
DevOPS on cloud
• Rapidly and automatically deploy new code
• Continuous Integration/Continuous Deployment
• Options: AWS CodeDeploy, AWS CodeCommit, or AWS
CodePipeline