(BDT210) Building Scalable Big Data Solutions: Intel & AOL

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Bob Rogers, PhD Chief Data Scientist for Big Data Solutions, Intel
Durga Nemani, System Architect AOL Inc.
October 2015
Building Scalable Big Data
Solutions
BDT210

Building Scalable Big Data Solutions
October 2015
Bob Rogers, PhD
Chief Data Scientist for Big Data Solutions
Intel

@scientistBob
What does Big Data have to do with Intel?
Trusted Analytics Platform

@scientistBob 5
Intel contributions to Apache Hadoop
Encryption
Intel® AES-NI

@scientistBob 6
Use case:
Assemble an accurate patient problem list
Why?
• To improve patient outcomes
KPI
• False negatives in problem list

@scientistBob 7
What does a patient look like to a data scientist?

@scientistBob
8
My first enterprise data hub

@scientistBob
0-25 %
25-50 %
50-75%
75-100 %
Poll: What percent of the key clinical data to you think is missing from
the problem list?
?

@scientistBob
>63%
Missing
Poll: What percent of the key clinical data to you think is missing from
the problem list?

@scientistBob
Real patient example
Coded
Data
Free Text
Scanned
Document
s
Other
Data Silos

@scientistBob
Missing information

@scientistBob 13
What did we learn?
• Start with what you know
• Leverage existing
technologies
• Use simple tools
• Measure your results

@scientistBob
Powerful Big Data analytics reveal the truth about your…
…customers
…products
…ecosystem
…opportunities
14

Thank you
bob.rogers@intel.com
@scientistBob
15

BuildingScalableBigData
Solutions
Durga Nemani – AOL Inc.

TheThree Vs
• Volume
• Multiple Terabytes per day
• Variety
• Delimited, Avro, JSON
• Velocity
• Hourly, Batch

Workload Management
• “One size fits all” model does not work.
• Specific infrastructure tuned to needs and requirements
• Variety of EMR clusters as per Data need
2
0
Workloads with significant
diversity of needs
Resources with lowest
common denominator
Resources for
workloads with significant
diversity of needs

JSON
EC2EMRS3
Apache HiveApache PigApache Hadoop
Open Source Data
Formats
AWS Services
Open Source
Technologies
Avro Parquet

Separation of Computeand Storage

SEE,SPOT,SQUEEZE
• Just enough spot instances to finish the job in 59 minutes.

KeyFeatures
• Separation of Compute and Storage: Amazon S3 and Amazon EMR
• Transient Clusters: No permanent cluster. Different size clusters for
different datasets
• Separation of duties: Independent jobs for Processing,
Extracting, loading and monitoring.
• Parallelism: Process the smallest chunk of data possible in
parallel to reduce dependencies
• Scalability: Hundreds of Amazon EMR clusters in multiple
regions and Availability Zones
• Cost optimized: All Spot instances. Launch in Availability Zone
with lowest spot prices.

CLOUD Facts
2
8
Total Compressed
Amazon S3 Data Size
150 TB
Uncompressed
RAW Data/Day
2-3 TB
Amazon EMR
Clusters/Day
350
Amazon S3 Data
Retention Period
13-24 Months

150
24,000
Restatement Use Case
Terabytes raw
2
9
10
Availability Zone
550
EMR Clusters EC2 Instances

AWS COSTBREAKOUT
44%
40%
16%
3
0
** Storage cost is recurring every month at 2.85$/100 GB
EC2 Cost
EMR Fee
S3 Cost

Tag all resources
Infrastructure as
Code
Command Line Interface
JSON as configuration files
AWS Identity and
Access Management
(IAM) roles and policies
Use of application ID
Enable CloudTrail
S3 lifecycle
management
S3 versioning
Separate code/data/logs buckets
Keyless EMR
clusters
Hybrid model
Enable debugging
Create multiple CLI profiles
Multi-factor authentication
CloudWatch billing alarms
EC2 Spot
instances
SNS notifications for failures
Loosely coupled Apps
Scale horizontally

3
4
Database on cloud
• Database on AWS
• Options: Amazon RDS, Amazon Redshift, or others using
Amazon EC2
Event-driven design
• Kick off code based on events
• Run downstream processes as soon as upstream completes
• Options: AWS Lambda, Amazon SQS, Amazon SWF or AWS
Data Pipeline
Data analytics
• Implement massive parallel processing technologies
• Options: Spark, Impala or Presto
DevOPS on cloud
• Rapidly and automatically deploy new code
• Continuous Integration/Continuous Deployment
• Options: AWS CodeDeploy, AWS CodeCommit, or AWS
CodePipeline

THANKYOU
Recommended session:
BDT208 - A Technical Introduction to
Amazon Elastic MapReduce
Thursday, Oct 8, 12:15 PM - 1:15 PM
– Titian 2201B

Remember to complete your
evaluations!

(BDT210) Building Scalable Big Data Solutions: Intel & AOL

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to (BDT210) Building Scalable Big Data Solutions: Intel & AOL

Similar to (BDT210) Building Scalable Big Data Solutions: Intel & AOL (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

(BDT210) Building Scalable Big Data Solutions: Intel & AOL