2. What we’ll cover:
1. Big data challenges
2. How to simplify big data processing
3. What technologies should you use?
• Why?
• How?
4. Reference architecture
5. Design patterns
6. Plethora of Tools
Amazon
Glacier
S3 DynamoDB
RDS
EMR
Amazon
Redshift
Data Pipeline
Amazon
Kinesis
Amazon Kinesis
Streams app
Lambda Amazon ML
SQS
ElastiCache
DynamoDB
Streams
Amazon Elasticsearch
Service
Amazon Kinesis
Analytics
7. Big Data Challenges
Is there a reference architecture?
What tools should I use?
How?
Why?
8. Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable / elastic, available, reliable, secure, no / low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch / real-time / serving layer
Big data ≠ big cost
Architectural Principles
9. Conceptual Framework to Simplify Big Data Processing
COLLECT STORE PROCESS/
ANALYZE
CONSUME
Time to answer (Latency)
Throughput
Cost
13. Hot Warm Cold
Volume MB–GB GB–TB PB–EB
Item size B–KB KB–MB KB–TB
Latency ms ms, sec min, hrs
Durability Low–high High Very high
Request rate Very high High Low
Cost/GB $$-$ $-¢¢ ¢
Hot data Warm data Cold data
Data Characteristics: Hot, Warm, Cold
18. What About Messaging?
• Decouple producers &
consumers
• Persistent buffer
• Collect multiple streams
• No client ordering
• No streaming MapReduce
• No parallel consumption for
Amazon SQS consumers
• Amazon SNS can
publish to multiple SNS
subscribers (queues or
ʎ functions)
Publisher
Amazon SNS
topic
function
ʎ
AWS Lambda
function
Amazon SQS
queue
queue
Subscriber
ConsumersProducers
Amazon SQS
queue
19. What Stream / Message Storage Should I Use?
Amazon
DynamoDB
Streams
Amazon
Kinesis
Streams
Amazon
Kinesis
Firehose
Apache
Kafka
Amazon
SQS
AWS managed Yes Yes Yes No Yes
Guaranteed
ordering
Yes Yes No Yes No
Delivery (deduping) Exactly-once At-least-once At-least-once At-least-once At-least-once
Data retention
period
24 hours 7 days N/A Configurable 14 days
Availability 3 AZ 3 AZ 3 AZ Configurable 3 AZ
Scale /
throughput
No limit /
~ table IOPS
No limit /
~ shards
No limit /
automatic
No limit /
~ nodes
No limits /
automatic
Parallel clients Yes Yes No Yes No
Stream MapReduce Yes Yes N/A Yes N/A
Row/object size 400 KB 1 MB Destination row/object size Configurable 256 KB
Cost Higher (table cost) Low Low Low (+admin) Low-medium
Hot Warm
20. COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
Database
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Search
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon S3
Amazon SQS
Message
Amazon S3
File
LoggingIoTApplicationsTransportMessaging
File Storage
21. Why is Amazon S3 Good for Big Data?
• Native support by big data frameworks
(Spark, Hive, Presto, etc.)
• No need to run compute clusters for storage
(unlike HDFS)
• Supports transient Hadoop clusters &
Amazon EC2 Spot Instances
• Multiple distinct (Spark, Hive, Presto)
clusters can use the same data
• Elastic: Unlimited number of objects
• Very high bandwidth – no aggregate
throughput limit
• Highly available – can tolerate AZ
failure
• Designed for 99.999999999%
durability
• Tired-storage (Standard, IA, Amazon
Glacier) via life-cycle policy
• Secure – SSL, client/server-side
encryption at rest
• Low cost
22. What about HDFS & Amazon Glacier?
• Use HDFS for very frequently accessed (hot)
data
• Use Amazon S3 Standard for frequently
accessed data
• Use Amazon S3 Standard – IA for infrequently
accessed data
• Use Amazon Glacier for archiving cold data
23. Cache, database, search
COLLECT STORE
Mobile apps
Web apps
Data centers
AWS Direct
Connect
RECORDS
AWS Import/Export
Snowball
Logging
Amazon
CloudWatch
AWS
CloudTrail
DOCUMENTS
FILES
Messaging
Message MESSAGES
Devices
Sensors &
IoT platforms
AWS IoT STREAMS
Apache Kafka
Amazon Kinesis
Streams
Amazon Kinesis
Firehose
Amazon DynamoDB
Streams
Hot
Stream
Amazon SQS
Message
Amazon Elasticsearch
Service
Amazon DynamoDB
Amazon S3
Amazon ElastiCache
Amazon RDS
SearchSQLNoSQLCacheFile
LoggingIoTApplicationsTransportMessaging
25. Best Practice - Use the Right Tool for the Job
Data Tier
Search
Amazon
Elasticsearch
Service
Cache
Amazon
ElastiCache
Redis
Memcached
SQL
Amazon Aurora
MySQL
PostgreSQL
Oracle
SQL Server
NoSQL
Amazon
DynamoDB
Cassandra
HBase
MongoDB
Database tier options
26. What Data Store Should I Use?
Data structure? → Fixed schema, JSON, key-value
Access patterns? → Store data in format you will access it
Data / access characteristics? → Hot, warm, cold
Cost? → Right cost
27. Data Structure and Access Patterns
Access Patterns What to use?
Put/Get (key, value) Cache, NoSQL
Simple relationships → 1:N, M:N NoSQL
Cross table joins, transaction, SQL SQL
Faceting, search Search
Data Structure What to use?
Fixed schema SQL, NoSQL
Schema-free (JSON) NoSQL, Search
(Key, value) Cache, NoSQL
29. Amazon ElastiCache
Amazon
DynamoDB
Amazon
RDS/Aurora
Amazon
Elasticsearch Amazon S3 Amazon Glacier
Average
latency
ms ms ms, sec ms,sec ms,sec,min
(~ size)
hrs
Typical
data stored
GB GB–TBs
(no limit)
GB–TB
(64 TB max)
GB–TB MB–PB
(no limit)
GB–PB
(no limit)
Typical
item size
B-KB KB
(400 KB max)
KB
(64 KB max)
KB
(2 GB max)
KB-TB
(5 TB max)
GB
(40 TB max)
Request
Rate
High – very high Very high
(no limit)
High High Low – high
(no limit)
Very low
Storage cost
GB/month
$$ ¢¢ ¢¢ ¢¢ ¢ ¢/10
Durability Low - moderate Very high Very high High Very high Very high
Availability High
2 AZ
Very high
3 AZ
Very high
3 AZ
High
2 AZ
Very high
3 AZ
Very high
3 AZ
Hot data Warm data Cold data
What Data Store Should I Use?
30. Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project. The design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
34. Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
“I’m currently scoping out a project that will greatly increase
my team’s use of Amazon S3. Hoping you could answer
some questions. The current iteration of the design calls for
many small files, perhaps up to a billion during peak. The
total size would be on the order of 1.5 TB per month…”
Request rate
(Writes/sec)
Object size
(Bytes)
Total size
(GB/month)
Objects per month
300 2048 1483 777,600,000
35. Cost Conscious Design
Example: Should I use Amazon S3 or Amazon DynamoDB?
https://calculator.s3.amazonaws.com/index.html
Simple Monthly
Calculator
41. What Stream & Message Processing Technology Should I Use?
Amazon
EMR (Spark
Streaming)
Apache
Storm
KCL Application Amazon Kinesis
Analytics
AWS
Lambda
Amazon SQS
Application
AWS
managed
Yes (Amazon
EMR)
No (Do it
yourself)
No ( EC2 + Auto
Scaling)
Yes Yes No (EC2 + Auto
Scaling)
Serverless No No No Yes Yes No
Scale /
throughput
No limits /
~ nodes
No limits /
~ nodes
No limits /
~ nodes
Up to 8 KPU /
automatic
No limits /
automatic
No limits /
~ nodes
Availability Single AZ Configurable Multi-AZ Multi-AZ Multi-AZ Multi-AZ
Programming
languages
Java,
Python,
Scala
Almost any
language via
Thrift
Java, others via
MultiLangDaemon
ANSI SQL with
extensions
Node.js,
Java,
Python
AWS SDK
languages (Java,
.NET, Python, …)
Uses Multistage
processing
Multistage
processing
Single stage
processing
Multistage
processing
Simple
event-based
triggers
Simple event
based triggers
Reliability KCL and
Spark
checkpoints
Framework
managed
Managed by KCL Managed by
Amazon Kinesis
Analytics
Managed by
AWS
Lambda
Managed by SQS
Visibility Timeout
42. Which Analysis Tool Should I Use?
Amazon Redshift Amazon EMR
Presto Spark Hive
Use case Optimized for Data Warehousing Interactive
query
General purpose (iterative
ML, RT, ..)
Batch
Scale/throughput ~Nodes ~ Nodes
Storage Local Storage Amazon S3, HDFS
Optimization Columnar storage, Data
compression, and Zone maps
Framework dependent
Metadata Amazon Redshift Managed Hive Meta-store
BI tools supports Yes (JDBC/ODBC) Yes (JDBC/ODBC & Custom)
Access controls Users, Groups and Access Controls Integration with LDAP
UDF support Yes (Scalar) Yes
Slow
46. STORE CONSUMEPROCESS / ANALYZE
Amazon QuickSight
Apps & Services
Analysis&visualizationNotebooksIDEAPI
Applications & API
Analysis and visualization
Notebooks
IDE
Business
users
Data scientist,
developers
COLLECT ETL
56. Batch & Interactive Layer
process
store
Amazon S3
Amazon
Redshift
Amazon EMR
Presto
Hive
Pig
Spark
Amazon
ElastiCache
Amazon
DynamoDB
Amazon
RDS
Amazon
ES
Amazon Kinesis
Analytics
AWS Lambda
Storm
Spark Streaming
on Amazon EMR
Applications
Amazon
Kinesis
Data
Lambda
Architecture
Serving Layer
KCL
Amazon
ML
Real-time Layer
57. Summary
Decoupled “data bus”
• Data → Store → Process → Store → Analyze → Answers
Use the right tool for the job
• Data structure, latency, throughput, access patterns
Leverage AWS managed services
• Scalable/elastic, available, reliable, secure, no/low admin
Use Lambda architecture ideas
• Immutable (append-only) log, batch/real-time/serving layer
Big data ≠ big cost
Principal Architect and Senior Manager, Big Data Solutions Architecture, Amazon Web Services
Abstract:
The world is producing an ever increasing volume, velocity, and variety of big data. Consumers and businesses are demanding up-to-the-second (or even millisecond) analytics on their fast-moving data, in addition to classic batch processing. AWS delivers many services for solving big data problems. But what service should you use, why, when, and how? In this session, we simplify big data processing as a data bus comprising various stages: ingest, store, process, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architecture, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost.
Volume – 100 to 150TB a day
Velocity – 1million reads and writes per second is becoming a norm
Variety ->
IOT/log data/ streaming data
Transactional data
File data
Fixed Schema
CSV
Parquet
Avero
Schema-free
JSON
Key-value
Small files, large files,
Weekly / Monthly Bill: what you spent this billing cycle
Hourly server logs: were your systems misbehaving 1hr ago
Daily customer-preferences report from your web site’s click stream: what deal or ad to try next time
Daily fraud reports: was there fraud yesterday
Real-time alerts: what went wrong now
Real-time spending caps: prevent overspending now
Real-time analysis: what to offer the current customer now
Real-time detection: block fraudulent use now
I need to harness big data, fast
I want more happy customers
I want to save/make more money
Is there a reference architecture?What tools should I use?How? Why?
Go faster – only key points
Before we go into solving the Big architecture, I want to introduce some “tried and test” architecture principles.
Here at AWS we believe you should be using the right tool for the job – “instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. We’ll talk about this more.
Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each other…this has been tried and battle test.
Managed services – this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense….you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.
Lambda Architecture -
throughput = f (volume, request rate)
latency
Cost
Event to action/answer latency?
Types of Data
Heavily read, app defined data structuresDatabase records
Search documents
Log files
Messaging events
Devices / sensors / IoT stream
Database Records
Search Documents
Log Files
Messaging Events
Devices / Sensors / IoT Stream
Types of Data
Heavily read, app defined data structuresDatabase records
Search documents
Log files
Messaging events
Devices / sensors / IoT stream
Database Records
Search Documents
Log Files
Messaging Events
Devices / Sensors / IoT Stream
Huge buffer…
http://calculator.s3.amazonaws.com/index.html#r=IAD&key=calc-BE3BA3E4-1AC5-4E7A-B542-015056D8EDAF
Kinesis -> $52.14 per month
SQS -> $133.42 per month for puts or $400/month (put, get, delete)
DynamoDB -> $3809.88 per month (10TB of storage cost itself is $2500/month)
Cost (100rpsx 35KB)
$52/month
$133/month * 2 = $266/month
?
Amazon DynamoDB Service (US-East)
$
Provisioned Throughput Capacity:
$120
Indexed Data Storage:
$2560.90
DynamoDB Streams:
$1.3
Amazon SQS Service (US-East)
Pricing Example
Let’s assume that our data producers put 100 records per second in aggregate, and each record is 35KB. In this case, the total data input rate is 3.4MB/sec (100 records/sec*35KB/record). For simplicity, we assume that the throughput and data size of each record are stable and constant throughout the day. Please note that we can dynamically adjust the throughput of our Amazon Kinesis stream at any time.
We first calculate the number of shards needed for our stream to achieve the required throughput. As one shard provides a capacity of 1MB/sec data input and supports 1000 records/sec, four shards provide a capacity of 4MB/sec data input and support 4000 records/sec. So a stream with four shards satisfies our required throughput of 3.4MB/sec at 100 records/sec.
We then calculate our monthly Amazon Kinesis costs using Amazon Kinesis pricing in the US-East Region:
Shard Hour: One shard costs $0.015 per hour, or $0.36 per day ($0.015*24). Our stream has four shards so that it costs $1.44 per day ($0.36*4). For a month with 31 days, our monthly Shard Hour cost is $44.64 ($1.44*31).
PUT Payload Unit (25KB): As our record is 35KB, each record contains two PUT Payload Units. Our data producers put 100 records or 200 PUT Payload Units per second in aggregate. That is 267,840,000 records or 535,680,000 PUT Payload Units per month. As one million PUT Payload Units cost $0.014, our monthly PUT Payload Units cost is $7.499 ($0.014*535.68).
Adding the Shard Hour and PUT Payload Unit costs together, our total Amazon Kinesis costs are $1.68 per day, or $52.14 per month. For $1.68 per day, we have a fully-managed streaming data infrastructure that enables us to continuously ingest 4MB of data per second, or 337GB of data per day in a reliable and elastic manner.
No need to run compute clusters for storage (unlike HDFS)
Can run transient Hadoop clusters & Amazon EC2 spot Instances
Multiple distinct (Spark, Hive, Presto) clusters can use the same data
Highly durable, highly available, highly scalable
Secure
Designed for 99.999999999% durability
https://aws.amazon.com/s3/storage-classes/
Lifecycle Policies - migrated to S3-Standard - IA, archive to Amazon Glacier, or deleted after a specific period of time!
No need to run a Hadoop cluster for storage (unlike HDFS)
Unlimited number of objects
Object size up to 5TB
Very high bandwidth
Versioning
Lifecycle Policies to Achieve to Glacier
Designed for 99.999999999% durability
Server size encryption
Highly available – SLA 99.99 (Standard)
Highly scalable
Low cost
Event notification
Cross-region replication
Natively supported by almost all big data frameworks & tools
2 x 2 Matrix
Structured
Level of query (from none to complex)
Draw down the slide
Analysis of data is a process of inspecting, cleaning, transforming, and modeling data with the goal of discovering useful information, suggesting conclusions, and supporting decision-making.
Examples:
Data Warehousing: Monthly bill, Interactive dashboards
Machine Learning: Sentiment Analysis, Prediction
Stream processing: Alerts, 1 minute metrics, etc.
http://www.jmlr.org/papers/volume9/li08a/li08a.pdf
Forecasting Web Page Views:
http://nlp.stanford.edu/sentiment/
http://nlp.stanford.edu/sentiment/Methods and Observations
https://en.wikipedia.org/wiki/Data_analysis
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? – how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Add connector
Direct Acyclic Graphs?
Exactly once processing & DAG? — how do you do this??
https://storm.apache.org/documentation/Rationale.html
http://www.slideshare.net/ptgoetz/apache-storm-vs-spark-streaming
Applications & API
Analysis and Visualization
Notebooks
IDE
Similar to multi-tier web-app-data architectures
Concept of a “data bus” or “data pipeline”
This is a summary of all six design patterns together. This summarizes all of the solutions available in the context of the temperature of the data and the data processing latency requirements.
Hive – 1 year worth of click stream data
Spark – 1 year of click stream data – what people are buying frequently together
Redshift – reporting, enterprise reporting tool – SQL Heavy
Impala – same as redshift
Preseto same league as Impala presto – Interactive SQL analytics – have a Hadoop installed base….
NoSQL – Analytics on NoSQL
Best practices
BDT318 - Netflix Keystone: How Netflix Handles Data Streams Up to 8 Million Events Per Second
Use Kinesis or Kafka for stream storage
Use appropriate stream processing tool
KCL – Simple, only for Kinesis
Apache Storm – general purpose
Spark Streaming – Windowing, MLlib, Statefull
AWS Lambda – fully managed, no servers, stateless
Use Amazon SNS for Alerts
Use Amazon ML for predictions
Use a managed database/cache for state management
Use a visualization tool for KPI visualization
Lambda Architecture best practices
Use Kinesis or Kafka for stream storage
Maintain an immutable (append only) raw master data in Amazon S3 - use Amazon Kinesis S3 connector or Firehose
Use appropriate batch processing technologies for creating batch views
Use appropriate stream processing technology for real-time views
Use a serving layer with an appropriate database to serve downstream applications
Before we go into solving the Big architecture, I want to introduce some “tried and test” architecture principles.
Here at AWS we believe you should be using the right tool for the job – “instead of using a big swiss army knife for using a screw dreive, it will be best to use a screw drive - this is especially important for big data architectures. We’ll talk about this more.
Decoupled architecture http://whatis.techtarget.com/definition/decoupled-architecture - In general, a decoupled architecture is a framework for complex work that allows components to remain completely autonomous and unaware of each other…this has been tried and battle test.
Managed services – this is relatively now - Should I install Cassandra or MongoDB or CouchDB on AWS. You obviously can. Sometimes there are good reasons for doing this. Many customers still do this. Netflix is a great example. They run a multi-region Cassandra and are a poster child for how to do this. But for most customers, delegating this task to AWS makes more sense….you are better of spending your time on building features for your customers rather than building highly scalable distributed systems.
Lambda Architecture -