2. Agenda
⢠Why Real-time Data streaming and Analytics?
⢠How to Build?
⢠Where to Store streaming data?
⢠How to Ingest streaming data?
⢠How to Process streaming data?
⢠Delivery Streaming Data
⢠Dive into Stream Process Framework
⢠Transform, Aggregate, Join Streaming Data
⢠Case Studies
⢠Key Takeaways
5. Data Loses Value Over Time
* Source: Mike Gualtieri, Forrester, Perishable insights
Real time Seconds Minutes Hours Days Months
Valueofdatatodecision-making
Preventive/predictive
Actionable Reactive Historical
Time-critical decisions Traditional âbatchâ business intelligence
7. Batch vs Real-time
Batch Difference Real-time
Arbitrarily, or Periodically Continuity Constant
Store â Process
(Hadoop MapReduce, Hive, Pig, Spark)
Method of analysis
Process â Store
(Spark Streaming, Flink, Apache Storm)
Small - Huge (KB~TB) Data size per a unit Small (B~KB)
Low - High (minutes to hours) Query Latency Low (milliseconds to minutes)
Low - High (hourly/daily/monthly) Request Rate Very High - High (in seconds, minutes)
High - Very high Durability Low - High
¢~$ (Amazon S3, Glacier) Cost/GB $$~$ (Redis, Memcached)
8. From Batch to Real-time:
Lambda Architecture
Data
Source
Stream
Storage
Speed Layer
Batch Layer
Batch
Process
Batch
View
Real-
time
View
Consumer
Query & Merge
Results
Service Layer
Stream
Ingestion
Raw Data
Storage
Streaming Data
Stream
Delivery
Stream
Process
10. Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Devices and/or
applications that
produce real-time
data at high
velocity
Data from tens of
thousands of data
sources can be
written to a single
stream
Data are stored in the
order they were
received for a set
duration of time and
can be replayed
indefinitely during
that time
Records are read in
the order they are
produced, enabling
real-time analytics
or streaming ETL
Data lake
(most common)
Database
(least common)
11. Where to Store Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
17. Amazon Kinesis
Data Streams
Amazon Managed
Streaming for Kafka
⢠Operational Considerations
⢠Number of clusters?
⢠Number of brokers per cluster?
⢠Number of topics per broker?
⢠Number of partitions per topic?
⢠Only increase number of
partitions; canât decrease
⢠Integration with a few of AWS
Services such as Kinesis Data
Analytics for Java
⢠Operational Considerations
⢠Number of Kinesis Data Streams?
⢠Number of shards per stream?
⢠Increase/Decrease number of
shards
⢠Fully Integration with AWS
Services such as Lambda
function, Kinesis Data Analytics,
etc
18. RequestQueue
- Length
- WaitTime
ResponseQueue
- Length
- WaitTime
Network
- Packet Drop?
Produce/Consume Rate Unbalance
Who is Leader? Disk Full?
Too many topics?
Metrics to Monitor: MSK (Kafka)
19. Metrics to Monitor: MSK (Kafka)
Metric Level Description
ActiveControllerCount DEFAULT Only one controller per cluster should be active at any given time.
OfflinePartitionsCount DEFAULT Total number of partitions that are offline in the cluster.
GlobalPartitionCount DEFAULT Total number of partitions across all brokers in the cluster.
GlobalTopicCount DEFAULT Total number of topics across all brokers in the cluster.
KafkaAppLogsDiskUsed DEFAULT The percentage of disk space used for application logs.
KafkaDataLogsDiskUsed DEFAULT The percentage of disk space used for data logs.
RootDiskUsed DEFAULT The percentage of the root disk used by the broker.
PartitionCount PER_BROKER The number of partitions for the broker.
LeaderCount PER_BROKER The number of leader replicas.
UnderMinIsrPartitionCount PER_BROKER The number of under minIsr partitions for the broker.
UnderReplicatedPartitions PER_BROKER The number of under-replicated partitions for the broker.
FetchConsumerTotalTimeMsMean PER_BROKER The mean total time in milliseconds that consumers spend on
fetching data from the broker.
ProduceTotalTimeMsMean PER_BROKER The mean produce time in milliseconds.
20. How about monitoring Kinesis Data Streams?
Consumer
Application
GetRecords()
Data
How long time does a record stay in a shard?
21. Metrics to Monitor: Kinesis Data Streams
Metric Description
GetRecords.IteratorAgeMilliseconds Age of the last record in all GetRecords
ReadProvisionedThroughputExceeded Number of GetRecords calls throttled
WriteProvisionedThroughputExceeded Number of PutRecord(s) calls throttled
PutRecord.Success, PutRecords.Success Number of successful PutRecord(s) operations
GetRecords.Success Number of successful GetRecords operations
23. How to Ingest Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
24. Stream Ingestion
⢠AWS SDKs
⢠Publish directly from application code via APIs
⢠AWS Mobile SDK
⢠Kinesis Agent
⢠Monitors log files and forwards lines as messages to
Kinesis Data Streams
⢠Kinesis Producer Library (KPL)
⢠Background process aggregates and batches messages
⢠3rd party and open source
⢠Kafka Connect (kinesis-kafka-connector)
⢠fluentd (aws-fluent-plugin-kinesis)
⢠Log4J Appender (kinesis-log4j-appender)
⢠and more âŚ
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
Amazon Kinesis
Data Streams
25. How to Process Streaming
Data?
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
28. Pre-built Data Transformation Blueprints
Blueprint Description
General Processing For custom transformation logic
Apache Log to JSON
Parses and converts Apache log lines to JSON objects using predefined
JSON field names
Apache Log to CSV Parses and converts Apache log lines to CSV format
Syslog to JSON
Parses and converts Syslog lines to JSON objects using predefined JSON
field names
Syslog to CSV Parses and converts Syslog lines to CSV format
29. Pre-built Data Conversion
Data
Source
Kinesis
Data Firehose
JSON Data
schema
AWS Glue Data
Catalog
Amazon S3
⢠Convert the format of your input data from JSON to columnar data
format Apache Parquet or Apache ORC before storing the data in
Amazon S3
⢠Works in conjunction to the transform features to convert other format
to JSON before the data conversion
convert to
columnar format
/failed
30. Failure and Error Handling
⢠S3 Destination
⢠Pause and retry for up to 24 hours (maximum data retention period)
⢠If data delivery fails for more than 24 hours, your data is lost.
⢠Redshift Destination
⢠Configurable retry duration (0-2 hours)
⢠After retry, skip and load error manifest files to S3âs errors/ folder
⢠Elasticsearch Destination
⢠Configurable retry duration (0-2 hours)
⢠After retry, skip and load failed records to S3âs elasticsearch_failed/
folder
31. Stream Process
⢠Transform
⢠Filter, Enrich, Convert
⢠Aggregation
⢠Windows Queries
⢠Top-K Contributor
⢠Join
⢠Stream-Stream Join
⢠Stream-(External) Table Join
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
AWS Lambda
Amazon Kinesis
Data Analytics
AWS Glue Amazon EMR
45. What about (Stream) SQL?
Data
Source
Stream
Storage
Stream
SQL
Process
Stream
Ingestion
Data
Sink
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
âIt's raining cats and dogs!â Itâs 1
raining 1
cats 1
and 1
dogs! 1
46. Kinesis Data Analytics (SQL)
⢠STREAM (in-application): a continuously
updated entity that you can SELECT from and
INSERT into like a TABLE
⢠PUMP: an entity used to continuously
'SELECT ... FROM' a source STREAM, and
INSERT SQL results into an output STREAM
⢠Create output stream, which can be used to
send to a destination
SOURCE
STREAM
INSERT
& SELECT
(PUMP)
DESTIN.
STREAM
Destination
Source
[("It's", 1),
("raining", 1),
("cats", 1),
("and", 1),
("dogs!", 1)]
54. Kinesis Data Analytics (SQL):
Preprocessing Data
https://aws.amazon.com/ko/blogs/big-data/preprocessing-data-in-amazon-kinesis-analytics-with-aws-lambda/
55. Integration of Stream Process and
Stream Storage
Amazon
Lambda
Kinesis Data
Analytics (SQL)
Kinesis Data
Analytics (Java)
Glue EMR
Kinesis Data
Firehose O O X X X
Kinesis Data
Streams O O O O O
Managed
Streaming for
Kafka (MSK)
X X O O O
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
57. Stream Process: Aggregation
⢠Aggregations (count, sum, min,...) take granular real time data
and turn it into insights
⢠Data is continuously processed so you need to tell the
application when you want results
⢠Windowed Queries
a. Sliding Windows (with Overlap)
b. Tumbling Windows (No Overlap)
c. Custom Windows
61. Why Stream-Stream Join is so difficult?
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Sink
t0t1t2tN
. . . . . . .
⢠Timing
⢠Skewed data
âđĄ
âđĄ
âđĄ
62. How about Stream-Join by Partition Key?
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
Data
Source
Stream
Storage
Data
Source
Stream
Process
t1t2t3t5
t1t2t3t5
t1t1t2t3
Each shard will be filled with records
coming from fast data producers
shard-1
shard-2
shard-3
63. Lastly, how about Stream-Join by Hash
Table?
Data
Source
Stream
Storage
Stream
Process
Key-Value
Storage
Data
Source
Stream
Storage
Data
Source
Stream
Storage
Stream
Process
68. EMR vs Glue vs Kinesis Data Analytics
Operational
Excellence
Kinesis Data
Analytics (SQL)
EMR
Glue
Kinesis Data
Analytics (Java)
Degree of Freedom
â Complexity
69. AWS Glue
Comparing stream processing services
AWS Lambda Amazon Kinesis
Data Analytics
Amazon EMR
Simple programming
interface and scaling
⢠Serverless functions
⢠Six languages (Java,
Python, Golang,
Node.js, Ruby, C#)
⢠Event-based, stateless
processing
⢠Continuous and simple
scaling mechanism
Easy and powerful
stream processing
Simple, flexible, and
cost-effective ETL & Data
Catalog
Flexibility and choice for
your needs
⢠Serverless applications
⢠Supports SQL and Java
(Apache Flink)
⢠Stateful processing
with automatic backups
⢠Stream operators make
building app easy
⢠Serverless applications
⢠Can use the transforms
native to Apache Spark
Structured Streaming
⢠Automatically discover
new data, extracts
schema definitions
⢠Automatically
generates the ETL code
⢠Choose your instances
⢠Use your favorite
open-source
framework
⢠Fine-grained control
over cluster,
debugging tools, and
more
⢠Deep open-source tool
integrations with AWS
71. Example Usage Pattern 1: Web Analytics
and Leaderboards
Amazon
DynamoDB
Amazon Kinesis
Data Analytics
Amazon Kinesis
Data Streams
Amazon
Cognito
Lightweight JS
client code
Web server on
Amazon EC2
OR
Compute top 10 usersIngest web app data Persist to feed live apps
Lambda
function
https://aws.amazon.com/solutions/implementations/real-time-web-analytics-with-kinesis/
73. Example Usage Pattern 3: Analyzing
AWS CloudTrail Event Logs
AWS
CloudTrail
CloudWatch
Events trigger
Kinesis
Data Analytics
Lambda
function
S3 bucket
for raw data
DynamoDB
table
Chart.JS
dashboard
Compute operational
metrics
Ingest raw log data Deliver to real time
dashboards and archival
Kinesis Data
Firehose
https://aws.amazon.com/solutions/implementations/real-time-insights-account-activity/
75. From Batch to Real-time:
Lambda Architecture
Data
Source
Stream
Storage
Speed Layer
Batch Layer
Batch
Process
Batch
View
Real-
time
View
Consumer
Query & Merge
Results
Service Layer
Stream
Ingestion
Raw Data
Storage
Streaming Data
Stream
Delivery
Stream
Process
76. Key Components of Real-time Analytics
Data
Source
Stream
Storage
Stream
Process
Stream
Ingestion
Data
Sink
AWS Lambda
Kinesis Data
Analytics
Glue EMR
Kinesis Data
Firehose
Kinesis Data
Streams
Managed
Streaming for
Kafka
Real-Time Applications
- Aggregation
- Top-K Contributor
- Anomaly Detection
Streaming ETL
- Filter, Enrich, Convert
- Join
Kafka Connect
KPL
Kinesis Agent
AWS SDKs
77. Key Takeaways
⢠Build decoupled systems
⢠Data â Store â Process â Store â Analyze â Answers
⢠Data Source â Stream Ingestion â Stream Storage â Stream
Process â Data Sink
⢠Follow the principle of "extract data once and reuse multiple
timesâ to power new customer experiences
⢠Use the right tool for the job
⢠Know the AWS services soft and hard limits
⢠Leverage managed and serverless services (DevOps!)
⢠Scalable/elastic, available, reliable, secure, no/low admin
78. Where To Go Next?
⢠AWS Analytics Immersion Day - Build BI System from Scratch
⢠Workshop - https://tinyurl.com/yapgwv77
⢠Slides - https://tinyurl.com/ybxkb74b
⢠Writing SQL on Streaming Data with Amazon Kinesis Analytics â Part 1, 2
⢠Part1 - https://tinyurl.com/y8vo8q7o
⢠Part2 - https://tinyurl.com/ycbv7wel
⢠Streaming Analytics Workshop â Kinesis Data Analytics for Java (Flink)
https://streaming-analytics.labgui.de/
⢠Amazon MSK Labs
https://amazonmsk-labs.workshop.aws/
⢠Querying Amazon Kinesis Streams Directly with SQL and Spark Streaming
https://tinyurl.com/y7hklyff
⢠AWS Glue Streaming ETL - Scala Script Example
https://tinyurl.com/y79x6jda
79. Appendix
⢠Amazon Managed Streaming for Apache Kafka: Best Practices
https://docs.aws.amazon.com/msk/latest/developerguide/bestpractices.html
⢠Optimizing Your Apache KafkaŽ Deployment
https://www.confluent.io/blog/optimizing-apache-kafka-deployment/
⢠Monitoring Kafka performance metrics
https://www.datadoghq.com/blog/monitoring-kafka-performance-metrics/