Stream processing and managing real-time data

© 2019, Amazon Web Services, Inc. or its affiliates. All rights reserved.S U M M I T
Stream processing and managing
real-time data
Javier Ramirez
@supercoco9
AWS Tech Evangelist
A N T 2
Prakash Sethuraman
Chief Architect Digital Technologies, HSBC

Source Stream Ingestion Stream Storage Stream Processing Destination
Devices and or applications
that produce real-time data
at high velocity.
The process by which data is
ingested into the stream.
The way in which the data is
stored within the stream.
The way in which the stream
is processed to provide
analytical insight
Databases where data is
stored for near real-time or
longer term analysis.
An Overview of Data Streaming Technology
Destination
Real-Time Applications
(seconds)
Analyze streaming data
to generate real-time
insights and
notifications
Streaming ETL
(minutes)
Compress, encrypt and
transform data in near
real-time before it is
delivered to its
destination
Stream Storage
Stream Ingestion
[Wed Oct 11 14:32:52 2018]
[error] [client 127.0.0.1]
client denied by server
configuration:
/export/home/live/ap/htdocs
/test
Mobile device
Metering
Click streams
IoT sensors
Logs

Apache Kafka
A distributed streaming platform

Apache Kafka Anatomy 101
Producer
Broker
Broker
Broker
Data Consumer
Cluster
Zookeeper
Producer

Apache Kafka Anatomy. Topics and partitions
Newest dataOldest data
50 1 2 3 4
0 1 2 3
0 1 2 3 4
Partition 2
Partition 1
Partition 3
Writes from
Producers
Topic with 3 partitions
Consumer
Consumer
Consumer
Consumer
Group
= next consumer offset

Challenges operating Apache Kafka
Difficult to setup
Hard to achieve high availability
Tricky to scale
AWS integrations = development
No console, no visible metrics 𝑓𝑓 𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑘𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢𝑢 = �
𝑛𝑛=1
∞
𝑆𝑆𝑆𝑆𝑆𝑆

Apache Managed
Streaming for Kafka

Getting started with Amazon MSK is easy
• Fully compatible with Apache Kafka v1.1.1 and v2.1.0
• AWS Management Console and AWS API for provisioning
• Clusters are setup automatically
• Provision Apache Kafka brokers and storage
• Create and tear down clusters on-demand
• Deeply integrated with AWS services
• Amazon MSK is committed to improving open-source Apache Kafka

Amazon Elasticsearch Service is a fully
managed service that makes it easy to
deploy, manage, and scale
Elasticsearch and Kibana
AMAZON ELASTICSEARCH SERVICE
A fully managed, scalable, secure Elasticsearch service

Data is stored in indexes, distributed across shards
ID
Field: value
Field: value
Field: value
Field: value
Index
Shards
Alldocs
1/51/51/51/51/5

Use a replica for redundancy
ID
Field: value
Field: value
Field: value
Field: value
Index
Shards

Benefits of Amazon Elasticsearch Service
Supports Open Source APIs
and Tools
Drop-in replacement with no
need to learn new APIs or
skills
Easy to Use
Deploy a production-ready
Elasticsearch cluster in
minutes
Scalable
Resize your cluster with a
few clicks or a single API
call
Secure
Deploy into your VPC and
restrict access using security
groups and IAM policies
Highly Available
Replicate across Availability
Zones, with monitoring and
automated self-healing
Tightly Integrated with
Other AWS Services
Seamless data ingestion,
security, auditing and
orchestration

Amazon ES cluster
1
3
Instance 1
2
1 2
Instance 2
3
2
1
Instance 3
Availability Zone 1 Availability Zone 2
2
1
Instance 4
3
3

Software & Internet Education Technology BioTech and
Pharma
Media and EntertainmentFinancial Services Social Media
Telecommunications Travel & Transportation Real Estate
Logistics & Operations Publishing Other

Amazon Go
video analytics
Amazon.com
online catalog
Amazon
CloudWatch
logs
Amazon
S3 events
AWS
metering

Amazon Kinesis Data Firehose
• Zero administration and seamless elasticity
• Direct-to-data store integration
• Serverless continuous data transformations
• Near real-time

Ingest Transform Deliver
Amazon S3
Amazon Redshift
Amazon Elasticsearch Service
AWS IoT
Amazon Kinesis Agent
Amazon Kinesis Streams
Amazon CloudWatch Logs
Amazon CloudWatch Events
Apache Kafka

Kinesis Firehose AWS Lambda

Filter, enrich and convert data while it is streaming
data
producer Kinesis Data
Firehose
Elasticsearch
Service
[Wed Oct 11 14:32:52 2017] [error] [client 127.0.0.1]
[Wed Oct 11 14:32:53 2017] [info] [client 127.0.0.1]
geo-IP
service
{
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "127.0.0.1",
"city": "Boston",
"state": "MA"
}
{
"recordId": "1",
"result": "Ok",
"data": {
"date": "2017/10/11 14:32:52",
"status": "error",
"source": "127.0.0.1",
"city": "Boston",
"state": "MA"
}
},
{
"recordId": "2",
"result": "Dropped"
}

Blueprint Description
General Processing For custom transformation logic
Apache Log to JSON Parses and converts Apache log lines to JSON objects using
predefined JSON field names
Apache Log to CSV Parses and converts Apache log lines to CSV format
Syslog to JSON Parses and converts Syslog lines to JSON objects using
predefined JSON field names
Syslog to CSV Parses and converts Syslog lines to CSV format

• Convert the format of your input data from JSON to columnar data format Apache Parquet
or Apache ORC before storing the data in Amazon S3.
• Works in conjunction to the transform features to convert other format to JSON before the
data conversion

S3 Destination
• Pause and retry for up to 24 hours (maximum data retention period)
Redshift Destination
• Configurable retry duration (0-2 hours)
• After retry, skip and load error manifest files to S3’s errors/ folder
Elasticsearch Destination
• Configurable retry duration (0-2 hours)
• After retry, skip and load failed records to S3’s elasticsearch_failed/ folder

Amazon Kinesis Data Streams
• Easy administration and low cost
• Real-time, elastic performance
• Secure, durable storage
• Available to multiple real-time analytics applications

Amazon Kinesis - Firehose vs. Streams
Amazon Kinesis Data Streams is for use casesthat require custom
processing, per incoming record, with sub-1 second processing latency, and
a choice of stream processing frameworks. Allows multiple consumers,
different consumer patterns, and stream replay
Amazon Kinesis Data Firehose is for use casesthat require zero
administration, ability to use existing analytics tools based on Amazon S3,
Amazon Redshift, and Amazon ES, and a data latency of 60 seconds or
higher
Kinesis Data
Streams
Kinesis Data
Firehose

Data is stored in the order it was received for a set duration
of time, and can be replayed indefinitely during this time.

•AT_SEQUENCE_NUMBER - Start reading from the position denoted by a specific sequence number,
provided in the value StartingSequenceNumber.
•AFTER_SEQUENCE_NUMBER - Start reading right after the position denoted by a specific sequence
number, provided in the value StartingSequenceNumber.
•AT_TIMESTAMP - Start reading from the position denoted by a specific time stamp, provided in the
value Timestamp.
•TRIM_HORIZON - Start reading at the last untrimmed record in the shard in the system, which is the
oldest data record in the shard.
•LATEST - Start reading just after the most recent record in the shard, so that you always read the most
recent data in the shard.

Amazon S3
Amazon Redshift
Amazon Elasticsearch
Splunk
Real-Time Applications (seconds)
Streaming ETL (minutes)
Stream Ingestion
[Wed Oct 11 14:32:52 2018]
[error] [client 127.0.0.1]
client denied by server
configuration:
/export/home/live/ap/htdocs
/test
Mobile device
Metering
Click streams
IoT sensors
Logs
AWS SDKsAmazon
KinesisAgent
AmazonKinesis
ProducerLibrary
AmazonKinesis
ConsumerLibrary

Fully managed service for real-time processing of streaming data
Cost-effective: $0.014 per 1,000,000 PUT Payload Units
Millions of sources
producing 100’s of
terabytes per hour
Amazon Web Services
Front
End
AZ AZ AZAuthentication
authorization
Durable, highly consistent storage replicas data
across three data centers (availability zones)
Ordered stream of
events supports
multiple readers
Amazon Kinesis
Client Library
on EC2
Amazon Kinesis
Data Firehose
Amazon Kinesis
Data Analytics
AWS Lambda

Time-based
seek

The KPL and record aggregation
Aggregation refers to the storage of multiple records in a Kinesis Data Streams record.
Aggregation allows customers to increase the number of records sent per API call, which
effectively increases producer throughput.

Processing a data stream with Apache Spark
https://spark.apache.org/docs/2.3.1/streaming-kinesis-integration.htm
l

Processing a data stream with AWS Lambda
data
producer
Kinesis Data
Streams
Amazon
SNS
Continuously stream data
Lambda
service
Lambda
functionA
Lambda
function B
Continuously polls for new data,
1 poll per second
Automatically invokes your
function(s) when data found
• Stateless
• Lambda polls each shard once per second
• Scales with your data

HSBC - The World’s
Leading International
Bank

Reasons to communicate
We’re here to make customer’s lives simple, so they can focus on what
matters

Solution Overview
HSBC UK
Mainframes
Mapper
EMR
Spark
Kinesis
StreamsDirect
Connect
Customer Preferences
DynamoDB Lambda API Gateway
Data Service
AuroraEMRDynamoDBAPI GatewayKinesis
Streams
Event Engine
Kinesis
Streams
Lambda
Push Notifications
Notification Service
API GatewayKinesis
Streams
Lambda
Message Service
API GatewayDynamoDBKinesis
Streams
Lambda
JSON
ASCII
Dead Letter Queues
SNSSQSVPC CloudWatch KMS
Common Services
EU-West-1
AVRO
EBCDIC
Kafka
AVRO
EBCDIC

Lambda and Kinesis Data Streams Lessons Learned
• Increasing number of Kinesis Data Streams shards may not increase system
performance, batch size matters. Perform load test.
• Consider the impact of language and VPC usage on Lambda startup time vs. Lambda
execution time
• Java-based functions start slower vs. Python/Node but executes faster
• 3GB memory isn’t always fastest for VPC attached Lambdas. Most optimum mem
allocation for Java-based functions was 1GB. Consider ENI-reuse.
• Consider pre-warming VPC attached functions to achieve your latency SLA

Key Takeaways
• Follow the principle of "extract data once and reuse multiple times” to power new
customer experiences
• Generating a repeatable correlation ID from source is critical in a distributed system
• Perform load tests to fine tune your system and identify choke points
• Know the AWS services soft and hard limits
• Plan your network architecture to provide service isolation and to support production scale
• Consider how to unify your existing and cloud operation model – logging, monitoring and
alerting

Amazon Kinesis Data Analytics
• Interact with streaming data in real-time using SQL or integrated Java applications
• Build fully managed and elastic stream processing applications

Kinesis Data Analytics
Easily write SQL or Java code to process
streaming data
Connect to streaming source
Continuously deliver results

KDA for Java for sophisticated applications
Utilizes Apache Flink, a Framework and distributed engine for stateful
processing of data streams
Simple
programming
High performance
Stateful
Processing
Strong data
integrity
Easy to use and
flexible APIs make
building apps fast
In-memory
computing provides
low latency & high
throughput
Durable
application state
saves
Exactly-once
processing and
consistent state

Kinesis Data Analytics – Java Applications
Build Java applications
using open source
(Apache Flink)
Upload your application
code to Kinesis Data
Analytics
Run your application in a
fully managed and
elastic service
1 2 3

Apache Flink supports over 25 operators
… and much, much more.
Example Operators Typically usage
Map, FlatMap, Filter, Iterative Basic transformations
Key By, Split, Shuffle, Custom
Partition
Change logical or physical structure
of the stream
Window, Reduce, Fold, Sum, Min,
Max
Analytics and aggregations
Join, Union, coGroup, Combine multiple data streams

How do you build an application?
Streaming operators are applied to data streams in a pipeline
Source
Sink
DataStream
KeyedDataStream
DataStream
Sink
keyBy,
window
filter
apply

Extensible integrations with AWS services
• Easily add sources and sinks to an application
• Build custom connectors for other data sources and sinks
Example Sources
Example
Destinations (Sinks)
Apache Kafka
Apache Kafka RabbitMQ
RabbitMQ ElasticSearchApache
Cassandra

Automatically backup your application
Create and restore your application to a previous
point-in-time (snapshots)
Running application state is automatically backed
up by default (checkpoints)

Application scaling – resources and parallelism
Resources
• Kinesis Process Unit (KPUs) used to
run code
• Each KPU is 1 vCPU and 4 GB memory
• 50 GB of running application storage
per KPU
• Automatic or provisioned scaling
Parallelism
• Number of instances of a task
• Default versus operator parallelism
• Maximum defines the largest possible
parallelism for an application

KDA for SQL for simple and fast use cases
• Sub-second end to end processing latencies
• SQL steps can be chained together in serial or parallel steps
• Build applications with one or hundreds of queries
• Pre-built functions include everything from sum and count
distinct to machine learning algorithms
• Aggregations run continuously using window operators

Easily connect to Kinesis Data streams and
Kinesis Data Firehose delivery streams
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Firehose

AWS Lambda function
raw data
Amazon Kinesis Data Analytics application
transformed
data
SQL
code
source destination

Writing Streaming SQL
Streams (in memory tables)
CREATE STREAM calls_per_ip_stream(
eventTimeStamp TIMESTAMP,
computationType VARCHAR(256),
category VARCHAR(1024),
subCategory VARCHAR(1024),
unit VARCHAR(256),
unitValue BIGINT
);

Writing Streaming SQL
Pumps (continuous query)
CREATE OR REPLACE PUMP calls_per_ip_pump AS
INSERT INTO calls_per_ip_stream
SELECT STREAM "eventTimestamp",
COUNT(*),
"sourceIPAddress"
FROM source_sql_stream_001 ctrail
GROUP BY "sourceIPAddress",
STEP(ctrail.ROWTIME BY INTERVAL '1' MINUTE),
STEP(ctrail."eventTimestamp" BY INTERVAL '1'
MINUTE);

Aggregating Streaming Data?
• Aggregations (count, sum, min,…) take granular real time data and turn it
into insights
• Data is continuously processed so you need to tell the application when
you want results
• Tumbling windows, sliding windows, and custom windows

In-application stream
Amazon Kinesis Data Analytics application
SQL code joining
table and stream
streaming source destination
Amazon
S3
In-application table

Example Usage Pattern 1: Web Analytics and
Leaderboards
Amazon
Kinesis Data
Analytics
AWS
Lambda
function
Amazon
Cognito
Lightweight JS
client code
Web Server on
Amazon EC2
OR
Amazon
DynamoDB
Table
Amazon
Kinesis Data
Streams
Compute top 10 usersIngest web app data Persist to feed live apps
https://aws.amazon.com/answers/web-applications/real-time-web-analytics-with-kinesis/

Example Usage Pattern 2: Monitoring IoT Devices
IoT sensors AWS IoT Amazon RDS
MySQL DB
instance
Amazon Kinesis
Data Streams
Amazon Kinesis
Data Analytics
AWS Lambda
function
Compute avg temp
every 10secIngest sensor data
Persist time series data
aggregations
Amazon
CloudWatch
https://aws.amazon.com/answers/iot/real-time-iot-device-monitoring-with-kinesis/

Example Usage Pattern 3: Analyzing AWS
CloudTrail Event Logs
AWS
CloudTrail
Amazon
CloudWatch
events trigger
Amazon Kinesis
Data Analytics
AWS
Lambda
function
Amazon S3
bucket for raw
data
Amazon
DynamoDB
Table(s)
Chart.JS
Dashboard
Compute
operational metrics
Ingest raw log data Deliver to a real time
dashboards and archival
Amazon Kinesis
Data Firehose
https://aws.amazon.com/answers/account-management/real-time-insights-account-activity/

aws.amazon.com/kinesis
aws.amazon.com/kinesis/getting-started
aws.amazon.com/msk
aws.amazon.com/msk/getting-started

Stream processing and managing real-time data

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Stream processing and managing real-time data

Ähnlich wie Stream processing and managing real-time data (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Stream processing and managing real-time data