SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ray Zhu, Sr. Product Manager, Amazon Kinesis
August 14, 2017
Analyzing Streaming Data in Real-
time with Amazon Kinesis

What to Expect from the Session
• Streaming data overview
• Amazon Kinesis overview
• Customer stories
• Streaming data architectures
• Example – twitter feeds analysis
• Getting started resources

Most data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test

Processing real-time, streaming data
• Durable
• Continuous
• Fast
• Correct
• Reactive
• Reliable
What are the key requirements?
Ingest Transform Analyze React Persist

Amazon Kinesis makes it easy to work with
real-time streaming data
Kinesis
Streams
Stores data as a
continuous
replayable stream
for custom
applications
Kinesis
Firehose
Load streaming data
into Amazon S3,
Amazon Redshift, and
Amazon Elasticsearch
Service
Kinesis
Analytics
Analyze data
streams using
standard SQL
queries

Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low
cost
• Build custom real-time applications to process streaming
data

Amazon Kinesis Firehose
• Reliably ingest, transform, and deliver batched, compressed,
and encrypted data to S3, Redshift, and Elasticsearch
• Point and click setup with zero administration and seamless
elasticity

Amazon Kinesis Analytics
• Interact with streaming data in real-time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms

Amazon Kinesis customer diversity

Common streaming data
architectures

Generate time series analytics
• Compute key performance indicators over time periods
• Combine with static or historical data in S3 or Amazon Redshift
Analytics
Streams
Firehose
Amazon
Redshift
S3
Streams
Firehose
Custom, real-
time
destinations

Feed real-time dashboards
• Validate and transform raw data, and then process to calculate
meaningful statistics
• Send processed data downstream for visualization in BI and
visualization services
Amazon
QuickSight
Analytics
Amazon ES
Amazon
Redshift
Amazon
RDS
Streams
Firehose

Create real-time alarms and notifications
• Build sequences of events from the stream, like user sessions in a
clickstream or app behavior through logs
• Identify events (or a series of events) of interest, and react to the
data through alarms and notifications
Analytics
Streams
Firehose
Streams
Amazon
SNS
Amazon
CloudWatch
Lambda

Example Scenario Requirements
Data to capture
• Filter for NHL-related tweets
• Total number of tweets per hour that contain hashtags for
NHL teams
• Top 5 NHL-related hashtags per hour
Output Requirements
• Filtered tweets are saved to Amazon S3
• Hourly aggregate count is saved to Amazon Elasticsearch
Service
• Full team name of top 5 hashtags are saved to Amazon
Elasticsearch Service

Why use Kinesis Analytics for this solution?
Challenges
• Twitter stream can be noisy
• Tweet structure is complex, with several levels of
nested JSON
• NHL-related tweet volume is cyclical
With Kinesis Analytics
• Easily filter out unwanted tweets
• Normalize tweet schema for simple SQL queries
• Automatically scale to meet demand

End-to-End Architecture
Amazon
Kinesis
Stream
Amazon
Kinesis
Analytics
Amazon
Kinesis
Firehose
Amazon
Elasticsearch
Service
Amazon S3
EC2
Instance
Reference
Data

Twitter Data Input
Amazon Kinesis StreamEC2 Instance
Source JSON Data
• Hundreds of attributes
per Tweet
• ~5KB per Tweet
• Most attributes are
unnecessary for use
case
Kinesis Stream Input
• 5 attributes per record
• ~250B per record
• Relevant attributes

How are tweets mapped to a schema?
Amazon Kinesis stream Amazon KinesisAnalytics
{
"id": 795296435386388500,
"text": "#Canes game tonight! #NHL",
"created_at": "11-06-2016 16:07:00",
"tags": [{
"tag": "Canes"
}, {
"tag": "NHL"
}]
}
id text created_at tag
795… #Canes… 11-06-2016… Canes
795… #Canes… 11-06-2016… NHL
Source Data for Kinesis Analytics

How is streaming data accessed with SQL?
STREAM
• Analogous to a TABLE
• Represents continuous data flow
CREATE OR REPLACE STREAM "NHL_TWEET_STREAM" (
ID BIGINT, TWEET_TEXT VARCHAR(140), HASHTAG VARCHAR(140));
PUMP
• Continuous INSERT query
• Inserts data from one in-application stream to another
CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS
INSERT INTO "NHL_TWEET_STREAM"
SELECT STREAM * FROM . . .

How do we model our data?
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
TOTAL_TWEETS_STREAM
•TWEET_COUNT
MENTION_COUNT_STREAM
•TEAMNAME
•MENTIONCOUNT
SELECT STREAM *
FROM "SOURCE_STREAM"
WHERE "HASHTAG" NOT IN
<exclude list>
SOURCE_STREAM
•id
•text
•tag
Kinesis
stream
Kinesis
Firehose
Kinesis
Firehose
TeamName
• hashtag
• team

How do we get team name from the hashtag?
• Create CSV file with hashtag to team name map in S3
• Configure Kinesis Analytics application to import file as
reference data
• Reference data appears as a table
• Join streaming data on reference data
hashtag,team
NYRangers,New York Rangers
NYR,New York Rangers
Canes,Carolina Hurricanes
Oilers,Edmonton Oilers
Sharks,San Jose Sharks
s3://mybucket/team_map.csv

Use Reference Data in Query
SELECT STREAM tn."team"
FROM "NHL_TWEET_STREAM" tweets
INNER JOIN "TeamName" tn
ON tweets."HASHTAG" =
LOWER(tn."hashtag")
TeamName
• hashtag
• team
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
hashtag,team
NYRangers,New York Rangers
NYR,New York Rangers
Canes,Carolina Hurricanes
Oilers,Edmonton Oilers
Sharks,San Jose Sharks
s3://mybucket/team_map.csv
New York Rangers
New York Rangers
Carolina Hurricanes
Edmonton Oilers
San Jose Sharks

How do we aggregate streaming data?
• A common requirement in streaming
analytics is to perform set-based operation(s)
(count, average, max, min,..) over events
that arrive within a specified period of time
• Cannot simply aggregate over an entire table
like typical static database
• How do we define a subset in a potentially infinite
stream?
• Windowing functions!

Windowing Concepts
• Windows can be tumbling or sliding
• Windows are fixed length
Output record will have the timestamp of the end of the window
1 5 4 26 8 6 4
t1 t2 t5 t6t3 t4
Time
Window1 Window2 Window3
Aggregate
Function (Sum)
18 14
Output Events

Comparing Types of Windows
• Output created at the end of the window
• The output of the window will be single event based on the
aggregate function used
Tumbling window
Aggregate per time interval
Sliding window
Windows constantly re-evaluated

How do we aggregate team mentions per hour?
• Use TOP_K_ITEMS_TUMBLING function
• Pass cursor to team name stream
• Define window size of 3600 seconds
INSERT INTO "MENTION_COUNT_STREAM"
SELECT STREAM *
FROM TABLE(TOP_K_ITEMS_TUMBLING(
CURSOR(SELECT STREAM tn."team"... ),
'team', -- name of column to aggregate
5, -- number of top items
3600 -- tumbling window size in seconds
));

Output to Kinesis Firehose
MENTION_COUNT_STREAM
•TEAMNAME
•MENTIONCOUNT
Amazon
Elasticsearch
Service
KibanaAmazon
Kinesis Firehose
{
"aggregationtime": "2016-11-06T14:42:03.335",
"teamname": "New York Rangers",
"mentioncount": 604
}
5 records, every hour

Getting started resources
• Kinesis Home Page: https://aws.amazon.com/kinesis/
• Kinesis Data Generator: https://aws.amazon.com/blogs/big-data/test-your-
streaming-data-solution-with-the-new-amazon-kinesis-data-generator/
• Kinesis Firehose Lab:
https://run.qwiklab.com/focuses/preview/1641?search=44727
• Build a Log Analytics Solution: https://aws.amazon.com/getting-
started/projects/build-log-analytics-solution/
• Blog Post: https://aws.amazon.com/kinesis/firehose/blog-posts/

SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis

Ähnlich wie SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis (20)

Mehr von Amazon Web Services

Mehr von Amazon Web Services (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

SRV420 Analyzing Streaming Data in Real-time with Amazon Kinesis