Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, you’ll learn about how AWS customers are transitioning from batch to stream processing using Kinesis, and how to get started. We will provide an overview of streaming applications and introduce the Kinesis capabilities. We will walk through a production use case to demonstrate how to ingest streaming data, prepare it, and analyze it to gain actionable insights in real time using Kinesis. We will also provide pointers to tutorials and other resources so you can quickly get started with your streaming data application.
2. What to Expect from the Session
• Streaming data overview
• Amazon Kinesis overview
• Customer stories
• Streaming data architectures
• Example – twitter feeds analysis
• Getting started resources
4. Most data is produced continuously
Mobile Apps Web Clickstream Application Logs
Metering Records IoT Sensors Smart Buildings
[Wed Oct 11 14:32:52
2000] [error] [client
127.0.0.1] client
denied by server
configuration:
/export/home/live/ap/h
tdocs/test
8. Amazon Kinesis makes it easy to work with
real-time streaming data
Kinesis
Streams
Stores data as a
continuous
replayable stream
for custom
applications
Kinesis
Firehose
Load streaming data
into Amazon S3,
Amazon Redshift, and
Amazon Elasticsearch
Service
Kinesis
Analytics
Analyze data
streams using
standard SQL
queries
9. Amazon Kinesis Streams
• Reliably ingest and durably store streaming data at low
cost
• Build custom real-time applications to process streaming
data
10. Amazon Kinesis Firehose
• Reliably ingest, transform, and deliver batched, compressed,
and encrypted data to S3, Redshift, and Elasticsearch
• Point and click setup with zero administration and seamless
elasticity
11. Amazon Kinesis Analytics
• Interact with streaming data in real-time using SQL
• Build fully managed and elastic stream processing
applications that process data for real-time visualizations
and alarms
20. Generate time series analytics
• Compute key performance indicators over time periods
• Combine with static or historical data in S3 or Amazon Redshift
Analytics
Streams
Firehose
Amazon
Redshift
S3
Streams
Firehose
Custom, real-
time
destinations
21. Feed real-time dashboards
• Validate and transform raw data, and then process to calculate
meaningful statistics
• Send processed data downstream for visualization in BI and
visualization services
Amazon
QuickSight
Analytics
Amazon ES
Amazon
Redshift
Amazon
RDS
Streams
Firehose
22. Create real-time alarms and notifications
• Build sequences of events from the stream, like user sessions in a
clickstream or app behavior through logs
• Identify events (or a series of events) of interest, and react to the
data through alarms and notifications
Analytics
Streams
Firehose
Streams
Amazon
SNS
Amazon
CloudWatch
Lambda
24. Example Scenario Requirements
Data to capture
• Filter for NHL-related tweets
• Total number of tweets per hour that contain hashtags for
NHL teams
• Top 5 NHL-related hashtags per hour
Output Requirements
• Filtered tweets are saved to Amazon S3
• Hourly aggregate count is saved to Amazon Elasticsearch
Service
• Full team name of top 5 hashtags are saved to Amazon
Elasticsearch Service
25. Why use Kinesis Analytics for this solution?
Challenges
• Twitter stream can be noisy
• Tweet structure is complex, with several levels of
nested JSON
• NHL-related tweet volume is cyclical
With Kinesis Analytics
• Easily filter out unwanted tweets
• Normalize tweet schema for simple SQL queries
• Automatically scale to meet demand
27. Twitter Data Input
Amazon Kinesis StreamEC2 Instance
Source JSON Data
• Hundreds of attributes
per Tweet
• ~5KB per Tweet
• Most attributes are
unnecessary for use
case
Kinesis Stream Input
• 5 attributes per record
• ~250B per record
• Relevant attributes
28. How are tweets mapped to a schema?
Amazon Kinesis stream Amazon KinesisAnalytics
{
"id": 795296435386388500,
"text": "#Canes game tonight! #NHL",
"created_at": "11-06-2016 16:07:00",
"tags": [{
"tag": "Canes"
}, {
"tag": "NHL"
}]
}
id text created_at tag
795… #Canes… 11-06-2016… Canes
795… #Canes… 11-06-2016… NHL
Source Data for Kinesis Analytics
29. How is streaming data accessed with SQL?
STREAM
• Analogous to a TABLE
• Represents continuous data flow
CREATE OR REPLACE STREAM "NHL_TWEET_STREAM" (
ID BIGINT, TWEET_TEXT VARCHAR(140), HASHTAG VARCHAR(140));
PUMP
• Continuous INSERT query
• Inserts data from one in-application stream to another
CREATE OR REPLACE PUMP "NHL_TWEET_PUMP" AS
INSERT INTO "NHL_TWEET_STREAM"
SELECT STREAM * FROM . . .
30. How do we model our data?
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
TOTAL_TWEETS_STREAM
•TWEET_COUNT
MENTION_COUNT_STREAM
•TEAMNAME
•MENTIONCOUNT
SELECT STREAM *
FROM "SOURCE_STREAM"
WHERE "HASHTAG" NOT IN
<exclude list>
SOURCE_STREAM
•id
•text
•tag
Kinesis
stream
Kinesis
Firehose
Kinesis
Firehose
TeamName
• hashtag
• team
31. How do we get team name from the hashtag?
• Create CSV file with hashtag to team name map in S3
• Configure Kinesis Analytics application to import file as
reference data
• Reference data appears as a table
• Join streaming data on reference data
hashtag,team
NYRangers,New York Rangers
NYR,New York Rangers
Canes,Carolina Hurricanes
Oilers,Edmonton Oilers
Sharks,San Jose Sharks
s3://mybucket/team_map.csv
32. Use Reference Data in Query
SELECT STREAM tn."team"
FROM "NHL_TWEET_STREAM" tweets
INNER JOIN "TeamName" tn
ON tweets."HASHTAG" =
LOWER(tn."hashtag")
TeamName
• hashtag
• team
NHL_TWEET_STREAM
•ID
•TEXT
•HASHTAG
hashtag,team
NYRangers,New York Rangers
NYR,New York Rangers
Canes,Carolina Hurricanes
Oilers,Edmonton Oilers
Sharks,San Jose Sharks
s3://mybucket/team_map.csv
New York Rangers
New York Rangers
Carolina Hurricanes
Edmonton Oilers
San Jose Sharks
33. How do we aggregate streaming data?
• A common requirement in streaming
analytics is to perform set-based operation(s)
(count, average, max, min,..) over events
that arrive within a specified period of time
• Cannot simply aggregate over an entire table
like typical static database
• How do we define a subset in a potentially infinite
stream?
• Windowing functions!
34. Windowing Concepts
• Windows can be tumbling or sliding
• Windows are fixed length
Output record will have the timestamp of the end of the window
1 5 4 26 8 6 4
t1 t2 t5 t6t3 t4
Time
Window1 Window2 Window3
Aggregate
Function (Sum)
18 14
Output Events
35. Comparing Types of Windows
• Output created at the end of the window
• The output of the window will be single event based on the
aggregate function used
Tumbling window
Aggregate per time interval
Sliding window
Windows constantly re-evaluated
36. How do we aggregate team mentions per hour?
• Use TOP_K_ITEMS_TUMBLING function
• Pass cursor to team name stream
• Define window size of 3600 seconds
INSERT INTO "MENTION_COUNT_STREAM"
SELECT STREAM *
FROM TABLE(TOP_K_ITEMS_TUMBLING(
CURSOR(SELECT STREAM tn."team"... ),
'team', -- name of column to aggregate
5, -- number of top items
3600 -- tumbling window size in seconds
));
37. Output to Kinesis Firehose
MENTION_COUNT_STREAM
•TEAMNAME
•MENTIONCOUNT
Amazon
Elasticsearch
Service
KibanaAmazon
Kinesis Firehose
{
"aggregationtime": "2016-11-06T14:42:03.335",
"teamname": "New York Rangers",
"mentioncount": 604
}
5 records, every hour