1. Алгоритм обнаружения
аномалий и Streaming SQL
в Amazon Kinesis Analytics
Денис Баталов, PhD
@dbatalov
Sr. Solutions Architect
Спец по ML и AI
Amazon Web Services
Luxembourg
4. Сегодня вы узнаете про
1. Алгоритм обнаружения аномалий Random Cut Forest
2. Спотовый рынок виртуальных машин Amazon EC2
3. Streaming SQL для обработки потоков
4. Обнаружение ценовых аномалий спотового рынка с
использованием Amazon Kinesis Analytics
5. Random Cut Tree – Дерево Случайных Разбивок
повторяем: разбивка заканчивается
когда все точки изолированыРазбивка длинной стороны
много данных
Неудачная разбивка
6. Random Cut Forest – Лес Случайных Разбивок
Каждое дерево построено на случайной выборке
…
7. Случайная выборка из потока
«резервуарная выборка» [Vitter]
Случайная выборка 5-ти значений из потока?
сохраняем с вероятностью
выбрасываем с вероятностью
5
7
2
7
5
6
1
6
13. Показатель Аномальности
Значение является аномальным если его вставка в дерево существенно
увеличивает размер дерева, то есть сумму длин всех ветвей (или длину
описания данных)
нормальное значение:
21. Копайте Глубже
“Robust Random Cut Forest Based
Anomaly Detection on Streams”
[Guha, Mishra, Roy, Schrijvers]
http://docs.aws.amazon.com/kinesisanalytics/latest/dev
/app-anomaly-detection.html
22. Compute Purchasing Models
On-Demand
Pay for compute
capacity by the hour
with no long-term
commitments
For spiky workloads,
or to define needs
Reserved
Make a low, one-time
payment and receive a
significant discount on
the hourly charge
For committed
utilization
Spot
Bid for unused capacity,
charged at a Spot Price
which fluctuates based
on supply and demand
For time-insensitive or
transient workloads
Dedicated
Launch instances within
Amazon VPC that run
on hardware dedicated
to a single customer
For highly sensitive or
compliance related
workloads
Free Tier
Get Started on AWS
with free usage & no
commitment
For POCs and
getting started
23. Reserved Instances (RI)
For example:
Reserve capacity for one or three years
Pay a low, one-time fee for the capacity reservation
Receive a significant discount on the hourly charge for your instance
24. Reserved Instance Payment Options Explained
No Upfront option:
•Up to a 55% discount compared to On-Demand
•Does not require upfront payment
•Low hourly rate for the RI on an ongoing hourly basis
Partial Upfront option:
•Balances the payments of an RI between upfront and hourly
•Provides a higher discount (up to 76%) compared to the No Upfront
option
•Pay a very low hourly rate upfront for every hour in the term
regardless of usage
With the All Upfront option:
•Highest discount compared to On-Demand (up to 77% off).
25. Reserved Instance vs. On-Demand
$-
$500
$1,000
$1,500
$2,000
$2,500
$3,000
30% 40% 50% 60% 70% 80% 90% 100%
Utilization Over a Year
m3.xlarge 1yr OD/RI Break Even
Utilization
On Demand No Upfront Partial Upfront All Upfront
What are the “break-even” points of each of these options in relation to purchasing instances On-
Demand?
26. Spot instances
What are Spot instances?
•Spare EC2 instances bid on in hourly increments
•One hour at a time
•Behave exactly like a regular instances
Cost Benefits
•Up to 92% off regular on-demand prices per hour
What is the trade-off?
•May be interrupted if that instance is needed for a EC2
capacity
•No charge for any partial hour due to termination
30. Amazon Kinesis Streams
Easy administration: Create a stream, set capacity level with shards. Scale to
match your data throughput rate & volume.
Build real-time applications: Process streaming data with Kinesis Client
Library (KCL), Apache Spark/Storm, AWS Lambda, ....
Low cost: Cost-efficient for workloads of any scale.
31. Amazon Kinesis Firehose
Zero administration: Capture and deliver streaming data to Amazon S3, Amazon
Redshift, or Amazon Elasticsearch Service without writing an app or managing
infrastructure.
Direct-to-data-store integration: Batch, compress, and encrypt streaming data for
delivery in as little as 60 seconds.
Seamless elasticity: Seamlessly scales to match data throughput without
intervention.
Capture and submit
streaming data to Firehose
Analyze streaming data using your
favorite BI tools
Firehose loads streaming data
continuously into S3, Amazon Redshift,
and Amazon ES
32. Amazon Kinesis Analytics
Apply SQL on streams: Easily connect to a Kinesis stream or Firehose
delivery stream and apply SQL skills.
Build real-time applications: Perform continual processing on streaming big
data with sub-second processing latencies.
Easy scalability: Elastically scales to match data throughput.
Connect to Kinesis streams,
Firehose delivery streams
Run standard SQL queries
against data streams
Kinesis Analytics can send processed data
to analytics tools so you can create alerts
and respond in real time
33. Amazon Kinesis: streaming data made easy
Services make it easy to capture, deliver, and process streams on
AWS
Kinesis Analytics
For all developers, data scientists
Easily analyze data streams
using standard SQL queries
Kinesis Firehose
For all developers, data scientists
Easily load massive
volumes of streaming data
into S3, Amazon Redshift,
or Amazon ES
Kinesis Streams
For Technical Developers
Collect and stream data
for ordered, replayable,
real-time processing
35. Kinesis Analytics
Pay for only what you use
Automatic elasticity
Standard SQL for analytics
Real-time processing
Easy to use
36. Use SQL to build real-time applications
Easily write SQL code to process
streaming data
Connect to streaming source
Continuously deliver SQL results
37. Connect to streaming source
• Streaming data sources include Firehose or
Streams
• Input formats include JSON, .csv, variable
column, unstructured text
• Each input has a schema; schema is inferred,
but you can edit
• Reference data sources (S3) for data
enrichment
38. Write SQL code
• Build streaming applications with one-to-many
SQL statements
• Robust SQL support and advanced analytic
functions
• Extensions to the SQL standard to work
seamlessly with streaming data
• Support for at-least-once processing
semantics
39. Continuously deliver SQL results
• Send processed data to multiple destinations
• S3, Amazon Redshift, Amazon ES (through
Firehose)
• Streams (with AWS Lambda integration for
custom destinations)
• End-to-end processing speed as low as sub-
second
• Separation of processing and data delivery
40. Generate time series analytics
• Compute key performance indicates over-time windows
• Combine with historical data in S3 or Amazon Redshift
Analytics
Streams
Firehose
Amazon
Redshift
S3
Streams
Firehose
Custom, real-
time
destinations
41. Feed real-time dashboards
• Validate and transform raw data, and then process to calculate
meaningful statistics
• Send processed data downstream for visualization in BI and
visualization services
Amazon
QuickSight
Analytics
Amazon ES
Amazon
Redshift
Amazon
RDS
Streams
Firehose
42. Create real-time alarms and notifications
• Build sequences of events from the stream, like user sessions in a
clickstream or app behavior through logs
• Identify events (or a series of events) of interest, and react to the
data through alarms and notifications
Analytics
Streams
Firehose
Streams
Amazon
SNS
Amazon
CloudWatch
Lambda
43. SQL on streaming data
• SQL is an API to your data
• Ask for what you want, system decides how to get it
• For all data, not just “flat” data in a database
• Opportunity for novel data organization and algorithms
• A standard (ANSI 2008, 2011) and the most commonly
used data manipulation language
44. A simple streaming query
• Tweets about the AWS NYC Summit
• Selecting from a STREAM of tweets, an in-application
stream
• Each row has a corresponding ROWTIME
SELECT STREAM ROWTIME, author, text
FROM Tweets
WHERE text LIKE ‘%#AWSNYCSummit%'
45. A streaming table is a STREAM
• In relational databases, you work with SQL tables
• With Analytics, you work with STREAMS
• SELECT, INSERT, and CREATE can be used with STREAMs
CREATE STREAM Tweets
(author VARCHAR(20),
text VARCHAR(140));
INSERT INTO Tweets
SELECT …
46. Writing queries on unbounded data sets
• Streams are unbounded data sets
• Need continuous queries, row-by-row or across rows
• WINDOWs define a start and end to the query
SELECT STREAM author,
count(author) OVER ONE_MINUTE
FROM Tweets
WINDOW ONE_MINUTE AS
(PARTITION BY author
RANGE INTERVAL '1' MINUTE PRECEDING);
47. Аномалии в спотовых ценах
CREATE OR REPLACE PUMP "WEIGHTED_FAMILY_STREAM_PUMP" AS
INSERT INTO "WEIGHTED_FAMILY_STREAM"
SELECT STREAM "ts", "availabilityzone", "instancetype",
"family", "size", "magnitude", "spotprice"/"magnitude" as
"weightedprice", "spotprice"
FROM
(SELECT STREAM "ts", "availabilityzone", "instancetype",
instance_family("instancetype") as "family",
instance_size("instancetype") as "size",
instance_magnitude("instancetype") as "magnitude",
"spotprice"
FROM "SOURCE_SQL_STREAM_001"
WHERE "productdescription" = 'Linux/UNIX')
WHERE "family" = 'C4';
48. Аномалии в спотовых ценах
CREATE OR REPLACE PUMP "AZ_PRICE_STREAM_PUMP" AS
INSERT INTO "AZ_PRICE_STREAM"
SELECT STREAM "ts", "eu-west-1a-price", "eu-west-1b-price",
"eu-west-1c-price", "ANOMALY_SCORE" as "anomaly_score"
FROM TABLE(RANDOM_CUT_FOREST(CURSOR(SELECT STREAM
"ts",
avg(case when "availabilityzone" = 'eu-west-1a' then
"weightedprice" else null end) over w1 as "eu-west-1a-price",
avg(case when "availabilityzone" = 'eu-west-1b' then
"weightedprice" else null end) over w1 as "eu-west-1b-price",
avg(case when "availabilityzone" = 'eu-west-1c' then
"weightedprice" else null end) over w1 as "eu-west-1c-price"
FROM "WEIGHTED_FAMILY_STREAM"
WINDOW W1 AS (RANGE INTERVAL '10' MINUTE PRECEDING)),
100, 100, 10000, 10));