More Related Content Similar to SRV304_Building High-Throughput Serverless Data Processing Pipelines (20) More from Amazon Web Services (20) SRV304_Building High-Throughput Serverless Data Processing Pipelines1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
SRV304 Building High-Throughput
Serverless Data Processing Pipelines
C e c i l i a D e n g , S o f t w a r e D e v e l o p e r o n A W S L a m b d a
N o v e m b e r 2 8 , 2 0 1 7
2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Me
• Canadian
• UBC
• EA Canada
• AWS Lambda
3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Goal
Data processing that is
• High-throughput ( > 1 GB/s)
• Serverless (no servers to manage)
• Real-time (pipeline)
4. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
What to Expect
• Why streams?
• What’s AWS Lambda?
• What’s Amazon Kinesis?
• What does serverless stream processing look like?
• How does Lambda process streams?
• Examples use cases
5. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHY STREAMS?
6. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stream Processing
Goal
• High-throughput ( > 1 GB/s)
• Serverless (managed compute)
• Real-time (pipeline)
Streams
• Data size constraint
• Data time constraint
• Have access to recent data
• Processing time constraint
Batch
• No size constraint
• No time constraint (not real-time)
• Have access to all data
• Long running processing (reports)
7. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Because you have data that is:
• Generated continuously and simultaneously by thousands of data sources
• Typically small sizes (KBs)
And needs to be processed either:
• Sequentially and incrementally
• Or over sliding windows
in some real-time constraint
Stream Processing
8. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHAT’S LAMBDA?
9. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
It’s your function
Your libraries, your
code, your
executable
With a
programming
model
Easy to start
blueprints and
tutorials,
monitoring, and
logging
That runs
stateless
Infrastructure
abstracted,
persist data using
Amazon DynamoDB,
Amazon S3, or
ElastiCache
And
integrated
security
model
IAM resource
policies and
roles, VPC
Support
Lambda: What Is It?
And flexible
resource model
Choose your
memory and we
allocate
proportional CPU,
network bandwidth,
disk I/O
10. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lambda: How Do I Trigger It?
Amazon
S3
Amazon
SNS
ASYNCHRONOUS PUSH MODEL
Amazon
Alexa
AWS
IoT
SYNCHRONOUS PUSH MODEL
Mapping owned by Event Source
triggers Lambda via Invoke APIs
resource-based policy permissions
RequestResponse
invocation
Event Invocation
HOW IT WORKS
11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lambda: How do I Trigger It?
Amazon
DynamoDB
Amazon
Kinesis
STREAM PULL MODEL Mapping owned by Lambda
Lambda function invokes when new
records are found on stream
Lambda execution role policy permissions
Polled batch
RequestResponse
invocation
Lambda polls the streams
12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Lambda
Node.js
Python
Java
C#
FUNCTIONEVENT SOURCE
AWS
CloudFormation
Amazon
API Gateway
Amazon
SNS
Database
Cloud
Service
Anything
ENDPOINT
Amazon
Kinesis
13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Node.js
Python
Java
C#
FUNCTION
Amazon
Kinesis
ENDPOINT
Database
Cloud
Service
Anything
EVENT SOURCE
IoT Data
IoT Data
Financial
Data
Log Data
Kinesis
14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
WHAT’S KINESIS?
15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
It’s storage
For real-time data
that’s only stored
for a limited time
Where new data
is made available
quickly
Typically less than 1
second put-to-get
delay
That uses a
checkpoint
model
Supports
multiple
concurrent in-
ordered
processing
Kinesis: What Is It?
As a managed
service
With APIs that let
you easily create
and configure the
stream and put and
retrieve data
16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kinesis: How do I Process It?
…
Source
Shards GetRecords
PutRecords
• Poll for work
• Checkpoint for progress
• Separate checkpoints for multiple consumers
• Use the KCL library
Scale Amazon Kinesis by splitting or merging shards
17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Streams: How do I Process It?
…
DDB events
Shards GetRecords
• Poll for work
• Checkpoint for progress
• Separate checkpoints for multiple consumers
• Use the KCL library
Scale Amazon Kinesis by splitting or merging shards
18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
SEEMS HARD. CAN I NOT?
19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis Firehose
• Manages stream:
• No shard configuration
• No partition key or order
• Manages stream processing:
• Polls for records
• Dump to one of
• Amazon S3
• Amazon Redshift
• Amazon Elasticsearch Service
• Compute power default 8 * (1 vCPU + 4GB) KPU
• Choose a Lambda transform function
• JSON/CSV to whatever
• Apache Log to JSON/CSV
• Syslog to JSON/CSV
20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis Firehose
21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis Firehose
22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis Analytics
• Does not manage stream:
• Need to configure Kinesis Stream
• Manages stream processing:
• From Amazon Kinesis or Kinesis Firehose
• Polls for records
• Uses a SQL model to continuously:
• Map record data to internal “stream tables” (aggregation)
• Query the internal “stream tables” for desired results (filter)
• Output the desired results to
• Additional internal “stream tables” (further aggregation) or
• External Kinesis Stream or Kinesis Firehose (destination store)
• Compute power default 8 * (1 vCPU + 4GB) KPU
23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis Analytics
24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis Analytics
25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Kinesis
26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
• Does not manage stream:
• Need to configure Kinesis Stream
• Manages stream processing:
• From Amazon Kinesis or DynamoDB streams
• Polls for records
• Sends for invocation to a Lambda function
• Computer power default 1000 * (configured memory and associated sized
CPU)
• Setup with Lambda createEventSourceMapping
• Lambda:
• Preserves order
• Soft concurrent limit of 1000 invocations * (max 3GB memory and associated
sized CPU)
• Completely customized model and functionality
27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
…
Source
Amazon Kinesis
Destination 1
Destination 2
Shards
Scale Amazon Kinesis by splitting or merging shards
Polls a batch
Lambda will scale automatically
…
Lambda
Waits for response
29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
STREAM PROCESSING BY LAMBDA
30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
…
Source
Shards
Trim horizonCheckpointCheckpointLatest Checkpoint
32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
Event received by Lambda function is a collection of records from the stream:
{ "Records": [ {
"kinesis": {
"partitionKey": "partitionKey-3",
"kinesisSchemaVersion": "1.0",
"data": "SGVsbG8sIHRoaXMgaXMgYSB0ZXN0IDEyMy4=",
"sequenceNumber": "49545115243490985018280067714973144582180062593244200961" },
"eventSource": "aws:kinesis",
"eventID": "shardId-
000000000000:49545115243490985018280067714973144582180062593244200961",
"invokeIdentityArn": "arn:aws:iam::account-id:role/testLEBRole",
"eventVersion": "1.0",
"eventName": "aws:kinesis:record",
"eventSourceARN": "arn:aws:kinesis:us-west-2:35667example:stream/examplestream",
"awsRegion": "us-west-2" } ] }
33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
Per shard:
▪ Lambda calls GetRecords with max limit from Kinesis (10 k or 10 MB)
▪ If no record, wait some time (1s)
▪ Sub-batch in-memory and format records into Lambda payload
▪ Invoke Lambda with synchronous invoke
… …
Source
Amazon Kinesis
Destination 1
Lambda
Destination 2
Shards
Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards
Waits for responsePolls a batch
35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
▪ Lambda blocks on ordered processing for each individual shard
▪ Increasing # of shards with even distribution allows increased concurrency
▪ Batch size may impact duration if the Lambda function takes longer to process more records
… …
Source
Amazon Kinesis
Destination 1
Lambda
Destination 2
Shards
Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards
Waits for responsePolls a batch
36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
Polls and blocks on synchronous invocation per shard
If put/ingestion rate is greater than the theoretical throughput, your processing is at risk of
falling behind
Maximum theoretical throughput
# shards * 2 MB / Lambda function duration (s)
Effective theoretical throughput
# shards * batch size (MB) / Lambda function duration (s)
37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Processing Streams: Lambda
Retries
Will retry on execution failures until the record is expired
Throttles and errors impact duration and directly impact throughput
Best practice
Retry with exponential backoff
Effective theoretical throughput with retries
( # shards * batch size (MB) ) / ( function duration (s) * retries until expiry)
…
Source
Amazon Kinesis
Destination
1
Lambda
Destination
2
Shards
Polls a batch
Receives success
Receives error
Receives error
38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
MORE EXAMPLES
39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Real-time Ad Serving
40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
The Assembly Line
41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anomaly Detection
42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Anomaly Detection
43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Game Analytics
Store real-time player scores and stats Send to Lambda for further aggregation like
Top scores or Longest runs
Surface leaderboards
44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Game Analytics
45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
QUESTIONS?
46. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Thank you!
@ c i c i k e n d i g g i t ( m o s t l y m e c o m p l a i n i n g t o a i r l i n e s )