SRV304_Building High-Throughput Serverless Data Processing Pipelines

© 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
AWS re:INVENT
SRV304 Building High-Throughput
Serverless Data Processing Pipelines
C e c i l i a D e n g , S o f t w a r e D e v e l o p e r o n A W S L a m b d a
N o v e m b e r 2 8 , 2 0 1 7

Me
• Canadian
• UBC
• EA Canada
• AWS Lambda

Goal
Data processing that is
• High-throughput ( > 1 GB/s)
• Serverless (no servers to manage)
• Real-time (pipeline)

What to Expect
• Why streams?
• What’s AWS Lambda?
• What’s Amazon Kinesis?
• What does serverless stream processing look like?
• How does Lambda process streams?
• Examples use cases

WHY STREAMS?

Stream Processing
Goal
• High-throughput ( > 1 GB/s)
• Serverless (managed compute)
• Real-time (pipeline)
Streams
• Data size constraint
• Data time constraint
• Have access to recent data
• Processing time constraint
Batch
• No size constraint
• No time constraint (not real-time)
• Have access to all data
• Long running processing (reports)

Because you have data that is:
• Generated continuously and simultaneously by thousands of data sources
• Typically small sizes (KBs)
And needs to be processed either:
• Sequentially and incrementally
• Or over sliding windows
in some real-time constraint
Stream Processing

WHAT’S LAMBDA?

It’s your function
Your libraries, your
code, your
executable
With a
programming
model
Easy to start
blueprints and
tutorials,
monitoring, and
logging
That runs
stateless
Infrastructure
abstracted,
persist data using
Amazon DynamoDB,
Amazon S3, or
ElastiCache
And
integrated
security
model
IAM resource
policies and
roles, VPC
Support
Lambda: What Is It?
And flexible
resource model
Choose your
memory and we
allocate
proportional CPU,
network bandwidth,
disk I/O

Lambda: How Do I Trigger It?
Amazon
S3
Amazon
SNS
ASYNCHRONOUS PUSH MODEL
Amazon
Alexa
AWS
IoT
SYNCHRONOUS PUSH MODEL
Mapping owned by Event Source
triggers Lambda via Invoke APIs
resource-based policy permissions
RequestResponse
invocation
Event Invocation
HOW IT WORKS

Lambda: How do I Trigger It?
Amazon
DynamoDB
Amazon
Kinesis
STREAM PULL MODEL Mapping owned by Lambda
Lambda function invokes when new
records are found on stream
Lambda execution role policy permissions
Polled batch
RequestResponse
invocation
Lambda polls the streams

Lambda
Node.js
Python
Java
C#
FUNCTIONEVENT SOURCE
AWS
CloudFormation
Amazon
API Gateway
Amazon
SNS
Database
Cloud
Service
Anything
ENDPOINT
Amazon
Kinesis

Node.js
Python
Java
C#
FUNCTION
Amazon
Kinesis
ENDPOINT
Database
Cloud
Service
Anything
EVENT SOURCE
IoT Data
IoT Data
Financial
Data
Log Data
Kinesis

WHAT’S KINESIS?

It’s storage
For real-time data
that’s only stored
for a limited time
Where new data
is made available
quickly
Typically less than 1
second put-to-get
delay
That uses a
checkpoint
model
Supports
multiple
concurrent in-
ordered
processing
Kinesis: What Is It?
As a managed
service
With APIs that let
you easily create
and configure the
stream and put and
retrieve data

Kinesis: How do I Process It?
…
Source
Shards GetRecords
PutRecords
• Poll for work
• Checkpoint for progress
• Separate checkpoints for multiple consumers
• Use the KCL library
Scale Amazon Kinesis by splitting or merging shards

Streams: How do I Process It?
…
DDB events
Shards GetRecords
• Poll for work
• Checkpoint for progress
• Separate checkpoints for multiple consumers
• Use the KCL library

SEEMS HARD. CAN I NOT?

Processing Streams: Kinesis Firehose
• Manages stream:
• No shard configuration
• No partition key or order
• Manages stream processing:
• Polls for records
• Dump to one of
• Amazon S3
• Amazon Redshift
• Amazon Elasticsearch Service
• Compute power default 8 * (1 vCPU + 4GB) KPU
• Choose a Lambda transform function
• JSON/CSV to whatever
• Apache Log to JSON/CSV
• Syslog to JSON/CSV

Processing Streams: Kinesis Firehose

Processing Streams: Kinesis Analytics
• Does not manage stream:
• Need to configure Kinesis Stream
• From Amazon Kinesis or Kinesis Firehose
• Uses a SQL model to continuously:
• Map record data to internal “stream tables” (aggregation)
• Query the internal “stream tables” for desired results (filter)
• Output the desired results to
• Additional internal “stream tables” (further aggregation) or
• External Kinesis Stream or Kinesis Firehose (destination store)
• Compute power default 8 * (1 vCPU + 4GB) KPU

Processing Streams: Kinesis Analytics

Processing Streams: Kinesis

Processing Streams: Lambda
• Does not manage stream:
• Need to configure Kinesis Stream
• From Amazon Kinesis or DynamoDB streams
• Sends for invocation to a Lambda function
• Computer power default 1000 * (configured memory and associated sized
CPU)
• Setup with Lambda createEventSourceMapping
• Lambda:
• Preserves order
• Soft concurrent limit of 1000 invocations * (max 3GB memory and associated
sized CPU)
• Completely customized model and functionality

…
Source
Amazon Kinesis
Destination 1
Destination 2
Shards
Polls a batch
Lambda will scale automatically
…
Lambda
Waits for response

STREAM PROCESSING BY LAMBDA

…
Source
Shards
Trim horizonCheckpointCheckpointLatest Checkpoint

Event received by Lambda function is a collection of records from the stream:
{ "Records": [ {
"kinesis": {
"partitionKey": "partitionKey-3",
"kinesisSchemaVersion": "1.0",
"data": "SGVsbG8sIHRoaXMgaXMgYSB0ZXN0IDEyMy4=",
"sequenceNumber": "49545115243490985018280067714973144582180062593244200961" },
"eventSource": "aws:kinesis",
"eventID": "shardId-
000000000000:49545115243490985018280067714973144582180062593244200961",
"invokeIdentityArn": "arn:aws:iam::account-id:role/testLEBRole",
"eventVersion": "1.0",
"eventName": "aws:kinesis:record",
"eventSourceARN": "arn:aws:kinesis:us-west-2:35667example:stream/examplestream",
"awsRegion": "us-west-2" } ] }

Per shard:
▪ Lambda calls GetRecords with max limit from Kinesis (10 k or 10 MB)
▪ If no record, wait some time (1s)
▪ Sub-batch in-memory and format records into Lambda payload
▪ Invoke Lambda with synchronous invoke
… …
Source
Amazon Kinesis
Destination 1
Lambda
Destination 2
Shards
Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards
Waits for responsePolls a batch

▪ Lambda blocks on ordered processing for each individual shard
▪ Increasing # of shards with even distribution allows increased concurrency
▪ Batch size may impact duration if the Lambda function takes longer to process more records
… …
Source
Amazon Kinesis
Destination 1
Lambda
Destination 2
Shards
Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards
Waits for responsePolls a batch

Polls and blocks on synchronous invocation per shard
If put/ingestion rate is greater than the theoretical throughput, your processing is at risk of
falling behind
Maximum theoretical throughput
# shards * 2 MB / Lambda function duration (s)
Effective theoretical throughput
# shards * batch size (MB) / Lambda function duration (s)

Retries
Will retry on execution failures until the record is expired
Throttles and errors impact duration and directly impact throughput
Best practice
Retry with exponential backoff
Effective theoretical throughput with retries
( # shards * batch size (MB) ) / ( function duration (s) * retries until expiry)
…
Source
Amazon Kinesis
Destination
1
Lambda
Destination
2
Shards
Polls a batch
Receives success
Receives error
Receives error

MORE EXAMPLES

Real-time Ad Serving

The Assembly Line

Anomaly Detection

Game Analytics
Store real-time player scores and stats Send to Lambda for further aggregation like
Top scores or Longest runs
Surface leaderboards

Game Analytics

QUESTIONS?

Thank you!
@ c i c i k e n d i g g i t ( m o s t l y m e c o m p l a i n i n g t o a i r l i n e s )

SRV304_Building High-Throughput Serverless Data Processing Pipelines

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to SRV304_Building High-Throughput Serverless Data Processing Pipelines

Similar to SRV304_Building High-Throughput Serverless Data Processing Pipelines (20)

More from Amazon Web Services

More from Amazon Web Services (20)

SRV304_Building High-Throughput Serverless Data Processing Pipelines