SlideShare ist ein Scribd-Unternehmen logo
1 von 39
Downloaden Sie, um offline zu lesen
Principal Data Architect at Home24
Data Services: Search, Recommendations, Ranking
Worked on: Here Maps, Sapo.pt, DataJet, Xing, …
Scala, Perl, Prolog, Java, SQL, R, …
AWS: Step-Functions, Lambda Function, EMR, EC2,
Batch, SQS, SNS, Firehose, Athena, API Gateway, ...
home24.tech.blog
€
● 15 persons of 12 Nationalities
● Serverless Lovers. For data ingestion we have:
● AWS Technologies: Step-Functions, Cloud-Formation, Lambda Functions,
Athena, EMR, Redshift, S3, ...
Production Development
Number of Lambdas 625 2311
Number of Step Function 113 490
Consumed time (a month) 3,383,525 sec (39 days) 5,371,037 sec (62 days)
Number of requests (a month) 2,014,203 Requests 3,300,118 Requests
● Majority of our Streams are low rate messages
● The Big Stream doesn’t have an easily predictable rate of
messages and can peak to 100 messages/sec
● We will have many more low rate Streams
Main requirements
● Store new Stream Data in Raw S3 Bucket
● Refine Raw S3 Bucket data to a Refined S3 Bucket
● Wrong formatted messages shall not stop the flow
● Notification shall be sent on bad data
● Data must be refined in less than 10 minutes
Other
● Able to replay many days of data fast
● For development, every developer shall be able to deploy his version
independently
Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Architecture
● A SQS Queue collects all data from the SNS
Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Architecture
● A SQS Queue collects all data from the SNS
● A Lambda copies the data from the SQS to a
Firehose
● The Lambda Function is invoked once a
minute via CloudWatch Event
Requirements
● Collect data from SNS
● The data must be stored as received in S3.
● Files size must be easy to process on
Lambda (< 10MB)
● At least 1 file per minute must be created
Architecture
● A SQS Queue collects all data from the SNS
● A Lambda copies the data from the SQS to a
Firehose
● The Lambda Function is invoked once a
minute via CloudWatch Event
● Firehose merges the data and creates files
on Raw S3 Bucket
Requirement
● When some message are not
processable, send a notification.
Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue
Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue
● Non empty Dead-Letter SQS means
there is an error on the data
Requirement
● When some message are not
processable, send a notification.
Architecture
● The data is deleted from the SQS
Queue after successful copy to
Firehose
● On case of error, the messages will
end on the Dead-Letter Queue
● Non empty Dead-Letter SQS means
there is an error on the data
● After fixing the Lambda function, one
can always copy the messages back
to the Raw SQS
Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files
Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files
● A file with the same key, as Raw file, is
created on the Refine S3 Bucket
Requirements
● Decompress data (zip, deflate, gz,
base64, ...)
● Normalize fields (dates for example)
● Add metadata
● Convert all to JSON
● Stored on S3
Architecture
● When a new file is created on Raw S3
Bucket a message is sent to SQS via
SNS
● The Lambda Function is invoked once
a minute via CloudWatch Event and
process all unprocessed files
● A file with the same key, as Raw file, is
created on the Refine S3 Bucket
● Messages that fail to process will end
on the Dead Letter Queue
Requirements
● Replay multiple days of data
Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones
Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones
● The execution time of the Refiner
Lambda will rise and the Refiner
Lambdas will work in parallel
Requirements
● Replay multiple days of data
Architecture
● Lambda Function List files on the
Raw S3 Bucket and send
messages to SQS
● Since the files in Raw and Refine
have the same key, the files will
always overwrite the existing ones
● The execution time of the Refiner
Lambda will rise and the Refiner
Lambdas will work in parallelParallelism:
● our Lambda goes to ~190 sec, 3 lambdas
running in parallel.
● 9198 S3 objects
● 30 GB of GZip data, 10GB/hour
Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Architecture
● We created an internal SNS
where we clone the external
messages
Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Architecture
● We created an internal SNS
where we clone the external
messages
● SNS can write to multiple
SQS
Requirement
● Developers shall be able to
deploy their Stream
Processors
● No interaction with external
team shall be required
Architecture
● We created an internal SNS
where we clone the external
messages
● SNS can write to multiple
SQS
● Same CloudFormation magic
and every developer can
deploy his own Environment
EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
Scale Scale while it has credits to 1
vCPU. To have more vCPUs you
need to use more expensive
instance types or implement
autoscaling
Out of the box until a certain
level.
2 vCPU * 5 Lambdas = 10
vCPUs
EC2 Lambda
CPU /
Price
1 t2.nano (5% vCPU and 500MB)
0.0063*24*30 = 4.536$/month
Considering 3 seconds a minute
with the highest memory (2
vCPU and 1536 MB)
3*60*24*30*10*(0.000002501+0
.0000002) = 3.5$/month
Devops Higher Low
Scale Scale while it has credits to 1
vCPU. To have more vCPUs you
need to use more expensive
instance types or implement
autoscaling
Out of the box until a certain
level.
2 vCPU * 5 Lambdas = 10
vCPUs
Price wise, lambda seems a good solution. For our problems, 10 vCPUs is
clearly more than enough.
Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
On SQS you pay PUTs and GETs on Kinesys you pay PUTs
Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
Fast stream 3 Shards 36.7$/month
Puts 1.1$/month
Requests
51.8$/month
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
● Fast Stream: 25 message/second (64.8 million requests/month)
with spikes of 100 message/second
On SQS you pay PUTs and GETs on Kinesys you pay PUTs
Kinesys SQS
Slow stream 2 Shards 24.5$/month
Puts 0.042$/Month
Requests
2.07$/month
Fast stream 3 Shards 36.7$/month
Puts 1.1$/month
Requests
51.8$/month
Errors Errors have to be controlled
externally
Errors go to
DeadLeter Queue
We analyze our 2 types of stream of data:
● Slow Stream: 1 message/sec (2.6 million requests/month)
● Fast Stream: 25 message/second (64.8 million requests/month)
with spikes of 100 message/second
On SQS you pay PUTs and GETs on Kinesys you pay PUTs
● You just pay for what you use
● Scalability is not an issue at our messages volume (top 100
messages/second)
○ SQS and Firehose can easily process that volume of messages
○ Multiple Lambdas can work in parallel in case of high traffic or
replay.
● Separated Lambdas by Stream help understanding the logs
● Separated environments simplify developers work
● Data is on S3 and it can be queried via Athena, EMR, Redshift
Spectrum, ...
Questions
Answers

Weitere ähnliche Inhalte

Was ist angesagt?

Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
Martin Kleppmann
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Alexey Kharlamov
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
Demi Ben-Ari
 

Was ist angesagt? (20)

ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015 ApacheCon BigData Europe 2015
ApacheCon BigData Europe 2015
 
Benchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per nodeBenchmarking Apache Samza: 1.2 million messages per sec per node
Benchmarking Apache Samza: 1.2 million messages per sec per node
 
Netflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipelineNetflix Keystone—Cloud scale event processing pipeline
Netflix Keystone—Cloud scale event processing pipeline
 
Unbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxiniUnbounded bounded-data-strangeloop-2016-monal-daxini
Unbounded bounded-data-strangeloop-2016-monal-daxini
 
DataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and AnalyticsDataStax and Esri: Geotemporal IoT Search and Analytics
DataStax and Esri: Geotemporal IoT Search and Analytics
 
Samza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next LevelSamza at LinkedIn: Taking Stream Processing to the Next Level
Samza at LinkedIn: Taking Stream Processing to the Next Level
 
Air traffic controller - Streams Processing meetup
Air traffic controller  - Streams Processing meetupAir traffic controller  - Streams Processing meetup
Air traffic controller - Streams Processing meetup
 
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARNApache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
Apache Samza: Reliable Stream Processing Atop Apache Kafka and Hadoop YARN
 
Harvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's FeedHarvesting the Power of Samza in LinkedIn's Feed
Harvesting the Power of Samza in LinkedIn's Feed
 
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
Building large-scale analytics platform with Storm, Kafka and Cassandra - NYC...
 
Spark streaming: Best Practices
Spark streaming: Best PracticesSpark streaming: Best Practices
Spark streaming: Best Practices
 
Spark Streaming into context
Spark Streaming into contextSpark Streaming into context
Spark Streaming into context
 
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
PlayStation and Searchable Cassandra Without Solr (Dustin Pham & Alexander Fi...
 
Thinking Functionally with Clojure
Thinking Functionally with ClojureThinking Functionally with Clojure
Thinking Functionally with Clojure
 
New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015New Analytics Toolbox DevNexus 2015
New Analytics Toolbox DevNexus 2015
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
MongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: ShardingMongoDB for Time Series Data: Sharding
MongoDB for Time Series Data: Sharding
 
stream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samzastream-processing-at-linkedin-with-apache-samza
stream-processing-at-linkedin-with-apache-samza
 
Scala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache sparkScala like distributed collections - dumping time-series data with apache spark
Scala like distributed collections - dumping time-series data with apache spark
 
So you think you can stream.pptx
So you think you can stream.pptxSo you think you can stream.pptx
So you think you can stream.pptx
 

Ähnlich wie Store stream data on Data Lake

Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Amazon Web Services
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
Max Lapan
 

Ähnlich wie Store stream data on Data Lake (20)

Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Cloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark AnalyticsCloud Security Monitoring and Spark Analytics
Cloud Security Monitoring and Spark Analytics
 
Raleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS LambdaRaleigh DevDay 2017: Real time data processing using AWS Lambda
Raleigh DevDay 2017: Real time data processing using AWS Lambda
 
Riga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWSRiga dev day: Lambda architecture at AWS
Riga dev day: Lambda architecture at AWS
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Ben...
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
Real Time Data Processing Using AWS Lambda - DevDay Austin 2017
 
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
AWS re:Invent 2016: Case Study: How Startups Like Smartsheet and Quantcast Ac...
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Serverlessusecase workshop feb3_v2
Serverlessusecase workshop feb3_v2Serverlessusecase workshop feb3_v2
Serverlessusecase workshop feb3_v2
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 
Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021Scalable crawling with Kafka, scrapy and spark - November 2021
Scalable crawling with Kafka, scrapy and spark - November 2021
 
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
Getting Maximum Performance from Amazon Redshift (DAT305) | AWS re:Invent 2013
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
SMC303 Real-time Data Processing Using AWS Lambda
SMC303 Real-time Data Processing Using AWS LambdaSMC303 Real-time Data Processing Using AWS Lambda
SMC303 Real-time Data Processing Using AWS Lambda
 
Collecting 600M events/day
Collecting 600M events/dayCollecting 600M events/day
Collecting 600M events/day
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 

Mehr von Marcos Rebelo (6)

Coordinating external data importer services using AWS step functions
Coordinating external data importer services using AWS step functionsCoordinating external data importer services using AWS step functions
Coordinating external data importer services using AWS step functions
 
Mojolicious
MojoliciousMojolicious
Mojolicious
 
Perl5i
Perl5iPerl5i
Perl5i
 
Modern Perl
Modern PerlModern Perl
Modern Perl
 
Perl Introduction
Perl IntroductionPerl Introduction
Perl Introduction
 
Perl In The Command Line
Perl In The Command LinePerl In The Command Line
Perl In The Command Line
 

Kürzlich hochgeladen

AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
ellan12
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
soniya singh
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Sheetaleventcompany
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Call Girls In Delhi Whatsup 9873940964 Enjoy Unlimited Pleasure
 

Kürzlich hochgeladen (20)

Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Green Park Escort Service Delhi N.C.R.
 
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
Call Girls Dubai Prolapsed O525547819 Call Girls In Dubai Princes$
 
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptxAWS Community DAY Albertini-Ellan Cloud Security (1).pptx
AWS Community DAY Albertini-Ellan Cloud Security (1).pptx
 
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service AvailableCall Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
Call Girls Ludhiana Just Call 98765-12871 Top Class Call Girl Service Available
 
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
Call Girls In Pratap Nagar Delhi 💯Call Us 🔝8264348440🔝
 
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
Call Now ☎ 8264348440 !! Call Girls in Shahpur Jat Escort Service Delhi N.C.R.
 
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
✂️ 👅 Independent Andheri Escorts With Room Vashi Call Girls 💃 9004004663
 
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 26 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...Pune Airport ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready...
Pune Airport ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready...
 
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night StandHot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
Hot Call Girls |Delhi |Hauz Khas ☎ 9711199171 Book Your One night Stand
 
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
All Time Service Available Call Girls Mg Road 👌 ⏭️ 6378878445
 
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting  High Prof...
VIP Model Call Girls Hadapsar ( Pune ) Call ON 9905417584 Starting High Prof...
 
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
WhatsApp 📞 8448380779 ✅Call Girls In Mamura Sector 66 ( Noida)
 
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
Call Girls Service Chandigarh Lucky ❤️ 7710465962 Independent Call Girls In C...
 
On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024On Starlink, presented by Geoff Huston at NZNOG 2024
On Starlink, presented by Geoff Huston at NZNOG 2024
 
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...Top Rated  Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
Top Rated Pune Call Girls Daund ⟟ 6297143586 ⟟ Call Me For Genuine Sex Servi...
 
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
DDoS In Oceania and the Pacific, presented by Dave Phelan at NZNOG 2024
 
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...Nanded City ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready ...
Nanded City ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready ...
 
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
𓀤Call On 7877925207 𓀤 Ahmedguda Call Girls Hot Model With Sexy Bhabi Ready Fo...
 
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No AdvanceRohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
Rohini Sector 6 Call Girls Delhi 9999965857 @Sabina Saikh No Advance
 

Store stream data on Data Lake

  • 1.
  • 2. Principal Data Architect at Home24 Data Services: Search, Recommendations, Ranking Worked on: Here Maps, Sapo.pt, DataJet, Xing, … Scala, Perl, Prolog, Java, SQL, R, … AWS: Step-Functions, Lambda Function, EMR, EC2, Batch, SQS, SNS, Firehose, Athena, API Gateway, ...
  • 4.
  • 5. ● 15 persons of 12 Nationalities ● Serverless Lovers. For data ingestion we have: ● AWS Technologies: Step-Functions, Cloud-Formation, Lambda Functions, Athena, EMR, Redshift, S3, ... Production Development Number of Lambdas 625 2311 Number of Step Function 113 490 Consumed time (a month) 3,383,525 sec (39 days) 5,371,037 sec (62 days) Number of requests (a month) 2,014,203 Requests 3,300,118 Requests
  • 6. ● Majority of our Streams are low rate messages ● The Big Stream doesn’t have an easily predictable rate of messages and can peak to 100 messages/sec ● We will have many more low rate Streams
  • 7. Main requirements ● Store new Stream Data in Raw S3 Bucket ● Refine Raw S3 Bucket data to a Refined S3 Bucket ● Wrong formatted messages shall not stop the flow ● Notification shall be sent on bad data ● Data must be refined in less than 10 minutes Other ● Able to replay many days of data fast ● For development, every developer shall be able to deploy his version independently
  • 8. Requirements ● Collect data from SNS ● The data must be stored as received in S3. ● Files size must be easy to process on Lambda (< 10MB) ● At least 1 file per minute must be created
  • 9. Requirements ● Collect data from SNS ● The data must be stored as received in S3. ● Files size must be easy to process on Lambda (< 10MB) ● At least 1 file per minute must be created Architecture ● A SQS Queue collects all data from the SNS
  • 10. Requirements ● Collect data from SNS ● The data must be stored as received in S3. ● Files size must be easy to process on Lambda (< 10MB) ● At least 1 file per minute must be created Architecture ● A SQS Queue collects all data from the SNS ● A Lambda copies the data from the SQS to a Firehose ● The Lambda Function is invoked once a minute via CloudWatch Event
  • 11. Requirements ● Collect data from SNS ● The data must be stored as received in S3. ● Files size must be easy to process on Lambda (< 10MB) ● At least 1 file per minute must be created Architecture ● A SQS Queue collects all data from the SNS ● A Lambda copies the data from the SQS to a Firehose ● The Lambda Function is invoked once a minute via CloudWatch Event ● Firehose merges the data and creates files on Raw S3 Bucket
  • 12. Requirement ● When some message are not processable, send a notification.
  • 13. Requirement ● When some message are not processable, send a notification. Architecture ● The data is deleted from the SQS Queue after successful copy to Firehose
  • 14. Requirement ● When some message are not processable, send a notification. Architecture ● The data is deleted from the SQS Queue after successful copy to Firehose ● On case of error, the messages will end on the Dead-Letter Queue
  • 15. Requirement ● When some message are not processable, send a notification. Architecture ● The data is deleted from the SQS Queue after successful copy to Firehose ● On case of error, the messages will end on the Dead-Letter Queue ● Non empty Dead-Letter SQS means there is an error on the data
  • 16. Requirement ● When some message are not processable, send a notification. Architecture ● The data is deleted from the SQS Queue after successful copy to Firehose ● On case of error, the messages will end on the Dead-Letter Queue ● Non empty Dead-Letter SQS means there is an error on the data ● After fixing the Lambda function, one can always copy the messages back to the Raw SQS
  • 17. Requirements ● Decompress data (zip, deflate, gz, base64, ...) ● Normalize fields (dates for example) ● Add metadata ● Convert all to JSON ● Stored on S3
  • 18. Requirements ● Decompress data (zip, deflate, gz, base64, ...) ● Normalize fields (dates for example) ● Add metadata ● Convert all to JSON ● Stored on S3 Architecture ● When a new file is created on Raw S3 Bucket a message is sent to SQS via SNS
  • 19. Requirements ● Decompress data (zip, deflate, gz, base64, ...) ● Normalize fields (dates for example) ● Add metadata ● Convert all to JSON ● Stored on S3 Architecture ● When a new file is created on Raw S3 Bucket a message is sent to SQS via SNS ● The Lambda Function is invoked once a minute via CloudWatch Event and process all unprocessed files
  • 20. Requirements ● Decompress data (zip, deflate, gz, base64, ...) ● Normalize fields (dates for example) ● Add metadata ● Convert all to JSON ● Stored on S3 Architecture ● When a new file is created on Raw S3 Bucket a message is sent to SQS via SNS ● The Lambda Function is invoked once a minute via CloudWatch Event and process all unprocessed files ● A file with the same key, as Raw file, is created on the Refine S3 Bucket
  • 21. Requirements ● Decompress data (zip, deflate, gz, base64, ...) ● Normalize fields (dates for example) ● Add metadata ● Convert all to JSON ● Stored on S3 Architecture ● When a new file is created on Raw S3 Bucket a message is sent to SQS via SNS ● The Lambda Function is invoked once a minute via CloudWatch Event and process all unprocessed files ● A file with the same key, as Raw file, is created on the Refine S3 Bucket ● Messages that fail to process will end on the Dead Letter Queue
  • 23. Requirements ● Replay multiple days of data Architecture ● Lambda Function List files on the Raw S3 Bucket and send messages to SQS
  • 24. Requirements ● Replay multiple days of data Architecture ● Lambda Function List files on the Raw S3 Bucket and send messages to SQS ● Since the files in Raw and Refine have the same key, the files will always overwrite the existing ones
  • 25. Requirements ● Replay multiple days of data Architecture ● Lambda Function List files on the Raw S3 Bucket and send messages to SQS ● Since the files in Raw and Refine have the same key, the files will always overwrite the existing ones ● The execution time of the Refiner Lambda will rise and the Refiner Lambdas will work in parallel
  • 26. Requirements ● Replay multiple days of data Architecture ● Lambda Function List files on the Raw S3 Bucket and send messages to SQS ● Since the files in Raw and Refine have the same key, the files will always overwrite the existing ones ● The execution time of the Refiner Lambda will rise and the Refiner Lambdas will work in parallelParallelism: ● our Lambda goes to ~190 sec, 3 lambdas running in parallel. ● 9198 S3 objects ● 30 GB of GZip data, 10GB/hour
  • 27. Requirement ● Developers shall be able to deploy their Stream Processors ● No interaction with external team shall be required
  • 28. Requirement ● Developers shall be able to deploy their Stream Processors ● No interaction with external team shall be required Architecture ● We created an internal SNS where we clone the external messages
  • 29. Requirement ● Developers shall be able to deploy their Stream Processors ● No interaction with external team shall be required Architecture ● We created an internal SNS where we clone the external messages ● SNS can write to multiple SQS
  • 30. Requirement ● Developers shall be able to deploy their Stream Processors ● No interaction with external team shall be required Architecture ● We created an internal SNS where we clone the external messages ● SNS can write to multiple SQS ● Same CloudFormation magic and every developer can deploy his own Environment
  • 31. EC2 Lambda CPU / Price 1 t2.nano (5% vCPU and 500MB) 0.0063*24*30 = 4.536$/month Considering 3 seconds a minute with the highest memory (2 vCPU and 1536 MB) 3*60*24*30*10*(0.000002501+0 .0000002) = 3.5$/month
  • 32. EC2 Lambda CPU / Price 1 t2.nano (5% vCPU and 500MB) 0.0063*24*30 = 4.536$/month Considering 3 seconds a minute with the highest memory (2 vCPU and 1536 MB) 3*60*24*30*10*(0.000002501+0 .0000002) = 3.5$/month Devops Higher Low
  • 33. EC2 Lambda CPU / Price 1 t2.nano (5% vCPU and 500MB) 0.0063*24*30 = 4.536$/month Considering 3 seconds a minute with the highest memory (2 vCPU and 1536 MB) 3*60*24*30*10*(0.000002501+0 .0000002) = 3.5$/month Devops Higher Low Scale Scale while it has credits to 1 vCPU. To have more vCPUs you need to use more expensive instance types or implement autoscaling Out of the box until a certain level. 2 vCPU * 5 Lambdas = 10 vCPUs
  • 34. EC2 Lambda CPU / Price 1 t2.nano (5% vCPU and 500MB) 0.0063*24*30 = 4.536$/month Considering 3 seconds a minute with the highest memory (2 vCPU and 1536 MB) 3*60*24*30*10*(0.000002501+0 .0000002) = 3.5$/month Devops Higher Low Scale Scale while it has credits to 1 vCPU. To have more vCPUs you need to use more expensive instance types or implement autoscaling Out of the box until a certain level. 2 vCPU * 5 Lambdas = 10 vCPUs Price wise, lambda seems a good solution. For our problems, 10 vCPUs is clearly more than enough.
  • 35. Kinesys SQS Slow stream 2 Shards 24.5$/month Puts 0.042$/Month Requests 2.07$/month We analyze our 2 types of stream of data: ● Slow Stream: 1 message/sec (2.6 million requests/month) On SQS you pay PUTs and GETs on Kinesys you pay PUTs
  • 36. Kinesys SQS Slow stream 2 Shards 24.5$/month Puts 0.042$/Month Requests 2.07$/month Fast stream 3 Shards 36.7$/month Puts 1.1$/month Requests 51.8$/month We analyze our 2 types of stream of data: ● Slow Stream: 1 message/sec (2.6 million requests/month) ● Fast Stream: 25 message/second (64.8 million requests/month) with spikes of 100 message/second On SQS you pay PUTs and GETs on Kinesys you pay PUTs
  • 37. Kinesys SQS Slow stream 2 Shards 24.5$/month Puts 0.042$/Month Requests 2.07$/month Fast stream 3 Shards 36.7$/month Puts 1.1$/month Requests 51.8$/month Errors Errors have to be controlled externally Errors go to DeadLeter Queue We analyze our 2 types of stream of data: ● Slow Stream: 1 message/sec (2.6 million requests/month) ● Fast Stream: 25 message/second (64.8 million requests/month) with spikes of 100 message/second On SQS you pay PUTs and GETs on Kinesys you pay PUTs
  • 38. ● You just pay for what you use ● Scalability is not an issue at our messages volume (top 100 messages/second) ○ SQS and Firehose can easily process that volume of messages ○ Multiple Lambdas can work in parallel in case of high traffic or replay. ● Separated Lambdas by Stream help understanding the logs ● Separated environments simplify developers work ● Data is on S3 and it can be queried via Athena, EMR, Redshift Spectrum, ...