SlideShare ist ein Scribd-Unternehmen logo
1 von 67
Downloaden Sie, um offline zu lesen
Processing Terabytes of data every day
… and sleeping at night
@katavic_d - @loige
London, 04/07/2019
- loige.link/tera-inf -
Domagoj KatavicSenior Software Engineer
🐦 @katavic_d
😸 github.com/dkatavic
Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
🌍 loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige
Cognito Detect
on premise solution
(soon also for the cloud!)
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven & Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige
@katavic_d - @loige
Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A Vectra product for Incident Response
Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Our first iteration
@katavic_d - @loige
@katavic_d - @loige
Control plane
Centralised
Logging &
Metrics
Security
● Separate VPCs
● Strict Security Groups (whitelisting)
● Red, amber, green subnets
● Encryption at rest through AWS services
● Client Certificates + TLS
● Pentest
@katavic_d - @loige
Let’s start the beta!
@katavic_d - @loige
Warning: different timezones!
A cu m
Our ne *
@katavic_d - @loige *yeah, we actually look that cute when we sleep!
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Lambda timeouts incident
● AWS Lambda timeout: 15 minutes (max)
● We are receiving files every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, files
can be too big to be processed within timeout
limits
@katavic_d - @loige
Splitter lambda
@katavic_d - @loige
Message-aware splitting
Lessons learned
● Predictable data input for
predictable performance
● Data ingestion parallelization
(exploiting serverless
capabilities)
@katavic_d - @loige
@katavic_d - @loige
Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!
Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Missing data incident
@katavic_d - @loige
@katavic_d - @loige
● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige
Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Fast retry at peak times
● Lambda retry logic is not configurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly defined
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in these range of time is likely to succeed@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in this range of time is likely to fail@katavic_d - @loige
Fast retry at peak times
If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry
Can we extend the retry period
in case of failure?
@katavic_d - @loige
@katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new file is stored in S3
@katavic_d - @loige
Extended retry period
If the Lambda fails, the event is automatically retried, up to 2 times
@katavic_d - @loige
Extended retry period
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
@katavic_d - @loige
Extended retry period
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
@katavic_d - @loige
Extended retry period
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3
@katavic_d - @loige
Extended retry period
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3
Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige
AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige
AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige
Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notified
@katavic_d - @loige
Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige
Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Norm Kerth, Project Retrospectives: A Handbook for Team Review
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige
Incidents Retrospective
● Summary
● Events timeline
● Contributing Factors
● Remediation / Solution
● Actions for the future
● Transparency
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting, Start Finishing”
@katavic_d - @loige
Development best practices
● Clear acceptance criteria
○ Collectively defined (3 amigos)
○ Make sure you know when a card is done
● Split the work in small units of work (cards)
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige
Development best practices
● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowflakes”, one code base for all customers
● Feature flags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige
Conclusion
We are still waking up at night sometimes,
but we are definitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Allow your team space to fail
● Always review and strive for improvement
● Monitor/Instrument as much as you can
● Use managed services to reduce the operational overhead
(but learn their nuances)
@katavic_d - @loige
We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
- loige.link/tera-inf -
Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic

Weitere ähnliche Inhalte

Was ist angesagt?

Leveraging HP Performance Center
Leveraging HP Performance CenterLeveraging HP Performance Center
Leveraging HP Performance Center
Martin Spier
 
Rubyconf presentation
Rubyconf presentationRubyconf presentation
Rubyconf presentation
krevuri
 

Was ist angesagt? (20)

Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web Framework
 
Top conf serverlezz
Top conf   serverlezzTop conf   serverlezz
Top conf serverlezz
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless framework
 
Front-end for Java developers Devoxx France 2018
Front-end for Java developers Devoxx France 2018Front-end for Java developers Devoxx France 2018
Front-end for Java developers Devoxx France 2018
 
SpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellSpringOne 2016 in a nutshell
SpringOne 2016 in a nutshell
 
Microservices and serverless in python projects
Microservices and serverless in python projectsMicroservices and serverless in python projects
Microservices and serverless in python projects
 
Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web Framework
 
GC Tuning Confessions Of A Performance Engineer - Improved :)
GC Tuning Confessions Of A Performance Engineer - Improved :)GC Tuning Confessions Of A Performance Engineer - Improved :)
GC Tuning Confessions Of A Performance Engineer - Improved :)
 
Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)
 
Leveraging HP Performance Center
Leveraging HP Performance CenterLeveraging HP Performance Center
Leveraging HP Performance Center
 
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Way Improved :) GC Tuning Confessions - presented at JavaOne2015Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
 
betterCode Workshop: Effizientes DevOps-Tooling mit Go
betterCode Workshop:  Effizientes DevOps-Tooling mit GobetterCode Workshop:  Effizientes DevOps-Tooling mit Go
betterCode Workshop: Effizientes DevOps-Tooling mit Go
 
Rubyconf presentation
Rubyconf presentationRubyconf presentation
Rubyconf presentation
 
Idi2018 - Serverless does not mean Opsless
Idi2018 - Serverless does not mean OpslessIdi2018 - Serverless does not mean Opsless
Idi2018 - Serverless does not mean Opsless
 
Building Web Apps in Ratpack
Building Web Apps in RatpackBuilding Web Apps in Ratpack
Building Web Apps in Ratpack
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritage
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP Switzerland
 
Painless container management with Container Engine and Kubernetes
Painless container management with Container Engine and KubernetesPainless container management with Container Engine and Kubernetes
Painless container management with Container Engine and Kubernetes
 
Aws Lambda in Swift - NSLondon - 3rd December 2020
Aws Lambda in Swift - NSLondon - 3rd December 2020Aws Lambda in Swift - NSLondon - 3rd December 2020
Aws Lambda in Swift - NSLondon - 3rd December 2020
 
Altitude SF 2017: Building a continuous deployment pipeline
Altitude SF 2017: Building a continuous deployment pipelineAltitude SF 2017: Building a continuous deployment pipeline
Altitude SF 2017: Building a continuous deployment pipeline
 

Ähnlich wie Processing Terabytes of data every day … and sleeping at night (infiniteConf 2019)

Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
Peter Mounce
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...
Luciano Mammino
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
confluent
 

Ähnlich wie Processing Terabytes of data every day … and sleeping at night (infiniteConf 2019) (20)

Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Skillenza Build with Serverless Challenge -  Advanced Serverless ConceptsSkillenza Build with Serverless Challenge -  Advanced Serverless Concepts
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applications
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverless
 
Netty training
Netty trainingNetty training
Netty training
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 

Mehr von Luciano Mammino

Mehr von Luciano Mammino (20)

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJS
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiper
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLs
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & Airtable
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust Dublin
 
Monoliths to the cloud!
Monoliths to the cloud!Monoliths to the cloud!
Monoliths to the cloud!
 
The senior dev
The senior devThe senior dev
The senior dev
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community Vijayawada
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iterators
 

Kürzlich hochgeladen

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 

Processing Terabytes of data every day … and sleeping at night (infiniteConf 2019)

  • 1. Processing Terabytes of data every day … and sleeping at night @katavic_d - @loige London, 04/07/2019 - loige.link/tera-inf -
  • 2. Domagoj KatavicSenior Software Engineer 🐦 @katavic_d 😸 github.com/dkatavic
  • 3. Luciano Mammino Cloud Architect 🐦 @loige 😸 github.com/lmammino 🌍 loige.co 4.7 out of 5 stars on Amazon.com With @mariocasciaro
  • 4. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 5. AI to detect and hunt for cyber attackers Cognito Platform ● Detect ● Recall @katavic_d - @loige
  • 6. Cognito Detect on premise solution (soon also for the cloud!) ● Analyzing network traffic and logs ● Uses AI to deliver real-time attack visibility ● Behaviour driven & Host centric ● Provides threat context and most relevant attack details @katavic_d - @loige
  • 8. Cognito Recall ● Collects network metadata and stores it in “the cloud” ● Data is processed, enriched and standardised ● Data is made searchable @katavic_d - @loige A Vectra product for Incident Response
  • 9. Recall requirements ● Data isolation ● Ingestion speed: ~2GB/min x customer (up ~3TB x day per customer) ● Investigation tool: Flexible data exploration @katavic_d - @loige
  • 10. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 12. @katavic_d - @loige Control plane Centralised Logging & Metrics
  • 13. Security ● Separate VPCs ● Strict Security Groups (whitelisting) ● Red, amber, green subnets ● Encryption at rest through AWS services ● Client Certificates + TLS ● Pentest @katavic_d - @loige
  • 14. Let’s start the beta! @katavic_d - @loige
  • 15. Warning: different timezones! A cu m Our ne * @katavic_d - @loige *yeah, we actually look that cute when we sleep!
  • 16. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 19. Lambda timeouts incident ● AWS Lambda timeout: 15 minutes (max) ● We are receiving files every minute (containing 1 minute of network traffic) ● During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige
  • 22. Lessons learned ● Predictable data input for predictable performance ● Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige
  • 24. Lambdas IP starvation incident ● Spinning up many lambdas consumed all the available IPs in a subnet ● Failure to get an IP for the new ES machines ● ElasticSearch cannot scale up ● Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!
  • 25. Lessons learned ● Every running lambda inside a VPC uses an ENI (Elastic Network Interface) ● Every ENI takes a private IP address ● Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! ● Consider putting lambdas in their dedicated subnet @katavic_d - @loige
  • 30. ● New lambda version: triggered insertion failures ● ElasticSearch rejecting inserts and logging errors ● Our log reporting agents got stuck (we DDoS’d ourselves!) ● Monitoring/Alerting failed Resolution: ● Fix mismatching schema ● Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige
  • 31. Alerting on lambda failures Using logs: ● Best case: no logs ● Worst case: no logs (available)! A better approach: ● Attach a DLQ to your lambdas ● Alert on queue size with CloudWatch! ● Visibility on Lambda retries @katavic_d - @loige
  • 35. Fast retry at peak times ● Lambda retry logic is not configurable loige.link/lambda-retry ● Most events will be retried 2 times ● Time between retry attempts is not clearly defined (observed in the order of few seconds) ● What if all retry attempts happen at peak times? @katavic_d - @loige
  • 36. Fast retry at peak times @katavic_d - @loige
  • 37. Fast retry at peak times Processing in these range of time is likely to succeed@katavic_d - @loige
  • 38. Fast retry at peak times @katavic_d - @loige
  • 39. Fast retry at peak times Processing in this range of time is likely to fail@katavic_d - @loige
  • 40. Fast retry at peak times If retries are in the same zone, the message will fail and go to the DLQ 1st retry 2nd retry
  • 41. Can we extend the retry period in case of failure? @katavic_d - @loige
  • 42. @katavic_d - @loige Extended retry period We normally trigger our ingestion Lambda when a new file is stored in S3
  • 43. @katavic_d - @loige Extended retry period If the Lambda fails, the event is automatically retried, up to 2 times
  • 44. @katavic_d - @loige Extended retry period If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
  • 45. @katavic_d - @loige Extended retry period At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
  • 46. @katavic_d - @loige Extended retry period If the processing still fails, we can extend the VisibilityTimeout (event delay) x3
  • 47. @katavic_d - @loige Extended retry period If the processing still fails, we eventually drop the message and alert for manual intervention. x3
  • 48. Lessons learned ● Cannot always rely on the default retry logic ● SQS events + DLQ = custom SERVERLESS retry logic ● Now we only alert on custom metrics when we are sure the event will fail (logic error) ● https://loige.link/async-lambda-retry @katavic_d - @loige
  • 49. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 50. AWS nuances ● Serverless is generally cheap, but be careful! ○ You are paying for wait time ○ Bugs may be expensive ○ 100ms charging blocks ● https://loige.link/lambda-pricing ● https://loige.link/serverless-costs-all-wrong @katavic_d - @loige
  • 51. AWS nuances ● Not every service/feature is available in every region or AZ ○ SQS FIFO :( ○ Not all AWS regions have 3 AZs ○ Not all instance types are available in every availability zone ● https://loige.link/aws-regional-services @katavic_d - @loige
  • 52. AWS nuances ● Limits everywhere! ○ Soft vs hard limits ○ Take them into account in your design ● https://loige.link/aws-service-limits @katavic_d - @loige
  • 53. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 54. Process How to deal with incidents ● Page ● Engineers on call ● Incident Retrospective ● Actions @katavic_d - @loige
  • 55. Pages ● Page is an alarm for people on call (Pagerduty) ● Rotate ops & devs (share the pain) ● Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) ● When a page is received, it needs to be acknowledged or it is automatically escalated ● If customer facing (e.g. service not available), customer is notified @katavic_d - @loige
  • 56. Engineers on call 1. Use operational handbook 2. Might escalate to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige
  • 57. Incidents Retrospective "Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Norm Kerth, Project Retrospectives: A Handbook for Team Review TLDR; NOT A BLAMING GAME! @katavic_d - @loige
  • 58. Incidents Retrospective ● Summary ● Events timeline ● Contributing Factors ● Remediation / Solution ● Actions for the future ● Transparency @katavic_d - @loige
  • 59. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 60. Development best practices ● Regular Retrospectives (not just for incidents) ○ What’s good ○ What’s bad ○ Actions to improve ● Kanban Board ○ All work visible ○ One card at the time ○ Work In Progress limit ○ “Stop Starting, Start Finishing” @katavic_d - @loige
  • 61. Development best practices ● Clear acceptance criteria ○ Collectively defined (3 amigos) ○ Make sure you know when a card is done ● Split the work in small units of work (cards) ○ High throughput ○ More predictability ● Bugs take priority over features! @katavic_d - @loige
  • 62. Development best practices ● Pair programming ○ Share the knowledge/responsibility ○ Improve team dynamics ○ Enforced by low WIP limit ● Quality over deadlines ● Don’t estimate without data @katavic_d - @loige
  • 63. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 64. Release process ● Infrastructure as a code ○ Deterministic deployments ○ Infrastructure versioning using git ● No “snowflakes”, one code base for all customers ● Feature flags: ○ Special features ○ Soft releases ● Automated tests before release @katavic_d - @loige
  • 65. Conclusion We are still waking up at night sometimes, but we are definitely sleeping a lot more and better! Takeaways: ● Have healthy and clear processes ● Allow your team space to fail ● Always review and strive for improvement ● Monitor/Instrument as much as you can ● Use managed services to reduce the operational overhead (but learn their nuances) @katavic_d - @loige
  • 66. We are hiring … Talk to us!@katavic_d - @loige Thank you! - loige.link/tera-inf -
  • 67. Credits Pictures from Unsplash Huge thanks for support and reviews to: ● All the Vectra team ● Yan Cui (@theburningmonk) ● Paul Dolan ● @gbinside ● @augeva ● @Podgeypoos79 ● @PawrickMannion ● @micktwomey ● Vedran Jukic