SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
, Principal Engineer
Over 30 years experience predominantly dealing with event pipelines and data
retrieval.
He currently works as a platform architect and principal developer at Lookout Inc
working on the Ingestion Pipeline and Query Services team working on the next
scale of data ingestion.
■ Provides security scanning for mobile devices for Enterprise and
Consumer markets
■ Founded in 2004 when the original founders discovered a
vulnerability in the Bluetooth and Nokia phones
■ Demonstrated the need for mobile security through a demonstration
at the 2005 Academy Awards downloading information from
celebrity phones 1.5 miles away from the venue
■ Enterprise customers have the ability to apply corporate policies
against devices registered in their enterprise
■ To apply these policies Lookout ingests data about device
configuration and applications installed on devices
■ Functions as a proxy for all mobile devices in the Lookout fleet
■ Device telemetry is sent at various intervals for these categories
● Software
● Hardware
● Client
● Filesystem
● Configuration
● Binary Manifest
● Risky Configuration
● Personal Content Protection (safe browsing)
● Device Settings
● Device Permissions
● Activation Status
■ Easy to setup and maintain
■ Scaling is easy
■ Cost Effective
■ Simple to handle the Unexpected
■ Some of the components are “single region” (EMR)
■ As the system grows the costs increase significantly (DynamoDB)
■ Limits on Primary Key (PK) and Sort Key (SK) for DynamoDB - Not
designed for time series data
A highly scalable and fault tolerant streaming framework that can process messages (for
example Device Telemetry Messages) and persist these messages into a scalable, fault
tolerant persistent store and support operational queries.
Key Requirements:
■ Infrastructure should scale to support 100M devices
■ Cost effective ingestion, storage and querying at this scale
■ Low Latency, High Availability at scale (up/down)
■ Failure handling (no loss of data)
■ Ease of deployment and management
■ A NoSQL database that implements almost all the features of Apache Cassandra
■ Written in C++ 14 instead of Java to increase the performance.
■ Uses a shared nothing approach and uses the Seastar framework to shard requests by
core - http://seastar.io/
■ Scylla’s close-to-the-hardware design significantly reduces the number of instances
needed.
■ Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but delivers 10X the
throughput and consistent, low single-digit latencies.
■ Has support for tunable job prioritization to support extremely high read and write
throughput (which was a problem that Cassandra has not solved yet). Has really high
throughput on instances with NVMe volumes (compared to EBS or non NVMe volumes).
■ Amount of storage available for data depends on the compaction strategy
selected.
● Levelled compaction - Half of data storage needed for compaction - not
recommended
● Size tiered compaction - Half of data storage needed for compaction
● Time window compaction - Depends on the number of tables and record size -
normally around half needed for compaction
● Incremental compaction - possible to push up to 85% for data storage, so storage
needs need to be planned well. - Enterprise Edition
■ May not be a good choice if storage requirements are very large as opposed to
transactions as you will have wasted compute tied to the increased storage needs.
■ Note that this assumes you do not plan to use low cost EBS volumes with much reduced
throughput.
■ No FedRamp certified version of Scylla Cloud available today requiring deployment of
self-managed cluster
■ No Autoscaling support as we have to provision nodes and rebalance data through
scripts/UI.
■ Not suitable for ad-hoc queries or table scan type queries, and does not support joins.
■ Each worker instance is stateless and coordinates
with each other via internal Kafka topics.
■ Kafka Connect automatically detects failures and
rebalances work over remaining processes.
■ Suitable for streaming data to and from Kafka and is
not suitable for complex operation like aggregations,
windowing, etc., that frameworks like Apache Spark
or Apache Flink support.
■ The maximum number of tasks is limited to the
number of partitions.
■ Exposes a REST API to create, modify and monitor
the connectors and tasks
■ Kafka
● 6 Kafka Brokers - R5.xlarge
● 6 Zookeepers - M5.large
● 3 Schema Registries - M5.large
● 6 Kafka Connect Workers - C5.xlarge
● 1 Control Center - M5.2xlarge
● Split over 3 AZs
● # partitions
■ Loaded Libraries - 120 partitions
■ Device Settings - 150 partitions
■ Other topics - 60 partitions
■ ScyllaDB
● 4 ScyllaDB instances - I3.4xlarge
● Split over 2 AZs
■ Load
● 12 different device telemetry
emulated
● Messages sent in Apache Avro
format
● 14 instances generating load -
C5d.4xlarge
■ The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is
not very efficient with sharding when the number of partitions grow (approx 50% of the
partitions were idle).
■ We replaced by using a murmur3 hash and then put it through a consistent hashing
algorithm (jump hash) to get an even distribution across all partitions (we used Google’s
guava library). - “A Fast, Minimal Memory, Consistent Hash Algorithm” -
https://arxiv.org/pdf/1406.2294.pdf
■ We emulated approximately 38
million devices generating a total
of 109,668 messages/second or
394 Million messages/hr.
■ On average a device was
generating 253 messages/day
■ We don’t expect querying to be
much impact, so did not add that
as part of load
■ The load test duration was for 96
hrs.
Telemetry Type # Device Telemetry
emulated/second
# Device Telemetry
emulated/day
Avg size in
Bytes/Telemetry
Celldata 760 1.72 83
Client 760 1.72 166
Configuration 13908 31.62 396
Device Change 2280 6.91 218
Device Permissions 1520 3.45 74
Device Settings 45600 103.68 75
Hardware 760 1.72 254
Loaded Libraries 38000 86.40 219
Risk Configuration 1900 5.18 261
Software 760 1.72 375
Binary 1520 3.45 219
File System 1900 5.18 219
■ Message latency was in
milliseconds on average, unless the
system was overtaxed.
■ Repairs forced the load and was
generally taxing on the system
(CPU at 100%), but the cluster
continued to function.
■ The latency increased when Kafka
Connect tasks failed (when repairs
were running on ScyllaDB).
■ ScyllaDB Cluster was running near
capacity (CPU between 75-90%)
■ Overall, the results were really
positive.
■ Kafka Connect provided a quick and easy solution to add new ingestion pipelines
■ Using DataMountaineer’s Kafka Connect connector for Cassandra was easier to
implement than the Confluent connector
■ Scylla DB CPU shot up while repairing and timeouts occurred - Scylla’s ability to reserve
capacity for maintenance tasks ensured repairs completed something not available in
Cassandra.
■ As the complexity of the data ingestion increased the solution leaned more towards
implementing a custom Kafka → Scylla worker cluster for debugging and maintenance
reasons
■ The cost benefits over the current architecture flow increased significantly as our volume
increased.
■ This does not include:
● Query load and associated costs.
● Dynamo streams and it’s equivalent on Scylla and associated costs.
DynamoDB Scylla
# Devices $ Cost/Mo # Devices $ Cost/Mo
On Demand
38,000,000 $304,400.00 38,000,000 $14,564.24
100,000,000 $801,052.63 100,000,000 $38,303.95
+20% Engineer cost
(Maintenance)
Provisioned
38,000,000 $55,610.00
100,000,000 $146,342.11
$801,052
$146,342
$38,303
Richard Ney
richard.ney@lookout.com
@rney_home

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

How Scylla Manager Handles Backups
How Scylla Manager Handles BackupsHow Scylla Manager Handles Backups
How Scylla Manager Handles Backups
 
A glimpse of cassandra 4.0 features netflix
A glimpse of cassandra 4.0 features   netflixA glimpse of cassandra 4.0 features   netflix
A glimpse of cassandra 4.0 features netflix
 
Event Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDBEvent Streaming Architectures with Confluent and ScyllaDB
Event Streaming Architectures with Confluent and ScyllaDB
 
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
High-Load Storage of Users’ Actions with ScyllaDB and HDDsHigh-Load Storage of Users’ Actions with ScyllaDB and HDDs
High-Load Storage of Users’ Actions with ScyllaDB and HDDs
 
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for KubernetesScylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
Scylla Summit 2022: What’s New in ScyllaDB Operator for Kubernetes
 
Lightweight Transactions at Lightning Speed
Lightweight Transactions at Lightning SpeedLightweight Transactions at Lightning Speed
Lightweight Transactions at Lightning Speed
 
Free & Open DynamoDB API for Everyone
Free & Open DynamoDB API for EveryoneFree & Open DynamoDB API for Everyone
Free & Open DynamoDB API for Everyone
 
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
Scylla Summit 2018: Getting the Most Out of Scylla on KubernetesScylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
Scylla Summit 2018: Getting the Most Out of Scylla on Kubernetes
 
Developing Scylla Applications: Practical Tips
Developing Scylla Applications: Practical TipsDeveloping Scylla Applications: Practical Tips
Developing Scylla Applications: Practical Tips
 
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
Scylla Summit 2022: Building Zeotap's Privacy Compliant Customer Data Platfor...
 
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
Safer restarts, faster streaming, and better repair, just a glimpse of cassan...
 
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Fu...
 
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
Scylla Summit 2022: Operating at Monstrous Scales: Benchmarking Petabyte Work...
 
How ReversingLabs Serves File Reputation Service for 10B Files
How ReversingLabs Serves File Reputation Service for 10B FilesHow ReversingLabs Serves File Reputation Service for 10B Files
How ReversingLabs Serves File Reputation Service for 10B Files
 
Powering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraphPowering a Graph Data System with Scylla + JanusGraph
Powering a Graph Data System with Scylla + JanusGraph
 
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Scylla Summit 2019 Keynote - Dor Laor - Beyond CassandraScylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
Scylla Summit 2019 Keynote - Dor Laor - Beyond Cassandra
 
Using ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber SecurityUsing ScyllaDB with JanusGraph for Cyber Security
Using ScyllaDB with JanusGraph for Cyber Security
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0Live traffic capture and replay in cassandra 4.0
Live traffic capture and replay in cassandra 4.0
 
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion RecordsScylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
Scylla Summit 2022: How to Migrate a Counter Table for 68 Billion Records
 

Ähnlich wie Lookout on Scaling Security to 100 Million Devices

Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
Piyush Kumar
 

Ähnlich wie Lookout on Scaling Security to 100 Million Devices (20)

Our Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent CloudOur Multi-Year Journey to a 10x Faster Confluent Cloud
Our Multi-Year Journey to a 10x Faster Confluent Cloud
 
Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"Infrastructure Considerations : Design : "webops"
Infrastructure Considerations : Design : "webops"
 
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
A Big Data Lake Based on Spark for BBVA Bank-(Oscar Mendez, STRATIO)
 
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messagesMulti-Tenancy Kafka cluster for LINE services with 250 billion daily messages
Multi-Tenancy Kafka cluster for LINE services with 250 billion daily messages
 
VMworld 2013: How SRP Delivers More Than Power to Their Customers
VMworld 2013: How SRP Delivers More Than Power to Their Customers VMworld 2013: How SRP Delivers More Than Power to Their Customers
VMworld 2013: How SRP Delivers More Than Power to Their Customers
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Data streaming fundamentals
Data streaming fundamentalsData streaming fundamentals
Data streaming fundamentals
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Kaseya Connect 2013: Optimizing Your K Server - Best Practices in Kaseya Infr...
Kaseya Connect 2013: Optimizing Your K Server - Best Practices in Kaseya Infr...Kaseya Connect 2013: Optimizing Your K Server - Best Practices in Kaseya Infr...
Kaseya Connect 2013: Optimizing Your K Server - Best Practices in Kaseya Infr...
 
Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival
 
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming dataUsing Kafka and Kudu for fast, low-latency SQL analytics on streaming data
Using Kafka and Kudu for fast, low-latency SQL analytics on streaming data
 
Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?Big Data Streams Architectures. Why? What? How?
Big Data Streams Architectures. Why? What? How?
 
Zero Downtime JEE Architectures
Zero Downtime JEE ArchitecturesZero Downtime JEE Architectures
Zero Downtime JEE Architectures
 
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
Architecting and Tuning IIB/eXtreme Scale for Maximum Performance and Reliabi...
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
Big Data on Cloud Native Platform
Big Data on Cloud Native PlatformBig Data on Cloud Native Platform
Big Data on Cloud Native Platform
 
IBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster RecoveryIBM MQ - High Availability and Disaster Recovery
IBM MQ - High Availability and Disaster Recovery
 
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
SDN 101: Software Defined Networking Course - Sameh Zaghloul/IBM - 2014
 
4.exalogic ferhat final
4.exalogic ferhat final4.exalogic ferhat final
4.exalogic ferhat final
 
IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)IBM MQ High Availabillity and Disaster Recovery (2017 version)
IBM MQ High Availabillity and Disaster Recovery (2017 version)
 

Mehr von ScyllaDB

Mehr von ScyllaDB (20)

Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
What Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQLWhat Developers Need to Unlearn for High Performance NoSQL
What Developers Need to Unlearn for High Performance NoSQL
 
Low Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & PitfallsLow Latency at Extreme Scale: Proven Practices & Pitfalls
Low Latency at Extreme Scale: Proven Practices & Pitfalls
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDBBeyond Linear Scaling: A New Path for Performance with ScyllaDB
Beyond Linear Scaling: A New Path for Performance with ScyllaDB
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
Database Performance at Scale Masterclass: Workload Characteristics by Felipe...
 
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
Database Performance at Scale Masterclass: Database Internals by Pavel Emelya...
 
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr SarnaDatabase Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
Database Performance at Scale Masterclass: Driver Strategies by Piotr Sarna
 
Replacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDBReplacing Your Cache with ScyllaDB
Replacing Your Cache with ScyllaDB
 
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear ScalabilityPowering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
Powering Real-Time Apps with ScyllaDB_ Low Latency & Linear Scalability
 
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx7 Reasons Not to Put an External Cache in Front of Your Database.pptx
7 Reasons Not to Put an External Cache in Front of Your Database.pptx
 
Getting the most out of ScyllaDB
Getting the most out of ScyllaDBGetting the most out of ScyllaDB
Getting the most out of ScyllaDB
 
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a MigrationNoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
NoSQL Database Migration Masterclass - Session 2: The Anatomy of a Migration
 
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration LogisticsNoSQL Database Migration Masterclass - Session 3: Migration Logistics
NoSQL Database Migration Masterclass - Session 3: Migration Logistics
 
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and ChallengesNoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
NoSQL Data Migration Masterclass - Session 1 Migration Strategies and Challenges
 
ScyllaDB Virtual Workshop
ScyllaDB Virtual WorkshopScyllaDB Virtual Workshop
ScyllaDB Virtual Workshop
 
DBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & TradeoffsDBaaS in the Real World: Risks, Rewards & Tradeoffs
DBaaS in the Real World: Risks, Rewards & Tradeoffs
 
Build Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDBBuild Low-Latency Applications in Rust on ScyllaDB
Build Low-Latency Applications in Rust on ScyllaDB
 
NoSQL Data Modeling 101
NoSQL Data Modeling 101NoSQL Data Modeling 101
NoSQL Data Modeling 101
 

Kürzlich hochgeladen

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Kürzlich hochgeladen (20)

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 

Lookout on Scaling Security to 100 Million Devices

  • 1.
  • 2. , Principal Engineer Over 30 years experience predominantly dealing with event pipelines and data retrieval. He currently works as a platform architect and principal developer at Lookout Inc working on the Ingestion Pipeline and Query Services team working on the next scale of data ingestion.
  • 3. ■ Provides security scanning for mobile devices for Enterprise and Consumer markets ■ Founded in 2004 when the original founders discovered a vulnerability in the Bluetooth and Nokia phones ■ Demonstrated the need for mobile security through a demonstration at the 2005 Academy Awards downloading information from celebrity phones 1.5 miles away from the venue
  • 4.
  • 5. ■ Enterprise customers have the ability to apply corporate policies against devices registered in their enterprise ■ To apply these policies Lookout ingests data about device configuration and applications installed on devices
  • 6. ■ Functions as a proxy for all mobile devices in the Lookout fleet ■ Device telemetry is sent at various intervals for these categories ● Software ● Hardware ● Client ● Filesystem ● Configuration ● Binary Manifest ● Risky Configuration ● Personal Content Protection (safe browsing) ● Device Settings ● Device Permissions ● Activation Status
  • 7.
  • 8. ■ Easy to setup and maintain ■ Scaling is easy ■ Cost Effective ■ Simple to handle the Unexpected
  • 9. ■ Some of the components are “single region” (EMR) ■ As the system grows the costs increase significantly (DynamoDB) ■ Limits on Primary Key (PK) and Sort Key (SK) for DynamoDB - Not designed for time series data
  • 10.
  • 11.
  • 12.
  • 13. A highly scalable and fault tolerant streaming framework that can process messages (for example Device Telemetry Messages) and persist these messages into a scalable, fault tolerant persistent store and support operational queries. Key Requirements: ■ Infrastructure should scale to support 100M devices ■ Cost effective ingestion, storage and querying at this scale ■ Low Latency, High Availability at scale (up/down) ■ Failure handling (no loss of data) ■ Ease of deployment and management
  • 14. ■ A NoSQL database that implements almost all the features of Apache Cassandra ■ Written in C++ 14 instead of Java to increase the performance. ■ Uses a shared nothing approach and uses the Seastar framework to shard requests by core - http://seastar.io/ ■ Scylla’s close-to-the-hardware design significantly reduces the number of instances needed. ■ Can horizontal scale-out and is fault-tolerance like Apache Cassandra, but delivers 10X the throughput and consistent, low single-digit latencies. ■ Has support for tunable job prioritization to support extremely high read and write throughput (which was a problem that Cassandra has not solved yet). Has really high throughput on instances with NVMe volumes (compared to EBS or non NVMe volumes).
  • 15.
  • 16.
  • 17. ■ Amount of storage available for data depends on the compaction strategy selected. ● Levelled compaction - Half of data storage needed for compaction - not recommended ● Size tiered compaction - Half of data storage needed for compaction ● Time window compaction - Depends on the number of tables and record size - normally around half needed for compaction ● Incremental compaction - possible to push up to 85% for data storage, so storage needs need to be planned well. - Enterprise Edition
  • 18. ■ May not be a good choice if storage requirements are very large as opposed to transactions as you will have wasted compute tied to the increased storage needs. ■ Note that this assumes you do not plan to use low cost EBS volumes with much reduced throughput. ■ No FedRamp certified version of Scylla Cloud available today requiring deployment of self-managed cluster ■ No Autoscaling support as we have to provision nodes and rebalance data through scripts/UI. ■ Not suitable for ad-hoc queries or table scan type queries, and does not support joins.
  • 19. ■ Each worker instance is stateless and coordinates with each other via internal Kafka topics. ■ Kafka Connect automatically detects failures and rebalances work over remaining processes. ■ Suitable for streaming data to and from Kafka and is not suitable for complex operation like aggregations, windowing, etc., that frameworks like Apache Spark or Apache Flink support. ■ The maximum number of tasks is limited to the number of partitions. ■ Exposes a REST API to create, modify and monitor the connectors and tasks
  • 20. ■ Kafka ● 6 Kafka Brokers - R5.xlarge ● 6 Zookeepers - M5.large ● 3 Schema Registries - M5.large ● 6 Kafka Connect Workers - C5.xlarge ● 1 Control Center - M5.2xlarge ● Split over 3 AZs ● # partitions ■ Loaded Libraries - 120 partitions ■ Device Settings - 150 partitions ■ Other topics - 60 partitions ■ ScyllaDB ● 4 ScyllaDB instances - I3.4xlarge ● Split over 2 AZs ■ Load ● 12 different device telemetry emulated ● Messages sent in Apache Avro format ● 14 instances generating load - C5d.4xlarge
  • 21. ■ The default partitioner (<murmur2 hash> mod <# partitions>) that comes with Kafka is not very efficient with sharding when the number of partitions grow (approx 50% of the partitions were idle). ■ We replaced by using a murmur3 hash and then put it through a consistent hashing algorithm (jump hash) to get an even distribution across all partitions (we used Google’s guava library). - “A Fast, Minimal Memory, Consistent Hash Algorithm” - https://arxiv.org/pdf/1406.2294.pdf
  • 22. ■ We emulated approximately 38 million devices generating a total of 109,668 messages/second or 394 Million messages/hr. ■ On average a device was generating 253 messages/day ■ We don’t expect querying to be much impact, so did not add that as part of load ■ The load test duration was for 96 hrs. Telemetry Type # Device Telemetry emulated/second # Device Telemetry emulated/day Avg size in Bytes/Telemetry Celldata 760 1.72 83 Client 760 1.72 166 Configuration 13908 31.62 396 Device Change 2280 6.91 218 Device Permissions 1520 3.45 74 Device Settings 45600 103.68 75 Hardware 760 1.72 254 Loaded Libraries 38000 86.40 219 Risk Configuration 1900 5.18 261 Software 760 1.72 375 Binary 1520 3.45 219 File System 1900 5.18 219
  • 23. ■ Message latency was in milliseconds on average, unless the system was overtaxed. ■ Repairs forced the load and was generally taxing on the system (CPU at 100%), but the cluster continued to function. ■ The latency increased when Kafka Connect tasks failed (when repairs were running on ScyllaDB). ■ ScyllaDB Cluster was running near capacity (CPU between 75-90%) ■ Overall, the results were really positive.
  • 24.
  • 25. ■ Kafka Connect provided a quick and easy solution to add new ingestion pipelines ■ Using DataMountaineer’s Kafka Connect connector for Cassandra was easier to implement than the Confluent connector ■ Scylla DB CPU shot up while repairing and timeouts occurred - Scylla’s ability to reserve capacity for maintenance tasks ensured repairs completed something not available in Cassandra. ■ As the complexity of the data ingestion increased the solution leaned more towards implementing a custom Kafka → Scylla worker cluster for debugging and maintenance reasons ■ The cost benefits over the current architecture flow increased significantly as our volume increased.
  • 26. ■ This does not include: ● Query load and associated costs. ● Dynamo streams and it’s equivalent on Scylla and associated costs. DynamoDB Scylla # Devices $ Cost/Mo # Devices $ Cost/Mo On Demand 38,000,000 $304,400.00 38,000,000 $14,564.24 100,000,000 $801,052.63 100,000,000 $38,303.95 +20% Engineer cost (Maintenance) Provisioned 38,000,000 $55,610.00 100,000,000 $146,342.11