SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Cloud Security Monitoring
and Spark Analytics
Boston Spark Meetup
Threat Stack
Andre Mesarovic
10 December 2015
Threat Stack - Who We Are
• Leadership team with deep security, SaaS, and big data
experience
• Launched on stage at 2014 AWS re:Invent
• Founded by principal engineers from Mandiant in 2012
• Based in Boston's Innovation District
• 27 employees and hiring
• On Track for 100+ Customers and 10,000 Monitored
Servers by Year-End 2015
• Funded by Accomplice (Atlas) and .406 Ventures
Threat Stack - Use Cases
• Insider Threat Detection
• External Threat Detection
• Data Loss Detection
• Regulatory Compliance Support - HIPAA, PCI
Threat Stack - Key Workload Questions
• What processes are running on all my servers?
• Did a process suddenly start making outbound
connections?
• Who is logged into my servers and what are they
running?
• Has anyone logged in from non-standard locations?
• Are any critical system and data files being changed?
• What happened on a transient server 7 weeks ago?
• Who is changing our Cloud infrastructure?
Threat Stack - Features
• Deep OS Auditing
• Behavior-based Intrusion Detection
• DVR Capabilities
• Customizable Alerts
• File Integrity Monitoring
• DevOps Enabled Deployment
Threat Stack - Tech Stack
• RabbitMQ
• Nginx
• Cassandra
• Elasticsearch
• MongoDB
• Redis - ElastiCache
• Postgres - RDS
• Languages: Node.js, C, Scala and a bit of Lua
• Chef
• Librato, Grafana, Sensu, Sentry, PagerDuty
• Slack
Spark Cluster
• Spark 1.4.1
• Spark standalone cluster manager - no Mesos or Yarn
• One long running Spark job - running over 2 months
• Separate driver node
– Since driver has different workload it can be scaled
independently of the workers
• We like our cluster to be a homogenous set of worker nodes
– One executor per worker
• Monitored by Grafana
• Custom Codahale metrics consumed by Grafana
– Only implemented for Driver - for Worker it’s a TODO
Spark Cluster Hardware
Threat Stack Overall Architecture
Spark Analytics Architecture
Spark Web UI - Master
Spark Web UI - Jobs
Event Pipeline Statistics
Mean event is 700 bytes
Second 10 Min Interval Day Month
Mean events 75 K 4.5 M 6.48 B 194 B
Spike events 125 K 7.5 M 10.8 B 324 B
Mean bytes 52.5 MB 31.5 GB 4.5 TB 136 TB
Spike bytes 87.5 MB 52.5 GB 7.6 TB 227 TB
Problem that Spark Analytics Addresses
• Overview
– Spark replaced home-grown rollups and Elasticsearch facets
– Original solutions did not scale well
• Home-grown rollups of streaming data
– Used eep.js - subset of CEP that adds aggregate functions and
windowed stream operations to Node.js.
– Postgres stored procedures to upsert rolled up values
– Problem: way too many Postgres transactions
• Elasticsearch facets
– Great for initial moderate volume
– Running into scaling issues as we grow
Why not Spark Streaming?
• We first tried to use Spark Streaming
• Ran OK in dev env but failed in prod env - 20x
• Too many endurance and scaling problems
• Ran out of file descriptors on workers very quickly
– Sure, we can write a cron job but do we want to?
– Zillions of 24 byte files that were never cleaned up
• Too many out-of-memory errors on workers
– Intermittent and random OOMs
– Workers crashed in 3 days due to tiny memory leak
• No robust RabbitMQ receiver - everyone is focused on Kafka
• Love the idea, but just wasn’t ready for prime time
Current Spark Solution
• Decouple event consumption and Spark processing
• Two processes: Event Writer and Spark Analytics
• Event Writer consumes events from RabbitMQ firehose
– Writes batches to scratch store every 10 min interval
• Spark job wakes up every 10 min to roll up events by
different criteria into Postgres
– For example, at 10:20 Spark job processes the data
from 10:10 to 10:20
• Spark then deletes the interval data of 10:10 to 10:20
• Spark uptime: 64 days since Oct. 7, 2015
Basic Workflow
• Event Writer consumes RMQ messages and writes them to S3
• RMQ messages are in MessagePack format
• Message is one doc per org/agent/type specified header and
array of events
• Event Writer flattens this into a batch of events
• Output is gzip JSON sequence file - one JSON object per line
• Event Writer writes fixed sized output batches of events to S3
• Current memory buffer for the batch is 100 MB
• This compresses down to 3.5 MB - 28x compression
Advantages of Current Solution
• Separation of concerns - each process is focused on doing one
thing best
• Event Writer is concerned with non-trivial RMQ flow control
• Spark simply reads event sequences from scratch storage
• Thus Spark has more resources to compute rollups
• Each app can scale independently
• Spark Streaming was trying to do too much - both handle
RMQ ingestion and analytics processing
• Current solution is more robust
Capacity and Scaling
• Good news - Spark scales linearly for us
• We ran tests with different numbers of workers and results
were linear
• Elasticity: we can independently scale the Event Writers and
the Spark cluster
• With Spark Streaming we could not dynamically add more
RMQ receivers without restarting the app
Event Writer Stats
• One Event Writer per RabbitMQ exchange
• We have 3 RMQ exchanges
• 10 minute interval for buffering events
• 100 MB in-memory event buffer compresses down to 3.5 MB
• Compression factor of 28 x
• 600 S3 objects per interval (compressed)
• 2.1 GB per interval (uncompressed would be 58.8 GB)
• Need 2 intervals present - current and previous - 4.1 GB (118
GB)
Event Types
• audit - accept, bind, connect, exit, etc.
• login - login, logout
• host
• file
• network
Event Example
{
"organization_id" : "3d0c49e818bac99c72b7088665342daf30a3bcd7",
"agent_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94",
"arguments" : "/usr/sbin/sshd -D -R",
"_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94",
"_type" : "audit",
"_insert_time" : 1429902593
"args" : [ "/usr/sbin/sshd", "-D", "-R" ],
"user" : "root", "group" : "root",
"path" : [ "/usr/sbin/sshd", null ],
"exe" : "/usr/sbin/sshd",
"timestamp" : 1429902590000,
"type" : "start",
"syscall" : "execve",
"command" : "sshd",
"uid" : 0, "euid" : 0,
"gid" : 0, "egid" : 0, "exit" : 0,
"session" : 4294967295,
"pid" : 7829, "ppid" : 873,
"success" :,
"parent_process" : {
"pid" : 873,
"exe" : "/usr/sbin/sshd",
"command" : "sshd",
"args" : [ "/usr/sbin/sshd", "-D" ],
"loginuid" : 4294967295,
"timestamp" : 1427337850230,
"uid" : 0,
"gid" : 0,
"ppid" : 1
},
Spark Event Count Rollups
• total counts - org and agent
• user counts - org, agent, user and exe
• IP counts that access Maxmind geo DB file on each worker
– IP source counts - org, exe, ip, country, city, lat, lon
– IP destination counts - ibid
• host counts - org, comment
• port source counts - org, exe and port
• port destination counts
• CloudTrail events of various (four) kinds
Sample Rollups Table
insert_time | event_time | org_id | agent_id | count
---------------------+---------------------+--------------------------+--------------------------+--------
2015-11-08 15:41:18 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 216652
2015-11-08 20:01:24 | 2015-11-08 19:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 207962
2015-11-08 15:31:17 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 160354
2015-11-08 15:01:34 | 2015-11-08 14:00:00 | 5522d0276c15919d69000y01 | 563bd15419d2f85c2c9085c1 | 160098
2015-11-07 21:51:31 | 2015-11-07 20:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 149813
2015-11-08 03:08:53 | 2015-11-08 00:00:00 | 533af57f41e9885820006771 | 5632c6431612b6096d195d02 | 144999
2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e988582000a7b1 | 55fc8beb7f8ce68d5052b6c9 | 143072
2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f989dacc155d6d5e2627cf | 141468
2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f98b41cc155d6d5e262811 | 137778
2015-11-17 15:21:11 | 2015-11-17 15:00:00 | 5522d0276c15919d69000x01 | 566f217100229a8b2bdce000 | 128375
Scratch Event Data
• S3
– Easy to get started with Spark S3 support (gzip support)
– Mean write time is 350 ms - 99.9 percentile is 2.3 sec!
– This clogs up our processing pipeline
– S3 is “eventually consistent” - there are no SLAs
guaranteeing when a written object is available
• Alternatives
– NoSQL store such as Redis - under active exploration now
– AWS Elastic File System - when will it arrive (April blog)?
– HDFS
S3 Write Percentiles
Percentile Millis
50.00 349
90.00 560
99.00 1413
99.50 2081
99.90 23,898
99.99 50,281
max 139,596
S3 vs Redis Write Latencies
All write latencies are in milliseconds.
The “10-minute intervals” column refers to the sample size.
Mean Max 10-min intervals
S3 349 139,596 15,172
Redis 43 168 7,313
Speedup factor 8 831
Data Expiration
• The problem of big data is how to efficiently delete data
• Every byte costs - AWS is not cheap
• Big data at scale costs big bucks
• In the real world, companies have to deal with data retention
• Deleting objects
– Spark
• After processing S3 objects, Spark deletes them
• Backup with AWS life-cycle expiration (1 day)
– Redis
• Use Redis TTLs
RabbitMQ Flow Control - Message Ack-ing
Flow control is fun!
• Fast publisher - slow consumer
Message Ack-ing
• MultipleRmqAckManager - Acknowledge all messages up to
and including the supplied delivery tag
• SingleRmqAckManager - Acknowledge just the supplied
delivery tag
• When we have written an S3 object, we ack all the RMQ
messages in that batch
RabbitMQ Prefetch Count
• Limit the number of unacknowledged messages on a channel
• Important for Event Writer to handle so as not to OOM during
traffic surges
• Sadly RMQ doesn’t implement AMQP prefetch for byte size
• Only supports prefetch count for number of messages
• This works if the messages are of relatively same size
• Fortunately this the case for us
Fault Tolerance
• Created generic fault tolerance manager
• Used for retrying RabbitMQ consumer and S3 writes
• Pluggable retry algorithm - linear backoff, exponential
backoff, whatever you wish
• Looked at third party packages (e.g. Spring Retry) but didn’t
quite fit our particular needs
• RMQ reads rarely fail
• Do see the occasional S3 write failure
Spark and Metrics
• Metrics and monitoring are vital to Threat Stack
• Any production app must have a way of allowing for app-
specific metrics
• Spark’s custom metrics are very rudimentary
• Custom metrics capabilities - driver and/or worker?
• Spark Codahale custom metrics - we apparently have to
extend Spark private class!
• You need to extend org.apache.spark.metrics.source.Source
and include it in your jar!

Weitere ähnliche Inhalte

Was ist angesagt?

HAL APIs and Ember Data
HAL APIs and Ember DataHAL APIs and Ember Data
HAL APIs and Ember DataCory Forsyth
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMrtpaem
 
RESTful API Design Best Practices Using ASP.NET Web API
RESTful API Design Best Practices Using ASP.NET Web APIRESTful API Design Best Practices Using ASP.NET Web API
RESTful API Design Best Practices Using ASP.NET Web API💻 Spencer Schneidenbach
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solrguest432cd6
 
James Higginbotham - API Design
James Higginbotham - API DesignJames Higginbotham - API Design
James Higginbotham - API DesignJohn Zozzaro
 
Understanding and programming the SharePoint REST API
Understanding and programming the SharePoint REST APIUnderstanding and programming the SharePoint REST API
Understanding and programming the SharePoint REST APIChris Beckett
 
Understanding REST
Understanding RESTUnderstanding REST
Understanding RESTNitin Pande
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageGreg Brown
 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web APIBrad Genereaux
 
RESTful modules in zf2
RESTful modules in zf2RESTful modules in zf2
RESTful modules in zf2Corley S.r.l.
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchclintongormley
 
Search is the UI
Search is the UI Search is the UI
Search is the UI danielbeach
 
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...State of Search Conference
 
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...CA API Management
 
Representational State Transfer (REST)
Representational State Transfer (REST)Representational State Transfer (REST)
Representational State Transfer (REST)David Krmpotic
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrTrey Grainger
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Sunny Gupta
 
Taking Advantage of the SharePoint 2013 REST API
Taking Advantage of the SharePoint 2013 REST APITaking Advantage of the SharePoint 2013 REST API
Taking Advantage of the SharePoint 2013 REST APIEric Shupps
 

Was ist angesagt? (20)

BeJUG JAX-RS Event
BeJUG JAX-RS EventBeJUG JAX-RS Event
BeJUG JAX-RS Event
 
HAL APIs and Ember Data
HAL APIs and Ember DataHAL APIs and Ember Data
HAL APIs and Ember Data
 
Implementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEMImplementing Site Search in CQ5 / AEM
Implementing Site Search in CQ5 / AEM
 
RESTful API Design Best Practices Using ASP.NET Web API
RESTful API Design Best Practices Using ASP.NET Web APIRESTful API Design Best Practices Using ASP.NET Web API
RESTful API Design Best Practices Using ASP.NET Web API
 
State-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache SolrState-of-the-Art Drupal Search with Apache Solr
State-of-the-Art Drupal Search with Apache Solr
 
James Higginbotham - API Design
James Higginbotham - API DesignJames Higginbotham - API Design
James Higginbotham - API Design
 
Understanding and programming the SharePoint REST API
Understanding and programming the SharePoint REST APIUnderstanding and programming the SharePoint REST API
Understanding and programming the SharePoint REST API
 
Understanding REST
Understanding RESTUnderstanding REST
Understanding REST
 
A Survey of Elasticsearch Usage
A Survey of Elasticsearch UsageA Survey of Elasticsearch Usage
A Survey of Elasticsearch Usage
 
Introduction to the Web API
Introduction to the Web APIIntroduction to the Web API
Introduction to the Web API
 
RESTful modules in zf2
RESTful modules in zf2RESTful modules in zf2
RESTful modules in zf2
 
Cool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearchCool bonsai cool - an introduction to ElasticSearch
Cool bonsai cool - an introduction to ElasticSearch
 
Search is the UI
Search is the UI Search is the UI
Search is the UI
 
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
Working Smarter: SEO Automation to Increase Efficiency and Effectiveness - Pa...
 
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
API Design Methodology - Mike Amundsen, Director of API Architecture, API Aca...
 
Representational State Transfer (REST)
Representational State Transfer (REST)Representational State Transfer (REST)
Representational State Transfer (REST)
 
Web crawler
Web crawlerWeb crawler
Web crawler
 
Scaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solrScaling Recommendations, Semantic Search, & Data Analytics with solr
Scaling Recommendations, Semantic Search, & Data Analytics with solr
 
Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014Colloquim Report on Crawler - 1 Dec 2014
Colloquim Report on Crawler - 1 Dec 2014
 
Taking Advantage of the SharePoint 2013 REST API
Taking Advantage of the SharePoint 2013 REST APITaking Advantage of the SharePoint 2013 REST API
Taking Advantage of the SharePoint 2013 REST API
 

Ähnlich wie Cloud Security Monitoring and Spark Analytics

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Amazon Web Services
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly SolarWinds Loggly
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising TechApache Apex
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingOh Chan Kwon
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Amazon Web Services
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon Web Services
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark Summit
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comDamien Krotkine
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Peter Bakas
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestHakka Labs
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestKrishna Gade
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data LakeMarcos Rebelo
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingChen-en Lu
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaAmazon Web Services
 

Ähnlich wie Cloud Security Monitoring and Spark Analytics (20)

Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
AWS re:Invent presentation: Unmeltable Infrastructure at Scale by Loggly
 
Real Time Insights for Advertising Tech
Real Time Insights for Advertising TechReal Time Insights for Advertising Tech
Real Time Insights for Advertising Tech
 
Spark cep
Spark cepSpark cep
Spark cep
 
Extending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event ProcessingExtending Spark Streaming to Support Complex Event Processing
Extending Spark Streaming to Support Complex Event Processing
 
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
Streaming data analytics (Kinesis, EMR/Spark) - Pop-up Loft Tel Aviv
 
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
Amazon RDS for MySQL – Diagnostics, Security, and Data Migration (DAT302) | A...
 
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
Spark and Spark Streaming at Netfix-(Kedar Sedekar and Monal Daxini, Netflix)
 
Using Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.comUsing Riak for Events storage and analysis at Booking.com
Using Riak for Events storage and analysis at Booking.com
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Keystone - ApacheCon 2016
Keystone - ApacheCon 2016Keystone - ApacheCon 2016
Keystone - ApacheCon 2016
 
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at PinterestDataEngConf SF16 - Scalable and Reliable Logging at Pinterest
DataEngConf SF16 - Scalable and Reliable Logging at Pinterest
 
Scalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at PinterestScalable and Reliable Logging at Pinterest
Scalable and Reliable Logging at Pinterest
 
Internals of Presto Service
Internals of Presto ServiceInternals of Presto Service
Internals of Presto Service
 
Store stream data on Data Lake
Store stream data on Data LakeStore stream data on Data Lake
Store stream data on Data Lake
 
Real-Time Event Processing
Real-Time Event ProcessingReal-Time Event Processing
Real-Time Event Processing
 
Scality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup PresentationScality S3 Server: Node js Meetup Presentation
Scality S3 Server: Node js Meetup Presentation
 
TenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience SharingTenMax Data Pipeline Experience Sharing
TenMax Data Pipeline Experience Sharing
 
Real-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS LambdaReal-time Data Processing Using AWS Lambda
Real-time Data Processing Using AWS Lambda
 

Kürzlich hochgeladen

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsRoshan Dwivedi
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 

Cloud Security Monitoring and Spark Analytics

  • 1. Cloud Security Monitoring and Spark Analytics Boston Spark Meetup Threat Stack Andre Mesarovic 10 December 2015
  • 2. Threat Stack - Who We Are • Leadership team with deep security, SaaS, and big data experience • Launched on stage at 2014 AWS re:Invent • Founded by principal engineers from Mandiant in 2012 • Based in Boston's Innovation District • 27 employees and hiring • On Track for 100+ Customers and 10,000 Monitored Servers by Year-End 2015 • Funded by Accomplice (Atlas) and .406 Ventures
  • 3. Threat Stack - Use Cases • Insider Threat Detection • External Threat Detection • Data Loss Detection • Regulatory Compliance Support - HIPAA, PCI
  • 4. Threat Stack - Key Workload Questions • What processes are running on all my servers? • Did a process suddenly start making outbound connections? • Who is logged into my servers and what are they running? • Has anyone logged in from non-standard locations? • Are any critical system and data files being changed? • What happened on a transient server 7 weeks ago? • Who is changing our Cloud infrastructure?
  • 5. Threat Stack - Features • Deep OS Auditing • Behavior-based Intrusion Detection • DVR Capabilities • Customizable Alerts • File Integrity Monitoring • DevOps Enabled Deployment
  • 6. Threat Stack - Tech Stack • RabbitMQ • Nginx • Cassandra • Elasticsearch • MongoDB • Redis - ElastiCache • Postgres - RDS • Languages: Node.js, C, Scala and a bit of Lua • Chef • Librato, Grafana, Sensu, Sentry, PagerDuty • Slack
  • 7. Spark Cluster • Spark 1.4.1 • Spark standalone cluster manager - no Mesos or Yarn • One long running Spark job - running over 2 months • Separate driver node – Since driver has different workload it can be scaled independently of the workers • We like our cluster to be a homogenous set of worker nodes – One executor per worker • Monitored by Grafana • Custom Codahale metrics consumed by Grafana – Only implemented for Driver - for Worker it’s a TODO
  • 9. Threat Stack Overall Architecture
  • 11. Spark Web UI - Master
  • 12. Spark Web UI - Jobs
  • 13. Event Pipeline Statistics Mean event is 700 bytes Second 10 Min Interval Day Month Mean events 75 K 4.5 M 6.48 B 194 B Spike events 125 K 7.5 M 10.8 B 324 B Mean bytes 52.5 MB 31.5 GB 4.5 TB 136 TB Spike bytes 87.5 MB 52.5 GB 7.6 TB 227 TB
  • 14. Problem that Spark Analytics Addresses • Overview – Spark replaced home-grown rollups and Elasticsearch facets – Original solutions did not scale well • Home-grown rollups of streaming data – Used eep.js - subset of CEP that adds aggregate functions and windowed stream operations to Node.js. – Postgres stored procedures to upsert rolled up values – Problem: way too many Postgres transactions • Elasticsearch facets – Great for initial moderate volume – Running into scaling issues as we grow
  • 15. Why not Spark Streaming? • We first tried to use Spark Streaming • Ran OK in dev env but failed in prod env - 20x • Too many endurance and scaling problems • Ran out of file descriptors on workers very quickly – Sure, we can write a cron job but do we want to? – Zillions of 24 byte files that were never cleaned up • Too many out-of-memory errors on workers – Intermittent and random OOMs – Workers crashed in 3 days due to tiny memory leak • No robust RabbitMQ receiver - everyone is focused on Kafka • Love the idea, but just wasn’t ready for prime time
  • 16. Current Spark Solution • Decouple event consumption and Spark processing • Two processes: Event Writer and Spark Analytics • Event Writer consumes events from RabbitMQ firehose – Writes batches to scratch store every 10 min interval • Spark job wakes up every 10 min to roll up events by different criteria into Postgres – For example, at 10:20 Spark job processes the data from 10:10 to 10:20 • Spark then deletes the interval data of 10:10 to 10:20 • Spark uptime: 64 days since Oct. 7, 2015
  • 17. Basic Workflow • Event Writer consumes RMQ messages and writes them to S3 • RMQ messages are in MessagePack format • Message is one doc per org/agent/type specified header and array of events • Event Writer flattens this into a batch of events • Output is gzip JSON sequence file - one JSON object per line • Event Writer writes fixed sized output batches of events to S3 • Current memory buffer for the batch is 100 MB • This compresses down to 3.5 MB - 28x compression
  • 18. Advantages of Current Solution • Separation of concerns - each process is focused on doing one thing best • Event Writer is concerned with non-trivial RMQ flow control • Spark simply reads event sequences from scratch storage • Thus Spark has more resources to compute rollups • Each app can scale independently • Spark Streaming was trying to do too much - both handle RMQ ingestion and analytics processing • Current solution is more robust
  • 19. Capacity and Scaling • Good news - Spark scales linearly for us • We ran tests with different numbers of workers and results were linear • Elasticity: we can independently scale the Event Writers and the Spark cluster • With Spark Streaming we could not dynamically add more RMQ receivers without restarting the app
  • 20. Event Writer Stats • One Event Writer per RabbitMQ exchange • We have 3 RMQ exchanges • 10 minute interval for buffering events • 100 MB in-memory event buffer compresses down to 3.5 MB • Compression factor of 28 x • 600 S3 objects per interval (compressed) • 2.1 GB per interval (uncompressed would be 58.8 GB) • Need 2 intervals present - current and previous - 4.1 GB (118 GB)
  • 21. Event Types • audit - accept, bind, connect, exit, etc. • login - login, logout • host • file • network
  • 22. Event Example { "organization_id" : "3d0c49e818bac99c72b7088665342daf30a3bcd7", "agent_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94", "arguments" : "/usr/sbin/sshd -D -R", "_id" : "835af48534bfd4bc60f8c5882dd565c5a84e4b94", "_type" : "audit", "_insert_time" : 1429902593 "args" : [ "/usr/sbin/sshd", "-D", "-R" ], "user" : "root", "group" : "root", "path" : [ "/usr/sbin/sshd", null ], "exe" : "/usr/sbin/sshd", "timestamp" : 1429902590000, "type" : "start", "syscall" : "execve", "command" : "sshd", "uid" : 0, "euid" : 0, "gid" : 0, "egid" : 0, "exit" : 0, "session" : 4294967295, "pid" : 7829, "ppid" : 873, "success" :, "parent_process" : { "pid" : 873, "exe" : "/usr/sbin/sshd", "command" : "sshd", "args" : [ "/usr/sbin/sshd", "-D" ], "loginuid" : 4294967295, "timestamp" : 1427337850230, "uid" : 0, "gid" : 0, "ppid" : 1 },
  • 23. Spark Event Count Rollups • total counts - org and agent • user counts - org, agent, user and exe • IP counts that access Maxmind geo DB file on each worker – IP source counts - org, exe, ip, country, city, lat, lon – IP destination counts - ibid • host counts - org, comment • port source counts - org, exe and port • port destination counts • CloudTrail events of various (four) kinds
  • 24. Sample Rollups Table insert_time | event_time | org_id | agent_id | count ---------------------+---------------------+--------------------------+--------------------------+-------- 2015-11-08 15:41:18 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 216652 2015-11-08 20:01:24 | 2015-11-08 19:00:00 | 5522d0276c15919d69000x01 | 563bd15419d2f85c2c9085c1 | 207962 2015-11-08 15:31:17 | 2015-11-08 15:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 160354 2015-11-08 15:01:34 | 2015-11-08 14:00:00 | 5522d0276c15919d69000y01 | 563bd15419d2f85c2c9085c1 | 160098 2015-11-07 21:51:31 | 2015-11-07 20:00:00 | 5522d0276c15919d69000x01 | 5665c53b04d674f048e0892e | 149813 2015-11-08 03:08:53 | 2015-11-08 00:00:00 | 533af57f41e9885820006771 | 5632c6431612b6096d195d02 | 144999 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e988582000a7b1 | 55fc8beb7f8ce68d5052b6c9 | 143072 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f989dacc155d6d5e2627cf | 141468 2015-11-08 03:08:53 | 2015-11-08 01:00:00 | 533af57f41e9885820006771 | 55f98b41cc155d6d5e262811 | 137778 2015-11-17 15:21:11 | 2015-11-17 15:00:00 | 5522d0276c15919d69000x01 | 566f217100229a8b2bdce000 | 128375
  • 25. Scratch Event Data • S3 – Easy to get started with Spark S3 support (gzip support) – Mean write time is 350 ms - 99.9 percentile is 2.3 sec! – This clogs up our processing pipeline – S3 is “eventually consistent” - there are no SLAs guaranteeing when a written object is available • Alternatives – NoSQL store such as Redis - under active exploration now – AWS Elastic File System - when will it arrive (April blog)? – HDFS
  • 26. S3 Write Percentiles Percentile Millis 50.00 349 90.00 560 99.00 1413 99.50 2081 99.90 23,898 99.99 50,281 max 139,596
  • 27. S3 vs Redis Write Latencies All write latencies are in milliseconds. The “10-minute intervals” column refers to the sample size. Mean Max 10-min intervals S3 349 139,596 15,172 Redis 43 168 7,313 Speedup factor 8 831
  • 28. Data Expiration • The problem of big data is how to efficiently delete data • Every byte costs - AWS is not cheap • Big data at scale costs big bucks • In the real world, companies have to deal with data retention • Deleting objects – Spark • After processing S3 objects, Spark deletes them • Backup with AWS life-cycle expiration (1 day) – Redis • Use Redis TTLs
  • 29. RabbitMQ Flow Control - Message Ack-ing Flow control is fun! • Fast publisher - slow consumer Message Ack-ing • MultipleRmqAckManager - Acknowledge all messages up to and including the supplied delivery tag • SingleRmqAckManager - Acknowledge just the supplied delivery tag • When we have written an S3 object, we ack all the RMQ messages in that batch
  • 30. RabbitMQ Prefetch Count • Limit the number of unacknowledged messages on a channel • Important for Event Writer to handle so as not to OOM during traffic surges • Sadly RMQ doesn’t implement AMQP prefetch for byte size • Only supports prefetch count for number of messages • This works if the messages are of relatively same size • Fortunately this the case for us
  • 31. Fault Tolerance • Created generic fault tolerance manager • Used for retrying RabbitMQ consumer and S3 writes • Pluggable retry algorithm - linear backoff, exponential backoff, whatever you wish • Looked at third party packages (e.g. Spring Retry) but didn’t quite fit our particular needs • RMQ reads rarely fail • Do see the occasional S3 write failure
  • 32. Spark and Metrics • Metrics and monitoring are vital to Threat Stack • Any production app must have a way of allowing for app- specific metrics • Spark’s custom metrics are very rudimentary • Custom metrics capabilities - driver and/or worker? • Spark Codahale custom metrics - we apparently have to extend Spark private class! • You need to extend org.apache.spark.metrics.source.Source and include it in your jar!