2. Threat Stack - Who We Are
• Leadership team with deep security, SaaS, and big data
experience
• Launched on stage at 2014 AWS re:Invent
• Founded by principal engineers from Mandiant in 2012
• Based in Boston's Innovation District
• 27 employees and hiring
• On Track for 100+ Customers and 10,000 Monitored
Servers by Year-End 2015
• Funded by Accomplice (Atlas) and .406 Ventures
3. Threat Stack - Use Cases
• Insider Threat Detection
• External Threat Detection
• Data Loss Detection
• Regulatory Compliance Support - HIPAA, PCI
4. Threat Stack - Key Workload Questions
• What processes are running on all my servers?
• Did a process suddenly start making outbound
connections?
• Who is logged into my servers and what are they
running?
• Has anyone logged in from non-standard locations?
• Are any critical system and data files being changed?
• What happened on a transient server 7 weeks ago?
• Who is changing our Cloud infrastructure?
5. Threat Stack - Features
• Deep OS Auditing
• Behavior-based Intrusion Detection
• DVR Capabilities
• Customizable Alerts
• File Integrity Monitoring
• DevOps Enabled Deployment
6. Threat Stack - Tech Stack
• RabbitMQ
• Nginx
• Cassandra
• Elasticsearch
• MongoDB
• Redis - ElastiCache
• Postgres - RDS
• Languages: Node.js, C, Scala and a bit of Lua
• Chef
• Librato, Grafana, Sensu, Sentry, PagerDuty
• Slack
7. Spark Cluster
• Spark 1.4.1
• Spark standalone cluster manager - no Mesos or Yarn
• One long running Spark job - running over 2 months
• Separate driver node
– Since driver has different workload it can be scaled
independently of the workers
• We like our cluster to be a homogenous set of worker nodes
– One executor per worker
• Monitored by Grafana
• Custom Codahale metrics consumed by Grafana
– Only implemented for Driver - for Worker it’s a TODO
13. Event Pipeline Statistics
Mean event is 700 bytes
Second 10 Min Interval Day Month
Mean events 75 K 4.5 M 6.48 B 194 B
Spike events 125 K 7.5 M 10.8 B 324 B
Mean bytes 52.5 MB 31.5 GB 4.5 TB 136 TB
Spike bytes 87.5 MB 52.5 GB 7.6 TB 227 TB
14. Problem that Spark Analytics Addresses
• Overview
– Spark replaced home-grown rollups and Elasticsearch facets
– Original solutions did not scale well
• Home-grown rollups of streaming data
– Used eep.js - subset of CEP that adds aggregate functions and
windowed stream operations to Node.js.
– Postgres stored procedures to upsert rolled up values
– Problem: way too many Postgres transactions
• Elasticsearch facets
– Great for initial moderate volume
– Running into scaling issues as we grow
15. Why not Spark Streaming?
• We first tried to use Spark Streaming
• Ran OK in dev env but failed in prod env - 20x
• Too many endurance and scaling problems
• Ran out of file descriptors on workers very quickly
– Sure, we can write a cron job but do we want to?
– Zillions of 24 byte files that were never cleaned up
• Too many out-of-memory errors on workers
– Intermittent and random OOMs
– Workers crashed in 3 days due to tiny memory leak
• No robust RabbitMQ receiver - everyone is focused on Kafka
• Love the idea, but just wasn’t ready for prime time
16. Current Spark Solution
• Decouple event consumption and Spark processing
• Two processes: Event Writer and Spark Analytics
• Event Writer consumes events from RabbitMQ firehose
– Writes batches to scratch store every 10 min interval
• Spark job wakes up every 10 min to roll up events by
different criteria into Postgres
– For example, at 10:20 Spark job processes the data
from 10:10 to 10:20
• Spark then deletes the interval data of 10:10 to 10:20
• Spark uptime: 64 days since Oct. 7, 2015
17. Basic Workflow
• Event Writer consumes RMQ messages and writes them to S3
• RMQ messages are in MessagePack format
• Message is one doc per org/agent/type specified header and
array of events
• Event Writer flattens this into a batch of events
• Output is gzip JSON sequence file - one JSON object per line
• Event Writer writes fixed sized output batches of events to S3
• Current memory buffer for the batch is 100 MB
• This compresses down to 3.5 MB - 28x compression
18. Advantages of Current Solution
• Separation of concerns - each process is focused on doing one
thing best
• Event Writer is concerned with non-trivial RMQ flow control
• Spark simply reads event sequences from scratch storage
• Thus Spark has more resources to compute rollups
• Each app can scale independently
• Spark Streaming was trying to do too much - both handle
RMQ ingestion and analytics processing
• Current solution is more robust
19. Capacity and Scaling
• Good news - Spark scales linearly for us
• We ran tests with different numbers of workers and results
were linear
• Elasticity: we can independently scale the Event Writers and
the Spark cluster
• With Spark Streaming we could not dynamically add more
RMQ receivers without restarting the app
20. Event Writer Stats
• One Event Writer per RabbitMQ exchange
• We have 3 RMQ exchanges
• 10 minute interval for buffering events
• 100 MB in-memory event buffer compresses down to 3.5 MB
• Compression factor of 28 x
• 600 S3 objects per interval (compressed)
• 2.1 GB per interval (uncompressed would be 58.8 GB)
• Need 2 intervals present - current and previous - 4.1 GB (118
GB)
23. Spark Event Count Rollups
• total counts - org and agent
• user counts - org, agent, user and exe
• IP counts that access Maxmind geo DB file on each worker
– IP source counts - org, exe, ip, country, city, lat, lon
– IP destination counts - ibid
• host counts - org, comment
• port source counts - org, exe and port
• port destination counts
• CloudTrail events of various (four) kinds
25. Scratch Event Data
• S3
– Easy to get started with Spark S3 support (gzip support)
– Mean write time is 350 ms - 99.9 percentile is 2.3 sec!
– This clogs up our processing pipeline
– S3 is “eventually consistent” - there are no SLAs
guaranteeing when a written object is available
• Alternatives
– NoSQL store such as Redis - under active exploration now
– AWS Elastic File System - when will it arrive (April blog)?
– HDFS
27. S3 vs Redis Write Latencies
All write latencies are in milliseconds.
The “10-minute intervals” column refers to the sample size.
Mean Max 10-min intervals
S3 349 139,596 15,172
Redis 43 168 7,313
Speedup factor 8 831
28. Data Expiration
• The problem of big data is how to efficiently delete data
• Every byte costs - AWS is not cheap
• Big data at scale costs big bucks
• In the real world, companies have to deal with data retention
• Deleting objects
– Spark
• After processing S3 objects, Spark deletes them
• Backup with AWS life-cycle expiration (1 day)
– Redis
• Use Redis TTLs
29. RabbitMQ Flow Control - Message Ack-ing
Flow control is fun!
• Fast publisher - slow consumer
Message Ack-ing
• MultipleRmqAckManager - Acknowledge all messages up to
and including the supplied delivery tag
• SingleRmqAckManager - Acknowledge just the supplied
delivery tag
• When we have written an S3 object, we ack all the RMQ
messages in that batch
30. RabbitMQ Prefetch Count
• Limit the number of unacknowledged messages on a channel
• Important for Event Writer to handle so as not to OOM during
traffic surges
• Sadly RMQ doesn’t implement AMQP prefetch for byte size
• Only supports prefetch count for number of messages
• This works if the messages are of relatively same size
• Fortunately this the case for us
31. Fault Tolerance
• Created generic fault tolerance manager
• Used for retrying RabbitMQ consumer and S3 writes
• Pluggable retry algorithm - linear backoff, exponential
backoff, whatever you wish
• Looked at third party packages (e.g. Spring Retry) but didn’t
quite fit our particular needs
• RMQ reads rarely fail
• Do see the occasional S3 write failure
32. Spark and Metrics
• Metrics and monitoring are vital to Threat Stack
• Any production app must have a way of allowing for app-
specific metrics
• Spark’s custom metrics are very rudimentary
• Custom metrics capabilities - driver and/or worker?
• Spark Codahale custom metrics - we apparently have to
extend Spark private class!
• You need to extend org.apache.spark.metrics.source.Source
and include it in your jar!