This document summarizes Christian Beedgen's presentation on using AWS to build a scalable machine data analytics service. The presentation covers the architecture of Sumo Logic's service, which ingests machine-generated log data from customers in near real-time and performs analytics. It discusses how the service is built as loosely coupled microservices deployed across AWS with automation. Challenges of scaling such a distributed system are also addressed.
2. Who Am I
• Co-Founder & CTO, Sumo Logic since 2010
– Cloud-based Machine Data Analytics Service
– Applications, Operations, Security
• Server guy, Chief Architect, ArcSight, 2001-2009
– Major SIEM player in the enterprise space
– Log Management for security & compliance
5. Agenda
•
•
•
•
•
•
•
Introduction To Logs & Logging
Why We Are Building This Service
Architecture Of The Service
Deployment Automation
Loosely Coupled Components
Lessons Learned
Cost & Business Value
7. What Is Machine Data?
• Actually, Machine Generated Data
Curt Monash:
“Data that was produced
entirely by machines OR
data that is more about
observing humans than
recording their choices.”
Daniel Abadi:
"Machine-generated data is
data that is generated as a
result of a decision of an
independent computational
agent or a measurement of
an event that is not caused
by a human action."
8. Examples Of Machine Data
•
•
•
•
•
•
Computer, network, and other equipment logs
Satellite and similar telemetry (espionage or science)
Location data, RFID chip readings, GPS system output
Temperature and other environmental sensor readings
Sensor readings from factories, pipelines, etc.
Output from many kinds of medical devices
9. What Are Logs?
•
•
•
•
•
Logs are a kind of Machine Data
Time-stamped bits and pieces of text
Whispers & utterances of your infrastructure
Written to disk to a log file by applications
Sent over the network by devices
10.
11. A Wealth Of Information
•
•
•
•
•
Like Twitter for your infrastructure
Machine data analytics…
…is sentiment analysis for machines
Free data of tremendous value
Don’t forget to manage and analyze it
15. Anatomy Of A Log
• Timestamp with time zone!
• Log level
16. Anatomy Of A Log
• Timestamp with time zone!
• Log level
• Host ID & module name (process/service)
17. Anatomy Of A Log
•
•
•
•
Timestamp with time zone!
Log level
Host ID & module name (process/service)
Code location or class
18. Anatomy Of A Log
•
•
•
•
•
Timestamp with time zone!
Log level
Host ID & module name (process/service)
Code location or class
Authentication context
19. Anatomy Of A Log
•
•
•
•
•
•
Timestamp with time zone!
Log level
Host ID & module name (process/service)
Code location or class
Authentication context
Key-value pairs
20. Use Cases
• Availability & Performance
– Prevent downtime by proactive analytics, alerting
– Reduce MTTR by having all required data at your fingertips
• Application Release
– Derive metrics from development and staging systems pre-deploy
– Baseline and compare after post-deploy quickly shows errors
• Security & Compliance
– Compliance starts with having all security related logs in one place
– Analytics across all data facilitates detecting breaches and problems
21. Customer Metrics
Use Case
Customer Examples
Metric
Security &
Compliance
Apigee reduced compliance
audit costs by ~50%
Availability and
Performance
Ink saves nearly $500K
annually
Application
Release
Intaact reduced errors
by 4X
22. Machine Data Is Big Data
• Volume
– Machine Data is voluminous and will continue to grow
– Our own application creates 1TB/logs per week easily
• Velocity
– Machine Data occurs in real-time, and it is time-stamped
– Needs to be processed in real-time as well
• Variety
– Machine Data is unstructured, or poly-structured at best
– Some standard schema, but sure enough not for you applications
26. Legacy Products Fall Short
• Volume leads to scalability issues
– Every Log Management system will fail – I have seen it
– Why should you bother with scaling yet one more system?
• Velocity challenges processing pipelines
– What good are dashboards if they are not real-time?
– Streaming query engines are absolute must
• Variety isn’t being embraced
– All data should be allowed into the system
– No vendor will ever know your application’s log schema
27. AWS Enables Innovation
•
•
•
•
•
Attending Werner’s talk at Stanford in 2008
First parking lot discussion
This can apply to our space!
Datacenter as API
Massive power up to scraggly devs
28. AWS Enables Sumo Logic
• Entering an existing market
– Existing & established competition, some of it huge
– Catch up & differentiate at the same time
• A Big Data service
– Scaling on premise is hard and leaves the hard part to the customer
– Now we build one single system to deal with all customers
• This data is important
– Regulatory compliance is among the big drivers for collecting it
– HA & DR concerns all over the place à Amazon S3
32. Development Approach
•
•
•
•
•
•
•
Developed in Scala because we like it
Many small cohesive modules, low coupling
Maven-based build system
Layers of modules combined into applications
Different applications for different concerns
Internal Service-Oriented Architecture
Communication via documented protocols
33. Basic Concerns
• Data ingestion
– Receiving data
– Raw storage
– Full-text indexing
• Data analysis
– Interactive analytics
– Scheduled queries
– Machine learning
– Continuous query
evaluation
34. Concerns Map To Clusters
•
•
•
•
•
•
A cluster is multiple instances of the same application
Deployed on multiple Amazon EC2 instances
Deployed across multiple availability zones
Instances within a cluster are oblivious of each other
Receive from upstream, talk to downstream
Receive from message bus, or talk RPC
36. Receiver
•
•
•
•
•
•
•
HTTPS endpoint behind Elastic Load Balancing
Decompress messages from Collector
Extract timestamps from messages
Aggregate messages per-customer into blocks
Flush blocks to message bus
Ack to Collector
“Statelessly stateful”/”Statefully stateless”
Receiver
37. Raw
•
•
•
•
•
•
•
Raw
Receive message blocks from message bus
Encrypt message blocks
Different key for every day for every customer
Flush encrypted message blocks to Amazon S3
Copy blocks as CSV to customer’s Amazon S3 bucket
Ack to message bus
Fully stateless
38. Index
•
•
•
•
•
•
•
Index
Receive message blocks from message bus
Cache message block on disk and ack to message bus
Add message blocks to Lucene indexes
Deal with wildly varying timestamps
Flush index shards to Amazon S3
Update meta data database with index shard info
Stateful
39. Continuous Query
•
•
•
•
•
•
•
CQ
Receive message blocks from message bus
Evaluate each message against all search expressions
Push matching messages into respective pipelines
Ack to message bus
Flush results periodically for pickup by client
Persist checkpoints periodically to Amazon S3
Stateful, with checkpoint recovery
41. Query
•
•
•
•
•
•
•
Query
Fully distributed streaming query engine
Materialize messages matching search expression
Push messages through a pipeline of operators
First stage – non-aggregation operators
Second stage – aggregation operators
Present both raw message results as well as aggregates
Results update periodically for interactive UI experience
43. Why Deployment Automation
•
•
•
•
•
•
•
Add 1 part developers, 1 part Datacenter-as-API, stir…
Aim for fully integrated continuous deployment
Checkin à unit test à integration test à deployment
Jenkins automates it all – using AWS instances
Deployment doesn’t mean production
Nite à Stag à Long à Prod deployments
There are humans involved as well!
44. Automation Enables Scale
• The goal is 100% - accept no less
• Why U need automation
–
–
–
–
Number of deployments grows (staging, per-developer)
Number of AWS resources per deployment grows
Number of operators/developers grows
Frequency of deployments, changes increases
45. Current Deployment Stats
•
•
•
•
•
•
4 Deployments running 24/7, 50 for development
20+ clusters per deployment
25+ software components deployed
Hundreds of instances in production
Less than 10 minutes to deploy from scratch
Less than 4 minutes to restart hundreds of components
46. dsh: Another AWS deployment tool
•
•
•
•
•
Model-driven, describe desired state, run to make it so
High performance due to parallelization
Covers all layers of the stack – AWS, OS, Sumo Logic
Easy to use and extend, scriptable CLI
Developer-friendly, Scala-based, high-level APIs
51. Differential Deployment
• Start by finding existing resources
– Use tagging where it is available
– Name prefixes (“prod_xxx”) where it isn’t (security groups, IAM, …)
• Fix differences to model
– Start “missing” instances
– Change security group rules, missing IAM users
• Proceed with caution
– Never delete anything that holds data
– Amazon EBS, Amazon DynamoDB, Amazon S3, Amazon RDS
53. Making It Fast
• Parallelize all the things
– Upload to Amazon S3 while booting instances while creating IAM users
while setting up security groups while…
– Hyper-concurrent rolling restarts
55. Making It Fast
• Parallelize all the things
– Upload to Amazon S3 while booting instances while creating IAM users
while setting up security groups while…
– Hyper-concurrent rolling restarts
• Fast enough for development
– Write new code or fix a bug, compile locally
– Push code to development deployment and make it live
• Optimize data transfers
– Use Amazon S3 hashes to only transfer new files
– Only upload changed JARs
56. Making It Reliable
• Check prerequisites before you even try
– Does Prod account have room for this many instances?
– Do I have the required permissions for the AWS APIs?
– Any model discrepancies I can’t automatically resolve? Too many Amazon
EBS volumes?
• Handle common failures automatically
– No m1.large in us-east-1b? Move Amazon EBS volumes to us-west-1c and
try there
– Hitting the AWS API rate limit? Throttle and try again
– SSH didn’t come up on the instance? Kill it and launch another
– Eventual consistency in AWS– query until it has the expected state (tags)
57. Making It Secure
• Different AWS accounts
– Per developer
– Production
• account.xml!
– All credentials for one AWS
account (AWS keys, SSH
keys)
– Password-protected
• IAM
– One user per Sumo
component
– Minimal IAM policy
– Inject AWS credentials
• Security Groups
– Part of the model
– Minimal privileges
58. Making It Safe
•
•
•
•
•
Let mistakes happen at most once
Add safeguards to prevent operator mistakes
Type in the deployment name before deleting anything
Disallow risky operations in production (shutdown Prod)
Don’t allow –SNAPSHOT code to be deployed in production
59. Making It Easy
• Automate best practices
– Distribute instances over availability zones evenly
– Register instances in Elastic Load Balancing and match AZs to
instances
– Tag all resources consistently
• Consistent naming
– Generate SSH with logical names
60. Making It Affordable
• Developers forget to shut stuff down
– Deployment reaper automatically shuts down deployments
– Daily cost emails
• Per-team budgets
– Manager responsible to
keep within budget
61. Pitfalls
•
•
•
•
Base AMI plus scripted installation prevents auto scaling
Security group updates cause TCP disconnects
This is fixed in the VPC stack, however
Parallelism can cause stampedes (for example,
Amazon DynamoDB)
• Tagging API rate limits are easy to hit
63. Loose Coupling In The Large
•
•
•
•
•
A deployment is made up of many things
Some of these things need to talk to each other
Some of these things come and go
Don’t pass in a huge list of static dependencies
Start each application with one parameter
$ bin/receiver prod.service-registry.sumologic.com!
64. Service Registry
•
•
•
•
•
•
•
Service Registry is a concept, enables discovery
A client-side library accessing a Zookeeper cluster
Services are abstracted into types
Application provides and consumes different services
Sumo Logic services (RPC)
Third-party services (message bus)
AWS services (Amazon ElastiCache, Amazon RDS)
65. The Perils Of Horizontal Scale
•
•
•
•
•
•
•
Scaling out a multi-tenant processing system
1000s of customers, 1000s of machines
Parallelism is good, but locality has to be considered
1 customer distributed over 1000 machines is bad
No single machine getting enough load for that customer
Batches & shards will become too small
Metadata and in-memory structures grow out of proportion
66. The Perils Of Horizontal Scale
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
67. The Perils Of Horizontal Scale
1
1
1
1
1
Index
Index
Index
Index
Index
1
1
1
1
1
Index
Index
Index
Index
Index
1
1
1
1
1
Index
Index
Index
Index
Index
1
1
1
1
1
Index
Index
Index
Index
Index
1
1
1
1
1
Index
Index
Index
Index
Index
68. The Perils Of Horizontal Scale
1
1
1
1
1
2
1
2
1
2
1
2
1
2
1
Index
Index
Index
Index
Index
2
1
2
1
2
1
2
1
2
1
Index
Index
Index
Index
Index
2
1
2
1
2
1
2
1
2
1
Index
Index
Index
Index
Index
2
1
2
1
2
1
2
1
2
1
Index
Index
Index
Index
Index
2
Index
2
Index
2
Index
2
Index
2
Index
69. The Perils Of Horizontal Scale
1
5
1
5
1
5
1
5
1
5
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
4
1
8
5
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
2
3
Index
6
7
4
8
4
8
4
8
4
8
4
8
70. The Perils Of Horizontal Scale
1Index
1Index
Index
Index
Index
1Index
1Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
71. The Perils Of Horizontal Scale
1Index
1Index
2Index
2Index
2Index
1Index
1Index
2Index
2Index
2Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
Index
72. The Perils Of Horizontal Scale
1Index4
3
1Index4
3
2Index5
3
2Index5
3
2Index6
3
1Index4
3
1Index4
3
2Index5
3
2Index5
3
2Index6
3
7 Index
7 Index
5 Index
8
5 Index
8
5Index6
8
7 Index
7 Index
5 Index
8
5 Index
8
5Index6
8
7Index
7Index
5Index
8
5Index
8
5Index6
8
73. Customer Partitioning
•
•
•
•
•
Each cluster elects a leader node via Zookeeper
Leader runs the partitioning logic
Set[Customer], Set[Instance] à Map[Instance, Set[Customer]]!
Partitioning written to Zookeeper
Example: indexer node knows which customer’s message
blocks to pull from message bus
75. Some Tips On AWS S3
• Use the TransferManager class from the AWS Java SDK
– Multi-part uploads and downloads
– Multi-threaded, overall latency reduction
• Use random prefixes for keynames in Amazon S3 buckets
– Amazon S3 partitions by keyname prefix
!
http://aws.typepad.com/aws/2012/03/amazon-s3-performance-tips-tricks-seattle-hiring-event.html
• Endpoint URL for Amazon S3
– s3.amazonaws.com might go to Virginia, or Pacific Northwest (!)
– If you are in us-east, use s3-external-1.amazonaws.com instead
76. Elastic Block Store
• RAID-0 makes Amazon EBS faster
– Use LVM RAID-0 if heavy I/O is required
– Align stripe sizes with file system block sizes
• Snapshotting Amazon EBS volumes
– Snapshots eat performance
– Even for volumes with provisioned IOPS
• Overlapping snapshots
– Can be scheduled too close together, like every minute
– I/Os start taking 30+ seconds
78. Somebody Has To Pay For Lunch
•
•
•
•
•
On-demand resources are very sexy
Automation gives developers their own sandbox
Compute is the most easily incurred cost
You need an automated reaper
Or just raise another round… J
79. Elasticity Is Not An Arbitrary Need
•
•
•
•
•
•
•
At least in our system, there’s baseline load
At least in our system, the cost is in compute
Alert-based scaling can be safe & effective
Measure your spend with tools that are out there
We actually use Sumo Logic for that!
Look for a moving average of resource consumption
Buy Reserved Instances, don’t fret the instance types
81. Amazon CloudTrail
• Logs! From AWS! The eagle has landed!
• Amazon CloudTrail logs your API activity to Amazon S3
• Sumo Logic will read from Amazon S3, allow analysis
82.
83. Please give us your feedback on this
presentation
BDT401
As a thank you, we will select prize
winners daily for completed surveys!
Thank You
86. PowerPoint Guidelines
When pasting content from another presentation
please paste using “Destination Theme”
Windows
Mac
Note: This works when copying entire slides from other presentations as long as the source presentation is also 16:9
87. PowerPoint Guidelines
When pasting content Code into a Code template please use the
“Keep Text Only Function” If any additional coloring needs to be done
to your code type please do it after pasting it into your slide.
Windows
Mac
88. 68k Assembly Code Sample
; Syntax Test file for 68k Assembly code
; Some comments about this file
.D0 00000000
MS 2100 00000002
MM 2000;DI
LEA.L $002100,A1
MOVE.L #2,-(A1)
BSR $00002050
MM 2050;DI
MOVE.L (A1)+,D1
MOVE.L (A1),D2
ADD.L D1,D2