Having many different technologies within an organization can be problematic for developers and operations alike. Structuring those systems into discrete modules not only abstracts away a lot of the complexity of a heterogeneous architecture, it also allows the evolution of systems using common access and storage patterns. This session will discuss how to think about, architect, and maintain a service architecture for a big data system.
4. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Even with the right tools, 80% of
the work of building a big data
system is acquiring and refining
The Real Truth
7. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
• Millions of URLs per day
• Over 1.25 billion page views per month
• 500m events per day (~6k events/second)
• Auto-scale 125-160 machines depending on traffic
SimpleReach
8. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
And It Goes Like This...
C*
Vertic
a
9. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Goals• Consistent non-data storage layer access patterns
• Data accuracy across storage engines
• Minimize downtime/Minimize cost of downtime
• High availability
• Allow access to many toolsets (for all languages, DBs,
Engines)
• Clients should have minimal architecture knowledge
10. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Consistent Access Patterns
realtime_scor
e
(‘score’,
‘realtime’)
11. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Authentication, Tracking,
Per service
access keys
Track call
volume by
access key
Prevent
internal
denial of
service
Monitor
availability and
performance
12. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Controlled Data Flow
Social
Event
Collector
Social
Data
Batch & Write
Processed
Data
Batch & Write
Raw Data
Calculate
Score
Write
NSQ Multicast NSQ NSQ
13. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
NSQ by Bit.ly• Distributed and de-centralized topology
• At least once delivery guaranteed
• Multicast style message routing
• Runtime discovery for consumers to find
producers
• Allow for maintenance windows with no
downtime
14. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Path of a Packet
Internet
EC
InternalAPI
Solr
C*
Mong
Redis
Vertic
API
Fire
Hos
SC
Consumers
Queue
15. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Evolution Takes Work• Know your access patterns
• Service Oriented Architecture (Internal API)
• Data accuracy checks: visual and programmatic
• Built framework for testing out engines (Storage,
Queueing, etc)
16. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Homogeneous Machines at Base
Application
Base AMI
Organizational Base
Event Collection
NSQ
Mongos
App Config
Users
Monitoring
Consumer
NSQ
Mongos
App Config
Users
Base Image Layout Producer Consumer
Amazon Linux
Monitoring
Amazon Linux
Application Group
17. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
DevOps Wizardry
• Extensive use of AWS
• Monitor: Nagios, Statsd, and Graphite
• Manage: Chef, OpsWorks, cSSHx, Vagrant
• Deployments
18. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Evolving Amazon Tools
• Full Featured API
• OpsWorks
• Cloud Formation
• S3 / CloudFront
• Elastic Beanstalk
• Elastic
MapReduce
19. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Service
Internal API
Solr
Real-time
C*
C*
Vertica
20. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Service Architecture Machines
Application
Base AMI
Organizational Base
iAPI Front End
nginx
App Config
Users
Monitoring
Data Store
App Config
Users
Base Image Layout Proxy Machines Storage Machines
Amazon Linux
Monitoring
Amazon Linux
Application Group
21. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Anatomy of an Endpoint
Mong
Mong
Vertic
C*
C*
hourly
content
Mong
Mong
Vertic
C*
C*
tenminute
content
QueryingMachines
Helen
Helen
PyVertic
PyMon
PyMon
PyVertic
22. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Endpoint Breakout• Availability
• Consistent Access Patterns
• Minimal downtime changes
• Smaller code deploys
• Non-monolithic code base
23. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Architecture Distribution
US-EAST-1a
MONGO-SHARD-0001-B
MONGO-SHARD-0000-A
CASSANDRA-0001
CASSANDRA-0010
REDIS-0001A
VERTICA-0001
iAPI-0001
US-EAST-1b
MONGO-SHARD-0002-B
MONGO-SHARD-0001-A
CASSANDRA-0002
CASSANDRA-0011
REDIS-0001B
iAPI-0002
US-EAST-1e
MONGO-SHARD-0002-A
MONGO-SHARD-0000-B
CASSANDRA-0003
CASSANDRA-0012
VERTICA-0003
iAPI-0003
VERTICA-0002
25. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
The Schrute of the Problem
26. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
New Service Questions
• Can its host be completely homogenous?
• Can it accept downtime (and what should downtime look
like)?
• Does it fit into an existing service?
• Does it require datacenter distribution?
27. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Summary• Solutions Require Evolution
• Build, Use, and Integrate Tools
• Abstraction
• Homogeneous Distribution
• Monitoring & Automation
28. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
We’re
(Ask about Food Coma Fridays)
29. Big Architectures for Big
Data
Eric Lubow @elubow
#Cassandra13
Questions are guaranteed in life.
Answers aren’t.
Eric Lubow
@elubow
elubow@simplereach.co
Thank
you.