At Comcast Silicon Valley we have developed a general purpose message bus for the cloud. The service is API compatible with Amazon’s SQS/SNS and is built on Cassandra and Redis with the goal of linear horizontal scalability. In this Webinar we will explore the architecture of the system and how we employ Cassandra as a central component to meet key requirements. We will also take a look at the latest performance numbers.
2. CMB
A Message Bus for the Cloud
CQS – Queuing Service
CNS – Topic based Pub Sub Service
3. Why did we build our own?
• General purpose message bus to replace project
driven one-off solutions
• Smooth data center failover, maybe even “active-
active” queues
• Must scale to millions of queues and 1000s of
messages/sec (for example 1 queue per STB)
• Tight latency requirements (“10ms response time
95th pct”)
• Evaluated other options to arrive at AWS
SQS/SNS
4. AWS SQS Primer
“Simple Queuing Service”
• Focus on guaranteed delivery
• Best effort on orderly delivery, duplicates
• Few simple core APIs:
– CreateQueue() / DeleteQueue()
– SendMessage()
– ReceiveMessage()
– DeleteMessage()
• Do not trust message recipients
5. Why did we build our own?
AWS SQS
Guaranteed Delivery +
Simple, Standard API +
Horizontally Scalable +
Active-Active ?
DC Failover ?
Latency ?
Limitations (Msg Size, # Artifacts, …) ?
6. “Build a horizontally scalable queuing service on
top of Cassandra (and Redis) which is API
compatible with AWS SQS API”
7. CQS over Cassandra & Redis
• Cassandra
– Cross-DC persistence and replication
– Proven horizontal scalability
• Redis
– Meet latency requirements
– Help with best effort ordering
– Handle Visibility Timeout (VTO)
8. Cassandra Data Modeling
• How to represent queued messages in
Cassandra?
– Single Column Queue
– Single Row Queue
– Multi-Row Queue
19. CQS Architecture
Recap
• Cassandra Persistence Layer
– Messages sharded across 100 rows per queue
• Avoid wide rows (> 500K)
• Minimize churn (Tombstones)
• Distribute queue among Cassandra nodes
• Redis Caching Layer
– To meet latency requirements
• Payload cache (kicks in after first miss, pre-load next 10k)
– Improve FIFOness by storing Msg IDs in Redis List
– Handle message visibility entirely in Redis (Hashtable)
20. Cassandra Data Modeling
Key Cassandra Features
• Persistence and failover
– Cross-DC replication in combination with Local Quorum
Reads/Writes (tunable consistency)
• Millions of queues, spiky traffic patterns
– Massive horizontal scalability
• Message order (FIFOness) / future dated messages
– Wide rows, composite column keys / TimeUUID and
column sort order
• Message retention period (expiration)
– TTL
• Fast lookup of static metadata (Queues, Users etc.)
– Row Cache, Secondary Indexes
21. Cassandra Data Modeling
Lessons Learned
• Coming from RDBMS background…
– Forget the table analogy, rather:
• CF = HashMap<RowKey, TreeMap<ColKey, ColValue>
– No need to specify column names in advance
• Wide rows, value-less columns, composite keys
– No unique constraints, no foreign keys, no joins:
• Design schema around your queries
• Use de-normalization where needed
– No inserts (everything is an update!)
• Design for idempotent operations
• Use globally unique identifiers
– But, there are indexes
• Use secondary indexes
22. CQS Scalability and Availability
• Scalability
– Send(), Receive(), Delete()
• Scale with Cassandra Ring, API Servers (stateless) and
Redis Shards
• Are constant time operations
– Queues not sharded across Redis servers!
• Availability
– Depends on availability of Cassandra
– Service functions without Redis!
30. CNS Architecture
• CQS Queue preserves messages when Publish
Workers are down or overloaded
• CQS Visibility Timeout takes care of guaranteed
delivery
• Retry policy and guaranteed delivery
– http://docs.aws.amazon.com/sns/latest/gsg/DeliveryPolicies.html
• Publish Workers hardened for rogue endpoints
– Fail endpoints, slow endpoints, …
31. Differences SQS/SNS and CQS/CNS
• Goal: Full API compatibility
• Current state:
– All APIs implemented, most parameters supported
– Can use AWS Java SDK and others
• Limitations:
– AWS4 signatures not supported (V1 and V2 ok)
– SMS endpoints not supported, limited email support
• Enhancements:
– Additional APIs for monitoring and management: PeekMessage(),
HealthCheck(), GetWorkerState(), ManageWorker(), ManageService(),
GetAPIStats()
– Unlimited number of queues, topics and subscriptions
– Adjustable message size and other parameters (SNS <= 64KB, SQS <=
64KB, LP <= 20 sec, DS <= 900 sec, RP, …)