This session will cover various use cases for Cassandra at eBay. It’ll start with overview of eBay’s heterogeneous data platform comprised of SQL & NoSQL databases, and where Cassandra fits into that. For each use case, Jay will go into detail of system design, data model & multi-datacenter deployment. To conclude, Jay will summarize the best practices that guide Cassandra utilization at eBay.
2. eBay Marketplaces
Thousands of servers
Petabytes of data
Billions of SQLs/day24x7x365
99.98+% Availability
turning over a TBevery second
Multiple Datacenters
Near-Real-time
Always online
400+ million items for sale
$75 billion+ per year in goods are sold on eBay
Big Data
112 million active users
Billions of page views/day
3. 3
eBay Site Data Infrastructure
Don’t force!
One size does not fit all.
It’s a mixture of
multiple SQL &
NoSQL databases.
We use the right
database for the
right problem.
4. eBay Site Data Infrastructure
A heterogeneous mixture
Thousands of nodes
> 2K sharded logical host
> 16K tables
> 27K indexes
> 140 billion SQLs/day
> 5 PB provisioned
Hundreds of nodes
Persistent & in-memory
> 40 billion SQLs/day
10+ clusters, 100+ nodes
> 250 TB provisioned
(local HDD + shared SSD)
> 9 billion writes/day
> 5 billion reads/day
Hundreds of nodes
> 50 TB
> 2 billion ops/day
Thousands of nodes
The world largest
cluster with 2K+ nodes
Dozens of nodes
5. How do we scale RDBMS?
Shard
– Patterns: Modulus, lookup-based, range, etc.
– Application sees only logical shard/database
Replicate
– Disaster recovery, read availability & read scalability
Big NOs
– No transactions
– No joins
– No referential integrity constraints 5
6. Why Cassandra?
Multi-datacenter (active-active)
Always Available - No SPOF
Easy to scale up & down
6
Write performance
Distributed counters
Hadoop support
Not replacing RDBMS, but complementing!
Some use cases don’t fit well in RDBMS - sparse data, big data,
flexible schema, real-time analytics, …
Many use cases don’t need top-tier set-ups.
10. 10
System Overview
Business Event Stream
Checkout Shipping Refund & Recoup …
Order placed
(bin/bid)
Paid Shipped Refunded
Rawdata
Simple in-memory aggregations +/
Complex Event Processing +/
Cassandra’s distributed counters
Label printed per day per user
User segmentation for affiliate pricing
Orders per hour, …
Multiple Cassandra clusters
Payment
Actinreal-time
Fraud Prevention
Affiliate Pricing Engine
(eBay Partner Network)
Order tracking
Real-time reporting
…
(Kept from several months to years)
11. A glimpse on Data Model
11
Historic & real-time insights per user per carrier.
Sudden & drastic change might be suspicious.
User bucketing based on historic
& real-time buying activity.
14. System Overview
14
Cassandra
Fraud Detection & Prevention System
Sign-ininfo
Business events
(checkout, sell,…)
StaaSOracle
Checkout Shipping …PaymentSelling
Real-time
Beacons data
Real-time
Insights
Other data
Machine
Learned Models
15. 15
A glimpse on Data Model
Collected at sign-in
& stored as key-value.
Pulled periodically to StaaS for
training machine learned models.
17. System Overview
17
Transport (HTTP, …)
Scalable NIO
servers based
on Netty
Thousands of
production
machines
Cassandra
Stats for CPU, Memory, Disk, ..
…
agent agent agent agent …
Server Server Server Server Server
In-memory grid (hazelcast) for rollups
18. A glimpse on Data Model
18
Granular data points
Rolled up metrics
for various time intervals
21. System Overview
21
Business Event Stream
Recommendation system
Taste GraphTaste Vector
1. Item purchased.
2a. Write purchase edge.
2b. Read other edges for this user & item.
4. Req. recommendations.
5. Finds other items close to
user’s coordinates.
6. Reco. shown to user
More, http://www.slideshare.net/planetcassandra/e-bay-nyc
22. Real-time Personalization Data Service
22
User performs search using keyword User gets personalized pages based on
implicit/explicit profile
23. System Overview
23
Personalization Data Service
CacheMesh
(write-back cache)
Heavy writes
eBay site pages (personalized)
Every few mins
in-memory
MySQL
& XMP DB
CassandraOracle
(scaled out) Heavyreads
Cache miss
user profiles
Application SOA services (multiple)
Data
Warehouse
24. Data Model
24
• Keep column names short.
• Don’t overload one CF with all the data:
- Split hot & cold data in separate CF.
- Splitting & sharding can help compaction.
Static column families
26. Manage signals via “Your Favorites”
26
Whole page is
served by
Cassandra
More, http://www.slideshare.net/jaykumarpatel/cassandra-at-ebay-13920376
27. Multi-Datacenter Deployment
27
Topology - NTS
RF - 1:1 or 2:2 or 3:3
Read CL - ONE/QUORUM
Write CL - ONE
Data is backed up periodically
to protect against human or
software error
User request has no datacenter affinity
Non-sticky load balancing
29. Lessons & Best Practices
• One size does not fit all
– Use Cassandra for the right use cases.
• Choose proper Replication Factor and Consistency Level
– They alter latency, availability, durability, consistency and cost.
– Cassandra supports tunable consistency, but remember strong consistency is not free.
• Many ways to model data in Cassandra
– The best way depends on your use case and query patterns.
• De-normalize and duplicate for read performance
– But don’t de-normalize if you don’t need to.
http://www.slideshare.net/jaykumarpatel/cassandra-data-modeling-best-practices
29
30. Are you excited? Come Join Us!
30
Thank You
@pateljay3001
#cassandra13