We will describe the architecture of a personalization platform that captures customer profiles and behavioral data. A Cassandra cluster is used as an intermediate storage backend to replicate updates to profile records and timeline events across multiple data-centers. A caching tier serves up the user data and provides a real-time execution environment where predictive models can calculate propensities or update category histograms, etc.. We delve into metrics that are used to track replication performance and data freshness. We also discuss applications and features like user badges that are powered by this new P13N platform.
3. Bullseye
Bullseye Functional Architecture
Offline AnalysisOffline Database/
Batch Processing
Recent User Data
1-5 days
(Cassandra)
Real Time Model
Evaluation & Caching
(sharded/full user state
in memory)
Client
Access
Near Real Time
Event Collection
Tracking
Long Term
User Data
(Local SSD)
4. Why Cassandra?
Great write performance
Great replication performance
Reasonable read performance
Reasonable cost
Client controlled consistency settings
Bullseye
5. Cassandra Setup
Cassandra Version 1.2.9
We use Replication
â⯠Cassandra rings deployed to 3 datacenters
Cassandra clients
â⯠We use both the Datastax Java and C++ Beta clients
Using CQL Table specifications and commands
Not on SSDs
Bullseye
6. Cassandra Usage
Column Family Design:
ââŻAvoid Tombstones
ââŻAvoid Compaction
With Focus on Short Term Storage:
ââŻTurn off automatic compaction / only manual compaction
ââŻUse unique column key names to avoid tombstones
ââŻClear out old data with truncation
Bullseye
7. Cache Miss Flow (New Session)
Bullseye
CREATE TABLE DAY_N (USER_ID TEXT, RECORD_NAME TEXT,
RECORD_VALUE BLOB, PRIMARY KEY (USER_ID, RECORD_NAME));
Write to active day column family with key user id.
Truncate the oldest day column family.
When going from one day to the next, do a manual compaction for the old day.
On read, pull user id info from all col. families newer than the local SSD data.
8. Queuing Flow (Ongoing Activity)
Bullseye
CREATE TABLE HOUR_N (ID TEXT, RECORD_NAME TEXT,
RECORD_VALUE BLOB, PRIMARY KEY (ID, RECORD_NAME));
Read/Write from active hour with key timestamp rounded to nearest second
Store the column family one hour old to offline DB
Truncate the column family two hours old
Do async probe of record for current second as well as recent seconds till
state is captured. Data may be read 1-3 times. More if replication is lagging.
9. Cassandra Issues and Resolutions
Issues with C++ Datastax Cassandra beta client
ââŻopen sourced, so could apply fixes
Performance issues with the cache miss query
ââŻincreased heap size
ââŻreduced replication factor
ââŻturned off cross colo read repair
ââŻdeployed data center aware policy for C++
Bullseye
10. Personalization Applications
Ranjan Sinha, PhD
Lead Research Scientist
April 7, 2014
Disclaimer: Some of the content in this talk is based on my personal opinion. It does not reflect the views of ebay.
12. Why Personalize?
Enable more relevant experience
Retention of existing users
New user acquisition
Reactivating churned users
Increasing activity per user
Improving conversion from visits to transactions
Personalization Applications
13. P13N Platform: Introduction
Maintains activity timeline information
Enables event processing at near real-time
Enables in-session personalization
Provides environment for predictive model evaluation
Backup and restore to and from Hadoop/HBase
Personalization Applications
14. P13N Platform: Conceptual Architecture
Personalization Applications
Tracking Event
Source
m1 m3m2
âŠ.
Model Executor
Filters and forwards
events
Activity
Timeline
+
User Badges
In-memory
Cache +
Model
Evaluation
CEP Processor
Client Access
Hadoop/
HBase
Offline Modeling
Platform
User Badges
mn
Cassandra
15. P13N Platform: Modeling stages
Realtime
ââŻIn-session user intent
ââŻContextual Models
Nearline
ââŻUpdate propensity models (aka User Badges)
Offline
ââŻBootstrap propensity models by mining long-term behavior history
Personalization Applications
16. Application (1): User Badges
Personalization Applications
Name Description
SaleType Auction vs. Buy-it-now
ItemCondition New vs. Used
Category Preference of categories
Price Price range of purchasing activity
Deals Propensity to purchase deals
Social Share Propensity to share items in social media
Profile based on long-term behavior
17. Application (2): Search Ranking âŠ
Should all queries be personalized in the same manner?
ââŻFor some queries (ebay or google), everyone would like the same results
ââŻFor other queries, different people may want completely different results
Personalization Applications
Query: âbig ben puzzlesâ
Not_P13N
Rank
P13N
Rank
Sold IsNew Title
1 1 No No
LOT OF 7 BIG BEN PUZZLES 5/1000PC. 2/1500
PUZZLES EUC
2 3 No Yes
1000 Pc MB Big Ben Jigsaw Puzzle Mount Shuksan
North Cascades National Park WA
3 2 Yes No
COMPLETE Fishing Village,Smalls Island MB Big
Ben Puzzle 1000 Piece Puzzle Size!
User: always buys used items
18. Application (3): Contextual models âŠ
Personalization Applications
Infer categories that user is interested in within the current session
Long and Short term behavior
ââŻHistoric behavior may provide benefits at the start of the session
ââŻShort-term behavior may contribute gains in an extended search session
ââŻCombination of session and historic behavior may outperform using either alone
e2
t
Nearline, after session expiry
Online, in-session
Offline, historical
e3e1 âŠevents⊠e1
Event
source