Cassandra Day SV 2014: Building a Personalization Platform with Cassandra at eBay

Bullseye P13n Platform
April 7, 2014
Charles Bracher
Bullseye Dev Manager
Ranjan Sinha, PhD
Lead Research Scientist
Bullseye

Outline
P13n Platform
Why Cassandra?
Cassandra Setup
Cassandra Usage
Cassandra Issues and Resolutions
Hand over to Ranjan for the Data Science Perspective
Bullseye

Bullseye
Bullseye Functional Architecture
Offline AnalysisOffline Database/
Batch Processing
Recent User Data
1-5 days
(Cassandra)
Real Time Model
Evaluation & Caching
(sharded/full user state
in memory)
Client
Access
Near Real Time
Event Collection
Tracking
Long Term
User Data
(Local SSD)

Why Cassandra?
Great write performance
Great replication performance
Reasonable read performance
Reasonable cost
Client controlled consistency settings
Bullseye

Cassandra Setup
Cassandra Version 1.2.9
We use Replication
–  Cassandra rings deployed to 3 datacenters
Cassandra clients
–  We use both the Datastax Java and C++ Beta clients
Using CQL Table specifications and commands
Not on SSDs
Bullseye

Cassandra Usage
Column Family Design:
– Avoid Tombstones
– Avoid Compaction
With Focus on Short Term Storage:
– Turn off automatic compaction / only manual compaction
– Use unique column key names to avoid tombstones
– Clear out old data with truncation
Bullseye

Cache Miss Flow (New Session)
Bullseye
CREATE TABLE DAY_N (USER_ID TEXT, RECORD_NAME TEXT,
RECORD_VALUE BLOB, PRIMARY KEY (USER_ID, RECORD_NAME));
Write to active day column family with key user id.
Truncate the oldest day column family.
When going from one day to the next, do a manual compaction for the old day.
On read, pull user id info from all col. families newer than the local SSD data.

Queuing Flow (Ongoing Activity)
Bullseye
CREATE TABLE HOUR_N (ID TEXT, RECORD_NAME TEXT,
RECORD_VALUE BLOB, PRIMARY KEY (ID, RECORD_NAME));
Read/Write from active hour with key timestamp rounded to nearest second
Store the column family one hour old to offline DB
Truncate the column family two hours old
Do async probe of record for current second as well as recent seconds till
state is captured. Data may be read 1-3 times. More if replication is lagging.

Cassandra Issues and Resolutions
Issues with C++ Datastax Cassandra beta client
– open sourced, so could apply fixes
Performance issues with the cache miss query
– increased heap size
– reduced replication factor
– turned off cross colo read repair
– deployed data center aware policy for C++
Bullseye

Personalization Applications
Ranjan Sinha, PhD
Lead Research Scientist
April 7, 2014
Disclaimer: Some of the content in this talk is based on my personal opinion. It does not reflect the views of ebay.

Outline
Why Personalize?
P13N Platform
– Introduction
– Conceptual architecture
– Modeling stages
P13N Applications
– User badges
– Search ranking
– Contextual models
– Deals

Why Personalize?
Enable more relevant experience
Retention of existing users
New user acquisition
Reactivating churned users
Increasing activity per user
Improving conversion from visits to transactions

P13N Platform: Introduction
Maintains activity timeline information
Enables event processing at near real-time
Enables in-session personalization
Provides environment for predictive model evaluation
Backup and restore to and from Hadoop/HBase

P13N Platform: Conceptual Architecture
Tracking Event
Source
m1 m3m2
….
Model Executor
Filters and forwards
events
Activity
Timeline
+
User Badges
In-memory
Cache +
Model
Evaluation
CEP Processor
Client Access
Hadoop/
HBase
Offline Modeling
Platform
User Badges
mn
Cassandra

P13N Platform: Modeling stages
Realtime
– In-session user intent
– Contextual Models
Nearline
– Update propensity models (aka User Badges)
Offline
– Bootstrap propensity models by mining long-term behavior history

Application (1): User Badges
Name Description
SaleType Auction vs. Buy-it-now
ItemCondition New vs. Used
Category Preference of categories
Price Price range of purchasing activity
Deals Propensity to purchase deals
Social Share Propensity to share items in social media
Profile based on long-term behavior

Application (2): Search Ranking …
Should all queries be personalized in the same manner?
– For some queries (ebay or google), everyone would like the same results
– For other queries, different people may want completely different results
Query: “big ben puzzles”
Not_P13N
Rank
P13N
Rank
Sold IsNew Title
1 1 No No
LOT OF 7 BIG BEN PUZZLES 5/1000PC. 2/1500
PUZZLES EUC
2 3 No Yes
1000 Pc MB Big Ben Jigsaw Puzzle Mount Shuksan
North Cascades National Park WA
3 2 Yes No
COMPLETE Fishing Village,Smalls Island MB Big
Ben Puzzle 1000 Piece Puzzle Size!
User: always buys used items

Application (3): Contextual models …
Infer categories that user is interested in within the current session
Long and Short term behavior
– Historic behavior may provide benefits at the start of the session
– Short-term behavior may contribute gains in an extended search session
– Combination of session and historic behavior may outperform using either alone
e2
t
Nearline, after session expiry
Online, in-session
Offline, historical
e3e1 …events… e1
Event
source

Application (4): Deals
Personalize
categories
Personalize
modules
Personalize
tabs
Personalize
items

fin

Cassandra Day SV 2014: Building a Personalization Platform with Cassandra at eBay

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Mehr von DataStax Academy

Mehr von DataStax Academy (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Cassandra Day SV 2014: Building a Personalization Platform with Cassandra at eBay