SAS Intelligent Advertising changed its ad-serving platform from using Datastax Cassandra clusters to Scylla clusters for its real-time visitor data storage. This presentation describes how this migration was executed with no downtime and with no loss of data, even as data was constantly being created or updated.
Testing tools and AI - ideas what to try with some tool examples
SAS Institute on Changing All Four Tires While Driving an AdTech Engine at Full Speed
1. Changing All Four Tires
while Driving an Ad Tech
Engine at Full Speed
David Blythe, Principal Software Developer
2.
3. Presenter
David Blythe, Principal Software Developer
Backend architect/developer (C++, scripting) for real-time ad
decisioning platform
✧ Activation of data used for targeting (visitor, geo, device)
✧ Application of business logic to decisions
✧ Publication of ad metadata to servers
✧ Collection and processing of serving data
✧ Automation of daily processing tasks
5. Leader in analytics since 1976
■ Provides general purpose software tools for data analytics
■ Provides specialized solutions for vertical markets
● E.g. healthcare, banking, marketing
■ Largest privately held software company in the world
■ 2018 revenue: $3.27 billion
■ 14000 employees worldwide
■ Headquartered in Cary, NC
6. SAS Customer Intelligence (CI)
One of the vertical markets applying SAS analytics
■ Provides services to digital marketers and web publishers
● Analytics for A/B testing, product recommendations, tracking customer
engagement
● Hosted services to provide real-time customized decisions for content/ads
■ Multichannel engagement (web, mobile, video, e-mail)
■ Complex business rules (targeting, limited exposure, competitive exclusion)
■ Wide variety data (page, geo, device, behavioral, visitor attributes)
■ Many billions of content decisions monthly
■ Response time generally <10ms
■ No down time
8. SAS CI’s Use of NoSql
■ Key-value store
● Key = visitor ID Value = serialized data
■ Store data per visitor
● Static attributes (e.g. gender, interests)
■ Updated infrequently by non-real-time servers
● Analytical data (e.g. product recommendations)
■ Updated periodically by non-real-time servers
● Real-time data (e.g. recent decisions)
■ Updated constantly by real-time decisioning servers
■ Data read at start of a visitor session, held in memory during session
■ One NoSql cluster per AWS region
● Co-located in the same network with decisioning servers
■ Hundreds of millions of rows per cluster
9. SAS CI NoSql Encapsulation
Applications use a custom API to NoSql
■ Abstract OO interface
● To support multiple implementations
■ Encapsulates the business-level function
● E.g. “Get selected data rows for a visitor”
● Not CQL
■ Instance of concrete class held per tenant (customer)
● To allow different tenants to use different implementations
■ Instance is held behind a mutex
● To allow a tenant’s instance to change on-the-fly, through configuration
11. SAS CI NoSql Implementations
■ Memory-based
● For automated unit testing
■ Flat file-based
● For manual testing on local machines
■ Cassandra/Thrift – Schema A
● In production 2010-2014 with open source Cassandra distribution
■ Datastax Cassandra/CQL – Schema A
● In production 2015-2018 with licensed Datastax distribution
■ Datastax Cassandra/CQL – Schema B
● In production 2018 with licensed Datastax distribution
● In production 2019-present with licensed Scylla distribution (no code changes !)
13. Migration challenge
Scenarios requiring data migration
■ Schema needs to change within existing cluster
■ New NoSql vendor is to be used
■ Data must be transformed or moved
How to maintain 24/7 access during migration?
■ Must be able to read old data
■ Must be able to write new/updated data
■ Must be able to finish migration and decommission old
schema/vendor
14. Migration solution
A “migrating” implementation of the NoSql API
■ Wraps the two services that implement the old and new
schema/vendor
class MigratingNoSqlService : public NoSqlService
{
public:
MigratingNoSqlService(NoSqlService &old_service, NoSqlService &new_service);
NoSqlConnection* getConnection() {
return new MigratingNoSqlConnection(old_service.getConnection(), new_service.getConnection());
}
};
class MigratingNoSqlConnection : public NoSqlConnection
{
public:
MigratingNoSqlConnection(NoSqlConnection *old_service_connection, NoSqlConnection *old_service_connection);
int get(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type,
std::string &data_value);
// etc...
};
16. Lazy writing strategy
■ Reading always delegates to just old service
■ Writing delegates to both old and new services
...
int get(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type,
std::string &data_value) {
return old_service.get(timeout_ms, customer, id, data_type, data_value);
}
int update(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type, const
std::string &data_value, time_t ttl) {
old_service.update(timeout_ms, customer, id, data_type, data_value, ttl);
return new_service.update(timeout_ms, customer, id, data_type, data_value, ttl);
}
...
17. Lazy writing strategy
Summary
■ old service continues to be maintained with all data, while new
service accumulates just new/updated data
■ old service is decommissioned after new service is deemed to have
“enough” data
■ Advantages
● Simple
■ Disadvantages
● Sacrifices data for visitors not engaged during migration period
● Prolongs time until decommissioning old service
18. Lazy reading strategy
■ Reading delegates to new service first, and if no data found,
delegates to old service
■ Writing delegates to just new service
...
int get(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type,
std::string &data_value) {
int status = new_service.get(timeout_ms, customer, id, data_type, data_value);
if (data_value.empty())
status = old_service.get(timeout_ms, customer, id, data_type, data_value);
return status;
}
int update(size_t timeout_ms, const std::string &customer, const std::string &id, const std::string &data_type, const
std::string &data_value, time_t ttl) {
return new_service.update(timeout_ms, customer, id, data_type, data_value, ttl);
}
...
19. Lazy reading strategy
Summary
■ old service is no longer updated at all, while new service accumulates
just new/updated data
■ old service is decommissioned after new service is deemed to have
“enough” data
■ Advantages and Disadvantages
● Similar to Lazy writing
● Added disadvantage: Can require two reads
20. Aggressive lazy reading strategy
■ Like Lazy reading strategy
● Reading delegates to new service first, and if no data found, delegates to old service
● Writing delegates to just new service
● Old service is no longer updated
■ A separate one-off script walks the old service’s keyspace and copies
all non-existent rows to new service
● …or multiple identical scripts working on different slices in parallel
21. Aggressive lazy reading strategy
Summary
■ old service is no longer updated at all, while new service accumulates
just new/updated data
■ old service is decommissioned after one-off scripts have finished
copying all rows to new service
■ Advantages
● No data is sacrificed
● Migration time is minimized, so time needed for double-reads is limited
■ Disadvantages
● Requires careful attention to the one-off script
● Increases load during migration period
22. SAS CI NoSql Scylla Migration
■ 2-week window between Scylla license approval and Datastax license
expiration
■ Prior to Scylla approval…
● Stood up Scylla test cluster
● Wrote and tested MigratingNoSqlService using Aggressive lazy reading strategy
● Wrote and tested one-off migration script in Python
■ After Scylla approval, per cluster…
● Stood up production Scylla cluster
● Switched on the MigratingNoSqlService for all tenants
● Ran multiple copies of the one-off script in parallel
● When scripts finished, switched on regular DatastaxNoSqlService for all tenants
● Tore down production Datastax cluster
23. SAS CI NoSql Scylla Migration Results
■ Complete migration took from a few hours to two days per cluster
■ All clusters completed well within the licensing window
■ No downtime
■ No lost data
■ No operational headaches
■ No customer complaints
■ Read performance improved because… Scylla
24. Take Away
Leverage the strengths of OO in your NoSql applications
■ Create your own business-specific API
● Lays groundwork for testing and shifting infrastructure
■ Implement concrete test classes
● Enables robust unit and application testing
■ Implement concrete production classes, as needed
● Provides some vendor independence
● Can encapsulate much of the migration process
25. Thank you Stay in touch
Any questions?
David Blythe
david.blythe@sas.com
@BlytheDavid
davidblythe