SlideShare a Scribd company logo
1 of 36
Using Aerospike and
Machine Learning
Brian Bulkowski
CTO, Founder
@bbulkow
2© 2016 Aerospike Inc. All rights reserved.[ ]
What is Aerospike ?
Large-scale DHT Database ( 10B ++ objects, 100T++, O(1) get / put )
… with queries, data structures, UDF, fast clients ...
... On Linux ...
High availability clustering & rebalancing ( proven 5 9’s, no load balancer )
Very high performance C code – reads and writes
( 2M++ TPS from Flash, 4M++ TPS from DRAM PER SERVER )
KVS++ provides query, UDF, table/columns, aggregations, SQL
Direct attach storage; persistence through replication and Flash
Cloud-savvy – runs with EC2, GCE others; Docker, more …
Dual License: Open Source for devs, Enterprise for deployment
3© 2016 Aerospike Inc. All rights reserved.[ ]
Architecture Overview – Flash based system of engagement
LEGACY DATABASE
(Mainframe)
XDR
Decisioning Engine
DATA WAREHOUSE/
DATA LAKE
LEGACY RDBMS
HDFS BASED
BUSINESS
TRANSACTIONS
Web views
( Payments )
( Mobile Queries ) (
Recommendation )
( And More )
High Performance NoSQL
“REAL-TIME BIG DATA”
“DECISIONING”
500
Business Trans per sec
5000
Calculations per sec
X = 2.5 M
Database Transactions per sec
4© 2016 Aerospike Inc. All rights reserved.[ ]
CREDIT CARD
PROCESSING SYSTEM
FRAUD DETECTION &
PROTECTION APP
ACCOUNT
BEHAVIOR
ACCOUNT
STATISTICS
STATIC DATA
RULE 1 – PASSED ✔
RULE 2 – PASSED ✔
RULE 3 – FAILED ✗
HISTORICAL
DATA
RULES
RULE 1
RULE 2
RULE 3
…
Challenge
■ Overall SLA 750 ms
■ Loss of Business due to latency
■ Every Credit Card transaction requires
hundreds of DB reads/writes
Need to scale reliably
■ 10  100 TB
■ 10B  100 B objects
■ 200k  I Million+ TPS
Selected NoSQL
■ Built for Flash
■ Predictable Low latency at High Throughput
■ Immediate consistency, no data loss
■ Cross data center (XDR) support
■ 20 Server Cluster
■ Dell 730xd w/ 4NVMe SSDs
Example - Fraud Prevention
5© 2016 Aerospike Inc. All rights reserved.[ ]
■ 3 node cluster, Intel S3700 SSDs
■ Followed religiously all DataStax recommendations
■ Standard YCSB, includes instructions to reproduce for your workload
■ http://www.aerospike.com/blog/comparing-nosql-databases-aerospike-and-
cassandra/
Aerospike vs Cassandra ( 2016 )
6© 2016 Aerospike Inc. All rights reserved.[ ]
Aerospike vs Cassandra ( 2016 )
7© 2016 Aerospike Inc. All rights reserved.[ ]
Aerospike vs Cassandra ( 2016 )
Online Learning
Leveraging Aerospike to Power Real-time Analytics
Neilson Marketing Cloud Webinar
Brent Keator
VP Infrastructure
Neilson Marketing Cloud
Kevin Lyons
Senior VP Data Science
Neilson Marketing Cloud
YouTube: Neilsen Marketing Cloud Aerospike Webinar 2016
Aerospike: https://aerospike.com/webinars
Models that build profitable marketing audiences at scale...
Finding more of your best
customers: High-income business
professional
The Modeling Process, simplified
2012 2015
30 - 40 models
levering billions of events
Creating 100 million + scores
over 1000 models
‘leveraging’ trillions of events
Creating 150 billion+ scores / day
The Challenge
A system creates as many models as we want, when
we want them, that dynamically adapts in real-time
to changing conditions
▪ Automatically creates, validates, ships, and
monitors models, with a capacity that scales
to 10s of thousands of models
The Opportunity
What we really need:
In other words, we simply need ….
Online models evolve &
adapt over time, in
reaction to a changing
environment with each
and every event
Given a complete
data set, a batch
model is created in
entirety all at once
Introducing Online Learning
Batch Online Learning
Creation Evolution
large-scale data
storage
large-scale
data movement
painful data
aggregation
lots of manual
everything
Harder to build models,
but easier to evaluate
limited data storage,
mostly for monitoring
event-level
data streams
light data
aggregation
lots of automatic
everything
Easier to build, but harder
to evaluate (& support)
Batch Models (Offline) vs. Online Learning
Online LearningBatch Models (Offline)
● Outperformed both L2 and Elastic Net
● Leverages small (‘micro’) batches
● Validates and monitors models in real time
● Alerts team when models are not behaving
Some Techno Mumbo Jumbo
Stochastic gradient descent with L1 regularization
eXelate.com @eXelate
Technical Solutions
How do we do it?
eXpresso Serving Cluster
10B+ events/day
300+ nodes across
4 data centers
eXtream Modeling Cluster
160B models/day
100+ nodes across
4 data centers
JGroups
Distribute
d
Messagin
g
Serving Layer
Our Aerospike “Citrusleaf” Use-Cases
Unique User DataStore
53 Servers across
4 data centers
Specs
Memory: 512GB
CPU: e5-2620v2 (Dual-Socket)
Disk: Intel S3710(13-15 1.2TB SSDs)
Network: Aggregated 10GB NICs
2-Namespaces
Online Learning (Models DataStore)
9 Servers across
3 data centers
Specs
Memory: 32GB
CPU: e5-2620 (Dual-Socket)
Disk:1-240GB SSDs
Network: Aggregated 1GB NICs
1-Namespace
Online Learning
Online LearningBatch Models (Offline)
Batch
Predefined ratio
Predefined feature selection
One time Validation
Streaming
Downsampling
Automated feature selection
Ongoing data cleaning
Ongoing validation
The Online Learning Challenge
● All necessary data already exists in eXtream
● The cluster’s processing resources can be better utilized
● eXtream addresses most performance / scalability requirements
● Scoring mechanism already exists
eXtream as a Framework for Online Learning
Why it works...
Online Learning Flow
● Labeling Mechanism - customer defined target
audience
Events Classification
● Downsampling mechanism
● Burst tolerance
● Duplicate entries
Dataset Preparation
● Blacklist
● Whitelist
● Automatic Tuning
Features Selection
● Sliding window of recent events
● 60/40 not-converted/converted ratio
● Various accuracy metrics (lift, precision, recall, confusion matrix)
● Decide if the model is ready for making predictions
Model Validation
● Two phases (Scoring, Re-code)
● Scale vs Accuracy tradeoff
Predictions Mechanism
Scalability / Performance
Thousands of
Concurrent Models:
High Throughput:
billions of training events per daytraining, validation, scoring
Why do we need it?
● Store the models in one common place
● Persistency
● Built-in replication
Scalability / Performance
Why do we need it?
XDR Replication Map
Inter-DC Network
Bare-Metal Cloud
LVS/GSLB/XDR =
HA
Online Learning Datastore
Replication
Monitoring- Why do we need it?
thousands of models
automatically created by users
some models won’t converge
Monitoring- Real Time
Monitoring- Aggregation
Monitoring- DS Bot
eXelate.com @eXelate
Thank You!

More Related Content

What's hot

Amazon EC2 Foundations - SRV319 - Anaheim AWS Summit
Amazon EC2 Foundations - SRV319 - Anaheim AWS SummitAmazon EC2 Foundations - SRV319 - Anaheim AWS Summit
Amazon EC2 Foundations - SRV319 - Anaheim AWS SummitAmazon Web Services
 
HPC on Azure for Reserach
HPC on Azure for ReserachHPC on Azure for Reserach
HPC on Azure for ReserachJürgen Ambrosi
 
What's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS SummitWhat's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS SummitAmazon Web Services
 
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...Anand Haridass
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...EMC
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Amazon Web Services
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...Amazon Web Services
 
High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101Amazon Web Services
 
Disaggregated Hadoop Stacks
Disaggregated Hadoop StacksDisaggregated Hadoop Stacks
Disaggregated Hadoop StacksDataWorks Summit
 
Foundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS SummitFoundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS SummitAmazon Web Services
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAmazon Web Services
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAlluxio, Inc.
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudCloudera, Inc.
 
The Future of Database Migration is Cloud, AWS Federal Pop-Up Loft
The Future of Database Migration is Cloud, AWS Federal Pop-Up LoftThe Future of Database Migration is Cloud, AWS Federal Pop-Up Loft
The Future of Database Migration is Cloud, AWS Federal Pop-Up LoftAmazon Web Services
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent RecapRandall Hunt
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017Cloudera Japan
 
NetApp CTO Predictions 2018
NetApp CTO Predictions 2018NetApp CTO Predictions 2018
NetApp CTO Predictions 2018NetApp
 

What's hot (20)

Amazon EC2 Foundations - SRV319 - Anaheim AWS Summit
Amazon EC2 Foundations - SRV319 - Anaheim AWS SummitAmazon EC2 Foundations - SRV319 - Anaheim AWS Summit
Amazon EC2 Foundations - SRV319 - Anaheim AWS Summit
 
HPC on Azure for Reserach
HPC on Azure for ReserachHPC on Azure for Reserach
HPC on Azure for Reserach
 
Amazon EC2 Foundations
Amazon EC2 FoundationsAmazon EC2 Foundations
Amazon EC2 Foundations
 
What's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS SummitWhat's new with Amazon Redshift - ADB203 - New York AWS Summit
What's new with Amazon Redshift - ADB203 - New York AWS Summit
 
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
2016 Sept 1st - IBM Consultants & System Integrators Interchange - Big Data -...
 
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
Building Hadoop-as-a-Service with Pivotal Hadoop Distribution, Serengeti, & I...
 
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
Introducing Amazon EC2 P3 Instance - Featuring the Most Powerful GPU for Mach...
 
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
EUT302_Data Ingestion at Seismic Scale Best Practices for Processing Petabyte...
 
High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101High Performance Computing (HPC) on AWS 101
High Performance Computing (HPC) on AWS 101
 
Disaggregated Hadoop Stacks
Disaggregated Hadoop StacksDisaggregated Hadoop Stacks
Disaggregated Hadoop Stacks
 
Foundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS SummitFoundations of Amazon EC2 - SRV319 - Chicago AWS Summit
Foundations of Amazon EC2 - SRV319 - Chicago AWS Summit
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
 
Achieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloadsAchieving compute and storage independence for data-driven workloads
Achieving compute and storage independence for data-driven workloads
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
Cisco NetApp VMware - Long Distance VMotion
Cisco NetApp VMware - Long Distance VMotionCisco NetApp VMware - Long Distance VMotion
Cisco NetApp VMware - Long Distance VMotion
 
The Future of Database Migration is Cloud, AWS Federal Pop-Up Loft
The Future of Database Migration is Cloud, AWS Federal Pop-Up LoftThe Future of Database Migration is Cloud, AWS Federal Pop-Up Loft
The Future of Database Migration is Cloud, AWS Federal Pop-Up Loft
 
Amazon EC2 Foundations
Amazon EC2 FoundationsAmazon EC2 Foundations
Amazon EC2 Foundations
 
Randall's re:Invent Recap
Randall's re:Invent RecapRandall's re:Invent Recap
Randall's re:Invent Recap
 
How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017How to go into production your machine learning models? #CWT2017
How to go into production your machine learning models? #CWT2017
 
NetApp CTO Predictions 2018
NetApp CTO Predictions 2018NetApp CTO Predictions 2018
NetApp CTO Predictions 2018
 

Similar to Aerospike for machine learning

Real-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian BulkowskiReal-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian BulkowskiData Con LA
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalAvere Systems
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceAmazon Web Services
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Amazon Web Services
 
Virtual Storage Center
Virtual Storage CenterVirtual Storage Center
Virtual Storage CenterIBM Danmark
 
Oracle Exec Summary 7000 Unified Storage
Oracle Exec Summary 7000 Unified StorageOracle Exec Summary 7000 Unified Storage
Oracle Exec Summary 7000 Unified StorageDavid R. Klauser
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Riccardo Zamana
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next DecadePaula Koziol
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Precisely
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Amazon Web Services
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8MongoDB
 
Best Practices for Building Open Source Data Layers
Best Practices for Building Open Source Data LayersBest Practices for Building Open Source Data Layers
Best Practices for Building Open Source Data LayersIBMCompose
 
Aerospike: Enabling Your Digital Transformation
Aerospike: Enabling Your Digital TransformationAerospike: Enabling Your Digital Transformation
Aerospike: Enabling Your Digital TransformationBrillix
 
Taking SharePoint to the Cloud
Taking SharePoint to the CloudTaking SharePoint to the Cloud
Taking SharePoint to the CloudAaron Saikovski
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADXRiccardo Zamana
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overviewKognitio
 

Similar to Aerospike for machine learning (20)

Real-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian BulkowskiReal-Time Analytics in Transactional Applications by Brian Bulkowski
Real-Time Analytics in Transactional Applications by Brian Bulkowski
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
Optimiser votre infrastructure SQL Server avec Azure
Optimiser votre infrastructure SQL Server avec AzureOptimiser votre infrastructure SQL Server avec Azure
Optimiser votre infrastructure SQL Server avec Azure
 
Solving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute finalSolving enterprise challenges through scale out storage & big compute final
Solving enterprise challenges through scale out storage & big compute final
 
Deep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance PerformanceDeep Dive on Delivering Amazon EC2 Instance Performance
Deep Dive on Delivering Amazon EC2 Instance Performance
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
Choosing the Right EC2 Instance and Applicable Use Cases - AWS June 2016 Webi...
 
Virtual Storage Center
Virtual Storage CenterVirtual Storage Center
Virtual Storage Center
 
Oracle Exec Summary 7000 Unified Storage
Oracle Exec Summary 7000 Unified StorageOracle Exec Summary 7000 Unified Storage
Oracle Exec Summary 7000 Unified Storage
 
Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020Azure Data Explorer deep dive - review 04.2020
Azure Data Explorer deep dive - review 04.2020
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
AI Scalability for the Next Decade
AI Scalability for the Next DecadeAI Scalability for the Next Decade
AI Scalability for the Next Decade
 
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
Big Data Goes Airborne. Propelling Your Big Data Initiative with Ironcluster ...
 
Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”Building Analytic Apps for SaaS: “Analytics as a Service”
Building Analytic Apps for SaaS: “Analytics as a Service”
 
Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8Webinar: High Performance MongoDB Applications with IBM POWER8
Webinar: High Performance MongoDB Applications with IBM POWER8
 
Best Practices for Building Open Source Data Layers
Best Practices for Building Open Source Data LayersBest Practices for Building Open Source Data Layers
Best Practices for Building Open Source Data Layers
 
Aerospike: Enabling Your Digital Transformation
Aerospike: Enabling Your Digital TransformationAerospike: Enabling Your Digital Transformation
Aerospike: Enabling Your Digital Transformation
 
Taking SharePoint to the Cloud
Taking SharePoint to the CloudTaking SharePoint to the Cloud
Taking SharePoint to the Cloud
 
Time Series Analytics Azure ADX
Time Series Analytics Azure ADXTime Series Analytics Azure ADX
Time Series Analytics Azure ADX
 
Kognitio - an overview
Kognitio - an overviewKognitio - an overview
Kognitio - an overview
 

Recently uploaded

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Aerospike for machine learning

  • 1. Using Aerospike and Machine Learning Brian Bulkowski CTO, Founder @bbulkow
  • 2. 2© 2016 Aerospike Inc. All rights reserved.[ ] What is Aerospike ? Large-scale DHT Database ( 10B ++ objects, 100T++, O(1) get / put ) … with queries, data structures, UDF, fast clients ... ... On Linux ... High availability clustering & rebalancing ( proven 5 9’s, no load balancer ) Very high performance C code – reads and writes ( 2M++ TPS from Flash, 4M++ TPS from DRAM PER SERVER ) KVS++ provides query, UDF, table/columns, aggregations, SQL Direct attach storage; persistence through replication and Flash Cloud-savvy – runs with EC2, GCE others; Docker, more … Dual License: Open Source for devs, Enterprise for deployment
  • 3. 3© 2016 Aerospike Inc. All rights reserved.[ ] Architecture Overview – Flash based system of engagement LEGACY DATABASE (Mainframe) XDR Decisioning Engine DATA WAREHOUSE/ DATA LAKE LEGACY RDBMS HDFS BASED BUSINESS TRANSACTIONS Web views ( Payments ) ( Mobile Queries ) ( Recommendation ) ( And More ) High Performance NoSQL “REAL-TIME BIG DATA” “DECISIONING” 500 Business Trans per sec 5000 Calculations per sec X = 2.5 M Database Transactions per sec
  • 4. 4© 2016 Aerospike Inc. All rights reserved.[ ] CREDIT CARD PROCESSING SYSTEM FRAUD DETECTION & PROTECTION APP ACCOUNT BEHAVIOR ACCOUNT STATISTICS STATIC DATA RULE 1 – PASSED ✔ RULE 2 – PASSED ✔ RULE 3 – FAILED ✗ HISTORICAL DATA RULES RULE 1 RULE 2 RULE 3 … Challenge ■ Overall SLA 750 ms ■ Loss of Business due to latency ■ Every Credit Card transaction requires hundreds of DB reads/writes Need to scale reliably ■ 10  100 TB ■ 10B  100 B objects ■ 200k  I Million+ TPS Selected NoSQL ■ Built for Flash ■ Predictable Low latency at High Throughput ■ Immediate consistency, no data loss ■ Cross data center (XDR) support ■ 20 Server Cluster ■ Dell 730xd w/ 4NVMe SSDs Example - Fraud Prevention
  • 5. 5© 2016 Aerospike Inc. All rights reserved.[ ] ■ 3 node cluster, Intel S3700 SSDs ■ Followed religiously all DataStax recommendations ■ Standard YCSB, includes instructions to reproduce for your workload ■ http://www.aerospike.com/blog/comparing-nosql-databases-aerospike-and- cassandra/ Aerospike vs Cassandra ( 2016 )
  • 6. 6© 2016 Aerospike Inc. All rights reserved.[ ] Aerospike vs Cassandra ( 2016 )
  • 7. 7© 2016 Aerospike Inc. All rights reserved.[ ] Aerospike vs Cassandra ( 2016 )
  • 8. Online Learning Leveraging Aerospike to Power Real-time Analytics
  • 9. Neilson Marketing Cloud Webinar Brent Keator VP Infrastructure Neilson Marketing Cloud Kevin Lyons Senior VP Data Science Neilson Marketing Cloud YouTube: Neilsen Marketing Cloud Aerospike Webinar 2016 Aerospike: https://aerospike.com/webinars
  • 10. Models that build profitable marketing audiences at scale... Finding more of your best customers: High-income business professional
  • 11. The Modeling Process, simplified
  • 12. 2012 2015 30 - 40 models levering billions of events Creating 100 million + scores over 1000 models ‘leveraging’ trillions of events Creating 150 billion+ scores / day The Challenge
  • 13. A system creates as many models as we want, when we want them, that dynamically adapts in real-time to changing conditions ▪ Automatically creates, validates, ships, and monitors models, with a capacity that scales to 10s of thousands of models The Opportunity What we really need:
  • 14. In other words, we simply need ….
  • 15. Online models evolve & adapt over time, in reaction to a changing environment with each and every event Given a complete data set, a batch model is created in entirety all at once Introducing Online Learning Batch Online Learning Creation Evolution
  • 16. large-scale data storage large-scale data movement painful data aggregation lots of manual everything Harder to build models, but easier to evaluate limited data storage, mostly for monitoring event-level data streams light data aggregation lots of automatic everything Easier to build, but harder to evaluate (& support) Batch Models (Offline) vs. Online Learning Online LearningBatch Models (Offline)
  • 17. ● Outperformed both L2 and Elastic Net ● Leverages small (‘micro’) batches ● Validates and monitors models in real time ● Alerts team when models are not behaving Some Techno Mumbo Jumbo Stochastic gradient descent with L1 regularization
  • 19. eXpresso Serving Cluster 10B+ events/day 300+ nodes across 4 data centers eXtream Modeling Cluster 160B models/day 100+ nodes across 4 data centers JGroups Distribute d Messagin g Serving Layer
  • 20. Our Aerospike “Citrusleaf” Use-Cases Unique User DataStore 53 Servers across 4 data centers Specs Memory: 512GB CPU: e5-2620v2 (Dual-Socket) Disk: Intel S3710(13-15 1.2TB SSDs) Network: Aggregated 10GB NICs 2-Namespaces Online Learning (Models DataStore) 9 Servers across 3 data centers Specs Memory: 32GB CPU: e5-2620 (Dual-Socket) Disk:1-240GB SSDs Network: Aggregated 1GB NICs 1-Namespace Online Learning
  • 21. Online LearningBatch Models (Offline) Batch Predefined ratio Predefined feature selection One time Validation Streaming Downsampling Automated feature selection Ongoing data cleaning Ongoing validation The Online Learning Challenge
  • 22. ● All necessary data already exists in eXtream ● The cluster’s processing resources can be better utilized ● eXtream addresses most performance / scalability requirements ● Scoring mechanism already exists eXtream as a Framework for Online Learning Why it works...
  • 24. ● Labeling Mechanism - customer defined target audience Events Classification
  • 25. ● Downsampling mechanism ● Burst tolerance ● Duplicate entries Dataset Preparation
  • 26. ● Blacklist ● Whitelist ● Automatic Tuning Features Selection
  • 27. ● Sliding window of recent events ● 60/40 not-converted/converted ratio ● Various accuracy metrics (lift, precision, recall, confusion matrix) ● Decide if the model is ready for making predictions Model Validation
  • 28. ● Two phases (Scoring, Re-code) ● Scale vs Accuracy tradeoff Predictions Mechanism
  • 29. Scalability / Performance Thousands of Concurrent Models: High Throughput: billions of training events per daytraining, validation, scoring
  • 30. Why do we need it? ● Store the models in one common place ● Persistency ● Built-in replication Scalability / Performance Why do we need it?
  • 31. XDR Replication Map Inter-DC Network Bare-Metal Cloud LVS/GSLB/XDR = HA Online Learning Datastore Replication
  • 32. Monitoring- Why do we need it? thousands of models automatically created by users some models won’t converge