SlideShare ist ein Scribd-Unternehmen logo
1 von 53
Downloaden Sie, um offline zu lesen
Building a Real-Time Fraud
Prevention Engine Using Open
Source (Big Data) Software
Kees Jan de Vries
Booking.com
Who am I?
About me
• Physics PhD Imperial College
• Data Scientist at Booking.com for 1.5 year
▸ Security Department
• linkedin.com/in/kees-jan-de-vries-93767240
About Booking.com
• World leader in connecting travellers with the widest
variety of great places to stay
• Part of The Priceline Group, the world’s 3rd largest
e-commerce company (by market capitalisation)
• Employing 14,000 people in 180 countries
• Each day, over 1,200,000 room nights are reserved on
Booking.com
Contents.
• Motivation
• Running Example
▸ Probability to Book
• Real Time Prediction Engine: Lessons Learnt
▸ Aggregate Features
▸ Models Training and Deployment
▸ Interpretation of Individual Scores
Motivation
Motivation.
booking.com
Motivation.
booking.com
Booking.com
• Empower people to experience the
world
Security
• Provide secure reservation system
(facilitating hundreds of thousands of
reservations on a daily basis)
Fraud Prevention
• Protect customers and property
partners, ensuring the best possible
experience
Motivation for this Talk.
Train awesome Model
Serve in Real Time
at Low Latency using
aggregate features
Running Example
Probability to Book
Disclaimer: although the running example
presented in these slides may seem realistic, it is
only intended to highlight some lessons learnt
about building a real time prediction engine.
Probability to Book.
Let’s book a hotel for
the Spark Summit
East 2017 in Boston
Probability to Book.
OK. Let’s click on the
first hit.
Probability to Book.
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Probability to Book.
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Let’s help the customer
make the best choice
for them!
Probability to Book.
We’ll calculate the
probability of booking
when the Customer
clicks on an
accommodation
Nice. Here’s all the
information I need.
But maybe I’ll browse
a few more, to make
the best choice.
Behind the Scenes.
User id
Page id
...
Action
PredictionEngine
Click!
Behind the Scenes.
Labels
• Booked? Yes/No
Features
• Simple
▸ Time of day
▸ ...
• Profile
▸ User
▸ Circumstantial
Click!
Behind the Scenes.
Labels
• Booked? Yes/No
Features
• Simple
▸ Time of day
▸ ...
• Profile
▸ User
▸ Circumstantial
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Click!
Behind the Scenes.
Features
Model
Action
User id
...
Click!
Action
Behind the Scenes.
Features
Model
Action
User id
...
Click!
Aggregate
features
“ ”
Action
Aggregate Features
Aggregate Features.
Time
Requests Data Streams
Click Booking
Aggregate Features.
Time
Requests Data Streams
Click Booking
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Clicks coincide
with requests, so
aggregation is
preferably instant
Aggregate Features.
Users
Time
...
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Requests
Clicks (Requests)
Clicks coincide
with requests, so
aggregation is
preferably instant
Small time
window, so limited
amount of data
Aggregate Features.
SELECT
COUNT(DISTINCT hotel_id)
FROM cliks.win:time(30 min)
GROUP BY user_id
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
In memory Complex Event Processing
• No lag: instant aggregation
• Scalability: see Esper website
• Not persistent
Aggregate Features.
SELECT
COUNT(DISTINCT hotel_id)
FROM cliks.win:time(30 min)
GROUP BY user_id
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing
“Complex Event Processing and Esper are standing queries and latency to the answer is
usually below 10us with more than 99% predictability.”
“The first component of scaling is the throughput that can be achieved running
single-threaded. For Esper we think this number is very high and likely between 10k to
200k events per second.”
In memory Complex Event Processing
Aggregate Features.
Users
Time
...
Requests
Bookings
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
Users
Time
...
Requests
Bookings
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Very High Cardinality:
features for every user who
made at least one booking
Aggregate Features
High Cardinality Features with Cassandra
• Very scalable: reads & writes
• None-instant aggregations
▸ Consistency fundamentally bound
by gossip protocol
• Persistent
( )
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features
High Cardinality Features with Cassandra
( )
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
From http://cassandra.apache.org/
● Proven
● Fault Tolerant
● Performant
● Scalable
● Elastic
● ...
“Some of the largest production
deployments include Apple's, with over
75,000 nodes storing over 10 PB of
data, Netflix (2,500 nodes, 420 TB, over
1 trillion requests per day), …”
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Medium Cardinality: limited
by number of hotels
Aggregate Features.
HotelPages
Time
...
Requests
Bookings/Clicks
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Medium Cardinality: limited
by number of hotels
Larger lag is ok with time
window of 3 months and
large number of events per
page.
Aggregate Features
SELECT
page_id
, COUNT(*) AS res_count
FROM reservations
GROUP BY page_id
Low to Medium Cardinality
• No need to go “fancy”
• Batch processing
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Models
Training and deployment
Aggregate Features for Training.
Time
Requests Data Streams
Click Reservation
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features for Training.
Time
Requests Data Streams
Click Reservation
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
Aggregate Features for Training.
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Aggregate Features for Training.
Aggregate
• User
▸ # (distinct) hotels
viewed in last 30
minutes
▸ # bookings so far
▸ ...
• Hotel Page
▸ % booking per page
view last 3 months
▸ ...
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Warning: stragglers
ahead! Distribute wisely!
Aggregate Features for Training.
SELECT
request.request_id
, COUNT DISTINCT event.page_id
FROM request
JOIN event ON
request.user_id = event.user_id
WHERE event.epoch BETWEEN
request.epoch
AND request.epoch + 30*60
GROUP BY request.request_id
Scalable Technologies
• Simple SQL
▸ Hive
• Facing stragglers
▸ Spark
Model Training & Iteration
y X
Train Model
Predict
YES
Predict
NO
Actual YES TP FN
Actual NO FP TN
Investigate
Evaluate
• Features
▸ Engineering
▸ Selection
• Hyper parameters
Improve Model
Sorry: H2O?
The benchmark shown on this slide can be found in:
http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html
More benchmarks:
https://github.com/szilard/benchm-ml
Time to Deploy
model = H2ORandomForestEstimator(ntrees=50, max_depth=10)
model.train(x=inputCols, y="label",
training_frame=trainingDf,
validation_frame=validationDf)
model.download_pojo("/my/model/path/")
Time to Deploy.
model = H2ORandomForestEstimator(ntrees=50, max_depth=10)
model.train(x=inputCols, y="label",
training_frame=trainingDf,
validation_frame=validationDf)
model.download_pojo("/my/model/path/")
AWESOME!!!
POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring
on the JVM. Very fast!
Model Interpretation
Score Interpretation.
feature 15 > 9
feature 2 > 0.56 feature 4 > 200
yes no
yes no yes no
feature 11 > 6
yes no
Logistic Regression
Decision Trees
feature 1 feature 2+ 13.9 x + …x = -5.12 x
Decision tree interpretation based on:
http://blog.datadive.net/interpreting-random-forests
Contribution Feature i:
i
xi
p0 Contribution Feature i:
pnode
- pparent node
p2
p1
p3
p4
p6
p7
p5
p8
Example:
p3
= p0
+ (p1
-p0
) + (p3
-p1
)
p0
: bias
(p1
-p0
): feature 15 contr.
(p3
-p1
): feature 2 contr.
Score Interpretation.
feature 15 > 9
feature 2 > 0.56 feature 4 > 200
yes no
yes no yes no
feature 11 > 6
yes no
Logistic Regression
Decision Trees
feature 1 feature 2+ 13.9 x + …x = -5.12 x
Decision tree interpretation based on:
http://blog.datadive.net/interpreting-random-forests
Contribution Feature i:
i
xi
p0 Contribution Feature i:
pnode
- pparent node
p2
p1
p3
p4
p6
p7
p5
p8
We hacked this
interpretation into Spark
ML during a Hackathon
at Booking:) H2O does
not offer it (yet?).
Probability to Book.
The probability of
booking is high; largest
contribution from
#distinct property
pages. Let’s show a
summary!
Probability to Book.
Awesome! This app
really helps me book!
The probability of
booking is high; largest
contribution from
#distinct property
pages. Let’s show a
summary!
Summary
Summary.
• How to make Aggregate Features available?
▸ Long/short time windows
▸ High and Low cardinality dimensions
• Model Training and Deployment
▸ H2O POJO for fast Real-Time scoring
• Interpretation
▸ Contributions per Feature per Score
Thank You.
Questions?
We’re hiring!
Kees Jan de Vries from Booking.com
linkedin.com/in/kees-jan-de-vries-93767240

Weitere ähnliche Inhalte

Was ist angesagt?

Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsDataWorks Summit
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Databricks
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeDatabricks
 
Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flowconfluent
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.Taras Matyashovsky
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Flink Forward
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...HostedbyConfluent
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceDatabricks
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design PatternsJohn Yeung
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureDatabricks
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayDataWorks Summit
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesDatabricks
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)KafkaZone
 

Was ist angesagt? (20)

Ozone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objectsOzone: scaling HDFS to trillions of objects
Ozone: scaling HDFS to trillions of objects
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Amazon Aurora
Amazon AuroraAmazon Aurora
Amazon Aurora
 
Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
Spark SQL Adaptive Execution Unleashes The Power of Cluster in Large Scale wi...
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta LakeSimplify CDC Pipeline with Spark Streaming SQL and Delta Lake
Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake
 
Modern Data Flow
Modern Data FlowModern Data Flow
Modern Data Flow
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...Tame the small files problem and optimize data layout for streaming ingestion...
Tame the small files problem and optimize data layout for streaming ingestion...
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
Cosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle ServiceCosco: An Efficient Facebook-Scale Shuffle Service
Cosco: An Efficient Facebook-Scale Shuffle Service
 
Big Data Architecture and Design Patterns
Big Data Architecture and Design PatternsBig Data Architecture and Design Patterns
Big Data Architecture and Design Patterns
 
Productionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices ArchitectureProductionizing Machine Learning with a Microservices Architecture
Productionizing Machine Learning with a Microservices Architecture
 
How Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per dayHow Uber scaled its Real Time Infrastructure to Trillion events per day
How Uber scaled its Real Time Infrastructure to Trillion events per day
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)Stream processing with Apache Flink (Timo Walther - Ververica)
Stream processing with Apache Flink (Timo Walther - Ververica)
 
The delta architecture
The delta architectureThe delta architecture
The delta architecture
 

Andere mochten auch

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Spark Summit
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...Spark Summit
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkDatabricks
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Spark Summit
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Spark Summit
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...Spark Summit
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Spark Summit
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Spark Summit
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...Spark Summit
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Spark Summit
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Spark Summit
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark Summit
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Spark Summit
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Spark Summit
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Spark Summit
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Spark Summit
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Spark Summit
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Spark Summit
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Spark Summit
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Spark Summit
 

Andere mochten auch (20)

Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
Fighting Cybercrime: A Joint Task Force of Real-Time Data and Human Analytics...
 
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
BigDL: A Distributed Deep Learning Library on Spark: Spark Summit East talk b...
 
Tuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache SparkTuning and Monitoring Deep Learning on Apache Spark
Tuning and Monitoring Deep Learning on Apache Spark
 
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
Netflix's Recommendation ML Pipeline Using Apache Spark: Spark Summit East ta...
 
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
Horizontally Scalable Relational Databases with Spark: Spark Summit East talk...
 
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
A New “Sparkitecture” for Modernizing your Data Warehouse: Spark Summit East ...
 
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
Going Real-Time: Creating Frequently-Updating Datasets for Personalization: S...
 
Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...Realtime Analytical Query Processing and Predictive Model Building on High Di...
Realtime Analytical Query Processing and Predictive Model Building on High Di...
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
Problem Solving Recipes Learned from Supporting Spark: Spark Summit East talk...
 
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
Analytics at the Real-Time Speed of Business: Spark Summit East talk by Manis...
 
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
Spark SQL: Another 16x Faster After Tungsten: Spark Summit East talk by Brad ...
 
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
Secured (Kerberos-based) Spark Notebook for Data Science: Spark Summit East t...
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
Feature Hashing for Scalable Machine Learning: Spark Summit East talk by Nick...
 
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
Global Empire-Building for Fun and Profit: Spark Summit East talk by Michelle...
 
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
Migrating from Redshift to Spark at Stitch Fix: Spark Summit East talk by Sky...
 
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
Building Deep Learning Powered Big Data: Spark Summit East talk by Jiao Wang ...
 
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
Insights Without Tradeoffs Using Structured Streaming keynote by Michael Armb...
 
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
Bulletproof Jobs: Patterns for Large-Scale Spark Processing: Spark Summit Eas...
 

Ähnlich wie Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

Reliably Measuring Responsiveness
Reliably Measuring ResponsivenessReliably Measuring Responsiveness
Reliably Measuring ResponsivenessNicholas Jansma
 
AI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAmine Bendahmane
 
Your hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureYour hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureSiteMinder
 
Genn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsGenn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsTakeshi Nakano
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best OfferMichel Bruley
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews, Inc.
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...Srinath Perera
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream Michel Bruley
 
From model to experiment Tel Aviv
From model to experiment Tel AvivFrom model to experiment Tel Aviv
From model to experiment Tel AvivElena Sokolova, PhD
 
SiteMinder. Attract, reach and convert guests globally.
SiteMinder.  Attract, reach and convert guests globally.SiteMinder.  Attract, reach and convert guests globally.
SiteMinder. Attract, reach and convert guests globally.Mark Abramowitz
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEuropean Data Forum
 
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxMicroservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxOpenCredo
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetupPredicSis
 
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...Amazon Web Services
 

Ähnlich wie Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries (15)

Reliably Measuring Responsiveness
Reliably Measuring ResponsivenessReliably Measuring Responsiveness
Reliably Measuring Responsiveness
 
AI techniques for tourism-oriented applications
AI techniques for tourism-oriented applicationsAI techniques for tourism-oriented applications
AI techniques for tourism-oriented applications
 
Your hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the futureYour hotel's direct booking strategy: Targeting the guests of the future
Your hotel's direct booking strategy: Targeting the guests of the future
 
Genn.ai introduction for Buzzwords
Genn.ai introduction for BuzzwordsGenn.ai introduction for Buzzwords
Genn.ai introduction for Buzzwords
 
Big Data and the Next Best Offer
Big Data and the Next Best OfferBig Data and the Next Best Offer
Big Data and the Next Best Offer
 
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
SmartNews TechNight Vol.5 : AD Data Engineering in practice: SmartNews Ads裏のデ...
 
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...
 
Web log & clickstream
Web log & clickstream Web log & clickstream
Web log & clickstream
 
From model to experiment Tel Aviv
From model to experiment Tel AvivFrom model to experiment Tel Aviv
From model to experiment Tel Aviv
 
SiteMinder. Attract, reach and convert guests globally.
SiteMinder.  Attract, reach and convert guests globally.SiteMinder.  Attract, reach and convert guests globally.
SiteMinder. Attract, reach and convert guests globally.
 
EDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko GrobelnikEDF2013: Big Data Tutorial: Marko Grobelnik
EDF2013: Big Data Tutorial: Marko Grobelnik
 
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic FoxMicroservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
Microservices Manchester: Concursus - Event Sourcing Evolved By Domonic Fox
 
Feature surfacing - meetup
Feature surfacing  - meetupFeature surfacing  - meetup
Feature surfacing - meetup
 
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
BDA 301 An Introduction to Amazon Rekognition, for Deep Learning-based Comput...
 
Big Data Tutorial V4
Big Data Tutorial V4Big Data Tutorial V4
Big Data Tutorial V4
 

Mehr von Spark Summit

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang Spark Summit
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...Spark Summit
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang WuSpark Summit
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya RaghavendraSpark Summit
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...Spark Summit
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...Spark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingSpark Summit
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...Spark Summit
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakSpark Summit
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimSpark Summit
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraSpark Summit
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Spark Summit
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...Spark Summit
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spark Summit
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovSpark Summit
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Spark Summit
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkSpark Summit
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Spark Summit
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...Spark Summit
 

Mehr von Spark Summit (20)

FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
FPGA-Based Acceleration Architecture for Spark SQL Qi Xie and Quanfu Wang
 
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
VEGAS: The Missing Matplotlib for Scala/Apache Spark with DB Tsai and Roger M...
 
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang WuApache Spark Structured Streaming Helps Smart Manufacturing with  Xiaochang Wu
Apache Spark Structured Streaming Helps Smart Manufacturing with Xiaochang Wu
 
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data  with Ramya RaghavendraImproving Traffic Prediction Using Weather Data  with Ramya Raghavendra
Improving Traffic Prediction Using Weather Data with Ramya Raghavendra
 
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
A Tale of Two Graph Frameworks on Spark: GraphFrames and Tinkerpop OLAP Artem...
 
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark Marcin ...
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
Apache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim DowlingApache Spark and Tensorflow as a Service with Jim Dowling
Apache Spark and Tensorflow as a Service with Jim Dowling
 
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
MMLSpark: Lessons from Building a SparkML-Compatible Machine Learning Library...
 
Next CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub WozniakNext CERN Accelerator Logging Service with Jakub Wozniak
Next CERN Accelerator Logging Service with Jakub Wozniak
 
Powering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin KimPowering a Startup with Apache Spark with Kevin Kim
Powering a Startup with Apache Spark with Kevin Kim
 
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya RaghavendraImproving Traffic Prediction Using Weather Datawith Ramya Raghavendra
Improving Traffic Prediction Using Weather Datawith Ramya Raghavendra
 
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
Hiding Apache Spark Complexity for Fast Prototyping of Big Data Applications—...
 
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...How Nielsen Utilized Databricks for Large-Scale Research and Development with...
How Nielsen Utilized Databricks for Large-Scale Research and Development with...
 
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
Spline: Apache Spark Lineage not Only for the Banking Industry with Marek Nov...
 
Goal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim SimeonovGoal Based Data Production with Sim Simeonov
Goal Based Data Production with Sim Simeonov
 
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
Preventing Revenue Leakage and Monitoring Distributed Systems with Machine Le...
 
Getting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir VolkGetting Ready to Use Redis with Apache Spark with Dvir Volk
Getting Ready to Use Redis with Apache Spark with Dvir Volk
 
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
Deduplication and Author-Disambiguation of Streaming Records via Supervised M...
 
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
MatFast: In-Memory Distributed Matrix Computation Processing and Optimization...
 

Kürzlich hochgeladen

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in collegessuser7a7cd61
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxBoston Institute of Analytics
 

Kürzlich hochgeladen (20)

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024Generative AI for Social Good at Open Data Science East 2024
Generative AI for Social Good at Open Data Science East 2024
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
While-For-loop in python used in college
While-For-loop in python used in collegeWhile-For-loop in python used in college
While-For-loop in python used in college
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Top 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In QueensTop 5 Best Data Analytics Courses In Queens
Top 5 Best Data Analytics Courses In Queens
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptxNLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
NLP Project PPT: Flipkart Product Reviews through NLP Data Science.pptx
 

Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software: Spark Summit East talk by Kees Jan de Vries

  • 1. Building a Real-Time Fraud Prevention Engine Using Open Source (Big Data) Software Kees Jan de Vries Booking.com
  • 2. Who am I? About me • Physics PhD Imperial College • Data Scientist at Booking.com for 1.5 year ▸ Security Department • linkedin.com/in/kees-jan-de-vries-93767240 About Booking.com • World leader in connecting travellers with the widest variety of great places to stay • Part of The Priceline Group, the world’s 3rd largest e-commerce company (by market capitalisation) • Employing 14,000 people in 180 countries • Each day, over 1,200,000 room nights are reserved on Booking.com
  • 3. Contents. • Motivation • Running Example ▸ Probability to Book • Real Time Prediction Engine: Lessons Learnt ▸ Aggregate Features ▸ Models Training and Deployment ▸ Interpretation of Individual Scores
  • 6. Motivation. booking.com Booking.com • Empower people to experience the world Security • Provide secure reservation system (facilitating hundreds of thousands of reservations on a daily basis) Fraud Prevention • Protect customers and property partners, ensuring the best possible experience
  • 7. Motivation for this Talk. Train awesome Model Serve in Real Time at Low Latency using aggregate features
  • 9. Disclaimer: although the running example presented in these slides may seem realistic, it is only intended to highlight some lessons learnt about building a real time prediction engine.
  • 10. Probability to Book. Let’s book a hotel for the Spark Summit East 2017 in Boston
  • 11. Probability to Book. OK. Let’s click on the first hit.
  • 12. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  • 13. Probability to Book. Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice. Let’s help the customer make the best choice for them!
  • 14. Probability to Book. We’ll calculate the probability of booking when the Customer clicks on an accommodation Nice. Here’s all the information I need. But maybe I’ll browse a few more, to make the best choice.
  • 15. Behind the Scenes. User id Page id ... Action PredictionEngine Click!
  • 16. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Click!
  • 17. Behind the Scenes. Labels • Booked? Yes/No Features • Simple ▸ Time of day ▸ ... • Profile ▸ User ▸ Circumstantial Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Click!
  • 19. Behind the Scenes. Features Model Action User id ... Click! Aggregate features “ ” Action
  • 21. Aggregate Features. Time Requests Data Streams Click Booking
  • 22. Aggregate Features. Time Requests Data Streams Click Booking Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 23. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests)
  • 24. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant
  • 25. Aggregate Features. Users Time ... Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Requests Clicks (Requests) Clicks coincide with requests, so aggregation is preferably instant Small time window, so limited amount of data
  • 26. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... In memory Complex Event Processing • No lag: instant aggregation • Scalability: see Esper website • Not persistent
  • 27. Aggregate Features. SELECT COUNT(DISTINCT hotel_id) FROM cliks.win:time(30 min) GROUP BY user_id Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...From http://espertech.com/esper/faq_esper.php#streamprocessing “Complex Event Processing and Esper are standing queries and latency to the answer is usually below 10us with more than 99% predictability.” “The first component of scaling is the throughput that can be achieved running single-threaded. For Esper we think this number is very high and likely between 10k to 200k events per second.” In memory Complex Event Processing
  • 28. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 29. Aggregate Features. Users Time ... Requests Bookings Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Very High Cardinality: features for every user who made at least one booking
  • 30. Aggregate Features High Cardinality Features with Cassandra • Very scalable: reads & writes • None-instant aggregations ▸ Consistency fundamentally bound by gossip protocol • Persistent ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 31. Aggregate Features High Cardinality Features with Cassandra ( ) Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... From http://cassandra.apache.org/ ● Proven ● Fault Tolerant ● Performant ● Scalable ● Elastic ● ... “Some of the largest production deployments include Apple's, with over 75,000 nodes storing over 10 PB of data, Netflix (2,500 nodes, 420 TB, over 1 trillion requests per day), …”
  • 32. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 33. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels
  • 34. Aggregate Features. HotelPages Time ... Requests Bookings/Clicks Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... Medium Cardinality: limited by number of hotels Larger lag is ok with time window of 3 months and large number of events per page.
  • 35. Aggregate Features SELECT page_id , COUNT(*) AS res_count FROM reservations GROUP BY page_id Low to Medium Cardinality • No need to go “fancy” • Batch processing Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 37. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 38. Aggregate Features for Training. Time Requests Data Streams Click Reservation Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ...
  • 39. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id
  • 40. Aggregate Features for Training. Aggregate • User ▸ # (distinct) hotels viewed in last 30 minutes ▸ # bookings so far ▸ ... • Hotel Page ▸ % booking per page view last 3 months ▸ ... SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Warning: stragglers ahead! Distribute wisely!
  • 41. Aggregate Features for Training. SELECT request.request_id , COUNT DISTINCT event.page_id FROM request JOIN event ON request.user_id = event.user_id WHERE event.epoch BETWEEN request.epoch AND request.epoch + 30*60 GROUP BY request.request_id Scalable Technologies • Simple SQL ▸ Hive • Facing stragglers ▸ Spark
  • 42. Model Training & Iteration y X Train Model Predict YES Predict NO Actual YES TP FN Actual NO FP TN Investigate Evaluate • Features ▸ Engineering ▸ Selection • Hyper parameters Improve Model
  • 43. Sorry: H2O? The benchmark shown on this slide can be found in: http://h2o-release.s3.amazonaws.com/h2o/rel-lambert/5/docs-website/resources/h2odatasheet.html More benchmarks: https://github.com/szilard/benchm-ml
  • 44. Time to Deploy model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/")
  • 45. Time to Deploy. model = H2ORandomForestEstimator(ntrees=50, max_depth=10) model.train(x=inputCols, y="label", training_frame=trainingDf, validation_frame=validationDf) model.download_pojo("/my/model/path/") AWESOME!!! POJO: Plain Old Java Object. Model export in plain Java code for real- time scoring on the JVM. Very fast!
  • 47. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 Example: p3 = p0 + (p1 -p0 ) + (p3 -p1 ) p0 : bias (p1 -p0 ): feature 15 contr. (p3 -p1 ): feature 2 contr.
  • 48. Score Interpretation. feature 15 > 9 feature 2 > 0.56 feature 4 > 200 yes no yes no yes no feature 11 > 6 yes no Logistic Regression Decision Trees feature 1 feature 2+ 13.9 x + …x = -5.12 x Decision tree interpretation based on: http://blog.datadive.net/interpreting-random-forests Contribution Feature i: i xi p0 Contribution Feature i: pnode - pparent node p2 p1 p3 p4 p6 p7 p5 p8 We hacked this interpretation into Spark ML during a Hackathon at Booking:) H2O does not offer it (yet?).
  • 49. Probability to Book. The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  • 50. Probability to Book. Awesome! This app really helps me book! The probability of booking is high; largest contribution from #distinct property pages. Let’s show a summary!
  • 52. Summary. • How to make Aggregate Features available? ▸ Long/short time windows ▸ High and Low cardinality dimensions • Model Training and Deployment ▸ H2O POJO for fast Real-Time scoring • Interpretation ▸ Contributions per Feature per Score
  • 53. Thank You. Questions? We’re hiring! Kees Jan de Vries from Booking.com linkedin.com/in/kees-jan-de-vries-93767240