This document discusses challenges in deploying machine learning models for scoring in streaming applications using Apache Spark. It describes how ML Pipelines and Structured Streaming in Spark can be used to build an application that monitors web sessions for bots in real-time. However, there are issues with two-pass transformers and handling invalid data in Spark 2.2. Spark 2.3 includes fixes that allow most transformers and models to work for both batch and streaming scoring, and improves handling of invalid values. The talk provides tips on updating pipelines to work with streaming and testing them.
3. TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Try for free today.
databricks.com
4. App: monitoring web sessions for bots
Web Activity Logs
Compute Features
Kill User’s Login
Session
Run Prediction
API Check
Cached
Predictions
streaming web app
5. App: monitoring web sessions for bots
Web Activity Logs
Kill User’s Login
Session
Compute Features Run Prediction
API Check
Cached
Predictions
streaming web app
7. Challenge: teams & environments
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users
8. Challenge: featurization logic
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
Feature
Logic
↓
Feature
Logic
↓
Feature
Logic
↓
Model
End Users
9. Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
10. In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
11. ML Pipelines in Apache Spark
Original
dataset
11
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
12. ML Pipelines: featurization
12
Feature
extraction
Original
dataset
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
13. ML Pipelines: model
13
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Feature
extraction
Original
dataset
Predictive
model
14. ML Pipelines: successes
• Apache Spark integration simplifies
• Deployment
• ETL
• Integration into complete analytics pipelines with SQL (& streaming!)
• Scalability & speed
• Pipelines for featurization, modeling & tuning
15. ML Pipelines: adoption
• 1000s of commits
• 100s of contributors
• 10,000s of users
(on Databricks alone)
• Many production use cases
16. Structured Streaming
One single API Dataset / DataFrame for batch & streaming
End-to-end exactly-once guarantees
• The guarantees extend into the sources/sinks, e.g. MySQL, S3
Understands external event-time
• Handles late-arriving data
• Supports sessionization based on event time
17. Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
ML Pipeline
Persistence
Apache Spark
deployments
Featurization in
Pipelines
Backwards
compatibility
18. In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
19. 2-pass Transformers
Algorithmic pattern
• Scan data to collect stats
• Collect stats to driver
• Scan data to apply transform
(using stats)
VectorAssembler
• Find lengths of Vector cols
• Compute total # features
• Create new Vector column
(of length # features)
Scan-collect-scan pattern fails with Structured Streaming.
20. Handling invalid values
Invalid values include:
• NaN and null values
• Out-of-bounds values (e.g., for Bucketizer)
• Incorrect Vector lengths (e.g., for VectorAssembler)
Robust deployments must handle invalid data.
ML Pipelines use the handleInvalid Param
with options “skip” / “keep” / “error”
— but have only partial coverage.
21. In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
22. Most Transformers & Models “just work”
As of Apache Spark 2.3, batch & streaming scoring/transform
are basically identical:
• PipelineModel.transform() works on Streaming
Datasets and DataFrames.
• New unit test framework covers batch & streaming tests.
Fixes & tests tracked in SPARK-21926 & SPARK-22644.
23. Fixes for 2-pass Transformers
VectorAssembler
• Assemble multiple columns
into 1 feature Vector
• Needs lengths of Vector
columns
• Extract from metadata (added
by, e.g., OneHotEncoder)
• Compute from data
Fails with Structured
Streaming
VectorSizeHint
• Manually adds Vector
length to column metadata
• Required only for
Structured Streaming
24. Fixes for 2-pass Transformers
OneHotEncoder
• Transform categorical
column to 0/1 Vector
• Needs # categories:
• Extract from metadata (added
by, e.g., StringIndexer)
• Compute from data
OneHotEncoderEstimator
• fit() stores categories for
use in transform()
• Match behavior at training &
test time
Bug if train & test data have
different categories (state)
25. Handling invalid values
Improvements in Spark 2.3
• VectorIndexer, StringIndexer, OneHotEncoderEstimator
• Bucketizer, QuantileDiscretizer
• RFormula
• Most coverage handles NaN. Some handles null.
Fixes targeted for Spark 2.4
• VectorAssembler
• RFormula: Pass handleInvalid to all sub-stages
27. In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
28. Cheat sheet: fixing your Pipeline to work
with Structured Streaming
• Update uses of OneHotEncoder, VectorAssembler.
(RFormula should be OK).
• Check how invalid values are handled.
• Beware using handleInvalid=“skip”, which drops invalid Rows.
• Test!
• In custom logic (custom SQL, Transformers, Models),
beware of 2-pass Transformers (hidden state).
29. Remaining work
• Locality Sensitive Hashing (LSH) Models do not work
(SPARK-24465)
• Require Spark SQL to support nested UDTs (SPARK-12878)
• VectorAssemblerEstimator: nicer API than VectorSizeHint
(SPARK-24467)
• Handling invalid values
• Expanded support
• Better defaults for handleInvalid Param
30. Beyond this talk
This talk:
Deployment
in streaming
Deployment
outside of
Spark
Deployment
in batch jobs
Model
management
Feature
management
Experiment
management
Monitoring A/B testing Serving APIs
31. Resources
Overview of productionizing Apache Spark ML models
Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-
your-machine-learning-models
Batch scoring
Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-
loading-pipelines
Streaming scoring
Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and-
stuctured-streaming.html
Sub-second scoring
Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-
apache-spark-mllib-models-for-real-time-prediction-serving
33. Thank You!
Questions?
Shout out to Bago Amirbekian,
Weichen Xu, and to the many
other contributors on this work.
Office hours today @ 3:50pm at
Databricks booth