SlideShare ist ein Scribd-Unternehmen logo
1 von 33
Downloaden Sie, um offline zu lesen
Deploying MLlib for Scoring in
Structured Streaming
Joseph Bradley
June 5, 2018
Spark + AI Summit
About me
Joseph Bradley
• Software engineer at Databricks
• Apache Spark committer & PMC member
TEAM
About Databricks
Started Spark project (now Apache Spark) at UC Berkeley in 2009
PRODUCT
Unified Analytics Platform
MISSION
Making Big Data Simple
Try for free today.
databricks.com
App: monitoring web sessions for bots
Web Activity Logs
Compute Features
Kill User’s Login
Session
Run Prediction
API Check
Cached
Predictions
streaming web app
App: monitoring web sessions for bots
Web Activity Logs
Kill User’s Login
Session
Compute Features Run Prediction
API Check
Cached
Predictions
streaming web app
Productionizing Machine Learning
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users
Challenge: teams & environments
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
End Users
Challenge: featurization logic
Data Science / ML
Prediction Servers
models results
Serialize
Deserialize
Make
predictions
Feature
Logic
↓
Feature
Logic
↓
Feature
Logic
↓
Model
End Users
Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
ML Pipelines in Apache Spark
Original
dataset
11
Text Label
I bought the game... 4
Do NOT bother try... 1
this shirt is aweso... 5
never got it. Seller... 1
I ordered this to... 3
ML Pipelines: featurization
12
Feature
extraction
Original
dataset
Text Label Words Features
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...]
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...]
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...]
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...]
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
ML Pipelines: model
13
Text Label Words Features Prediction Probability
I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8
Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6
this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9
never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7
I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7
Feature
extraction
Original
dataset
Predictive
model
ML Pipelines: successes
• Apache Spark integration simplifies
• Deployment
• ETL
• Integration into complete analytics pipelines with SQL (& streaming!)
• Scalability & speed
• Pipelines for featurization, modeling & tuning
ML Pipelines: adoption
• 1000s of commits
• 100s of contributors
• 10,000s of users
(on Databricks alone)
• Many production use cases
Structured Streaming
One single API Dataset / DataFrame for batch & streaming
End-to-end exactly-once guarantees
• The guarantees extend into the sources/sinks, e.g. MySQL, S3
Understands external event-time
• Handles late-arriving data
• Supports sessionization based on event time
Challenges in productionizing ML
Sharing models across teams
and across systems & environments
while maintaining identical behavior
both now and in the future
ML Pipeline
Persistence
Apache Spark
deployments
Featurization in
Pipelines
Backwards
compatibility
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
2-pass Transformers
Algorithmic pattern
• Scan data to collect stats
• Collect stats to driver
• Scan data to apply transform
(using stats)
VectorAssembler
• Find lengths of Vector cols
• Compute total # features
• Create new Vector column
(of length # features)
Scan-collect-scan pattern fails with Structured Streaming.
Handling invalid values
Invalid values include:
• NaN and null values
• Out-of-bounds values (e.g., for Bucketizer)
• Incorrect Vector lengths (e.g., for VectorAssembler)
Robust deployments must handle invalid data.
ML Pipelines use the handleInvalid Param
with options “skip” / “keep” / “error”
— but have only partial coverage.
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
Most Transformers & Models “just work”
As of Apache Spark 2.3, batch & streaming scoring/transform
are basically identical:
• PipelineModel.transform() works on Streaming
Datasets and DataFrames.
• New unit test framework covers batch & streaming tests.
Fixes & tests tracked in SPARK-21926 & SPARK-22644.
Fixes for 2-pass Transformers
VectorAssembler
• Assemble multiple columns
into 1 feature Vector
• Needs lengths of Vector
columns
• Extract from metadata (added
by, e.g., OneHotEncoder)
• Compute from data
Fails with Structured
Streaming
VectorSizeHint
• Manually adds Vector
length to column metadata
• Required only for
Structured Streaming
Fixes for 2-pass Transformers
OneHotEncoder
• Transform categorical
column to 0/1 Vector
• Needs # categories:
• Extract from metadata (added
by, e.g., StringIndexer)
• Compute from data
OneHotEncoderEstimator
• fit() stores categories for
use in transform()
• Match behavior at training &
test time
Bug if train & test data have
different categories (state)
Handling invalid values
Improvements in Spark 2.3
• VectorIndexer, StringIndexer, OneHotEncoderEstimator
• Bucketizer, QuantileDiscretizer
• RFormula
• Most coverage handles NaN. Some handles null.
Fixes targeted for Spark 2.4
• VectorAssembler
• RFormula: Pass handleInvalid to all sub-stages
Demo: Streaming Scoring in 2.3
In this talk
Our toolkit: ML Pipelines & Structured Streaming
Issues in Apache Spark 2.2
Fixes in Apache Spark 2.3
Tips & resources
Cheat sheet: fixing your Pipeline to work
with Structured Streaming
• Update uses of OneHotEncoder, VectorAssembler.
(RFormula should be OK).
• Check how invalid values are handled.
• Beware using handleInvalid=“skip”, which drops invalid Rows.
• Test!
• In custom logic (custom SQL, Transformers, Models),
beware of 2-pass Transformers (hidden state).
Remaining work
• Locality Sensitive Hashing (LSH) Models do not work
(SPARK-24465)
• Require Spark SQL to support nested UDTs (SPARK-12878)
• VectorAssemblerEstimator: nicer API than VectorSizeHint
(SPARK-24467)
• Handling invalid values
• Expanded support
• Better defaults for handleInvalid Param
Beyond this talk
This talk:
Deployment
in streaming
Deployment
outside of
Spark
Deployment
in batch jobs
Model
management
Feature
management
Experiment
management
Monitoring A/B testing Serving APIs
Resources
Overview of productionizing Apache Spark ML models
Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize-
your-machine-learning-models
Batch scoring
Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and-
loading-pipelines
Streaming scoring
Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and-
stuctured-streaming.html
Sub-second scoring
Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing-
apache-spark-mllib-models-for-real-time-prediction-serving
https://databricks.com/careers
Thank You!
Questions?
Shout out to Bago Amirbekian,
Weichen Xu, and to the many
other contributors on this work.
Office hours today @ 3:50pm at
Databricks booth

Weitere ähnliche Inhalte

Was ist angesagt?

SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015Lance Co Ting Keh
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentDataWorks Summit
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...Databricks
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model ServingDatabricks
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...Databricks
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformDatabricks
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesDatabricks
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....Databricks
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDatabricks
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...Spark Summit
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleDatabricks
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkDatabricks
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionDatabricks
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment Databricks
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Databricks
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksAnyscale
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh LalwaniDatabricks
 
Structured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at AppleStructured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at AppleDatabricks
 

Was ist angesagt? (20)

SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015SparkApplicationDevMadeEasy_Spark_Summit_2015
SparkApplicationDevMadeEasy_Spark_Summit_2015
 
Semi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial EnvironmentSemi-Supervised Learning In An Adversarial Environment
Semi-Supervised Learning In An Adversarial Environment
 
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
A Predictive Analytics Workflow on DICOM Images using Apache Spark with Anahi...
 
MLflow Model Serving
MLflow Model ServingMLflow Model Serving
MLflow Model Serving
 
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & Deep Learning ...
 
Using Databricks as an Analysis Platform
Using Databricks as an Analysis PlatformUsing Databricks as an Analysis Platform
Using Databricks as an Analysis Platform
 
PandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled EnsemblesPandasUDFs: One Weird Trick to Scaled Ensembles
PandasUDFs: One Weird Trick to Scaled Ensembles
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Degrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files SyndromeDegrading Performance? You Might be Suffering From the Small Files Syndrome
Degrading Performance? You Might be Suffering From the Small Files Syndrome
 
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk b...
 
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life CycleMLflow: Infrastructure for a Complete Machine Learning Life Cycle
MLflow: Infrastructure for a Complete Machine Learning Life Cycle
 
Extending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySparkExtending Machine Learning Algorithms with PySpark
Extending Machine Learning Algorithms with PySpark
 
Data Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet EncryptionData Security at Scale through Spark and Parquet Encryption
Data Security at Scale through Spark and Parquet Encryption
 
Apache Spark Model Deployment
Apache Spark Model Deployment Apache Spark Model Deployment
Apache Spark Model Deployment
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
Simplify Distributed TensorFlow Training for Fast Image Categorization at Sta...
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Whirlpools in the Stream with Jayesh Lalwani
 Whirlpools in the Stream with Jayesh Lalwani Whirlpools in the Stream with Jayesh Lalwani
Whirlpools in the Stream with Jayesh Lalwani
 
Structured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at AppleStructured Streaming Use-Cases at Apple
Structured Streaming Use-Cases at Apple
 

Ähnlich wie Deploying MLlib for Scoring in Structured Streaming

Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache SparkDatabricks
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16BigMine
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkDatabricks
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at TwitterAlex Payne
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Libraryjeykottalam
 
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScyllaDB
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
JavaScript Robotics
JavaScript RoboticsJavaScript Robotics
JavaScript RoboticsAnna Gerber
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitDatabricks
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkDataWorks Summit/Hadoop Summit
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++Mike Acton
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...PAPIs.io
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...sparktc
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossAndrew Flatters
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark MLHolden Karau
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkDatabricks
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionDatabricks
 
Build Java Web Application Using Apache Struts
Build Java Web Application Using Apache Struts Build Java Web Application Using Apache Struts
Build Java Web Application Using Apache Struts weili_at_slideshare
 

Ähnlich wie Deploying MLlib for Scoring in Structured Streaming (20)

Distributed ML in Apache Spark
Distributed ML in Apache SparkDistributed ML in Apache Spark
Distributed ML in Apache Spark
 
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
Foundations for Scaling ML in Apache Spark by Joseph Bradley at BigMine16
 
Foundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache SparkFoundations for Scaling ML in Apache Spark
Foundations for Scaling ML in Apache Spark
 
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
The Why and How of Scala at Twitter
The Why and How of Scala at TwitterThe Why and How of Scala at Twitter
The Why and How of Scala at Twitter
 
MLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning LibraryMLlib: Spark's Machine Learning Library
MLlib: Spark's Machine Learning Library
 
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and BeyondScaling Up Machine Learning Experimentation at Tubi 5x and Beyond
Scaling Up Machine Learning Experimentation at Tubi 5x and Beyond
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
JavaScript Robotics
JavaScript RoboticsJavaScript Robotics
JavaScript Robotics
 
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML ToolkitAugmenting Machine Learning with Databricks Labs AutoML Toolkit
Augmenting Machine Learning with Databricks Labs AutoML Toolkit
 
Combining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache SparkCombining Machine Learning frameworks with Apache Spark
Combining Machine Learning frameworks with Apache Spark
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
Building machine learning service in your business — Eric Chen (Uber) @PAPIs ...
 
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
Creating an end-to-end Recommender System with Apache Spark and Elasticsearch...
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
Introduction to Spark ML
Introduction to Spark MLIntroduction to Spark ML
Introduction to Spark ML
 
Combining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache SparkCombining Machine Learning Frameworks with Apache Spark
Combining Machine Learning Frameworks with Apache Spark
 
Apache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and ProductionApache Spark MLlib 2.0 Preview: Data Science and Production
Apache Spark MLlib 2.0 Preview: Data Science and Production
 
Build Java Web Application Using Apache Struts
Build Java Web Application Using Apache Struts Build Java Web Application Using Apache Struts
Build Java Web Application Using Apache Struts
 

Mehr von Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Mehr von Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Kürzlich hochgeladen

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Jack DiGiovanna
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhijennyeacort
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxMike Bennett
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort servicejennyeacort
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一fhwihughh
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsVICTOR MAESTRE RAMIREZ
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch
 

Kürzlich hochgeladen (20)

Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
Building on a FAIRly Strong Foundation to Connect Academic Research to Transl...
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝DelhiRS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
RS 9000 Call In girls Dwarka Mor (DELHI)⇛9711147426🔝Delhi
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Semantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptxSemantic Shed - Squashing and Squeezing.pptx
Semantic Shed - Squashing and Squeezing.pptx
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
9711147426✨Call In girls Gurgaon Sector 31. SCO 25 escort service
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
办理学位证纽约大学毕业证(NYU毕业证书)原版一比一
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
Advanced Machine Learning for Business Professionals
Advanced Machine Learning for Business ProfessionalsAdvanced Machine Learning for Business Professionals
Advanced Machine Learning for Business Professionals
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]GA4 Without Cookies [Measure Camp AMS]
GA4 Without Cookies [Measure Camp AMS]
 

Deploying MLlib for Scoring in Structured Streaming

  • 1. Deploying MLlib for Scoring in Structured Streaming Joseph Bradley June 5, 2018 Spark + AI Summit
  • 2. About me Joseph Bradley • Software engineer at Databricks • Apache Spark committer & PMC member
  • 3. TEAM About Databricks Started Spark project (now Apache Spark) at UC Berkeley in 2009 PRODUCT Unified Analytics Platform MISSION Making Big Data Simple Try for free today. databricks.com
  • 4. App: monitoring web sessions for bots Web Activity Logs Compute Features Kill User’s Login Session Run Prediction API Check Cached Predictions streaming web app
  • 5. App: monitoring web sessions for bots Web Activity Logs Kill User’s Login Session Compute Features Run Prediction API Check Cached Predictions streaming web app
  • 6. Productionizing Machine Learning Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions End Users
  • 7. Challenge: teams & environments Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions End Users
  • 8. Challenge: featurization logic Data Science / ML Prediction Servers models results Serialize Deserialize Make predictions Feature Logic ↓ Feature Logic ↓ Feature Logic ↓ Model End Users
  • 9. Challenges in productionizing ML Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future
  • 10. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 11. ML Pipelines in Apache Spark Original dataset 11 Text Label I bought the game... 4 Do NOT bother try... 1 this shirt is aweso... 5 never got it. Seller... 1 I ordered this to... 3
  • 12. ML Pipelines: featurization 12 Feature extraction Original dataset Text Label Words Features I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...]
  • 13. ML Pipelines: model 13 Text Label Words Features Prediction Probability I bought the game... 4 “i", “bought”,... [1, 0, 3, 9, ...] 4 0.8 Do NOT bother try... 1 “do”, “not”,... [0, 0, 11, 0, ...] 2 0.6 this shirt is aweso... 5 “this”, “shirt” [0, 2, 3, 1, ...] 5 0.9 never got it. Seller... 1 “never”, “got” [1, 2, 0, 0, ...] 1 0.7 I ordered this to... 3 “i”, “ordered” [1, 0, 0, 3, ...] 4 0.7 Feature extraction Original dataset Predictive model
  • 14. ML Pipelines: successes • Apache Spark integration simplifies • Deployment • ETL • Integration into complete analytics pipelines with SQL (& streaming!) • Scalability & speed • Pipelines for featurization, modeling & tuning
  • 15. ML Pipelines: adoption • 1000s of commits • 100s of contributors • 10,000s of users (on Databricks alone) • Many production use cases
  • 16. Structured Streaming One single API Dataset / DataFrame for batch & streaming End-to-end exactly-once guarantees • The guarantees extend into the sources/sinks, e.g. MySQL, S3 Understands external event-time • Handles late-arriving data • Supports sessionization based on event time
  • 17. Challenges in productionizing ML Sharing models across teams and across systems & environments while maintaining identical behavior both now and in the future ML Pipeline Persistence Apache Spark deployments Featurization in Pipelines Backwards compatibility
  • 18. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 19. 2-pass Transformers Algorithmic pattern • Scan data to collect stats • Collect stats to driver • Scan data to apply transform (using stats) VectorAssembler • Find lengths of Vector cols • Compute total # features • Create new Vector column (of length # features) Scan-collect-scan pattern fails with Structured Streaming.
  • 20. Handling invalid values Invalid values include: • NaN and null values • Out-of-bounds values (e.g., for Bucketizer) • Incorrect Vector lengths (e.g., for VectorAssembler) Robust deployments must handle invalid data. ML Pipelines use the handleInvalid Param with options “skip” / “keep” / “error” — but have only partial coverage.
  • 21. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 22. Most Transformers & Models “just work” As of Apache Spark 2.3, batch & streaming scoring/transform are basically identical: • PipelineModel.transform() works on Streaming Datasets and DataFrames. • New unit test framework covers batch & streaming tests. Fixes & tests tracked in SPARK-21926 & SPARK-22644.
  • 23. Fixes for 2-pass Transformers VectorAssembler • Assemble multiple columns into 1 feature Vector • Needs lengths of Vector columns • Extract from metadata (added by, e.g., OneHotEncoder) • Compute from data Fails with Structured Streaming VectorSizeHint • Manually adds Vector length to column metadata • Required only for Structured Streaming
  • 24. Fixes for 2-pass Transformers OneHotEncoder • Transform categorical column to 0/1 Vector • Needs # categories: • Extract from metadata (added by, e.g., StringIndexer) • Compute from data OneHotEncoderEstimator • fit() stores categories for use in transform() • Match behavior at training & test time Bug if train & test data have different categories (state)
  • 25. Handling invalid values Improvements in Spark 2.3 • VectorIndexer, StringIndexer, OneHotEncoderEstimator • Bucketizer, QuantileDiscretizer • RFormula • Most coverage handles NaN. Some handles null. Fixes targeted for Spark 2.4 • VectorAssembler • RFormula: Pass handleInvalid to all sub-stages
  • 27. In this talk Our toolkit: ML Pipelines & Structured Streaming Issues in Apache Spark 2.2 Fixes in Apache Spark 2.3 Tips & resources
  • 28. Cheat sheet: fixing your Pipeline to work with Structured Streaming • Update uses of OneHotEncoder, VectorAssembler. (RFormula should be OK). • Check how invalid values are handled. • Beware using handleInvalid=“skip”, which drops invalid Rows. • Test! • In custom logic (custom SQL, Transformers, Models), beware of 2-pass Transformers (hidden state).
  • 29. Remaining work • Locality Sensitive Hashing (LSH) Models do not work (SPARK-24465) • Require Spark SQL to support nested UDTs (SPARK-12878) • VectorAssemblerEstimator: nicer API than VectorSizeHint (SPARK-24467) • Handling invalid values • Expanded support • Better defaults for handleInvalid Param
  • 30. Beyond this talk This talk: Deployment in streaming Deployment outside of Spark Deployment in batch jobs Model management Feature management Experiment management Monitoring A/B testing Serving APIs
  • 31. Resources Overview of productionizing Apache Spark ML models Webinar with Richard Garris: http://go.databricks.com/apache-spark-mllib-2.x-how-to-productionize- your-machine-learning-models Batch scoring Apache Spark docs: https://spark.apache.org/docs/latest/ml-pipeline.html#ml-persistence-saving-and- loading-pipelines Streaming scoring Guide and example notebook: https://docs.databricks.com/spark/latest/mllib/mllib-pipelines-and- stuctured-streaming.html Sub-second scoring Webinar with Sue Ann Hong: https://www.brighttalk.com/webcast/12891/268455/productionizing- apache-spark-mllib-models-for-real-time-prediction-serving
  • 33. Thank You! Questions? Shout out to Bago Amirbekian, Weichen Xu, and to the many other contributors on this work. Office hours today @ 3:50pm at Databricks booth