Spark Summit Europe 2017
Applying Multiple ML Pipelines to heterogenous data streams
This talk explains how we adapted Spark mllib to deploy hundreds of ML pipelines in one streaming job to make real time predictions on heterogenous data streams.
9. Spark Transformers
High level interface for all operations on Datasets
Act as stages in Pipelines
Require various configuration: inputColumn, outputColumn, data, …
Dataset is the only input/output format
9
14. Possible (deployment) solutions
Serialisation
✓ Ability to use any streaming technology
x Increased complexity
x Potentially increased deployment latency
Spark ML Pipelines
✓ Simplicity - one technology for training and evaluation
x Support only for the “one model scenario”
14
15. Heterogeneous datasets in Spark
Spark Pipelines can be applied only to Datasets
Spark Pipelines cannot be combined in a composite pipeline to be
applied on Pipeline-per-row basis
Questions:
How to apply different Pipeline(s) for each row in the Datasets?
How to identify which Pipelines should be applied to a row?
15
27. Summary
Applying Pipelines to heterogenous data streams is hard.
Spark Pipelines cannot be combined and applied in a composite pipeline
on Pipeline-per-row basis
Our solution:
Extend Transformer API to apply different Pipeline(s) for each row
Extend Streams to include Pipeline composition
macdab@altocloud.com & gevorg@altocloud.com
27