SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Downloaden Sie, um offline zu lesen
@s_kontopoulos
Machine Learning at Scale: Challenges
and Solutions
Stavros Kontopoulos
Senior Software Engineer @ Lightbend, M.Sc.
@s_kontopoulos
Who am I?
2
skonto
s_kontopoulos
S. Software Engineer @ Lightbend, Fast Data Team
Apache Flink
Contributor at
SlideShare stavroskontopoulos
stavroskontopoulos
All trademarks and registered trademarks are property of their respective holders.
@s_kontopoulos
Agenda
- ML in the Enterprise
- ML from development to production
- Key technologies: Apache Spark as a case study
3
@s_kontopoulos
ML in the Enterprise
ML is a key tool that fuels the effort of coupling business monitoring (BI) with
predictive and prescriptive analytics.
business insights -> business optimization -> data monetization
4
@s_kontopoulos
ML in the Enterprise - The Data-Science LifeCycle
Identify Business Question
Identify and collect related Data
Data cleansing, feature extraction (Data pre-processing)
Experiment planning
Model Building
Model Evaluation
Model Deployment/Management in Production
Model Optimization - Performance
5
@s_kontopoulos
Machine Learning Model
A model is a function that maps inputs to outputs and essentially expresses a
mathematical abstraction.
Linear Regression:
Neural Network:
Random Forest:
Function composition
6
@s_kontopoulos
Model Evolution
- Models can be either pre-computed eg. trained off-line or updated on-line.
- Online ML with Streaming:
- Pure online means only use the latest arrived data point to update the model. Usually models
are updated per batch/window eg. online k-means though.
- An interesting case is when we sample the stream and train a model only when the distribution
changes.
- Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling
- Re-train the model by ignoring the previous one.
7
@s_kontopoulos
Machine Learning Pipeline
Machine learning pipeline in Production: describes all steps from data
preprocessing before feeding the model to model output processing
(post-processing).
8
@s_kontopoulos
Machine Learning Pipeline in Libraries
Pros:
- Data and test data go through the same steps
- Like a CI (continuous integration) pipeline people can reason about data
transformation
- Caching of computations
- Model serving easier 9
@s_kontopoulos
Multiple Models in a Pipeline
Within the same pipeline it is also possible to run multiple models:
a) Model Segmentation
b) Model Ensemble
c) Model Chaining
d) Model Composition
http://dmg.org/pmml/v4-1/MultipleModels.html
http://dl.acm.org/citation.cfm?id=1859403
10
@s_kontopoulos
Model Development & Production
Data Scientist
All trademarks and registered trademarks are property of their respective holders.
GO
Data Engineer
11
@s_kontopoulos
Model Standardization
12
ML Framework Model Definition
Evaluation
Data
Predictions
Export Import
PFA - Portable
Format For
Analytics
@s_kontopoulos
Model Standardization
13
- PFA or PMML won’t break the pipeline. PFA is more flexible than PMML.
“Unlike PMML, PFA has control structures to direct program flow, a true type system for both
model parameters and data, and its statistical functions are much more finely grained and can
accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/)
- Custom model definitions and implementations are more flexible or more
optimized but could break the pipeline.
- Some Implementations:
- https://github.com/jpmml/jpmml-evaluator-spark
- https://github.com/jpmml
- https://github.com/opendatagroup/hadrian
@s_kontopoulos
Model Lifecycle
Some concerns about model lifecycle:
- Model evolution
- Model release practices
- Model versioning
- Model update process
14
@s_kontopoulos
Model Governance
● governed by the company’s policies and procedures, laws and regulations
and organization’s goals
● searchable across company
● be transparent, explainable, traceable and interpretable for auditors and
regulators. Example GDPR requirements:
https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in-
the-gdpr/
● have approval and release process
15
@s_kontopoulos
Model Server
“A model server is a system which handles the lifecycle of a model and provides
the required APIs for deploying a model/pipeline.”
Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/
CLIPPER Tensorflow Serving
16
@s_kontopoulos
Model Serving - Requirements
Other requirements:
- Response time - time to calculate a prediction. Could be a few mills.
- Throughput - predictions per second.
- Support for running multiple models (very common to run hundreds of models
eg. A telecom operator where there is one model per customer or in IoT one
model per site/sensor).
17
@s_kontopoulos
Model Serving - Requirements
- multiple versions of the same machine learning pipeline within the system.
One reason can be A/B testing.
- Model update- How quickly and easy a model can be updated?
- Uptime/reliability
18
@s_kontopoulos
Tensorflow Serving Issues
Not all systems cover the requirements. For example:
● Metadata not available. (https://github.com/tensorflow/serving/issues/612)
● No new models at runtime: (https://github.com/tensorflow/serving/issues/422)
● Can be hard to build from scratch:
https://github.com/tensorflow/serving/issues/327
19
@s_kontopoulos
Model Serving with Apache Flink
Apache Flink: Low latency compared to Spark streaming engine based on the
Beam model.
20
@s_kontopoulos
Model Serving with Apache Flink
Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline
models loaded from a permanent storage and update them without interruption.
FLIP Proposal:
(https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8
oGRPsPuk8)
Combines different efforts: https://github.com/FlinkML
● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/)
● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky)
● https://github.com/FlinkML/flink-tensorflow (Eron Wright)
21
@s_kontopoulos
Model Serving with Apache Flink
22
Use a control stream and a data Stream. Keep model in operator’s state. Join the streams.
Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and
partitions-based join based on RichCoFlatMapFunction.
@s_kontopoulos
Model Serving with Apache Flink
23
More here:
https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
@s_kontopoulos
Data Lakes
How can we work with data to cover future needs and use cases. We need a
robust ML framework plus flexible infrastructure. Data Warehouses will not work.
Data lake to the rescue.
“A data lake is a method of storing data within a system or repository, in its natural
format, that facilitates the collocation of data in various schemata and structural
forms, usually object blobs or files.”
- Wikipedia
24
@s_kontopoulos
Data Lakes
● Agility. It can be seen as a tool that makes data accessible to different users
and facilitates ML.
● Designed for low-cost storage
● Schema on read
● Security and governance still maturing.
25
@s_kontopoulos
Data Lake Issues
“Through 2018, 80% of data lakes will not include effective metadata management
capabilities, making them inefficient.”
- Gartner
Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM
Watson Platform etc.
26
@s_kontopoulos
Notebooks
Very convenient for the data scientist or the analyst.
Production usually is based on traditional deployment methods.
- Spark Notebook
- Apache zeppelin
- Jupyter
27
@s_kontopoulos
ML with Apache Spark
“A popular big data framework for ML and data-science.”
- You can work locally and move to production fast
- ETL/Feature Engineering
- Hyper-parameter tuning
- Rich Model support
- Multiple language support (Scala, Java, Python, R)
28
@s_kontopoulos
Apache Spark - Intro
29
A framework for distributed in-memory data processing.
@s_kontopoulos
Apache Spark - Intro
- User defines computations/operations (map, flatMap etc) on the data-sets
(bounded or not) as a DAG.
- DAG is shipped to nodes where the data lie, computation is executed and
results are sent back to the user.
- The data-sets are considered as immutable distributed data (RDDs).
- Resilient Distributed Datasets (RDD) an immutable distributed
collection of objects.
30
@s_kontopoulos
Apache Spark - Basic Example in Scala
31
basic statistics, a hello world
for ML
@s_kontopoulos
Apache Spark - Intro
There are three APIs: RDD, DataFrames, Datasets
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat
aframes-and-datasets.html
32
RDD DataFrames (SQL) Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
@s_kontopoulos
Apache Spark - Intro
“Datasets support encoders which allow to map semi-structured formats (eg
JSON) to constructs of type safe languages (Scala, Java). Also they have better
performance compared to java serialization or kryo.”
33
@s_kontopoulos
MLliB
A library for machine learning on top of Spark. Has two APIs:
- RDD based (spark.mllib).
- Datasets / Dataframes based (spark.ml).
The latter is relatively new and makes it easier to construct a ML pipeline or run an
algorithm. The first is older with more features.
34
@s_kontopoulos
MLliB
“As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered
maintenance mode. “
What are the implications?
● MLlib will still support the RDD-based API in spark.mllib with bug fixes.
● MLlib will not add new features to the RDD-based API.
● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated.
● The RDD-based API is expected to be removed in Spark 3.0.
35
@s_kontopoulos
MLliB
Supports different categories of ML algorithms:
● Basic statistics (correlations etc)
● Pipelines (LSH, TF-IDF)
● Extracting, transforming and selecting features
● Classification and Regression (Random forests, Gradient boosted trees)
● Clustering (K-means, LDA, etc)
● Collaborative filtering
● Frequent Pattern Mining
● Model selection and tuning
Allows to implement: Fraud detection, Recommendation engines,...
36
@s_kontopoulos
MLliB Local
A new package is available for production use of the algorithms without the need
of Spark itself. How about PMML vs this method?
https://issues.apache.org/jira/browse/SPARK-13944
https://issues.apache.org/jira/browse/SPARK-16365
37
@s_kontopoulos
MLliB - Unsupervised Learning Example
Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data
Describes wine quality. Different dimensions like: chlorides, sugar etc.
We will apply k-means to identify different clusters of wine quality.
Implemented both mllib and ml implementations as spark notebooks.
38
Normalize Data K-means PCA Visualize
@s_kontopoulos
MLliB - Unsupervised Learning Example
39
parse data
train k-means with different k
@s_kontopoulos
MLliB - Unsupervised Learning Example
40
Counting errors for elbow method
@s_kontopoulos
MLLiB - Unsupervised Learning Example
41
PCA analysis to verify k-means
with k=2
@s_kontopoulos
MLLiB - Unsupervised Learning Example
42
PCA K=2
@s_kontopoulos
MLliB - Unsupervised Learning Example
43
Available with the mllib implementation
@s_kontopoulos
Spark Deep Learning Pipelines
- People know SQL
- Models are productized as SQL UDFS.
Predictions as a SQL statement:
SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table
https://github.com/databricks/spark-deep-learning
44
@s_kontopoulos
BigDL
● Developed by Intel.
● It does not use GPUs, optimized for Intel processors.
“It is orders of magnitude faster than out-of-box open source Caffe, Torch or
TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).”
● It is implemented as a standalone package on Spark.
● Can be used with existing Spark or Hadoop clusters.
● High-performance powered by Intel MKL and multi-threaded programming.
● Easily scaled-out
● Appropriate for users who are not DL experts.
45
@s_kontopoulos
BigDL
● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and
testing machine learning models.
● A lot of useful features: Loss Functions, Layers support etc
● Implements a parameter server for distributed training of DL models
● Support visualization via tensorboard:
https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb
oard
46
@s_kontopoulos
BigDL in practice
For a cool example of using BigDL on mesos check our blog:
http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/
47
@s_kontopoulos
Thank you! Questions?
https://github.com/skonto/talks/blob/master/big-data-italy-2017/ml/references.md
48

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (19)

What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0What’s New in the Upcoming Apache Spark 3.0
What’s New in the Upcoming Apache Spark 3.0
 
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
Vectorized Deep Learning Acceleration from Preprocessing to Inference and Tra...
 
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...
 
Scaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with DatabricksScaling and Modernizing Data Platform with Databricks
Scaling and Modernizing Data Platform with Databricks
 
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...Fully Automated QA System For Large Scale Search And Recommendation Engines U...
Fully Automated QA System For Large Scale Search And Recommendation Engines U...
 
Practical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on HadoopPractical Distributed Machine Learning Pipelines on Hadoop
Practical Distributed Machine Learning Pipelines on Hadoop
 
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
Taming the Search: A Practical Way of Enforcing GDPR and CCPA in Very Large D...
 
Building Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta LakesBuilding Data Intensive Analytic Application on Top of Delta Lakes
Building Data Intensive Analytic Application on Top of Delta Lakes
 
Jethro data meetup index base sql on hadoop - oct-2014
Jethro data meetup    index base sql on hadoop - oct-2014Jethro data meetup    index base sql on hadoop - oct-2014
Jethro data meetup index base sql on hadoop - oct-2014
 
Building Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous DataBuilding Identity Graphs over Heterogeneous Data
Building Identity Graphs over Heterogeneous Data
 
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark StreamingTiny Batches, in the wine: Shiny New Bits in Spark Streaming
Tiny Batches, in the wine: Shiny New Bits in Spark Streaming
 
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
Apache Spark for Machine Learning with High Dimensional Labels: Spark Summit ...
 
Pandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySparkPandas UDF: Scalable Analysis with Python and PySpark
Pandas UDF: Scalable Analysis with Python and PySpark
 
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-PlatformDelight: An Improved Apache Spark UI, Free, and Cross-Platform
Delight: An Improved Apache Spark UI, Free, and Cross-Platform
 
Geospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache SparkGeospatial Analytics at Scale with Deep Learning and Apache Spark
Geospatial Analytics at Scale with Deep Learning and Apache Spark
 
Data Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflowData Versioning and Reproducible ML with DVC and MLflow
Data Versioning and Reproducible ML with DVC and MLflow
 
Simplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta LakeSimplify and Scale Data Engineering Pipelines with Delta Lake
Simplify and Scale Data Engineering Pipelines with Delta Lake
 
Scaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache SparkScaling Machine Learning with Apache Spark
Scaling Machine Learning with Apache Spark
 
Big Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely headingBig Data is changing abruptly, and where it is likely heading
Big Data is changing abruptly, and where it is likely heading
 

Ähnlich wie Machine learning at scale challenges and solutions

Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
Peyman Mohajerian
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
Gianmario Spacagna
 

Ähnlich wie Machine learning at scale challenges and solutions (20)

ArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning MetadataArangoML Pipeline Cloud - Managed Machine Learning Metadata
ArangoML Pipeline Cloud - Managed Machine Learning Metadata
 
Use Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdfUse Case Patterns for LLM Applications (1).pdf
Use Case Patterns for LLM Applications (1).pdf
 
Media_Entertainment_Veriticals
Media_Entertainment_VeriticalsMedia_Entertainment_Veriticals
Media_Entertainment_Veriticals
 
DevOps for DataScience
DevOps for DataScienceDevOps for DataScience
DevOps for DataScience
 
Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?Is Spark the right choice for data analysis ?
Is Spark the right choice for data analysis ?
 
Notes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at ScaleNotes on Deploying Machine-learning Models at Scale
Notes on Deploying Machine-learning Models at Scale
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Enabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standardsEnabling the digital thread using open OSLC standards
Enabling the digital thread using open OSLC standards
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Tech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning productsTech leaders guide to effective building of machine learning products
Tech leaders guide to effective building of machine learning products
 
Tutorial4
Tutorial4Tutorial4
Tutorial4
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
How to Productionize Your Machine Learning Models Using Apache Spark MLlib 2....
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
Deploying Data Science Engines to Production
Deploying Data Science Engines to ProductionDeploying Data Science Engines to Production
Deploying Data Science Engines to Production
 
databricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineeringdatabricks ml flow demonstration using automatic features engineering
databricks ml flow demonstration using automatic features engineering
 
Serverless machine learning architectures at Helixa
Serverless machine learning architectures at HelixaServerless machine learning architectures at Helixa
Serverless machine learning architectures at Helixa
 
Fighting Fraud with Apache Spark
Fighting Fraud with Apache SparkFighting Fraud with Apache Spark
Fighting Fraud with Apache Spark
 
Big Data Engineering for Machine Learning
Big Data Engineering for Machine LearningBig Data Engineering for Machine Learning
Big Data Engineering for Machine Learning
 
Use of standards and related issues in predictive analytics
Use of standards and related issues in predictive analyticsUse of standards and related issues in predictive analytics
Use of standards and related issues in predictive analytics
 

Mehr von Stavros Kontopoulos

ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
Stavros Kontopoulos
 

Mehr von Stavros Kontopoulos (10)

Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdfServerless Machine Learning Model Inference on Kubernetes with KServe.pdf
Serverless Machine Learning Model Inference on Kubernetes with KServe.pdf
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...ML At the Edge:  Building Your Production Pipeline With Apache Spark and Tens...
ML At the Edge: Building Your Production Pipeline With Apache Spark and Tens...
 
Apache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on FlinkApache Flink London Meetup - Let's Talk ML on Flink
Apache Flink London Meetup - Let's Talk ML on Flink
 
Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)Spark Summit EU Supporting Spark (Brussels 2016)
Spark Summit EU Supporting Spark (Brussels 2016)
 
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big DataVoxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
Voxxed days thessaloniki 21/10/2016 - Streaming Engines for Big Data
 
Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016Trivento summercamp masterclass 9/9/2016
Trivento summercamp masterclass 9/9/2016
 
Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016Trivento summercamp fast data 9/9/2016
Trivento summercamp fast data 9/9/2016
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Cassandra at Pollfish
Cassandra at PollfishCassandra at Pollfish
Cassandra at Pollfish
 

Kürzlich hochgeladen

%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
masabamasaba
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
masabamasaba
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Medical / Health Care (+971588192166) Mifepristone and Misoprostol tablets 200mg
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
masabamasaba
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
masabamasaba
 

Kürzlich hochgeladen (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
tonesoftg
tonesoftgtonesoftg
tonesoftg
 
%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare%in Harare+277-882-255-28 abortion pills for sale in Harare
%in Harare+277-882-255-28 abortion pills for sale in Harare
 
%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand%in Midrand+277-882-255-28 abortion pills for sale in midrand
%in Midrand+277-882-255-28 abortion pills for sale in midrand
 
VTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learnVTU technical seminar 8Th Sem on Scikit-learn
VTU technical seminar 8Th Sem on Scikit-learn
 
Architecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the pastArchitecture decision records - How not to get lost in the past
Architecture decision records - How not to get lost in the past
 
What Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the SituationWhat Goes Wrong with Language Definitions and How to Improve the Situation
What Goes Wrong with Language Definitions and How to Improve the Situation
 
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
OpenChain - The Ramifications of ISO/IEC 5230 and ISO/IEC 18974 for Legal Pro...
 
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
%+27788225528 love spells in Boston Psychic Readings, Attraction spells,Bring...
 
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
%in Bahrain+277-882-255-28 abortion pills for sale in Bahrain
 
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
%in Hazyview+277-882-255-28 abortion pills for sale in Hazyview
 
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
%+27788225528 love spells in Huntington Beach Psychic Readings, Attraction sp...
 
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
WSO2Con2024 - From Code To Cloud: Fast Track Your Cloud Native Journey with C...
 
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
Abortion Pills In Pretoria ](+27832195400*)[ 🏥 Women's Abortion Clinic In Pre...
 
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
%in kaalfontein+277-882-255-28 abortion pills for sale in kaalfontein
 
%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto%in Soweto+277-882-255-28 abortion pills for sale in soweto
%in Soweto+277-882-255-28 abortion pills for sale in soweto
 
WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?WSO2CON 2024 - Does Open Source Still Matter?
WSO2CON 2024 - Does Open Source Still Matter?
 
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
%+27788225528 love spells in Knoxville Psychic Readings, Attraction spells,Br...
 
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
WSO2CON 2024 - WSO2's Digital Transformation Journey with Choreo: A Platforml...
 
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
%+27788225528 love spells in Colorado Springs Psychic Readings, Attraction sp...
 

Machine learning at scale challenges and solutions

  • 1. @s_kontopoulos Machine Learning at Scale: Challenges and Solutions Stavros Kontopoulos Senior Software Engineer @ Lightbend, M.Sc.
  • 2. @s_kontopoulos Who am I? 2 skonto s_kontopoulos S. Software Engineer @ Lightbend, Fast Data Team Apache Flink Contributor at SlideShare stavroskontopoulos stavroskontopoulos All trademarks and registered trademarks are property of their respective holders.
  • 3. @s_kontopoulos Agenda - ML in the Enterprise - ML from development to production - Key technologies: Apache Spark as a case study 3
  • 4. @s_kontopoulos ML in the Enterprise ML is a key tool that fuels the effort of coupling business monitoring (BI) with predictive and prescriptive analytics. business insights -> business optimization -> data monetization 4
  • 5. @s_kontopoulos ML in the Enterprise - The Data-Science LifeCycle Identify Business Question Identify and collect related Data Data cleansing, feature extraction (Data pre-processing) Experiment planning Model Building Model Evaluation Model Deployment/Management in Production Model Optimization - Performance 5
  • 6. @s_kontopoulos Machine Learning Model A model is a function that maps inputs to outputs and essentially expresses a mathematical abstraction. Linear Regression: Neural Network: Random Forest: Function composition 6
  • 7. @s_kontopoulos Model Evolution - Models can be either pre-computed eg. trained off-line or updated on-line. - Online ML with Streaming: - Pure online means only use the latest arrived data point to update the model. Usually models are updated per batch/window eg. online k-means though. - An interesting case is when we sample the stream and train a model only when the distribution changes. - Adaptive supervised learning: SGD (Stochastic Gradient Descent) + random sampling - Re-train the model by ignoring the previous one. 7
  • 8. @s_kontopoulos Machine Learning Pipeline Machine learning pipeline in Production: describes all steps from data preprocessing before feeding the model to model output processing (post-processing). 8
  • 9. @s_kontopoulos Machine Learning Pipeline in Libraries Pros: - Data and test data go through the same steps - Like a CI (continuous integration) pipeline people can reason about data transformation - Caching of computations - Model serving easier 9
  • 10. @s_kontopoulos Multiple Models in a Pipeline Within the same pipeline it is also possible to run multiple models: a) Model Segmentation b) Model Ensemble c) Model Chaining d) Model Composition http://dmg.org/pmml/v4-1/MultipleModels.html http://dl.acm.org/citation.cfm?id=1859403 10
  • 11. @s_kontopoulos Model Development & Production Data Scientist All trademarks and registered trademarks are property of their respective holders. GO Data Engineer 11
  • 12. @s_kontopoulos Model Standardization 12 ML Framework Model Definition Evaluation Data Predictions Export Import PFA - Portable Format For Analytics
  • 13. @s_kontopoulos Model Standardization 13 - PFA or PMML won’t break the pipeline. PFA is more flexible than PMML. “Unlike PMML, PFA has control structures to direct program flow, a true type system for both model parameters and data, and its statistical functions are much more finely grained and can accept callbacks to modify their behavior” (http://dmg.org/pfa/docs/motivation/) - Custom model definitions and implementations are more flexible or more optimized but could break the pipeline. - Some Implementations: - https://github.com/jpmml/jpmml-evaluator-spark - https://github.com/jpmml - https://github.com/opendatagroup/hadrian
  • 14. @s_kontopoulos Model Lifecycle Some concerns about model lifecycle: - Model evolution - Model release practices - Model versioning - Model update process 14
  • 15. @s_kontopoulos Model Governance ● governed by the company’s policies and procedures, laws and regulations and organization’s goals ● searchable across company ● be transparent, explainable, traceable and interpretable for auditors and regulators. Example GDPR requirements: https://iapp.org/news/a/is-there-a-right-to-explanation-for-machine-learning-in- the-gdpr/ ● have approval and release process 15
  • 16. @s_kontopoulos Model Server “A model server is a system which handles the lifecycle of a model and provides the required APIs for deploying a model/pipeline.” Image: https://rise.cs.berkeley.edu/blog/low-latency-model-serving-clipper/ Image: https://www.tensorflow.org/serving/ CLIPPER Tensorflow Serving 16
  • 17. @s_kontopoulos Model Serving - Requirements Other requirements: - Response time - time to calculate a prediction. Could be a few mills. - Throughput - predictions per second. - Support for running multiple models (very common to run hundreds of models eg. A telecom operator where there is one model per customer or in IoT one model per site/sensor). 17
  • 18. @s_kontopoulos Model Serving - Requirements - multiple versions of the same machine learning pipeline within the system. One reason can be A/B testing. - Model update- How quickly and easy a model can be updated? - Uptime/reliability 18
  • 19. @s_kontopoulos Tensorflow Serving Issues Not all systems cover the requirements. For example: ● Metadata not available. (https://github.com/tensorflow/serving/issues/612) ● No new models at runtime: (https://github.com/tensorflow/serving/issues/422) ● Can be hard to build from scratch: https://github.com/tensorflow/serving/issues/327 19
  • 20. @s_kontopoulos Model Serving with Apache Flink Apache Flink: Low latency compared to Spark streaming engine based on the Beam model. 20
  • 21. @s_kontopoulos Model Serving with Apache Flink Idea: Exploit Flink’s low latency capabilities for serving models. Focus on offline models loaded from a permanent storage and update them without interruption. FLIP Proposal: (https://docs.google.com/document/d/1ON_t9S28_2LJ91Fks2yFw0RYyeZvIvndu8 oGRPsPuk8) Combines different efforts: https://github.com/FlinkML ● https://github.com/FlinkML/flink-jpmml (https://radicalbit.io/) ● https://github.com/FlinkML/flink-modelServer (Boris Lublinsky) ● https://github.com/FlinkML/flink-tensorflow (Eron Wright) 21
  • 22. @s_kontopoulos Model Serving with Apache Flink 22 Use a control stream and a data Stream. Keep model in operator’s state. Join the streams. Flink provides 2 ways of implementing low-level joins - key based join based on CoProcessFunction and partitions-based join based on RichCoFlatMapFunction.
  • 23. @s_kontopoulos Model Serving with Apache Flink 23 More here: https://info.lightbend.com/ebook-serving-machine-learning-models-register.html
  • 24. @s_kontopoulos Data Lakes How can we work with data to cover future needs and use cases. We need a robust ML framework plus flexible infrastructure. Data Warehouses will not work. Data lake to the rescue. “A data lake is a method of storing data within a system or repository, in its natural format, that facilitates the collocation of data in various schemata and structural forms, usually object blobs or files.” - Wikipedia 24
  • 25. @s_kontopoulos Data Lakes ● Agility. It can be seen as a tool that makes data accessible to different users and facilitates ML. ● Designed for low-cost storage ● Schema on read ● Security and governance still maturing. 25
  • 26. @s_kontopoulos Data Lake Issues “Through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” - Gartner Several vendors try to deliver end-to-end solutions: Databricks Delta platform, IBM Watson Platform etc. 26
  • 27. @s_kontopoulos Notebooks Very convenient for the data scientist or the analyst. Production usually is based on traditional deployment methods. - Spark Notebook - Apache zeppelin - Jupyter 27
  • 28. @s_kontopoulos ML with Apache Spark “A popular big data framework for ML and data-science.” - You can work locally and move to production fast - ETL/Feature Engineering - Hyper-parameter tuning - Rich Model support - Multiple language support (Scala, Java, Python, R) 28
  • 29. @s_kontopoulos Apache Spark - Intro 29 A framework for distributed in-memory data processing.
  • 30. @s_kontopoulos Apache Spark - Intro - User defines computations/operations (map, flatMap etc) on the data-sets (bounded or not) as a DAG. - DAG is shipped to nodes where the data lie, computation is executed and results are sent back to the user. - The data-sets are considered as immutable distributed data (RDDs). - Resilient Distributed Datasets (RDD) an immutable distributed collection of objects. 30
  • 31. @s_kontopoulos Apache Spark - Basic Example in Scala 31 basic statistics, a hello world for ML
  • 32. @s_kontopoulos Apache Spark - Intro There are three APIs: RDD, DataFrames, Datasets https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dat aframes-and-datasets.html 32 RDD DataFrames (SQL) Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time
  • 33. @s_kontopoulos Apache Spark - Intro “Datasets support encoders which allow to map semi-structured formats (eg JSON) to constructs of type safe languages (Scala, Java). Also they have better performance compared to java serialization or kryo.” 33
  • 34. @s_kontopoulos MLliB A library for machine learning on top of Spark. Has two APIs: - RDD based (spark.mllib). - Datasets / Dataframes based (spark.ml). The latter is relatively new and makes it easier to construct a ML pipeline or run an algorithm. The first is older with more features. 34
  • 35. @s_kontopoulos MLliB “As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. “ What are the implications? ● MLlib will still support the RDD-based API in spark.mllib with bug fixes. ● MLlib will not add new features to the RDD-based API. ● In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API. ● After reaching feature parity (roughly estimated for Spark 2.3), the RDD-based API will be deprecated. ● The RDD-based API is expected to be removed in Spark 3.0. 35
  • 36. @s_kontopoulos MLliB Supports different categories of ML algorithms: ● Basic statistics (correlations etc) ● Pipelines (LSH, TF-IDF) ● Extracting, transforming and selecting features ● Classification and Regression (Random forests, Gradient boosted trees) ● Clustering (K-means, LDA, etc) ● Collaborative filtering ● Frequent Pattern Mining ● Model selection and tuning Allows to implement: Fraud detection, Recommendation engines,... 36
  • 37. @s_kontopoulos MLliB Local A new package is available for production use of the algorithms without the need of Spark itself. How about PMML vs this method? https://issues.apache.org/jira/browse/SPARK-13944 https://issues.apache.org/jira/browse/SPARK-16365 37
  • 38. @s_kontopoulos MLliB - Unsupervised Learning Example Our data set: https://www.kaggle.com/danielpanizzo/wine-quality/data Describes wine quality. Different dimensions like: chlorides, sugar etc. We will apply k-means to identify different clusters of wine quality. Implemented both mllib and ml implementations as spark notebooks. 38 Normalize Data K-means PCA Visualize
  • 39. @s_kontopoulos MLliB - Unsupervised Learning Example 39 parse data train k-means with different k
  • 40. @s_kontopoulos MLliB - Unsupervised Learning Example 40 Counting errors for elbow method
  • 41. @s_kontopoulos MLLiB - Unsupervised Learning Example 41 PCA analysis to verify k-means with k=2
  • 42. @s_kontopoulos MLLiB - Unsupervised Learning Example 42 PCA K=2
  • 43. @s_kontopoulos MLliB - Unsupervised Learning Example 43 Available with the mllib implementation
  • 44. @s_kontopoulos Spark Deep Learning Pipelines - People know SQL - Models are productized as SQL UDFS. Predictions as a SQL statement: SELECT my_custom_keras_model_udf(image) as predictions from my_spark_image_table https://github.com/databricks/spark-deep-learning 44
  • 45. @s_kontopoulos BigDL ● Developed by Intel. ● It does not use GPUs, optimized for Intel processors. “It is orders of magnitude faster than out-of-box open source Caffe, Torch or TensorFlow on a single-node Xeon (i.e., comparable with mainstream GPU).” ● It is implemented as a standalone package on Spark. ● Can be used with existing Spark or Hadoop clusters. ● High-performance powered by Intel MKL and multi-threaded programming. ● Easily scaled-out ● Appropriate for users who are not DL experts. 45
  • 46. @s_kontopoulos BigDL ● Offers a user-friendly, idiomatic Scala and Python 2.7/3.5 API for training and testing machine learning models. ● A lot of useful features: Loss Functions, Layers support etc ● Implements a parameter server for distributed training of DL models ● Support visualization via tensorboard: https://intel-analytics.github.io/bigdl-doc/UserGuide/visualization-with-tensorb oard 46
  • 47. @s_kontopoulos BigDL in practice For a cool example of using BigDL on mesos check our blog: http://developer.lightbend.com/blog/2017-06-22-bigdl-on-mesos/ 47