Building High Available and Scalable Machine Learning Applications

Building High Available & Scalable
Machine Learning Products
Yalçın Yenigün
25/05/2017

Agenda
1. What is Data-Driven Product?
a) Introduction
b) Examples
2. Machine Learning
a) Term Definitions
b) A Visual Example
c) Supervised Learning
d) Unsupervised Learning
e) Cross Validation
f) Feature Extraction
3. Machine Learning in iyzico

Data Driven Product
• Data driven is the future!!!
• It’s the ‘right’ way of doing things!!!..etc.
• What is “data-driven” ??
• Is Facebook a data-driven product??
• Is Uber a data-driven product??
• We can say that “all” of these are data-driven products
• All of them works with data.
• But they are really data-driven products??

Data Driven Product
• Experimentation:
• Data-Driven: Making design decisions based on
behavioral evidence from users.
• Example: Picking a green button for your website
because conversion metrics are significantly improved
over the purple button

Data Driven Product
• Machine Learning : Building systems that learn from
behavioral data generated by users
• Examples:
• Recommendation
• Personalized Ranking
• People-you-may-know
• Products-you-may-like

Data Driven Product
• Databases or APIs
• They just use the data
• To them their system is also data-driven.
• But they are NOT data-driven.
• They don’t use behavioral data generated by users.

Examples
• A mobile app that gives information about public transport around you.
• Pulls data from transport operator or APIs, merges and gives you.
• Nothing really data-driven.
• Data-driven version of this app:
• Learn what part of the transport network relevant to you.
• Predict when cycling is better when walking is better.
• Predict waiting times.
• Predict delays of transports.

Examples
• A website that provides blogging services to users
• Write posts, subscribe other posts.. etc.
• Data-driven version of this blog:
• Recommend who to follow based on your previous likes
• Auto-tag your content to allow people quickly find it
• Create relevance-sorted feed of posts.

Term Definitions
• Machine Learning: “Field of study that gives computers the ability to
learn without being explicitly programmed” Arthur Samuel
• Arthur Samuel: A pioneer in the field of computer gaming
and artificial intelligence. He coined the term "machine learning"
in 1959.
• Feature: In machine learning and pattern recognition, a feature is
individual measurable property of a phenomenon being observed.

Term Definitions
• Data Sampling: Data sampling is a statistical
analysis technique used to select,
manipulate and analyze a representative
subset of data points in order to identify
patterns in the larger data set being
examined.

Term Definitions
• Training Set: A training set is a set of data used to discover potentially predictive
relationships.
• ML Model: You can use the ML model to get predictions on new data for which you do not
know the target.
• Cross Validation: A model validation technique for assessing how the results of a statistical
analysis will generalize to an independent data set.

Confusion Matrix
• Accuracy: Ratio of correctly predicted observations.
(TP + TN) / (TP + TN + FP + FN)
• Precision: Ratio of correct positive observations.
TP / (TP + FP)
• Recall: Ratio of correctly predicted positive events.
TP / (TP + FN)

Supervised Learning
• Input data is called training data and has a known
label or result such as spam/not-spam or a stock price
at a time.
• Example problems are classification and regression.
• Example algorithms include Logistic Regression and
the Back Propagation Neural Network.

Supervised Learning
• Supervised Learning: Right answers given
• Regression: Predict continuous valued
output
• Classification: Discrete valued output

Supervised Learning – Classification Example

Linear Regression with One Variable

Supervised Learning – Classification Example
http://localhost:8888/notebooks/dev/workspaces/i
yzico/scipy_2015_sklearn_tutorial/notebooks/02.1
%20Supervised%20Learning%20-
%20Classification.ipynb

Supervised Learning – Regression Example
%20Supervised%20Learning%20-
%20Regression.ipynb

Unsupervised Learning
• Input data is not labeled and does not have a known
result.
• Example problems are clustering, dimensionality
reduction and association rule learning.
• Example algorithms include: the Apriori algorithm and
k-Means.

Supervised vs Unsupervised Learning

Unsupervised Learning Examples

Unsupervised Learning –
Transformation Example
%20Unsupervised%20Learning%20-
%20Transformations%20and%20Dimensionality%20
Reduction.ipynb

Unsupervised Learning – Clustering Example
%20Unsupervised%20Learning%20-
%20Clustering.ipynb

Cross Validation
• A model validation technique for
assessing how the results of
a statistical analysis will generalize to
an independent data set.

Cross Validation Example
%20Cross%20Validation.ipynb

Feature Extraction
• Feature extraction starts from an initial set of measured data and builds derived values (features)
intended to be informative and non-redundant.
• Feature extraction involves reducing the amount of resources required to describe a large set of
data.

Challenge 1:
Model Needs To Be
Tested With Real Data
Before Production

Machine Learning Model Release Pipeline
Model 1.0.2
(local)
Model 1.0.1
(listen)
Model 1.0.0
(production)
• New model developed and tested on local environment.
• Tech stack: Anaconda, Jupyter, Python, R, Scala
• New model tested on Listen Mode Server with real transaction data.
• Tech stack: Spark, Scala, Java 8
• Cost Matrix reported with real data
• Response Time reported with real data

Challenge 2:
Response Time Should
Be Minor Than
0.1 seconds

Optimize Spark Cluster
• Use Spark Cluster for Training
• Use Standalone Spark for
Predictions
• Load Balancer for High
Availability
• Increase Spark Total Executor
Core Size
• Decrease Spark Max Memory In
Mb

Schemaless Database with MySQL
• Multiple features developed
each week
• All features stored and reported
• Data is really dynamic
• Schema management is really
difficult
• i.e. Uber, Friendfeed..etc.

Challenge 4:
High Availability and
Fail Fast

Never Stop Payment Transaction
• If API is down fail fast
• Use fallback methods not to
affect payment transactions
• Netflix Circuit Breaker used

Netflix Hystrix Circuit Breaker

Challenge 5:
Continuous Delivery
and
Machine Learning

Continuous Delivery and Machine Learning
• Training Jobs Devops Scripts implemented and automatized for
Continuous Integration Environment
• Cross Validation jobs automatized on Spark with millions of
transactions
• Probability Calibration is implemented.
• Data Sampling is automatized (Clustering based sampling)

Challenge 6:
Aggregated Feature
Simulation with
Batch Data

Aggregated Features with Batch Data
• Time based aggregated features needs to be simulated before
production
• Ex: Buyers last 1 hours payment behavior
• Redis used for time series data (ZRANGE functions)
• ZRANGE and ZREVRANGE offer the ability to retrieve elements from
a Sorted Set based on their sorted position

References
• https://medium.com/@neal_lathia/what-do-we-mean-when-we-talk-about-data-
driven-products-127ceb3e6cf
• https://www.slideshare.net/HadoopSummit/h20-a-platform-for-big-math
• https://www.wikipedia.org/
• https://www.coursera.org/learn/machine-learning
• http://www.r2d3.us/visual-intro-to-machine-learning-part-1/
• https://github.com/amueller/scipy_2015_sklearn_tutorial
• https://redis.io/commands/
• https://github.com/Netflix/Hystrix
• https://eng.uber.com/schemaless-part-one/
• https://backchannel.org/blog/friendfeed-schemaless-mysql
• https://www.continuum.io/anaconda-overview

Building High Available and Scalable Machine Learning Applications

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Building High Available and Scalable Machine Learning Applications

Ähnlich wie Building High Available and Scalable Machine Learning Applications (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building High Available and Scalable Machine Learning Applications

Hinweis der Redaktion