This document discusses building high available and scalable machine learning products. It begins with an introduction to data-driven products and machine learning concepts like supervised and unsupervised learning. It then discusses six key challenges in building machine learning products at iyzico: 1) models need testing on real data before production, 2) response times must be under 0.1 seconds, 3) data is dynamic, 4) high availability and fail fast is required, 5) continuous delivery of machine learning models, and 6) simulating aggregated features from batch data. It provides examples of techniques used at iyzico to address these challenges like Spark for predictions, schemaless databases, circuit breakers, devops for machine learning, and Redis for
3. Agenda
1. What is Data-Driven Product?
a) Introduction
b) Examples
2. Machine Learning
a) Term Definitions
b) A Visual Example
c) Supervised Learning
d) Unsupervised Learning
e) Cross Validation
f) Feature Extraction
3. Machine Learning in iyzico
5. Data Driven Product
• Data driven is the future!!!
• It’s the ‘right’ way of doing things!!!..etc.
• What is “data-driven” ??
• Is Facebook a data-driven product??
• Is Uber a data-driven product??
• We can say that “all” of these are data-driven products
• All of them works with data.
• But they are really data-driven products??
6. Data Driven Product
• Experimentation:
• Data-Driven: Making design decisions based on
behavioral evidence from users.
• Example: Picking a green button for your website
because conversion metrics are significantly improved
over the purple button
7. Data Driven Product
• Machine Learning : Building systems that learn from
behavioral data generated by users
• Examples:
• Recommendation
• Personalized Ranking
• People-you-may-know
• Products-you-may-like
8. Data Driven Product
• Databases or APIs
• They just use the data
• To them their system is also data-driven.
• But they are NOT data-driven.
• They don’t use behavioral data generated by users.
9. Examples
• A mobile app that gives information about public transport around you.
• Pulls data from transport operator or APIs, merges and gives you.
• Nothing really data-driven.
• Data-driven version of this app:
• Learn what part of the transport network relevant to you.
• Predict when cycling is better when walking is better.
• Predict waiting times.
• Predict delays of transports.
10. Examples
• A website that provides blogging services to users
• Write posts, subscribe other posts.. etc.
• Data-driven version of this blog:
• Recommend who to follow based on your previous likes
• Auto-tag your content to allow people quickly find it
• Create relevance-sorted feed of posts.
12. Term Definitions
• Machine Learning: “Field of study that gives computers the ability to
learn without being explicitly programmed” Arthur Samuel
• Arthur Samuel: A pioneer in the field of computer gaming
and artificial intelligence. He coined the term "machine learning"
in 1959.
• Feature: In machine learning and pattern recognition, a feature is
individual measurable property of a phenomenon being observed.
13. Term Definitions
• Data Sampling: Data sampling is a statistical
analysis technique used to select,
manipulate and analyze a representative
subset of data points in order to identify
patterns in the larger data set being
examined.
14. Term Definitions
• Training Set: A training set is a set of data used to discover potentially predictive
relationships.
• ML Model: You can use the ML model to get predictions on new data for which you do not
know the target.
• Cross Validation: A model validation technique for assessing how the results of a statistical
analysis will generalize to an independent data set.
21. Supervised Learning
• Input data is called training data and has a known
label or result such as spam/not-spam or a stock price
at a time.
• Example problems are classification and regression.
• Example algorithms include Logistic Regression and
the Back Propagation Neural Network.
37. Unsupervised Learning
• Input data is not labeled and does not have a known
result.
• Example problems are clustering, dimensionality
reduction and association rule learning.
• Example algorithms include: the Apriori algorithm and
k-Means.
43. Cross Validation
• A model validation technique for
assessing how the results of
a statistical analysis will generalize to
an independent data set.
46. Feature Extraction
• Feature extraction starts from an initial set of measured data and builds derived values (features)
intended to be informative and non-redundant.
• Feature extraction involves reducing the amount of resources required to describe a large set of
data.
53. Machine Learning Model Release Pipeline
Model 1.0.2
(local)
Model 1.0.1
(listen)
Model 1.0.0
(production)
• New model developed and tested on local environment.
• Tech stack: Anaconda, Jupyter, Python, R, Scala
• New model tested on Listen Mode Server with real transaction data.
• Tech stack: Spark, Scala, Java 8
• Cost Matrix reported with real data
• Response Time reported with real data
55. Optimize Spark Cluster
• Use Spark Cluster for Training
• Use Standalone Spark for
Predictions
• Load Balancer for High
Availability
• Increase Spark Total Executor
Core Size
• Decrease Spark Max Memory In
Mb
57. Schemaless Database with MySQL
• Multiple features developed
each week
• All features stored and reported
• Data is really dynamic
• Schema management is really
difficult
• i.e. Uber, Friendfeed..etc.
62. Continuous Delivery and Machine Learning
• Training Jobs Devops Scripts implemented and automatized for
Continuous Integration Environment
• Cross Validation jobs automatized on Spark with millions of
transactions
• Probability Calibration is implemented.
• Data Sampling is automatized (Clustering based sampling)
64. Aggregated Features with Batch Data
• Time based aggregated features needs to be simulated before
production
• Ex: Buyers last 1 hours payment behavior
• Redis used for time series data (ZRANGE functions)
• ZRANGE and ZREVRANGE offer the ability to retrieve elements from
a Sorted Set based on their sorted position
A model is prepared through a training process in which it is required to make predictions and is corrected when those predictions are wrong. The training process continues until the model achieves a desired level of accuracy on the training data.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.
A model is prepared by deducing structures present in the input data. This may be to extract general rules. It may be through a mathematical process to systematically reduce redundancy, or it may be to organize data by similarity.