SlideShare a Scribd company logo
1 of 45
TransmogrifAI
Automate Machine Learning Workflow with the power of Scala and
Spark at massive scale.
@khatri_chetanBy: Chetan Khatri,
Lead - Technology,
Accion Labs
About me
Lead - Data Science @ Accion labs India Pvt. Ltd.
Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark
HBase Connectors.
Data Engineering @: Nazara Games, Eccella Corporation.
Advisor - Data Science Lab, University of Kachchh, India.
M.Sc. - Computer Science from University of Kachchh, India.
About Accion Labs
We Are A Product Engineering Company Helping Transform Businesses Through
Emerging Technologies.
What we do? Data Engineering, Machine Learning, NLP, Microservices etc.
Case studies: https://www.accionlabs.com/accion-work
Contact: https://www.accionlabs.com/accion-bangalore-contact-1
Agenda
● What is TransmogrifAI ?
● Why you need TransmogrifAI ?
● Automation of Machine learning life Cycle - from development to deployment.
○ Feature Inference
○ Transformation
○ Automated Feature validation
○ Automated Model Selection
○ Hyperparameter Optimization
● Type Safety in Spark, TransmogrifAI.
● Example: Code - Titanic kaggle problem.
What is TransmogrifAI ?
● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018
● An end to end automated machine learning workflow library for structured
data build on top of Scala and SparkML.
Build with
What is TransmogrifAI ?
● TransmogrifAI helps extensively to automate Machine learning model life
cycle such as Feature Selection, Transformation, Automated Feature
validation, Automated Model Selection, Hyperparameter Optimization.
● It enforces compile-time type-safety, modularity, and reuse.
● Through automation, It achieves accuracies close to hand-tuned models with
almost 100x reduction in time.
Why you need TransmogrifAI ?
AUTOMATION
Numerous Transformers
and Estimators.
MODULARITY AND
REUSE
Enforces a strict separation
between ML workflow
definitions and data
manipulation.
COMPILE TIME
TYPE
SAFETYWorkflow built are Strongly
typed, code completion
during development and
fewer runtime errors.
TRANSPARENCY
Model insights leverage
stored feature metadata
and lineage to help debug
models.
Features
Why you need TransmogrifAI ?
Use TransmogrifAI if you need a machine learning library to:
● Build production ready machine learning applications in hours, not months
● Build machine learning models without getting a Ph.D. in machine learning
● Build modular, reusable, strongly typed machine learning workflows
More read documentation: https://transmogrif.ai/
Why Machine Learning is hard ?! Really! ...
For example, this may be using a linear
classifier when your true decision
boundaries are non-linear.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
Why Machine Learning is hard ?! Really! ...
fast and effective debugging is the skill that is most required for
implementing modern day machine learning pipelines.
Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
Real time Machine Learning takes time to Productionize
TransmogrifAI Automates entire ML
Life Cycle to accelerate developer’s
productivity.
Under the Hood
Automated Feature Engineering
Automated Feature Selection
Automated Model Selection
Automated Feature Engineering
Automatic Derivation of new features based on existing features.
Email Phone Age Subject Zip Code DOB Gender
Email is
Spam
Country
Code [0-20]
[21-30]
[ > 30]
Stop
words
Top
terms
(TF-IDF)
Detect
Language
Average Income
House Price
School Quality
Shopping
Transportation
To
Binary
Age
Day of Week
Week of Year
Quarter
Month
Year
Hour
Feature Vector
Automated Feature Engineering
● Analyze every feature columns and compute descriptive statistics.
○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation.
● Handle Missing values / Noisy values.
○ Ex. fillna by Mean / Avg / near by values.
patient_details = patient_details.fillna(-1)
data['City_Type'] = data['City_Type'].fillna('Z')
imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False)
data_total_imputed = imp.fit_transform(data_total)
# mark zero values as missing or NaN
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN)
# fill missing values with mean column values
dataset.fillna(dataset.mean(), inplace=True)
Automated Feature Engineering
● Does features have acceptable ranges / Does it contain valid values ?
● Does that feature could be leaker ?
○ Is it usually filled out after predicted field is ?
○ Is it highly correlated with the predicted field ?
● Does that feature is Outlier ?
Automated Feature Selection / Data Pre-processing
● Data Type of Features, Automatic Data Pre-processing.
○ MinMaxScaler
○ Normalizer
○ Binarizer
○ Label Encoding
○ One Hot Encoding
● Auto Data Pre-Processing based on chosen ML Model.
● Algorithm like XGBoost, specifically requires dummy encoded data while
algorithm like decision tree doesn’t seem to care at all (sometimes)!
Auto Data Pre-processing
● Numeric - Imputation, Track Null Value, Log Transformation for large range,
Scaling, Smart Binning.
● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy
Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category
Embedding.
● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis,
Language Detection.
● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week,
Month, Year).
Auto Selection of Best Model with Hyper Parameter
Tuning
● Machine Learning Model
○ Learning Rate
○ Epoc
○ Batch Size
○ Optimizer
○ Activation Function
○ Loss Function
● Search Algorithms to find best model and optimal hyper parameters.
○ Ex. Grid Search, Random Search, Bandit Methods
Examples - Hyper parameter tuning
XGBoost:
params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss',
'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9,
'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3}
num_rounds = 400
params['seed'] = 523264626346 # 0.85533
dtrain = xgb.DMatrix(train, labels, missing=np.nan)
clf = xgb.train(params, dtrain, num_rounds)
dtest = xgb.DMatrix(test, missing = np.nan)
test_preds = clf.predict(dtest)
Examples - Hyper parameter tuning
rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry),
criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True)
rf.fit(X_train, y_train)
gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features,
subsample = subsample, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion,
max_features = max_features, n_jobs = -1)
cv = StratifiedKFold(y_train, 10)
scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1')
param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000],
'criterion' : ['gini', 'entropy'],
'max_features' : [15,20,25,30],
'max_depth' : [4,5,6]
}
gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train)
gs_cv.best_params_
Ensemble Modeling
ens['XGB2'] = xgb2_pred['Disbursed']
ens['RF'] = rf_pred['Disbursed']
ens['FTRL'] = ftrl_pred['Disbursed']
ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min')
ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min')
ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank']
ens['RF_Rank'] = rankdata(ens['RF'], method='min')
ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min')
ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
Type Safety: Integration with Apache Spark and Scala
● Modular, Reusable, Strongly typed Machine learning workflow on top of
Apache Spark.
● Type Safety in Apache Spark with DataSet API.
Structured Data in Apache Spark
Structured in Spark
DataFrames
Datasets
Unification of APIs in Apache Spark 2.0
DataFrame
Dataset
Untyped API
Typed API
Dataset
(2016)
DataFrame = Dataset [Row]
Alias
DataSet [T]
Why Dataset ?
● Strongly Typing.
● Ability to use powerful lambda functions.
● Spark SQL’s optimized execution engine (catalyst, tungsten).
● Can be constructed from JVM objects & manipulated using Functional.
● transformations (map, filter, flatMap etc).
● A DataFrame is a Dataset organized into named columns.
● DataFrame is simply a type alias of Dataset[Row].
DataFrame API Code
// convert RDD -> DF with column names
val parsedDF = parsedRDD.toDF("project", "sprint", "numStories")
//filter, groupBy, sum, and then agg()
parsedDF.filter($"project" === "finance").
groupBy($"sprint").
agg(sum($"numStories").as("count")).
limit(100).
show(100)
project sprint numStories
finance 3 20
finance 4 22
DataFrame -> SQL View -> SQL Query
parsedDF.createOrReplaceTempView("audits")
val results = spark.sql(
"""SELECT sprint, sum(numStories)
AS count FROM audits WHERE project = 'finance' GROUP BY sprint
LIMIT 100""")
results.show(100)
project sprint numStories
finance 3 20
finance 4 22
Why Structure APIs ?
// DataFrame
data.groupBy("dept").avg("age")
// SQL
select dept, avg(age) from data group by 1
// RDD
data.map { case (dept, age) => dept -> (age, 1) }
.reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) }
.map { case (dept, (age, c)) => dept -> age / c }
Catalyst in Spark
SQL AST
DataFrame
Datasets
Unresolved
Logical Plan
Logical Plan
Optimized
Logical Plan
Physical
Plans
CostModel
Selected
Physical
Plan
RDD
Dataset API in Spark 2.x
val employeesDF = spark.read.json("employees.json")
// Convert data to domain objects.
case class Employee(name: String, age: Int)
val employeesDS: Dataset[Employee] = employeesDF.as[Employee]
val filterDS = employeesDS.filter(p => p.age > 3)
Type-safe: operate on domain
objects with compiled lambda
functions.
Structured APIs in Apache Spark
SQL DataFrames Datasets
Syntax Errors Runtime Compile Time Compile Time
Analysis Errors Runtime Runtime Compile Time
Analysis errors are caught before a job runs on cluster
Spark SQL API - Analysis Error example.
Spark SQL API - Analysis Error example.
TransmogrifAI - Type Safety is Everywhere!
● Value operations
● Feature operations
● Transformation Pipelines (aka Workflows)
// Typed value operations
val tokenize(t: Text): TextList = t.map(_.split("")).toTextList
// Types feature operations
val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor
val tokens: Feature[TextList] = title.map(tokenize)
// Transformation pipelines
new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
Example Code
Ref. https://github.com/fosscoder/transmogrifai-demo
A Case Story - Functional Flow - Spark as a SaaS
User
Interface
Build workflow
- Source
- Target
- Transformations
- filter
-
aggregation
- Joins
- Expressions
- Machine Learning
Algorithms
Store Metadata
of workflow in
Document based
NoSQL
Ex. MongoDB
ReactiveMongo
Scala / Spark
Job Reads
Metadata from
NoSQL ex.
MongoDB
Run on the
Cluster
Schedule Using
Airflow
SparkSubmit
Operator
A Case Story - High Level Technical Architecture - Spark as a SaaS
User
Interface
Middleware
Akka HTTP
Web
Service’s
Apache Livy Configuration
Apache Livy Configuration
Apache Livy Configuration ...
Apache Livy Integration
Apache Livy Integration ...
Apache Livy Integration ...
Questions ?
Thank you!
@khatri_chetan
chetan.khatri@accionlabs.com
References
[1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows
on Spark from Salesforce Engineering
[online] https://transmogrif.ai
[2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator
[online] https://github.com/rssanders3/airflow-spark-operator-plugin
[3] Apache Spark - Unified Analytics Engine for Big Data
[online] https://spark.apache.org/
[4] Apache Livy
[online] https://livy.incubator.apache.org/
[5] Zayd's Blog - Why is machine learning 'hard'?
[online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
[6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions
[online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s
[7] Auto-Machine Learning: The Magic Behind Einstein
[online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s

More Related Content

What's hot

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsKrishna Sankar
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ FyberDaniel Hen
 
Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemAbhishek Thakur
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature HashingWush Wu
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellDatabricks
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David SzakallasDatabricks
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoostJoonyoung Yi
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKOlivier Grisel
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹Wayne Chen
 
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...Sri Ambati
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret packageVivian S. Zhang
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David SzakallasDatabricks
 
Hyperparameter optimization landscape Berlin ML Group meetup 8/2019
Hyperparameter optimization landscape Berlin ML Group meetup 8/2019Hyperparameter optimization landscape Berlin ML Group meetup 8/2019
Hyperparameter optimization landscape Berlin ML Group meetup 8/2019Jakub Czakon
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to MahoutTed Dunning
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...Databricks
 

What's hot (20)

Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Deploying Machine Learning Models to Production
Deploying Machine Learning Models to ProductionDeploying Machine Learning Models to Production
Deploying Machine Learning Models to Production
 
Data Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science CompetitionsData Wrangling For Kaggle Data Science Competitions
Data Wrangling For Kaggle Data Science Competitions
 
XGBoost @ Fyber
XGBoost @ FyberXGBoost @ Fyber
XGBoost @ Fyber
 
Xgboost
XgboostXgboost
Xgboost
 
Approaching (almost) Any NLP Problem
Approaching (almost) Any NLP ProblemApproaching (almost) Any NLP Problem
Approaching (almost) Any NLP Problem
 
Introduction of Feature Hashing
Introduction of Feature HashingIntroduction of Feature Hashing
Introduction of Feature Hashing
 
Demystifying Xgboost
Demystifying XgboostDemystifying Xgboost
Demystifying Xgboost
 
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van HovellAn Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
An Introduction to Higher Order Functions in Spark SQL with Herman van Hovell
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 
Spark schema for free with David Szakallas
Spark schema for free with David SzakallasSpark schema for free with David Szakallas
Spark schema for free with David Szakallas
 
Introduction to XGBoost
Introduction to XGBoostIntroduction to XGBoost
Introduction to XGBoost
 
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTKStatistical Machine Learning for Text Classification with scikit-learn and NLTK
Statistical Machine Learning for Text Classification with scikit-learn and NLTK
 
Ensembling & Boosting 概念介紹
Ensembling & Boosting  概念介紹Ensembling & Boosting  概念介紹
Ensembling & Boosting 概念介紹
 
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
Leakage in Meta Modeling And Its Connection to HCC Target-Encoding - Mathias ...
 
Data mining with caret package
Data mining with caret packageData mining with caret package
Data mining with caret package
 
Spark Schema For Free with David Szakallas
 Spark Schema For Free with David Szakallas Spark Schema For Free with David Szakallas
Spark Schema For Free with David Szakallas
 
Hyperparameter optimization landscape Berlin ML Group meetup 8/2019
Hyperparameter optimization landscape Berlin ML Group meetup 8/2019Hyperparameter optimization landscape Berlin ML Group meetup 8/2019
Hyperparameter optimization landscape Berlin ML Group meetup 8/2019
 
Introduction to Mahout
Introduction to MahoutIntroduction to Mahout
Introduction to Mahout
 
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
From Python to PySpark and Back Again – Unifying Single-host and Distributed ...
 

Similar to Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionChetan Khatri
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Fosdem2017 Scientific computing on Jruby
Fosdem2017  Scientific computing on JrubyFosdem2017  Scientific computing on Jruby
Fosdem2017 Scientific computing on JrubyPrasun Anand
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupJim Dowling
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Karthik Murugesan
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkFaisal Siddiqi
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowDatabricks
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Databricks
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streamingAdam Doyle
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...Amazon Web Services
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...GeeksLab Odessa
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Spark Summit
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowChetan Khatri
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkDatabricks
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchDatabricks
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine LearningRebecca Bilbro
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXDatabricks
 

Similar to Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala (20)

A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...
 
No more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in productionNo more struggles with Apache Spark workloads in production
No more struggles with Apache Spark workloads in production
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Fosdem2017 Scientific computing on Jruby
Fosdem2017  Scientific computing on JrubyFosdem2017  Scientific computing on Jruby
Fosdem2017 Scientific computing on Jruby
 
Ml ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science MeetupMl ops and the feature store with hopsworks, DC Data Science Meetup
Ml ops and the feature store with hopsworks, DC Data Science Meetup
 
Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018Netflix Machine Learning Infra for Recommendations - 2018
Netflix Machine Learning Infra for Recommendations - 2018
 
ML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talkML Infra for Netflix Recommendations - AI NEXTCon talk
ML Infra for Netflix Recommendations - AI NEXTCon talk
 
Building an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflowBuilding an ML Platform with Ray and MLflow
Building an ML Platform with Ray and MLflow
 
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
Building Deep Reinforcement Learning Applications on Apache Spark with Analyt...
 
ProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPSProgrammingPrimerAndOOPS
ProgrammingPrimerAndOOPS
 
Spark ml streaming
Spark ml streamingSpark ml streaming
Spark ml streaming
 
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
AWS November Webinar Series - Advanced Analytics with Amazon Redshift and the...
 
C3 w1
C3 w1C3 w1
C3 w1
 
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
AI&BigData Lab 2016. Руденко Петр: Особенности обучения, настройки и использо...
 
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
Learnings Using Spark Streaming and DataFrames for Walmart Search: Spark Summ...
 
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-AirflowPyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
PyconZA19-Distributed-workloads-challenges-with-PySpark-and-Airflow
 
Best Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache SparkBest Practices for Building and Deploying Data Pipelines in Apache Spark
Best Practices for Building and Deploying Data Pipelines in Apache Spark
 
Reproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorchReproducible AI using MLflow and PyTorch
Reproducible AI using MLflow and PyTorch
 
(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning(Py)testing the Limits of Machine Learning
(Py)testing the Limits of Machine Learning
 
Streaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFXStreaming Inference with Apache Beam and TFX
Streaming Inference with Apache Beam and TFX
 

More from Chetan Khatri

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Chetan Khatri
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Chetan Khatri
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionChetan Khatri
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...Chetan Khatri
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with ScalaChetan Khatri
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoChetan Khatri
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...Chetan Khatri
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...Chetan Khatri
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriChetan Khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Chetan Khatri
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningChetan Khatri
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer ScienceChetan Khatri
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteChetan Khatri
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanChetan Khatri
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabsChetan Khatri
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduinoChetan Khatri
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy MeterChetan Khatri
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonChetan Khatri
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskvChetan Khatri
 

More from Chetan Khatri (20)

Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
Data Science for Beginner by Chetan Khatri and Deptt. of Computer Science, Ka...
 
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
Demystify Information Security & Threats for Data-Driven Platforms With Cheta...
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_productionPyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
PyConLT19-No_more_struggles_with_Apache_Spark_(PySpark)_workloads_in_production
 
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
HBaseConAsia 2018 - Scaling 30 TB's of Data lake with Apache HBase and Scala ...
 
An Introduction to Spark with Scala
An Introduction to Spark with ScalaAn Introduction to Spark with Scala
An Introduction to Spark with Scala
 
HBase with Apache Spark POC Demo
HBase with Apache Spark POC DemoHBase with Apache Spark POC Demo
HBase with Apache Spark POC Demo
 
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
HKOSCon18 - Chetan Khatri - Open Source AI / ML Technologies and Application ...
 
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
HKOSCon18 - Chetan Khatri - Scaling TB's of Data with Apache Spark and Scala ...
 
Fossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatriFossasia 2018-chetan-khatri
Fossasia 2018-chetan-khatri
 
Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...Fossasia ai-ml technologies and application for product development-chetan kh...
Fossasia ai-ml technologies and application for product development-chetan kh...
 
An Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learningAn Introduction Linear Algebra for Neural Networks and Deep learning
An Introduction Linear Algebra for Neural Networks and Deep learning
 
Introduction to Computer Science
Introduction to Computer ScienceIntroduction to Computer Science
Introduction to Computer Science
 
An introduction to Git with Atlassian Suite
An introduction to Git with Atlassian SuiteAn introduction to Git with Atlassian Suite
An introduction to Git with Atlassian Suite
 
Think machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetanThink machine-learning-with-scikit-learn-chetan
Think machine-learning-with-scikit-learn-chetan
 
A step towards machine learning at accionlabs
A step towards machine learning at accionlabsA step towards machine learning at accionlabs
A step towards machine learning at accionlabs
 
Voltage measurement using arduino
Voltage measurement using arduinoVoltage measurement using arduino
Voltage measurement using arduino
 
Design & Building Smart Energy Meter
Design & Building Smart Energy MeterDesign & Building Smart Energy Meter
Design & Building Smart Energy Meter
 
Data Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - PythonData Analytics with Pandas and Numpy - Python
Data Analytics with Pandas and Numpy - Python
 
Internet of things initiative-cskskv
Internet of things   initiative-cskskvInternet of things   initiative-cskskv
Internet of things initiative-cskskv
 

Recently uploaded

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 

Recently uploaded (20)

Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 

Automate ml workflow_transmogrif_ai-_chetan_khatri_berlin-scala

  • 1. TransmogrifAI Automate Machine Learning Workflow with the power of Scala and Spark at massive scale. @khatri_chetanBy: Chetan Khatri, Lead - Technology, Accion Labs
  • 2. About me Lead - Data Science @ Accion labs India Pvt. Ltd. Open Source Contributor @ Apache Spark, Apache HBase, Elixir Lang, Spark HBase Connectors. Data Engineering @: Nazara Games, Eccella Corporation. Advisor - Data Science Lab, University of Kachchh, India. M.Sc. - Computer Science from University of Kachchh, India.
  • 3. About Accion Labs We Are A Product Engineering Company Helping Transform Businesses Through Emerging Technologies. What we do? Data Engineering, Machine Learning, NLP, Microservices etc. Case studies: https://www.accionlabs.com/accion-work Contact: https://www.accionlabs.com/accion-bangalore-contact-1
  • 4. Agenda ● What is TransmogrifAI ? ● Why you need TransmogrifAI ? ● Automation of Machine learning life Cycle - from development to deployment. ○ Feature Inference ○ Transformation ○ Automated Feature validation ○ Automated Model Selection ○ Hyperparameter Optimization ● Type Safety in Spark, TransmogrifAI. ● Example: Code - Titanic kaggle problem.
  • 5. What is TransmogrifAI ? ● TransmogrifAI is open sourced by Salesforce.com, Inc. in June, 2018 ● An end to end automated machine learning workflow library for structured data build on top of Scala and SparkML. Build with
  • 6. What is TransmogrifAI ? ● TransmogrifAI helps extensively to automate Machine learning model life cycle such as Feature Selection, Transformation, Automated Feature validation, Automated Model Selection, Hyperparameter Optimization. ● It enforces compile-time type-safety, modularity, and reuse. ● Through automation, It achieves accuracies close to hand-tuned models with almost 100x reduction in time.
  • 7. Why you need TransmogrifAI ? AUTOMATION Numerous Transformers and Estimators. MODULARITY AND REUSE Enforces a strict separation between ML workflow definitions and data manipulation. COMPILE TIME TYPE SAFETYWorkflow built are Strongly typed, code completion during development and fewer runtime errors. TRANSPARENCY Model insights leverage stored feature metadata and lineage to help debug models. Features
  • 8. Why you need TransmogrifAI ? Use TransmogrifAI if you need a machine learning library to: ● Build production ready machine learning applications in hours, not months ● Build machine learning models without getting a Ph.D. in machine learning ● Build modular, reusable, strongly typed machine learning workflows More read documentation: https://transmogrif.ai/
  • 9. Why Machine Learning is hard ?! Really! ... For example, this may be using a linear classifier when your true decision boundaries are non-linear. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  • 10. Why Machine Learning is hard ?! Really! ... fast and effective debugging is the skill that is most required for implementing modern day machine learning pipelines. Ref. http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html
  • 11. Real time Machine Learning takes time to Productionize TransmogrifAI Automates entire ML Life Cycle to accelerate developer’s productivity.
  • 12. Under the Hood Automated Feature Engineering Automated Feature Selection Automated Model Selection
  • 13. Automated Feature Engineering Automatic Derivation of new features based on existing features. Email Phone Age Subject Zip Code DOB Gender Email is Spam Country Code [0-20] [21-30] [ > 30] Stop words Top terms (TF-IDF) Detect Language Average Income House Price School Quality Shopping Transportation To Binary Age Day of Week Week of Year Quarter Month Year Hour Feature Vector
  • 14. Automated Feature Engineering ● Analyze every feature columns and compute descriptive statistics. ○ Number of Nulls, Number of empty string, Mean, Max, Min, Standard deviation. ● Handle Missing values / Noisy values. ○ Ex. fillna by Mean / Avg / near by values. patient_details = patient_details.fillna(-1) data['City_Type'] = data['City_Type'].fillna('Z') imp = Imputer(missing_values='NaN', strategy='median', axis=0, copy = False) data_total_imputed = imp.fit_transform(data_total) # mark zero values as missing or NaN dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, numpy.NaN) # fill missing values with mean column values dataset.fillna(dataset.mean(), inplace=True)
  • 15. Automated Feature Engineering ● Does features have acceptable ranges / Does it contain valid values ? ● Does that feature could be leaker ? ○ Is it usually filled out after predicted field is ? ○ Is it highly correlated with the predicted field ? ● Does that feature is Outlier ?
  • 16. Automated Feature Selection / Data Pre-processing ● Data Type of Features, Automatic Data Pre-processing. ○ MinMaxScaler ○ Normalizer ○ Binarizer ○ Label Encoding ○ One Hot Encoding ● Auto Data Pre-Processing based on chosen ML Model. ● Algorithm like XGBoost, specifically requires dummy encoded data while algorithm like decision tree doesn’t seem to care at all (sometimes)!
  • 17. Auto Data Pre-processing ● Numeric - Imputation, Track Null Value, Log Transformation for large range, Scaling, Smart Binning. ● Categorical - Imputation, Track Null Value, One Hot Encoding / Dummy Encoding, Dynamic Top K Pivot, Smart Binning, Label Encoding, Category Embedding. ● Text - Tokenization, Hash Encoding, TF-IDF, Word2Vec, Sentiment Analysis, Language Detection. ● Temporal - Time Difference, Circular Statistics, Time Extraction(Day, Week, Month, Year).
  • 18. Auto Selection of Best Model with Hyper Parameter Tuning ● Machine Learning Model ○ Learning Rate ○ Epoc ○ Batch Size ○ Optimizer ○ Activation Function ○ Loss Function ● Search Algorithms to find best model and optimal hyper parameters. ○ Ex. Grid Search, Random Search, Bandit Methods
  • 19. Examples - Hyper parameter tuning XGBoost: params = {'booster':'gbtree', 'objective':'binary:logistic', 'max_depth':9, 'eval_metric':'logloss', 'eta':0.02, 'silent':1, 'nthread':4, 'subsample': 0.9, 'colsample_bytree':0.9, 'scale_pos_weight':1, 'min_child_weight':3, 'max_delta_step':3} num_rounds = 400 params['seed'] = 523264626346 # 0.85533 dtrain = xgb.DMatrix(train, labels, missing=np.nan) clf = xgb.train(params, dtrain, num_rounds) dtest = xgb.DMatrix(test, missing = np.nan) test_preds = clf.predict(dtest)
  • 20. Examples - Hyper parameter tuning rf = RandomForestClassifier(n_estimators= int(n_tree), max_features= int(mtry), criterion = criterion, max_depth = max_depth, n_jobs = -1, oob_score = True) rf.fit(X_train, y_train) gbm = GradientBoostingClassifier(n_estimators = n_tree, max_features = max_features, subsample = subsample, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') ext = ExtraTreesClassifier(n_estimators = n_tree, criterion = criterion, max_features = max_features, n_jobs = -1) cv = StratifiedKFold(y_train, 10) scores = cross_val_score(rf, X_train, y_train, cv = cv, n_jobs = 4, scoring= 'f1') param_grid = { 'n_estimators' : [500,1000,1500,2000,2500,3000], 'criterion' : ['gini', 'entropy'], 'max_features' : [15,20,25,30], 'max_depth' : [4,5,6] } gs_cv = GridSearchCV(rf, param_grid, scoring = 'f1', n_jobs = -1, verbose = 2).fit(X_train[subspace], y_train) gs_cv.best_params_
  • 21. Ensemble Modeling ens['XGB2'] = xgb2_pred['Disbursed'] ens['RF'] = rf_pred['Disbursed'] ens['FTRL'] = ftrl_pred['Disbursed'] ens['XGB1_Rank'] = rankdata(ens['XGB1'], method='min') ens['XGB2_Rank'] = rankdata(ens['XGB2'], method='min') ens['XGB_Rank'] = 0.5 * ens['XGB1_Rank'] + 0.5 * ens['XGB2_Rank'] ens['RF_Rank'] = rankdata(ens['RF'], method='min') ens['FTRL_Rank'] = rankdata(ens['FTRL'], method='min') ens['Final'] = (0.75*ens['XGB_Rank'] + 0.25*ens['RF_Rank']) * 0.75 + 0.25 * ens['FTRL']
  • 22. Type Safety: Integration with Apache Spark and Scala ● Modular, Reusable, Strongly typed Machine learning workflow on top of Apache Spark. ● Type Safety in Apache Spark with DataSet API.
  • 23. Structured Data in Apache Spark Structured in Spark DataFrames Datasets
  • 24. Unification of APIs in Apache Spark 2.0 DataFrame Dataset Untyped API Typed API Dataset (2016) DataFrame = Dataset [Row] Alias DataSet [T]
  • 25. Why Dataset ? ● Strongly Typing. ● Ability to use powerful lambda functions. ● Spark SQL’s optimized execution engine (catalyst, tungsten). ● Can be constructed from JVM objects & manipulated using Functional. ● transformations (map, filter, flatMap etc). ● A DataFrame is a Dataset organized into named columns. ● DataFrame is simply a type alias of Dataset[Row].
  • 26. DataFrame API Code // convert RDD -> DF with column names val parsedDF = parsedRDD.toDF("project", "sprint", "numStories") //filter, groupBy, sum, and then agg() parsedDF.filter($"project" === "finance"). groupBy($"sprint"). agg(sum($"numStories").as("count")). limit(100). show(100) project sprint numStories finance 3 20 finance 4 22
  • 27. DataFrame -> SQL View -> SQL Query parsedDF.createOrReplaceTempView("audits") val results = spark.sql( """SELECT sprint, sum(numStories) AS count FROM audits WHERE project = 'finance' GROUP BY sprint LIMIT 100""") results.show(100) project sprint numStories finance 3 20 finance 4 22
  • 28. Why Structure APIs ? // DataFrame data.groupBy("dept").avg("age") // SQL select dept, avg(age) from data group by 1 // RDD data.map { case (dept, age) => dept -> (age, 1) } .reduceByKey { case ((a1, c1), (a2, c2)) => (a1 + a2, c1 + c2) } .map { case (dept, (age, c)) => dept -> age / c }
  • 29. Catalyst in Spark SQL AST DataFrame Datasets Unresolved Logical Plan Logical Plan Optimized Logical Plan Physical Plans CostModel Selected Physical Plan RDD
  • 30. Dataset API in Spark 2.x val employeesDF = spark.read.json("employees.json") // Convert data to domain objects. case class Employee(name: String, age: Int) val employeesDS: Dataset[Employee] = employeesDF.as[Employee] val filterDS = employeesDS.filter(p => p.age > 3) Type-safe: operate on domain objects with compiled lambda functions.
  • 31. Structured APIs in Apache Spark SQL DataFrames Datasets Syntax Errors Runtime Compile Time Compile Time Analysis Errors Runtime Runtime Compile Time Analysis errors are caught before a job runs on cluster
  • 32. Spark SQL API - Analysis Error example.
  • 33. Spark SQL API - Analysis Error example.
  • 34. TransmogrifAI - Type Safety is Everywhere! ● Value operations ● Feature operations ● Transformation Pipelines (aka Workflows) // Typed value operations val tokenize(t: Text): TextList = t.map(_.split("")).toTextList // Types feature operations val title = FeatureBuilder.Text[Book].extract(_.title).asPredictor val tokens: Feature[TextList] = title.map(tokenize) // Transformation pipelines new OpWorkflow().setInput(books).setResultFeatures(token.vectorize())
  • 36. A Case Story - Functional Flow - Spark as a SaaS User Interface Build workflow - Source - Target - Transformations - filter - aggregation - Joins - Expressions - Machine Learning Algorithms Store Metadata of workflow in Document based NoSQL Ex. MongoDB ReactiveMongo Scala / Spark Job Reads Metadata from NoSQL ex. MongoDB Run on the Cluster Schedule Using Airflow SparkSubmit Operator
  • 37. A Case Story - High Level Technical Architecture - Spark as a SaaS User Interface Middleware Akka HTTP Web Service’s
  • 45. References [1] TransmogrifAI - AutoML library for building modular, reusable, strongly typed machine learning workflows on Spark from Salesforce Engineering [online] https://transmogrif.ai [2] A plugin to Apache Airflow to allow you to run Spark Submit Commands as an Operator [online] https://github.com/rssanders3/airflow-spark-operator-plugin [3] Apache Spark - Unified Analytics Engine for Big Data [online] https://spark.apache.org/ [4] Apache Livy [online] https://livy.incubator.apache.org/ [5] Zayd's Blog - Why is machine learning 'hard'? [online] http://ai.stanford.edu/~zayd/why-is-machine-learning-hard.html [6] Meet TransmogrifAI, Open Source AutoML That Powers Einstein Predictions [online] https://www.youtube.com/watch?v=uMapcWtzwyA&t=106s [7] Auto-Machine Learning: The Magic Behind Einstein [online] https://www.youtube.com/watch?v=YDw1GieW4cw&t=564s