ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

ModelDB: A system
to manage machine
learning models
Manasi Vartak
PhD Student, MIT DB Group

People
Manasi Vartak
PhD student, MIT
Srinidhi Viswanathan
MEng, MIT
Samuel Madden
Faculty, MIT
Matei Zaharia
Faculty, Stanford
Harihar Subramanyam
MEng, MIT
Wei-En Lee
MEng student, MIT

Building a credit
recommendation algorithm
Profession Credit History Risk of Default
Politician Reasonable 0.3
Struggling
artist
Poor 0.7
Investor
Has more
money than our
company
0.0
… … … …
Barack
Obama
Lindsay
Lohan
Warren
Buffet

Accuracy: 58%
Model 2
RandomForestClassifier
val udf1: (Int => Int) = (delayed..)
df.withColumn(“timesDelayed”, udf1)

Accuracy: 63%
RandomForestClassifier
.withColumn(“percentPaid”, udf2)
val lrGrid = new ParamGridBuilder()
.addGrid(rf.maxDepth, Array(5, 10, 15))
.addGrid(rf.numTrees, Array(50, 100))
Model 5
credit-default-clean.csv

.withColumn(“percentPaid”, udf2)
.withColumn(“creditUsed”, udf3)
…
val lrGrid = new ParamGridBuilder()
.addGrid(lr.elasticNetParam, Array(0.01, 0.1, 0.5, 0.7))
val scaler = new StandardScaler()
.setInputCol(“features”)
…
val labelIndexer1 = new LabelIndexer()
val labelIndexer2 = new LabelIndexer()
…
Model 50
val udf1: (Int => Int) = (delayed..)
val udf2: (String, Int) = …
credit-default-clean.csv

Why is this a problem?
• No record of model history
Did my colleague do that
already?

• Insights lost along the way
already?
How did normalization
affect my ROC?

• Difﬁcult to reproduce results
already?
affect my ROC?
What params did I use?

• Cannot search for or query models
already?
affect my ROC?
Where’s the LR
model I tried last
week with featureX?

• Cannot search for or query models
• Difﬁcult to collaborate
already?
affect my ROC?
How does someone review
your model?
Where’s the LR
model I tried last
week with featureX?

Requirements from model
management tool

management tool
• Experiment tracking

management tool
• Versioning

management tool
• Versioning
• Reproducibility

management tool
• Versioning
• Reproducibility
• Comparisons, queries, search

management tool
• Versioning
• Reproducibility
• Collaboration

management tool
• Versioning
• Reproducibility
• Collaboration
*With minimal effort

ModelDB: a system to
manage machine
learning models
https://github.com/mitdbg/modeldb
http://modeldb.csail.mit.edu

ModelDB: model
management system

ModelDB: model
management system
Ingest models,
metadata

ModelDB: model
management system
Model, Pipeline
Storage
Versioning
Ingest models,
metadata

ModelDB: model
management system
Model, Pipeline
Storage
Versioning
Query
Ingest models,
metadata

ModelDB: model
management system
Model, Pipeline
Storage
Versioning
Query
Ingest models,
metadata
Collaboration,
Reproducibility

User quotes
“I should have had this in my self-driving cars class; it
would have made things so much easier”
“…it can really help with reproducibility … and
collaboration in multi-person teams…”
“I used it to track models for a research project; it
was so simple”

ModelDB Architecture &
Design Decisions

Design Decisions
1. Support for diverse
languages and environments

Design Decisions
2. Minimal changes to
existing workflows

Design Decisions
existing workflows
3. Rich visual interface

Design Decisions
existing workflows
3. Rich visual interface
4. Support for complex
queries

“Oh, but why not git?”
• All code treated equal
• Some elements are special: data sources,
parameters, metrics, models
• Difﬁcult to tease that out
• No semantics, so can’t run interesting queries

ModelDB Features
(currently available)
• Versioning
• Reproducibility
• Collaboration

ModelDB Features
• Versioning
• Reproducibility
• Collaboration
Log models, params, pipelines
etc. via ModelDB API

ModelDB Features
• Versioning
• Reproducibility
• Collaboration
Every modeling run = version

ModelDB Features
• Versioning
• Reproducibility
• Collaboration
All pipeline details, params
logged

ModelDB Features
• Versioning
• Reproducibility
• Collaboration
Model search, query,
comparison via frontend
logged

ModelDB Features
• Versioning
• Reproducibility
• Collaboration
Model search, query,
comparison via frontend
Central repository of models
Review models, annotate
logged

ModelDB Features
(ongoing)
• Uniﬁed Querying of Modeling Artifacts

ModelDB Features
(ongoing)
Base data, intermediates,
models, predictions, metadata

ModelDB Features
(ongoing)
“How did the GBDTs do on married customers who
are interested in gardening?”

ModelDB Features
(ongoing)
Base
Data
is_married=T

ModelDB Features
(ongoing)
Base
Data
is_married=T
Intermediates
gardening=T

ModelDB Features
(ongoing)
Base
Data
is_married=T
Intermediates
gardening=T
Metadata
type=
GBDT
Models
ids={..}

ModelDB Features
(ongoing)
Base
Data
is_married=T
Intermediates
gardening=T
Predictions
accuracy(…)
Metadata
type=
GBDT
Models
ids={..}

ModelDB Features
(ongoing)
Base
Data
is_married=T
Intermediates
gardening=T
Predictions
accuracy(…)
Metadata
type=
GBDT
Models
ids={..}
What query language?
How to persist data?

ModelDB Features
(ongoing)
• Mining data in ModelDB

ModelDB Features
(ongoing)
Model Features Params Metric
M13 X3,X9... l1=0.3 0.63
M22 X1,X4,X7 l2=0.7 0.8
M34 X11,X13 l1=0.7 0.55
… … … …

ModelDB Features
(ongoing)
Given model history, what
should we try next?
Bayesian Modeling/AutoML
M13 X3,X9... l1=0.3 0.63
M22 X1,X4,X7 l2=0.7 0.8
M34 X11,X13 l1=0.7 0.55
… … … …

ModelDB Features
(ongoing)
• Full model lifecycle management
should we try next?
M13 X3,X9... l1=0.3 0.63
M22 X1,X4,X7 l2=0.7 0.8
M34 X11,X13 l1=0.7 0.55
… … … …

ModelDB Features
(ongoing)
• Full model lifecycle management
should we try next?
Model performance degrades
Retrain model over time
M13 X3,X9... l1=0.3 0.63
M22 X1,X4,X7 l2=0.7 0.8
M34 X11,X13 l1=0.7 0.55
… … … …

ModelDB available now!
https://github.com/mitdbg/modeldb
*MIT License

• Download, try it out!

• Tell us what you think; what can we do better?

• Tell us what you think; what can we do better?
• Contribute! (see Issues on repo for some ideas)

ModelDB: a system to
manage machine
learning models
mvartak@csail.mit.edu | @DataCereal
http://modeldb.csail.mit.edu
*Icons from FlatIcon

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Andere mochten auch

Andere mochten auch (20)

Ähnlich wie ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak

Ähnlich wie ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak (20)

Mehr von Spark Summit

Mehr von Spark Summit (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ModelDB: A System to Manage Machine Learning Models: Spark Summit East talk by Manasi Vartak