SlideShare ist ein Scribd-Unternehmen logo
1 von 27
© 2018 IBM Corporation
Applying
Software Engineering Practices for the Data Science & ML
Lifecycle
Data Works Summit, San Jose 2018
Sriram Srinivasan
Architect - IBM Data Science & Machine Learning, Cloud Private for Data
IBM Data Science Experience
© 2017 IBM Corporation<#>
Overview
 Enterprises, as usual, want quick return on investments in Data Science
 But with a shrinking dev -> prod cycle: application of new techniques and course corrections on a
continuous basis is the norm
 Data Science & Machine Learning are increasingly cross-team endeavors
 Data Engineers, Business Analysts, DBAs and Data Stewards are frequently involved in the lifecycle
 The “Cloud” has been a great influence
 Economies of scale with regards to infrastructure cost, quick re-assignments of resources are expected
 Automation, APIs, Repeatability, Reliability and Elasticity are essential for “operations”.
 Data Science & ML need to exhibit the same maturity as other enterprise class apps :
 Compliance & Regulations are still a critical mandate for large Enterprises to adhere to
 Security, audit-ability and governance need to be in-place from the start and not an after-thought.
© 2017 IBM Corporation<#>
Data Scientist Concerns
 Where is the data I need to drive business insights?
 I don’t want to do all the plumbing – connect to databases, Hadoop etc.
How do I collaborate and share my work with others?
 What visualization techniques exist to tell my story?
 How do I bring my familiar R/Python libraries to this new Data Science platform?
 How do I use the latest libraries/Technique or newer versions ?
 How do I procure compute resources for my experimentation ?
 With specialized compute such as GPUs
How are my Machine Learning Models performing & how to improve them?
 I have this Machine Learning Model, how do I help deploy it in production?
Data Science
Experience
Access to libraries & tools.. an ever growing list..
 Multiple programming languages – Python, R, Scala..
 Modern Data Scientists are programmers/ software developers too !
 Build your favorite libraries or experiment with new ones
 Modularization via packages & dependency management are problems just as with any Software development
 Publish apps and expose APIs.. – share & collaborate
 Work with a variety of data sources and technologies.. easily..
Machine Learning
Environments
Deep Learning
Environments
SPSS Modeler
….
….
….
© 2017 IBM Corporation<#>
Challenges for the Enterprise
 Ensure secure data access & auditability - for governance and compliance
 Control and Curate access to data and for all open source libraries used
 Explainability and reproducibility of machine learning activities
 Improve trust in analytics and predictions
 Efficient Collaboration and versioning of all source, sample data and models
 Easy teaming with accountability
 Establish Continuous integration practices just as with any Enterprise software
 Agility in delivery and problem resolutions in production
 Publish/Share and identify provenance/ lineage with confidence
 Visibility and Access control
 Effective Resource utilization and ability to scale-out on demand
 Guarantee SLAs for production work, balance resources amongst different data scientists, machine learning
practioners' workloads
 Goal: Operationalize Data Science !
5 tenets for operationalizing data science
Analytics-Ready Data
Managed Trusted
Quality, Provenance and
Explainability
Resilient Measurable
Monitor + Measure
Evolution
Deliver & ImproveAt Scale & Always On
Where’s my data ?
Analytics-Ready Data
Managed
• access to data with techniques to track & deal with
sensitive content
• data virtualization
• automate-able pipelines for data preparation,
transformations
Need:
An Enterprise Catalog
&
Data Integration capabilities
How can I convince you to use this model ?
• provenance of data used to train & test
• lineage of the model - transforms , features & labels
• model explainability - algorithm, size of data, compute
resources used to train & test, evaluation thresholds,
repeatable
Trusted
Quality, Provenance and
Explainability
How was the model built ?
Need:
An enterprise Catalog
for Analytics & Model assets
Dependable for your business
Resilient
At Scale & Always On
• reliable & performant for (re-)training
• highly available, low latency model serving at real time
even with sophisticated data prep
• outage free model /version upgrades in production
ML infused in real-time,
critical business processes
Must have:
A platform for elasticity, reliability &
load-balancing
Is the model still good enough ?
Measurable
Monitor + Measure
• latency metrics for real-time scoring
• frequent accuracy evaluations with thresholds
• health monitoring for model decay
Desired:
Continuous Model
evaluations &
instrumentations
Growth & Maturity
Evolution
Deliver & Improve
• versioning: champion/challenger, experimentation and
hyper-parameterization
• process efficiencies: automated re-training auto
deployments (with curation & approvals)
Must-have:
Delta deployments &
outage free upgrades
12
A git and Docker/Kubernetes based approach
from lessons learnt during the implementation of :
IBM Data Science Experience Local & Desktop
https://www.ibm.com/products/data-science-experience
and
IBM Cloud Private for Data
http://ibm.biz/data4ai
Part 1: Establish a way to organize Models, scripts & all other assets
 A “Data Science Project”
o just a folder of assets grouped together
o Contains “models” (say .pickle/.joblib or R objects with metadata)
o scripts
o for data wrangling and transformations
o used for training & testing, evaluations, batch scoring
o interactive notebooks and apps (R Shiny etc.)
o sample data sets & references to remote data sources
o perhaps even your own home-grown libraries & modules..
o is a git repository
• why ? Easy to share, version & publish – across different teams or users with different Roles
• track history of commit changes, setup approval practices & version.
Familiar concept
- Projects exist in most IDEs & tools
Open & Portable..
- Even works on your laptop
Sharable & Versionable ..
- Courtesy of .git
 Use Projects with all tools/IDEs and services – one container for all artifacts & for tracking dependencies
Part 2: Provide reproducible Environments
 Enabled by Docker & Kubernetes
o A Docker image represents a ”Runtime” – and essentially enables repeatability by other users
• For example - a Python 2.7 Runtime with Anaconda 4.x, or an R 3.4.3 Runtime environment
• Can also include IDEs or tooling, such as Jupyter /Jupyterlab or Zeppelin , RStudio etc. – exposed via a
http address
• Allows for many different types of packages & versions of packages to be surfaced. Compatibility of
package-versions can be maintained, avoiding typical package conflict issues
o Docker containers with Project volumes provide for reproducible compute environments
o With Kubernetes - Orchestrate & Scale out compute and scale for users. Automate via Kube Cron jobs
Port forwarding
+auth
Project (.git repo) mounted as a volume
- or just git cloned & pushed when work is done..
Example
-a Jupyterlab
container
Churn predictor –v2
Scoring
Server pods
Kiubesvc
Load
balance
Auth-Proxy
Example
-a scoring service
for a specific version
of a Model
Port forwarding
Create replicas for scale/load-balancing
Part 3: A dev-ops process & an Enterprise Catalog
 Establish a “Release” to production mechanism
o git tag the state of a Data Science project to identify a version that is a release candidate
o Take that released project tag through a conventional dev->stage->production pipeline
o An update to the release would simply translate to a “pull” from a new git tag.
o Stand up Docker containers in Kubernetes deployments + svc) to deploy Data Science artifacts, expose Web
Services or simply use Kube Jobs to run scripts as needed.
 A catalog for metadata and lineage
o All asset metadata is recorded in this repository, including all data assets in the Enterprise
• Enables tracking of relationships between assets – including references to data sources such as Relational
tables/views or HDFS files etc.
• Manage Projects and versions in development, Releases in production
• Track APIs / URL end-points and Apps being exposed (and consumers)
o Establish policies for governance & compliance (apart from just access control)
Summary: The Governed Data Science lifecycle - is a team sport
Data Engineer
CDO (Data Steward)
Data Scientist
Organizes
• Data & Analytics Asset
Enterprise Catalog
• Lineage
• Governance of Data &
Models
• Audits & Policies
• Enables Model Explainability
Collects
Builds data lakes
and warehouses
Gets Data Ready for Analytics
Analyzes
Explores, Shapes data
& trains models
Exec
App. Developer Problem
Statement
or target
opportunity
Finds Data
Explores &
Understands
Data
Collects,
Preps &
Persists Data
Extracts
features for
ML
Train Models
Deploy &
monitor
accuracy
Trusted
Predictions
Experiments
Sets goals &
measures results
Real-time Apps
& business processes
Infuses ML in
apps &
processes
PRODUCTION
• Secure & Governed
• Monitor & Instrument
• High Throughput
-Load Balance & Scale
• Reliable Deployments
- outage free upgrades
Auto-retrain & upgrade
Refine
Features, Lineage/Relationships recorded
in Governance Catalog
Development
& prototyping
Production
Admin/Ops
17
June 2018 / © 2018 IBM Corporation
Backup
IBM Cloud Private for Data
& Data Science Experience Local
Build & Collaborate
Collaborate within git-backed
projects
Sample data
Understand data
distributions & profile
Understand, Analyze,
Train ML models
Jupyter notebook Environment
Python 2.7/3.5 with Anaconda
Scala, R
R Studio Environment with >
300 packages, R Markdown, R
Shiny
Zeppelin notebook
with Python 2.7 with Anaconda
Data Scientist
Analyzes
Experiments, trains models
Evaluates model accuracy
Publish Apps & models
Experiments
Features, Lineage recorded
in Governance Catalog
Models,
Scoring
services
& Apps
Publish to the
Catalog
Explores &
Understands
Data,
distributions
- ML Feature
engineering
- Visualizations
- Notebooks
- Train Models
- Dashboards,
apps
Self service Compute Environments
Servers/IDEs - lifecyle easily
controlled by each Data
Scientist
Self-serve reservations of
compute resources
Worker compute resources –
for batch jobs run on-demand
or on schedule
Environments are essentially Kubernetes
pods – with High Availability & Compute
scale-out baked in
(load-balancing/auto-scaling is being planned
for a future spring)
On demand or leased compute
Extend ..
– Roll your own Environments
Add libs/packages to the existing Jupyter, Rstudio , Zeppelin IDE
Environments or introduce new Job “Worker” environments
https://content-dsxlocal.mybluemix.net/docs/content/local/images.html
DSX Local provides a Docker Registry
(and replicated for HA) as well.
These images get managed by DSX and is
used to help build out custom
Environments
Plug-n-Play extensibility
Reproducibility, courtesy of Docker images
Automate ..
Jobs – trigger on-demand or by a
schedule.
such as for Model Evaluations, Batch
scoring or even continuous (re-) training
Monitor models through a dashboard
Model versioning, evaluation history
Publish versions of models, supporting
dev/stage/production paradigm
Monitor scalability through cluster dashboard
Adapt scalability by redistributing compute/memory/disk
resources
Deploy, monitor and manage
Deployment manager - Project Releases
Project releases
Deployed & (delta)
updatable
Current git tag
• Develop in one DSX Local instance & deploy/manage in another (or the same too)
• Easy support for Hybrid use cases - develop & train on-prem, deploy in the cloud (or vice versa)
Bring in a new “release” to production
New Releases
- from a “Source”
Project in the
same cluster
New Releases
- from a “Source”
Project pulled
from
github/bitbucket
New Releases
- from a “Source”
Project created
from a .tar.gz
package
Expose a ML model via a REST API
replicas for load
balancing
pick a version to
expose
(multiple
deployments are
possible too..)
Optionally
reserve compute
scoring end-point Model pre-loaded into memory
inside scoring containers
Expose Python and R scripts as a Web Service
Custom scripts can
be externalized as a
REST service
- say for custom
prediction functions

Weitere ähnliche Inhalte

Was ist angesagt?

Secure code practices
Secure code practicesSecure code practices
Secure code practicesHina Rawal
 
Penetration testing reporting and methodology
Penetration testing reporting and methodologyPenetration testing reporting and methodology
Penetration testing reporting and methodologyRashad Aliyev
 
OWASP Top 10 A4 – Insecure Direct Object Reference
OWASP Top 10 A4 – Insecure Direct Object ReferenceOWASP Top 10 A4 – Insecure Direct Object Reference
OWASP Top 10 A4 – Insecure Direct Object ReferenceNarudom Roongsiriwong, CISSP
 
Software Development Methodologies
Software Development MethodologiesSoftware Development Methodologies
Software Development MethodologiesNicholas Davis
 
Chapter 13 software testing strategies
Chapter 13 software testing strategiesChapter 13 software testing strategies
Chapter 13 software testing strategiesSHREEHARI WADAWADAGI
 
Software Engineering unit 2
Software Engineering unit 2Software Engineering unit 2
Software Engineering unit 2Abhimanyu Mishra
 
Evolution of Our Software Architecture
Evolution of Our Software ArchitectureEvolution of Our Software Architecture
Evolution of Our Software ArchitecturePaul Lam
 
Identity and Access Management (IAM)
Identity and Access Management (IAM)Identity and Access Management (IAM)
Identity and Access Management (IAM)Identacor
 
One agent, one click, and the future of data ingest with Elastic
One agent, one click, and the future of data ingest with ElasticOne agent, one click, and the future of data ingest with Elastic
One agent, one click, and the future of data ingest with ElasticElasticsearch
 
Patch Management Best Practices
Patch Management Best Practices Patch Management Best Practices
Patch Management Best Practices Ivanti
 
Virus and its CounterMeasures -- Pruthvi Monarch
Virus and its CounterMeasures                         -- Pruthvi Monarch Virus and its CounterMeasures                         -- Pruthvi Monarch
Virus and its CounterMeasures -- Pruthvi Monarch Pruthvi Monarch
 
System Requirements
System Requirements System Requirements
System Requirements Alaa Al Nouri
 
Web Application Security and Awareness
Web Application Security and AwarenessWeb Application Security and Awareness
Web Application Security and AwarenessAbdul Rahman Sherzad
 
Role of system analyst
Role of system analystRole of system analyst
Role of system analystprachi90501
 
Employing Enterprise Application Integration (EAI)
Employing Enterprise Application Integration (EAI)Employing Enterprise Application Integration (EAI)
Employing Enterprise Application Integration (EAI)elliando dias
 

Was ist angesagt? (20)

Software process
Software processSoftware process
Software process
 
Secure code practices
Secure code practicesSecure code practices
Secure code practices
 
Stepwise planning
Stepwise planningStepwise planning
Stepwise planning
 
Penetration testing reporting and methodology
Penetration testing reporting and methodologyPenetration testing reporting and methodology
Penetration testing reporting and methodology
 
OWASP Top 10 A4 – Insecure Direct Object Reference
OWASP Top 10 A4 – Insecure Direct Object ReferenceOWASP Top 10 A4 – Insecure Direct Object Reference
OWASP Top 10 A4 – Insecure Direct Object Reference
 
Software Development Methodologies
Software Development MethodologiesSoftware Development Methodologies
Software Development Methodologies
 
Secure coding practices
Secure coding practicesSecure coding practices
Secure coding practices
 
Chapter 13 software testing strategies
Chapter 13 software testing strategiesChapter 13 software testing strategies
Chapter 13 software testing strategies
 
Software Engineering unit 2
Software Engineering unit 2Software Engineering unit 2
Software Engineering unit 2
 
Evolution of Our Software Architecture
Evolution of Our Software ArchitectureEvolution of Our Software Architecture
Evolution of Our Software Architecture
 
Identity and Access Management (IAM)
Identity and Access Management (IAM)Identity and Access Management (IAM)
Identity and Access Management (IAM)
 
One agent, one click, and the future of data ingest with Elastic
One agent, one click, and the future of data ingest with ElasticOne agent, one click, and the future of data ingest with Elastic
One agent, one click, and the future of data ingest with Elastic
 
Patch Management Best Practices
Patch Management Best Practices Patch Management Best Practices
Patch Management Best Practices
 
Octave
OctaveOctave
Octave
 
Unit 7
Unit 7Unit 7
Unit 7
 
Virus and its CounterMeasures -- Pruthvi Monarch
Virus and its CounterMeasures                         -- Pruthvi Monarch Virus and its CounterMeasures                         -- Pruthvi Monarch
Virus and its CounterMeasures -- Pruthvi Monarch
 
System Requirements
System Requirements System Requirements
System Requirements
 
Web Application Security and Awareness
Web Application Security and AwarenessWeb Application Security and Awareness
Web Application Security and Awareness
 
Role of system analyst
Role of system analystRole of system analyst
Role of system analyst
 
Employing Enterprise Application Integration (EAI)
Employing Enterprise Application Integration (EAI)Employing Enterprise Application Integration (EAI)
Employing Enterprise Application Integration (EAI)
 

Ähnlich wie Software engineering practices for the data science and machine learning lifecycle

Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in ProductionDataWorks Summit
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-useltonrodriguez11
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...Robert Grossman
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleatSistemas
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningProvectus
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerProvectus
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenGoDataDriven
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Tushar Katarki
 
Data Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business ProcessesData Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business ProcessesFormulatedby
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndCloudera, Inc.
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsDatabricks
 
BBBT Watson Data Platform Presentation
BBBT Watson Data Platform PresentationBBBT Watson Data Platform Presentation
BBBT Watson Data Platform PresentationRitika Gunnar
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise AnalyticsDATAVERSITY
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsAnyscale
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learningRajesh Muppalla
 
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0minseok kim
 
Implementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureImplementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureDavide Veronese
 

Ähnlich wie Software engineering practices for the data science and machine learning lifecycle (20)

Machine Learning Models in Production
Machine Learning Models in ProductionMachine Learning Models in Production
Machine Learning Models in Production
 
DevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-usDevOps for Machine Learning overview en-us
DevOps for Machine Learning overview en-us
 
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
AnalyticOps: Lessons Learned Moving Machine-Learning Algorithms to Production...
 
DevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-OracleDevOps Spain 2019. Olivier Perard-Oracle
DevOps Spain 2019. Olivier Perard-Oracle
 
resume4
resume4resume4
resume4
 
Feature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine LearningFeature Store as a Data Foundation for Machine Learning
Feature Store as a Data Foundation for Machine Learning
 
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMakerMLOps and Reproducible ML on AWS with Kubeflow and SageMaker
MLOps and Reproducible ML on AWS with Kubeflow and SageMaker
 
Microsoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDrivenMicrosoft DevOps for AI with GoDataDriven
Microsoft DevOps for AI with GoDataDriven
 
Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes Scaling AI/ML with Containers and Kubernetes
Scaling AI/ML with Containers and Kubernetes
 
Data Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business ProcessesData Science Salon: Applying Machine Learning to Modernize Business Processes
Data Science Salon: Applying Machine Learning to Modernize Business Processes
 
Part 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to EndPart 3: Models in Production: A Look From Beginning to End
Part 3: Models in Production: A Look From Beginning to End
 
Experimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOpsExperimentation to Industrialization: Implementing MLOps
Experimentation to Industrialization: Implementing MLOps
 
BBBT Watson Data Platform Presentation
BBBT Watson Data Platform PresentationBBBT Watson Data Platform Presentation
BBBT Watson Data Platform Presentation
 
2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics2022 Trends in Enterprise Analytics
2022 Trends in Enterprise Analytics
 
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning ModelsApache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
Apache ® Spark™ MLlib 2.x: How to Productionize your Machine Learning Models
 
Reshma Resume 2016
Reshma Resume 2016Reshma Resume 2016
Reshma Resume 2016
 
Continuous delivery for machine learning
Continuous delivery for machine learningContinuous delivery for machine learning
Continuous delivery for machine learning
 
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
Pivotal korea transformation_strategy_seminar_enterprise_dev_ops_20160630_v1.0
 
Ml ops on AWS
Ml ops on AWSMl ops on AWS
Ml ops on AWS
 
Implementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureImplementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architecture
 

Mehr von DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Mehr von DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Kürzlich hochgeladen

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 

Kürzlich hochgeladen (20)

Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 

Software engineering practices for the data science and machine learning lifecycle

  • 1. © 2018 IBM Corporation Applying Software Engineering Practices for the Data Science & ML Lifecycle Data Works Summit, San Jose 2018 Sriram Srinivasan Architect - IBM Data Science & Machine Learning, Cloud Private for Data IBM Data Science Experience
  • 2. © 2017 IBM Corporation<#> Overview  Enterprises, as usual, want quick return on investments in Data Science  But with a shrinking dev -> prod cycle: application of new techniques and course corrections on a continuous basis is the norm  Data Science & Machine Learning are increasingly cross-team endeavors  Data Engineers, Business Analysts, DBAs and Data Stewards are frequently involved in the lifecycle  The “Cloud” has been a great influence  Economies of scale with regards to infrastructure cost, quick re-assignments of resources are expected  Automation, APIs, Repeatability, Reliability and Elasticity are essential for “operations”.  Data Science & ML need to exhibit the same maturity as other enterprise class apps :  Compliance & Regulations are still a critical mandate for large Enterprises to adhere to  Security, audit-ability and governance need to be in-place from the start and not an after-thought.
  • 3. © 2017 IBM Corporation<#> Data Scientist Concerns  Where is the data I need to drive business insights?  I don’t want to do all the plumbing – connect to databases, Hadoop etc. How do I collaborate and share my work with others?  What visualization techniques exist to tell my story?  How do I bring my familiar R/Python libraries to this new Data Science platform?  How do I use the latest libraries/Technique or newer versions ?  How do I procure compute resources for my experimentation ?  With specialized compute such as GPUs How are my Machine Learning Models performing & how to improve them?  I have this Machine Learning Model, how do I help deploy it in production?
  • 4. Data Science Experience Access to libraries & tools.. an ever growing list..  Multiple programming languages – Python, R, Scala..  Modern Data Scientists are programmers/ software developers too !  Build your favorite libraries or experiment with new ones  Modularization via packages & dependency management are problems just as with any Software development  Publish apps and expose APIs.. – share & collaborate  Work with a variety of data sources and technologies.. easily.. Machine Learning Environments Deep Learning Environments SPSS Modeler …. …. ….
  • 5. © 2017 IBM Corporation<#> Challenges for the Enterprise  Ensure secure data access & auditability - for governance and compliance  Control and Curate access to data and for all open source libraries used  Explainability and reproducibility of machine learning activities  Improve trust in analytics and predictions  Efficient Collaboration and versioning of all source, sample data and models  Easy teaming with accountability  Establish Continuous integration practices just as with any Enterprise software  Agility in delivery and problem resolutions in production  Publish/Share and identify provenance/ lineage with confidence  Visibility and Access control  Effective Resource utilization and ability to scale-out on demand  Guarantee SLAs for production work, balance resources amongst different data scientists, machine learning practioners' workloads  Goal: Operationalize Data Science !
  • 6. 5 tenets for operationalizing data science Analytics-Ready Data Managed Trusted Quality, Provenance and Explainability Resilient Measurable Monitor + Measure Evolution Deliver & ImproveAt Scale & Always On
  • 7. Where’s my data ? Analytics-Ready Data Managed • access to data with techniques to track & deal with sensitive content • data virtualization • automate-able pipelines for data preparation, transformations Need: An Enterprise Catalog & Data Integration capabilities
  • 8. How can I convince you to use this model ? • provenance of data used to train & test • lineage of the model - transforms , features & labels • model explainability - algorithm, size of data, compute resources used to train & test, evaluation thresholds, repeatable Trusted Quality, Provenance and Explainability How was the model built ? Need: An enterprise Catalog for Analytics & Model assets
  • 9. Dependable for your business Resilient At Scale & Always On • reliable & performant for (re-)training • highly available, low latency model serving at real time even with sophisticated data prep • outage free model /version upgrades in production ML infused in real-time, critical business processes Must have: A platform for elasticity, reliability & load-balancing
  • 10. Is the model still good enough ? Measurable Monitor + Measure • latency metrics for real-time scoring • frequent accuracy evaluations with thresholds • health monitoring for model decay Desired: Continuous Model evaluations & instrumentations
  • 11. Growth & Maturity Evolution Deliver & Improve • versioning: champion/challenger, experimentation and hyper-parameterization • process efficiencies: automated re-training auto deployments (with curation & approvals) Must-have: Delta deployments & outage free upgrades
  • 12. 12 A git and Docker/Kubernetes based approach from lessons learnt during the implementation of : IBM Data Science Experience Local & Desktop https://www.ibm.com/products/data-science-experience and IBM Cloud Private for Data http://ibm.biz/data4ai
  • 13. Part 1: Establish a way to organize Models, scripts & all other assets  A “Data Science Project” o just a folder of assets grouped together o Contains “models” (say .pickle/.joblib or R objects with metadata) o scripts o for data wrangling and transformations o used for training & testing, evaluations, batch scoring o interactive notebooks and apps (R Shiny etc.) o sample data sets & references to remote data sources o perhaps even your own home-grown libraries & modules.. o is a git repository • why ? Easy to share, version & publish – across different teams or users with different Roles • track history of commit changes, setup approval practices & version. Familiar concept - Projects exist in most IDEs & tools Open & Portable.. - Even works on your laptop Sharable & Versionable .. - Courtesy of .git  Use Projects with all tools/IDEs and services – one container for all artifacts & for tracking dependencies
  • 14. Part 2: Provide reproducible Environments  Enabled by Docker & Kubernetes o A Docker image represents a ”Runtime” – and essentially enables repeatability by other users • For example - a Python 2.7 Runtime with Anaconda 4.x, or an R 3.4.3 Runtime environment • Can also include IDEs or tooling, such as Jupyter /Jupyterlab or Zeppelin , RStudio etc. – exposed via a http address • Allows for many different types of packages & versions of packages to be surfaced. Compatibility of package-versions can be maintained, avoiding typical package conflict issues o Docker containers with Project volumes provide for reproducible compute environments o With Kubernetes - Orchestrate & Scale out compute and scale for users. Automate via Kube Cron jobs Port forwarding +auth Project (.git repo) mounted as a volume - or just git cloned & pushed when work is done.. Example -a Jupyterlab container Churn predictor –v2 Scoring Server pods Kiubesvc Load balance Auth-Proxy Example -a scoring service for a specific version of a Model Port forwarding Create replicas for scale/load-balancing
  • 15. Part 3: A dev-ops process & an Enterprise Catalog  Establish a “Release” to production mechanism o git tag the state of a Data Science project to identify a version that is a release candidate o Take that released project tag through a conventional dev->stage->production pipeline o An update to the release would simply translate to a “pull” from a new git tag. o Stand up Docker containers in Kubernetes deployments + svc) to deploy Data Science artifacts, expose Web Services or simply use Kube Jobs to run scripts as needed.  A catalog for metadata and lineage o All asset metadata is recorded in this repository, including all data assets in the Enterprise • Enables tracking of relationships between assets – including references to data sources such as Relational tables/views or HDFS files etc. • Manage Projects and versions in development, Releases in production • Track APIs / URL end-points and Apps being exposed (and consumers) o Establish policies for governance & compliance (apart from just access control)
  • 16. Summary: The Governed Data Science lifecycle - is a team sport Data Engineer CDO (Data Steward) Data Scientist Organizes • Data & Analytics Asset Enterprise Catalog • Lineage • Governance of Data & Models • Audits & Policies • Enables Model Explainability Collects Builds data lakes and warehouses Gets Data Ready for Analytics Analyzes Explores, Shapes data & trains models Exec App. Developer Problem Statement or target opportunity Finds Data Explores & Understands Data Collects, Preps & Persists Data Extracts features for ML Train Models Deploy & monitor accuracy Trusted Predictions Experiments Sets goals & measures results Real-time Apps & business processes Infuses ML in apps & processes PRODUCTION • Secure & Governed • Monitor & Instrument • High Throughput -Load Balance & Scale • Reliable Deployments - outage free upgrades Auto-retrain & upgrade Refine Features, Lineage/Relationships recorded in Governance Catalog Development & prototyping Production Admin/Ops
  • 17. 17 June 2018 / © 2018 IBM Corporation Backup IBM Cloud Private for Data & Data Science Experience Local
  • 18. Build & Collaborate Collaborate within git-backed projects Sample data Understand data distributions & profile
  • 19. Understand, Analyze, Train ML models Jupyter notebook Environment Python 2.7/3.5 with Anaconda Scala, R R Studio Environment with > 300 packages, R Markdown, R Shiny Zeppelin notebook with Python 2.7 with Anaconda Data Scientist Analyzes Experiments, trains models Evaluates model accuracy Publish Apps & models Experiments Features, Lineage recorded in Governance Catalog Models, Scoring services & Apps Publish to the Catalog Explores & Understands Data, distributions - ML Feature engineering - Visualizations - Notebooks - Train Models - Dashboards, apps
  • 20. Self service Compute Environments Servers/IDEs - lifecyle easily controlled by each Data Scientist Self-serve reservations of compute resources Worker compute resources – for batch jobs run on-demand or on schedule Environments are essentially Kubernetes pods – with High Availability & Compute scale-out baked in (load-balancing/auto-scaling is being planned for a future spring) On demand or leased compute
  • 21. Extend .. – Roll your own Environments Add libs/packages to the existing Jupyter, Rstudio , Zeppelin IDE Environments or introduce new Job “Worker” environments https://content-dsxlocal.mybluemix.net/docs/content/local/images.html DSX Local provides a Docker Registry (and replicated for HA) as well. These images get managed by DSX and is used to help build out custom Environments Plug-n-Play extensibility Reproducibility, courtesy of Docker images
  • 22. Automate .. Jobs – trigger on-demand or by a schedule. such as for Model Evaluations, Batch scoring or even continuous (re-) training
  • 23. Monitor models through a dashboard Model versioning, evaluation history Publish versions of models, supporting dev/stage/production paradigm Monitor scalability through cluster dashboard Adapt scalability by redistributing compute/memory/disk resources Deploy, monitor and manage
  • 24. Deployment manager - Project Releases Project releases Deployed & (delta) updatable Current git tag • Develop in one DSX Local instance & deploy/manage in another (or the same too) • Easy support for Hybrid use cases - develop & train on-prem, deploy in the cloud (or vice versa)
  • 25. Bring in a new “release” to production New Releases - from a “Source” Project in the same cluster New Releases - from a “Source” Project pulled from github/bitbucket New Releases - from a “Source” Project created from a .tar.gz package
  • 26. Expose a ML model via a REST API replicas for load balancing pick a version to expose (multiple deployments are possible too..) Optionally reserve compute scoring end-point Model pre-loaded into memory inside scoring containers
  • 27. Expose Python and R scripts as a Web Service Custom scripts can be externalized as a REST service - say for custom prediction functions