With the advent of newer frameworks and toolkits, data scientists are now more productive than ever and starting to prove indispensable to enterprises. Typical organizations have large teams of data scientists who build out key analytics assets that are used on a daily basis and an integral part of live transactions. However, there is also quite a lot of chaos and complexities that get introduced because of the state of the industry. Many packages used by data scientists are from open source, and even if they are well curated, there is a growing tendency to pick out the cutting-edge or unstable packages and frameworks to accelerate analytics. Different data scientists may use different versions of runtimes, different Python or R versions, or even different versions of the same packages. Predominantly data scientists work on their laptops and it becomes difficult to reproduce their environments for use by others. Since data science is now a team sport across multiple personas, involving non-practitioners, traditional application developers, execs, and IT operators, how does an enterprise create a platform for productive cross-role collaboration?
Enterprises need a very reliable and repeatable process, especially when it results in something that affects their production environments. They also require a well managed approach that enables the graduation of an asset from development through a testing and staging process to production. Given the pace of businesses nowadays, the process needs to be quite agile and flexible too—even enabling an easy path to reversing a change. Compliance and audit processes require clear lineage and history as well as approval chains.
In the traditional software engineering world, this lifecycle has been well understood and best practices have been followed for ages. But what does it mean when you have non-programmers or users who are not really trained in software engineering philosophies or who perceive all of this as "big process" roadblocks in their daily work ? How do you we engage them in a productive manner and yet support enterprise requirements for reliability, tracking, and a clear continuous integration and delivery practice? The presenters, in this session, will bring up interesting techniques based on their user research, real life customer interviews, and productized best practices. The presenters also invite the audience to share their stories and best practices to make this a lively conversation.
Speaker
Sriram Srinivasan, Senior Technical Staff Member, Analytics Platform Architect, IBM
4. Data Science
Experience
Access to libraries & tools.. an ever growing list..
Multiple programming languages – Python, R, Scala..
Modern Data Scientists are programmers/ software developers too !
Build your favorite libraries or experiment with new ones
Modularization via packages & dependency management are problems just as with any Software development
Publish apps and expose APIs.. – share & collaborate
Work with a variety of data sources and technologies.. easily..
Machine Learning
Environments
Deep Learning
Environments
SPSS Modeler
….
….
….
6. 5 tenets for operationalizing data science
Analytics-Ready Data
Managed Trusted
Quality, Provenance and
Explainability
Resilient Measurable
Monitor + Measure
Evolution
Deliver & ImproveAt Scale & Always On
7. Where’s my data ?
Analytics-Ready Data
Managed
• access to data with techniques to track & deal with
sensitive content
• data virtualization
• automate-able pipelines for data preparation,
transformations
Need:
An Enterprise Catalog
&
Data Integration capabilities
8. How can I convince you to use this model ?
• provenance of data used to train & test
• lineage of the model - transforms , features & labels
• model explainability - algorithm, size of data, compute
resources used to train & test, evaluation thresholds,
repeatable
Trusted
Quality, Provenance and
Explainability
How was the model built ?
Need:
An enterprise Catalog
for Analytics & Model assets
9. Dependable for your business
Resilient
At Scale & Always On
• reliable & performant for (re-)training
• highly available, low latency model serving at real time
even with sophisticated data prep
• outage free model /version upgrades in production
ML infused in real-time,
critical business processes
Must have:
A platform for elasticity, reliability &
load-balancing
10. Is the model still good enough ?
Measurable
Monitor + Measure
• latency metrics for real-time scoring
• frequent accuracy evaluations with thresholds
• health monitoring for model decay
Desired:
Continuous Model
evaluations &
instrumentations
12. 12
A git and Docker/Kubernetes based approach
from lessons learnt during the implementation of :
IBM Data Science Experience Local & Desktop
https://www.ibm.com/products/data-science-experience
and
IBM Cloud Private for Data
http://ibm.biz/data4ai
13. Part 1: Establish a way to organize Models, scripts & all other assets
A “Data Science Project”
o just a folder of assets grouped together
o Contains “models” (say .pickle/.joblib or R objects with metadata)
o scripts
o for data wrangling and transformations
o used for training & testing, evaluations, batch scoring
o interactive notebooks and apps (R Shiny etc.)
o sample data sets & references to remote data sources
o perhaps even your own home-grown libraries & modules..
o is a git repository
• why ? Easy to share, version & publish – across different teams or users with different Roles
• track history of commit changes, setup approval practices & version.
Familiar concept
- Projects exist in most IDEs & tools
Open & Portable..
- Even works on your laptop
Sharable & Versionable ..
- Courtesy of .git
Use Projects with all tools/IDEs and services – one container for all artifacts & for tracking dependencies
14. Part 2: Provide reproducible Environments
Enabled by Docker & Kubernetes
o A Docker image represents a ”Runtime” – and essentially enables repeatability by other users
• For example - a Python 2.7 Runtime with Anaconda 4.x, or an R 3.4.3 Runtime environment
• Can also include IDEs or tooling, such as Jupyter /Jupyterlab or Zeppelin , RStudio etc. – exposed via a
http address
• Allows for many different types of packages & versions of packages to be surfaced. Compatibility of
package-versions can be maintained, avoiding typical package conflict issues
o Docker containers with Project volumes provide for reproducible compute environments
o With Kubernetes - Orchestrate & Scale out compute and scale for users. Automate via Kube Cron jobs
Port forwarding
+auth
Project (.git repo) mounted as a volume
- or just git cloned & pushed when work is done..
Example
-a Jupyterlab
container
Churn predictor –v2
Scoring
Server pods
Kiubesvc
Load
balance
Auth-Proxy
Example
-a scoring service
for a specific version
of a Model
Port forwarding
Create replicas for scale/load-balancing
15. Part 3: A dev-ops process & an Enterprise Catalog
Establish a “Release” to production mechanism
o git tag the state of a Data Science project to identify a version that is a release candidate
o Take that released project tag through a conventional dev->stage->production pipeline
o An update to the release would simply translate to a “pull” from a new git tag.
o Stand up Docker containers in Kubernetes deployments + svc) to deploy Data Science artifacts, expose Web
Services or simply use Kube Jobs to run scripts as needed.
A catalog for metadata and lineage
o All asset metadata is recorded in this repository, including all data assets in the Enterprise
• Enables tracking of relationships between assets – including references to data sources such as Relational
tables/views or HDFS files etc.
• Manage Projects and versions in development, Releases in production
• Track APIs / URL end-points and Apps being exposed (and consumers)
o Establish policies for governance & compliance (apart from just access control)
16. Summary: The Governed Data Science lifecycle - is a team sport
Data Engineer
CDO (Data Steward)
Data Scientist
Organizes
• Data & Analytics Asset
Enterprise Catalog
• Lineage
• Governance of Data &
Models
• Audits & Policies
• Enables Model Explainability
Collects
Builds data lakes
and warehouses
Gets Data Ready for Analytics
Analyzes
Explores, Shapes data
& trains models
Exec
App. Developer Problem
Statement
or target
opportunity
Finds Data
Explores &
Understands
Data
Collects,
Preps &
Persists Data
Extracts
features for
ML
Train Models
Deploy &
monitor
accuracy
Trusted
Predictions
Experiments
Sets goals &
measures results
Real-time Apps
& business processes
Infuses ML in
apps &
processes
PRODUCTION
• Secure & Governed
• Monitor & Instrument
• High Throughput
-Load Balance & Scale
• Reliable Deployments
- outage free upgrades
Auto-retrain & upgrade
Refine
Features, Lineage/Relationships recorded
in Governance Catalog
Development
& prototyping
Production
Admin/Ops
19. Understand, Analyze,
Train ML models
Jupyter notebook Environment
Python 2.7/3.5 with Anaconda
Scala, R
R Studio Environment with >
300 packages, R Markdown, R
Shiny
Zeppelin notebook
with Python 2.7 with Anaconda
Data Scientist
Analyzes
Experiments, trains models
Evaluates model accuracy
Publish Apps & models
Experiments
Features, Lineage recorded
in Governance Catalog
Models,
Scoring
services
& Apps
Publish to the
Catalog
Explores &
Understands
Data,
distributions
- ML Feature
engineering
- Visualizations
- Notebooks
- Train Models
- Dashboards,
apps
20. Self service Compute Environments
Servers/IDEs - lifecyle easily
controlled by each Data
Scientist
Self-serve reservations of
compute resources
Worker compute resources –
for batch jobs run on-demand
or on schedule
Environments are essentially Kubernetes
pods – with High Availability & Compute
scale-out baked in
(load-balancing/auto-scaling is being planned
for a future spring)
On demand or leased compute
21. Extend ..
– Roll your own Environments
Add libs/packages to the existing Jupyter, Rstudio , Zeppelin IDE
Environments or introduce new Job “Worker” environments
https://content-dsxlocal.mybluemix.net/docs/content/local/images.html
DSX Local provides a Docker Registry
(and replicated for HA) as well.
These images get managed by DSX and is
used to help build out custom
Environments
Plug-n-Play extensibility
Reproducibility, courtesy of Docker images
22. Automate ..
Jobs – trigger on-demand or by a
schedule.
such as for Model Evaluations, Batch
scoring or even continuous (re-) training
23. Monitor models through a dashboard
Model versioning, evaluation history
Publish versions of models, supporting
dev/stage/production paradigm
Monitor scalability through cluster dashboard
Adapt scalability by redistributing compute/memory/disk
resources
Deploy, monitor and manage
24. Deployment manager - Project Releases
Project releases
Deployed & (delta)
updatable
Current git tag
• Develop in one DSX Local instance & deploy/manage in another (or the same too)
• Easy support for Hybrid use cases - develop & train on-prem, deploy in the cloud (or vice versa)
25. Bring in a new “release” to production
New Releases
- from a “Source”
Project in the
same cluster
New Releases
- from a “Source”
Project pulled
from
github/bitbucket
New Releases
- from a “Source”
Project created
from a .tar.gz
package
26. Expose a ML model via a REST API
replicas for load
balancing
pick a version to
expose
(multiple
deployments are
possible too..)
Optionally
reserve compute
scoring end-point Model pre-loaded into memory
inside scoring containers
27. Expose Python and R scripts as a Web Service
Custom scripts can
be externalized as a
REST service
- say for custom
prediction functions