Presentation at Data Innovation Summit 2019 in Stockholm, Sweden.
ABSTRACT
Microscopes are capable of producing vast amounts of data, and when used in automated laboratories both the number and size of images present many challenges for storing, categorizing, analyzing, annotating, and transforming the data into actionable information that can used for decision making; either by humans or machines. In this presentation I will describe the informatics system we have established at the Department of Pharmaceutical Biosciences at Uppsala University, which consists of computational hardware (CPUs, GPUs, storage), middleware (Kubernetes), imaging database (OMERO), and workflow system (Pachyderm) to perform online prioritization of new data, as well as the continuous analytics system to automate the process from captured images to continuously updated and deployed AI models. The AI methodologies include Deep Learning models trained on image data, and conventional machine learning models trained on features extracted from images or chemical structures. Due to the microservice architecture the system is scalable and can be expanded using hybrid-architectures with cloud computing resources. The informatics system serves a robotized cell profiling setup with incubators, liquid handling and high-content microscopy. The lab is quite young and is targeting applications primarily in drug screening and toxicity assessment, with the aim to improve research using AI and intelligent design of experiments.
1. Automating the process of continuously prioritising
data, updating and deploying AI models in a robotised
lab for drug discovery
Ola Spjuth <ola.spjuth@farmbio.uu.se>
Department of Pharmaceutical Biosciences, Uppsala University
www.pharmb.io
Lead Scientist AI at
2. Objective: Accelerate drug discovery using
AI and intelligent design of experiments.
⢠Predict safety concerns (fail early)
⢠Explain drug mechanisms
⢠Screen for new drugs
3. Hypothesis
revise
Insight
⢠Iterative
⢠Flexible
⢠Mostly manual
⢠Slow
Experiments
Analysis and interpretation
Traditional hypothesis testing
⢠Retrospective analysis
⢠Hopefully predictive
⢠Expensive
⢠Limited for hypothesis
testing
more
Predictive modeling
Database
Data generation
Traditional Processing Stream Processing
Data
Data Query
request
response
Real- T ime
Analytics
Data Results
ModelPrediction
Modeling and prediction
4. Data-driven science
Stream Processing
Real- T ime
Analytics
Data Results
Data
Models Evaluation
Prediction/
Insight
Hard problem!
Poor accuracy?
Hypothesis
Hypothesis
test
generate
Generate new data
5. Intelligent experimentation
Data
Stream Processing
Real- T ime
Analytics
Data Results
Current fact finding
Analyze data in motion â before it is storedsk
Continuous Analytics
Results
Intelligent design of
experiments Experiments
Scientist
⢠What experiments should we
do and how?
⢠Can we reduce search space?
⢠How store only interesting
data?
⢠Can we replace experiments
with predictions?
Informatics system
6. Genetic or
chemical
perturbations
Experiments
in multi-
well plates
Imaging Features Hypotheses
Convolutional Neural Network
Predictions
High-content cell profiling
Bray et al. (2016). âCell Painting, a High-Content Image-Based Assay for Morphological
Profiling Using Multiplexed Fluorescent Dyes.â Nature Protocols 11 (9): 1757â74.
7. Automated cell-based experiments
GPU cluster
CPU server
Storage
Automated plate handling
Informatics system
Current setup
Automate more protocols
Open source robotized cell lab
11. Dealing with large scale data
⢠High volume, high velocity
⢠Continuously process data, train models, serve models
⢠Embrace scalable virtual infrastructures (cloud) and
microservices (containers)
⢠Intelligently prioritize what to store
12. Designing a flexible and scalable
informatics system
14
Cloud
HPC
GPU cluster
CPU server
Storage
13. Master
Worker
Worker
Server1 (KVM) Server2 (KVM) GPU-Server
GPU-Worker
Join external nodes
VPN /
IPSec
Develop Other
Master
Worker
Master/
Worker
Other
Our infrastructure
Master
Worker
Master
Worker
External
users
Internal
users and
devel
(administered with Rancher)
Fileserver (NFS)
14. System configuration, IaaC
Master
Worker
Server1 (KVM) Server2 (KVM) GPU-Server
Kubernetes (administered with Rancher)
Develop Other
Master
Worker
Master/
Worker
Other
OpenShift Master
Worker
Master
Worker
15. 17
Multi-cluster Kubernetes management
⢠Kubernetes Setup and
admin in the cluster
⢠High availability.
Kubernetes extras
such as Networking,
Helm, Nginx-load-
balancer with tested
and matched versions
17. AI/Machine Learning life cycle
â˘Structured processing steps
â˘Governance
â˘Versioning
â˘Streamline analysis on high-
performance e-infrastructures
â˘Support reproducible data analysis
â˘Enable large-scale data analysis
18. DevOps for machine learning
A systematic method of working that accelerates the path from initial data to production AI services
Continuous Integration & Continuous
Delivery
Physical Virtual
Cloud
Data scientist
Data
Modeling
Testing
Pre-processing Training Container Testing
Model serving
21. FAIR data and services
FAIR1
⢠Findable
⢠Semantic service discovery (JSON-LD)
⢠Accessible
⢠Web UI
⢠Programmatic API (OpenAPI)
⢠Interoperable
⢠Open API
⢠Semantic annotations (JSON-LD)
⢠Reproducible
⢠Published scientific workflow
(Pachyderm)
23
Services
⢠Portable (Containers)
⢠Resilient (Kubernetes)
⢠Scalable (Kubernetes and cloud
computing)
1 Wilkinson, Mark D., et al. "The FAIR Guiding
Principles for scientific data management and
stewardship." Scientific data 3 (2016).
22. Examples of what we serve
Site-of-metabolism and reaction types
http://ptp.service.pharmb.io/
https://metpred.service.pharmb.io/draw/
Target (safety) profiles
24. Key components and takeaways
⢠IaaC: Infrastructure is portable, testable and simplifies maintenance
⢠Containers: Components become portable and scalable
⢠Kubernetes: Resilient container orchestration over multiple nodes
⢠logging and monitoring - profiling
⢠Hybrid cloud - elasticity
⢠Pachyderm: Workflows of containers in Kubernetes with data versioning
⢠CICD: Streamline development and testing
⢠Open Source: Only open source components
ď DevOps: Software Developers, Data Engineers and Data Scientists working together
in same infrastructure
25. Implications: Continuous Analytics
⢠We can handle the continuous data processing from instruments with
robust, resilient data pipelines
⢠We can continuously re-train models as data is updated
⢠We can continuously publish data and models
ď Agile research group of different competencies
⢠Scientists has access to infrastructure
⢠DevOps roles, no dedicated sysadmin / tool devel / data engineer
26. ⢠Easy deployment of virtual infrastructures on IaaS
⢠Containerize tools, orchestrate microservices with
workflow systems on top of Kubernetes and OpenShift
Cloudflare
kubeadm Terraform
kubectl
Packer
KubeNow
https://github.com/kubenow
Capuccini et al. On-Demand Virtual Research Environments using
Microservices. https://arxiv.org/abs/1805.06180
Novella et al. Container-based bioinformatics with
Pachyderm. Bioinformatics. 35, 5, 839-846. (2018). DOI:
http://dx.doi.org/10.1093/bioinformatics/bty699
Tools we develop and use
⢠Data pipelining
and data
versioning layer
for Kubernetes
https://www.pachyderm.io