Building Data Ecosystems for Accelerated Discovery

Building Data
Ecosystems for
Accelerated
Discovery
April 29, 2020
Adam Kraut
@adamkraut
adam@bioteam.net

2|
The BioTeam
Virtual company founded in 2002
Staffed by scientists turned technologists
Technology agnostic and vendor independent
Pioneers of open-source distributed computing
Translate scientific drivers into innovative solutions
Providing strategic guidance and deep collaboration
Assess > Design > Build > Implement > Train > Support
BioTeam is independent and committed to Science

3|
The Central
Problems
Our primary mission is to solve complex problems at the
intersection of science, technology, and data
Most of our clients are struggling with central problems:
Science is changing faster than IT
Advanced infrastructure increases complexity
Distributed data is difficult to manage at scale
Our data is not findable
Our data is not accessible
Our data is not interoperable
Our data is not reusable

4|
The Data
Ecosystem
A data ecosystem is a set of infrastructure and services that
empowers a community of scientists and engineers.
Key features of a healthy Life Sciences data ecosystem:
Data Discoverability
Data Integrity at the Origin
Common Languages
Pipelines and Infrastructure as Code
Microservices and frontends
Experiment tracking and shared Workspaces
Continuous Delivery mindset for ML and Discovery

5|
Science at the
Speed of Light
Science is rate limited by our ability to generate and test a
hypothesis
Consider the foundational layers of your ecosystem. Primarily we
look at the Science Network to understand the data movement
challenges and access patterns.
We recommend you plan ahead and have faster data paths
between lab instruments generating data and your analysis tools.
Bring compute to the data and data to the compute.
In a worst case scenario, you actually halt experiments in
progress and destroy your potential with inferior networking.
In a best case scenario, you have a loss-free high-speed network
designed to match the capabilities and capacities of your science.
photo: Ann Lingard

6|
Data
Discoverability
The primary goal of a data scientist is to locate data, make
sense of it, and evaluate if it is trustworthy or not.
Datasets often diverge into silos which become problematic.
Human nature creates silos.
Applications and databases create silos.
Businesses and geography creates silos.
Searching and finding data is usually our primary objective.
Assessing the quality is a secondary supporting objective.
Need: Globally Unique IDs and resource resolver services.
Need: Defined metadata at the point of data instantiation.

7|
Data Integrity at
the Origin
Applying ML algorithms requires the highest level of data
integrity to be effective.
https://github.com/lyft/amundsen
Data objects should come with metadata that conforms to a
dictionary or ontology. A rich data store is harmonized, indexed
in various databases, discoverable, and queryable.
Good data hygiene is paramount. Promote upstream integrity of
the data objects to empower your downstream analytics.
Automatically infer partial metadata from information in silos.
We see an increased usage of graph databases such as Neo4J and
other scale-first storage systems like Redshift and SciDB.
The best case scenario is high-quality curated datasets for training
more accurate models and algorithms.

8|
Common
Languages
Controlled Vocabularies, Ontologies, and Data Dictionaries
Cross-functional teams require more efficient communication and
alignment up and down the chain of command.
Adopt and align around standard semantics, API’s, and formats
such as GA4GH, OpenAPI, HL7, Parquet.
Establish new domain-specific languages to avoid sharp edges.
Choose programming language wisely. Adopt a language with the
broadest compatibility across your tools and platforms.
We primarily recommend Python, Go, or JavaScript.
Gen3 Data Dictionary

9|
Pipelines and IaC Informatics pipelines are benefitting from advances in software
development
Our team continues to use Ansible playbooks and Chef
cookbooks for server configuration, along with Terraform
and CloudFormation for cloud provisioning and overall
environment integration.
This is even more critical in Hybrid Cloud scenarios where
significant gaps exist in core infrastructure components.
In AI and ML projects we expect an increase in Kubernetes tooling
and frameworks such as Helm and Kubeflow.

10|
Microservices and
micro frontends
“Serverless” architecture trend creates new design patterns.
https://blog.acolyer.org/2020/03/02/firecracker/
A Berkeley View on Serverless Computing
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2
019/EECS-2019-3.pdf
Patterns for Serverless Functions
Data Lakes, internal/robust API, state machines
Event patterns, sidecars, eventual consistency
Formal Foundations of Serverless Computing
Composition and new abstractions focused on reuse
See also: TLA+

11|
Experiment
Tracking and
Workspaces
Data science methodology is iterative and requires
collaboration
Jupyter Project continues to see mainstream adoption as a go-to for
computational notebooks and literate programming.
JupyterHub as a multi-user notebook server is the most popular
analysis and visualization component among our clients.
Start off with shared spreadsheets or docs in a repo or wiki.
The objective is tracking experimental outcomes, performance,
parameters, data provenance, and access control authorizations.
Improving the UX of using GPUs and Accelerators.
See also: Sagemaker, Colab, Nextflow, Cromwell, Tensorboard

12|
Continuous
Delivery for ML
and Data Science
Discipline of bringing DevOps principles and practices to ML
DevOps teams should bridge the gap between ML training
environments and deploying models using CI/CD techniques.
Eliminate manual handoffs between teams, reduce cycle time
between training models and deploying them.
Automate the end-to-end process. Versioning, Testing,
Deployments of ML components: data, model, and code.
Trend towards explainability of models as selection criteria.
An explainable model allows us to say how a decision was made.
Critical to understanding fundamental biology and chemistry.

13|
The 10x Engineer
pitfall
The “Unicorn” AI or ML specialist is a red flag that should be
avoided. Data Science is a Team Sport!
Teams of expert generalists with solid leadership principles
are the most successful.
Diversity is key in high-performance teams.
Recruit people with mixed talent and experience.
Include clinicians, lawyers, and other outside expertise.
Continuous learning and improvement.
Every member of the team has an opportunity to lead.
Requires discipline at first and strong communication.
Check your ego, work hard, and put the team first.

Thanks!
April 29, 2020
Adam Kraut
@adamkraut
adam@bioteam.net

Building Data Ecosystems for Accelerated Discovery

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Building Data Ecosystems for Accelerated Discovery

Ähnlich wie Building Data Ecosystems for Accelerated Discovery (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Building Data Ecosystems for Accelerated Discovery