SlideShare ist ein Scribd-Unternehmen logo
1 von 14
Building Data
Ecosystems for
Accelerated
Discovery
April 29, 2020
Adam Kraut
@adamkraut
adam@bioteam.net
2|
The BioTeam
Virtual company founded in 2002
Staffed by scientists turned technologists
Technology agnostic and vendor independent
Pioneers of open-source distributed computing
Translate scientific drivers into innovative solutions
Providing strategic guidance and deep collaboration
Assess > Design > Build > Implement > Train > Support
BioTeam is independent and committed to Science
3|
The Central
Problems
Our primary mission is to solve complex problems at the
intersection of science, technology, and data
Most of our clients are struggling with central problems:
Science is changing faster than IT
Advanced infrastructure increases complexity
Distributed data is difficult to manage at scale
Our data is not findable
Our data is not accessible
Our data is not interoperable
Our data is not reusable
4|
The Data
Ecosystem
A data ecosystem is a set of infrastructure and services that
empowers a community of scientists and engineers.
Key features of a healthy Life Sciences data ecosystem:
Data Discoverability
Data Integrity at the Origin
Common Languages
Pipelines and Infrastructure as Code
Microservices and frontends
Experiment tracking and shared Workspaces
Continuous Delivery mindset for ML and Discovery
5|
Science at the
Speed of Light
Science is rate limited by our ability to generate and test a
hypothesis
Consider the foundational layers of your ecosystem. Primarily we
look at the Science Network to understand the data movement
challenges and access patterns.
We recommend you plan ahead and have faster data paths
between lab instruments generating data and your analysis tools.
Bring compute to the data and data to the compute.
In a worst case scenario, you actually halt experiments in
progress and destroy your potential with inferior networking.
In a best case scenario, you have a loss-free high-speed network
designed to match the capabilities and capacities of your science.
photo: Ann Lingard
6|
Data
Discoverability
The primary goal of a data scientist is to locate data, make
sense of it, and evaluate if it is trustworthy or not.
Datasets often diverge into silos which become problematic.
Human nature creates silos.
Applications and databases create silos.
Businesses and geography creates silos.
Searching and finding data is usually our primary objective.
Assessing the quality is a secondary supporting objective.
Need: Globally Unique IDs and resource resolver services.
Need: Defined metadata at the point of data instantiation.
7|
Data Integrity at
the Origin
Applying ML algorithms requires the highest level of data
integrity to be effective.
https://github.com/lyft/amundsen
Data objects should come with metadata that conforms to a
dictionary or ontology. A rich data store is harmonized, indexed
in various databases, discoverable, and queryable.
Good data hygiene is paramount. Promote upstream integrity of
the data objects to empower your downstream analytics.
Automatically infer partial metadata from information in silos.
We see an increased usage of graph databases such as Neo4J and
other scale-first storage systems like Redshift and SciDB.
The best case scenario is high-quality curated datasets for training
more accurate models and algorithms.
8|
Common
Languages
Controlled Vocabularies, Ontologies, and Data Dictionaries
Cross-functional teams require more efficient communication and
alignment up and down the chain of command.
Adopt and align around standard semantics, API’s, and formats
such as GA4GH, OpenAPI, HL7, Parquet.
Establish new domain-specific languages to avoid sharp edges.
Choose programming language wisely. Adopt a language with the
broadest compatibility across your tools and platforms.
We primarily recommend Python, Go, or JavaScript.
Gen3 Data Dictionary
9|
Pipelines and IaC Informatics pipelines are benefitting from advances in software
development
Our team continues to use Ansible playbooks and Chef
cookbooks for server configuration, along with Terraform
and CloudFormation for cloud provisioning and overall
environment integration.
This is even more critical in Hybrid Cloud scenarios where
significant gaps exist in core infrastructure components.
In AI and ML projects we expect an increase in Kubernetes tooling
and frameworks such as Helm and Kubeflow.
10|
Microservices and
micro frontends
“Serverless” architecture trend creates new design patterns.
https://blog.acolyer.org/2020/03/02/firecracker/
A Berkeley View on Serverless Computing
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2
019/EECS-2019-3.pdf
Patterns for Serverless Functions
Data Lakes, internal/robust API, state machines
Event patterns, sidecars, eventual consistency
Formal Foundations of Serverless Computing
Composition and new abstractions focused on reuse
See also: TLA+
11|
Experiment
Tracking and
Workspaces
Data science methodology is iterative and requires
collaboration
Jupyter Project continues to see mainstream adoption as a go-to for
computational notebooks and literate programming.
JupyterHub as a multi-user notebook server is the most popular
analysis and visualization component among our clients.
Start off with shared spreadsheets or docs in a repo or wiki.
The objective is tracking experimental outcomes, performance,
parameters, data provenance, and access control authorizations.
Improving the UX of using GPUs and Accelerators.
See also: Sagemaker, Colab, Nextflow, Cromwell, Tensorboard
12|
Continuous
Delivery for ML
and Data Science
Discipline of bringing DevOps principles and practices to ML
DevOps teams should bridge the gap between ML training
environments and deploying models using CI/CD techniques.
Eliminate manual handoffs between teams, reduce cycle time
between training models and deploying them.
Automate the end-to-end process. Versioning, Testing,
Deployments of ML components: data, model, and code.
Trend towards explainability of models as selection criteria.
An explainable model allows us to say how a decision was made.
Critical to understanding fundamental biology and chemistry.
13|
The 10x Engineer
pitfall
The “Unicorn” AI or ML specialist is a red flag that should be
avoided. Data Science is a Team Sport!
Teams of expert generalists with solid leadership principles
are the most successful.
Diversity is key in high-performance teams.
Recruit people with mixed talent and experience.
Include clinicians, lawyers, and other outside expertise.
Continuous learning and improvement.
Every member of the team has an opportunity to lead.
Requires discipline at first and strong communication.
Check your ego, work hard, and put the team first.
Thanks!
April 29, 2020
Adam Kraut
@adamkraut
adam@bioteam.net

Weitere ähnliche Inhalte

Was ist angesagt?

Darwin ai covid-net mitre
Darwin ai   covid-net mitreDarwin ai   covid-net mitre
Darwin ai covid-net mitreianmitch
 
OTN Gambia 2008
OTN Gambia 2008OTN Gambia 2008
OTN Gambia 2008Greg Fegan
 
Finding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesFinding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesmhaendel
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...MLAI2
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte PushingChris Dagdigian
 
BeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionBeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionNick Jones
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences Pistoia Alliance
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiPistoia Alliance
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Chris Dagdigian
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Chris Dagdigian
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsCarole Goble
 
Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018Pistoia Alliance
 
Executive Summary - Data Management Hub
Executive Summary - Data Management HubExecutive Summary - Data Management Hub
Executive Summary - Data Management HubDenis Parfenov
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019Chris Dagdigian
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET
 
External controlled vocabularies support in Dataverse
External controlled vocabularies support in DataverseExternal controlled vocabularies support in Dataverse
External controlled vocabularies support in Dataversevty
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research datavty
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020Pistoia Alliance
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECAProject
 

Was ist angesagt? (20)

Darwin ai covid-net mitre
Darwin ai   covid-net mitreDarwin ai   covid-net mitre
Darwin ai covid-net mitre
 
OTN Gambia 2008
OTN Gambia 2008OTN Gambia 2008
OTN Gambia 2008
 
Finding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologiesFinding common ground: integrating the eagle-i and VIVO ontologies
Finding common ground: integrating the eagle-i and VIVO ontologies
 
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
Federated Semi-Supervised Learning with Inter-Client Consistency & Disjoint L...
 
Practical Petabyte Pushing
Practical Petabyte PushingPractical Petabyte Pushing
Practical Petabyte Pushing
 
BeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN sessionBeSTGRID OpenGridForum 29 GIN session
BeSTGRID OpenGridForum 29 GIN session
 
PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences PA webinar on benefits & costs of FAIR implementation in life sciences
PA webinar on benefits & costs of FAIR implementation in life sciences
 
Heartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirtiHeartificial intelligence - claudio-mirti
Heartificial intelligence - claudio-mirti
 
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)Cloud Sobriety for Life Science IT Leadership (2018 Edition)
Cloud Sobriety for Life Science IT Leadership (2018 Edition)
 
Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)Bio-IT Trends From The Trenches (digital edition)
Bio-IT Trends From The Trenches (digital edition)
 
Building the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of ScientistsBuilding the FAIR Research Commons: A Data Driven Society of Scientists
Building the FAIR Research Commons: A Data Driven Society of Scientists
 
Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018Data quality supporting AI in Life Sciences webinar 10 dec 2018
Data quality supporting AI in Life Sciences webinar 10 dec 2018
 
Executive Summary - Data Management Hub
Executive Summary - Data Management HubExecutive Summary - Data Management Hub
Executive Summary - Data Management Hub
 
Trends from the Trenches: 2019
Trends from the Trenches: 2019Trends from the Trenches: 2019
Trends from the Trenches: 2019
 
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
dkNET Webinar: FAIR Data & Software in the Research Life Cycle 01/22/2021
 
External controlled vocabularies support in Dataverse
External controlled vocabularies support in DataverseExternal controlled vocabularies support in Dataverse
External controlled vocabularies support in Dataverse
 
Setting up Dataverse repository for research data
Setting up Dataverse repository for research dataSetting up Dataverse repository for research data
Setting up Dataverse repository for research data
 
new_kitching_cv
new_kitching_cvnew_kitching_cv
new_kitching_cv
 
Knowledge graphs ilaria maresi the hyve 23apr2020
Knowledge graphs   ilaria maresi the hyve 23apr2020Knowledge graphs   ilaria maresi the hyve 23apr2020
Knowledge graphs ilaria maresi the hyve 23apr2020
 
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
CINECA webinar slides: Data Gravity in the Life Sciences: Lessons learned fro...
 

Ă„hnlich wie Building Data Ecosystems for Accelerated Discovery

The Eco-System of AI and How to Use It
The Eco-System of AI and How to Use ItThe Eco-System of AI and How to Use It
The Eco-System of AI and How to Use Itinside-BigData.com
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfAlan Morrison
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData Abdelkader OUARED
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...Denodo
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedRobert Grossman
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLJordan Birdsell
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a ServiceJohn Liu
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Denodo
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderDataconomy Media
 
Embracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOpsEmbracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOpsSteve Woodward
 
Embracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev OpsEmbracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev OpsNick Brown
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science TeamsEMC
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDataWorks Summit
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life SciencesChris Shaw
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think MilanoATMOSPHERE .
 
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...John Archer
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIDenodo
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Nathan Bijnens
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2Joe_F
 

Ă„hnlich wie Building Data Ecosystems for Accelerated Discovery (20)

The Eco-System of AI and How to Use It
The Eco-System of AI and How to Use ItThe Eco-System of AI and How to Use It
The Eco-System of AI and How to Use It
 
The FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdfThe FAIR data movement and 22 Feb 2023.pdf
The FAIR data movement and 22 Feb 2023.pdf
 
Introduction to BigData
Introduction to BigData Introduction to BigData
Introduction to BigData
 
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
How Data Virtualization Puts Enterprise Machine Learning Programs into Produc...
 
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed DeployedCrossing the Analytics Chasm and Getting the Models You Developed Deployed
Crossing the Analytics Chasm and Getting the Models You Developed Deployed
 
MLOps - The Assembly Line of ML
MLOps - The Assembly Line of MLMLOps - The Assembly Line of ML
MLOps - The Assembly Line of ML
 
Artificial Intelligence As a Service
Artificial Intelligence As a ServiceArtificial Intelligence As a Service
Artificial Intelligence As a Service
 
ODSC and iRODS
ODSC and iRODSODSC and iRODS
ODSC and iRODS
 
Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)Advanced Analytics and Machine Learning with Data Virtualization (India)
Advanced Analytics and Machine Learning with Data Virtualization (India)
 
Embedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern StaenderEmbedded-ml(ai)applications - Bjoern Staender
Embedded-ml(ai)applications - Bjoern Staender
 
Embracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOpsEmbracing Cloud Deployment for Big Data and DevOps
Embracing Cloud Deployment for Big Data and DevOps
 
Embracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev OpsEmbracing Cloud Deployment for Big Data and Dev Ops
Embracing Cloud Deployment for Big Data and Dev Ops
 
Building Data Science Teams
Building Data Science TeamsBuilding Data Science Teams
Building Data Science Teams
 
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXTDriving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
Driving Enterprise Adoption: Tragedies, Triumphs and Our NEXT
 
IBM Aspera In Life Sciences
IBM Aspera In Life SciencesIBM Aspera In Life Sciences
IBM Aspera In Life Sciences
 
IBM Think Milano
IBM Think MilanoIBM Think Milano
IBM Think Milano
 
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...Extending open source and hybrid cloud to drive OT transformation - Future Oi...
Extending open source and hybrid cloud to drive OT transformation - Future Oi...
 
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BIAugmentation, Collaboration, Governance: Defining the Future of Self-Service BI
Augmentation, Collaboration, Governance: Defining the Future of Self-Service BI
 
Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)Data Mesh in Azure using Cloud Scale Analytics (WAF)
Data Mesh in Azure using Cloud Scale Analytics (WAF)
 
Qo Introduction V2
Qo Introduction V2Qo Introduction V2
Qo Introduction V2
 

KĂĽrzlich hochgeladen

Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...anilsa9823
 
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...SĂ©rgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsAArockiyaNisha
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyDrAnita Sharma
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSĂ©rgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSĂ©rgio Sacani
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsSĂ©rgio Sacani
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPirithiRaju
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSĂ©rgio Sacani
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfmuntazimhurra
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​kaibalyasahoo82800
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 

KĂĽrzlich hochgeladen (20)

Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
Lucknow đź’‹ Russian Call Girls Lucknow Finest Escorts Service 8923113531 Availa...
 
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls AgencyHire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire đź’• 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
fundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomologyfundamental of entomology all in one topics of entomology
fundamental of entomology all in one topics of entomology
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Biological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdfBiological Classification BioHack (3).pdf
Biological Classification BioHack (3).pdf
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 

Building Data Ecosystems for Accelerated Discovery

  • 1. Building Data Ecosystems for Accelerated Discovery April 29, 2020 Adam Kraut @adamkraut adam@bioteam.net
  • 2. 2| The BioTeam Virtual company founded in 2002 Staffed by scientists turned technologists Technology agnostic and vendor independent Pioneers of open-source distributed computing Translate scientific drivers into innovative solutions Providing strategic guidance and deep collaboration Assess > Design > Build > Implement > Train > Support BioTeam is independent and committed to Science
  • 3. 3| The Central Problems Our primary mission is to solve complex problems at the intersection of science, technology, and data Most of our clients are struggling with central problems: Science is changing faster than IT Advanced infrastructure increases complexity Distributed data is difficult to manage at scale Our data is not findable Our data is not accessible Our data is not interoperable Our data is not reusable
  • 4. 4| The Data Ecosystem A data ecosystem is a set of infrastructure and services that empowers a community of scientists and engineers. Key features of a healthy Life Sciences data ecosystem: Data Discoverability Data Integrity at the Origin Common Languages Pipelines and Infrastructure as Code Microservices and frontends Experiment tracking and shared Workspaces Continuous Delivery mindset for ML and Discovery
  • 5. 5| Science at the Speed of Light Science is rate limited by our ability to generate and test a hypothesis Consider the foundational layers of your ecosystem. Primarily we look at the Science Network to understand the data movement challenges and access patterns. We recommend you plan ahead and have faster data paths between lab instruments generating data and your analysis tools. Bring compute to the data and data to the compute. In a worst case scenario, you actually halt experiments in progress and destroy your potential with inferior networking. In a best case scenario, you have a loss-free high-speed network designed to match the capabilities and capacities of your science. photo: Ann Lingard
  • 6. 6| Data Discoverability The primary goal of a data scientist is to locate data, make sense of it, and evaluate if it is trustworthy or not. Datasets often diverge into silos which become problematic. Human nature creates silos. Applications and databases create silos. Businesses and geography creates silos. Searching and finding data is usually our primary objective. Assessing the quality is a secondary supporting objective. Need: Globally Unique IDs and resource resolver services. Need: Defined metadata at the point of data instantiation.
  • 7. 7| Data Integrity at the Origin Applying ML algorithms requires the highest level of data integrity to be effective. https://github.com/lyft/amundsen Data objects should come with metadata that conforms to a dictionary or ontology. A rich data store is harmonized, indexed in various databases, discoverable, and queryable. Good data hygiene is paramount. Promote upstream integrity of the data objects to empower your downstream analytics. Automatically infer partial metadata from information in silos. We see an increased usage of graph databases such as Neo4J and other scale-first storage systems like Redshift and SciDB. The best case scenario is high-quality curated datasets for training more accurate models and algorithms.
  • 8. 8| Common Languages Controlled Vocabularies, Ontologies, and Data Dictionaries Cross-functional teams require more efficient communication and alignment up and down the chain of command. Adopt and align around standard semantics, API’s, and formats such as GA4GH, OpenAPI, HL7, Parquet. Establish new domain-specific languages to avoid sharp edges. Choose programming language wisely. Adopt a language with the broadest compatibility across your tools and platforms. We primarily recommend Python, Go, or JavaScript. Gen3 Data Dictionary
  • 9. 9| Pipelines and IaC Informatics pipelines are benefitting from advances in software development Our team continues to use Ansible playbooks and Chef cookbooks for server configuration, along with Terraform and CloudFormation for cloud provisioning and overall environment integration. This is even more critical in Hybrid Cloud scenarios where significant gaps exist in core infrastructure components. In AI and ML projects we expect an increase in Kubernetes tooling and frameworks such as Helm and Kubeflow.
  • 10. 10| Microservices and micro frontends “Serverless” architecture trend creates new design patterns. https://blog.acolyer.org/2020/03/02/firecracker/ A Berkeley View on Serverless Computing https://www2.eecs.berkeley.edu/Pubs/TechRpts/2 019/EECS-2019-3.pdf Patterns for Serverless Functions Data Lakes, internal/robust API, state machines Event patterns, sidecars, eventual consistency Formal Foundations of Serverless Computing Composition and new abstractions focused on reuse See also: TLA+
  • 11. 11| Experiment Tracking and Workspaces Data science methodology is iterative and requires collaboration Jupyter Project continues to see mainstream adoption as a go-to for computational notebooks and literate programming. JupyterHub as a multi-user notebook server is the most popular analysis and visualization component among our clients. Start off with shared spreadsheets or docs in a repo or wiki. The objective is tracking experimental outcomes, performance, parameters, data provenance, and access control authorizations. Improving the UX of using GPUs and Accelerators. See also: Sagemaker, Colab, Nextflow, Cromwell, Tensorboard
  • 12. 12| Continuous Delivery for ML and Data Science Discipline of bringing DevOps principles and practices to ML DevOps teams should bridge the gap between ML training environments and deploying models using CI/CD techniques. Eliminate manual handoffs between teams, reduce cycle time between training models and deploying them. Automate the end-to-end process. Versioning, Testing, Deployments of ML components: data, model, and code. Trend towards explainability of models as selection criteria. An explainable model allows us to say how a decision was made. Critical to understanding fundamental biology and chemistry.
  • 13. 13| The 10x Engineer pitfall The “Unicorn” AI or ML specialist is a red flag that should be avoided. Data Science is a Team Sport! Teams of expert generalists with solid leadership principles are the most successful. Diversity is key in high-performance teams. Recruit people with mixed talent and experience. Include clinicians, lawyers, and other outside expertise. Continuous learning and improvement. Every member of the team has an opportunity to lead. Requires discipline at first and strong communication. Check your ego, work hard, and put the team first.
  • 14. Thanks! April 29, 2020 Adam Kraut @adamkraut adam@bioteam.net