SlideShare ist ein Scribd-Unternehmen logo
1 von 32
Integrating large, fast-moving, and
heterogeneous data sets in biology.


              C. Titus Brown
            Asst Prof, CSE and
               Microbiology;
           BEACON NSF STC
         Michigan State University
              ctb@msu.edu
Introduction
 Background:
   Modeling & data analysis undergrad =>
   Open source software development + software
    engineering +
   developmental biology + genomics PhD =>
   Bio + computer science faculty =>
   Data driven biology


 Currently working with next-gen sequencing data
  (mRNAseq, metagenomics, difficult genomes).
 Thinking hard about how to do data-driven
  modeling & model-driven data analysis.
Goal & outline
     Address challenges and opportunities of
   heterogeneous data integration: 1000 ft view.

Outline:
 What types of analysis and discovery do we want
  to enable?
 What are the technical challenges, common
  solutions, and common failure points?
 Where might we look for success stories, and
  what lessons can we port to biology?
 My conclusions.
Specific types of questions
 “I have a known chemical/gene interaction; do I see it
  in this other data set?”
 “I have a known chemical/gene interaction; what other
  gene expression is affected?”
 “What does chemical X do to overall phenotype, effect
  on gene expression, altered protein localization, and
  patterns of histone modification?”
 More complex/combinatorial interactions:
   What does this chemical do in this genetic background?
   What kind of additional gene expression changes are
    generated by the combination of these two chemicals?
   What are common effects of this class of chemicals?
What general behavior do we want to
enable?
 Reuse of data by groups that did not/could not
  produce it.

 Publication of reusable/“fork”able data analysis
  pipelines and models.

 Integration of data and models.


 Serendipitous uses and cross-referencing of data sets
  (“mashups”).

 Rapid scientific exploration and hypothesis generation
  in data space.
(Executable papers & data reuse)
 ENCODE
   All data is available; all processing scripts for
   papers are available on a virtual machine.

 QIIME (microbial ecology)
  Amazon virtual machine containing software and data
  for:
  “Collaborative cloud-enabled tools allow rapid,
  reproducible biological insights.” (pmid 23096404)

 Digital normalization paper
  Amazon virtual machine, again:
  http://arxiv.org/abs/1203.4802
Executable papers can support easy
replication & reuse of code, data.


                            (IPython
                            Notebook; also
                            see RStudio)




                     http://ged.msu.edu/papers/2012-
                                  diginorm/notebook/
What general behavior do we want to
enable?
 Reuse of data by groups that did not/could not
  produce it.

 Publication of reusable/”fork”able data analysis
  pipelines and models.

 Integration of data and models.


 Serendipitous uses and cross-referencing of data sets
  (“mashups”).

 Rapid scientific exploration and hypothesis generation
  in data space.
An entertaining digression --
  A mashup of Facebook “top 10 books by college” and per-college SAT rank




                                   http://booksthatmakeyoudumb.virgil.gr/
Technical obstacles
 Syntactic incompatibility
   The first 90% of bioinformatics: your IDs are different
    from my IDs.
 Semantic incompatibility
   The second 90% of bioinformatics: what does “gene”
    mean in your database?
 Impedance mismatch
   SQL is notoriously bad at representing intervals and
    hierarchies
   Genomes consist of intervals; ontologies consist of
    hierarchies!
   …SQL databases dominate (vs graph or object DBs).
 Data volume & velocity
   Large & expanding data sets just make everything
    harder.
 Unstructured data
   aka “publications” – most scientific knowledge is “locked
Typical solutions
 “Entity resolution”
    Accession numbers or other common identifiers
  …requires global naming system OR translators.

 Top down imposition of structure
   Centralized DB;
   “Here is the schema you will all use”;
  …limits flexibility, prevents use of unstructured data, heavyweight.

 Ontologies to enable “correct” communication
   Centrally coordinated vocabulary
  …slow, hard to get right, doesn’t solve unstructured data problem.
  Balancing theoretical rigor with practical applicability is particularly
  hard.

 Ad hoc entity resolution (“winging it”)
   Common solution
  …doesn’t work that well.
Are better standards the
solution?




                       http://xkcd.com/927/
Rephrasing technical goals
How can we best provide a platform or platforms to
    support flexible data integration and data
investigation across a wide range of data sets and
               data types in biology?


My interests:
 Avoid master data manager and centralization
 Support federated roll-out of new data and
  functionality
 Provide flexible extensibility of ontologies and
  hierarchies
 Support diverse “ecology” of databases,
Success stories outside of
biology?
 Look for domains:
   with really large amounts of heterogenous data,
   that are continually increasing in size,
   are being effectively mined on an ongoing basis,
   Have widely used programmatic interfaces that
    support “mashups” and other cross-database stuff,
   and are intentional, with principles that we can
    steal or adapt.
Success stories outside of
biology?
 Look for domains:
   with really large amounts of heterogenous data,
   that are continually increasing in size,
   are being effectively mined on an ongoing basis,
   Have widely used programmatic interfaces that
    support “mashups” and other cross-database stuff,
   and are intentional, with principles that we can
    steal or adapt.


                        Amazon.
Amazon:
 > 50 million users, > 1 million product partners,
    billions of reviews, dozens of compute services …
   Continually changing/updating data sets.
   Explicitly adopted a service-oriented architecture
    that enables both internal and external use of this
    data.
   For example, the amazon.com Web site is itself
    built from over 150 independent services…
   Amazon routinely deploys new services and
    functionality.
Sources:
The Platform Rant (Steve Yegge) -- in which he
compares the Google and Amazon approaches:
https://plus.google.com/112678702228711889851/
posts/eVeouesvaVX

A summary at HighScalability.com:
http://highscalability.com/amazon-architecture

 (They are both long and tech-y, note, but the first
            is especially entertaining.)
A brief summary of core
principles
Mandates from the CEO:

1. All teams must expose data and functionality
   solely through a service interface.
2. All communication between teams happens
   through that service interface.
3. All service interfaces must be designed so that
   they can be exposed to the outside world.
More colloquially:
         “You should eat your own dogfood.”

  Design and implement the database and database
functionality to meet your own needs; and only use the
    functionality you’ve explicitly made available to
                       everyone.

To adapt to research: database functionality should be
designed in tightly integration with researchers who are
      using it, both at a user interface level and
                    programmatically.

(Genome databases have done a really good job of this,
      albeit generally in a centralized model.)
If the “customers” aren’t integrated
into the development loop:
A platform view?
                                                  Diff'n gene                Data
                         Metabolic
                                                  expression              exploration
                          model
                                                     query                  WWW




    Gene ID
   translator



                                                                         Isoform
    Chemical                                                           resolution/
  relationships                                                       comparison
                                          Expression
                                         normalization




            Expression          Expression               Expression                  Expression
               data                data                     data                       data II
              (tiling)         (microarray)              (mRNAseq)                   (mRNAseq)
A few points
 Open source and agile software development
 approaches can be surprisingly effective and
 inexpensive.

 Developing services in small groups that include
 “customer-facing developers” helps ensure utility.

 Implementing services in the “cloud” (e.g. virtual
 machines, or on top of “infrastructure as a
 service” services) gives developer flexibility in
 tools, approaches, implementation; also enables
 scaling and reusability.
Combining modelling with data
 Data-driven modeling: connections and parameters
  can be, to some extent, determined from data.

 Model-driven data investigation: data that doesn’t fit
  the “known” model is particularly interesting.

The second approach is essentially how particle
physicists work with accelerator data: build a model &
then interpret the data using the model.

(In biology, models are less constraining, though; more
                      unknowns.)
Using developmental models




             Davidson et al., http://sugp.caltech.edu/endomes
Using developmental models


    Models can contain useful abstractions of
specific processes; here, the direct effects of
 blocking nuclearization of B-catenin can be
   predicted by following the connections.




Models provide a common language for (dis)agreement
                    a community.
Using developmental models




             Davidson et al., http://sugp.caltech.edu/endomes
Social obstacles
 Training of biologically aware software developers
 is lacking.

 Molecular biologists are still very much of a
 computationally naïve mindset: “give me the
 answer so I can do the real work”

 Incentives for data sharing, much less useful
 data sharing are not yet very strong.
   Pubs, grants, respect...


 Patterns for useful data sharing are still not well
 understood, in general.
Other places to look
 NEON and other NSF centers (e.g. NCEAS) are
 collecting vast heterogenous data sets, and are
 explicitly tackling the data
 management/use/integration/reuse problem.

 SBML (“Systems Biology Markup Language”) is a
 modeling descriptive language that enables
 interoperability of modeling software.

 Software Carpentry runs free workshops on
 effective use of computation for science.
My conclusions…
 We need a “platform” mentality to make the most use
  of our data, even if we don’t completely embrace
  loose coupling and distribution.

 Agile and end-user focused software development
  methodologies have worked well in other areas; much
  of the hard technical space has already been
  explored in Internet companies (and probably social
  networking companies, too).

 Data is most useful in the context of an explicit model;
  models can be generated from data, and models can
  feed back into data gathering.
Things I didn’t discuss
 Database maintenance and active curation is
  incredibly important.

 Most data only makes sense in the context of other
  data (think: controls; wild type vs knockout; other
  backgrounds; etc.) – so we will need lots more data to
  interpret the data we already have.

 “Deep learning” is a promising field for extracting
  correlations from multiple large data sets.

 All of these technical problems are easier to solve
  than the social problems (incentives; training).
Thanks --

This talk and ancillary notes will be available on my
                     blog ~soon:
                 http://ivory.idyll.org/blog/

Please do contact me at ctb@msu.edu if you have
            questions or comments.

Weitere ähnliche Inhalte

Was ist angesagt?

Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDatabricks
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitData Con LA
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFlávio Codeço Coelho
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences EMC
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinarPistoia Alliance
 
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...EMC
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?Maryann Martone
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Robert Grossman
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Spark Summit
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Toolsidescitation
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingOla Spjuth
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Ola Spjuth
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017Manish K Patel
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangementGuy Coates
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...Allen Day, PhD
 

Was ist angesagt? (20)

Drug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge GraphsDrug Repurposing using Deep Learning on Knowledge Graphs
Drug Repurposing using Deep Learning on Knowledge Graphs
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
2014 bangkok-talk
2014 bangkok-talk2014 bangkok-talk
2014 bangkok-talk
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
VariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn LangitVariantSpark a library for genomics by Lynn Langit
VariantSpark a library for genomics by Lynn Langit
 
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio  Silva: Cloud Computing Technologies for Genomic Big Data AnalysisFabricio  Silva: Cloud Computing Technologies for Genomic Big Data Analysis
Fabricio Silva: Cloud Computing Technologies for Genomic Big Data Analysis
 
Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences Whitepaper : CHI: Hadoop's Rise in Life Sciences
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar2020.04.07 automated molecular design and the bradshaw platform webinar
2020.04.07 automated molecular design and the bradshaw platform webinar
 
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
White Paper: Life Sciences at RENCI, Big Data IT to Manage, Decipher and Info...
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
Biomedical Clusters, Clouds and Commons - DePaul Colloquium Oct 24, 2014
 
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
 
A Survey on Bioinformatics Tools
A Survey on Bioinformatics ToolsA Survey on Bioinformatics Tools
A Survey on Bioinformatics Tools
 
Towards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imagingTowards automated phenotypic cell profiling with high-content imaging
Towards automated phenotypic cell profiling with high-content imaging
 
Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...Building an informatics solution to sustain AI-guided cell profiling with hig...
Building an informatics solution to sustain AI-guided cell profiling with hig...
 
NRNB EAC Meeting 2012
NRNB EAC Meeting 2012NRNB EAC Meeting 2012
NRNB EAC Meeting 2012
 
A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017A study on cloud computing ppt n_24-12-2017
A study on cloud computing ppt n_24-12-2017
 
Next-generation sequencing: Data mangement
Next-generation sequencing: Data mangementNext-generation sequencing: Data mangement
Next-generation sequencing: Data mangement
 
Satya Sahoo Thesis Defense
Satya Sahoo Thesis DefenseSatya Sahoo Thesis Defense
Satya Sahoo Thesis Defense
 
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
20170428 - Look to Precision Agriculture to Bootstrap Precision Medicine - Cu...
 

Andere mochten auch

R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A PresentationIvin
 
Orange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLVOrange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLVMobileMonday Tel-Aviv
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talkc.titus.brown
 
Pagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.roPagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.roIulian Ghisoiu
 
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESGAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESAlexander Lavrov
 
Nobody laughed
Nobody laughedNobody laughed
Nobody laughedTakahe One
 
Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International DevelopmentAlex Rascanu
 
Gene :: Properties
Gene :: PropertiesGene :: Properties
Gene :: Propertiesrejita
 
Kakapo Keynote
Kakapo KeynoteKakapo Keynote
Kakapo KeynoteTakahe One
 
Social Media for Business [public version]
Social Media for Business [public version]Social Media for Business [public version]
Social Media for Business [public version]Khomeini Mujahid
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)@rtNya
 
Cloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuCloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuPiet van Vugt
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibilityc.titus.brown
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy TesterKristina O'Regan
 

Andere mochten auch (20)

R E V I V A L C O L L E G E S A Presentation
R E V I V A L   C O L L E G E  S A PresentationR E V I V A L   C O L L E G E  S A Presentation
R E V I V A L C O L L E G E S A Presentation
 
Orange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLVOrange Israel iPhone startAPP contest winners at MoMoTLV
Orange Israel iPhone startAPP contest winners at MoMoTLV
 
2013 bms-retreat-talk
2013 bms-retreat-talk2013 bms-retreat-talk
2013 bms-retreat-talk
 
Pagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.roPagina cu Rochii de seara si Lenjerie TrendyLook.ro
Pagina cu Rochii de seara si Lenjerie TrendyLook.ro
 
TPSI by Competitive Analytics
TPSI by Competitive AnalyticsTPSI by Competitive Analytics
TPSI by Competitive Analytics
 
2014 abic-talk
2014 abic-talk2014 abic-talk
2014 abic-talk
 
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCESGAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
 
2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit2011 Ohio Hispanic Business Summit
2011 Ohio Hispanic Business Summit
 
Torture And Beyond
Torture And BeyondTorture And Beyond
Torture And Beyond
 
Nobody laughed
Nobody laughedNobody laughed
Nobody laughed
 
Big Data for International Development
Big Data for International DevelopmentBig Data for International Development
Big Data for International Development
 
Gene :: Properties
Gene :: PropertiesGene :: Properties
Gene :: Properties
 
Kakapo Keynote
Kakapo KeynoteKakapo Keynote
Kakapo Keynote
 
Social Media for Business [public version]
Social Media for Business [public version]Social Media for Business [public version]
Social Media for Business [public version]
 
Osss (Page Revisi)
Osss (Page Revisi)Osss (Page Revisi)
Osss (Page Revisi)
 
Resume
ResumeResume
Resume
 
Cloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvuCloudxp keynote 18 sept pvu
Cloudxp keynote 18 sept pvu
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Rational App Scan&Policy Tester
Rational App Scan&Policy TesterRational App Scan&Policy Tester
Rational App Scan&Policy Tester
 
About BMC
About BMCAbout BMC
About BMC
 

Ähnlich wie 2013 nas-ehs-data-integration-dc

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachAndre Freitas
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleAndy Petrella
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EITESANGO
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...CSCJournals
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect WorldVital.AI
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET Journal
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration James Hendler
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisMichele Thomas
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine LearningSri Ambati
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science James Hendler
 
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemLeveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemSemantic Web Company
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...Genomika Diagnósticos
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSouth Tyrol Free Software Conference
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirSpark Summit
 
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...Amazon Web Services
 
Stephen Friend Dana Farber Cancer Institute 2011-10-24
Stephen Friend Dana Farber Cancer Institute 2011-10-24Stephen Friend Dana Farber Cancer Institute 2011-10-24
Stephen Friend Dana Farber Cancer Institute 2011-10-24Sage Base
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmIRJET Journal
 

Ähnlich wie 2013 nas-ehs-data-integration-dc (20)

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing ApproachCoping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
 
Spark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scaleSpark Summit Europe: Share and analyse genomic data at scale
Spark Summit Europe: Share and analyse genomic data at scale
 
EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017 EiTESAL eHealth Conference 14&15 May 2017
EiTESAL eHealth Conference 14&15 May 2017
 
A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...A consistent and efficient graphical User Interface Design and Querying Organ...
A consistent and efficient graphical User Interface Design and Querying Organ...
 
Natural Language Processing & Semantic Models in an Imperfect World
Natural Language Processing & Semantic Modelsin an Imperfect WorldNatural Language Processing & Semantic Modelsin an Imperfect World
Natural Language Processing & Semantic Models in an Imperfect World
 
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...IRJET-  	  Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
IRJET- Improved Model for Big Data Analytics using Dynamic Multi-Swarm Op...
 
The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration The Rensselaer IDEA: Data Exploration
The Rensselaer IDEA: Data Exploration
 
A Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And AnalysisA Systems Approach To Qualitative Data Management And Analysis
A Systems Approach To Qualitative Data Management And Analysis
 
Interpretable Machine Learning
Interpretable Machine LearningInterpretable Machine Learning
Interpretable Machine Learning
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
Semantic (Web) Technologies for Translational Research in Life Sciences
Semantic (Web) Technologies for Translational Research in Life SciencesSemantic (Web) Technologies for Translational Research in Life Sciences
Semantic (Web) Technologies for Translational Research in Life Sciences
 
The Science of Data Science
The Science of Data Science The Science of Data Science
The Science of Data Science
 
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management SystemLeveraging Knowledge Graphs in your Enterprise Knowledge Management System
Leveraging Knowledge Graphs in your Enterprise Knowledge Management System
 
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 API-Centric Data Integration for Human Genomics Reference Databases: Achieve... API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
API-Centric Data Integration for Human Genomics Reference Databases: Achieve...
 
SFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free softwareSFSCON23 - Michele Finelli - Management of large genomic data with free software
SFSCON23 - Michele Finelli - Management of large genomic data with free software
 
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier TordoirShare and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
Share and analyze geonomic data at scale by Andy Petrella and Xavier Tordoir
 
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
WuXi NextCODE Scales up Genomic Sequencing on AWS (ANT210-S) - AWS re:Invent ...
 
Stephen Friend Dana Farber Cancer Institute 2011-10-24
Stephen Friend Dana Farber Cancer Institute 2011-10-24Stephen Friend Dana Farber Cancer Institute 2011-10-24
Stephen Friend Dana Farber Cancer Institute 2011-10-24
 
Mining Big Data using Genetic Algorithm
Mining Big Data using Genetic AlgorithmMining Big Data using Genetic Algorithm
Mining Big Data using Genetic Algorithm
 
Pine education-platform
Pine education-platformPine education-platform
Pine education-platform
 

Mehr von c.titus.brown

Mehr von c.titus.brown (19)

2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial2015 beacon-metagenome-tutorial
2015 beacon-metagenome-tutorial
 
2015 msu-code-review
2015 msu-code-review2015 msu-code-review
2015 msu-code-review
 
2015 mcgill-talk
2015 mcgill-talk2015 mcgill-talk
2015 mcgill-talk
 
2015 pycon-talk
2015 pycon-talk2015 pycon-talk
2015 pycon-talk
 
2015 opencon-webcast
2015 opencon-webcast2015 opencon-webcast
2015 opencon-webcast
 
2015 vancouver-vanbug
2015 vancouver-vanbug2015 vancouver-vanbug
2015 vancouver-vanbug
 
2015 osu-metagenome
2015 osu-metagenome2015 osu-metagenome
2015 osu-metagenome
 
2015 pag-chicken
2015 pag-chicken2015 pag-chicken
2015 pag-chicken
 
2015 pag-metagenome
2015 pag-metagenome2015 pag-metagenome
2015 pag-metagenome
 
2014 nyu-bio-talk
2014 nyu-bio-talk2014 nyu-bio-talk
2014 nyu-bio-talk
 
2014 anu-canberra-streaming
2014 anu-canberra-streaming2014 anu-canberra-streaming
2014 anu-canberra-streaming
 
2014 aus-agta
2014 aus-agta2014 aus-agta
2014 aus-agta
 
2014 mmg-talk
2014 mmg-talk2014 mmg-talk
2014 mmg-talk
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
2014 wcgalp
2014 wcgalp2014 wcgalp
2014 wcgalp
 
2014 moore-ddd
2014 moore-ddd2014 moore-ddd
2014 moore-ddd
 
2014 ismb-extra-slides
2014 ismb-extra-slides2014 ismb-extra-slides
2014 ismb-extra-slides
 
2014 bosc-keynote
2014 bosc-keynote2014 bosc-keynote
2014 bosc-keynote
 
2014 marine-microbes-grc
2014 marine-microbes-grc2014 marine-microbes-grc
2014 marine-microbes-grc
 

2013 nas-ehs-data-integration-dc

  • 1. Integrating large, fast-moving, and heterogeneous data sets in biology. C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu
  • 2. Introduction  Background:  Modeling & data analysis undergrad =>  Open source software development + software engineering +  developmental biology + genomics PhD =>  Bio + computer science faculty =>  Data driven biology  Currently working with next-gen sequencing data (mRNAseq, metagenomics, difficult genomes).  Thinking hard about how to do data-driven modeling & model-driven data analysis.
  • 3. Goal & outline Address challenges and opportunities of heterogeneous data integration: 1000 ft view. Outline:  What types of analysis and discovery do we want to enable?  What are the technical challenges, common solutions, and common failure points?  Where might we look for success stories, and what lessons can we port to biology?  My conclusions.
  • 4. Specific types of questions  “I have a known chemical/gene interaction; do I see it in this other data set?”  “I have a known chemical/gene interaction; what other gene expression is affected?”  “What does chemical X do to overall phenotype, effect on gene expression, altered protein localization, and patterns of histone modification?”  More complex/combinatorial interactions:  What does this chemical do in this genetic background?  What kind of additional gene expression changes are generated by the combination of these two chemicals?  What are common effects of this class of chemicals?
  • 5. What general behavior do we want to enable?  Reuse of data by groups that did not/could not produce it.  Publication of reusable/“fork”able data analysis pipelines and models.  Integration of data and models.  Serendipitous uses and cross-referencing of data sets (“mashups”).  Rapid scientific exploration and hypothesis generation in data space.
  • 6. (Executable papers & data reuse)  ENCODE  All data is available; all processing scripts for papers are available on a virtual machine.  QIIME (microbial ecology) Amazon virtual machine containing software and data for: “Collaborative cloud-enabled tools allow rapid, reproducible biological insights.” (pmid 23096404)  Digital normalization paper Amazon virtual machine, again: http://arxiv.org/abs/1203.4802
  • 7. Executable papers can support easy replication & reuse of code, data. (IPython Notebook; also see RStudio) http://ged.msu.edu/papers/2012- diginorm/notebook/
  • 8. What general behavior do we want to enable?  Reuse of data by groups that did not/could not produce it.  Publication of reusable/”fork”able data analysis pipelines and models.  Integration of data and models.  Serendipitous uses and cross-referencing of data sets (“mashups”).  Rapid scientific exploration and hypothesis generation in data space.
  • 9. An entertaining digression -- A mashup of Facebook “top 10 books by college” and per-college SAT rank http://booksthatmakeyoudumb.virgil.gr/
  • 10. Technical obstacles  Syntactic incompatibility  The first 90% of bioinformatics: your IDs are different from my IDs.  Semantic incompatibility  The second 90% of bioinformatics: what does “gene” mean in your database?  Impedance mismatch  SQL is notoriously bad at representing intervals and hierarchies  Genomes consist of intervals; ontologies consist of hierarchies!  …SQL databases dominate (vs graph or object DBs).  Data volume & velocity  Large & expanding data sets just make everything harder.  Unstructured data  aka “publications” – most scientific knowledge is “locked
  • 11. Typical solutions  “Entity resolution”  Accession numbers or other common identifiers …requires global naming system OR translators.  Top down imposition of structure  Centralized DB;  “Here is the schema you will all use”; …limits flexibility, prevents use of unstructured data, heavyweight.  Ontologies to enable “correct” communication  Centrally coordinated vocabulary …slow, hard to get right, doesn’t solve unstructured data problem. Balancing theoretical rigor with practical applicability is particularly hard.  Ad hoc entity resolution (“winging it”)  Common solution …doesn’t work that well.
  • 12. Are better standards the solution? http://xkcd.com/927/
  • 13. Rephrasing technical goals How can we best provide a platform or platforms to support flexible data integration and data investigation across a wide range of data sets and data types in biology? My interests:  Avoid master data manager and centralization  Support federated roll-out of new data and functionality  Provide flexible extensibility of ontologies and hierarchies  Support diverse “ecology” of databases,
  • 14. Success stories outside of biology?  Look for domains:  with really large amounts of heterogenous data,  that are continually increasing in size,  are being effectively mined on an ongoing basis,  Have widely used programmatic interfaces that support “mashups” and other cross-database stuff,  and are intentional, with principles that we can steal or adapt.
  • 15. Success stories outside of biology?  Look for domains:  with really large amounts of heterogenous data,  that are continually increasing in size,  are being effectively mined on an ongoing basis,  Have widely used programmatic interfaces that support “mashups” and other cross-database stuff,  and are intentional, with principles that we can steal or adapt. Amazon.
  • 16. Amazon:  > 50 million users, > 1 million product partners, billions of reviews, dozens of compute services …  Continually changing/updating data sets.  Explicitly adopted a service-oriented architecture that enables both internal and external use of this data.  For example, the amazon.com Web site is itself built from over 150 independent services…  Amazon routinely deploys new services and functionality.
  • 17. Sources: The Platform Rant (Steve Yegge) -- in which he compares the Google and Amazon approaches: https://plus.google.com/112678702228711889851/ posts/eVeouesvaVX A summary at HighScalability.com: http://highscalability.com/amazon-architecture (They are both long and tech-y, note, but the first is especially entertaining.)
  • 18. A brief summary of core principles Mandates from the CEO: 1. All teams must expose data and functionality solely through a service interface. 2. All communication between teams happens through that service interface. 3. All service interfaces must be designed so that they can be exposed to the outside world.
  • 19. More colloquially: “You should eat your own dogfood.” Design and implement the database and database functionality to meet your own needs; and only use the functionality you’ve explicitly made available to everyone. To adapt to research: database functionality should be designed in tightly integration with researchers who are using it, both at a user interface level and programmatically. (Genome databases have done a really good job of this, albeit generally in a centralized model.)
  • 20. If the “customers” aren’t integrated into the development loop:
  • 21. A platform view? Diff'n gene Data Metabolic expression exploration model query WWW Gene ID translator Isoform Chemical resolution/ relationships comparison Expression normalization Expression Expression Expression Expression data data data data II (tiling) (microarray) (mRNAseq) (mRNAseq)
  • 22. A few points  Open source and agile software development approaches can be surprisingly effective and inexpensive.  Developing services in small groups that include “customer-facing developers” helps ensure utility.  Implementing services in the “cloud” (e.g. virtual machines, or on top of “infrastructure as a service” services) gives developer flexibility in tools, approaches, implementation; also enables scaling and reusability.
  • 23. Combining modelling with data  Data-driven modeling: connections and parameters can be, to some extent, determined from data.  Model-driven data investigation: data that doesn’t fit the “known” model is particularly interesting. The second approach is essentially how particle physicists work with accelerator data: build a model & then interpret the data using the model. (In biology, models are less constraining, though; more unknowns.)
  • 24.
  • 25. Using developmental models Davidson et al., http://sugp.caltech.edu/endomes
  • 26. Using developmental models Models can contain useful abstractions of specific processes; here, the direct effects of blocking nuclearization of B-catenin can be predicted by following the connections. Models provide a common language for (dis)agreement a community.
  • 27. Using developmental models Davidson et al., http://sugp.caltech.edu/endomes
  • 28. Social obstacles  Training of biologically aware software developers is lacking.  Molecular biologists are still very much of a computationally naïve mindset: “give me the answer so I can do the real work”  Incentives for data sharing, much less useful data sharing are not yet very strong.  Pubs, grants, respect...  Patterns for useful data sharing are still not well understood, in general.
  • 29. Other places to look  NEON and other NSF centers (e.g. NCEAS) are collecting vast heterogenous data sets, and are explicitly tackling the data management/use/integration/reuse problem.  SBML (“Systems Biology Markup Language”) is a modeling descriptive language that enables interoperability of modeling software.  Software Carpentry runs free workshops on effective use of computation for science.
  • 30. My conclusions…  We need a “platform” mentality to make the most use of our data, even if we don’t completely embrace loose coupling and distribution.  Agile and end-user focused software development methodologies have worked well in other areas; much of the hard technical space has already been explored in Internet companies (and probably social networking companies, too).  Data is most useful in the context of an explicit model; models can be generated from data, and models can feed back into data gathering.
  • 31. Things I didn’t discuss  Database maintenance and active curation is incredibly important.  Most data only makes sense in the context of other data (think: controls; wild type vs knockout; other backgrounds; etc.) – so we will need lots more data to interpret the data we already have.  “Deep learning” is a promising field for extracting correlations from multiple large data sets.  All of these technical problems are easier to solve than the social problems (incentives; training).
  • 32. Thanks -- This talk and ancillary notes will be available on my blog ~soon: http://ivory.idyll.org/blog/ Please do contact me at ctb@msu.edu if you have questions or comments.

Hinweis der Redaktion

  1. Separation of concerns; multiple implementation possible; when publish, don’t have to talk to anybody to get “your method” integrated; recognition that everything is changing. Embrace chaos.