SlideShare ist ein Scribd-Unternehmen logo
1 von 63
Downloaden Sie, um offline zu lesen
Challenges with data quality,
     sharing, and versioning
      David Dooling <ddooling@wustl.edu>
                              GIA 2009
Production Centers
• Tony Cox, Sanger        • David Dooling, WUStL
  Sequencing              Scale
  Scale                   Quality
  Infrastructure          Sharing
  Data    flow            Versioning



• Toby Bloom, Broad
  Quality
  Integration
  Standards
  Sharing

<ddooling@wustl.edu>
sub scale {



<ddooling@wustl.edu>
Moore’s Law

                       ,-./011-2#
                       300.-4/#567#
                       8,9#
                       :;0.6<-#
                       :-=>-1?-#




       !quot;quot;quot;#   !quot;quot;$#      !quot;quot;!#     !quot;quot;%#   !quot;quot;&#   !quot;quot;'#   !quot;quot;(#   !quot;quot;)#   !quot;quot;*#   !quot;quot;+#   !quot;$quot;#




<ddooling@wustl.edu>
Images




                 200 TB/week




<ddooling@wustl.edu>
Images




                   10 PB/year




<ddooling@wustl.edu>
Perspective




                   20 PB/day

<ddooling@wustl.edu>
Perspective




                       2 PB/s

<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS404:5:1:6:180#0/1
  aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a
  @HWI-EAS404:5:1:6:396#0/1
  TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
  +HWI-EAS404:5:1:6:396#0/1
  Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ
  @HWI-EAS404:5:1:6:1344#0/1
  GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
  +HWI-EAS404:5:1:6:1344#0/1
  aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[
  @HWI-EAS404:5:1:6:1814#0/1
  AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
  +HWI-EAS404:5:1:6:1814#0/1
  aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X


                               7 TB/week
<ddooling@wustl.edu>
FASTQ
  @HWI-EAS404:5:1:6:180#0/1
  GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT
  +HWI-EAS404:5:1:6:180#0/1
  aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a
  @HWI-EAS404:5:1:6:396#0/1
  TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA
  +HWI-EAS404:5:1:6:396#0/1
  Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ
  @HWI-EAS404:5:1:6:1344#0/1
  GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG
  +HWI-EAS404:5:1:6:1344#0/1
  aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[
  @HWI-EAS404:5:1:6:1814#0/1
  AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC
  +HWI-EAS404:5:1:6:1814#0/1
  aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X


                               350 TB/year
<ddooling@wustl.edu>
Mapping




                       2 TB/week




<ddooling@wustl.edu>
Mapping




                       100 TB/year




<ddooling@wustl.edu>
Mapping




                  42,000 core-hr/week




<ddooling@wustl.edu>
Mapping




                       5 core-yr/week




<ddooling@wustl.edu>
Mapping




                       250 core cluster




<ddooling@wustl.edu>
The Weakest Link




<ddooling@wustl.edu>
The Balanced PC
• Clock speed
• AGP
• Front-side bus
• Hypertransport
• 1 Gbps
• PCI-X
• SATA
• PCI-Express
• Infiniband
• Multi-core
• Front-side bus
• GPU
• 10 Gbps
<ddooling@wustl.edu>
The balanced PS         1




        10   gosub     get(sequencers)
        20   gosub     get(disk)
        30   gosub     get(backup_capacity)
        40   gosub     get(network_capacity)
        50   gosub     get(cluster_nodes)




                        1 - Pipeline for Sequencing
<ddooling@wustl.edu>
The unbalanced PS



        10   gosub get(sequencers)
        20   gosub get(disk)
        30   gosub get(backup_capacity)
        40   gosub get(network_capacity)
        50   gosub get(cluster_nodes)
        60   goto 10




<ddooling@wustl.edu>
The GHz race




<ddooling@wustl.edu>
} # scale



<ddooling@wustl.edu>
sub quality {



<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Honda




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Ford




<ddooling@wustl.edu>
Quality is Job 1




<ddooling@wustl.edu>
...must be more than
         just a slogan



<ddooling@wustl.edu>
Quality missteps
          Initial low fidelity between base
              quality values and quality




                       Tsonev, S. SEP 2007

<ddooling@wustl.edu>
An aside




            “basecall calibration predicted vs. observed”
<ddooling@wustl.edu>
Cult of traces




<ddooling@wustl.edu>
Quality is the key
Need high fidelity between prediction and observed

                 50 bytes per base


                 20 bytes per base


                  2 bytes per base


                       3 bits per base

<ddooling@wustl.edu>
The down side




http://www3.appliedbiosystems.com/cms/
groups/mcb_marketing/documents/
generaldocuments/cms_057559.pdf




                                         http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg




   <ddooling@wustl.edu>
} # quality



<ddooling@wustl.edu>
sub sharing {



<ddooling@wustl.edu>
1000 Genomes




<ddooling@wustl.edu>
3.8 Tb




<ddooling@wustl.edu>
~50 B/b




<ddooling@wustl.edu>
190 TB




<ddooling@wustl.edu>
Submitted to central
         repositories



<ddooling@wustl.edu>
... and replicated
            across the pond



<ddooling@wustl.edu>
The goal of this project is to provide a system
for storing and retrieving huge amounts of
data, distributed among a large number of
heterogenous server nodes, under a single
virtual filesystem tree with a variety of
standard access methods.




<ddooling@wustl.edu>
Write-only databases




          Search limited to sequence and
           values of specific XML entities
              submitted as metadata
<ddooling@wustl.edu>
Write-only databases




                             x
          Search limited to sequence and
           values of specific XML entities
              submitted as metadata
<ddooling@wustl.edu>
Speaking of XML
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>             <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                   <LS454>
<STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/      <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/              <INSTRUMENT_MODEL>GS 20</                  <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
XMLSchema-instancequot;>                               2001/XMLSchema-instancequot;>                          INSTRUMENT_MODEL>                                  ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
  <STUDY alias=quot;LowSalternSDbayVir111005quot;            <EXPERIMENT                                                                                         ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
accession=quot;SRP000145quot;>                             alias=quot;LowSalternSDbayVir111005_experimentquot;        <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT   ACGTACGTACGTACGTACGTACGTACGTACG</VALUE>
    <DESCRIPTOR>                                   expected_number_runs=quot;2quot; accession=quot;SRX000217quot;>    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT         </RUN_ATTRIBUTE>
      <STUDY_TITLE>Solar Salterns, viral               <TITLE>454 sequencing of saltern metagenome    ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT         <RUN_ATTRIBUTE>
fraction from low salinity saltern in San Diego,   fragment library</TITLE>                           ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</                   <TAG>key_sequence</TAG>
CA </STUDY_TITLE>                                      <STUDY_REF accession=quot;SRP000145quot;               FLOW_SEQUENCE>                                              <VALUE>TCAG</VALUE>
      <STUDY_TYPE                                  refname=quot;LowSalternSDbayVir111005quot;/>                       <FLOW_COUNT>168</FLOW_COUNT>                     </RUN_ATTRIBUTE>
existing_study_type=quot;Metagenomicsquot;/>                   <DESIGN>                                             </LS454>                                         </RUN_ATTRIBUTES>
      <STUDY_ABSTRACT>Viral community from a             <DESIGN_DESCRIPTION>454 Sequencing of            </PLATFORM>                                      </RUN>
quot;lowquot; salinity saltern and sequenced at 454 Life   viral fraction from low salinity saltern in San        <PROCESSING>                                     <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS
Sciences. </STUDY_ABSTRACT>                        Diego, CA</DESIGN_DESCRIPTION>                           <BASE_CALLS>                                 20quot; run_date=quot;2006-04-06T09:25:19Zquot;
      <CENTER_NAME>SDSU</CENTER_NAME>                    <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot;             <SEQUENCE_SPACE>Base Space</               run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot;
                                                   refname=quot;28373quot;/>                                  SEQUENCE_SPACE>                                    total_data_blocks=quot;1quot; accession=quot;SRR001054quot;>
<CENTER_PROJECT_NAME>LowSalternSDbayVir111005</          <LIBRARY_DESCRIPTOR>                                 <BASE_CALLER>454BaseCaller</BASE_CALLER>       <EXPERIMENT_REF accession=quot;SRX000217quot;
CENTER_PROJECT_NAME>                                       <LIBRARY_NAME>lowSalternSDbayVir111005</         </BASE_CALLS>                                refname=quot;LowSalternSDbayVir111005_experimentquot;/>
      <PROJECT_ID>28373</PROJECT_ID>               LIBRARY_NAME>                                            <QUALITY_SCORES qtype=quot;phredquot;>                   <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot;
    </DESCRIPTOR>                                          <LIBRARY_STRATEGY>OTHER</                          <QUALITY_SCORER>454BaseCaller</            total_spots=quot;70935quot; total_reads=quot;70935quot;
    <STUDY_ATTRIBUTES>                             LIBRARY_STRATEGY>                                  QUALITY_SCORER>                                    number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;>
      <STUDY_ATTRIBUTE>                                    <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE>             <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS>          <FILES>
        <TAG>NCBI parent project ID</TAG>                  <LIBRARY_SELECTION>RANDOM</                        <MULTIPLIER>1</MULTIPLIER>                          <FILE filename=quot;D1LDSHL01.sffquot;
        <VALUE>28725</VALUE>                       LIBRARY_SELECTION>                                       </QUALITY_SCORES>                            filetype=quot;sffquot;/>
      </STUDY_ATTRIBUTE>                                   <LIBRARY_LAYOUT>                               </PROCESSING>                                        </FILES>
    </STUDY_ATTRIBUTES>                                      <SINGLE/>                                  </EXPERIMENT>                                        </DATA_BLOCK>
  </STUDY>                                                 </LIBRARY_LAYOUT>                          </EXPERIMENT_SET>                                      <RUN_ATTRIBUTES>
</STUDY_SET>                                               <LIBRARY_CONSTRUCTION_PROTOCOL>                                                                     <RUN_ATTRIBUTE>
                                                             none provided                                                                                        <TAG>flow_count</TAG>
<?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                     </LIBRARY_CONSTRUCTION_PROTOCOL>           <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?>                      <VALUE>168</VALUE>
<SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/           </LIBRARY_DESCRIPTOR>                        <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/              </RUN_ATTRIBUTE>
XMLSchema-instancequot;>                                     <SPOT_DESCRIPTOR>                            XMLSchema-instancequot;>                                     <RUN_ATTRIBUTE>
  <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;>             <SPOT_DECODE_SPEC>                           <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS             <TAG>flow_sequence</TAG>
    <SAMPLE_NAME>                                            <NUMBER_OF_READS_PER_SPOT>2</            20quot; run_date=quot;2006-03-17T09:39:51Zquot;
      <TAXON_ID>496920</TAXON_ID>                  NUMBER_OF_READS_PER_SPOT>                          run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot;             <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
      <COMMON_NAME>saltern metagenome</                      <READ_SPEC>                              total_data_blocks=quot;1quot; accession=quot;SRR001053quot;>       ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
COMMON_NAME>                                                    <READ_INDEX>0</READ_INDEX>                <EXPERIMENT_REF accession=quot;SRX000217quot;          ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT
    </SAMPLE_NAME>                                              <READ_CLASS>Technical Read</          refname=quot;LowSalternSDbayVir111005_experimentquot;/>    ACGTACGTACGTACGTACGTACGTACGTACG</VALUE>
    <DESCRIPTION>viral fraction from low           READ_CLASS>                                            <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot;                </RUN_ATTRIBUTE>
salinity saltern in San Diego, CA </                            <READ_TYPE>Adapter</READ_TYPE>        total_spots=quot;51121quot; total_reads=quot;51121quot;                  <RUN_ATTRIBUTE>
DESCRIPTION>                                                    <BASE_COORD>1</BASE_COORD>            number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;>             <TAG>key_sequence</TAG>
    <SAMPLE_ATTRIBUTES>                                      </READ_SPEC>                                   <FILES>                                               <VALUE>TCAG</VALUE>
      <SAMPLE_ATTRIBUTE>                                     <READ_SPEC>                                      <FILE filename=quot;D0IIGP301.sffquot;                   </RUN_ATTRIBUTE>
        <TAG>collection_date</TAG>                              <READ_INDEX>1</READ_INDEX>            filetype=quot;sffquot;/>                                       </RUN_ATTRIBUTES>
        <VALUE>11/10/05</VALUE>                                 <READ_CLASS>Application Read</              </FILES>                                       </RUN>
      </SAMPLE_ATTRIBUTE>                          READ_CLASS>                                            </DATA_BLOCK>                                  </RUN_SET>
      <SAMPLE_ATTRIBUTE>                                        <READ_TYPE>Forward</READ_TYPE>            <RUN_ATTRIBUTES>
        <TAG>lat_lon</TAG>                                      <BASE_COORD>5</BASE_COORD>                  <RUN_ATTRIBUTE>
        <VALUE>32.599040, -117.107356</VALUE>                </READ_SPEC>                                     <TAG>flow_count</TAG>
      </SAMPLE_ATTRIBUTE>                                  </SPOT_DECODE_SPEC>                                <VALUE>168</VALUE>
    </SAMPLE_ATTRIBUTES>                                 </SPOT_DESCRIPTOR>                                 </RUN_ATTRIBUTE>
  </SAMPLE>                                            </DESIGN>                                            <RUN_ATTRIBUTE>
</SAMPLE_SET>                                          <PLATFORM>                                             <TAG>flow_sequence</TAG>




         <ddooling@wustl.edu>
} # sharing



<ddooling@wustl.edu>
sub versioning {



<ddooling@wustl.edu>
The Cathedral and the Bazaar
Linux overturned much of what I thought I
knew. I had been preaching the Unix gospel of
small tools, rapid prototyping and evolutionary
programming for years. But I also believed
there was a certain critical complexity above
which a more centralized, a priori approach was
required. I believed that the most important
software (operating systems and really large
tools like the Emacs programming editor)
needed to be built like cathedrals, carefully
crafted by individual wizards or small bands of
mages working in splendid isolation, with no
beta to be released before its time.
<ddooling@wustl.edu>
The Vatican and the Reformation




<ddooling@wustl.edu>
The popes




                   Will this scale?
<ddooling@wustl.edu>
GenBank genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.edu>
git genome




http://betterexplained.com/articles/intro-to-distributed-version-control-illustrated/

  <ddooling@wustl.edu>
The Human Reference
>7 dna:chromosome chromosome:NCBI36:7:1:158821424:1
...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG
GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT
TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT
GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT
ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA
AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA
TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA
CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA
TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA
AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT
TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC
AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC...


<ddooling@wustl.edu>
The Human Reference




<ddooling@wustl.edu>
The Human Reference
  (a)                                                                                                                                               2
                                                                                                                                                                                                                                                                                             A

                                                                                                                                    4(24)                                                                                                                                                    B
                                                                                                                                                                                                                                                         82
               3(2)
                                         5                             7                                                                    16(2)
               3(3)                                                                                                                                       2
                                                                                                                             3                                           3(2)                 2
                                                                                                         5                                                                                                58(2)
                                                                                                             3(2)
                                                         2(2)                                                                                                        8                                                                                 2(3)
                                                                                       6(2)
             2(219)                     2                             2
                          23(2)                                                                                                                                                               3
                                                                                                                              2
                                                                                                         2                                                3                                                               81
                                                                                                                                                                                                                                                       3(21)
             4(22)                     4(3)
        13                                                                                                                                                                                                                                3(24)
                                                                                                                                                                                                                                  3
    A                                  2(2)
                                                                                                                                    2(2)                                                                                         2(202)
                                                                                                                                                                                                                                                       19(8)
                      2(19)            2(15)                     2                                                                          2(34)
                                                         2(13)
                                                                                                                                                                                                                                                                                   158
                                                                                                                                                                                                                                                                                             C
                                                                                                                                                                                                                                          5(7)                  2(42) 4(9)
                                                                                                                                            2(15)
                                                2(4)
                                                                                                                                                                                                                                                       7(8)
                                         3(3)                                     71
    B
                                                                                                             18
                                                                                                                      2
    C
                                                                                                                             2
    D
                                                                                                                                                                                                                          37
                                                                                                                                                                                                                                                                                             F
                                                                                       139                                                                                                                                                  6
    E                                                                                                                                                                                                                                                                                        E
                                                                                                                            13(2)                       13(2)            55(3)
                                                                                                     2(6)    2(7)                           6(3)
                                                                                       4(7)
                                         4                                                     5                                                                                                                                            2
    F                                                                                                 3                                                                                                                                                                                      D
                                                                     38(6)                                                  3(5)
                                                                                              160                   3(50)                                                                                                                   2
    G                                                                                                                                                                                                                                                                                        G
         2                                                                                                                                                                       2(61)
                      4(51)                                                                                                                                                                                              2(49)
                                       3(50)                                                                  8
                                                          2(7)
    H
                                                                                                                                                                                   4
                                                                                                                                            2(4)
                                                                                                             142                                                                                          2(50)            5
                                                                                                                                                         5(5)            8(6)             5(7)
                                                                                                                    158
                                                                                                                                             3
                                                                                                                    3(41)
                                                                                                                                                                173
                                                                                                                                                                                                                                                                                             H




  (b)                                                                                                                                       (c)                                                                                                           142
                                                                                                                                                          G                            160
                                                        81

                                                                                                                                                                                                                                  13(7)                   158
                              117                       93                                          29
                      D                                                                                                                                   H
                                                                                                                     A                                                                                                                                                       184
                                                                                                                                                                                                                  9(6)
                                                                                                                                                                                                                                                                                         H
                                                                                                                                                                                                                                                       48(10)
                                                       140                                                                                                                                                                            8
                                                                                                                                                                                       8(5)
                                                                                                                                                                                                                         38(6)
                                                       114                                                                                                                                                                                                                               G
                                                                                                                                                          F
                                                                                                                                                                                   13(2)
                                                                                                                                                                                                              13(2)               55(3)
                                                       132
                              207                                                                                                                                                                                                                                                        D
                                                                                                                                                                                       139
                      A                                                                             82
                                                  127(2)                                                             B                                    E
                                  62
                                                                                                                                                                                                                                                                                         E
                                                                                                    37                                                                                   71
                      B
                                                                                                                     F                                                                                                                            37
                                                       139                                                                                                D
                                                                                                                                                                                                                                                                                         F
                                                                          13(2)                55(3)
                      E                                                                                              D                                          21                                                                                                    158
                                                                                                                                                                                                              32(3)               45(3)
                                                                                                                                                          A
                                                   13(2)                                                                                                                                                                                                                                 C
                                                                                                                                                                                                  s5766
                                                                                                                                                                                   13(2)
                                                                          38(6)
                                                                                                                                                                                                              20(2)
                                                                                                                                                                18
                      F                                                                                              G
                                                                                                                                                          B
                                                                                         8
                                                       8(5)                                                                                                                                                                                                                              A
                                                                                                                                                                                        81
                                                                          18(6)                58(7)                 E
                                                       171                                                                                                C
                      G                                                                                                                                                           123(2)                                                          82
                                                                                                                                                                                                                                                                                         B




             D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in
             large genomes. Genome Biology 2006, 7:R7
<ddooling@wustl.edu>
} # versioning



<ddooling@wustl.edu>
sub thank {quot;youquot;}



<ddooling@wustl.edu>

Weitere ähnliche Inhalte

Ähnlich wie Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Webjoelburton
 
OSCON 2004: XML and Apache
OSCON 2004: XML and ApacheOSCON 2004: XML and Apache
OSCON 2004: XML and ApacheTed Leung
 
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsIST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsD.A. Garofalo
 
Lca2009 Video A11y
Lca2009 Video A11yLca2009 Video A11y
Lca2009 Video A11yguesta3d158
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Tatsuhiko Miyagawa
 
technical fluency
technical fluencytechnical fluency
technical fluencyjudell
 
Standardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsStandardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsTim Wright
 
Spring
SpringSpring
Springdasgin
 
Plone Interactivity
Plone InteractivityPlone Interactivity
Plone InteractivityEric Steele
 
07 Collada Overview
07 Collada Overview07 Collada Overview
07 Collada Overviewjohny2008
 
Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Alistair McKinnell
 
Social Media for Cause Marketers - CMF 2009 Workshop
Social Media for Cause Marketers - CMF 2009 WorkshopSocial Media for Cause Marketers - CMF 2009 Workshop
Social Media for Cause Marketers - CMF 2009 WorkshopMediaSauce
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Daniel Cukier
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solrtomhill
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesMatthew Rowe
 
JavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsJavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsDennis Byrne
 
Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001guest6e7a1b1
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-templateshintaro mizuno
 

Ähnlich wie Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing (20)

Why Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the WebWhy Python Web Frameworks Are Changing the Web
Why Python Web Frameworks Are Changing the Web
 
OSCON 2004: XML and Apache
OSCON 2004: XML and ApacheOSCON 2004: XML and Apache
OSCON 2004: XML and Apache
 
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML ConceptsIST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
IST 561 Session2--Feb 2, 2009 Basic XHTML Concepts
 
Lca2009 Video A11y
Lca2009 Video A11yLca2009 Video A11y
Lca2009 Video A11y
 
Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8Web Scraper Shibuya.pm tech talk #8
Web Scraper Shibuya.pm tech talk #8
 
technical fluency
technical fluencytechnical fluency
technical fluency
 
Standardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web StandardsStandardizing the Web: A Look into the Why of Web Standards
Standardizing the Web: A Look into the Why of Web Standards
 
Spring
SpringSpring
Spring
 
Plone Interactivity
Plone InteractivityPlone Interactivity
Plone Interactivity
 
07 Collada Overview
07 Collada Overview07 Collada Overview
07 Collada Overview
 
Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011Agile Tour Shanghai December 2011
Agile Tour Shanghai December 2011
 
Social Media for Cause Marketers - CMF 2009 Workshop
Social Media for Cause Marketers - CMF 2009 WorkshopSocial Media for Cause Marketers - CMF 2009 Workshop
Social Media for Cause Marketers - CMF 2009 Workshop
 
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
Eficiency and Low Cost: Pro Tips for you to save 50% of your money with Googl...
 
An Introduction to Solr
An Introduction to SolrAn Introduction to Solr
An Introduction to Solr
 
Integrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous SourcesIntegrating and Interpreting Social Data from Heterogeneous Sources
Integrating and Interpreting Social Data from Heterogeneous Sources
 
Juggling
JugglingJuggling
Juggling
 
JavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and PitfallsJavaServer Faces Anti-Patterns and Pitfalls
JavaServer Faces Anti-Patterns and Pitfalls
 
Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001Anvita Dynamic Fontson Web Feb2001
Anvita Dynamic Fontson Web Feb2001
 
Mojolicious on Steroids
Mojolicious on SteroidsMojolicious on Steroids
Mojolicious on Steroids
 
Edge trends mizuno-template
Edge trends mizuno-templateEdge trends mizuno-template
Edge trends mizuno-template
 

Kürzlich hochgeladen

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Hiroshi SHIBATA
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Mark Goldstein
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditSkynet Technologies
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 

Kürzlich hochgeladen (20)

Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024Long journey of Ruby standard library at RubyConf AU 2024
Long journey of Ruby standard library at RubyConf AU 2024
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
Arizona Broadband Policy Past, Present, and Future Presentation 3/25/24
 
Manual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance AuditManual 508 Accessibility Compliance Audit
Manual 508 Accessibility Compliance Audit
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 

Challenges with Data Quality, Sharing, and Versioning in Next-Generation Sequencing

  • 1. Challenges with data quality, sharing, and versioning David Dooling <ddooling@wustl.edu> GIA 2009
  • 2. Production Centers • Tony Cox, Sanger • David Dooling, WUStL Sequencing Scale Scale Quality Infrastructure Sharing Data flow Versioning • Toby Bloom, Broad Quality Integration Standards Sharing <ddooling@wustl.edu>
  • 4. Moore’s Law ,-./011-2# 300.-4/#567# 8,9# :;0.6<-# :-=>-1?-# !quot;quot;quot;# !quot;quot;$# !quot;quot;!# !quot;quot;%# !quot;quot;&# !quot;quot;'# !quot;quot;(# !quot;quot;)# !quot;quot;*# !quot;quot;+# !quot;$quot;# <ddooling@wustl.edu>
  • 5. Images 200 TB/week <ddooling@wustl.edu>
  • 6. Images 10 PB/year <ddooling@wustl.edu>
  • 7. Perspective 20 PB/day <ddooling@wustl.edu>
  • 8. Perspective 2 PB/s <ddooling@wustl.edu>
  • 9. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 7 TB/week <ddooling@wustl.edu>
  • 10. FASTQ @HWI-EAS404:5:1:6:180#0/1 GCTGGTTTAACTCGAGTATTTGTCCATTCTACTAATTTGAGTGTCTGCTGTGGAAAGGTGTTTGTCATGTATTTT +HWI-EAS404:5:1:6:180#0/1 aaaa`]aaaa`aa^aa]aaaa^`_``____`W]a_`T[[b__`YXUW][MSTNZX^[[`_Z[^``X`^a @HWI-EAS404:5:1:6:396#0/1 TATTTACTCTATCCCATTATATACATATTATGATTTCAAAATAACAATGCCAATATAAAAACTAACAATATGATA +HWI-EAS404:5:1:6:396#0/1 Yaaa_baa`^a]Wa___aaa^I^V]^]NQ_`^ZPP[__^_a`^a`JYQWVNFFMRQSX_X^a_Y[`^a^NZ @HWI-EAS404:5:1:6:1344#0/1 GAGGACTTGCATGCTAGGTTTGGTTCTTGGCTGAATTGCTGAAACTGTCCAAGTATCAGTAGCAAAACATGGGTG +HWI-EAS404:5:1:6:1344#0/1 aabaaa__]^a`[^`]]Y``[ST_]`]WW]]WZ]`^ZT[_X```_WVNYWKDNLTW[YXSVZ^ZTZZVRUX[ @HWI-EAS404:5:1:6:1814#0/1 AAAGCTTACTGCTGTTTAGAATTCTTGCTACAGTCAGGAGAAAGCCGAAAGCTGAACGGGTACTGAATCTTCTAC +HWI-EAS404:5:1:6:1814#0/1 aa````aa^a`_^``a`XY`^ZX^YW^[XUWUYOMVZZ_W^^XXTSMHMLLNTTDWU__[WVVY]Y_]X 350 TB/year <ddooling@wustl.edu>
  • 11. Mapping 2 TB/week <ddooling@wustl.edu>
  • 12. Mapping 100 TB/year <ddooling@wustl.edu>
  • 13. Mapping 42,000 core-hr/week <ddooling@wustl.edu>
  • 14. Mapping 5 core-yr/week <ddooling@wustl.edu>
  • 15. Mapping 250 core cluster <ddooling@wustl.edu>
  • 17. The Balanced PC • Clock speed • AGP • Front-side bus • Hypertransport • 1 Gbps • PCI-X • SATA • PCI-Express • Infiniband • Multi-core • Front-side bus • GPU • 10 Gbps <ddooling@wustl.edu>
  • 18. The balanced PS 1 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 1 - Pipeline for Sequencing <ddooling@wustl.edu>
  • 19. The unbalanced PS 10 gosub get(sequencers) 20 gosub get(disk) 30 gosub get(backup_capacity) 40 gosub get(network_capacity) 50 gosub get(cluster_nodes) 60 goto 10 <ddooling@wustl.edu>
  • 33. Quality is Job 1 <ddooling@wustl.edu>
  • 34. ...must be more than just a slogan <ddooling@wustl.edu>
  • 35. Quality missteps Initial low fidelity between base quality values and quality Tsonev, S. SEP 2007 <ddooling@wustl.edu>
  • 36. An aside “basecall calibration predicted vs. observed” <ddooling@wustl.edu>
  • 38. Quality is the key Need high fidelity between prediction and observed 50 bytes per base 20 bytes per base 2 bytes per base 3 bits per base <ddooling@wustl.edu>
  • 39. The down side http://www3.appliedbiosystems.com/cms/ groups/mcb_marketing/documents/ generaldocuments/cms_057559.pdf http://mammoth.psu.edu/labPhotos/imageOfFlowgram.jpg <ddooling@wustl.edu>
  • 46. Submitted to central repositories <ddooling@wustl.edu>
  • 47. ... and replicated across the pond <ddooling@wustl.edu>
  • 48. The goal of this project is to provide a system for storing and retrieving huge amounts of data, distributed among a large number of heterogenous server nodes, under a single virtual filesystem tree with a variety of standard access methods. <ddooling@wustl.edu>
  • 49. Write-only databases Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
  • 50. Write-only databases x Search limited to sequence and values of specific XML entities submitted as metadata <ddooling@wustl.edu>
  • 51. Speaking of XML <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <LS454> <STUDY_SET xmlns:xsi=quot;http://www.w3.org/2001/ <EXPERIMENT_SET xmlns:xsi=quot;http://www.w3.org/ <INSTRUMENT_MODEL>GS 20</ <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT XMLSchema-instancequot;> 2001/XMLSchema-instancequot;> INSTRUMENT_MODEL> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <STUDY alias=quot;LowSalternSDbayVir111005quot; <EXPERIMENT ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT accession=quot;SRP000145quot;> alias=quot;LowSalternSDbayVir111005_experimentquot; <FLOW_SEQUENCE>TACGTACGTACGTACGTACGTACGTACGTACGT ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTOR> expected_number_runs=quot;2quot; accession=quot;SRX000217quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </RUN_ATTRIBUTE> <STUDY_TITLE>Solar Salterns, viral <TITLE>454 sequencing of saltern metagenome ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <RUN_ATTRIBUTE> fraction from low salinity saltern in San Diego, fragment library</TITLE> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACG</ <TAG>key_sequence</TAG> CA </STUDY_TITLE> <STUDY_REF accession=quot;SRP000145quot; FLOW_SEQUENCE> <VALUE>TCAG</VALUE> <STUDY_TYPE refname=quot;LowSalternSDbayVir111005quot;/> <FLOW_COUNT>168</FLOW_COUNT> </RUN_ATTRIBUTE> existing_study_type=quot;Metagenomicsquot;/> <DESIGN> </LS454> </RUN_ATTRIBUTES> <STUDY_ABSTRACT>Viral community from a <DESIGN_DESCRIPTION>454 Sequencing of </PLATFORM> </RUN> quot;lowquot; salinity saltern and sequenced at 454 Life viral fraction from low salinity saltern in San <PROCESSING> <RUN alias=quot;D1LDSHLquot; instrument_model=quot;454 GS Sciences. </STUDY_ABSTRACT> Diego, CA</DESIGN_DESCRIPTION> <BASE_CALLS> 20quot; run_date=quot;2006-04-06T09:25:19Zquot; <CENTER_NAME>SDSU</CENTER_NAME> <SAMPLE_DESCRIPTOR accession=quot;SRS000373quot; <SEQUENCE_SPACE>Base Space</ run_file=quot;D1LDSHLquot; run_center=quot;454MSCquot; refname=quot;28373quot;/> SEQUENCE_SPACE> total_data_blocks=quot;1quot; accession=quot;SRR001054quot;> <CENTER_PROJECT_NAME>LowSalternSDbayVir111005</ <LIBRARY_DESCRIPTOR> <BASE_CALLER>454BaseCaller</BASE_CALLER> <EXPERIMENT_REF accession=quot;SRX000217quot; CENTER_PROJECT_NAME> <LIBRARY_NAME>lowSalternSDbayVir111005</ </BASE_CALLS> refname=quot;LowSalternSDbayVir111005_experimentquot;/> <PROJECT_ID>28373</PROJECT_ID> LIBRARY_NAME> <QUALITY_SCORES qtype=quot;phredquot;> <DATA_BLOCK name=quot;D1LDSHLquot; region=quot;1quot; </DESCRIPTOR> <LIBRARY_STRATEGY>OTHER</ <QUALITY_SCORER>454BaseCaller</ total_spots=quot;70935quot; total_reads=quot;70935quot; <STUDY_ATTRIBUTES> LIBRARY_STRATEGY> QUALITY_SCORER> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <STUDY_ATTRIBUTE> <LIBRARY_SOURCE>OTHER</LIBRARY_SOURCE> <NUMBER_OF_LEVELS>64</NUMBER_OF_LEVELS> <FILES> <TAG>NCBI parent project ID</TAG> <LIBRARY_SELECTION>RANDOM</ <MULTIPLIER>1</MULTIPLIER> <FILE filename=quot;D1LDSHL01.sffquot; <VALUE>28725</VALUE> LIBRARY_SELECTION> </QUALITY_SCORES> filetype=quot;sffquot;/> </STUDY_ATTRIBUTE> <LIBRARY_LAYOUT> </PROCESSING> </FILES> </STUDY_ATTRIBUTES> <SINGLE/> </EXPERIMENT> </DATA_BLOCK> </STUDY> </LIBRARY_LAYOUT> </EXPERIMENT_SET> <RUN_ATTRIBUTES> </STUDY_SET> <LIBRARY_CONSTRUCTION_PROTOCOL> <RUN_ATTRIBUTE> none provided <TAG>flow_count</TAG> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> </LIBRARY_CONSTRUCTION_PROTOCOL> <?xml version=quot;1.0quot; encoding=quot;UTF-8quot;?> <VALUE>168</VALUE> <SAMPLE_SET xmlns:xsi=quot;http://www.w3.org/2001/ </LIBRARY_DESCRIPTOR> <RUN_SET xmlns:xsi=quot;http://www.w3.org/2001/ </RUN_ATTRIBUTE> XMLSchema-instancequot;> <SPOT_DESCRIPTOR> XMLSchema-instancequot;> <RUN_ATTRIBUTE> <SAMPLE alias=quot;28373quot; accession=quot;SRS000373quot;> <SPOT_DECODE_SPEC> <RUN alias=quot;D0IIGP3quot; instrument_model=quot;454 GS <TAG>flow_sequence</TAG> <SAMPLE_NAME> <NUMBER_OF_READS_PER_SPOT>2</ 20quot; run_date=quot;2006-03-17T09:39:51Zquot; <TAXON_ID>496920</TAXON_ID> NUMBER_OF_READS_PER_SPOT> run_file=quot;D0IIGP3quot; run_center=quot;454MSCquot; <VALUE>TACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT <COMMON_NAME>saltern metagenome</ <READ_SPEC> total_data_blocks=quot;1quot; accession=quot;SRR001053quot;> ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT COMMON_NAME> <READ_INDEX>0</READ_INDEX> <EXPERIMENT_REF accession=quot;SRX000217quot; ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT </SAMPLE_NAME> <READ_CLASS>Technical Read</ refname=quot;LowSalternSDbayVir111005_experimentquot;/> ACGTACGTACGTACGTACGTACGTACGTACG</VALUE> <DESCRIPTION>viral fraction from low READ_CLASS> <DATA_BLOCK name=quot;D0IIGP3quot; region=quot;1quot; </RUN_ATTRIBUTE> salinity saltern in San Diego, CA </ <READ_TYPE>Adapter</READ_TYPE> total_spots=quot;51121quot; total_reads=quot;51121quot; <RUN_ATTRIBUTE> DESCRIPTION> <BASE_COORD>1</BASE_COORD> number_channels=quot;1quot; format_code=quot;1quot; sector=quot;0quot;> <TAG>key_sequence</TAG> <SAMPLE_ATTRIBUTES> </READ_SPEC> <FILES> <VALUE>TCAG</VALUE> <SAMPLE_ATTRIBUTE> <READ_SPEC> <FILE filename=quot;D0IIGP301.sffquot; </RUN_ATTRIBUTE> <TAG>collection_date</TAG> <READ_INDEX>1</READ_INDEX> filetype=quot;sffquot;/> </RUN_ATTRIBUTES> <VALUE>11/10/05</VALUE> <READ_CLASS>Application Read</ </FILES> </RUN> </SAMPLE_ATTRIBUTE> READ_CLASS> </DATA_BLOCK> </RUN_SET> <SAMPLE_ATTRIBUTE> <READ_TYPE>Forward</READ_TYPE> <RUN_ATTRIBUTES> <TAG>lat_lon</TAG> <BASE_COORD>5</BASE_COORD> <RUN_ATTRIBUTE> <VALUE>32.599040, -117.107356</VALUE> </READ_SPEC> <TAG>flow_count</TAG> </SAMPLE_ATTRIBUTE> </SPOT_DECODE_SPEC> <VALUE>168</VALUE> </SAMPLE_ATTRIBUTES> </SPOT_DESCRIPTOR> </RUN_ATTRIBUTE> </SAMPLE> </DESIGN> <RUN_ATTRIBUTE> </SAMPLE_SET> <PLATFORM> <TAG>flow_sequence</TAG> <ddooling@wustl.edu>
  • 54. The Cathedral and the Bazaar Linux overturned much of what I thought I knew. I had been preaching the Unix gospel of small tools, rapid prototyping and evolutionary programming for years. But I also believed there was a certain critical complexity above which a more centralized, a priori approach was required. I believed that the most important software (operating systems and really large tools like the Emacs programming editor) needed to be built like cathedrals, carefully crafted by individual wizards or small bands of mages working in splendid isolation, with no beta to be released before its time. <ddooling@wustl.edu>
  • 55. The Vatican and the Reformation <ddooling@wustl.edu>
  • 56. The popes Will this scale? <ddooling@wustl.edu>
  • 59. The Human Reference >7 dna:chromosome chromosome:NCBI36:7:1:158821424:1 ...AATAACTATATAAGTAAATAAGCAAGCTGTATGAATATACAAAGCTCTCTGGTAAAG GTAAATACATAAACAAACATAAAAACAGTCCTATTGTAATTTTGGTTTGTAACTCTGCTT TTTATTTTCTACATAATTTAAAAGGCAAATGCATAAAATGTAATTGTAAATCTGTTAGCT GGTATACAATGAATAAAGATATAATTTGTCACATCAATAACATAAAAAGAGTAGAGCTAT ATATATAGCAGTAGAATTTTGGTATGTGATTGAACTTAAGTTGAAATAAATTCAAATTAA AATGTTATAACTCTAGGATGTTATATGTAATTCTCATAGTAACCAAAAATGAAATATACA TAGAATATAAACAAAAGGAAATGAGACTAGAAACAAAATGTGTCACTACAAAAAAATCAA CTAAAGATAAAAAAGAAATAATTGAGAAAATGATTGGCAAAAATCAGTAACTCTGACGTA TTAAAACTTTCCATGCTACATAAATCTGAAAACTCTATTTCACATAAAACTGGAGCTGAA AGAAACAAATATTTACCTATAAAGTTAAAAGTTATATAGGGAACAAACACTAATTTTTTT TAGAAAAAATTATAAAAAGAGTAAAAATATGCCTTATACTACCGTAATTTCATGTTTTAC AGCTCTGGGAAAATAGAAAATAAAATGTTCTGTTAGCATGAATCCCTCTGTGCCCCC... <ddooling@wustl.edu>
  • 61. The Human Reference (a) 2 A 4(24) B 82 3(2) 5 7 16(2) 3(3) 2 3 3(2) 2 5 58(2) 3(2) 2(2) 8 2(3) 6(2) 2(219) 2 2 23(2) 3 2 2 3 81 3(21) 4(22) 4(3) 13 3(24) 3 A 2(2) 2(2) 2(202) 19(8) 2(19) 2(15) 2 2(34) 2(13) 158 C 5(7) 2(42) 4(9) 2(15) 2(4) 7(8) 3(3) 71 B 18 2 C 2 D 37 F 139 6 E E 13(2) 13(2) 55(3) 2(6) 2(7) 6(3) 4(7) 4 5 2 F 3 D 38(6) 3(5) 160 3(50) 2 G G 2 2(61) 4(51) 2(49) 3(50) 8 2(7) H 4 2(4) 142 2(50) 5 5(5) 8(6) 5(7) 158 3 3(41) 173 H (b) (c) 142 G 160 81 13(7) 158 117 93 29 D H A 184 9(6) H 48(10) 140 8 8(5) 38(6) 114 G F 13(2) 13(2) 55(3) 132 207 D 139 A 82 127(2) B E 62 E 37 71 B F 37 139 D F 13(2) 55(3) E D 21 158 32(3) 45(3) A 13(2) C s5766 13(2) 38(6) 20(2) 18 F G B 8 8(5) A 81 18(6) 58(7) E 171 C G 123(2) 82 B D Zhi, BJ Raphael, AL Price, H Tang and PA Pevzner. Identifying repeat domains in large genomes. Genome Biology 2006, 7:R7 <ddooling@wustl.edu>

Hinweis der Redaktion

  1. What are the challenges that the large genome centers are currently facing that the typical researcher will be facing soon? Do not store images Do not store SRF Keep FASTQ
  2. This acceleration breaks everything
  3. 3.4*125/75*35 = 198.333333333333
  4. We need to stop having to deal with images It should be transparent to the end user
  5. LHC http://atlasexperiment.org/
  6. (90*2+90/125*50)*35 = 7560 Uncompressed
  7. For 75 b read, you need 200 bytes, 25% is the headers Save 12.5% by simply not replicating the sequence header
  8. 8*90/12*35 = 2100
  9. Cost of software
  10. The chain is only as strong as its weakest link. Images: Assembly line backing up? Keystone cops piling up? Stooges? Transition: situation not-unlike that faced by PC manufacturers over the past decade
  11. This analogy works on another level as well...
  12. Intel convinced everyone that the speed of the computer was equal to the clock speed of the processor Many people believed this Even when using a 56k modem Even when AML Opteron came out Even when Intel went to multi-core and lower clock speeds A cautionary tale for those joining the Gb race Which wraps up the scale up...
  13. ... and leads us into quality
  14. ... and leads us into quality
  15. Make the best small engine in the world
  16. Made high-quality cars for years Recognized after years of consistent performance
  17. Now enjoy premium cost and high resale value Everyone I know has a Honda Odyssey
  18. Money from the T-bird allowed them to design, develop, and introduce the...
  19. It&#x2019;s gotten better
  20. Google image search second or third result Draw your own conclusions
  21. This distrust of base calls and quality values has reinforced the cult of traces This does not scale for human resources, disk space, etc. This leads to a very bad situation for those of us responsible for the computing, storage, and network infrastrcuture
  22. Quality is at the core of all other issues, storage, compute, throughput, etc. If it&#x2019;s a bad base, call it a bad base Don&#x2019;t forget the GHz race
  23. Reducing data to base calls and quality values does reduce its value Especially for data not natively in &#x201C;base space&#x201D; Is there a richness in this data that is lost? But you gain not having to have custom tool tails for each native data type
  24. 2 bits/base is absolute minimum
  25. Grid
  26. No one ever feels lucky
  27. No one ever feels lucky
  28. They have learned their lesson, by creating an incredible amount of XML to submit Study, Sample, Experiment, Run
  29. He may know a lot about software, but he does not know anything about building cathedrals
  30. Currently, revisions are tightly controlled by central repositories, NCBI, UCSC, EBI
  31. Push and pull around diff&#x2019;s Balance curation with rapid advances Debian web of trust
  32. How far will FASTA get you? C. elegans - part of genome repeat structure http://genomebiology.com/2006/7/1/R7 Can you use the current de Bruijn graph assembly engines for alignment?
  33. Talk to me