27. 4. AceDB - A C aenorhabditis e legans d ata b ase object-oriented Author "Patel B" Full_name "Bala Patel" Laboratory CB Paper [cgc1011] Paper [cgc533] Mail "Laboratory of Molecular Biology" Mail "Hills Road, Cambridge" Fax "050 3456789" Paper [cgc533] Title "Yet more of those Genes" Journal "Cell Reports" Volume 3 Year 1993
Not an expert. Reason will be explained further in the presentation.
* personal ideas/opinions; not necessarily Sanger’s
My background
there are hurdles for adoption in academia
Many people in institute; many ways of doing things + many tools
Data is often transitory. Apart from the raw sequencing data (served by e.g. EBI): data can often be archived once paper is written.
Because transitory. A one-off script that takes 5 minutes to write and a day to run is often preferable to one that takes a day to write and 5 minutes to run.
Because we want to focus on the research, not the tools. If the available tools get the work done they will suffice.
In many smaller labs: data management is not part of the initial grants. Is starting to change with the next-generation sequencing data.
“ Genome hacker”: very broad. From guy-who-knows-how-to-record-macros-in-Word to hardcore mathematicians.
"need IT support for heavier work": set up MongoDB server => what if need sharded cluster? => investment from IT "creating legacy": if it's something that will be used after you're gone (typical contract: postdoc = 3-5 yrs), you don't want to use a technology that is not supported or actively used within the organization “ often self-taught”!!!
“ normalization?”: Overkill to try and persuade them to use databases if you have to teach them normal forms.
What does the data look like?
Very difficult to parse without custom libraries (bio*)
“ //” => start of new record
State of the art. Is tab-delimited, but not really.
“ ##”: header “ #”: column headers INFO field: ‘;’-separated tag-value pairs (themselves separated with a ‘=‘) FORMAT field: necessary to know what is in the NA00001 column; colon-separated
Not really tab-delimited anymore because too structured Self-taught => simple scripting languages!
New technologies + existing technologies improved + decreasing cost of data generation
Would benefit most. "bench-based scientists": - are more and more learning perl and working with tab-delimited files - to go from Exel to database: json looks more like how they think than having to cope with normalization steps in a relational database “ big data”: auto-sharding, mapreduce, …
In-road into research: via department bioinformatician: constantly looking for new things Least effort of implementing and least costly if failure
Focus is often on data-exchange => a lot of effort on exchange file formats