1. www.ebi.ac.uk/arrayexpressEBI is an Outstation of the European Molecular Biology Laboratory.
MINSEQE standard, data formats, and
storage
Helen Parkinson, PhD
EMBL-EBI
2. www.ebi.ac.uk/arrayexpress
Scope: UHTS Standardization at multiple levels:
From sequence reads to interpreting results for
transcriptomics expts
Sequence reads Sequence Read Format (SRF)
http://srf.sourceforge.net/
Alignment / Assembly / Finishing Proposed Assembly and Alignment Format
(AAAF)
Local data storage
Minimum Information About a
Genome Sequence (MIGS)
Minimum Information about a Metagenomic
Sequence/ Sample (MIMS)
Minimal Information about a high-throughput
SEQuencing Experiment (MINSEQE)
e.g., ArrayExpress, GEO
Experimental details for sequence
identification
Experimental details for expression,
binding, modifications, sequence changes
Public repositories e.g., DDBJ/EMBL/GenBank
3. www.ebi.ac.uk/arrayexpress
MINSEQE Status
• Proposal available
• http://www.mged.org/minseqe
• Data deposition support :
• ArrayExpress
• GEO
• Short read archives
• Builds on SRF and proposed AAAF
• sequence reads and IDs [the raw DNA sequences comprise base calls,
quality values and platform specific information. ]
• define and format reference genome, consensus genome, absolute and
relative assembly
• Asim Siddiqui, Gabor Marth and Paul Flicek
• UHTS Quality Metrics Workgroup started
• Metrics for assessing quality for the different platforms
• Marc Salit (NIST)
4. www.ebi.ac.uk/arrayexpress
MINSEQE Implementation
• ArrayExpress support from Jan 2011, algorithm designed
and implemented for submissions from AE and GEO
• SequenceScape, Sanger data management system,
working with other centres
• ENA/EGA @ EBI – split of submission ENA/ArrayExpress
with in scope data coming direct to AE for curation and
scoring.
• Bioconductor pipeline for UHTS data using MINSEQE
information from AE – submitted to Bioinformatics -
ArrayExpressHTS
5. www.ebi.ac.uk/arrayexpress
ArrayExpress implementation for all GEO and
AE data
• MINSEQE compliant templates generated
• Scoring algorithm designed and implemented
• Standards applied at the point of submission by curators
@ EBI and NCBI GEO
• All data scored at EBI, including presence/absence of raw
data in ENA, SRA and variables
• GUI Modified to include search by MINSEQE scores e.g.
All RNA base HTP sequencing experiments with raw data
in EGA
6. www.ebi.ac.uk/arrayexpress
Future work
• Release MINSEQE supporting GUI for ArrayExpress
• Implementation of taxon specific rules for transcriptomics
data at the EBI short read archive and SequenceScape
• Likely revision based on changes in file formats used and
processing SW in the community – SRF not well
supported for e.g.
• Solicitation of support from journals
• Contributors:
• FGED (was MGED), scientists, funders, journals, companies,
esp. Chris Stoeckert, FGED president
• Wellcome Trust Sanger Institute, funders Welcome Trust
• EBI – EGA/ArrayExpress/ERA – EC funding
• NCBI GEO – NIH
Hinweis der Redaktion
CAMERA stands for Community Cyberinfrastructure for Advanced Marine Microbial Ecology Research and Analysis
SRF: sequence reads and IDs [the raw DNA sequences comprise base calls, quality values and platform specific information. ]
AARF: define and format reference genome, consensus genome, absolute and relative assembly (Asim Siddiqui, Gabor Marth and Paul Flicek)
UHTS quality workgroup: metrics for assessing quality for the different platforms