Injustice - Developers Among Us (SciFiDevCon 2024)
Kusarinoko: developing the public next generation sequencing data search interface that works.
1. Kusarinoko:
developing
the public next generation sequencing data
search interface
that works.
Tazro Ohta
Database Center for Life Science
Research Organization of Information and Systems
2. Problems for NGS data archive
managing large-scale data
Kusarinoko project, for better way to search and browse
metadata, fix and add
Inside of Sequence Read Archive
statistics of SRA reveals how it is
Today’s topics
4. Storing large-scale NGS data causes many problems
data transfer, storage, backup...
Metadata management is one big problem for public NGS
database
metadata : description of sequencing data. sample, sequencer platform,
application, etc.
Fixing metadata is a lifeline for public NGS database
Cost of storing large-scale sequence data
5. organism : mouse
ATGCATGCATGCATGCATGCAT
GCATGCATGCATGCATGCATGC : nervous cell
cell
ATGCATGCATGCATGCATGCAT
GCATGCATGCATGATGCATGCA
sequencer : 454
TGCATGCATGCATGCATGCATG
CATGCATGCATGCATGCATGCA
date : 2011 12 08
TGCATGATGCATCGATGCAATG
CATGCATGCATGCATGCATGCA
TGCATGCATGCATGCATGCATG
CATGCATGCATGCAGCATGCAT
GCATGCATGCATGCATGCATGC
SRA ATGCATGCATGCATGCATGCAT
Lab / Research institute
DRA INSDC
int’l nucleotide seq DB collaboration
data exchange
and sharing
ATGCATGCATGCAT
GCATGCATGCATGC
ATGCATGCATGCAT
data submission
ATGCATGCATGCATGCATGCAT GCATGCATGCATGA
GCATGCATGCATGCATGCATGC TGCATGCATGCATG
ATGCATGCATGCATGCATGCAT CATGCATGCATGCA
GCATGCATGCATGATGCATGCA Dat
TGCATGATGCATCG
TGCATGCATGCATGCATGCATG
w/ metadata
CATGCATGCATGCA
Data ID : 000001
CATGCATGCATGCATGCATGCA org
TGCATGCATGCATG
TGCATGATGCATCGATGCAATG CATGCATGCATGCA
CATGCATGCATGCATGCATGCA
organism : mouse GCATGCATGCATGC
cell
TGCATGCATGCATGCATGCATG ATGCATGCATGCAT
CATGCATGCATGCAGCATGCAT
cell : nervous cell seq
GCATGCATGCATGCATGCATGC
ATGCATGCATGCATGCATGCAT
sequencer : 454 date
date : 2011 12 08
ENA
Sequence Read Archive
Public NGS database, Sequence Read Archive
6. Over 55,000 submissions, over 350,000 sequence runs
and still increasing amount and size of the data
Metadata is provided apart, and is not described perfectly
submission / study / experiment / sample / run
Fixing metadata and adding extra information is NEEDED
It cannot be easy to find the data you want
13. Cutting the cost of using public data of SRA
search, browse, download, check
Giving more resources to support using data
is the data really sound?
Aim of Kusarinoko project
14. Study.xml Experiment.xml Submission.xml Sequence Data
metadata
Run.xml Sample.xml
pubmed ID FastQC result
get from sra.dbcls.jp calculate seq quality
Submission.xml by FastQC
integrate
Kusarinoko
Integrate metadata, add extra information
15. Covering only the data which has at least one published
article
if a paper is not published yet, Kusarinoko cannot find it. publication info:
sra.dbcls.jp
Quality checking is still beta ver
still on validating and trying to offer better information, will take more time
Limitation and features
22. Statistics of SRA by publication and seq quality
ONLY PUBLIC NGS DATA IN SRA WHICH HAS
PUBLICATION
Detailed stat will be available online at project website soon
Statistics for stepping into SRA
23. 2007~2011
number of
submission
Blue: Roche
Yellow: Illumina
Green: AB
Pink: Helicos
Red: PacBio
platform trend statistics
24. number of PubMed
ID
colored by Library
type
Blue: genomic
Red: transcriptomic
Brown:
metagenomic
Yellow: synthetic
Purple: Viral RNA
Green: non genomic
total 97 journals (unidentified) 587
total # of pmid:
Journal statistics
25. quick quality calc;
total average qual
(phred)
Blue: Roche
Yellow: Illumina
Green: AB
Pink: Helicos
Red: PacBio
same as max read
length
total # of items
(continuing)
(run): 16,006
minimum read length vs average quality value
26. total N content rate;
no correlation with
number of reads,
library prep methods
total # of items
(continuing)
(run): 16,006
total number of reads vs N content
27. total sequence
duplication
same as previous stat
amount of reads
seems not to effect
duplication
total # of items
(continuing)
(run): 16,006
total number of reads vs duplication rate
29. Developed a service to help searching and browsing SRA data
publication information and result of quality check support the metadata.
Statistics revealed the inside of SRA and gave some insights
showed NGS trends, and some items don’t have enough quality even if it has a
published article.
Detailed results and more at poster presentation: 2P-0132
(today)
Conclusion: for making use of public resources