Brief presentation on the challenges and current state of play with regards to the bioinformatics of a pathogen, M. tuberculosis. Presented at the UWC/UCT Big Data workshop in January 2015
Call Girls Service Jaipur {8445551418} ❤️VVIP BHAWNA Call Girl in Jaipur Raja...
Bioinformatics of TB: A case study in big data
1. Bioinformatics of TB
A case study in big data
Peter van Heusden
pvh@sanbi.ac.za
and Alan Christoffels
South African National Bioinformatics Institute
University of the Western Cape
Bellville, South Africa
January 2015
3. M. tuberculosis
Widespread pathogen, responsible for 1.3 million deaths annually
Genome size ~4 megabases
Illumina NGS sequencing run ~2 gigabytes (uncompressed)
4. M. tuberculosis
Widespread pathogen, responsible for 1.3 million deaths annually
Genome size ~4 megabases
Illumina NGS sequencing run ~2 gigabytes (uncompressed)
Typical student project (2014)
1. Gather data (on hard disk / over network)
2. Run annotation pipeline (compute time < 1 week, disk used 20 to
40 GB)
3. Examine significance of variation compared to “reference
sequence”
5. What’s coming down the pipe
In South Africa alone we have access to samples from several
thousand strains of TB
Low cost of sequencing means
1. More depth: capture population of pathogens in single patient
2. More length: study progression of infection in a patient
3. More breadth: build in depth regional or global picture of pathogen
sequence
6. Mapping a virulent TB strain
“Evolutionary history and global spread of the Mycobacterium
tuberculosis Beijing lineage” Merker et al (2015)
Beijing lineage strains associated with Multi-Drug Resistant
(MDR) TB spread worldwide
Studied 4987 isolates, fully sequenced 110 representatives
Mapped 6 clonal complexes and ancestral base sublineage
Paper presents wealth of different data types:
1. DNA reads
2. Genotyping
3. Phylogeny
4. Geospatial
5. Time series data
6. Metadata on samples and experiments
7. More data: not more of the same
Existing publishing puts focus on results not data
Research data is very seldom FAIR:
1. Findable
2. Accessible
3. Interpretable
4. Reusable
(j.mp/fairdata1)
8. Change data handling, change research results
In the 21st century, much of the vast volume of scientific
data captured by new instruments on a 24/7 basis, along
with information generated in the artificial worlds of
computer models, is likely to reside forever in a live,
substantially publicly accessible, curated state for the
purposes of continued analysis. This analysis will result in
the development of many new theories! (Jim Gray)
“Big” in “Big Data” is not (only) about data volume
Cheap pathogen sequencing is driving complexity of questions
that can be asked of data
...but only if data is FAIR
9.
10. Why we’re not all riding to work on unicorns
[W]e now have terrible data management tools for most of
the science disciplines. . . . When you go and look at what
scientists are doing, day in and day out, in terms of data
analysis, it is truly dreadful. (Jim Gray)
Who curates your data?
How is it managed?
Where is it analysed?
And who gets access?
11. Future directions for SANBI (data management) research
Research programme is necessarily modest:
1. Cross-institution authentication, authorisation and movement of
data
2. New storage technologies
3. Data repositories in addition to filesystems
4. Storing and querying data on sequence collections, not individual
samples
12. Future directions for SANBI (data management) research
Research programme is necessarily modest:
1. Cross-institution authentication, authorisation and movement of
data
2. New storage technologies
3. Data repositories in addition to filesystems
4. Storing and querying data on sequence collections, not individual
samples
Individual institutes can only prototype solutions: scale of the
challenge will require much broader collaborative development