1. Data of Unusual Size in
Metagenomics
C. Titus Brown
ctb@msu.edu
Asst Professor, Michigan State University
(Microbiology, Computer Science, and BEACON)
3. Thanks
• My lab, esp. Jason Pell, Arend Hintze, Adina
Chuang Howe, Qingpeng Zhang, and Eric
McDonald
• Michigan State, USDA and NSF for $$
4. “Three types of data scientists.”
(Bob Grossman, U. Chicago, at XLDB 2012)
1. Your data gathering rate is slower than Moore’s Law.
2. Your data gathering rate matches Moore’s Law.
3. Your data gathering rate exceeds Moore’s Law.
5. Metagenomics
• Randomly sequence DNA from mixed
microbial communities, e.g. soil.
• DNA sequencing rates (cost/volume) have
been outpacing Moore’s Law for ~5 years
now… A terabase for ~$10k today.
7. “Shredding libraries” is a good analogy!
• Lots of copies of Dickens, “Tale of Two Cities”, and SAT
study guides, etc.
• Not as many copies of <obscure hipster author>.
• Many different editions with minor differences, +
Reader’s Digest, excerpts, etc.
• (Although for libraries we usually know the language)
8. Two points:
1. If we feed all of the
libraries in the world into
a paper shredder and
mix, how do we recover
the book content!?
10. Digression: Data of Unusual Size (aka Big
Data) in Scientific Research
• Research is already hard enough:
– Novel, fast moving, heterogeneous data types.
– Unknown answers.
• Big Data => scaling, requires good engineering
– Apply or invent new data structures & algorithms.
– Write usable, functioning, reusable software.
(Hint: academics are not good at one of these things)
11. The assembly problem
• The N**2 approach: look at all overlapping
fragments.
• The word-based approach: further
decompose words into fixed-length
overlapping hashable words.
(Only one of these scales…)
12. Shotgun sequencing
“Coverage” is simply the average number of reads that overlap
each true base in genome.
Here, the coverage is ~10 – just draw a line straight down from the top
through all of the reads.
13. Reducing to k-mers overlaps
Note that k-mer abundance is not properly represented here! Each
blue k-mer will be present around 10 times.
14. Errors create new k-mers
Each single base error generates ~k new k-mers.
Generally, erroneous k-mers show up only once – errors are random.
15. So, our k-mer data contains both true and
false k-mers.
16. Random sampling => deep sampling needed
Typically 10-100x needed for robust recovery (300 Gbp for human)
18. Uneven representation
complicates matters.
Since you’re sequencing at
random, you need to
sequence deeply in order to
be sensitive to rare hipster
books.
These rare hipster books
may be important to
understanding culture: not
only best-sellers have
influence!
25. Streaming algorithm for lossy
compression of data sets.
• Converts random sampling to systematic sampling by
building an assembly graph on the fly
• Can discard up to 99.9% of data set and errors, and still
retain all information necessary for assembly.
• Acts as a prefilter for assemblers; ~5 lines of Python.
• Each piece of data is only examined once (!)
• Most errors are never collected => low memory.
26. Separately, apply Bloom filters to storing
the information/data.
“Exact” is for best possible information-theoretical storage.
Pell et al., PNAS 2012
27. Some details
• This was completely intractable.
• Implemented in C++ and Python; “good practice” (?)
• We’ve changed scaling behavior from data to information.
• Practical scaling for ~soil metagenomics is 10-100x: need <
1 TB of RAM for ~2 TB of data. ~2 weeks.
• Just beginning to explore threading, multicore, etc. (BIG
DATA grant proposal)
• Goal is to scale to 50 Tbp of data (~5-50 TB RAM currently)
28. My rules of thumb for Big Data
(for a better tomorrow)
1. Write well-understood filters and
components, not monolithic programs.
29. My rules of thumb for Big Data
(for a better tomorrow)
2. Throw away data as quickly as possible.
30. My rules of thumb for Big Data
(for a better tomorrow)
3. Scripting is an extremely effective way to
connect serious software to scientists.
31. My rules of thumb for Big Data
(for a better tomorrow)
4. Streaming/online approaches are worth the
effort to develop them.
(OK, this is obvious to this audience)
32. My rules of thumb for Big Data
(for a better tomorrow)
1. Write well-understood filters and components, not
monolithic programs.
2. Throw away data as quickly as possible.
3. Scripting is an extremely effective way to connect
serious software to scientists.
4. Streaming/online approaches are worth the effort to
develop them.