I will start of with the bacterial tree of life taken from the publication Of Lasken and McLean. This phylogeny contains all detected bacterial phyla that we know off.
What is striking in this figure is the large amount of red. These are all the phyla that are only known from sequence data. In the paper where this phylogeny was shown, they isolated single cells from several of these “unknown” taxa.
This shows us that an improtant part of bacterial diversity in the environment is not known. The same goes as well for the domains with microbial Eukaryota and the Archaea. All the major branches of life contain complete phyla that have not been cultivated. Part of this is because they do not Kill us, but part of it is also because we fail te recognize how to look for them. There are several groups that have recently been shown to be more interesting that we expected. For instance the OD1 and OP11 phyla.
In the paper from Luef at al, they did something most of would not do. 0.2 um Filtration is in the hospital used to keep fluids sterile. Here they filtered groundwater to which acetate was added, through a 0.2 um filter and collected the cells on a 0.1 um filter. They obtained DNA and that was used for amplicon 16S sequencing and shot gun sequencing or metagenomics. The main groups of bacteria that were identified belonged to the taxa OP11 and OD1. The metagenomic analysis indicated that the bacteria have a limited metabolic potential and they are probably relying on other bacteria to obtain basic resources /compounds for their metabolism.
So this shows that metagenomics can be really powerful when it comes to understanding diversity. But what is metagenomics exactly?
The term metagenome was first coined by Handelsman in 1998. The metagenome is the collective genome of all the microorganisms in an environment. Then metagenomics, is the study of genetic material recovered from an environmental sample.
With this technique we can investigate and compare microbial communities.
The techinque that is used here is shot-gun sequencing, not amplicon sequencing of a target gene.
The two main questions in metagenomics are:
Who is there?
What are they doing?
These questions are about the potentail of a communities. With DNA we only get to see what a community is capable off, but we do not see the actual processes that are active, for that we need to add metatranscriptomics and metaproteomics.
Nonetheless,metagenomics can help us with a way of generating hyptothesis, and even testing hypothesis on different communities.
To understand microbial communities, we first want to understand who is there. This will give us unbaised information on the diversity and the complexity of the ecosystem. Most people use amplicon sequencing for this, but with the current sequencers present, we can do this on raw shot gun data, or an assembled data.
The second question is often even more interesting, but is also a lot harder to do and understand. You need to get yourself familiar with biochemistry and biogeochemistry.
Now we know what metagenomics is, but how did the field of metagenomics start.
To understand communitieis we can use a lot of different techniques. We have seen during the course so far, Amplicon sequencing of ssu rRNA, or ITS. But proteins coding genes can be used as well.
Another apporach which is quick is the use of special microarrays. For instance the geochip. That microarray was used to understand microbial metabolism changes in the Deepwater Horizon accident in the gulf of mexico.
But the most interesting method is Shot-gun sequecing. So why is this technique interesting.
Shot gun metagenomics did not start with this paper, but this was the paper that really put shot gun sequencing of the metagenome on the map.
The data for this paper was created using sanger sequencing. And what they did, was filter 200 Liters of seawater from the sargasso sea, which is very ologothrophic with very little boimass. Than they sequenced the extracted DNA with Sanger sequedcing. It showed that most of the genes were unknown, and it double the ncbi NR database is one go.
Now we do not use sanger sequence anylonger for metagenomics, but any of the other platforms.
As Gregor already explained there are quite a lot of platforms that are used for high throughput sequenceing. This is reminder of the different platforms. And just see what has changed in the 12 years since Craig venter published his metagenome of the sargasso sea sample.
Now the throughput of the machines has become pretty extensive and a lot more people have joined the field using either amplicon sequencing or shot gun sequencing.
All these people produce tons of data. Which is also showoing when you start look at the databases.
For instance MG-RAST, is a database for doing metagenomic analysis. We can play with it tomorrow afternoon for those who like that.
When I made this slide in 2012 there were already 74.462 datasets present in MG-rast 23 TBp. Which was already quite a good load of sequence data.
Three years later, there are of 160.000 datasets in MG-rast and the number of basepairs has gone up to almost 70 Tb. It is not a dramatic increase
Sadly though, most of these these are not all shot gun dataset, mostly it is amplicon datasets that are analysed. And each dataset is one sample. But still,
This is quite an impressive amount of data.
So how does a typical shot gun sequences workflow look like?
Here we compare two of the main sequencing methods in microbial ecology. Amplicon sequencing and Shot gun sequencing, or whole sample sequencing.
With amplicon sequencing you are either doing a taxonomic assesment of the communities or a functional diversity study of on or more protein coding genes.
With whole shot gun sequencing. We start of with raw reads, that can be used directly for taxonomic and functional profiling, or we can use assembly to make larger sequences or contigs.
With contigs we have the advantage that we can find complete open reading frames for proteins, while that is not the case for single reads. Any idea what the average nucleotide length is of a typical bacterial gene? It is a 1000 basepairs. So a read might only have a small part of it.
Both methods needs sequence classification for community profiling to asnwer who is there. With shot gun sequencing we add the functional profiling as well.
When we have contigs or reads we need to annotate those. And we can do that by sequence classification and subsequent annotation of those sequences. So how does that work?
Basically, Sequencing classification is the process of separating sequence data using specific information. We create bins
In theory there are two methods we can use to do sequecing classification.
The first one is classification using sequence composition.
There are several different methods for metagenomic analysis using sequence composition. We can look at the nucleotides frequencies, and especially at stretches of sequence for instance tetra nucleotides.
Than there is clustering of reads, and even assembly of reads.
The last method is differentia coverage of contigs. This method appeared in 2013 and this is using differences between samples to bin contigs derived from a sequence pool of all samples. One of the methods is groopM which I will explain a bit more.
GroopM needs one assembly. This assembly can be generated with a metagenomic assembler, or just a normal whole genome assembler. The catch is we pool the reads of all samples in one big assembly run, and generated a big set of contigs. After assembly, we then map for each sample the reads to the assembly and determine the coverage of each contig per sample. This information is than used to create a a set if high confidence bin cores of long 1Kbp contigs. This dataset is than screened for contamination or completeness using a reference dataset with 111 marker genes. This can be a step for manual curation. And finally small contigs are then recruited to the core bin. So at the end we end up with fasta files for each bin.
So to give you an impression of the GroopM method we can take a look at the analysis of a large synthethic metagenome
In the GroopM paper they created a synthetic shotgun metagenome using 1159 genomes for reference and gave them real abundances based on a OTU table from a soil amplicon study.
After creating the shotgun data they assembled the data and got lots of contigs.
On the left you see what happens when you start by binning all these contigs using tetranucleotides
On the right is the binnen after GroopM binning has been applied.
Size of the circle is proportional to the length of the contig.