1. An assessment with CEGMA showed that 97% and 98% of a conserved
set of eukaryotic genes were at least partially covered in the
pseudochromosome assemblies of two Bayer rice lines, compared to 98% in
both the 93-11 and Nipponbare public genomes. Furthermore, 99% of over
66k rice transcripts could be mapped to the assemblies, indicating high
coverage of the gene space. Finally, repeat analysis revealed that ~9% of
repetitive sequences were missing from the two Bayer assemblies,
accounting for their smaller sizes in comparison with the public genomes.
BCS 1 (y axis) pseudochromosomes vs. 93-11 chromosomes (x axis)
Whole genome de novo assembly of two Bayer elite lines was performed using data from Illumina sequencing of paired-end, mate pair, and
fosmid libraries and PacBio long reads. The assemblies were further improved by the use of a genetic map and alignment to the Nipponbare
genome. The construction of reference genomes for these elite lines provide a valuable resource for marker and gene discovery in our rice
breeding program, as well as for reference-based assemblies of additional Bayer indica lines.
Whole Genome De Novo Assembly of Two Bayer Elite Rice Lines
Joan W. Wong1, Pieter B. F. Ouwerkerk1, Christian Dreischer2, Bjoern Geigle2, and Sebastian J. Schultheiss2
1Bayer CropScience NV, Innovation Center, Technologiepark 38, 9052 Ghent, Belgium
2Computomics GmbH & Co. KG, Christophstr. 32, 72072 Tuebingen, Germany
Computational Life Sciences
CONCLUSION
ABSTRACT
We performed genome sequencing and de novo assembly for two elite
indica rice lines that are parents for a Bayer commercial hybrid. Initial
assemblies were performed using ALLPATHS-LG on Illumina reads from
paired-end and mate pair libraries. Fosmid-end sequences and PacBio long
reads were then used for further scaffolding and gap filling. A genetic map
constructed from sequencing data of 2000 F2 individuals was used to order
and orient >300 scaffolds, composing around 90% sequence length of each
assembly. Finally, remaining scaffolds were placed using the public
Nipponbare genome as a reference. The final assemblies comprised 1,244
and 1,522 scaffolds with N50 scaffold sizes of 3.0 and 2.1 Mb and total sizes
of 401 and 404 Mb, respectively. The iterative assembly enabled us to track
the progress with each added dataset and demonstrated the value of the
mate pairs, long reads, and genetic map.
BACKGROUND
ALIGNMENT WITH INDICA REFERENCE GENOME
ASSEMBLY EVALUATION
ALIGNMENT WITH JAPONICA REFERENCE GENOME
ALLPATHS-LG
de novo assemble paired-end, mate-pair, and fosmid reads
PBJelly2 and SOAP GapCloser
scaffold and fill gaps with PacBio reads
Custom algorithm (Computomics)
orient and place scaffolds using genetic map
RepARK
generate repeat libraries
ABACAS
assemble scaffolds + repeats based on japonica
PBJelly2 & GapCloser
fill remaining gaps
ASSEMBLY PROCESS
0 50 100 150 200 250 300 350 400 450
Scaffold
Contig
Scaffold
Contig
BCS
2
BCS
2
BCS
1
BCS
1
Assembly Size (Mb)
De novo Reference-guided
ASSEMBLY SIZES
92 92 93 93
5 6 5 5
0
20
40
60
80
100
BCS 1 BCS 2 BGI indica IRGSP japonica v5
%ConservedGenesFound
Partial
Complete