SlideShare ist ein Scribd-Unternehmen logo
1 von 16
Keegan McAuliffe
MCB 432: Computing in
Molecular Biology
The following is my final presentation for MCB 432: detailing the process our
group undertook to determine the identity of a unknown bacteria. We were
provided with raw sequence reads of a bacteria, and we converted them into
contigs and scaffolds. We assembled the data into a complete genome, then
annotated for potential genes to successfully determine the identity of the
bacteria as Bacteroides vulgatus str. 3975.
Keegan McAuliffe
Henry Chen
Andrew Storm
Dominic Gentile
Team 10 Results and Discussion
Introduction:
The onset of new high throughput sequencing has increased our ability to analyze genetic information.
In this project, we demonstrate how to use raw sequence data from sampled organisms for genetic and
genomic analysis. With the raw sequenced reads provided by the PI, we assembled a genome for our
unknown microorganism. The genome assembly was accomplished by using the A5ud assembler
program (Table 1). With the data generated, we were able to determine the total number of contigs and
scaffolds and use these assemblies to predict and annotate genes (Table 2). Assembled genome on
hand, we are now capable of searching and analyzing predicted genes in order to characterize our
unknown organism, which we accomplished using the Prodigal algorithm for gene prediction. Prodigal
generates gene and protein predictions, but does not provide analysis to what those predicted genes
and proteins represent. Therefore, we need to employ other programs that function to annotate our
predictions and because genes are so complex, we need to be specific in choosing programs for gene
analysis. For instance, programs such as Emboss allow you to search for alignments and patterns in your
assembly to databases of well-known genes, HMM and Blast searches allow to you to compare protein
homology, and many other programs designed to search for features such as tRNA and signal peptides.
With this analytical power, we analyzed our genome and present how we accomplished these tasks and
our results.
Results: (Optional tasks)
The objective of Optional Task 1 was to determine the GC content of each gene. In order to ascertain
this information, it was first necessary to assemble our reads into contigs and scaffolds—the objective of
Mandatory Task 1. To do this, we first had to unzip or inflate the data of our read, using the “gunzip”
command. Next, we ran the A5ud assembler on the data. This generated a file for quality trimming
report, assembly report, initial scaffolding report, final scaffold quality check, error corrected reads,
contigs, crude scaffolds, broken scaffolds, and final scaffolds. The assembly report contained the GC
content for each contig, which we added to Table 3. The average GC content for all contigs is .407.
Because GC bonds are more stable than AT bonds, our genome is less stable than a genome of GC
content greater than .500.
The objective of Optional Task 3 was to determine the best BlastP match for our proteins against the NR
database. The first step of Task 3, then, was to determine the proper command to generate a single best
match from the NR database for each contig, with an E-value less than 1e-10, as well as the organism to
which it belongs, the accession number, and percent identity. The command we used was:
blastp –db nr –query TeamProject.faa –out TeamProject.br –evalue 1E-10
–outfmt 6 –max_target_seqs 1
This command gave us the E-value, accession number, and percent identity for the blast blastp match of
each contig. However, we still needed to the organism name and description of the gene. For this, we
used the program efetch.pl. Using a list of accession names as an input, efetch.pl generated the
organism name and gene annotation for each gene of interest. This data was recorded in Table 5. This
task was also instrumental in determining the most closely related genus, species, and strain to our
scaffolds.
The best blastp match for each contig was of the genus Bacteroides, and the overwhelming majority was
of the species Bacteroides vulgatus. More specifically, the strain Bacteroides vulgatus str. 3975 RP4
occurred 9 times out of 104 contigs. Furthermore, this represents 60% of the 15 blast results specific
enough to indicate strain. This data led us to conclude that Bacteroides vulgatus str. 3975 is the most
closely related strain.
The objective of Optional Tasks 4 and 5 were to analyze the CDSs for possible proteins and genes. The
scaffold sequence were analyzed using PFAM to determine possible protein matches and TIGRFAM to
determine possible gene matches. The hmmscan for the PFAM matches used the Pfam-A database and
TeamProject.faa. The hmmscan for the TIGRFAM matches used the TIGRFAMs_14.0.HMM database and
TeamProject.faa. The results were compiled into Table 6 and Table 7 from the TeamProject_pfam.txt and
TeamProject_tigrfam.txt. Only the best match for each CDS were added to Table 3. The PFAM hmmscan
revealed that many of the CDS had at least one related protein. The predicted proteins of CDSs with
multiple matches were all closely related. For example, all the predicted proteins for the 1_83 CDS are
from the Glycosyl transferase family 2. The TIGRFAM search revealed that there were fewer matches;
only 33 to the 191 matches of the PFAM search. Most of the CDS with TIGRFAM matches only have one
match. Only CDS 1_15, 1_39, 1_82, and 1_85 have multiple matches. These CDSs only had two matches
where several PFAM matches had four or five matches. The TIGRFAM and PFAM matches for each CDS
both predicted similar functions for the CDSs that had both TIGRFAM and PFAM matches.
Optional Task 6 used PHYRE2 to analyze CDS 1.1_1, 1.1_4, 1.1_14, 1.1_19, 1.1_32, 1.1_54, 1.1_57,
1.1_60, 1.1_68, and 2.1_8. All CDSs except 1.1_1 and 1.1_32 had a confidence of 100.0; with values of
61.1 and 49.4 respectively. The PHYRE2 predicted proteins agree with the PFAM predictions for all
except 1.1_1, 1.1_32, 1.1_57, and 1.1_60. The other possible PHYRE2 matches were also not the same
as the PFAM results. This may be because the structures of the PFAM matches are not in the PHYRE2
database.
For Optional Task 7 we used looked for more specific features such as signal peptides. We used our
assembled scaffold (team.fasta) and compared it to a reference database with gram negative
prokaryotes, we were able to identify potential signal peptides and determined the length of these
peptides. We compared our data to gram negative prokaryotes because our previous blast analysis
identified genes and proteins matched those found in the gram negative genus Bacteriodes. The output
data (which can be located in the file TeamProj_SigP_Summary.txt) specifically denoted the presence or
absence of the signal peptides and the cutoff points of those peptides (C-value). This allowed us to
determine the predicted lengths of the peptides. The results can be found in Table 3.
The objective of Optional Task 8 was to analyze the presence of rho-independent transcriptional
terminators. This is a particularly useful application as intrinsic terminators typically denote genes that
are actively transcribed. In order to accomplish this task, we needed to run our genome alignment
(team.fasta) for a RHO independent terminator database search while supplying the search with
predicted gene coordinates. These predicted gene coordinates were determined through our EMBOSS
infoseq analysis of predicted proteins on our assembly and restructured into the TeamProj.coords file for
use with our RHO analysis program. The report generated can be found in the file TeamProj_tt +
TeamProj_tt.txt and the results of which predicted genes had identifiable RHO independent terminators
are listed in Table 3.
Optional Task #9 determined if we can find any homologous RNA secondary structures from our assembled
genome. Like all genes, tRNA structure can provide valuable information on the function and origin of the gene,
which can be incredibly valuable when characterizing an unknown genome. With our assembled genome in hand
(team.fasta) we searched for matches in conserved RNA structures with a handful of RFAM databases: RF00005,
RF00010, RF00023, RF00029, RF00059, RF00174, RF00177, RF01693, RF01694, RF01726, RF01998, and RF02001.
The data can be found as TeamProj_RF*.txt. From our search we only found 1 tRNA match and include that match in
information on the matched gene in Table 3.
For Optional Task 14, we constructed an alignment of our scaffolds with the genome of the bacterial strain with the
most sequence matches, which we determined to be Bacteroides vulgatus str. 3975 RP4. On NCBI, we found 184
contigs of a whole genome-sequencing project for this strain. We concatenated these contigs to create a whole
genome, to which we compared our scaffolds using blastn. With that blast report as a reference, we aligned the
genomes using “act” and saved a screenshot of part of the alignment as Figure 3.
Discussion:
As we previously alluded to in the discussing the results of Optional Task 3, we used Blastp to
determine the best match of each contig within the database “NR.” This data, located in Table 5, clearly
indicates that genus of the closest relative is Bacteroides. After all, according to our blastp results, the
best match of every contig corresponds to the genus Bacteroides. We can further assert that the
species is Bacteroides vulgatus. 43 of the 104 contigs list Bacteroides vulgatus as their best match, and
of the blast matches that were specific to species, 43 of 49 contigs (87.76%) list Bacteroides vulgatus.
We can delve even deeper into the identity of the closest relative, as of the 104 contigs we were
searching against, the strain Bacteroides vulgatus str. 3975 RP4 occurred 9 times. Thus, 9 of 15 blast
results specific enough to indicate strain list Bacteroides vulgatus str. 3975 RP4. These data led us to
conclude that Bacteroides vulgatus str. 3975 is the most closely related strain.
Appendix
Contains 7 tables containing the raw data used to create our Results and
Discussion sections along with 1 figure showing our genome alignment
Table1GenomeAssemblystatistics forTeam10
No.ofReadPairs 47893
No.oflowqualityreads 1763
No.ofassembledReads 102640
No.ofunassembledReads 2382
No.ofContigs 2
No.ofScaffolds 2
Totalntlengthofscaffolds 126196
Length %G+C
No.ofreads
mapped Coverage
Contig 100.0 119,977 40.61% 4851245 6065.0
Contig 100.1 6,219 37.58% 240956 5811.0
Table 2 Gene annotation summary for scaffolds
CDS/ORFs tRNAs other RNAs
scaffold1.1 95 0 0
scaffold2.1 9 1 0
Table 3. Predicted Gene Coordinates
Scaffold Name Type Start Stop Strand NT Length AA Length GC % Signal Peptide?SP Length (AA) Best Blast Hit Blast description
scaffold 1.1 1_1 CDS 3 611 - 609 202 0.406 N gi|496057719|ref|WP_008782226.1| transposase, partial
scaffold 1.1 1_2 CDS 845 3022 - 2178 725 0.405 Y 21 gi|649547948|gb|KDS54658.1| hypothetical protein M099_1756
scaffold 1.1 1_3 CDS 3539 3766 - 228 75 0.403 N gi|649547946|gb|KDS54656.1|
glycoside hydrolase family 88
domain protein
scaffold 1.1 1_4 CDS 3949 4905 - 957 318 0.383 N gi|492435030|ref|WP_005843062.1|
MULTISPECIES: transcriptional
regulator
scaffold 1.1 1_5 CDS 5062 6291 + 1230 409 0.408 N gi|492435027|ref|WP_005843060.1| TonB-dependent receptor
scaffold 1.1 1_6 CDS 6311 7198 + 888 295 0.429 Y 18 gi|492435023|ref|WP_005843058.1| hypothetical protein
scaffold 1.1 1_7 CDS 7536 8942 + 1407 468 0.396 Y 21 gi|649547942|gb|KDS54652.1| ahpC/TSA family protein
scaffold 1.1 1_8 CDS 9027 9767 - 741 246 0.396 N gi|649547941|gb|KDS54651.1| ahpC/TSA family protein
scaffold 1.1 1_9 CDS 10111 12657 + 2547 848 0.421 N gi|495945682|ref|WP_008670261.1|
MULTISPECIES: hypothetical
protein
scaffold 1.1 1_10 CDS 12750 15755 - 3006 1001 0.36 N gi|495945680|ref|WP_008670259.1|
MULTISPECIES: hypothetical
protein
scaffold 1.1 1_11 CDS 15884 16252 + 369 122 0.477 Y 19 gi|492458337|ref|WP_005851052.1| alpha-L-fucosidase
scaffold 1.1 1_12 CDS 16394 17275 - 882 293 0.468 N gi|492434987|ref|WP_005843035.1| tRNA dimethylallyltransferase 1
scaffold 1.1 1_13 CDS 17363 18388 - 1026 341 0.429 N gi|492434984|ref|WP_005843033.1|
MULTISPECIES: hypothetical
protein
scaffold 1.1 1_14 CDS 18424 19740 - 1317 438 0.432 N gi|492434981|ref|WP_005843031.1|
MULTISPECIES: UDP-N-
acetylglucosamine acyltransferase
scaffold 1.1 1_15 CDS 19846 21519 + 1674 557 0.476 N gi|492458346|ref|WP_005851058.1|
MULTISPECIES: hydroxymyristoyl-
ACP dehydratase
scaffold 1.1 1_16 CDS 21680 21880 + 201 66 0.454 N gi|492458349|ref|WP_005851060.1|
MULTISPECIES: UDP-3-O-
acylglucosamine N-acyltransferase
scaffold 1.1 1_17 CDS 22035 22727 + 693 230 0.43 N gi|500644323|ref|WP_011964621.1| phosphohydrolase
scaffold 1.1 1_18 CDS 22796 23239 - 444 147 0.453 N gi|492434969|ref|WP_005843024.1|
MULTISPECIES: orotidine 5'-
phosphate decarboxylase
scaffold 1.1 1_19 CDS 23255 23524 - 270 89 0.47 N gi|492434967|ref|WP_005843023.1|
MULTISPECIES: peptide chain
release factor 1
scaffold 1.1 1_20 CDS 23527 23871 - 345 114 0.471 N gi|492458355|ref|WP_005851064.1|
MULTISPECIES:
phosphoribosylformylglycinamidine
cyclo-ligase
scaffold 1.1 1_21 CDS 24081 24527 + 447 148 0.31 N gi|492434963|ref|WP_005843021.1| hypothetical protein
scaffold 1.1 1_22 CDS 24636 24818 + 183 60 0.409 N gi|492434961|ref|WP_005843020.1| MULTISPECIES: toxin Fic
Table 5. Single best blast hit of annotated ORFs from Team 10
Name Gene Identifier Description Organism % identity E-value
1_1 gi|496057719|ref|WP_008782226.1| transposase, partial Bacteroides sp. 3_1_40A 100 8.00E-88
1_2 gi|649547948|gb|KDS54658.1| hypothetical protein M099_1756 Bacteroides vulgatus str. 3975 RP4 100 4.00E-62
1_3 gi|649547946|gb|KDS54656.1| glycoside hydrolase family 88 domain protein Bacteroides vulgatus str. 3975 RP4 100 6.00E-62
1_4 gi|492435030|ref|WP_005843062.1| MULTISPECIES: transcriptional regulator Bacteroides 100 5.00E-82
1_5 gi|492435027|ref|WP_005843060.1| TonB-dependent receptor Bacteroides vulgatus 100 0
1_6 gi|492435023|ref|WP_005843058.1| hypothetical protein Bacteroides vulgatus 100 0
1_7 gi|649547942|gb|KDS54652.1| ahpC/TSA family protein Bacteroides vulgatus str. 3975 RP4 100 0
1_8 gi|649547941|gb|KDS54651.1| ahpC/TSA family protein Bacteroides vulgatus str. 3975 RP4 100 0
1_9 gi|495945682|ref|WP_008670261.1| MULTISPECIES: hypothetical protein Bacteroides 99.61 0
1_10 gi|495945680|ref|WP_008670259.1| MULTISPECIES: hypothetical protein Bacteroides 97.22 2.00E-16
1_11 gi|492458337|ref|WP_005851052.1| alpha-L-fucosidase Bacteroides vulgatus 100 0
1_12 gi|492434987|ref|WP_005843035.1| tRNA dimethylallyltransferase 1 Bacteroides vulgatus 100 0
1_13 gi|492434984|ref|WP_005843033.1| MULTISPECIES: hypothetical protein Bacteroides 100 9.00E-131
1_14 gi|492434981|ref|WP_005843031.1| MULTISPECIES: UDP-N-acetylglucosamine acyltransferaseBacteroides 100 3.00E-180
1_15 gi|492458346|ref|WP_005851058.1| MULTISPECIES: hydroxymyristoyl-ACP dehydrataseBacteroides 100 0
1_16 gi|492458349|ref|WP_005851060.1| MULTISPECIES: UDP-3-O-acylglucosamine N-acyltransferaseBacteroides 100 0
1_17 gi|500644323|ref|WP_011964621.1| phosphohydrolase Bacteroides vulgatus 100 0
1_18 gi|492434969|ref|WP_005843024.1| MULTISPECIES: orotidine 5'-phosphate decarboxylaseBacteroides 100 0
1_19 gi|492434967|ref|WP_005843023.1| MULTISPECIES: peptide chain release factor 1 Bacteroides 100 0
1_20 gi|492458355|ref|WP_005851064.1| MULTISPECIES: phosphoribosylformylglycinamidine cyclo-ligaseBacteroides 100 0
1_21 gi|492434963|ref|WP_005843021.1| hypothetical protein Bacteroides vulgatus 100 6.00E-138
1_22 gi|492434961|ref|WP_005843020.1| MULTISPECIES: toxin Fic Bacteroides 100 0
1_23 gi|492458359|ref|WP_005851066.1| MULTISPECIES: hypothetical protein Bacteroides 100 6.00E-43
1_24 gi|492434958|ref|WP_005843019.1| hypothetical protein Bacteroides vulgatus 99.64 0
1_25 gi|492458364|ref|WP_005851068.1| MULTISPECIES: hypothetical protein Bacteroides 100 0
1_26 gi|492458366|ref|WP_005851069.1| MULTISPECIES: membrane protein Bacteroides 100 2.00E-43
1_27 gi|492458368|ref|WP_005851070.1| MULTISPECIES: hypothetical protein Bacteroides 100 9.00E-114
1_28 gi|492458370|ref|WP_005851071.1| MULTISPECIES: beta-N-acetylhexosaminidase Bacteroides 100 0
1_29 gi|492434942|ref|WP_005843009.1| MULTISPECIES: endonuclease Bacteroides 99.71 0
1_30 gi|511016443|ref|WP_016270813.1| excinuclease ABC subunit A Bacteroides vulgatus 100 0
1_31 gi|492434935|ref|WP_005843004.1| MULTISPECIES: hypothetical protein Bacteroides 100 0
1_32 gi|492434933|ref|WP_005843003.1| MULTISPECIES: chromate transporter Bacteroides 100 1.00E-131
1_33 gi|492434930|ref|WP_005843001.1| MULTISPECIES: chromate transporter Bacteroides 100 1.00E-105
1_34 gi|511016442|ref|WP_016270812.1| hypothetical protein Bacteroides vulgatus 100 0
1_35 gi|511016441|ref|WP_016270811.1| phosphoribosylformylglycinamidine synthase Bacteroides vulgatus 100 0
1_36 gi|492434921|ref|WP_005842995.1| MULTISPECIES: translocator protein, LysE familyBacteroides 100 4.00E-150
1_37 gi|492434917|ref|WP_005842993.1| MULTISPECIES: hypothetical protein Bacteroides 100 5.00E-127
1_38 gi|492458387|ref|WP_005851079.1| MULTISPECIES: dTDP-4-dehydrorhamnose reductaseBacteroides 100 0
1_39 gi|492434911|ref|WP_005842989.1| MULTISPECIES: peptide chain release factor 3 Bacteroides 100 0
1_40 gi|492434907|ref|WP_005842987.1| MULTISPECIES: molecular chaperone DnaJ Bacteroides 100 0
1_41 gi|492434904|ref|WP_005842985.1| dihydrofolate reductase Bacteroides vulgatus 100 0
1_42 gi|548318542|ref|WP_022508241.1| hypothetical protein Bacteroides vulgatus CAG:6 100 1.00E-174
1_43 gi|492434896|ref|WP_005842980.1| hypothetical protein Bacteroides vulgatus 100 0
1_44 gi|492458409|ref|WP_005851092.1| transcriptional regulator Bacteroides vulgatus 99.7 0
1_45 gi|492434890|ref|WP_005842976.1| MULTISPECIES: hypothetical protein Bacteroides 100 1.00E-44
1_46 gi|492434887|ref|WP_005842974.1| hypothetical protein Bacteroides vulgatus 100 0
1_47 gi|500644291|ref|WP_011964611.1| hypothetical protein Bacteroides vulgatus 100 0
Table 6. PFAM domain matches for annotated genes from Team 10
Name PFAM ID Description E value
scaffold1.1_1 PF01610.12 Transposase 2.90E-25
scaffold1.1_2 PF11396.3 Protein of unknown function (DUF2874) 7.80E-15
scaffold1.1_4 PF03965.11 Penicillinase repressor 2.40E-25
scaffold1.1_5 PF03544.9 Gram-negative bacterial TonB protein C-termi 2.50E-23
scaffold1.1_5 PF13715.1 Domain of unknown function (DUF4480) 1.50E-16
scaffold1.1_5 PF05569.6 BlaR1 peptidase M56 1.00E-11
scaffold1.1_5 PF13620.1 Carboxypeptidase regulatory-like domain 2.90E-10
scaffold1.1_5 PF07715.10 TonB-dependent Receptor Plug Domain 2.10E-06
scaffold1.1_6 PF14559.1 Tetratricopeptide repeat 6.20E-13
scaffold1.1_6 PF13414.1 TPR repeat 6.70E-12
scaffold1.1_6 PF07719.12 Tetratricopeptide repeat 2.90E-11
scaffold1.1_6 PF13428.1 Tetratricopeptide repeat 2.00E-10
scaffold1.1_6 PF13432.1 Tetratricopeptide repeat 9.60E-10
scaffold1.1_6 PF13429.1 Tetratricopeptide repeat 5.30E-08
scaffold1.1_6 PF12895.2 Anaphase-promoting complex, cyclosome, subun 1.30E-07
scaffold1.1_6 PF13431.1 Tetratricopeptide repeat 6.80E-06
scaffold1.1_7 PF00578.16 AhpC/TSA family 1.30E-11
scaffold1.1_7 PF00255.14 Glutathione peroxidase 4.20E-08
scaffold1.1_7 PF14289.1 Domain of unknown function (DUF4369) 1.70E-06
scaffold1.1_8 PF13905.1 Thioredoxin-like 1.40E-14
scaffold1.1_8 PF13098.1 Thioredoxin-like domain 1.90E-14
scaffold1.1_8 PF00085.15 Thioredoxin 2.70E-11
scaffold1.1_8 PF08534.5 Redoxin 4.30E-11
scaffold1.1_8 PF00578.16 AhpC/TSA family 1.00E-07
scaffold1.1_11 PF01120.12 Alpha-L-fucosidase 2.60E-87
scaffold1.1_12 PF01715.12 IPP transferase 7.70E-64
scaffold1.1_12 PF01745.11 Isopentenyl transferase 3.00E-12
scaffold1.1_12 PF04851.10 Type III restriction enzyme, res subunit 0.00022
scaffold1.1_13 PF07929.6 Plasmid pRiA4b ORF-3-like protein 4.00E-11
scaffold1.1_14 PF13720.1 Udp N-acetylglucosamine O-acyltransferase; D 1.20E-28
scaffold1.1_14 PF00132.19 Bacterial transferase hexapeptide (six repea 1.10E-25
scaffold1.1_15 PF03331.8 UDP-3-O-acyl N-acetylglycosamine deacetylase 6.00E-74
scaffold1.1_15 PF07977.8 FabA-like domain 1.10E-35
scaffold1.1_16 PF00132.19 Bacterial transferase hexapeptide (six repea 1.10E-29
scaffold1.1_16 PF04613.9 UDP-3-O-[3-hydroxymyristoyl] glucosamine N-a 7.00E-17
scaffold1.1_16 PF14602.1 Hexapeptide repeat of succinyl-transferase 1.20E-10
scaffold1.1_17 PF01966.17 HD domain 2.90E-08
scaffold1.1_18 PF00215.19 Orotidine 5'-phosphate decarboxylase / HUMPS 9.20E-30
scaffold1.1_19 PF03462.13 PCRF domain 3.40E-39
scaffold1.1_19 PF00472.15 RF-1 domain 2.60E-33
scaffold1.1_20 PF02769.17 AIR synthase related protein, C-terminal dom 1.70E-12
scaffold1.1_22 PF13310.1 Virulence protein RhuM family 5.70E-110
scaffold1.1_24 PF02638.10 Glycosyl hydrolase like GH101 1.80E-53
scaffold1.1_24 PF13200.1 Putative glycosyl hydrolase domain 3.40E-07
scaffold1.1_25 PF02554.9 Carbon starvation protein CstA 8.90E-79
scaffold1.1_25 PF13722.1 C-terminal domain on CstA (DUF4161) 2.30E-24
Table 7. TIGRFAM domain matches for annotated genes from Team 10
Name TIGRFAM ID Description E value
scaffold1.1_5TIGR04057 SusC_RagA_signa: TonB-dependent outer membrane receptor, SusC/RagA subfamily, signature region2.70E-16
scaffold1.1_5TIGR01352 tonB_Cterm: TonB family C-terminal domain 2.70E-12
scaffold1.1_12TIGR00174 miaA: tRNA dimethylallyltransferase 5.90E-75
scaffold1.1_14TIGR01852 lipid_A_lpxA: acyl-[acyl-carrier-protein]-UDP-N-acetylglucosamine O-acyltransferase 1.70E-92
scaffold1.1_15TIGR00325 lpxC: UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase 2.50E-56
scaffold1.1_15TIGR01750 fabZ: beta-hydroxyacyl-(acyl-carrier-protein) dehydratase FabZ 3.90E-49
scaffold1.1_16TIGR01853 lipid_A_lpxD: UDP-3-O-[3-hydroxymyristoyl] glucosamine N-acyltransferase LpxD 3.60E-105
scaffold1.1_18TIGR02127 pyrF_sub2: orotidine 5'-phosphate decarboxylase 3.60E-72
scaffold1.1_19TIGR00019 prfA: peptide chain release factor 1 1.10E-137
scaffold1.1_30TIGR00630 uvra: excinuclease ABC subunit A 0
scaffold1.1_38TIGR01214 rmlD: dTDP-4-dehydrorhamnose reductase 1.90E-89
scaffold1.1_39TIGR00503 prfC: peptide chain release factor 3 6.10E-207
scaffold1.1_39TIGR00231 small_GTP: small GTP-binding protein domain 2.20E-25
scaffold1.1_49TIGR02227 sigpep_I_bact: signal peptidase I 1.30E-19
scaffold1.1_52TIGR01730 RND_mfp: efflux transporter, RND family, MFP subunit 8.80E-48
scaffold1.1_56TIGR00221 nagA: N-acetylglucosamine-6-phosphate deacetylase 1.30E-81
scaffold1.1_57TIGR00057 TIGR00057: tRNA threonylcarbamoyl adenosine modification protein, Sua5/YciO/YrdC/YwlC family1.20E-44
scaffold1.1_59TIGR00460 fmt: methionyl-tRNA formyltransferase 8.00E-81
scaffold1.1_61TIGR02937 sigma70-ECF: RNA polymerase sigma factor, sigma-70 family 4.40E-29
scaffold1.1_63TIGR01163 rpe: ribulose-phosphate 3-epimerase 1.00E-83
scaffold1.1_64TIGR00360 ComEC_N-term: ComEC/Rec2-related protein 8.50E-27
scaffold1.1_67TIGR03990 Arch_GlmM: phosphoglucosamine mutase 1.80E-160
scaffold1.1_69TIGR00539 hemN_rel: putative oxygen-independent coproporphyrinogen III oxidase 4.50E-87
scaffold1.1_71TIGR00231 small_GTP: small GTP-binding protein domain 1.10E-18
scaffold1.1_76TIGR00166 S6: ribosomal protein S6 2.00E-25
scaffold1.1_77TIGR00165 S18: ribosomal protein S18 1.90E-33
scaffold1.1_78TIGR00158 L9: ribosomal protein L9 1.00E-35
scaffold1.1_82TIGR01579 MiaB-like-C: MiaB-like tRNA modifying enzyme 3.00E-122
scaffold1.1_82TIGR00089 TIGR00089: radical SAM methylthiotransferase, MiaB/RimO family 1.10E-113
scaffold1.1_85TIGR00525 folB: dihydroneopterin aldolase 5.10E-30
Table 8. Phyre2 predicted best crystal structure matches for annotated genes from Team 10
Name
PDB best
match Pct_identity Confidence
Aligned
region Description
1.1_1 c3f9kV 22 61.1 89-115 two domain fragment of hiv-2 integrase in complex with ledgf ibd
1.1_4 d1sd4a 19 100 3-120 Penicillinase repressor
1.1_14 c3i3aC 39 100 2-255 transferase, structural basis for the sugar nucleotide and acyl chain2 selectivity of leptospira interrogans lpxa
1.1_19 c3d5cX 43 100 8-369 peptide chain release factor 1, structural basis for translation termination on the 70s ribosome
1.1_32 c3dboA 29 49.4 36-67 toxin/antitoxin, crystal structure of a member of the vapbc family of toxin-antitoxin2 systems, vapbc-5, from mycobacterium tuberculosis
1.1_54 c4mt4C 12 100 27-478 transport protein, crystal structure of the campylobacter jejuni cmec outer membrane2 channel
1.1_57 c2eqaA 23 100 6-191 rna binding protein, crystal structure of the hypothetical sua5 protein from2 sulfolobus tokodaii
1.1_60 c3k6oA 24 100 29-237 structural genomics, unknown function, crystal structure of protein of unknown function duf13442 (yp_001299214.1) from bacteroides vulgatus atcc 8482
1.1_68 c1upsB 16 100 21-262 glycosyl hydrolase, glcnac[alpha]1-4gal releasing endo-[beta]-galactosidase2 from clostridium perfringens
Figure 3 is a screenshot of the whole-genome alignment
of our scaffolds against the genome of Bacteroides
vulgatus str. 3975 RP4, which we determined to be the
strain with the most blastp matches against our contigs.

Weitere ähnliche Inhalte

Was ist angesagt?

Scientists devise new way to dramatically raise rna treatment potency
Scientists devise new way to dramatically raise rna treatment potencyScientists devise new way to dramatically raise rna treatment potency
Scientists devise new way to dramatically raise rna treatment potencyDisney Scripps Florida
 
2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekingeProf. Wim Van Criekinge
 
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...UniversitasGadjahMada
 
Kyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis DefenseKyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis DefenseKyle Jensen
 
E research feb2016 sifting the needles in the haystack
E research feb2016 sifting the needles in the haystackE research feb2016 sifting the needles in the haystack
E research feb2016 sifting the needles in the haystackTom Kelly
 
Cpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editingCpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editingSachin Bhor
 
2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekingeProf. Wim Van Criekinge
 
Genome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattleGenome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattleLaurence Dawkins-Hall
 
Population fitness and genetic load of
Population fitness and genetic load ofPopulation fitness and genetic load of
Population fitness and genetic load ofThanka Elango
 
P53_Final_Presentation
P53_Final_PresentationP53_Final_Presentation
P53_Final_PresentationJonah Kohen
 
In search of tissue specific regulators in periodontium - a bioinformatic ap...
In search of tissue specific regulators in periodontium  - a bioinformatic ap...In search of tissue specific regulators in periodontium  - a bioinformatic ap...
In search of tissue specific regulators in periodontium - a bioinformatic ap...Agnieszka Caruso
 
Tips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsTips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsIntegrated DNA Technologies
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbChris Southan
 
An Investigation Of The Rigor Of Interpretation Rules
An Investigation Of The Rigor Of Interpretation RulesAn Investigation Of The Rigor Of Interpretation Rules
An Investigation Of The Rigor Of Interpretation RulesNick Brown
 
Poster for COS Symposium 2013
Poster for COS Symposium 2013 Poster for COS Symposium 2013
Poster for COS Symposium 2013 Devin Porter
 

Was ist angesagt? (20)

Scientists devise new way to dramatically raise rna treatment potency
Scientists devise new way to dramatically raise rna treatment potencyScientists devise new way to dramatically raise rna treatment potency
Scientists devise new way to dramatically raise rna treatment potency
 
2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge2015 bioinformatics alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge
 
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...
Mitochondrial ND-1 gene-specific primer polymerase chain reaction to determin...
 
Kyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis DefenseKyle Jensen MIT Ph.D. Thesis Defense
Kyle Jensen MIT Ph.D. Thesis Defense
 
E research feb2016 sifting the needles in the haystack
E research feb2016 sifting the needles in the haystackE research feb2016 sifting the needles in the haystack
E research feb2016 sifting the needles in the haystack
 
Cpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editingCpf1- a new tool for CRISPR genome editing
Cpf1- a new tool for CRISPR genome editing
 
2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge2016 bioinformatics i_score_matrices_wim_vancriekinge
2016 bioinformatics i_score_matrices_wim_vancriekinge
 
GPCRs_HouseLA
GPCRs_HouseLAGPCRs_HouseLA
GPCRs_HouseLA
 
Genome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattleGenome responses of trypanosome infected cattle
Genome responses of trypanosome infected cattle
 
Poster_Ptndeletions
Poster_PtndeletionsPoster_Ptndeletions
Poster_Ptndeletions
 
Population fitness and genetic load of
Population fitness and genetic load ofPopulation fitness and genetic load of
Population fitness and genetic load of
 
P53_Final_Presentation
P53_Final_PresentationP53_Final_Presentation
P53_Final_Presentation
 
CGI.Paper
CGI.PaperCGI.Paper
CGI.Paper
 
WF_URS 2016 poster-ad
WF_URS 2016 poster-adWF_URS 2016 poster-ad
WF_URS 2016 poster-ad
 
In search of tissue specific regulators in periodontium - a bioinformatic ap...
In search of tissue specific regulators in periodontium  - a bioinformatic ap...In search of tissue specific regulators in periodontium  - a bioinformatic ap...
In search of tissue specific regulators in periodontium - a bioinformatic ap...
 
Tips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI toolsTips for effective use of BLAST and other NCBI tools
Tips for effective use of BLAST and other NCBI tools
 
Peptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdbPeptide Tribulations in GtoPdb
Peptide Tribulations in GtoPdb
 
Church dm grc_workshop
Church dm grc_workshopChurch dm grc_workshop
Church dm grc_workshop
 
An Investigation Of The Rigor Of Interpretation Rules
An Investigation Of The Rigor Of Interpretation RulesAn Investigation Of The Rigor Of Interpretation Rules
An Investigation Of The Rigor Of Interpretation Rules
 
Poster for COS Symposium 2013
Poster for COS Symposium 2013 Poster for COS Symposium 2013
Poster for COS Symposium 2013
 

Andere mochten auch

diapositivas de Dulce ocampo
diapositivas de Dulce ocampodiapositivas de Dulce ocampo
diapositivas de Dulce ocampodulcesitaocampo
 
El señor Dongguo y el lobo de Zhongshan
El señor Dongguo y el lobo de ZhongshanEl señor Dongguo y el lobo de Zhongshan
El señor Dongguo y el lobo de ZhongshanJuan Toro
 
Informativo n° 14 1º básico a- viernes 07 de junio
Informativo n° 14  1º básico a- viernes 07 de junioInformativo n° 14  1º básico a- viernes 07 de junio
Informativo n° 14 1º básico a- viernes 07 de junioColegio Camilo Henríquez
 
"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info
"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info
"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif InfoSébastien Blanc
 
Gestión de contenidos para emprendedores y profesionales 2.0
Gestión de contenidos para emprendedores y profesionales 2.0Gestión de contenidos para emprendedores y profesionales 2.0
Gestión de contenidos para emprendedores y profesionales 2.0Alfonso Alcántara YORIENTO
 
Viaje de estudios
Viaje de estudiosViaje de estudios
Viaje de estudioscnsg1535
 
Alberto Director General
Alberto Director GeneralAlberto Director General
Alberto Director Generalguest95ee41
 
Informativo n°24 2° basico b - 25 de agosto de 2014 (1)
Informativo n°24   2° basico b - 25 de agosto de 2014 (1)Informativo n°24   2° basico b - 25 de agosto de 2014 (1)
Informativo n°24 2° basico b - 25 de agosto de 2014 (1)Colegio Camilo Henríquez
 
Managing Directors
Managing DirectorsManaging Directors
Managing Directorsguest75cc05
 
Informativo nº 8 26 de abril- 3º básico b
Informativo nº 8  26 de abril- 3º básico bInformativo nº 8  26 de abril- 3º básico b
Informativo nº 8 26 de abril- 3º básico bColegio Camilo Henríquez
 

Andere mochten auch (20)

Creatividad
CreatividadCreatividad
Creatividad
 
diapositivas de Dulce ocampo
diapositivas de Dulce ocampodiapositivas de Dulce ocampo
diapositivas de Dulce ocampo
 
Proyecto canaima laminas
Proyecto canaima  laminasProyecto canaima  laminas
Proyecto canaima laminas
 
Macy Resume 2
Macy Resume 2Macy Resume 2
Macy Resume 2
 
La Computadora
La ComputadoraLa Computadora
La Computadora
 
El gust
El gustEl gust
El gust
 
E-commerce. Transformez les clics en clients
E-commerce. Transformez les clics en clientsE-commerce. Transformez les clics en clients
E-commerce. Transformez les clics en clients
 
El señor Dongguo y el lobo de Zhongshan
El señor Dongguo y el lobo de ZhongshanEl señor Dongguo y el lobo de Zhongshan
El señor Dongguo y el lobo de Zhongshan
 
Informativo n° 14 1º básico a- viernes 07 de junio
Informativo n° 14  1º básico a- viernes 07 de junioInformativo n° 14  1º básico a- viernes 07 de junio
Informativo n° 14 1º básico a- viernes 07 de junio
 
Cappadocia
CappadociaCappadocia
Cappadocia
 
"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info
"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info
"Adoption et Réseaux Sociaux d'Entreprise" - Collaboratif Info
 
5º basico b 28 de agosto
5º basico b  28 de agosto5º basico b  28 de agosto
5º basico b 28 de agosto
 
Gestión de contenidos para emprendedores y profesionales 2.0
Gestión de contenidos para emprendedores y profesionales 2.0Gestión de contenidos para emprendedores y profesionales 2.0
Gestión de contenidos para emprendedores y profesionales 2.0
 
2º basico a 11 de diciembre
2º basico a  11 de diciembre2º basico a  11 de diciembre
2º basico a 11 de diciembre
 
Viaje de estudios
Viaje de estudiosViaje de estudios
Viaje de estudios
 
Beautesfondantes
BeautesfondantesBeautesfondantes
Beautesfondantes
 
Alberto Director General
Alberto Director GeneralAlberto Director General
Alberto Director General
 
Informativo n°24 2° basico b - 25 de agosto de 2014 (1)
Informativo n°24   2° basico b - 25 de agosto de 2014 (1)Informativo n°24   2° basico b - 25 de agosto de 2014 (1)
Informativo n°24 2° basico b - 25 de agosto de 2014 (1)
 
Managing Directors
Managing DirectorsManaging Directors
Managing Directors
 
Informativo nº 8 26 de abril- 3º básico b
Informativo nº 8  26 de abril- 3º básico bInformativo nº 8  26 de abril- 3º básico b
Informativo nº 8 26 de abril- 3º básico b
 

Ähnlich wie MCB 432 Final Table PP 01.06.16

Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical NotebookNaima Tahsin
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxChijiokeNsofor
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing ResearchTanmay Ghai
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationEditor IJCATR
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentIJERD Editor
 
Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyKeiji Takamoto
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein functionLars Juhl Jensen
 
Cameron_Locker_variants_final_poster1
Cameron_Locker_variants_final_poster1Cameron_Locker_variants_final_poster1
Cameron_Locker_variants_final_poster1Cameron Locker, MPH
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHMijcsa
 
Arrays and alternative splicing
Arrays and alternative splicingArrays and alternative splicing
Arrays and alternative splicingAnn Loraine
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packagesRavi Gandham
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationIJAEMSJORNAL
 
CRISPR Crops--a talk by Sophien Kamoun at Science Portal BD
CRISPR Crops--a talk by Sophien Kamoun at Science Portal BDCRISPR Crops--a talk by Sophien Kamoun at Science Portal BD
CRISPR Crops--a talk by Sophien Kamoun at Science Portal BDSophien Kamoun
 
Internship Report
Internship ReportInternship Report
Internship ReportNeha Gupta
 

Ähnlich wie MCB 432 Final Table PP 01.06.16 (20)

Bioinformatics.Practical Notebook
Bioinformatics.Practical NotebookBioinformatics.Practical Notebook
Bioinformatics.Practical Notebook
 
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptxBTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
BTC 506 Gene Identification using Bioinformatic Tools-230302130331.pptx
 
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation OverviewPathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
 
3302 3305
3302 33053302 3305
3302 3305
 
Gene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptxGene identification using bioinformatic tools.pptx
Gene identification using bioinformatic tools.pptx
 
RNA Sequencing Research
RNA Sequencing ResearchRNA Sequencing Research
RNA Sequencing Research
 
Genome comparision
Genome comparisionGenome comparision
Genome comparision
 
Particle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster IdentificationParticle Swarm Optimization for Gene cluster Identification
Particle Swarm Optimization for Gene cluster Identification
 
International Journal of Engineering Research and Development
International Journal of Engineering Research and DevelopmentInternational Journal of Engineering Research and Development
International Journal of Engineering Research and Development
 
Viral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus StrategyViral Protein Structure Predictions - Consensus Strategy
Viral Protein Structure Predictions - Consensus Strategy
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Cameron_Locker_variants_final_poster1
Cameron_Locker_variants_final_poster1Cameron_Locker_variants_final_poster1
Cameron_Locker_variants_final_poster1
 
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
SBVRLDNACOMP:AN EFFECTIVE DNA SEQUENCE COMPRESSION ALGORITHM
 
Arrays and alternative splicing
Arrays and alternative splicingArrays and alternative splicing
Arrays and alternative splicing
 
RSEM and DE packages
RSEM and DE packagesRSEM and DE packages
RSEM and DE packages
 
RT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferationRT-PCR and DNA microarray measurement of mRNA cell proliferation
RT-PCR and DNA microarray measurement of mRNA cell proliferation
 
Poster
PosterPoster
Poster
 
CRISPR Crops--a talk by Sophien Kamoun at Science Portal BD
CRISPR Crops--a talk by Sophien Kamoun at Science Portal BDCRISPR Crops--a talk by Sophien Kamoun at Science Portal BD
CRISPR Crops--a talk by Sophien Kamoun at Science Portal BD
 
Internship Report
Internship ReportInternship Report
Internship Report
 
Group b
Group bGroup b
Group b
 

MCB 432 Final Table PP 01.06.16

  • 1. Keegan McAuliffe MCB 432: Computing in Molecular Biology The following is my final presentation for MCB 432: detailing the process our group undertook to determine the identity of a unknown bacteria. We were provided with raw sequence reads of a bacteria, and we converted them into contigs and scaffolds. We assembled the data into a complete genome, then annotated for potential genes to successfully determine the identity of the bacteria as Bacteroides vulgatus str. 3975.
  • 2. Keegan McAuliffe Henry Chen Andrew Storm Dominic Gentile Team 10 Results and Discussion Introduction: The onset of new high throughput sequencing has increased our ability to analyze genetic information. In this project, we demonstrate how to use raw sequence data from sampled organisms for genetic and genomic analysis. With the raw sequenced reads provided by the PI, we assembled a genome for our unknown microorganism. The genome assembly was accomplished by using the A5ud assembler program (Table 1). With the data generated, we were able to determine the total number of contigs and scaffolds and use these assemblies to predict and annotate genes (Table 2). Assembled genome on hand, we are now capable of searching and analyzing predicted genes in order to characterize our unknown organism, which we accomplished using the Prodigal algorithm for gene prediction. Prodigal generates gene and protein predictions, but does not provide analysis to what those predicted genes and proteins represent. Therefore, we need to employ other programs that function to annotate our predictions and because genes are so complex, we need to be specific in choosing programs for gene analysis. For instance, programs such as Emboss allow you to search for alignments and patterns in your assembly to databases of well-known genes, HMM and Blast searches allow to you to compare protein homology, and many other programs designed to search for features such as tRNA and signal peptides. With this analytical power, we analyzed our genome and present how we accomplished these tasks and our results.
  • 3. Results: (Optional tasks) The objective of Optional Task 1 was to determine the GC content of each gene. In order to ascertain this information, it was first necessary to assemble our reads into contigs and scaffolds—the objective of Mandatory Task 1. To do this, we first had to unzip or inflate the data of our read, using the “gunzip” command. Next, we ran the A5ud assembler on the data. This generated a file for quality trimming report, assembly report, initial scaffolding report, final scaffold quality check, error corrected reads, contigs, crude scaffolds, broken scaffolds, and final scaffolds. The assembly report contained the GC content for each contig, which we added to Table 3. The average GC content for all contigs is .407. Because GC bonds are more stable than AT bonds, our genome is less stable than a genome of GC content greater than .500. The objective of Optional Task 3 was to determine the best BlastP match for our proteins against the NR database. The first step of Task 3, then, was to determine the proper command to generate a single best match from the NR database for each contig, with an E-value less than 1e-10, as well as the organism to which it belongs, the accession number, and percent identity. The command we used was: blastp –db nr –query TeamProject.faa –out TeamProject.br –evalue 1E-10 –outfmt 6 –max_target_seqs 1 This command gave us the E-value, accession number, and percent identity for the blast blastp match of each contig. However, we still needed to the organism name and description of the gene. For this, we used the program efetch.pl. Using a list of accession names as an input, efetch.pl generated the organism name and gene annotation for each gene of interest. This data was recorded in Table 5. This task was also instrumental in determining the most closely related genus, species, and strain to our scaffolds.
  • 4. The best blastp match for each contig was of the genus Bacteroides, and the overwhelming majority was of the species Bacteroides vulgatus. More specifically, the strain Bacteroides vulgatus str. 3975 RP4 occurred 9 times out of 104 contigs. Furthermore, this represents 60% of the 15 blast results specific enough to indicate strain. This data led us to conclude that Bacteroides vulgatus str. 3975 is the most closely related strain. The objective of Optional Tasks 4 and 5 were to analyze the CDSs for possible proteins and genes. The scaffold sequence were analyzed using PFAM to determine possible protein matches and TIGRFAM to determine possible gene matches. The hmmscan for the PFAM matches used the Pfam-A database and TeamProject.faa. The hmmscan for the TIGRFAM matches used the TIGRFAMs_14.0.HMM database and TeamProject.faa. The results were compiled into Table 6 and Table 7 from the TeamProject_pfam.txt and TeamProject_tigrfam.txt. Only the best match for each CDS were added to Table 3. The PFAM hmmscan revealed that many of the CDS had at least one related protein. The predicted proteins of CDSs with multiple matches were all closely related. For example, all the predicted proteins for the 1_83 CDS are from the Glycosyl transferase family 2. The TIGRFAM search revealed that there were fewer matches; only 33 to the 191 matches of the PFAM search. Most of the CDS with TIGRFAM matches only have one match. Only CDS 1_15, 1_39, 1_82, and 1_85 have multiple matches. These CDSs only had two matches where several PFAM matches had four or five matches. The TIGRFAM and PFAM matches for each CDS both predicted similar functions for the CDSs that had both TIGRFAM and PFAM matches.
  • 5. Optional Task 6 used PHYRE2 to analyze CDS 1.1_1, 1.1_4, 1.1_14, 1.1_19, 1.1_32, 1.1_54, 1.1_57, 1.1_60, 1.1_68, and 2.1_8. All CDSs except 1.1_1 and 1.1_32 had a confidence of 100.0; with values of 61.1 and 49.4 respectively. The PHYRE2 predicted proteins agree with the PFAM predictions for all except 1.1_1, 1.1_32, 1.1_57, and 1.1_60. The other possible PHYRE2 matches were also not the same as the PFAM results. This may be because the structures of the PFAM matches are not in the PHYRE2 database. For Optional Task 7 we used looked for more specific features such as signal peptides. We used our assembled scaffold (team.fasta) and compared it to a reference database with gram negative prokaryotes, we were able to identify potential signal peptides and determined the length of these peptides. We compared our data to gram negative prokaryotes because our previous blast analysis identified genes and proteins matched those found in the gram negative genus Bacteriodes. The output data (which can be located in the file TeamProj_SigP_Summary.txt) specifically denoted the presence or absence of the signal peptides and the cutoff points of those peptides (C-value). This allowed us to determine the predicted lengths of the peptides. The results can be found in Table 3. The objective of Optional Task 8 was to analyze the presence of rho-independent transcriptional terminators. This is a particularly useful application as intrinsic terminators typically denote genes that are actively transcribed. In order to accomplish this task, we needed to run our genome alignment (team.fasta) for a RHO independent terminator database search while supplying the search with predicted gene coordinates. These predicted gene coordinates were determined through our EMBOSS infoseq analysis of predicted proteins on our assembly and restructured into the TeamProj.coords file for use with our RHO analysis program. The report generated can be found in the file TeamProj_tt + TeamProj_tt.txt and the results of which predicted genes had identifiable RHO independent terminators are listed in Table 3.
  • 6. Optional Task #9 determined if we can find any homologous RNA secondary structures from our assembled genome. Like all genes, tRNA structure can provide valuable information on the function and origin of the gene, which can be incredibly valuable when characterizing an unknown genome. With our assembled genome in hand (team.fasta) we searched for matches in conserved RNA structures with a handful of RFAM databases: RF00005, RF00010, RF00023, RF00029, RF00059, RF00174, RF00177, RF01693, RF01694, RF01726, RF01998, and RF02001. The data can be found as TeamProj_RF*.txt. From our search we only found 1 tRNA match and include that match in information on the matched gene in Table 3. For Optional Task 14, we constructed an alignment of our scaffolds with the genome of the bacterial strain with the most sequence matches, which we determined to be Bacteroides vulgatus str. 3975 RP4. On NCBI, we found 184 contigs of a whole genome-sequencing project for this strain. We concatenated these contigs to create a whole genome, to which we compared our scaffolds using blastn. With that blast report as a reference, we aligned the genomes using “act” and saved a screenshot of part of the alignment as Figure 3.
  • 7. Discussion: As we previously alluded to in the discussing the results of Optional Task 3, we used Blastp to determine the best match of each contig within the database “NR.” This data, located in Table 5, clearly indicates that genus of the closest relative is Bacteroides. After all, according to our blastp results, the best match of every contig corresponds to the genus Bacteroides. We can further assert that the species is Bacteroides vulgatus. 43 of the 104 contigs list Bacteroides vulgatus as their best match, and of the blast matches that were specific to species, 43 of 49 contigs (87.76%) list Bacteroides vulgatus. We can delve even deeper into the identity of the closest relative, as of the 104 contigs we were searching against, the strain Bacteroides vulgatus str. 3975 RP4 occurred 9 times. Thus, 9 of 15 blast results specific enough to indicate strain list Bacteroides vulgatus str. 3975 RP4. These data led us to conclude that Bacteroides vulgatus str. 3975 is the most closely related strain.
  • 8. Appendix Contains 7 tables containing the raw data used to create our Results and Discussion sections along with 1 figure showing our genome alignment
  • 9. Table1GenomeAssemblystatistics forTeam10 No.ofReadPairs 47893 No.oflowqualityreads 1763 No.ofassembledReads 102640 No.ofunassembledReads 2382 No.ofContigs 2 No.ofScaffolds 2 Totalntlengthofscaffolds 126196 Length %G+C No.ofreads mapped Coverage Contig 100.0 119,977 40.61% 4851245 6065.0 Contig 100.1 6,219 37.58% 240956 5811.0
  • 10. Table 2 Gene annotation summary for scaffolds CDS/ORFs tRNAs other RNAs scaffold1.1 95 0 0 scaffold2.1 9 1 0
  • 11. Table 3. Predicted Gene Coordinates Scaffold Name Type Start Stop Strand NT Length AA Length GC % Signal Peptide?SP Length (AA) Best Blast Hit Blast description scaffold 1.1 1_1 CDS 3 611 - 609 202 0.406 N gi|496057719|ref|WP_008782226.1| transposase, partial scaffold 1.1 1_2 CDS 845 3022 - 2178 725 0.405 Y 21 gi|649547948|gb|KDS54658.1| hypothetical protein M099_1756 scaffold 1.1 1_3 CDS 3539 3766 - 228 75 0.403 N gi|649547946|gb|KDS54656.1| glycoside hydrolase family 88 domain protein scaffold 1.1 1_4 CDS 3949 4905 - 957 318 0.383 N gi|492435030|ref|WP_005843062.1| MULTISPECIES: transcriptional regulator scaffold 1.1 1_5 CDS 5062 6291 + 1230 409 0.408 N gi|492435027|ref|WP_005843060.1| TonB-dependent receptor scaffold 1.1 1_6 CDS 6311 7198 + 888 295 0.429 Y 18 gi|492435023|ref|WP_005843058.1| hypothetical protein scaffold 1.1 1_7 CDS 7536 8942 + 1407 468 0.396 Y 21 gi|649547942|gb|KDS54652.1| ahpC/TSA family protein scaffold 1.1 1_8 CDS 9027 9767 - 741 246 0.396 N gi|649547941|gb|KDS54651.1| ahpC/TSA family protein scaffold 1.1 1_9 CDS 10111 12657 + 2547 848 0.421 N gi|495945682|ref|WP_008670261.1| MULTISPECIES: hypothetical protein scaffold 1.1 1_10 CDS 12750 15755 - 3006 1001 0.36 N gi|495945680|ref|WP_008670259.1| MULTISPECIES: hypothetical protein scaffold 1.1 1_11 CDS 15884 16252 + 369 122 0.477 Y 19 gi|492458337|ref|WP_005851052.1| alpha-L-fucosidase scaffold 1.1 1_12 CDS 16394 17275 - 882 293 0.468 N gi|492434987|ref|WP_005843035.1| tRNA dimethylallyltransferase 1 scaffold 1.1 1_13 CDS 17363 18388 - 1026 341 0.429 N gi|492434984|ref|WP_005843033.1| MULTISPECIES: hypothetical protein scaffold 1.1 1_14 CDS 18424 19740 - 1317 438 0.432 N gi|492434981|ref|WP_005843031.1| MULTISPECIES: UDP-N- acetylglucosamine acyltransferase scaffold 1.1 1_15 CDS 19846 21519 + 1674 557 0.476 N gi|492458346|ref|WP_005851058.1| MULTISPECIES: hydroxymyristoyl- ACP dehydratase scaffold 1.1 1_16 CDS 21680 21880 + 201 66 0.454 N gi|492458349|ref|WP_005851060.1| MULTISPECIES: UDP-3-O- acylglucosamine N-acyltransferase scaffold 1.1 1_17 CDS 22035 22727 + 693 230 0.43 N gi|500644323|ref|WP_011964621.1| phosphohydrolase scaffold 1.1 1_18 CDS 22796 23239 - 444 147 0.453 N gi|492434969|ref|WP_005843024.1| MULTISPECIES: orotidine 5'- phosphate decarboxylase scaffold 1.1 1_19 CDS 23255 23524 - 270 89 0.47 N gi|492434967|ref|WP_005843023.1| MULTISPECIES: peptide chain release factor 1 scaffold 1.1 1_20 CDS 23527 23871 - 345 114 0.471 N gi|492458355|ref|WP_005851064.1| MULTISPECIES: phosphoribosylformylglycinamidine cyclo-ligase scaffold 1.1 1_21 CDS 24081 24527 + 447 148 0.31 N gi|492434963|ref|WP_005843021.1| hypothetical protein scaffold 1.1 1_22 CDS 24636 24818 + 183 60 0.409 N gi|492434961|ref|WP_005843020.1| MULTISPECIES: toxin Fic
  • 12. Table 5. Single best blast hit of annotated ORFs from Team 10 Name Gene Identifier Description Organism % identity E-value 1_1 gi|496057719|ref|WP_008782226.1| transposase, partial Bacteroides sp. 3_1_40A 100 8.00E-88 1_2 gi|649547948|gb|KDS54658.1| hypothetical protein M099_1756 Bacteroides vulgatus str. 3975 RP4 100 4.00E-62 1_3 gi|649547946|gb|KDS54656.1| glycoside hydrolase family 88 domain protein Bacteroides vulgatus str. 3975 RP4 100 6.00E-62 1_4 gi|492435030|ref|WP_005843062.1| MULTISPECIES: transcriptional regulator Bacteroides 100 5.00E-82 1_5 gi|492435027|ref|WP_005843060.1| TonB-dependent receptor Bacteroides vulgatus 100 0 1_6 gi|492435023|ref|WP_005843058.1| hypothetical protein Bacteroides vulgatus 100 0 1_7 gi|649547942|gb|KDS54652.1| ahpC/TSA family protein Bacteroides vulgatus str. 3975 RP4 100 0 1_8 gi|649547941|gb|KDS54651.1| ahpC/TSA family protein Bacteroides vulgatus str. 3975 RP4 100 0 1_9 gi|495945682|ref|WP_008670261.1| MULTISPECIES: hypothetical protein Bacteroides 99.61 0 1_10 gi|495945680|ref|WP_008670259.1| MULTISPECIES: hypothetical protein Bacteroides 97.22 2.00E-16 1_11 gi|492458337|ref|WP_005851052.1| alpha-L-fucosidase Bacteroides vulgatus 100 0 1_12 gi|492434987|ref|WP_005843035.1| tRNA dimethylallyltransferase 1 Bacteroides vulgatus 100 0 1_13 gi|492434984|ref|WP_005843033.1| MULTISPECIES: hypothetical protein Bacteroides 100 9.00E-131 1_14 gi|492434981|ref|WP_005843031.1| MULTISPECIES: UDP-N-acetylglucosamine acyltransferaseBacteroides 100 3.00E-180 1_15 gi|492458346|ref|WP_005851058.1| MULTISPECIES: hydroxymyristoyl-ACP dehydrataseBacteroides 100 0 1_16 gi|492458349|ref|WP_005851060.1| MULTISPECIES: UDP-3-O-acylglucosamine N-acyltransferaseBacteroides 100 0 1_17 gi|500644323|ref|WP_011964621.1| phosphohydrolase Bacteroides vulgatus 100 0 1_18 gi|492434969|ref|WP_005843024.1| MULTISPECIES: orotidine 5'-phosphate decarboxylaseBacteroides 100 0 1_19 gi|492434967|ref|WP_005843023.1| MULTISPECIES: peptide chain release factor 1 Bacteroides 100 0 1_20 gi|492458355|ref|WP_005851064.1| MULTISPECIES: phosphoribosylformylglycinamidine cyclo-ligaseBacteroides 100 0 1_21 gi|492434963|ref|WP_005843021.1| hypothetical protein Bacteroides vulgatus 100 6.00E-138 1_22 gi|492434961|ref|WP_005843020.1| MULTISPECIES: toxin Fic Bacteroides 100 0 1_23 gi|492458359|ref|WP_005851066.1| MULTISPECIES: hypothetical protein Bacteroides 100 6.00E-43 1_24 gi|492434958|ref|WP_005843019.1| hypothetical protein Bacteroides vulgatus 99.64 0 1_25 gi|492458364|ref|WP_005851068.1| MULTISPECIES: hypothetical protein Bacteroides 100 0 1_26 gi|492458366|ref|WP_005851069.1| MULTISPECIES: membrane protein Bacteroides 100 2.00E-43 1_27 gi|492458368|ref|WP_005851070.1| MULTISPECIES: hypothetical protein Bacteroides 100 9.00E-114 1_28 gi|492458370|ref|WP_005851071.1| MULTISPECIES: beta-N-acetylhexosaminidase Bacteroides 100 0 1_29 gi|492434942|ref|WP_005843009.1| MULTISPECIES: endonuclease Bacteroides 99.71 0 1_30 gi|511016443|ref|WP_016270813.1| excinuclease ABC subunit A Bacteroides vulgatus 100 0 1_31 gi|492434935|ref|WP_005843004.1| MULTISPECIES: hypothetical protein Bacteroides 100 0 1_32 gi|492434933|ref|WP_005843003.1| MULTISPECIES: chromate transporter Bacteroides 100 1.00E-131 1_33 gi|492434930|ref|WP_005843001.1| MULTISPECIES: chromate transporter Bacteroides 100 1.00E-105 1_34 gi|511016442|ref|WP_016270812.1| hypothetical protein Bacteroides vulgatus 100 0 1_35 gi|511016441|ref|WP_016270811.1| phosphoribosylformylglycinamidine synthase Bacteroides vulgatus 100 0 1_36 gi|492434921|ref|WP_005842995.1| MULTISPECIES: translocator protein, LysE familyBacteroides 100 4.00E-150 1_37 gi|492434917|ref|WP_005842993.1| MULTISPECIES: hypothetical protein Bacteroides 100 5.00E-127 1_38 gi|492458387|ref|WP_005851079.1| MULTISPECIES: dTDP-4-dehydrorhamnose reductaseBacteroides 100 0 1_39 gi|492434911|ref|WP_005842989.1| MULTISPECIES: peptide chain release factor 3 Bacteroides 100 0 1_40 gi|492434907|ref|WP_005842987.1| MULTISPECIES: molecular chaperone DnaJ Bacteroides 100 0 1_41 gi|492434904|ref|WP_005842985.1| dihydrofolate reductase Bacteroides vulgatus 100 0 1_42 gi|548318542|ref|WP_022508241.1| hypothetical protein Bacteroides vulgatus CAG:6 100 1.00E-174 1_43 gi|492434896|ref|WP_005842980.1| hypothetical protein Bacteroides vulgatus 100 0 1_44 gi|492458409|ref|WP_005851092.1| transcriptional regulator Bacteroides vulgatus 99.7 0 1_45 gi|492434890|ref|WP_005842976.1| MULTISPECIES: hypothetical protein Bacteroides 100 1.00E-44 1_46 gi|492434887|ref|WP_005842974.1| hypothetical protein Bacteroides vulgatus 100 0 1_47 gi|500644291|ref|WP_011964611.1| hypothetical protein Bacteroides vulgatus 100 0
  • 13. Table 6. PFAM domain matches for annotated genes from Team 10 Name PFAM ID Description E value scaffold1.1_1 PF01610.12 Transposase 2.90E-25 scaffold1.1_2 PF11396.3 Protein of unknown function (DUF2874) 7.80E-15 scaffold1.1_4 PF03965.11 Penicillinase repressor 2.40E-25 scaffold1.1_5 PF03544.9 Gram-negative bacterial TonB protein C-termi 2.50E-23 scaffold1.1_5 PF13715.1 Domain of unknown function (DUF4480) 1.50E-16 scaffold1.1_5 PF05569.6 BlaR1 peptidase M56 1.00E-11 scaffold1.1_5 PF13620.1 Carboxypeptidase regulatory-like domain 2.90E-10 scaffold1.1_5 PF07715.10 TonB-dependent Receptor Plug Domain 2.10E-06 scaffold1.1_6 PF14559.1 Tetratricopeptide repeat 6.20E-13 scaffold1.1_6 PF13414.1 TPR repeat 6.70E-12 scaffold1.1_6 PF07719.12 Tetratricopeptide repeat 2.90E-11 scaffold1.1_6 PF13428.1 Tetratricopeptide repeat 2.00E-10 scaffold1.1_6 PF13432.1 Tetratricopeptide repeat 9.60E-10 scaffold1.1_6 PF13429.1 Tetratricopeptide repeat 5.30E-08 scaffold1.1_6 PF12895.2 Anaphase-promoting complex, cyclosome, subun 1.30E-07 scaffold1.1_6 PF13431.1 Tetratricopeptide repeat 6.80E-06 scaffold1.1_7 PF00578.16 AhpC/TSA family 1.30E-11 scaffold1.1_7 PF00255.14 Glutathione peroxidase 4.20E-08 scaffold1.1_7 PF14289.1 Domain of unknown function (DUF4369) 1.70E-06 scaffold1.1_8 PF13905.1 Thioredoxin-like 1.40E-14 scaffold1.1_8 PF13098.1 Thioredoxin-like domain 1.90E-14 scaffold1.1_8 PF00085.15 Thioredoxin 2.70E-11 scaffold1.1_8 PF08534.5 Redoxin 4.30E-11 scaffold1.1_8 PF00578.16 AhpC/TSA family 1.00E-07 scaffold1.1_11 PF01120.12 Alpha-L-fucosidase 2.60E-87 scaffold1.1_12 PF01715.12 IPP transferase 7.70E-64 scaffold1.1_12 PF01745.11 Isopentenyl transferase 3.00E-12 scaffold1.1_12 PF04851.10 Type III restriction enzyme, res subunit 0.00022 scaffold1.1_13 PF07929.6 Plasmid pRiA4b ORF-3-like protein 4.00E-11 scaffold1.1_14 PF13720.1 Udp N-acetylglucosamine O-acyltransferase; D 1.20E-28 scaffold1.1_14 PF00132.19 Bacterial transferase hexapeptide (six repea 1.10E-25 scaffold1.1_15 PF03331.8 UDP-3-O-acyl N-acetylglycosamine deacetylase 6.00E-74 scaffold1.1_15 PF07977.8 FabA-like domain 1.10E-35 scaffold1.1_16 PF00132.19 Bacterial transferase hexapeptide (six repea 1.10E-29 scaffold1.1_16 PF04613.9 UDP-3-O-[3-hydroxymyristoyl] glucosamine N-a 7.00E-17 scaffold1.1_16 PF14602.1 Hexapeptide repeat of succinyl-transferase 1.20E-10 scaffold1.1_17 PF01966.17 HD domain 2.90E-08 scaffold1.1_18 PF00215.19 Orotidine 5'-phosphate decarboxylase / HUMPS 9.20E-30 scaffold1.1_19 PF03462.13 PCRF domain 3.40E-39 scaffold1.1_19 PF00472.15 RF-1 domain 2.60E-33 scaffold1.1_20 PF02769.17 AIR synthase related protein, C-terminal dom 1.70E-12 scaffold1.1_22 PF13310.1 Virulence protein RhuM family 5.70E-110 scaffold1.1_24 PF02638.10 Glycosyl hydrolase like GH101 1.80E-53 scaffold1.1_24 PF13200.1 Putative glycosyl hydrolase domain 3.40E-07 scaffold1.1_25 PF02554.9 Carbon starvation protein CstA 8.90E-79 scaffold1.1_25 PF13722.1 C-terminal domain on CstA (DUF4161) 2.30E-24
  • 14. Table 7. TIGRFAM domain matches for annotated genes from Team 10 Name TIGRFAM ID Description E value scaffold1.1_5TIGR04057 SusC_RagA_signa: TonB-dependent outer membrane receptor, SusC/RagA subfamily, signature region2.70E-16 scaffold1.1_5TIGR01352 tonB_Cterm: TonB family C-terminal domain 2.70E-12 scaffold1.1_12TIGR00174 miaA: tRNA dimethylallyltransferase 5.90E-75 scaffold1.1_14TIGR01852 lipid_A_lpxA: acyl-[acyl-carrier-protein]-UDP-N-acetylglucosamine O-acyltransferase 1.70E-92 scaffold1.1_15TIGR00325 lpxC: UDP-3-O-[3-hydroxymyristoyl] N-acetylglucosamine deacetylase 2.50E-56 scaffold1.1_15TIGR01750 fabZ: beta-hydroxyacyl-(acyl-carrier-protein) dehydratase FabZ 3.90E-49 scaffold1.1_16TIGR01853 lipid_A_lpxD: UDP-3-O-[3-hydroxymyristoyl] glucosamine N-acyltransferase LpxD 3.60E-105 scaffold1.1_18TIGR02127 pyrF_sub2: orotidine 5'-phosphate decarboxylase 3.60E-72 scaffold1.1_19TIGR00019 prfA: peptide chain release factor 1 1.10E-137 scaffold1.1_30TIGR00630 uvra: excinuclease ABC subunit A 0 scaffold1.1_38TIGR01214 rmlD: dTDP-4-dehydrorhamnose reductase 1.90E-89 scaffold1.1_39TIGR00503 prfC: peptide chain release factor 3 6.10E-207 scaffold1.1_39TIGR00231 small_GTP: small GTP-binding protein domain 2.20E-25 scaffold1.1_49TIGR02227 sigpep_I_bact: signal peptidase I 1.30E-19 scaffold1.1_52TIGR01730 RND_mfp: efflux transporter, RND family, MFP subunit 8.80E-48 scaffold1.1_56TIGR00221 nagA: N-acetylglucosamine-6-phosphate deacetylase 1.30E-81 scaffold1.1_57TIGR00057 TIGR00057: tRNA threonylcarbamoyl adenosine modification protein, Sua5/YciO/YrdC/YwlC family1.20E-44 scaffold1.1_59TIGR00460 fmt: methionyl-tRNA formyltransferase 8.00E-81 scaffold1.1_61TIGR02937 sigma70-ECF: RNA polymerase sigma factor, sigma-70 family 4.40E-29 scaffold1.1_63TIGR01163 rpe: ribulose-phosphate 3-epimerase 1.00E-83 scaffold1.1_64TIGR00360 ComEC_N-term: ComEC/Rec2-related protein 8.50E-27 scaffold1.1_67TIGR03990 Arch_GlmM: phosphoglucosamine mutase 1.80E-160 scaffold1.1_69TIGR00539 hemN_rel: putative oxygen-independent coproporphyrinogen III oxidase 4.50E-87 scaffold1.1_71TIGR00231 small_GTP: small GTP-binding protein domain 1.10E-18 scaffold1.1_76TIGR00166 S6: ribosomal protein S6 2.00E-25 scaffold1.1_77TIGR00165 S18: ribosomal protein S18 1.90E-33 scaffold1.1_78TIGR00158 L9: ribosomal protein L9 1.00E-35 scaffold1.1_82TIGR01579 MiaB-like-C: MiaB-like tRNA modifying enzyme 3.00E-122 scaffold1.1_82TIGR00089 TIGR00089: radical SAM methylthiotransferase, MiaB/RimO family 1.10E-113 scaffold1.1_85TIGR00525 folB: dihydroneopterin aldolase 5.10E-30
  • 15. Table 8. Phyre2 predicted best crystal structure matches for annotated genes from Team 10 Name PDB best match Pct_identity Confidence Aligned region Description 1.1_1 c3f9kV 22 61.1 89-115 two domain fragment of hiv-2 integrase in complex with ledgf ibd 1.1_4 d1sd4a 19 100 3-120 Penicillinase repressor 1.1_14 c3i3aC 39 100 2-255 transferase, structural basis for the sugar nucleotide and acyl chain2 selectivity of leptospira interrogans lpxa 1.1_19 c3d5cX 43 100 8-369 peptide chain release factor 1, structural basis for translation termination on the 70s ribosome 1.1_32 c3dboA 29 49.4 36-67 toxin/antitoxin, crystal structure of a member of the vapbc family of toxin-antitoxin2 systems, vapbc-5, from mycobacterium tuberculosis 1.1_54 c4mt4C 12 100 27-478 transport protein, crystal structure of the campylobacter jejuni cmec outer membrane2 channel 1.1_57 c2eqaA 23 100 6-191 rna binding protein, crystal structure of the hypothetical sua5 protein from2 sulfolobus tokodaii 1.1_60 c3k6oA 24 100 29-237 structural genomics, unknown function, crystal structure of protein of unknown function duf13442 (yp_001299214.1) from bacteroides vulgatus atcc 8482 1.1_68 c1upsB 16 100 21-262 glycosyl hydrolase, glcnac[alpha]1-4gal releasing endo-[beta]-galactosidase2 from clostridium perfringens
  • 16. Figure 3 is a screenshot of the whole-genome alignment of our scaffolds against the genome of Bacteroides vulgatus str. 3975 RP4, which we determined to be the strain with the most blastp matches against our contigs.