08448380779 Call Girls In Friends Colony Women Seeking Men
Bc2012 submission 109a
1. Representative Proteomes
and Genomes
A standardized, stable and unbiased set of proteomes and
genomes
http://pir.georgetown.edu/rps/
Raja Mazumder (mazumder@gwu.edu)
4. Procedure to generate the RPs
Compute pair-wise co-membership value (X) in UniRef50 for all proteomes
For each proteome, compute the mean co-membership between
this proteome and the other proteomes
Create ranked proteome list based on the mean co-membership
RPG generation starts
For a given CMT, take the first proteome in the ranked list and
the ones with X ≥ CMT to form an RPG, and remove them from
ranked list
ranked
No
list
empty?
Yes
RPG generation ends
Select a Representative Proteome for each RPG (manually inspected by curator)
5. Proteome A Proteome B UniRef50
UniRef50
Sequence clusters (UniRef100, 90, 50)
From any organism
Part of UniProt production cycle
PMID: 17379688
6. RPs at Different CMTs
1000 100
900 90
3.02
800 80
700 2.69 70 # RPs
Million proteins
600 2.36 60
# RPs
% Reduction -
500 50 % 1
Proteomes
400 40
2.02 % Reduction -
300 30 Sequences
2
200 20
100 10
0 0
75 55 35 15
1
Based on 1144 complete genomes
CMT (%) 2
Based on 4.3 million sequences (complete genomes only)
UniProtKB total: 13.46 million sequences
7. • RP at higher level is used to cluster the lower levels
• RPGs are constructed based on co-membership, not
taxonomy
8. Manual mapping of UniProt and
NCBI genomes
The taxonomy ID of each proteome present
in UniProt is mapped to the NCBI
RefSeq/GenBank genome project IDs
When more than one genome is available
for the same taxonomy ID, the genomes are
ranked according to the availability of a
RefSeq genome, number of related
publications, number of citations for each
publication, and date of sequencing.
The highest ranking genome is mapped to
the UniProtKB proteome
9. RefSeq genomes and proteomes
Mapping allows us to retrieve genomes and
proteomes from RefSeq.
ftp://ftp.pir.georgetown.edu/databases/rps/rg
ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_sequences/
11. Coverage Statistics – RP55
95% of all InterPro families contain at least
one protein from the RP set
InterPro covers ~75% of all proteins in
UniProtKB, and this number holds true as
well for RP55
93% of the experimentally-characterized
proteins are retained in the RP set
15. Visualizing all-against-all proteome
correlation matrix vs. the taxonomy
tree
Developed a method to graphically visualize NCBI's taxonomy
tree and overlay the proteome correlation tree (PCT) to illustrate
genomic similarity between organisms that may otherwise be
considered to have distant ancestry.
Computed all-against-all correlation values between all complete
proteomes
Comparison network can be browsed in Cytoscape network
software to easily identify nodes in the taxonomy tree that are not
supported by PCT data
Development tools: CytoscapeWeb, CytoscapeRPC, Perl
16. Family
Enterobacteriaceae Distance based on taxonomy tree
Shigella Escherichia Enterobacter Klebsiella Genus
Distance based on taxonomy tree
Species
ENT38 Distance based on correlation table
ECOLI
ESCF3 ECO24
ENTAK
ECOK1 KLEP7
SHIFL
SHIF8
18. Can easily identify genomic neighbors
The top 2 levels are family
and genus nodes arranged
according to taxonomic
position
The bottom nodes are
complete proteomes with a
heuristic force-directed
layout applied according to
all-against-all correlation
Although AGRT5 and
AGRVS share the same
genus, they are relatively
distant from each other
(~28%), compared to
AGRT5 and AGRSH
(~70%).
20. Conclusions
High quality RPs generated computationally and
inspected by curators
A standardized, stable and unbiased set of proteomes
and genomes
Completely integrated and into the UniProt/UniRef
production pipeline and has monthly releases
Automatically selects QFO/UniProt RF (if available in
RPG) as RP (provide feedback to QFO and others if
discrepancy)
Extended to RefSeq (
ftp://ftp.pir.georgetown.edu/databases/rps/rg;
ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_s
equences/)
RGs can help placement of unknown metagenomic
sequences into the correct clusters
21. Acknowledgements
Chuming Chen (PIR)
Darren Natale (PIR)
Hongzhan Huang (PIR)
Jian Zhang (PIR)
Peter McGarvey (PIR)
Cathy Wu (PIR)
Mona Motwani (GWU)
Jamal Theodore (GWU)
Robert Finn (HHMI Janelia Farm/Pfam)
Eleanor Stanley (EBI)
Kim Pruitt (NCBI)
Yuri Wolf (NCBI)
UniProt Consortium
Hinweis der Redaktion
For the purpose of making the Representative Proteome Groups, we created a list of organisms ranked by mean co-membership, meaning that the average connectedness to all other organisms was computed. The list was then used to recruit organisms based on whether or not the co-membership score was greater than the CMT. We did some tests and found that varying the ranking system changed no more than 2% of the RPGs, and none of the RPs (meaning that in those cases, one organism “jumped ship” into another group).
In order to ensure that the RPs are hierarchical at different CMT’s we first calculate RP75, then take that set to generate RP55 and so on. Any RP in a smaller set (such as RP15) will be thus be found in all larger sets (such as RP75). The genus change for the Vibrio salmonicida to Aliivibrio illustrates why using taxonomy as the mechanism for determining RPGs is a bad idea: taxonomy can change, but the sequences won’t.
Stability of RPs is illustrated by a retrospective study. Using a UniProt release from 2004, there were 116 RPs. All 116 remained as RPs throughout the test period. Furthermore, any added RP (in 2005, 2006, etc) remained as RP in subsequent releases. The comparison of number of RP and number of complete proteomes illustrates that RPs are growing at a slower rate. The % reduction of sequence space is indicated by red and purple lines, which shows that there is a steady increase in the reduction rate. This rate is dependent upon the type of genomes being sequenced: if many related genomes (yet another E coli, for example) are added, then the reduction increases (see 2006_7).
The InterPro coverage stats indicate that the RP set covers most of the sequence space in terms of similarity families. They also indicate that there is not much significant over- or under-representation of sequences. The missing InterPro families tended to be viral (which are not yet included in RP) or lineage-specific. Experimentally-characterized proteins were determined using literature references in Swiss-Prot entries or GOA evidence codes. There are 30,000 such proteins in the examined set.