Bc2012 submission 109a

Representative Proteomes
and Genomes
A standardized, stable and unbiased set of proteomes and
genomes
http://pir.georgetown.edu/rps/

Raja Mazumder (mazumder@gwu.edu)

Nomenclature

 Representative Proteomes
 Primarily computational
 Reference Proteomes/QFO
Proteomes/Blessed Proteomes
 Primarily manual
 Extension of Reference proteomes

Representative Proteomes (RP) and Representative
genomes (RG)
http://pir.georgetown.edu/rps/

Procedure to generate the RPs
Compute pair-wise co-membership value (X) in UniRef50 for all proteomes

For each proteome, compute the mean co-membership between
this proteome and the other proteomes
Create ranked proteome list based on the mean co-membership

RPG generation starts

For a given CMT, take the first proteome in the ranked list and
the ones with X ≥ CMT to form an RPG, and remove them from
ranked list

ranked
No
list
empty?
Yes

RPG generation ends

Select a Representative Proteome for each RPG (manually inspected by curator)

Proteome A Proteome B UniRef50

UniRef50
 Sequence clusters (UniRef100, 90, 50)
 From any organism
 Part of UniProt production cycle
 PMID: 17379688

RPs at Different CMTs
1000 100
900 90
3.02
800 80
700 2.69 70 # RPs
Million proteins
600 2.36 60
# RPs

% Reduction -
500 50 % 1
Proteomes
400 40
2.02 % Reduction -
300 30 Sequences
2

200 20
100 10
0 0
75 55 35 15
1
Based on 1144 complete genomes
CMT (%) 2
Based on 4.3 million sequences (complete genomes only)
UniProtKB total: 13.46 million sequences

• RP at higher level is used to cluster the lower levels
• RPGs are constructed based on co-membership, not
taxonomy

Manual mapping of UniProt and
NCBI genomes
 The taxonomy ID of each proteome present
in UniProt is mapped to the NCBI
RefSeq/GenBank genome project IDs
 When more than one genome is available
for the same taxonomy ID, the genomes are
ranked according to the availability of a
RefSeq genome, number of related
publications, number of citations for each
publication, and date of sequencing.
 The highest ranking genome is mapped to
the UniProtKB proteome

RefSeq genomes and proteomes
 Mapping allows us to retrieve genomes and
proteomes from RefSeq.

ftp://ftp.pir.georgetown.edu/databases/rps/rg
ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_sequences/

RP55 Over Time
1400 120 # complete proteomes
# RPGs
% species in multiple RPGs
1200
100 % stable RPGs

1000
# Complete proteomes

80

800

60 %
600

40
400

20
200

0 0
2004_1 2005_4 2006_7 2007_10 2008_13 2009_15 2010_09
UniProtKB release

Coverage Statistics – RP55

 95% of all InterPro families contain at least
one protein from the RP set
 InterPro covers ~75% of all proteins in
UniProtKB, and this number holds true as
well for RP55
 93% of the experimentally-characterized
proteins are retained in the RP set

Visualizing all-against-all proteome
correlation matrix vs. the taxonomy
tree
 Developed a method to graphically visualize NCBI's taxonomy
tree and overlay the proteome correlation tree (PCT) to illustrate
genomic similarity between organisms that may otherwise be
considered to have distant ancestry.
 Computed all-against-all correlation values between all complete
proteomes
 Comparison network can be browsed in Cytoscape network
software to easily identify nodes in the taxonomy tree that are not
supported by PCT data
 Development tools: CytoscapeWeb, CytoscapeRPC, Perl

Family
Enterobacteriaceae Distance based on taxonomy tree

Shigella Escherichia Enterobacter Klebsiella Genus
Distance based on taxonomy tree

Species
ENT38 Distance based on correlation table

ECOLI
ESCF3 ECO24
ENTAK
ECOK1 KLEP7
SHIFL
SHIF8

Example: Examine correlation scores of
AGRT5

 Agrobacterium tumefaciens (AGRT5)

http://pir.georgetown.edu/cgi-bin/rps_tree.pl?point_id=r15p176299&on=1&on100=1&file_id=122063&p=#-5

Can easily identify genomic neighbors

 The top 2 levels are family
and genus nodes arranged
according to taxonomic
position
 The bottom nodes are
complete proteomes with a
heuristic force-directed
layout applied according to
all-against-all correlation
 Although AGRT5 and
AGRVS share the same
genus, they are relatively
distant from each other
(~28%), compared to
AGRT5 and AGRSH
(~70%).

Sequence search
 Cleaner BLAST/phmmer results

Conclusions
 High quality RPs generated computationally and
inspected by curators
 A standardized, stable and unbiased set of proteomes
and genomes
 Completely integrated and into the UniProt/UniRef
production pipeline and has monthly releases
 Automatically selects QFO/UniProt RF (if available in
RPG) as RP (provide feedback to QFO and others if
discrepancy)
 Extended to RefSeq (
ftp://ftp.pir.georgetown.edu/databases/rps/rg;
ftp://ftp.pir.georgetown.edu/databases/rps/rp_in_refseq_s
equences/)
 RGs can help placement of unknown metagenomic
sequences into the correct clusters

Acknowledgements
Chuming Chen (PIR)
Darren Natale (PIR)
Hongzhan Huang (PIR)
Jian Zhang (PIR)
Peter McGarvey (PIR)
Cathy Wu (PIR)
Mona Motwani (GWU)
Jamal Theodore (GWU)
Robert Finn (HHMI Janelia Farm/Pfam)
Eleanor Stanley (EBI)
Kim Pruitt (NCBI)
Yuri Wolf (NCBI)
UniProt Consortium

Bc2012 submission 109a

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (11)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Bc2012 submission 109a

Hinweis der Redaktion