4. Fragmentation
孙瑞祥
ICT
Electron Transfer Dissociation: Characterization and
Applications in Protein Identification
2010. Improved Peptide Identification for Proteomic Analysis Based on Comprehensive Characterization of
Electron Transfer Dissociation Spectra. J Proteome Res.
Important spectral characteristics of ETD are ignored or underutilized in
popular database search algorithms, such as Mascot, Sequest, OMSSA, OR X!
TANDEM
Analyzed 461,440 spectra to find ETD characterization
distinct hydrogen rearrangement patterns of +2, +3 and +4 precursors
charge-reduced precursor ions and associated neutral loss peaks
pFind identified 63-122% more unique peptides than Mascot for doubly
charged precursors at 1% FDR cutoff.
5. Labeling Strategy
陆豪杰
Fudan Uinv
In vivo termini amino acid labeling for quantitative proteomics
Cover 93% proteins deposited in Uniprot.
More accuracy for identification and quantification.
Dual digest by Arg-C & Lys-N (increase sample complexity)
6. De Novo Sequencing
2010. pNovo: De novo Peptide Sequencing and Identification Using HCD Spectra. Journal of Proteome
Research 9:2713-2724.
董梦秋
NIBS
De novo Sequencing of Peptides using HCD Spectra
HCD produces high mass accuracy tandem mass spectra, the majority
of which contain complete ion series. Besides, abundant internal and
immonium ions in the HCD spectra can help differentiate between
similar sequences.
Ascaris suum sperm crawling
related proteins
pNovo
Identify peptide sequences
Blast
Homologs of C. elegans
Design primer for validation
7. De Novo Sequencing
马斌
U of Waterloo
Complete Homology-Assisted MS/MS Protein
Sequencing (CHAMPS)
2009. Automated protein (re)sequencing with MS/MS and a homologous database yields almost full coverage
and accuracy. Bioinformatics 25:2174 -2180.
Novel protein
SPIDER
Homologous sequenceDe novo sequences
CHAMPS
Complete protein sequence
(above 99% coverage and 100% accuracy for two standard proteins)
8. De Novo Sequencing
王全会
BIG
From an unknown genome to a measurable proteome: Studying
on the pH-dependent proteomes in N10 bacteria by de novo
sequencing
2009. Exploring membrane and cytoplasm proteomic responses of Alkalimonas amylolytica N10 to different
external pHs with combination strategy of de novo peptide sequencing. Proteomics 9:1254-1273.
Tandem spectra with/without SPITC labeling
PEAKS for auto de novo Manually analyzed
Combine filtered data
Validation by PCR and Western blot
More than 70% of the differential 2-DE spots were identified
9. Identification
余维川
HKUST
Optimization-Based Peptide Mass Fingerprinting for Protein
Mixture Identification
2010. Optimization-based peptide mass fingerprinting for protein mixture identification. J. Comput. Biol 17:221-
235.
• PMF method has two inherent disadvantages:
– Originally designed for identifying single purified proteins rather than
protein mixtures
– Can’t distinguish different peptides with identical mass
• Heuristic algorithm
– Introduce a scoring function for protein mixture identification
– Local search algorithms for protein mixture identification
• External factors might be optimized to facilitate successful protein
mixture identification
– Mass accuracy
– Sequence coverage
– Noise level
– Protein number in the mixtures
10. Identification
付岩
ICT
Unrestrictive modification detection based on related spectral
pairs
2009. Efficient discovery of abundant post-translational modifications and spectral pairs using peptide mass and
retention time differences. BMC Bioinformatics 10 (Suppl 1):S50.
• The majority of mass spectra cannot be interpreted at present
– Unexpected or unknown protein PTM
• Detect abundant PTM in high-accuracy peptide mass spectra
– Efficient and sequence database-independent approach
– Based on the observation that the spectra of a modified peptide and its unmodified
counterpart are correlated with each other in their peptide masses and retention time
– Frequently occurring peptide mass differences imply possible modifications
– Small and consistent retention time differences provide orthogonal supporting
evidence
– Use a bivariate Gaussian mixture model to discriminate modification-related spectral
pairs from random ones
• Results
– Experiments on two glycoprotein data sets demonstrate that the method can
effectively detect abundant modifications and spectral pairs.
– By including the discovered modifications into database search, an average of 10%
more spectra are interpreted
11. Identification
叶明亮
DICP
Development of Methods and Platform for Data Processing in
Mass Spectrometry Based Proteome Research
PMID: 17761002/19551949/18314942/20568719/19522514/20334362
• Un-modified peptide identification
– Implemented a predictive genetic algorithm for optimization of filtering criteria to
maximize the number of identified peptides at fixed FDR for SEQUEST
– Introduced an approach for calculating posterior probability of individual peptide
identification from the “local FDR” by using k nearest neighbors algorithm and Shannon
information entropy
• Phosphopeptide identification
– Developed an automatic validation approach for phosphopeptide identification by
combining consecutive stage MS data and the target-decoy database searching strategy
– Developed a classification filtering strategy to improve the phosphopeptide
identification and phosphorylation site localization
– Proposed a modified target-decoy database search strategy for confident
phosphorylation site analysis of individual phosphoproteins without manual
interpretation of spectra
– Developed a software ArMone for processing and analysis of phosphoproteome data
12. Label free semi-quantitation
邓宁
ZJU
Quantitative Analysis of Mitochondrial Proteomes using
Normalized Spectral Abundance Factor
Samples:
5 human cardiac mitochondrial samples
8 murine cardiac mitochondrial samples
7 murine liver mitochondrial samples
LC-MS/MS
Database search by
SEQUEST and statistically
validated by Scaffold
In-house software to generate NSAF
value for quantitative analysis
Results:
Electron transport chain show
highest abundances , especially
in heart
Metabolism related proteins
and urea cycle proteins show
more abundant in the liver
13. Database Construction
邵晨
PUMC
The urinary protein biomarker database
• Data collection
– Manual search in Pubmed
– Review by Students
• Database construction
• Basic analysis
– Compare different disease type
– Simple descriptive statistical analysis
– Construct disease-biomarker network and showing some basic
topological properties
14. Data Quality Control
朱云平
BPRC
A nonparametric model for quality control of database search
results in shotgun proteomics
2008. A nonparametric model for quality control of database search results in shotgun proteomics. BMC
Bioinformatics 9:29.
• Randomized database were used for
quality control
• Ignore to combine different database
search scores to improve the sensitivity
of randomized database methods
• A multivariate nonlinear discriminate
function (DF) based on the multivariate
nonparametric density estimation
technique was proposed to filter out
false-positive database search results
with a predictable FDR
15. Data Processing Platform
关慎恒
UCSF
A data processing platform for mammalian proteome dynamics
studies using stable isotope metabolic labeling
2010. Analysis of proteome dynamics in the mouse brain. Proceedings of the National Academy of Sciences
107:14508 -14513.
• Data processing platform
– Integrate a variety of software modules into a workflow
– Specifically developed for 15N metabolic labelling
Cross-extraction of 15N-containing ion intensities from raw data files of varying biosynthetic
incorporation times
Computation of peptide 15N incorporation distributions
Aggregation of multiple peptide relative isotope abundance curves into a protein curve
– Processing parameter optimization and noise reduction procedures are performed in some
necessary processing modules to reduce the propagation errors in a long chain of the
processing steps
16. Data Processing Platform
盛泉虎
SIBS
BuildSummary: A software tool for assembling protein
• Maximize the number of confident proteins above a threshold of FDR
– By integrate results from different peptide search engines for the same dataset
• BuildSummary
– Allow user to combine many independent PSM (peptide-spectrum matches) scoring
algorithms including de novo sequencing and spectrum library search algorithms, if the same
peptide FDR is applied to each of them by using target-decoy search approach
17. Glycoproteomics
Mass spectrometry database for glycoprotein structures
2009. Identification of N-Glycosylation Sites on Secreted Proteins of Human Hepatocellular Carcinoma Cells with a
Complementary Proteomics Approach. Journal of Proteome Research 8:662-672.
杨芃原
Fudan Univ
• Enrichment
– Hydrophilic affinity enrichment
– PNGase-F release of N-glycan
• Results
– Identified 4000 spectra of intact N-glycopeptides at FDR of 1% in three
2DLC runs for serum sample
– 1500 different glycopeptides, corresponding to 250 glycosylation site,
were discovered
– Two separated high-confident databases for serum sample were
constructed:
Naked glycopeptides (de-glycopeptides) database (523 peptides)
N-glycan database (599 glycans)
– software GRIP were developed for interpretation of spectra from intact
glycopeptides
18. Glycoproteomics
应万涛
BPRC
Establishment of a systematic method coupling consecutive MSn
and
software tools for charactering core-fucosylated glycoproteins
2009. A Strategy for Precise and Large Scale Identification of Core Fucosylated Glycoproteins. Molecular & Cellular
Proteomics 8:913 -923.
• Strategy development
– Novel enrichment step
Combining the use of lectin for CF glycoprotein
enrichment with ultrafiltration for further
enrichment of glycopeptide
– Established a neutral loss-dependent MS3
scan method that specifically captures
partially deglycosylated CF glycopeptides
– Established a novel database-independent
candidate spectrum-filtering method for
selecting partially deglycosylated CF
glycopeptides and a spectrum optimization
method
19. Glycoproteomics
张凯中
UWO
Glycan Structure Sequencing with Tandem Mass Spectrometry
2008. Complexities and algorithms for glycan sequencing using tandem mass spectrometry. J Bioinform Comput
Biol 6:77-91. 2009.
• Glycan de novo sequencing
– Glycan database is rather incomplete
– Determination of novel glycan structures requires de novo
sequencing
• Heuristic algorithm
– First generates many acceptable small subtrees, which are
then joined together in a repetitive process to obtain larger
and larger suboptimal subtress until reaching the desired
mass
– At each size of the subtree, only limited number of subtrees
are kept for later use
– Experiments on real MS/MS data showed that the heuristic
algorithm can be determine glycan structures
• Contribution
– A polynomial time algorithm is provided under a simple
model of glycan de novo sequencing
20. Proteogenomics
谢鹭
SIBS
The discovery of novel protein-coding features in mouse genome
based on mass spectrometry data
• Detect un-annotated protein-coding regions in mouse genome
– Two searchable proteomic database were constructed
All possible encoded exon junctions (EJCT dataset) for the discovery of
novel exon splice events
Putative encoded exons (ORF database) for finding uninterrupted novel
protein coding regions
– Two datasets were combined with a public full-length protein
dataset (competitive dataset) respectively and queried against
496 high-accuracy tandam MS RAW files from diverse mouse
samples
– 32 unique peptides (matching 149 spectra) from EJCT dataset
were discovered which straddle novel exon junctions
– 104 unique peptides (matching 450 spectra) from ORF dataset
were located in 99 unique protein-coding regions
21. Proteogenomics
赵屹
ICT
Proteogenomics analysis of Thermoanaerobacter
tengcongensis ( 腾冲嗜热菌 ) at different temperatures
• Genome
– Estimatd to encode 2588 theoretical proteins
• Annotating Genome
– By combining proteomics and transcriptomics
Transcriptomic data cover above 70% of 2588 genes
Above 74% of spectra were consistent with transcriptomic data
– Quantitative analysis of gene expression levels at 4 different
temperatures
359 genes were commonly expressed
Unique expressing genes were also detected in distinct temperatures
– 80 genes not belong to 2588 gene set
2 coding regions were supported by MS
21 coding regions may encode novel non-coding RNA
– The discovery was used to re-annotate 2588 gene set
22. Biological Problem oriented
汪迎春
IGDB
Deciphering the Signaling Network in the Leading Edge of the
Migrating Cells
2007. Profiling signaling polarity in chemotactic cells. Proceedings of the National Academy of Sciences 104:8328
-8333.
Characterization of the Ras/ERK Signaling Pathway in the
PD by Combined Proteome and Phosphoproteome Profiling
23. Biological Problem oriented
王通
JNU
Pathway analysis-assisted study strategy in functional
proteomics
2008. HIV-1 infected astrocytes and the microglial proteome. Journal of neuroimmune pharmacology 3:173-186.
• Biological Questions
– HIV associated neurodegenerative disorders (HAND)
– HIV associated malignancy (HAM)
– Infection and cancer
24. Biological Problem oriented
徐平
BRPC
Data analysis in large scale quantitative proteomics study with
SILAC approach
2009. Quantitative Proteomics Reveals the Function of Unconventional Ubiquitin Chains in Proteasomal
Degradation. Cell 137:133-145.
• Background
– K48-linked chains are mediators of
proteasomal degradation
– K6, K11, K27, K29 or K33 are not
well understood
• Results
– Identified K11 linkage-specific
substrates, including Ubc6, which
involved in ERAD pathway (ER
stress response)
25. Protein Structure
张法
ICT
Computational methods in cryo-electron microscopy: image data
processing and 3D structure reconstruction
2009. A framework to refine particle clusters produced by EMAN. Bioinformatics 12:i276-i280.
• EMAN
– One of the most popular software packages for
single particle reconstruction
• Particle reclustering framework (PRF)
– Normalization
– Threshold determination
– Reclustering
26. Data Analysis
卜东坡
ICT
Designing Succinct Structural Alphabets
2008. Designing succinct structural alphabets. Bioinformatics 24:i182 -i189.
• Fragment libraries
– A small amount of structural fragments can model protein structures accurately
– The library size and accuracy are dominating factors for modeling and predicting the protein
structures accurately
– A major bottleneck for the fragment-based protein structure prediction methods is designing
succinct and highly accurate structural alphabet
• Contributions
– Introducing structural information items, such as secondary structure, solvent accessibility
and contact capacity, can improve the prediction of structural fragments
– Derive the best combination of both sequence and structural information items, and
significantly reduce the structural alphabet size, at the same level of accuracy by using integer
linear programming
– Significantly improve the protein structure prediction, with all other conditions unchanged
• Scoring function for mapping a sequence segment to a structural fragment
– Consists of mutation score, secondary structure score, contact capacity score, and
environment fitness score.
– Using more scoring items to improve the performance is promising
27. Others
江瑞
Tsinghua
DomainRBF: a Bayesian regression approach to the prioritization
of associations between protein domains and human complex
diseases
2010. Prioritisation of associations between protein domains and complex diseases using domain-domain
interaction networks. Systems Biology, IET 4:212-222.
• DomainRBF (domain Rank with Bayes Factor)
– To prioritize association between candidate domains and human disease
– Ranking score based on ‘guilt-by-association’ principle, which relies on the
assumption that a disease is likely to be caused by a set of genes that have
similar properties
• Data sources
– Domain-disease associations
– Domain-domain interaction networks
• Validation
– Large-scale cross validation experiments on simulated linkage intervals,
random controls and the whole genome
– Results show that areas under ROC curves can be as high as 77.9%
28. Others
张红雨
HZAU
Proteins as molecular fossils
2010. A Universal Molecular Clock of Protein Folds and its Power in Tracing the Early History of Aerobic
Metabolism and Planet Oxygenation. Molecular Biology and Evolution.
• Proteins can also serve as molecular fossils
• Building phylogenies and timelines of
domains at fold and fold superfamily levels
of structural complexity
– Using a phylogenomic structural census in hundreds
of proteomes
– Correlate approximately linearly with geological
timescales
– Dissected the structures and functions of enzymes
in simulated metabolic networks
– The placement of anaerobic and aerobic enzymes in
the timeline revealed that aerobic metabolism
emerged ~2.9 billion years
29. Others
张勇
BGI Shenzhen
From NGS Genomics to MS-based Proteomics – BGI’s
bioinformatics activities
• Advertising from BGI Shenzhen
– Introduce BGI’s developmental progress
32. Phosphorylation
Kevan Shokat
UCSF
Kinase-specific phosphorylation analysis
2004. Design and use of analog-sensitive protein kinases. Curr Protoc Mol Biol Chapter 18:Unit 18.11.
The amino acid that must be changed to
construct –as kinase alleles can be most
easily identified using a freely available online
resource at http://kinase.ucsf.edu/ksd/.