SlideShare ist ein Scribd-Unternehmen logo
1 von 48
Use of Spark for
Proteomic Scoring
Steven M. Lewis PhD
Institute for Systems Biology
EMBL Uninett
http://tinyurl.com/qgtzhkw
Abstract
Tandem mass spectrometry has proven to be a powerful tool for proteomic
analysis. A critical step is scoring a measured spectrum against an existing
database of peptides and potential modifications. The details of proteomic
search are discussed. Such analyses stain the resources of existing machines
and are limited in the number of modifications that can be considered. Apache
Spark is a powerful tool for parallelizing applications. We have developed a
version of Comet - a high precision scoring algorithm and implemented it on a
Spark cluster. The cluster outperforms single machines by a factor of greater
than ten allowing searched which take 8 hours to be performed in under 30
minutes. Equally important, search speed scales with the number of cores
allowing further speed ups or increases in the number of modifications by
adding more computing power.
The considerations required to run large jobs in parallel will be discussed.
This is a war story
It describes a large problem
The approaches to parallelize it
The problems encountered
The tools developed to solve them
How did I get into this?
A few years ago I developed a Hadoop
application to to protein search
It was a good - reasonably big problem
We published a paper
I got a note from Gurvinder Singh at Uninett a
Norwegian cloud provider asking if I was
interested in implementing what I did in
Spark
Consider a Protein
MTRRSRVGAGLAAIVLALAAVSAAAPIAGAQ
SAGSGAVSVTIGDVDVSPANPTTGTQVLITPS
INNSGSASGSARVNEVTLRGDGLLATEDSLG
RLGAGDSIEVPLSSTFTEPGDHQLSVHVRGL
NPDGSVFYVQRSVYVTVDDRTSDVGVSART
TATNGSTDIQATITQYGTIPIKSGELQVVSDGR
IVERAPVANVSESDSANVTFDGASIPSGELVI
RGEYTLDDEHSTHTTNTTLTYQPQRSADVAL
TGVEASGGGTTYTISGDAANLGSADAASVRV
NAVGDGLSANGGYFVGKIETSEFATFDMTVQ
ADSAVDEIPITVNYSADGQRYSDVVTVDVSGA
SSGSATSPERAPGQQQKRAPSPSNGASGGG
LPLFKIGGAVAVIAIVVVVVRRWRNP
It is a string of Amino Acids (20) designated by one letter
Digestion
●Trypsin breaks proteins after arginine (R) or
lysine (K) except when followed by proline (P)
MTRSVGAGLAAIVLALAAVSAARPIARGAQ
SAGSGAVSVKTIGDVDVSPANPTTGTQVL
Cleaves to:
MTR
SVGAGLAAIVLALAAVSAARPIAR
GAQSAGSGAVSVK
TIGDVDVSPANPTTGTQVL
Tandem Mass Spec Proteomics
Proteins are digested into Peptides (fragments)
Run through a column to separate them and
analyzed in a Mass Spectrometer to yield a
spectrum. A database of known proteins is
searched for the best match
Basics of Tandem Mass Spectrometry
http://en.wikipedia.org/wiki/Tandem_mass_spectrometry
Measured Spectrum
From Kinter and Sherman
Proteomic Search
So you went into the lab
Prepared a sample
Ran it through a Tandem Mass Spec
Collected Thousands of spectra
Now we need a Search a Database of Proteins to find matches
Protein Database
● Search Starts with a list of proteins
○ Read From Uniprot
○ Parsed from a known genome
○ Supplied by a researcher
● Protein Databases for Humans are around 20 million
amino Acids
● For search you add the same number of decoy (false)
proteins
● Multiorganism databases may run 500 MB
Moral - databases are fairly big
Protein Database Fasta File
>sp|Q58D72|ATLA1_BOVIN Atlastin-1 OS=Bos taurus GN=ATL1 PE=2 SV=2
MAKNRRDRNSWGGFSEKTYEWSSEEEEPVKKAGPVQVLVVKDDHSFELDETALNRILLSEAVRDKEVVAVSVAGA
FRKGKSFLMDFMLRYMYNQESVDWVGDHNEPLTGFSWRGGSERETTGIQIWSEIFLINKPDGKKVAVLLMDTQGT
FDSQSTLRDSATVFALSTMISSIQVYNLSQNVQEDDLQHLQLFTEYGRLAMEETFLKPFQSLIFLVRDWSFPYEFSY
GSDGGS
>sp|Q58D72_REVERSED|ATLA1_BOVIN Atlastin-1 OS=Bos taurus GN=ATL1 PE=2 SV=2-REVERSED
MKKKESQETSESKPAPFAQHYLHRHTAAASYLKYLAENTSGQDWLAAAVQDIVAGLERYEGSYRIYAWTCLTILTL
MIMNCLSAIIDLGIFGTVGAIVYTIFIVVFLTAPTRAAHFINKSDNHKIYQIYLEDIETELQQLYRRSFEEGGMKKVGRFL
KVSEEKLELHKTQLDNPALFPKDGGCIEEMKKNYTDKATAVAALNNAEATAQLMSKPHPLEEGQYIKIYAKFYEVLG
RCTIKN
>tr|Q58D73|Q58D73_BOVIN Chromosome 20 open reading frame 29 OS=Bos taurus GN=C20orf29 PE=2 SV=1
MVHAFLIHTLRAAKAEEGLCRVLYSCFFGAENSPNDSQPHSAERDRLLRKEQILAVARQVESMYQLQQQACGRHA
VDLQPQSSDDPVALHEAPCGAFRLAPGDPFQEPRTVVWLGVLSIGFALVLDTHENLLLVESTLRLLARLLLDHLRLL
VPGGANLLLRADCIEGILTRFLPHGQLLFLNDQFVQGLEKEFSAAWSH
>tr|Q58D73_REVERSED|Q58D73_BOVIN Chromosome 20 open reading frame 29 OS=Bos taurus GN=C20orf29
PE=2 SV=1-REVERSED
… And so on for the next 20-500 mb
Protein Database
● Starting with a database
● These are digested in silico to produce peptides
● Modifications may be added to produce a list of
peptides to search
● Every potential modification roughly doubles the search
space
IAM[15.995]S[79.966]GS[79.966]S[79.966]S
AIYVR
RGNTVLKDLK
IEFLNEAS[79.966]VMK
1360.63272
TVRAKQPSEK
InSilico Digestion
MTRRSRVGAGLAAIVLALAAVSAAAPIAGAQ
SAGSGAVSVTIGDVDVSPANPTTGTQVLITPS
INNSGSASGSARVNEVTLRGDGLLATEDSLG
RLGAGDSIEVPLSSTFTEPGDHQLSVHVRGL
NPDGSVFYVQRSVYVTVDDRTSDVGVSART
TATNGSTDIQATITQYGTIPIKSGELQVVSDGR
IVERAPVANVSESDSANVTFDGASIPSGELVI
RGEYTLDDEHSTHTTNTTLTYQPQRSADVAL
TGVEASGGGTTYTISGDAANLGSADAASVRV
NAVGDGLSANGGYFVGKIETSEFATFDMTVQ
ADSAVDEIPITVNYSADGQRYSDVVTVDVSGA
SSGSATSPERAPGQQQKRAPSPSNGASGGG
LPLFKIGGAVAVIAIVVVVVRRWRNP
Consider a Protein
Digestion
●Trypsin breaks proteins after arginine (R) or
lysine (K) except when followed by proline (P)
MTRSVGAGLAAIVLALAAVSAARPIARGAQ
SAGSGAVSVKTIGDVDVSPANPTTGTQVL
Cleaves to:
MTR
SVGAGLAAIVLALAAVSAARPIAR
GAQSAGSGAVSVK
Well … Almost
●Sometimes cleavages are missed
●Sometimes breaks occur in other places
●Some amino acids are modified chemically
●Samples may be labeled with isotopes to
distinguish before and after proteins
All these changes can push the number of scored peptides
from hundreds of thousands to tens of millions or more
Finding Fragments
● http://db.systemsbiology.net:8080/proteomicsToolkit/FragIonServlet.html
LGAGDSIEVP
B ion Y ion
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
LGAGDSIEVP
Theoretical and Measured Spectra
B ION
Y ION
Cross Correlation
measured=215.36
….
measured=310.17
measured=312.76
measured=312.76 theory=312.18
measured=319.31
measured=344.22
…
measured=354.19 theory=356.17
measured=355.16 theory=356.17
measured=356.08 theory=356.17
measured=355.16
measured=356.08
…
measured=431.21
measured=442.03
measured=442.03 theory=440.24
measured=443.43
…
measured=942.79 theory=944.5
Score is a weighted sum of matching peaks
in the correlation
●Scoring is done against all peptides with a similar MZ to
the measured spectrum
●The output is the best scoring peptide and a few of the
"runner ups"
NOTE
In a typical Experiment only 15-25% of spectra
will be identified with peptide in the database
These are used to identify proteins
Why is this a Big Data Problem
The human body has about 20K Proteins
Usually for quality control there is a ‘Decoy’ for every protein
There are optional modifications with increase peptides by a factor
of 2
A smaller Sample will have about 50 M peptides - 900 M with
larger database and more modifications
A large run is about 100 K spectra
The search space is proportional to peptides * spectra
Demonstration
spark-submit --class
com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer
~/SteveSpark.jar
~/SparkClusterEupaG.properties
input_searchGUI.xml
spark-submit
--class com.lordjoe.distributed.hydra.comet_spark.SparkCometScanScorer
~/SteveSpark.jar
~/SparkClusterEupaG.properties
input_searchGUI.xml
http://hwlogin.labs.uninett.no:4040/ Viewer
Political Concerns
To sell the answer to biologists we must copy a
well known algorithm.
This means translating the code to Java from
C++ and accepting the algorithm’s data
structures and memory requirements
Binning
50,000 spectra * 2,000,000,000 peptides is a VERY large number
Fortunately all pairs do not have to be scored -
Spectra are measured with precursor mass
peptides have a mass - only peptides and spectra in a specific mass
range (bin) - need be compared
On modern high precision instruments the bin is about 0.03 Dalton
This reduces the number of pairs to score 2000 million
- on a Small sample we score 128 million pairs at about 500
microsec per scoring
Binning
Bins put all peptides and spectra with a specific MZ range
into groups
Spectra are put in several bins
Bins can be subdivided for scoring
Bins hold N Spectra and K peptides
Currently there are tens of thousands of bins
Scoring fails in larger Bins due to excess GC
time
Hadoop Input
CoGroup
FlatMap
PairFlatMap
Sort
Spark Operations
Debugging and Performance
This involves taking an unfamiliar problem
running on an unfamiliar platform
Questions
Which operations are taking most time?
How many times is each function called?
Are functions balanced across machines on the
cluster?
When a small number of cases fail how can you
instrument them
Did it work the first time?
Hell No
After it stopped crashing and did well on a
trivial problem a base sample took 30 hours to
run on the cluster - Way longer than on a single
machine!!!
- issues - data not like familiar test data
- Hadoop Input format bug
Spark Accumulators
Accumulators are like counters but much more
powerful.
Accumulators can track any object supporting
add and zero methods
Sample Code to Accumulate a Set of Strings
public class SetAccumulableParam implements
AccumulatorParam<Set<String>>, Serializable {
public Set<String> addAccumulator(final Set<String> r, final Set<String> t) {
HashSet<String> ret = new HashSet<String>(r);
ret.addAll(t);
return ret;
}
public Set<String> addInPlace(final Set<String> r1, final Set<String> r2) {
return addAccumulator(r1,r2); }
public Set<String> zero(final Set<String> initialValue) { return initialValue; }
}
Sample Accumulator Use
JavaSparkContext ctx = new JavaSparkContext(sparkConf);
// make an accumulator
final Accumulator<Set<String>> wordsUsed = ctx.accumulator(new HashSet<String>(),
new SetAccumulableParam());
JavaRDD<String> lines = ctx.textFile(args[0]); // read lines
JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() {
public Iterable<String> call(String s) {
List<String> stringList = Arrays.asList(s.split(" "));
wordsUsed.add(new HashSet<String>(stringList)); // accumulate words
return stringList;
}
});
… Finish word count
Function Accumulators
Functions extend AbstractFunctionBase
all reporting code in base class
Functions implement doCall not call
Calls are wrapped for timing and statistics
Data Gathered
total calls
total time
times executed on each MAC address
Sample Instrumented Function
public static class ChooseBestScanScore extends
AbstractLoggingFunction2<IScoredScan, IScoredScan, IScoredScan> {
@Override
public IScoredScan doCall(final IScoredScan v1, final IScoredScan v2)
throws Exception {
ISpectralMatch match1 = v1.getBestMatch();
ISpectralMatch match2 = v2.getBestMatch();
(match1.getHyperScore() > match2.getHyperScore()) ? v1 : v2;
}
}
CombineCometScoringResults totalCalls:69M totalTime:29.05 sec machines:15
variance 0.058
Running Job
Improving Performance
Fix bugs in Hadoop Format for large files
Find most time spent in scoring
Use a Parquet database to store digestion
Discover that repartition is cheaper than
expensive operations
Smart partitioning to balance work in partitions
Use more partitions for larger jobs
Smart Partitioning
a bin is a set of spectra and peptides that score
together
Bin sizes vary by orders of magnitude
Scoring puts pressure on memory
Bin sizes can be counted before scoring step
Partitioning puts larger bins in separate partitions
puts multiple smaller bins in the same partition
Performance
● A Larger test test took 4 hours on a single
machine
● On a small 15 node cluster it took
○ 69 minutes real time
○ Used 41 hours of cpu time
○ Scored 2100 million peptides
○ generated 605 million peptides
○ with 4 potential modifications
○ 95% of the time we find the same top
peptides as Comet
Summary
Proteomic Search is a large data problem involving scoring a
large number of spectra against an even larger number of
candidate peptides.
In the future the complexity will increase with more spectra and
more modifications adding more peptides
Spark is a parallel execution environment allowing search to be
performed on a cluster
Performance is superior to existing tools and can be improved by
increasing the size of the cluster
Code Part 1
// Read Spectra
RDD<IMeasuredSpectrum> spectraToScore = SparkScanScorer.getMeasuredSpectra(scoringApplication);
// Condition Spectra
RDD<CometScoredScan> cometSpectraToScore = spectraToScore.map(new
MapToCometSpectrum(comet));
// Assign bins to spectra
PairRDD<BinChargeKey, CometScoredScan> keyedSpectra =
handler.mapMeasuredSpectrumToKeys(cometSpectraToScore);
// read Proteins
RDD<IProtein> proteins = readProteins(jctx);
// Digest to peptides
RDD<IPolypeptide> digested = proteins.flatMap(new DigestProteinFunction(app));
// map to bins
PairRDD<BinChargeKey, IPolypeptide> keyedPeptides =
digested.flatMapToPair(new mapPolypeptidesToBin(application, usedBins));
Code Part 2
// Now collect the contents of spectra and peptide bins
PairRDD<BinChargeKey, Tuple2<Iterable<CometScoredScan>,
Iterable<HashMap<String, IPolypeptide>>>> binContents =
keyedSpectra.cogroup(keyedPeptides);
// do scoring
RDD< IScoredScan> scores =
binContents.flatMap(new ScoreSpectrumAndPeptideWithCogroup(application));
// combine spectrum scoring
RDD< IScoredScan> cometBestScores = handler.combineScanScores(scores);
// write results as a single file
consolidator.writeScores(cometBestScores);
Proteomic Search PseudoCode
RDD<Spectrum> spectra = readSpectra(); // mydata.mzXML
RDD<Proteins> proteins = readDatabase(); // uniprot_swiss.fasta
RDD<Peptides> peptides= digest(proteins );
THESE ARE UNUSED SLIDES
DON’T GO HERE
Consider a Protein - a collection of
Amino Acids
MTRRSRVGAGLAAIVLALAAVSAAAPIAGAQ
SAGSGAVSVTIGDVDVSPANPTTGTQVLITPS
INNSGSASGSARVNEVTLRGDGLLATEDSLG
RLGAGDSIEVPLSSTFTEPGDHQLSVHVRGL
NPDGSVFYVQRSVYVTVDDRTSDVGVSART
TATNGSTDIQATITQYGTIPIKSGELQVVSDGR
IVERAPVANVSESDSANVTFDGASIPSGELVI
RGEYTLDDEHSTHTTNTTLTYQPQRSADVAL
TGVEASGGGTTYTISGDAANLGSADAASVRV
NAVGDGLSANGGYFVGKIETSEFATFDMTVQ
ADSAVDEIPITVNYSADGQRYSDVVTVDVSGA
SSGSATSPERAPGQQQKRAPSPSNGASGGG
LPLFKIGGAVAVIAIVVVVVRRWRNP
Protein
Database
Digest
Measured
Spectra
Normalize
Add
Modifications
MZ Bin
Fragments
in one bin
MZ Bin
Spectra put in
multiple bins
Cross
Product
Score
all pairs
Hadoop Input
Filter (and write)
FlatMap
PairFlatMap
Hadoop Input
Sort
Spark Operations
Map
What is Spark
Spark is a Framework for parallel execution
Spark works well on Hadoop clusters (also
has a local mode for testing)
Spark is less formal than Map-Reduce and
multiple operations can run locally
Protein
Database
Digest
Measured
Spectra
Normalize
Add
Modifications
MZ Bin
Fragments in
one bin
MZ Bin Spectra
put in multiple
bins
Cross
Product
Score
all pairs
Sort by
Spectra
Report Best Fits
All operations are on a 15 node Spark
Cluster and are performed in parallel with
lazy execution
Most time is spent in the Score all Pairs
Step
Multi Stage Mass Spec
From Kinter and Sherman
A Protein is a Collection of Amino
Acids
●Each (of 20) Amino acid is indicated by a letter
●Assume we have a sample with a number of
proteins.
●Assume that we can list the possible proteins in
the sample.
●Tandem Mass Spectrometry is similar to
shotgun genomics

Weitere ähnliche Inhalte

Was ist angesagt?

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignAnubhav Jain
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAnubhav Jain
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)Anubhav Jain
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Dataaimsnist
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...aimsnist
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applicationsaimsnist
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Anubhav Jain
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningAnubhav Jain
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...Anubhav Jain
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Anubhav Jain
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Anubhav Jain
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceinside-BigData.com
 
Identification of toxicants and metabolites
Identification of toxicants and metabolitesIdentification of toxicants and metabolites
Identification of toxicants and metabolitesSteffen Neumann
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAnubhav Jain
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Anubhav Jain
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaDatabricks
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsAnubhav Jain
 

Was ist angesagt? (20)

Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Automated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design ProblemsAutomated Machine Learning Applied to Diverse Materials Design Problems
Automated Machine Learning Applied to Diverse Materials Design Problems
 
Overview of DuraMat software tool development (poster version)
Overview of DuraMat software tool development(poster version)Overview of DuraMat software tool development(poster version)
Overview of DuraMat software tool development (poster version)
 
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum DataAutomated Generation of High-accuracy Interatomic Potentials Using Quantum Data
Automated Generation of High-accuracy Interatomic Potentials Using Quantum Data
 
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
Failing Fastest: What an Effective HTE and ML Workflow Enables for Functional...
 
Spark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van HamSpark Summit EU talk by Erwin Datema and Roeland van Ham
Spark Summit EU talk by Erwin Datema and Roeland van Ham
 
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and ApplicationsData Mining to Discovery for Inorganic Solids: Software Tools and Applications
Data Mining to Discovery for Inorganic Solids: Software Tools and Applications
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Software tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data miningSoftware tools for high-throughput materials data generation and data mining
Software tools for high-throughput materials data generation and data mining
 
The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...The Status of ML Algorithms for Structure-property Relationships Using Matb...
The Status of ML Algorithms for Structure-property Relationships Using Matb...
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...Software tools for data-driven research and their application to thermoelectr...
Software tools for data-driven research and their application to thermoelectr...
 
How HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental scienceHow HPC and large-scale data analytics are transforming experimental science
How HPC and large-scale data analytics are transforming experimental science
 
Identification of toxicants and metabolites
Identification of toxicants and metabolitesIdentification of toxicants and metabolites
Identification of toxicants and metabolites
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Atomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discoveryAtomate: a tool for rapid high-throughput computing and materials discovery
Atomate: a tool for rapid high-throughput computing and materials discovery
 
Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...Software tools for calculating materials properties in high-throughput (pymat...
Software tools for calculating materials properties in high-throughput (pymat...
 
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim PoterbaScaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
Scaling Genetic Data Analysis with Apache Spark with Jon Bloom and Tim Poterba
 
Open Source Tools for Materials Informatics
Open Source Tools for Materials InformaticsOpen Source Tools for Materials Informatics
Open Source Tools for Materials Informatics
 

Andere mochten auch

Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsJuan Antonio Vizcaino
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Shivang Bajaniya
 
Proteomic identification of host and parasite biomarkers in saliva from patie...
Proteomic identification of host and parasite biomarkers in saliva from patie...Proteomic identification of host and parasite biomarkers in saliva from patie...
Proteomic identification of host and parasite biomarkers in saliva from patie...Christian Granda
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank ProductMahmoud Parsian
 
Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作NUTC, imac
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16pdx_spark
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Alexey Zinoviev
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台NUTC, imac
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Alexey Zinoviev
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015Chris Fregly
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
Proteomics course 1
Proteomics course 1Proteomics course 1
Proteomics course 1utpaltatu
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Chris Fregly
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學NUTC, imac
 

Andere mochten auch (20)

Proteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomicsProteomics public data resources: enabling "big data" analysis in proteomics
Proteomics public data resources: enabling "big data" analysis in proteomics
 
Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)Transpilers(Source-to-Source Compilers)
Transpilers(Source-to-Source Compilers)
 
Overview of Spark for HPC
Overview of Spark for HPCOverview of Spark for HPC
Overview of Spark for HPC
 
Proteomic identification of host and parasite biomarkers in saliva from patie...
Proteomic identification of host and parasite biomarkers in saliva from patie...Proteomic identification of host and parasite biomarkers in saliva from patie...
Proteomic identification of host and parasite biomarkers in saliva from patie...
 
Spark Solution for Rank Product
Spark Solution for Rank ProductSpark Solution for Rank Product
Spark Solution for Rank Product
 
Docker 基本概念與指令操作
Docker  基本概念與指令操作Docker  基本概念與指令操作
Docker 基本概念與指令操作
 
Apache Spark Essentials
Apache Spark EssentialsApache Spark Essentials
Apache Spark Essentials
 
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16Performance in Spark 2.0, PDX Spark Meetup 8/18/16
Performance in Spark 2.0, PDX Spark Meetup 8/18/16
 
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
Joker'16 Spark 2 (API changes; Structured Streaming; Encoders)
 
Meetup Spark 2.0
Meetup Spark 2.0Meetup Spark 2.0
Meetup Spark 2.0
 
使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台使用 CLI 管理 OpenStack 平台
使用 CLI 管理 OpenStack 平台
 
Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)Java BigData Full Stack Development (version 2.0)
Java BigData Full Stack Development (version 2.0)
 
London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015London Spark Meetup Project Tungsten Oct 12 2015
London Spark Meetup Project Tungsten Oct 12 2015
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
Proteomics course 1
Proteomics course 1Proteomics course 1
Proteomics course 1
 
Soil organisms
Soil organismsSoil organisms
Soil organisms
 
Spark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted MalaskaSpark Summit EU talk by Ted Malaska
Spark Summit EU talk by Ted Malaska
 
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
Paris Spark Meetup Oct 26, 2015 - Spark After Dark v1.5 - Best of Advanced Ap...
 
Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學Spark 巨量資料處理基礎教學
Spark 巨量資料處理基礎教學
 

Ähnlich wie Use of spark for proteomic scoring seattle presentation

Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Prof. Wim Van Criekinge
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenomec.titus.brown
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRONPrabin Shakya
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data setsIjripublishers Ijri
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Manuel Martín
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationPriyatham Bollimpalli
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMfnothaft
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issuesDongyan Zhao
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Dmitry Grapov
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biologyNeil Swainston
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...Akram Pasha
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identificationabhinav vedanbhatla
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadofnothaft
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesYasset Perez-Riverol
 

Ähnlich wie Use of spark for proteomic scoring seattle presentation (20)

Bioinformatica t4-alignments
Bioinformatica t4-alignmentsBioinformatica t4-alignments
Bioinformatica t4-alignments
 
Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013Bioinformatics t4-alignments wim_vancriekingev2013
Bioinformatics t4-alignments wim_vancriekingev2013
 
2015 ohsu-metagenome
2015 ohsu-metagenome2015 ohsu-metagenome
2015 ohsu-metagenome
 
Bioinformatics MiRON
Bioinformatics MiRONBioinformatics MiRON
Bioinformatics MiRON
 
Folker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data AnnotationFolker Meyer: Metagenomic Data Annotation
Folker Meyer: Metagenomic Data Annotation
 
Ijricit 01-002 enhanced replica detection in short time for large data sets
Ijricit 01-002 enhanced replica detection in  short time for large data setsIjricit 01-002 enhanced replica detection in  short time for large data sets
Ijricit 01-002 enhanced replica detection in short time for large data sets
 
Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?Automating Machine Learning - Is it feasible?
Automating Machine Learning - Is it feasible?
 
PPT
PPTPPT
PPT
 
2015-03-31_MotifGP
2015-03-31_MotifGP2015-03-31_MotifGP
2015-03-31_MotifGP
 
Meta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter OptimizationMeta Machine Learning: Hyperparameter Optimization
Meta Machine Learning: Hyperparameter Optimization
 
Scaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAMScaling up genomic analysis with ADAM
Scaling up genomic analysis with ADAM
 
2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues2015.04.08-Next-generation-sequencing-issues
2015.04.08-Next-generation-sequencing-issues
 
Prediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source toolsPrediction of pKa from chemical structure using free and open source tools
Prediction of pKa from chemical structure using free and open source tools
 
Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)Metabolomic Data Analysis Workshop and Tutorials (2014)
Metabolomic Data Analysis Workshop and Tutorials (2014)
 
Integrative information management for systems biology
Integrative information management for systems biologyIntegrative information management for systems biology
Integrative information management for systems biology
 
(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...(2016)application of parallel glowworm swarm optimization algorithm for data ...
(2016)application of parallel glowworm swarm optimization algorithm for data ...
 
Distributed approach for Peptide Identification
Distributed approach for Peptide IdentificationDistributed approach for Peptide Identification
Distributed approach for Peptide Identification
 
Fast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocadoFast Variant Calling with ADAM and avocado
Fast Variant Calling with ADAM and avocado
 
Tpa 2013
Tpa 2013Tpa 2013
Tpa 2013
 
Mapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome CoordinatesMapping millions of peptidoforms to Genome Coordinates
Mapping millions of peptidoforms to Genome Coordinates
 

Kürzlich hochgeladen

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Velvetech LLC
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentationvaddepallysandeep122
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Cizo Technology Services
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtimeandrehoraa
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfMarharyta Nedzelska
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureDinusha Kumarasiri
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...stazi3110
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanyChristoph Pohl
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)jennyeacort
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Hr365.us smith
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commercemanigoyal112
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 

Kürzlich hochgeladen (20)

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...Software Project Health Check: Best Practices and Techniques for Your Product...
Software Project Health Check: Best Practices and Techniques for Your Product...
 
PREDICTING RIVER WATER QUALITY ppt presentation
PREDICTING  RIVER  WATER QUALITY  ppt presentationPREDICTING  RIVER  WATER QUALITY  ppt presentation
PREDICTING RIVER WATER QUALITY ppt presentation
 
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
Global Identity Enrolment and Verification Pro Solution - Cizo Technology Ser...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 
SpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at RuntimeSpotFlow: Tracking Method Calls and States at Runtime
SpotFlow: Tracking Method Calls and States at Runtime
 
A healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdfA healthy diet for your Java application Devoxx France.pdf
A healthy diet for your Java application Devoxx France.pdf
 
Implementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with AzureImplementing Zero Trust strategy with Azure
Implementing Zero Trust strategy with Azure
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
Building a General PDE Solving Framework with Symbolic-Numeric Scientific Mac...
 
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte GermanySuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
SuccessFactors 1H 2024 Release - Sneak-Peek by Deloitte Germany
 
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
Call Us🔝>༒+91-9711147426⇛Call In girls karol bagh (Delhi)
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort ServiceHot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Patel Nagar🔝 9953056974 🔝 escort Service
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)Recruitment Management Software Benefits (Infographic)
Recruitment Management Software Benefits (Infographic)
 
Cyber security and its impact on E commerce
Cyber security and its impact on E commerceCyber security and its impact on E commerce
Cyber security and its impact on E commerce
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 

Use of spark for proteomic scoring seattle presentation

  • 1. Use of Spark for Proteomic Scoring Steven M. Lewis PhD Institute for Systems Biology EMBL Uninett http://tinyurl.com/qgtzhkw
  • 2. Abstract Tandem mass spectrometry has proven to be a powerful tool for proteomic analysis. A critical step is scoring a measured spectrum against an existing database of peptides and potential modifications. The details of proteomic search are discussed. Such analyses stain the resources of existing machines and are limited in the number of modifications that can be considered. Apache Spark is a powerful tool for parallelizing applications. We have developed a version of Comet - a high precision scoring algorithm and implemented it on a Spark cluster. The cluster outperforms single machines by a factor of greater than ten allowing searched which take 8 hours to be performed in under 30 minutes. Equally important, search speed scales with the number of cores allowing further speed ups or increases in the number of modifications by adding more computing power. The considerations required to run large jobs in parallel will be discussed.
  • 3. This is a war story It describes a large problem The approaches to parallelize it The problems encountered The tools developed to solve them
  • 4. How did I get into this? A few years ago I developed a Hadoop application to to protein search It was a good - reasonably big problem We published a paper I got a note from Gurvinder Singh at Uninett a Norwegian cloud provider asking if I was interested in implementing what I did in Spark
  • 6. Digestion ●Trypsin breaks proteins after arginine (R) or lysine (K) except when followed by proline (P) MTRSVGAGLAAIVLALAAVSAARPIARGAQ SAGSGAVSVKTIGDVDVSPANPTTGTQVL Cleaves to: MTR SVGAGLAAIVLALAAVSAARPIAR GAQSAGSGAVSVK TIGDVDVSPANPTTGTQVL
  • 7. Tandem Mass Spec Proteomics Proteins are digested into Peptides (fragments) Run through a column to separate them and analyzed in a Mass Spectrometer to yield a spectrum. A database of known proteins is searched for the best match
  • 8. Basics of Tandem Mass Spectrometry http://en.wikipedia.org/wiki/Tandem_mass_spectrometry
  • 10. Proteomic Search So you went into the lab Prepared a sample Ran it through a Tandem Mass Spec Collected Thousands of spectra Now we need a Search a Database of Proteins to find matches
  • 11. Protein Database ● Search Starts with a list of proteins ○ Read From Uniprot ○ Parsed from a known genome ○ Supplied by a researcher ● Protein Databases for Humans are around 20 million amino Acids ● For search you add the same number of decoy (false) proteins ● Multiorganism databases may run 500 MB Moral - databases are fairly big
  • 12. Protein Database Fasta File >sp|Q58D72|ATLA1_BOVIN Atlastin-1 OS=Bos taurus GN=ATL1 PE=2 SV=2 MAKNRRDRNSWGGFSEKTYEWSSEEEEPVKKAGPVQVLVVKDDHSFELDETALNRILLSEAVRDKEVVAVSVAGA FRKGKSFLMDFMLRYMYNQESVDWVGDHNEPLTGFSWRGGSERETTGIQIWSEIFLINKPDGKKVAVLLMDTQGT FDSQSTLRDSATVFALSTMISSIQVYNLSQNVQEDDLQHLQLFTEYGRLAMEETFLKPFQSLIFLVRDWSFPYEFSY GSDGGS >sp|Q58D72_REVERSED|ATLA1_BOVIN Atlastin-1 OS=Bos taurus GN=ATL1 PE=2 SV=2-REVERSED MKKKESQETSESKPAPFAQHYLHRHTAAASYLKYLAENTSGQDWLAAAVQDIVAGLERYEGSYRIYAWTCLTILTL MIMNCLSAIIDLGIFGTVGAIVYTIFIVVFLTAPTRAAHFINKSDNHKIYQIYLEDIETELQQLYRRSFEEGGMKKVGRFL KVSEEKLELHKTQLDNPALFPKDGGCIEEMKKNYTDKATAVAALNNAEATAQLMSKPHPLEEGQYIKIYAKFYEVLG RCTIKN >tr|Q58D73|Q58D73_BOVIN Chromosome 20 open reading frame 29 OS=Bos taurus GN=C20orf29 PE=2 SV=1 MVHAFLIHTLRAAKAEEGLCRVLYSCFFGAENSPNDSQPHSAERDRLLRKEQILAVARQVESMYQLQQQACGRHA VDLQPQSSDDPVALHEAPCGAFRLAPGDPFQEPRTVVWLGVLSIGFALVLDTHENLLLVESTLRLLARLLLDHLRLL VPGGANLLLRADCIEGILTRFLPHGQLLFLNDQFVQGLEKEFSAAWSH >tr|Q58D73_REVERSED|Q58D73_BOVIN Chromosome 20 open reading frame 29 OS=Bos taurus GN=C20orf29 PE=2 SV=1-REVERSED … And so on for the next 20-500 mb
  • 13. Protein Database ● Starting with a database ● These are digested in silico to produce peptides ● Modifications may be added to produce a list of peptides to search ● Every potential modification roughly doubles the search space IAM[15.995]S[79.966]GS[79.966]S[79.966]S AIYVR RGNTVLKDLK IEFLNEAS[79.966]VMK 1360.63272 TVRAKQPSEK
  • 15. Digestion ●Trypsin breaks proteins after arginine (R) or lysine (K) except when followed by proline (P) MTRSVGAGLAAIVLALAAVSAARPIARGAQ SAGSGAVSVKTIGDVDVSPANPTTGTQVL Cleaves to: MTR SVGAGLAAIVLALAAVSAARPIAR GAQSAGSGAVSVK
  • 16. Well … Almost ●Sometimes cleavages are missed ●Sometimes breaks occur in other places ●Some amino acids are modified chemically ●Samples may be labeled with isotopes to distinguish before and after proteins All these changes can push the number of scored peptides from hundreds of thousands to tens of millions or more
  • 17. Finding Fragments ● http://db.systemsbiology.net:8080/proteomicsToolkit/FragIonServlet.html LGAGDSIEVP B ion Y ion LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP LGAGDSIEVP
  • 18. Theoretical and Measured Spectra B ION Y ION
  • 19. Cross Correlation measured=215.36 …. measured=310.17 measured=312.76 measured=312.76 theory=312.18 measured=319.31 measured=344.22 … measured=354.19 theory=356.17 measured=355.16 theory=356.17 measured=356.08 theory=356.17 measured=355.16 measured=356.08 … measured=431.21 measured=442.03 measured=442.03 theory=440.24 measured=443.43 … measured=942.79 theory=944.5
  • 20. Score is a weighted sum of matching peaks in the correlation ●Scoring is done against all peptides with a similar MZ to the measured spectrum ●The output is the best scoring peptide and a few of the "runner ups" NOTE In a typical Experiment only 15-25% of spectra will be identified with peptide in the database These are used to identify proteins
  • 21. Why is this a Big Data Problem The human body has about 20K Proteins Usually for quality control there is a ‘Decoy’ for every protein There are optional modifications with increase peptides by a factor of 2 A smaller Sample will have about 50 M peptides - 900 M with larger database and more modifications A large run is about 100 K spectra The search space is proportional to peptides * spectra
  • 23. Political Concerns To sell the answer to biologists we must copy a well known algorithm. This means translating the code to Java from C++ and accepting the algorithm’s data structures and memory requirements
  • 24. Binning 50,000 spectra * 2,000,000,000 peptides is a VERY large number Fortunately all pairs do not have to be scored - Spectra are measured with precursor mass peptides have a mass - only peptides and spectra in a specific mass range (bin) - need be compared On modern high precision instruments the bin is about 0.03 Dalton This reduces the number of pairs to score 2000 million - on a Small sample we score 128 million pairs at about 500 microsec per scoring
  • 25. Binning Bins put all peptides and spectra with a specific MZ range into groups Spectra are put in several bins Bins can be subdivided for scoring Bins hold N Spectra and K peptides Currently there are tens of thousands of bins Scoring fails in larger Bins due to excess GC time
  • 27. Debugging and Performance This involves taking an unfamiliar problem running on an unfamiliar platform Questions Which operations are taking most time? How many times is each function called? Are functions balanced across machines on the cluster? When a small number of cases fail how can you instrument them
  • 28. Did it work the first time? Hell No After it stopped crashing and did well on a trivial problem a base sample took 30 hours to run on the cluster - Way longer than on a single machine!!! - issues - data not like familiar test data - Hadoop Input format bug
  • 29. Spark Accumulators Accumulators are like counters but much more powerful. Accumulators can track any object supporting add and zero methods
  • 30. Sample Code to Accumulate a Set of Strings public class SetAccumulableParam implements AccumulatorParam<Set<String>>, Serializable { public Set<String> addAccumulator(final Set<String> r, final Set<String> t) { HashSet<String> ret = new HashSet<String>(r); ret.addAll(t); return ret; } public Set<String> addInPlace(final Set<String> r1, final Set<String> r2) { return addAccumulator(r1,r2); } public Set<String> zero(final Set<String> initialValue) { return initialValue; } }
  • 31. Sample Accumulator Use JavaSparkContext ctx = new JavaSparkContext(sparkConf); // make an accumulator final Accumulator<Set<String>> wordsUsed = ctx.accumulator(new HashSet<String>(), new SetAccumulableParam()); JavaRDD<String> lines = ctx.textFile(args[0]); // read lines JavaRDD<String> words = lines.flatMap(new FlatMapFunction<String, String>() { public Iterable<String> call(String s) { List<String> stringList = Arrays.asList(s.split(" ")); wordsUsed.add(new HashSet<String>(stringList)); // accumulate words return stringList; } }); … Finish word count
  • 32. Function Accumulators Functions extend AbstractFunctionBase all reporting code in base class Functions implement doCall not call Calls are wrapped for timing and statistics Data Gathered total calls total time times executed on each MAC address
  • 33. Sample Instrumented Function public static class ChooseBestScanScore extends AbstractLoggingFunction2<IScoredScan, IScoredScan, IScoredScan> { @Override public IScoredScan doCall(final IScoredScan v1, final IScoredScan v2) throws Exception { ISpectralMatch match1 = v1.getBestMatch(); ISpectralMatch match2 = v2.getBestMatch(); (match1.getHyperScore() > match2.getHyperScore()) ? v1 : v2; } } CombineCometScoringResults totalCalls:69M totalTime:29.05 sec machines:15 variance 0.058
  • 35. Improving Performance Fix bugs in Hadoop Format for large files Find most time spent in scoring Use a Parquet database to store digestion Discover that repartition is cheaper than expensive operations Smart partitioning to balance work in partitions Use more partitions for larger jobs
  • 36. Smart Partitioning a bin is a set of spectra and peptides that score together Bin sizes vary by orders of magnitude Scoring puts pressure on memory Bin sizes can be counted before scoring step Partitioning puts larger bins in separate partitions puts multiple smaller bins in the same partition
  • 37. Performance ● A Larger test test took 4 hours on a single machine ● On a small 15 node cluster it took ○ 69 minutes real time ○ Used 41 hours of cpu time ○ Scored 2100 million peptides ○ generated 605 million peptides ○ with 4 potential modifications ○ 95% of the time we find the same top peptides as Comet
  • 38. Summary Proteomic Search is a large data problem involving scoring a large number of spectra against an even larger number of candidate peptides. In the future the complexity will increase with more spectra and more modifications adding more peptides Spark is a parallel execution environment allowing search to be performed on a cluster Performance is superior to existing tools and can be improved by increasing the size of the cluster
  • 39. Code Part 1 // Read Spectra RDD<IMeasuredSpectrum> spectraToScore = SparkScanScorer.getMeasuredSpectra(scoringApplication); // Condition Spectra RDD<CometScoredScan> cometSpectraToScore = spectraToScore.map(new MapToCometSpectrum(comet)); // Assign bins to spectra PairRDD<BinChargeKey, CometScoredScan> keyedSpectra = handler.mapMeasuredSpectrumToKeys(cometSpectraToScore); // read Proteins RDD<IProtein> proteins = readProteins(jctx); // Digest to peptides RDD<IPolypeptide> digested = proteins.flatMap(new DigestProteinFunction(app)); // map to bins PairRDD<BinChargeKey, IPolypeptide> keyedPeptides = digested.flatMapToPair(new mapPolypeptidesToBin(application, usedBins));
  • 40. Code Part 2 // Now collect the contents of spectra and peptide bins PairRDD<BinChargeKey, Tuple2<Iterable<CometScoredScan>, Iterable<HashMap<String, IPolypeptide>>>> binContents = keyedSpectra.cogroup(keyedPeptides); // do scoring RDD< IScoredScan> scores = binContents.flatMap(new ScoreSpectrumAndPeptideWithCogroup(application)); // combine spectrum scoring RDD< IScoredScan> cometBestScores = handler.combineScanScores(scores); // write results as a single file consolidator.writeScores(cometBestScores);
  • 41. Proteomic Search PseudoCode RDD<Spectrum> spectra = readSpectra(); // mydata.mzXML RDD<Proteins> proteins = readDatabase(); // uniprot_swiss.fasta RDD<Peptides> peptides= digest(proteins );
  • 42. THESE ARE UNUSED SLIDES DON’T GO HERE
  • 43. Consider a Protein - a collection of Amino Acids MTRRSRVGAGLAAIVLALAAVSAAAPIAGAQ SAGSGAVSVTIGDVDVSPANPTTGTQVLITPS INNSGSASGSARVNEVTLRGDGLLATEDSLG RLGAGDSIEVPLSSTFTEPGDHQLSVHVRGL NPDGSVFYVQRSVYVTVDDRTSDVGVSART TATNGSTDIQATITQYGTIPIKSGELQVVSDGR IVERAPVANVSESDSANVTFDGASIPSGELVI RGEYTLDDEHSTHTTNTTLTYQPQRSADVAL TGVEASGGGTTYTISGDAANLGSADAASVRV NAVGDGLSANGGYFVGKIETSEFATFDMTVQ ADSAVDEIPITVNYSADGQRYSDVVTVDVSGA SSGSATSPERAPGQQQKRAPSPSNGASGGG LPLFKIGGAVAVIAIVVVVVRRWRNP
  • 44. Protein Database Digest Measured Spectra Normalize Add Modifications MZ Bin Fragments in one bin MZ Bin Spectra put in multiple bins Cross Product Score all pairs Hadoop Input Filter (and write) FlatMap PairFlatMap Hadoop Input Sort Spark Operations Map
  • 45. What is Spark Spark is a Framework for parallel execution Spark works well on Hadoop clusters (also has a local mode for testing) Spark is less formal than Map-Reduce and multiple operations can run locally
  • 46. Protein Database Digest Measured Spectra Normalize Add Modifications MZ Bin Fragments in one bin MZ Bin Spectra put in multiple bins Cross Product Score all pairs Sort by Spectra Report Best Fits All operations are on a 15 node Spark Cluster and are performed in parallel with lazy execution Most time is spent in the Score all Pairs Step
  • 47. Multi Stage Mass Spec From Kinter and Sherman
  • 48. A Protein is a Collection of Amino Acids ●Each (of 20) Amino acid is indicated by a letter ●Assume we have a sample with a number of proteins. ●Assume that we can list the possible proteins in the sample. ●Tandem Mass Spectrometry is similar to shotgun genomics