Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Trelles_QnormBOSC2009
1. Q norm: A library of parallel methods for gene-expression Q-normalization José Manuel Mateos-Duran; Pjotr Prins; Andrés Rodríguez and Oswaldo Trelles The Bioinformatics Open Source Conference (BOSC)
2.
3. 1) Load data to memory 2) Order each column of R producing a set of indexes I[G][E]=p (where p is the original position of the value in column 4) Assign the average value to all entries O[g][e]= A[g] g=1 to G; e=1 to E 3) Obtain A[G] the average value for each row 5) Sort each column O[g][E] by the index I[g][E] (reproduce the original order) Q uantile normalization
4. C ode reorganization { nE = LoadProject(fname, fList); for (i=0;i< nE;i++) { // for each Exp [STEP 1] LoadFile(fList, i, dataIn); Qnorm1(dataIn, dIndex, fList[i].nG); PartialRowAccum(AvG, dataIn , nG); // Manage the Index in memory or disk } for (i=0;i<nG;i++) // Global average AvG[i].Av /=AvG[i].num; // produce the ORDERED output file [STEP 2] Prepare Out file & one column 'dataOut' array for (i=0;i<nE;i++) { Get the column index (from memory or disk) for (j=0;j<nG;j++) { // prepare OUT array dataOut[dIndex[j]]=AvG[j].Av; File positioning and writing the vector } } } P arallel prototype
5. S hared memory version { nE = LoadProject(fname, fList); for (i=0; i< nE; i++) { // for each Exp LoadFile(fList, i, dataIn); Qnorm1(dataIn, dIndex, fList[i].nG); PartialRowAccum(AvG, dataIn , nG); // Manage the Index in memory or disk } for (i=0;i<nG;i++) // Global average AvG[i].Av /=AvG[i].num; // produce the ORDERED output file [STEP 2] Prepare Output file and one column 'dataOut' array for (i=0;i<nE;i++) { Get the column index (from memory or from disk) for (j=0;j<nG;j++) { // complete output vector dataOut[dIndex[j]]=AvG[j].Av; File positioning and writing the vector } } } #pragma omp parallel shared From, To, Range // Open general parallel section #pragma omp parallel shared From, To, Range
6. Master Slave(s) Get Parameters, Initialize Start with params CalculateBlocks(nP,IniBlocks) Broadcast(IniBlocks) Receive (Block) while(!ENDsignal) { for each experiment in block { LoadExperiment(Exp) SortExperiment(Exp) AcumulateAverage(Exp); } while (ThereIsBlocks) { AverageBlock(ResultBlock) Receive(ResultBlock,who) Send(ResultBlock) AverageBlock(ResultBlock) if(!SendAllBlocks) { CalculateNextBlock(NextBlock) Send(who,NextBlock) Receive(Block); } } } Broadcast(ENDsignal) ReportResults M essage P assing version
7. CPU nE = LoadProject(fname, fList); for (i=0; i< nE; i++) { // for each Exp LoadFile(fList, i, dataIn); CopyToGPU(dataIn); <<kernel>> QSortGPU(dataIn, dIndex) CopyFromGPU(dIndex); WriteToDisk(dIndex); <<kernel>> RowAccum(dataIn, AvG) } <<kernel>> GlobalAvg (AvG, nE) CopyFromGPU(AvG); // Step 2: Produce Output File // Using indexes and global average G PU version GPU NVIDIA CUDA Programming Model GPU kernels: QSortGPU(dataIn, dIndex) RowAccum(dataIn, AvG) GlobalAvg(AvG, nE)
8. Input: Affymetrix raw CEL files (GPL3718 ) / 6.5M probes x 470 arrays. Convert CEL files: Ben Bolstad's Affyio (part of R/Bioconductor and my Biolib). H ardware & D ata Pablo : Shared Memory Cluster up-to 256 Nodes / JS20-IBM 512 CPUs - 1TB Distributed memory. Each node: 2 CPUs IBM PowerPC single-core 970FX - 64 bits - 2 GHz & 4GB RAM mem. HD : 40 GB (local) Interconnection Network: MERINET Picasso: Shared Memory Cluster up-to 64 Nodes Superdome HP 128 CPUs - 128 GB SM. Each node: 2 CPUs Intel Itanium-2 Dual Core - 1,6 GHz Almeria: CPU: Intel Core 2 Quad Q9450, 2.66 GHz, 1.33 GHz FSB, 12 MB L2 GPU: GeForce 9800 GX2, 600/1500 MHz, 2x1 GHz DDR3, 1 GB & 512 bits HD: 2 x 72 GB (RAID 0) **Western Digital Raptors **10000 RPM.
9. Input: Affymetrix raw CEL files (GPL3718 ) / 6.5M probes x 4 70 arrays. Convert CEL files: Ben Bolstad's Affyio (part of R/Bioconductor and my Biolib). B enchmarking Distributed memory Shared memory GPU 2.9 x total speed-up 5.5 x processing speed-up
10. C onclusions Background Application domain: bioinformatics (diverse, disperse, heterogeneous, huge data…) I/O and memory oriented applications Large collection of sequential code unable to deal with computational demands Aims Featuring the application domain Start-up a library of (common) parallel procedures. Benchmarking Performance is strong related to code dependencies Parallel models (shared, distributed, etc) are appropriated for different code structures Shared memory is good but expensive GPU-based solution seem to be a good alternative for local installations I/O bounded applications should search of performance in the I/O device Q norm
Hinweis der Redaktion
Now, lets address the use of parallel processing in bioinformatic applications