SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Java Thread and Process Performance
for Parallel Machine Learning on
Multicore HPC Clusters
12/08/2016 1IEEE BigData 2016
IEEE BigData 2016
December 5-8, Washington D.C.
Saliya Ekanayake
Supun Kamburugamuve |Pulasthi Wickramasinghe|Geoffrey Fox
Performance on Multicore HPC Clusters
5/16/2016
48 Nodes 128 Nodes
>40x Speedup with optimizations
All-MPI approach
Hybrid Threads and MPI
64x Ideal (if life was so fair!)
We’ll discuss today
Multidimensional Scaling (MDS) on Intel
Haswell HPC Cluster with 24 core nodes
and 40Gbps Infiniband
2
Performance Factors
• Thread models.
• Affinity
• Communication.
• Other factors of high-level languages.
 Garbage collection
 Serialization/Deserialization
 Memory references and cache
 Data read/write
9/28/2016 Ph.D. Dissertation Defense 3
• Long Running Threads – Bulk Synchronous Parallel (LRT-BSP).
 Resembles the classic BSP model of processes.
 A long running thread pool similar to FJ.
 Threads occupy CPUs always – “hot” threads.
Thread Models
• Long Running Threads – Fork Join (LRT-FJ).
 Serial work followed by parallel regions.
 A long running thread pool handles parallel tasks.
 Threads sleep after parallel work.
9/28/2016 Ph.D. Dissertation Defense 4
Serial work
Non-trivial parallel work
LRT-FJ
LRT-BSP
Serial work
Non-trivial parallel work
Busy thread synchronization
• LRT-FJ vs. LRT-BSP.
 High context switch overhead in FJ.
 BSP replicates serial work but reduced overhead.
 Implicit synchronization in FJ.
 BSP use explicit busy synchronizations.
Ph.D. Dissertation Defense
Affinity
• Six affinity patterns
• E.g. 2x4
 Two threads per process
 Four processes per node
 Assume two 4 core sockets
9/28/2016 5
Threads
Affinity
Processes Affinity
Cores Socket
None
(All)
Inherit CI SI NI
Explicit per
core
CE SE NE
2x4 CI
C0 C1 C2 C3 C4 C5 C6 C7
Socket 0 Socket 1
P0 P1 P3 P4
2x4 SI
C0 C1 C2 C3 C4 C5 C6 C7
Socket 0 Socket 1
P0,P1 P2,P3
2x4 NI
C0 C1 C2 C3 C4 C5 C6 C7
Socket 0 Socket 1
P0,P1,P2,P3
2x4 CE
C0 C1 C2 C3 C4 C5 C6 C7
Socket 0 Socket 1
P0 P1 P3 P4
2x4 SE
C0 C1 C2 C3 C4 C5 C6 C7
Socket 0 Socket 1
P0,P1 P2,P3
2x4 NE
C0 C1 C2 C3 C4 C5 C6 C7
Socket 0 Socket 1
P0,P1,P2,P3
Worker thread
Background thread (GC and other JVM threads)
Process (JVM)
A Quick Peek into Performance
9/28/2016 Ph.D. Dissertation Defense 6
1.5E+4
2.0E+4
2.5E+4
3.0E+4
3.5E+4
4.0E+4
4.5E+4
5.0E+4
1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1
Time(ms)
Threads per process x Processes per node
LRT-FJ NI
1.5E+4
2.0E+4
2.5E+4
3.0E+4
3.5E+4
4.0E+4
4.5E+4
5.0E+4
1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1
Time(ms)
Threads per process x Processes per node
LRT-FJ NI
LRT-FJ NE
1.5E+4
2.0E+4
2.5E+4
3.0E+4
3.5E+4
4.0E+4
4.5E+4
5.0E+4
1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1
Time(ms)
Threads per process x Processes per node
LRT-FJ NI
LRT-FJ NE
LRT-BSP NI
1.5E+4
2.0E+4
2.5E+4
3.0E+4
3.5E+4
4.0E+4
4.5E+4
5.0E+4
1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1
Time(ms)
Threads per process x Processes per node
LRT-FJ NI
LRT-FJ NE
LRT-BSP NI
LRT-BSP NE
No thread pinning and FJ
Threads pinned to cores and FJ
No thread pinning and BSP
Threads pinned to cores and BSP
K-Means 10K performance on 16 nodes
Communication Mechanisms
• Collective communications are expensive.
 Allgather, allreduce, broadcast.
 Frequently used in parallel machine learning
 E.g.
9/28/2016 Ph.D. Dissertation Defense 7
3 million double values distributed uniformly over 48 nodes
• Identical message size per node, yet
24 MPI is ~10 times slower than 1
MPI
• Suggests #ranks per node should be
1 for the best performance
• How to reduce this cost?
Communication Mechanisms
• Shared Memory (SM) for inter-process
communication (IPC).
 Custom Java implementation in SPIDAL.
 Uses OpenHFT’s Bytes API.
 Reduce network communications to the number of nodes.
9/28/2016 Ph.D. Dissertation Defense 8
Java SM architecture
Java DA-MDS 100k communication on 48 of 24-core nodes
Java DA-MDS 200k communication on 48 of 24-core nodes
Performance: K-Means
12/08/2016 IEEE BigData 2016 9
Java K-Means 1 mil points and 1k centers
performance on 16 nodes for LRT-FJ and LRT-BSP
with varying affinity patterns over varying threads
and processes.
Java vs C K-Means LRT-BSP affinity CE
performance for 1 mil points with
1k,10k,50k,100k, and 500k centers on 16 nodes
over varying threads and processes.
Performance: K-Means
9/28/2016 Ph.D. Dissertation Defense 10
Java and C K-Means 1 mil points with 100k centers
performance on 16 nodes for LRT-FJ and LRT-BSP
over varying intra-node parallelisms. The affinity
pattern is CE.
• All-Procs.
 #processes = total parallelism.
 Each pinned to a separate core.
 Balanced across the two sockets.
• Threads internal.
 1 process per node, pinned to T cores, where
T = #threads.
 Threads pinned to separate cores.
 Threads balanced across the two sockets.
• Hybrid
 2 processes per node, pinned to T cores.
 Total parallelism = Tx2 where T is 8 or 12.
 Threads pinned to separate cores and
balanced across the two sockets.
Java LRT-BSP
-{All Procs, Threads Internal, Hybrid}
C
K-Means Clustering
• Flink and Spark
9/28/2016 Ph.D. Dissertation Defense 11
Map (nearest
centroid
calculation)
Reduce
(update
centroids)
Data Set
<Points>
Data Set
<Initial
Centroids>
Data Set
<Updated
Centroids>
Broadcast
K-Means total and compute times for 1 million 2D points
and 1k,10,50k,100k, and 500k centroids for Spark, Flink,
and MPI Java LRT-BSP CE. Run on 16 nodes as 24x1.
K-Means total and compute times for 100k 2D points and
1k,2k,4k,8k, and 16k centroids for Spark, Flink, and MPI
Java LRT-BSP CE. Run on 1 node as 24x1.
12/08/2016 IEEE BigData 2016
Performance: Multidimensional Scaling
12
Java DA-MDS 50k on 16 of 24-core nodes Java DA-MDS 100k on 16 of 24-core nodes
Java DA-MDS speedup comparison for LRT-FJ and LRT-BSP
Linux perf statistics for DA-MDS run of 18x2 on 32 nodes.
Affinity pattern is CE.
15x
74x
2.6x
Performance: Multidimensional Scaling
5/16/2016
48 Nodes 128 Nodes
>40x Speedup with optimizations
All-MPI approach
Hybrid Threads and MPI
64x Ideal (if life was so fair!)
Multidimensional Scaling (MDS) on Intel
Haswell HPC Cluster with 24 core nodes
and 40Gbps Infiniband
13
Acknowledgement
• Grants: NSF CIF21 DIBBS 1443054 and NSF RaPyDLI 1415459
• Digital Science Center (DSC) at Indiana University
• Network Dynamics and Simulation Science Laboratory (NDSSL) at Virginia
Tech
12/08/2016 IEEE BigData 2016 14
Thank you!
Backup Slides
12/08/2016 IEEE BigData 2016 15
Other Factors
• Garbage Collection (GC)
 “Stop the world” events are expensive.
 Especially, for parallel machine learning.
 Typical OOP  allocate – use – forget.
 Original SPIDAL code produced frequent garbage of small
arrays.
 Solutions.
 Unavoidable, but can be reduced.
 Static allocation.
 Object reuse.
 Advantage.
 Less GC – obvious.
 Scale to larger problem sizes.
 E.g. Original SPIDAL code required 5GB (x 24 = 120 GB per node)
heap per process to handle 200K DA-MDS. Optimized code use <
1GB heap to finish within the same timing.
 Note. Physical memory is 128GB, so with optimized SPIDAL can
now do 1 million point MDS within hardware limits.
9/28/2016 Ph.D. Dissertation Defense 16
Heap size per
process reaches
–Xmx (2.5GB)
early in the
computation
Frequent GC
Heap size per
process is well
below (~1.1GB)
of –Xmx (2.5GB)
Virtually no GC activity
after optimizing
Other Factors
• Serialization/Deserialization.
 Default implementations are verbose, especially in Java.
 Kryo is by far the best in compactness.
 Off-heap buffers are another option.
• Memory references and cache.
 Nested structures are expensive.
 Even 1D arrays are preferred over 2D when possible.
 Adopt HPC techniques – loop ordering, blocked arrays.
• Data read/write.
 Stream I/O is expensive for large data
 Memory mapping is much faster and JNI friendly in Java
 Native calls require extra copies as objects move during GC.
 Memory maps are in off-GC space, so no extra copying is necessary
9/28/2016 Ph.D. Dissertation Defense 17

Weitere ähnliche Inhalte

Was ist angesagt?

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learningAdam Gibson
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhereinside-BigData.com
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.Kyong-Ha Lee
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementKyong-Ha Lee
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsWim Vanderbauwhede
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingJonathan Dursi
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureKyong-Ha Lee
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Updateinside-BigData.com
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsHPCC Systems
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418inside-BigData.com
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomFacultad de Informática UCM
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGenevsachde
 
RDFox Poster
RDFox PosterRDFox Poster
RDFox PosterDBOnto
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systemsinside-BigData.com
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSatish Mohan
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2Mohit Garg
 

Was ist angesagt? (20)

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Advanced spark deep learning
Advanced spark deep learningAdvanced spark deep learning
Advanced spark deep learning
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
 
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big ComputingEuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
EuroMPI 2016 Keynote: How Can MPI Fit Into Today's Big Computing
 
Database Research on Modern Computing Architecture
Database Research on Modern Computing ArchitectureDatabase Research on Modern Computing Architecture
Database Research on Modern Computing Architecture
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
HPC AI Advisory Council Update
HPC AI Advisory Council UpdateHPC AI Advisory Council Update
HPC AI Advisory Council Update
 
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC SystemsImproving Efficiency of Machine Learning Algorithms using HPCC Systems
Improving Efficiency of Machine Learning Algorithms using HPCC Systems
 
Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418Panda scalable hpc_bestpractices_tue100418
Panda scalable hpc_bestpractices_tue100418
 
OpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroomOpenMP tasking model: from the standard to the classroom
OpenMP tasking model: from the standard to the classroom
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
 
RDFox Poster
RDFox PosterRDFox Poster
RDFox Poster
 
Spark 101
Spark 101Spark 101
Spark 101
 
Scalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC SystemsScalable and Distributed DNN Training on Modern HPC Systems
Scalable and Distributed DNN Training on Modern HPC Systems
 
Simple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform ConceptSimple, Modular and Extensible Big Data Platform Concept
Simple, Modular and Extensible Big Data Platform Concept
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 

Ähnlich wie Java Thread and Process Performance for Parallel Machine Learning on Multicore HPC Clusters

Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Geoffrey Fox
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facilityinside-BigData.com
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDKKernel TLV
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetupGanesan Narayanasamy
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systemsinside-BigData.com
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsUnai Lopez-Novoa
 
European Exascale System Interconnect & Storage
European Exascale System Interconnect & StorageEuropean Exascale System Interconnect & Storage
European Exascale System Interconnect & Storageinside-BigData.com
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaFacultad de Informática UCM
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceSpeck&Tech
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraMasaharu Munetomo
 
resume_aditya_gujja_03
resume_aditya_gujja_03resume_aditya_gujja_03
resume_aditya_gujja_03Aditya Gujja
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGATO project
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingStefan Marr
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkDatabricks
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopDatabricks
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...Ganesan Narayanasamy
 

Ähnlich wie Java Thread and Process Performance for Parallel Machine Learning on Multicore HPC Clusters (20)

Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
Spidal Java: High Performance Data Analytics with Java on Large Multicore HPC...
 
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
40 Powers of 10 - Simulating the Universe with the DiRAC HPC Facility
 
Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Introduction to DPDK
Introduction to DPDKIntroduction to DPDK
Introduction to DPDK
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Spark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer AgarwalSpark Summit EU talk by Sameer Agarwal
Spark Summit EU talk by Sameer Agarwal
 
2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup2018 03 25 system ml ai and openpower meetup
2018 03 25 system ml ai and openpower meetup
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Harnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern CoprocessorsHarnessing OpenCL in Modern Coprocessors
Harnessing OpenCL in Modern Coprocessors
 
European Exascale System Interconnect & Storage
European Exascale System Interconnect & StorageEuropean Exascale System Interconnect & Storage
European Exascale System Interconnect & Storage
 
Barcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de RiquezaBarcelona Supercomputing Center, Generador de Riqueza
Barcelona Supercomputing Center, Generador de Riqueza
 
Architecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for scienceArchitecting a 35 PB distributed parallel file system for science
Architecting a 35 PB distributed parallel file system for science
 
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale EraRealizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
Realizing Robust and Scalable Evolutionary Algorithms toward Exascale Era
 
resume_aditya_gujja_03
resume_aditya_gujja_03resume_aditya_gujja_03
resume_aditya_gujja_03
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
 
Seminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent ProgrammingSeminar on Parallel and Concurrent Programming
Seminar on Parallel and Concurrent Programming
 
Resource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache SparkResource-Efficient Deep Learning Model Selection on Apache Spark
Resource-Efficient Deep Learning Model Selection on Apache Spark
 
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a LaptopProject Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
Project Tungsten Phase II: Joining a Billion Rows per Second on a Laptop
 
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
OpenCAPI-based Image Analysis Pipeline for 18 GB/s kilohertz-framerate X-ray ...
 

Kürzlich hochgeladen

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfSumit Kumar yadav
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTSérgio Sacani
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSSLeenakshiTyagi
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSarthak Sekhar Mondal
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡anilsa9823
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.Nitya salvi
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bSérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoSérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisDiwakar Mishra
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000Sapana Sha
 

Kürzlich hochgeladen (20)

Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
Disentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOSTDisentangling the origin of chemical differences using GHOST
Disentangling the origin of chemical differences using GHOST
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatidSpermiogenesis or Spermateleosis or metamorphosis of spermatid
Spermiogenesis or Spermateleosis or metamorphosis of spermatid
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 bAsymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
Asymmetry in the atmosphere of the ultra-hot Jupiter WASP-76 b
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Isotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on IoIsotopic evidence of long-lived volcanism on Io
Isotopic evidence of long-lived volcanism on Io
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 

Java Thread and Process Performance for Parallel Machine Learning on Multicore HPC Clusters

  • 1. Java Thread and Process Performance for Parallel Machine Learning on Multicore HPC Clusters 12/08/2016 1IEEE BigData 2016 IEEE BigData 2016 December 5-8, Washington D.C. Saliya Ekanayake Supun Kamburugamuve |Pulasthi Wickramasinghe|Geoffrey Fox
  • 2. Performance on Multicore HPC Clusters 5/16/2016 48 Nodes 128 Nodes >40x Speedup with optimizations All-MPI approach Hybrid Threads and MPI 64x Ideal (if life was so fair!) We’ll discuss today Multidimensional Scaling (MDS) on Intel Haswell HPC Cluster with 24 core nodes and 40Gbps Infiniband 2
  • 3. Performance Factors • Thread models. • Affinity • Communication. • Other factors of high-level languages.  Garbage collection  Serialization/Deserialization  Memory references and cache  Data read/write 9/28/2016 Ph.D. Dissertation Defense 3
  • 4. • Long Running Threads – Bulk Synchronous Parallel (LRT-BSP).  Resembles the classic BSP model of processes.  A long running thread pool similar to FJ.  Threads occupy CPUs always – “hot” threads. Thread Models • Long Running Threads – Fork Join (LRT-FJ).  Serial work followed by parallel regions.  A long running thread pool handles parallel tasks.  Threads sleep after parallel work. 9/28/2016 Ph.D. Dissertation Defense 4 Serial work Non-trivial parallel work LRT-FJ LRT-BSP Serial work Non-trivial parallel work Busy thread synchronization • LRT-FJ vs. LRT-BSP.  High context switch overhead in FJ.  BSP replicates serial work but reduced overhead.  Implicit synchronization in FJ.  BSP use explicit busy synchronizations.
  • 5. Ph.D. Dissertation Defense Affinity • Six affinity patterns • E.g. 2x4  Two threads per process  Four processes per node  Assume two 4 core sockets 9/28/2016 5 Threads Affinity Processes Affinity Cores Socket None (All) Inherit CI SI NI Explicit per core CE SE NE 2x4 CI C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0 P1 P3 P4 2x4 SI C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1 P2,P3 2x4 NI C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1,P2,P3 2x4 CE C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0 P1 P3 P4 2x4 SE C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1 P2,P3 2x4 NE C0 C1 C2 C3 C4 C5 C6 C7 Socket 0 Socket 1 P0,P1,P2,P3 Worker thread Background thread (GC and other JVM threads) Process (JVM)
  • 6. A Quick Peek into Performance 9/28/2016 Ph.D. Dissertation Defense 6 1.5E+4 2.0E+4 2.5E+4 3.0E+4 3.5E+4 4.0E+4 4.5E+4 5.0E+4 1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1 Time(ms) Threads per process x Processes per node LRT-FJ NI 1.5E+4 2.0E+4 2.5E+4 3.0E+4 3.5E+4 4.0E+4 4.5E+4 5.0E+4 1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1 Time(ms) Threads per process x Processes per node LRT-FJ NI LRT-FJ NE 1.5E+4 2.0E+4 2.5E+4 3.0E+4 3.5E+4 4.0E+4 4.5E+4 5.0E+4 1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1 Time(ms) Threads per process x Processes per node LRT-FJ NI LRT-FJ NE LRT-BSP NI 1.5E+4 2.0E+4 2.5E+4 3.0E+4 3.5E+4 4.0E+4 4.5E+4 5.0E+4 1x24 2x12 3x8 4x6 6x4 8x3 12x2 24x1 Time(ms) Threads per process x Processes per node LRT-FJ NI LRT-FJ NE LRT-BSP NI LRT-BSP NE No thread pinning and FJ Threads pinned to cores and FJ No thread pinning and BSP Threads pinned to cores and BSP K-Means 10K performance on 16 nodes
  • 7. Communication Mechanisms • Collective communications are expensive.  Allgather, allreduce, broadcast.  Frequently used in parallel machine learning  E.g. 9/28/2016 Ph.D. Dissertation Defense 7 3 million double values distributed uniformly over 48 nodes • Identical message size per node, yet 24 MPI is ~10 times slower than 1 MPI • Suggests #ranks per node should be 1 for the best performance • How to reduce this cost?
  • 8. Communication Mechanisms • Shared Memory (SM) for inter-process communication (IPC).  Custom Java implementation in SPIDAL.  Uses OpenHFT’s Bytes API.  Reduce network communications to the number of nodes. 9/28/2016 Ph.D. Dissertation Defense 8 Java SM architecture Java DA-MDS 100k communication on 48 of 24-core nodes Java DA-MDS 200k communication on 48 of 24-core nodes
  • 9. Performance: K-Means 12/08/2016 IEEE BigData 2016 9 Java K-Means 1 mil points and 1k centers performance on 16 nodes for LRT-FJ and LRT-BSP with varying affinity patterns over varying threads and processes. Java vs C K-Means LRT-BSP affinity CE performance for 1 mil points with 1k,10k,50k,100k, and 500k centers on 16 nodes over varying threads and processes.
  • 10. Performance: K-Means 9/28/2016 Ph.D. Dissertation Defense 10 Java and C K-Means 1 mil points with 100k centers performance on 16 nodes for LRT-FJ and LRT-BSP over varying intra-node parallelisms. The affinity pattern is CE. • All-Procs.  #processes = total parallelism.  Each pinned to a separate core.  Balanced across the two sockets. • Threads internal.  1 process per node, pinned to T cores, where T = #threads.  Threads pinned to separate cores.  Threads balanced across the two sockets. • Hybrid  2 processes per node, pinned to T cores.  Total parallelism = Tx2 where T is 8 or 12.  Threads pinned to separate cores and balanced across the two sockets. Java LRT-BSP -{All Procs, Threads Internal, Hybrid} C
  • 11. K-Means Clustering • Flink and Spark 9/28/2016 Ph.D. Dissertation Defense 11 Map (nearest centroid calculation) Reduce (update centroids) Data Set <Points> Data Set <Initial Centroids> Data Set <Updated Centroids> Broadcast K-Means total and compute times for 1 million 2D points and 1k,10,50k,100k, and 500k centroids for Spark, Flink, and MPI Java LRT-BSP CE. Run on 16 nodes as 24x1. K-Means total and compute times for 100k 2D points and 1k,2k,4k,8k, and 16k centroids for Spark, Flink, and MPI Java LRT-BSP CE. Run on 1 node as 24x1.
  • 12. 12/08/2016 IEEE BigData 2016 Performance: Multidimensional Scaling 12 Java DA-MDS 50k on 16 of 24-core nodes Java DA-MDS 100k on 16 of 24-core nodes Java DA-MDS speedup comparison for LRT-FJ and LRT-BSP Linux perf statistics for DA-MDS run of 18x2 on 32 nodes. Affinity pattern is CE. 15x 74x 2.6x
  • 13. Performance: Multidimensional Scaling 5/16/2016 48 Nodes 128 Nodes >40x Speedup with optimizations All-MPI approach Hybrid Threads and MPI 64x Ideal (if life was so fair!) Multidimensional Scaling (MDS) on Intel Haswell HPC Cluster with 24 core nodes and 40Gbps Infiniband 13
  • 14. Acknowledgement • Grants: NSF CIF21 DIBBS 1443054 and NSF RaPyDLI 1415459 • Digital Science Center (DSC) at Indiana University • Network Dynamics and Simulation Science Laboratory (NDSSL) at Virginia Tech 12/08/2016 IEEE BigData 2016 14 Thank you!
  • 15. Backup Slides 12/08/2016 IEEE BigData 2016 15
  • 16. Other Factors • Garbage Collection (GC)  “Stop the world” events are expensive.  Especially, for parallel machine learning.  Typical OOP  allocate – use – forget.  Original SPIDAL code produced frequent garbage of small arrays.  Solutions.  Unavoidable, but can be reduced.  Static allocation.  Object reuse.  Advantage.  Less GC – obvious.  Scale to larger problem sizes.  E.g. Original SPIDAL code required 5GB (x 24 = 120 GB per node) heap per process to handle 200K DA-MDS. Optimized code use < 1GB heap to finish within the same timing.  Note. Physical memory is 128GB, so with optimized SPIDAL can now do 1 million point MDS within hardware limits. 9/28/2016 Ph.D. Dissertation Defense 16 Heap size per process reaches –Xmx (2.5GB) early in the computation Frequent GC Heap size per process is well below (~1.1GB) of –Xmx (2.5GB) Virtually no GC activity after optimizing
  • 17. Other Factors • Serialization/Deserialization.  Default implementations are verbose, especially in Java.  Kryo is by far the best in compactness.  Off-heap buffers are another option. • Memory references and cache.  Nested structures are expensive.  Even 1D arrays are preferred over 2D when possible.  Adopt HPC techniques – loop ordering, blocked arrays. • Data read/write.  Stream I/O is expensive for large data  Memory mapping is much faster and JNI friendly in Java  Native calls require extra copies as objects move during GC.  Memory maps are in off-GC space, so no extra copying is necessary 9/28/2016 Ph.D. Dissertation Defense 17

Hinweis der Redaktion

  1. Note the differences in communication architectures Note times are in log scale Bars indicate compute only times, which is similar across these frameworks Overhead is dominated by communications in Flink and Spark