1. Highly Scalable Java Programming for Multi-Core System Zhi Gan (ganzhi@gmail.com) IBM China Development Lab Next Generation Systems
2.
3. Continuing evolution of multicore Nehalem EX POWER 7 UltraSPARC T2 Varying trade-offs between thread speed & throughput Varying assumptions about memory footprint and working sets Max cores per chip 8 8 8 Max threads per core 2 4 8 Last level on-chip cache 24MB 32MB 4MB Memory controllers per chip 2 2 4 Max chips per system 8 32 4 Max system size (threads) 128 1,024 256
4.
5. NUMA is the new normal L3 cache L3 cache L3 cache L3 cache L1 & L2 Caches Ex Units Highest affinity between threads on a core Next highest affinity between cores on a chip Affinity between a chip and locally attached DRAM IBM Power 750 POWER 7 32 cores, 128 threads Note: Memory systems on all major platforms have similar hierarchical structure DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN DIMM DIMM SN
6. Balancing I/O and Server Capacity Ultra-dense DRAM (MAX5) Parallel disk array Very high speed random r/w Highest cost & power Limited capacity High speed sequential r/w Lowest cost per GB Virtually unlimited capacity (PBs) High speed random reads Lowest cost per IOPS High capacity (TBs) Enterprise NAND Flash
31. Why Lock-Free Often Means Better Scalability? (I) Lock:All threads wait for one Lock free: No wait, but only one can succeed, Other threads need retry
32. Why Lock-Free Often Means Better Scalability? (II) Lock:All threads wait for one Lock free: No wait, but only one can succeed, Other threads often need to retry X X
33. Performance of A Lock-Free Stack Picture from: http:// www.infoq.com /articles/scalable-java-components
34. Performance of A Lock-Free HashMap Picture from: A Fast Lock-Free Hash Table by Cliff Click
What if all previous best prestise cannot meet your need? You would like to optimize your application manually?
msdk – This tool can be used to do detailed performance analysis of concurrent Java applications. It does an in-depth analysis of the complete execution stack, starting from the hardware to the application layer. Information is gathered from all four layers of the stack – hardware, operating system, jvm and application.
`
For multi-thread application, lock-free approach is different with lock-based approach in several aspects: When accessing shared resource, lock-based approach will only allow one thread to enter critical section and others will wait for it On the contrary, lock-free approach will all every thread to modify state of shared state. But one of the all threads can succeed, and all other threads will be aware of their action are failed so they will retry or choose other actions.
The real difference occurs when something bad happens to the running thread. If a running thread is paused by OS scheduler, different thing will happen to the two approach: Lock-based approach: All other threads are waiting for this thread, and no one can make progress Lock-free approach: Other threads will be free to do any operations. And the paused thread might fail its current operation From this difference, we can found in multi-core environment, lock-free will have more advantage. It will have better scalability since threads don’t wait for each other. And it will waste some CPU cycles if contention. But this won’t be a problem for most cases since we have more than enough CPU resource