Virtual machines don't have to be slow, they don't even have to be slower than running native code.
All you have to do is write your code, lay back and let the JVM do its magic !
Learn about various JVM runtime optimizations and why is it considered one of the best VMs in the world.
4. Introduction In the past, JVM was considered by many as Java Achilles’ heel Interpreter?! JVM team improved performance in 300 to 3000 times JDK 1.6 compared to JDK 1.0 Java is measured to be 50% to 100+% the speed of C and C++ Jake2 vs Quake2 How can it be?
5. Java Virtual Machines Zoo CEE-J Excelsior JET Hewlett-Packard J9 (IBM) Jbed Jblend Jrockit MRJ MicroJvm MS JVM OJVM PERC Blackdown Java CVM Gemstone Golden Code Development Intent Novell NSIcomCrE-ME ChaiVM HotSpot AegisVM Apache Harmony CACAO Dalvik IcedTea IKVM.NET Jamiga JamVM Jaos JC Jelatine JVM JESSICA Jikes RVM Jnode JOP Juice Jupiter JX Kaffe leJOS Mika VM Mysaifu NanoVM SableVM Squawk virtual machine SuperWaba TinyVM VMkit of Low Level Virtual Machine Wonka VM Xam 5
6. HotSpot Virtual Machine Developed by Longview Technologies back in 1999 Contains: Class loader Bytecode interpreter 2 Virtual machines 7 Garbage collectors 2 Compilers Runtime libraries
7. HotSpot Virtual Machine Configured by hundreds of –XX flags Reminder -X options are non-standard -XX options have specific system requirements for correct operations Both are subject to change without notice
9. GC Is Slow? GC has bad performance reputation Reduces throughput Introduces pauses Unpredictable Uncontrolled Performance degradation is proportional to objects count Just give me the damn free() and malloc()! I’ll be just fine! Is it so?
10. Generational Collectors Weak generational hypothesis Most objects die young (AKA Infant mortality) Few old to young references Generations: regions holding objects of different ages GC is done separately once a generation fills Different GC algorithms The young (nursery) generation Collected by “Minor garbage collection” The old (tenured) generation Collected by “Minor garbage collection”
11. GC Magic 101 vs Young is better than Tenured Let your objects die in young generation When possible and makes sense 11
12. GC Magic 101 12 vs Swapping is bad Application's memory footprint should not exceed the available physical memory
15. Tracking Collectors Algorithms Mark-Sweep collector Mark phase marks each reachable object Sweep phase “sweeps” the heap Non marked objects reclaimed as garbage Copying collector Heap is divided into two equal spaces When active space fills, live objects are copied to the unused space Only live objects are examined The roles of the spaces are then flipped
16. Compaction Compaction: The collector moves all live objects to the bottom of the heap Remaining memory is reclaimed Reduces the cost of objects allocation No potential fragmentation The drawback is slower completion of GC
17. The Young generation Consists of Eden + two survivor spaces Objects are initially allocated in Eden All HotSpot young collectors are stop-the-world copying collectors Done is parallel for parallel garbage collectors Collections are relatively fast and proportional to number of live objects
19. The Tenured generation Objects surviving several GC cycles, are promoted to the tenured generation Use -XX:MaxTenuringThreshold=# to change Collectors algorithms used are variations of Mark-Sweep More space efficient Characteristics Lower garbage density Bigger heap space Fewer GC cycles
24. Garbage First (G1) New in JDK 1.6 u14 (May 29th) All memory is divided to 1MB buckets Calculates objects liveness in buckets Drops “dead” buckets If a bucket is not total garbage, it’s not dropped Collects the most garbage buckets first Pauses only on “mark” No sweep User can provide pause time goals Actual seconds or Percentage of runtime G1 records bucket collection time and can estimate how many buckets to collect during pause
25. Garbage First (G1) Targets multi-process machines and large heaps G1 will be the long-term replacement for the CMS collector Unlike CMS, compacts to battle fragmentation A bucket’s space is fully reclaimed Better throughput Predictable pauses (high probability) Garbage left in buckets with high live ratio May be collected later
26. Benefits of G1 No imbalance of young-tenured generation Generations are only logical Generations are merely sets of buckets More predictable GC pauses Parallelism and concurrency in collections No fragmentation due to compaction Better heap utilization Better GC ergonomics
27. Young GCs in G1 Done using evacuation pauses Stop-The-World parallel collections Evacuates surviving objects between sets of buckets
28. Old GCs in G1 Drops dead buckets Calculates liveness info per bucket Identifies best buckets for subsequent eviction pauses Collect them piggy-backed on young GCs
30. GC Ergonomics Ergonomics goal is to provide good performance with little or no tuning Better matches the needs of different application types The HotSpot, garbage collector and heap size are automatically chosen Based on OS, RAM and no# CPU Server Vs. Client class machine Hints the characteristics of the application
32. GC Ergonomics With the parallel collectors, one can specify performance goals In contrast to specifying the heap size Improves performance for large applications Max Pause Time Goal Use -XX:MaxGCPauseMillis=<N> Both generation separately Or: Average + Variance No pause time goal by default
33. GC Ergonomics Throughput Goal Use -XX:GCTimeRatio=<N> The ratio of GC Vs. application time is 1/(1+N) If N=19, GC time goal is 1/(1+19) or 5% Default N is 99, meaning GC time is 1% Minimum Footprint Goal Priority of goals Maximum pause time goal Throughput goal Minimum footprint goal
34. GC Ergonomics Performance goals may not be met Pause time and throughput goals are somewhat contradicting The pause time goal shrinks the generation The throughput goal grows the generation Statistics are kept by the GC Adaptive to changes in application behavior
36. Heap Size The larger the heap space, the better For both young and old generation Larger space: less frequent GCs, lower GC overhead, objects more likely to become garbage Smaller space: faster GCs (not always! see later) Sometimes max heap size is dictated by available memory and/or max space the JVM can address You have to find a good balance between young and old generation size
37. Heap Size Maximize the number of objects reclaimed in the young generation Application's memory footprint should not exceed the available physical memory Swapping is bad The above apply to all our GCs 37
38. Heap Size -Xmx<size> : max heap size young generation + old generation -Xms<size> : initial heap size young generation + old generation -Xmn<size> : young generation size -XX:PermSize=<size> : permanent generation initial size -XX:MaxPermSize=<size> : permanent generation max size 38
39. Heap Size When -Xms != -Xmx, heap growth or shrinking requires a Full GC Set -Xms to desired heap size Set –Xmx even higher “just in case” Even full GC is better than OOM crash Same for -XX:PermSize and -XX:MaxPermSize Same for -XX:NewSize and -XX:MaxNewSize -Xmn Combines both 39
40. Tenuring Measure tenuring with - XX:+PrintTenuringDistribution Avoid tenuring for short or even medium-lived objects! Less promotion into the old generation Less frequent old GCs Promote long-lived objects ASAP Yeah, conflict with previous bullet Better copy more, than promote more -XX:TargetSurvivorRatio=<percent>, e.g., 50 How much of the survivor space should be filled Typically leave extra space to deal with “spikes” 40
41. Permanent Space Classes aren’t unloaded by default -XX:+CMSClassUnloadingEnabled to enable Classloader should be collected It holds references to classes Each object holds reference to classloader 41
43. GC Statistics Options GC logging has extremely low / non-existent overhead It’s very helpful when diagnosing production issues Enable it In production too! -XX:+ PrintGC PrintGCDetails PrintGCTimeStamps PrintTenuringDistribution Show this threshold and the ages of objects in the new generation 43
44. GC Is Slow? – The Answers Reduces throughput You choose Introduces pauses You choose Unpredictable Not any more Uncontrolled Configurable Performance degradation is proportional to objects count Not true Just give me the damn free() and malloc()! I’ll be just fine! Bad idea (see more later)
46. HotSpot Optimizations JIT Compilation Compiler Optimizations Generates more performant code that you could write in native Adaptive Optimization Split Time Verification Class Data Sharing
47. Two Virtual Machines? Client VM Reducing start-up time and memory footprint -client CL flag Server VM Maximum program execution speed -server CL flag Auto-detection Server: >1 CPUs & >=2GB of physical memory Win32 – always detected as client Many 64bit OSes don’t have client VMs 47
48. Just-In-Time Compilation Everyone knows about JIT! Hot code is compiled to native What is “hot”? Server VM – 10000 invocations Client VM – 1500 invocations Use -XX:CompileThreshold=# to change More invocations – better optimizations Less invocations – shorter warmup time
50. Adaptive Optimization Allows HotSpot to uncompile previously compiled code Much more aggressive, even speculative optimizations may be performed And rolled back if something goes wrong or new data gathered E.g. classloading might invalidate inlining
51. Split Time Verification Java suffers from long boot time One of the reasons is bytecode verification Valid flow control Type safety Visibility In order to ease on the weak KVM, J2ME started performing part of the verification in compile time It’s good, so now it’s in Java SE 6 too
52. Class Data Sharing Helps improve startup time During JDK installation part of rt.jar is preloaded into shared memory file which is attached in runtime No need to reload and reverify those classes every time
54. Two Types of Optimizations Java has two compilers: javac bytecode compiler HotSpot VM JIT compiler Both implement similar optimizations Bytecode compiler is limited Dynamic linking Can apply only static optimizations
55. Warning Caution! Don’t try this at home yourself! The source code you are about to see is not real! It’s pseudo assembly code Don’t writesuch code! Source code should be readable and object-oriented Bytecode will become performant automagically 55
56. Optimization Rules Make the common case fast Don't worry about uncommon/infrequent case Defer optimization decisions Until you have data Revisit decisions if data warrants 56
57. Null check Elimination Java is null-safe language Pointer can’t point to meaningless portion of memory Null checks are added by the compiler, NullPointerException is thrown JVM’s profiler can eliminate those checks 57
60. Inlining Love Encapsulation? Getters and setters Love clean and simple code? Small methods Use static code analysis? Small methods No penalty for using those! JIT brings the implementation of these methods into a containing method This optimization known as “Inlining”
61. Inlining Not just about eliminating call overhead Provides optimizer with bigger blocks Enables other optimizations hoisting, dead code elimination, code motion, strength reduction 61
62. Inlining But wait, all public non-final methods in Java are virtual! HotSpot examines the exact case in place In most cases there is only one implementation, which can be inlined But wait, more implementations may be loaded later! In such case HotSpot undoes the inlining Speculative inlining By default limited to 35 bytes of bytecode Use -XX:MaxInlineSize=# to change
66. Code Hoisting Hoist = to raise or lift Size optimization Eliminate duplicate code in method bodies by hoisting expressions or statements Duplicate bytecode, not necessarily source code
68. Bounds Check Elimination Java promises automatic boundary checks for arrays Exception is thrown If programmer checks the boundaries of its array by himself, the automatic check can be turned off
71. Loop Unrolling Some loops shouldn’t be loops In performance meaning, not code readability Those can be unrolled to set of statements If the boundaries are dynamic, partial unroll will occur
74. Escape Analysis Escape analysis is not optimization It is check for object not escaping local scope E.g. created in private method, assigned to local variable and not returned Escape analysis opens up possibilities for lots of optimizations
75. Scalar Replacement Remember the rule “new == always new object”? False! JVM can optimize away allocations Fields are hoisted into registers Object becomes unneeded But object creation is cheap! Yap, but GC is not so cheap… 75
79. Lock Coarsening HotSpot merges adjacent synchronized blocks using the same lock The compiler is allowed to moved statements into merged coarse blocks Tradeoff performance and responsiveness Reduces instruction count But locks are held longer
82. Lock Elision A thread enters a lock that no other thread will synchronize on Synchronization has no effect Can be deducted using escape analysis Such locks can be elided Elides 4 StringBuffer synchronized calls:
84. Constants Folding Trivial optimization How many constants are there? More than you think! Inlining generates constants Unrolling generates constants Escape analysis generates constants JIT determines what is constant in runtime Whatever doesn’t change
87. Dead Code Elimination Dead code - code that has no effect on the outcome of the program execution publicstaticvoid main(String[] args) { long start = System.nanoTime(); int result = 0; for (inti = 0; i < 10 * 1000 * 1000; i++) { result += Math.sqrt(i); } long duration = (System.nanoTime() - start) / 1000000; System.out.format("Test duration: %d (ms) %n", duration); }
88. OSR - On Stack Replacement Normally code is switched from interpretation to native in heap context Before entering method OSR - switch from interpretation to compiled code in local context In the middle of a method call JVM tracks code block execution count Less optimizations May prevent bound check elimination and loop unrolling
92. How Can I Help? final keyword For fields: Allows caching Allows lock coarsening For methods: Simplifies Inlining decisions Immutable objects die younger 93
93. JVM tuning tips Reminder: -XX options are non standard Added for HotSpot development purposes Mostly tested on Solaris 10 Platform dependent Some options may contradict each other Know and experiment with these options 94
95. References The HotSpot Home Page Java HotSpot VM Options Dynamic compilation and performance measurement Urban performance legends, revisited Synchronization optimizations in Mustang Robust Java benchmarking Garbage Collection Tuning 96
96. References JavaOne 2009 Sessions: Garbage Collection Tuning in the Java HotSpot™ Virtual Machine Under the Hood: Inside a High-Performance JVM™ Machine Practical Lessons in Memory Analysis Debugging Your Production JVM™ Machine Inside Out: A Modern Virtual Machine Revealed 97