The document discusses various techniques for optimizing Java application performance, including:
1. Classic techniques like tuning JVM options, garbage collection, and memory settings.
2. Code optimizations like avoiding object creation, using primitives over wrappers, and micro-optimizations.
3. Tools for analyzing performance like JVisualVM, JMH, and different profilers.
The document provides examples and comparisons of different optimizations and their performance impacts. It also discusses considerations for optimizing embedded and Android applications differently due to their unique constraints.
6. 6
ConfidentialConfidential
• The goal is that each processing element be kept as busy as possible
doing useful work. This entails satisfying four requirements: breaking
problems into independent subproblems that can be executed
concurrently, distributing these subproblems appropriately among the
processing elements, making sure that the necessary data is close to its
processing element, and overlapping communication with computation
where possible.
Performance
9. 9
Confidential
In computer science, the time complexity is the
computational complexity that describes the amount of
time it takes to run an algorithm.
Time complexity
11. 11
ConfidentialConfidential
1. A Graph is a non-linear data
structure consisting of nodes
and edges. The nodes are
sometimes also referred to as
vertices and the edges are lines
or arcs that connect any two
nodes in the graph.
Graph
12. 12
ConfidentialConfidential
Adjacency list is a collection of
unordered lists used to represent a
finite graph. Each list describes the
set of neighbors of a vertex in the
graph.
Graph: Adjacency list
Input Output
1 5
2 6
3 2, 5
4 5
5 1, 4
6 2
13. 13
ConfidentialConfidential
Adjacency matrix is a square matrix used to represent a finite graph.
The elements of the matrix indicate whether pairs of vertices are
adjacent or not in the graph.
Graph: Adjacency matrix
20. 20
ConfidentialConfidential
• JGraphT if you are more interested in data structures and algorithms.
• JGraph if your primary focus is visualization.
• Jung, yWorks, and BFG are other things people tried using.
• Prefuse is a no no since one has to rewrite most of it.
• Google Guava if you need good datastructures only.
• Apache Commons Graph.
Java & Graph
21. 21
ConfidentialConfidential
Graph<URL, DefaultEdge> g = new DefaultDirectedGraph<>(DefaultEdge.class);
URL amazon = new URL("http://www.amazon.com");
URL yahoo = new URL("http://www.yahoo.com");
URL ebay = new URL("http://www.ebay.com");
// add the vertices
g.addVertex(amazon);
g.addVertex(yahoo);
g.addVertex(ebay);
// add edges to create linking structure
g.addEdge(yahoo, amazon);
g.addEdge(yahoo, ebay);
(#8.1)
23. 23
ConfidentialConfidential
«Ты суешь её [option] в свой конфиг,
она действительно что-то улучшает, и
ты рисуешь себе звёздочку «я умею
тюнить JVM».
JVM Options
26. 26
ConfidentialConfidential
• The following shows the relationship between the memory size, the number of
GC execution, and the GC execution time.
• Large memory size
- decreases the number of GC executions.
- increases the GC execution time.
• Small memory size
- decreases the GC execution time.
- increases the number of GC executions.
• 10 GB is OK if the server resource is good and Full GC can be completed within
1 second even when the memory has been set to 10 GB. But most servers are
not in the status. When the memory is set to 10 GB, it takes about 10 ~ 30
seconds to execute Full GC. Of course, the time may vary according the object
size.
Setting Memory Size
27. 27
ConfidentialConfidential
• How we should set the memory size?
• Based on the current status before GC tuning, check the memory size left after
Full GC. If there is about 300 MB left after Full GC, it is good to set the memory to
1 GB (300 MB (for default usage) + 500 MB (minimum for the Old area) + 200 MB
(for free memory)). That means you should set the memory space with more than
500 MB for the Old area. Therefore, if you have three operation servers, set one
server to 1 GB, one to 1.5 GB, and one to 2 GB, and then check the result.
Setting Memory Size
28. 28
ConfidentialConfidential
• NewRatio is the ratio of the New area and the Old area. If XX:NewRatio=1, New
area:Old area is 1:1. For 1 GB, New area:Old area is 500MB: 500MB.
If NewRatio is 2, New area:Old area is 1:2. Therefore, as the value gets larger, the
Old area size gets larger and the New area size gets smaller.
• If the New area size is small, much memory is passed to the Old area, causing
frequent Full GC and taking a long time to handle it.
Setting Memory Size: NewRatio
29. 29
Confidential
• In the analysis, focus on the following. The
most important item to decide the GC
option is Full GC execution time.
Analyzing GC Tuning Results
• Case 1: -XX:+UseParallelGC -Xms1536m -
Xmx1536m -XX:NewRatio=2
• Case 2: -XX:+UseParallelGC -Xms1536m -
Xmx1536m -XX:NewRatio=3
• Case 3: -XX:+UseParallelGC -Xms1g -
Xmx1g -XX:NewRatio=3
• Case 4: -XX:+UseParallelOldGC -
Xms1536m -Xmx1536m -XX:NewRatio=2
• Case 5: -XX:+UseParallelOldGC -
Xms1536m -Xmx1536m -XX:NewRatio=3
• Case 6: -XX:+UseParallelOldGC -Xms1g -
Xmx1g -XX:NewRatio=3
GC Cases
30. 30
ConfidentialConfidential
• If you try to decrease the Old area
size to decrease Full GC
execution
time, OutOfMemoryError may
occur or the number of Full GCs
may increase.
• Alternatively, if you try to decrease
the number of Full GC by
increasing the Old area size, the
execution time will be increased.
GC Tuning: jstat
32. 32
ConfidentialConfidential
• -verbosegc is one of the JVM options specified when running a Java
application. While jstat can monitor any JVM application that has not
specified any options, -verbosegc needs to be specified in the
beginning, so it could be seen as an unnecessary option (since jstat can
be used instead). However, as -verbosegc displays easy to understand
output results whenever a GC occurs, it is very helpful for monitoring
rough GC information.
• HPjmeter
-verbosegc
42. 42
Confidential
• In embedded environments we usually have
very restricted resources concerning
memory consumption, CPU performance
and battery lifetime. Additionally there are
often other restrictions like real time
requirements.
Embedded or Android Java Project
Embedded Project
43. 43
Confidential
• When I was given the task of writing an
Android application, I decided to research
Dependency Injection Frameworks for
mobiles … It soon dawned on us that the
root of the problem was the DI framework.
It was searching for all injection resources
and references, while the app was starting
and trying to perform all the wiring at the
beginning of the app’s life cycle … Our
solution was to hard-wire the resources.
Even though this gave us an “uglier” app,
the app started up in lightning speed,
solving our performance issue.
Embedded or Android Java Project
Android Java Project
45. 45
ConfidentialConfidential
1. Don't use any frameworks for server applications
2. Divide your development process on 2 separate important parts:
1) Designing a classic application using minimum frameworks
2) Applying performance tuning templates and anti-refactoring
3. Use Tools to analyze performance on concrete device
Embedded or Android Java Project
46. 46
Confidential
• Format: ops/s
• Higher value is better
Micro optimizations
• Format: ns/op
• Smaller value is better
Throughput: operations per unit of time AverageTime: average time per operation
48. 48
Confidential
int sum = 0;
for (int i = 0 ; i < vals.length ; i++) {
sum += vals[i];
}
Micro optimizations (#8.3)
OptionalInt summa = Arrays.stream(vals)
.parallel()
.reduce((a,b) -> a + b);
Summa by Iterator Summa by Lambda Reduction
49. 49
Confidential
int findVal = 3;
int[] ar = new int[] {1, 2, 3};
boolean isFind = IntStream.of(ar)
.anyMatch(i -> i == findVal);
Micro optimizations ~10x
int findVal = 3;
int[] ar = new int[] {1, 2, 3};
for(int i=0; i<ar.length; i++)
if(ar[i]==findVal) {
isFind = true;
break;
}
Find by Lambda Reduction Find by Iterator
61. 61
ConfidentialConfidential
• JMH is a Java harness for building, running, and analyzing
nano/micro/milli/macro benchmarks written in Java and other languages
targeting the JVM.
• Part of the code-tools project of openjdk
• Used extensively within open jdk to test the internals
• Keeps pace with the changes in the jvm
• Brings scientific approach to benchmarking
Java Benchmarking: org.openjdk.jmh
62. 62
ConfidentialConfidential
• A maven based project . Bundles the benchmark code with the working jar
• A quick common annotation list
A quick look at JMH working
Annotation Function
@Benchmark Lines up the method for benchmarking
@BenchmarkMode Defines mode of benchmark line averagetime or
throughput
@Warmup Defines the warm-up cycles
@Measurement Defines the measurement iteration
@Fork Number of vm
64. 64
ConfidentialConfidential
• The Simple Logging Facade For Java (slf4j) is a simple facade for various
logging frameworks, like JDK logging (java.util.logging), log4j, or logback.
Even it contains a binding tat will delegate all logger operations to
another well known logging facade called jakarta commons logging
(JCL).
• Logback is the successor of log4j logger API, in fact both projects have
the same father, but logback offers some advantages over log4j, like
better performance and less memory consumption, automatic reloading
of configuration files, or filter capabilities, to cite a few features.
Log
67. 67
Confidential
1. Did this ever make sense?
2. Yes, on these assumptions:
- can ignore constant factors
- all instructions have the same duration
- memory doesn’t matter
- instruction execution dominates performance
3. But instruction execution is only one
bottleneck:
- Disk/Network
- Garbage Collection
- Resource Contention and more…
Which List Implementation?
get() add() remove(0)
ArrayList O(1) O(1) O(N)
LinkedList O(N) O(1) O(1)
COWArrayList O(1) O(N) O(N)
73. 73
ConfidentialConfidential
Memory access time
action approximate time (ns)
typical processor instruction 1
fetch from L1 cache 0.5
branch misprediction 5
fetch from L2 cache 7
mutex lock/unlock 25
fetch from main memory 100
2 kB via 1 GB/s 20 000
seek for new disk location 8 000 000
read 1 MB sequentially from disk 20 000 000
77. 77
ConfidentialConfidential
• Linked List: node size is 24 bytes
• Running on Intel Core i5:
- L1data: 128K
- L2: 512K
- L3: 3M
• Each new list item is 40 bytes (24 + 16)
- L1 cache will be full at <3K items
• ArrayList is better: each new item is 20 bytes (4 + 16)
What’s Going On?
82. 82
ConfidentialConfidential
public final class String ... {
private final char [] value ;
private int offset;
private int count;
public boolean equals(Object anObject)…
SubString JDK < 1.7.0.06
85. 85
Confidential
public int charAt () {
int r = 0;
for (int c = 0; c < text . length (); c ++) {
r += text . charAt (c);
}
return r;
}
charAt vs toCharArray
public int toCharArray () {
int r = 0;
char [] chars = text . toCharArray ();
for (int c = 0; c < text . length (); c ++) {
r += chars [c];
}
return r;
}
charAt toCharArray
87. 87
Confidential
public int charAt () {
int r = 0;
for (int c = 0; c < text . length (); c ++) {
emptyMethod();
r += text . charAt (c);
}
return r;
}
charAt vs toCharArray
public int toCharArray () {
int r = 0;
char [] chars = text . toCharArray ();
for (int c = 0; c < text . length (); c ++) {
emptyMethod();
r += chars [c];
}
return r;
}
charAt toCharArray
90. 90
ConfidentialConfidential
• Option #1 – synchronizing methods
public class SynchronizedCounterMethod {
private int c = 0;
public synchronized void increment() {
c++;
System.out.println("Current count value is " + c);
}
}
The synchronized keyword on a method means that if this is already
locked anywhere
Synchronization
92. 92
ConfidentialConfidential
• Option #2 – synchronizing blocks:
public class SynchronizedCounterCode {
private int c = 0;
public void increment() {
synchronized(this) {
c++;
}
System.out.println("Current count value is " + c);
}
}
When synchronizing a block, key for the locking should be supplied
Synchronization
98. 98
ConfidentialConfidential
1) The main responsibility of schedulers is to share a resource between
consumers
2) Linux schedulers:
a) Scheduler of Filesystem
b) Scheduler of NET
c) Scheduler of CPU
Linux scheduler
99. 99
ConfidentialConfidential
• The Completely Fair Scheduler (CFS) is a process scheduler which was
merged into the 2.6.23 (October 2007) release of the Linux kernel and is
the default scheduler. It handles CPU resource allocation for
executing processes, and aims to maximize overall CPU utilization while
also maximizing interactive performance.
Scheduler of CPU
101. 101
ConfidentialConfidential
• For extrusion of flows of OS uses interruptions
• Interruptions are generated by the timer
• CPU interruption processing:
- Execution of the current instruction comes to an end
- Program Counter (PC) remains
- The processor of interruption is called
• The processor of interruption launches the scheduler
Interruptions
102. 102
ConfidentialConfidential
• Traditionally for HZ of times a second
• Starting with the version of a kernel 2.6.21 the option CONFIG_NO_HZ
appeared
• Starting with the version of a kernel 3.10 the option
CONFIG_NO_HZ_FULL appeared
grep 'CONFIG_HZ=' /boot/config-$(uname -r)
How does the timer often work?
103. 103
ConfidentialConfidential
● Real-time scheduling policies - sched/rt.c
○ SCHED_FIFO - thread might be preempted only by higher priority thread
○ SCHED_RR - thread might be preempted by higher priority thread of time
quantum has expired
● Non-realtime scheduling policies - sched/fair.c
○ SCHED_BATCH - thread is always assumed CPU intensive
○ SCHED_IDLE - thread is executed very rarely
○ SCHED_OTHER - most commonly used policy
● Controlled by sched_setscheduler function call
Scheduling policies
104. 104
ConfidentialConfidential
● Static priority
○ 0 - ordinary application
○ 1-99 - real time application. Used for SCHED_FIFO and SCHED_RR
● Nice value (-19..20) only used for SCHED_OTHER and
SCHED_BATCH
Static priority and nice value
106. 106
ConfidentialConfidential
• We will begin with reviewing of one CPU
• CPU has runqueue - queue of the tasks which are in READY status
• Runqueue - the queue with priorities realized with use of a red-black tree
• As a priority vruntime is used
Completely Fair Scheduler (CFS)
107. 107
ConfidentialConfidential
● vruntime = real runtime / task weight + start_min_vruntime
● vruntime обновляется каждый раз при срабатывании таймера
● Task Weight зависит от nice value
● Nice value (-20, 19)
● Task Weight = 1024 / 1.25^nice
vruntime
108. 108
ConfidentialConfidential
● Each kernel of the processor uses independent queue
● It is necessary to move threads between queues for balancing of
loading
CFS scheduler multicore
109. 109
ConfidentialConfidential
• It is impossible to use the queue size as criterion of balancing because
threads can have different priority
• It is impossible to use total threads weight as criterion of balancing
because threads can sleep and kernels will nonuniformly are loaded
• It is necessary to use load a metrics which considers thread weight and
CPU utilization.
Balancing of threads between queues
118. 118
ConfidentialConfidential
• Another important feature of Java is its ability to load your compiled Java
classes (bytecode) following the start-up of the JVM. Depending on the
size of your application, the class loading process can be intrusive and
significantly degrade the performance of your application under high
load following a fresh restart. This short-term penalty can also be
explained by the fact that the internal JIT compiler has to start over its
optimization work following a restart.
Class Loading
119. 119
ConfidentialConfidential
• Profile your application for possible memory leaks using tools
such as Plumbr (Java memory leak detector).
• Performance Tip: Focus your analysis on the biggest Java
object accumulation points. It is important to realize that
reducing your application memory footprint will translate in
improved performance due to reduced GC activity.
• SET PATH=C:Toolsplumbrplumbrwin64;%PATH% set
CATALINA_OPTS=-agentlib:plumbr -
javaagent:C:Toolsplumbrplumbrplumbr.jar
Memory leak
122. 122
ConfidentialConfidential
• The Eclipse Memory Analyzer is a fast and feature-rich Java heap
analyzer that helps you find memory leaks and reduce memory
consumption.
• Use the Memory Analyzer to analyze productive heap dumps with
hundreds of millions of objects, quickly calculate the retained sizes of
objects, see who is preventing the Garbage Collector from collecting
objects, run a report to automatically extract leak suspects.
• https://www.eclipse.org/mat/
Memory Analyzer (MAT) (#7)
125. 125
ConfidentialConfidential
• Memory prices are low and getting lower, and retrieving data from disk
or via a network is still expensive. Caching is certainly one aspect of
application performance we shouldn’t overlook.
• Of course, introducing a stand-alone caching system into the topology
of an application does add complexity to the architecture – so a good
way to start leveraging caching is to make good use of existing caching
capabilities in the libraries and frameworks we’re already using.
Architectural Improvements: Caching
126. 126
ConfidentialConfidential
• No matter how much hardware we throw at a single instance, at some
point that won’t be enough. Simply put, scaling up has natural
limitations, and when the system hits these – scaling out is the only way
to grow, evolve and simply handle more load.
• Finally, an additional advantage of scaling with the help of a cluster,
beyond pure Java performance – is that adding new nodes also leads to
redundancy and better techniques of dealing with failure, leading to
overall higher availability of the system.
Architectural Improvements: Scaling Out