3. @TwitterEng 3
Performance Tuning Overview
Top-Down Analysis
- Commonly used when you have the ability to change code at the highest level of the software stack.
1. Monitor target application under load
- `System level diagnostics
- JVM level diagnostics
2. Profile Application Under load
3. Identify bottlenecks, Analyze, and Optimize.
-Make code more efficient
-Reduce allocation rates
4. Repeat
Thursday, September 26, 13
4. @TwitterEng 4
Performance Tuning Overview
Bottom-Up Analysis
- Commonly used when you do not have the ability to change code at the highest level of the software stack.
- JVM and OS performance optimization is a common use case.
1. Monitor CPU-level statistics against target application under load
- Use hardware counters (cache misses, path level, etc)
- HW Profile and map to instructions, OS/JVM, and Scala/Java code
- Use tools when available, otherwise visual inspect assembly code
2. Manipulate static and runtime compilers to address code issues
- Missed optimizations
-Example: autobox elision
3. Manipulate javac / scala compiler
4. Manipulate core platform libraries
5. Identify issues at higher level of the application stack
6. Repeat
Thursday, September 26, 13
10. @TwitterEng 10
Choosing the Right Metrics
Identify Metrics
- What’s important to your users
- What influences your bottom line?
- What are you willing to trade off?
Define Success
- If its not broken .... Don’t fix it.
- Perfect is the enemy of done.
Thursday, September 26, 13
11. @TwitterEng 11
Choosing the Right Metrics
We want it all!
-High Throughput
-Fast response times
-Small footprint
But …
-There’s no free lunch.
Choose your metrics wisely
-Target metrics that impact your customers first
Use Statistics!
- High variability can render some metrics useless
Thursday, September 26, 13
12. @TwitterEng 12
Throughput Metrics
Transactions per Second (TPS)
- # of Transactions / Time
- Aka pages/sec, queries/sec, hits/sec
-Good measure of top end performance
Average Response time
-Inverse of TPS
-Time / #Transactions
-Sometimes a rolling average.
CPU utilization
-Measure of computation efficiency
-Good for capacity planning, not for development regression testing (new features
can increase work).
Thursday, September 26, 13
13. @TwitterEng 13
Latency Metrics
Maximum response time
- Worst case
99% response time
- Drops a few outliers
90% response time
- May drop too many outliers and give a false sense of security
Critical Injection Rate
- Critical jOPs in SPECjbb2013
- Achievable throughput under response time SLA
Not Average Response Time
Thursday, September 26, 13
14. @TwitterEng 14
Memory Footprint Metrics
Heap size after Full GC (Live Data Size) Upcoming slide
Native process size
- # ps aux PID
Static footprint
- Size of application binary
- Size of .jar
- Why does it mater?
- download/deployment speed
-update/refresh speed
Thursday, September 26, 13
16. @TwitterEng 16
JVM Tuning Basics
Track size of Old Generation after Full GCs
[GC 435426K->392697K(657920K), 0.1411660 secs]
[Full GC 392697K->390333K(927232K), 0.5547680 secs]
[GC 625853K->592369K(1000960K), 0.1852460 secs]
[GC 831473K->800585K(1068032K), 0.1707610 secs]
[Full GC 800585K->798499K(1456640K), 1.9056030 secs]
Calculating Live Data Size
Thursday, September 26, 13
17. @TwitterEng 17
JVM Tuning Basics
Track size of Old Generation after Young GCs if no Full GC events occur
2013-09-10T05:39:03.489+0000: [GC[ParNew: 11766264K-
>18476K(13212096K), 0.0326070 secs] 12330878K-
>583306K(16357824K), 0.0327090 secs] [Times: user=0.48
sys=0.01, real=0.03 secs]
2013-09-10T05:42:54.666+0000: [GC[ParNew: 11762604K-
>20088K(13212096K), 0.0270110 secs] 12327434K-
>585068K(16357824K), 0.0271140 secs] [Times: user=0.39
sys=0.00, real=0.02 secs]
2013-09-10T05:46:41.623+0000: [GC[ParNew: 11764216K-
>21013K(13212096K), 0.0267490 secs] 12329196K-
>586133K(16357824K), 0.0268490 secs] [Times: user=0.40
sys=0.00, real=0.03 secs]
Calculating Live Data Size
Thursday, September 26, 13
18. @TwitterEng 18
JVM Tuning Basics
Size of Old Generation
-Good starting point: 2X size of live data at steady state.
-If object promotion rate causes frequent CMS cycles, increase size of the old
generation
-If live data size is 5GB, starting point should be ~10GB.
- Old Generation size alone.
- Set –Xms and –Xmx to same value
- Nobody really needs extra Full GC pauses
Young and Old Generation Sizing
Thursday, September 26, 13
19. @TwitterEng 19
JVM Tuning Basics
Size of Young Generation
- Young gen = Old gen is a good starting point.
- Young generation size should increase with allocation rate
- Sometimes 2-3x larger than Old Gen
- Young GC times dominated by copying of live objects to Survivor spaces, not
size of overall Young Generation
- Size so that most objects die in Young Generation
- Higher Allocation rates -> Larger Young Generation
Young and Old Generation Sizing
Thursday, September 26, 13
20. @TwitterEng 20
JVM Tuning Basics
Example Enterprise Application
- Significant application state
- In memory cache cache size: 3.5GB
- Overall Live data size: 4GB
- High allocation rate of transient data
-Most objects die in large young generation
- Suggested Initial Heap Size Suggestion
--Xms16g -Xmx16g -Xmn8g
Young and Old Generation Sizing
Thursday, September 26, 13
21. @TwitterEng 21
JVM Tuning Basics
Throughput
--XX:+UseParallelOldGC
Low server response times?
- CMS
- Older technology
- Can be highly tuned, but tuning can be brittle
- -XX:+UseParNewGC -XX:+UseConcMarkSweepGC
- G1
- Current development focus
- Young GC times slower than CMS
- -XX:+UseG1GC
Choosing a Garbage Collector
Thursday, September 26, 13
26. @TwitterEng 26
Tuning for Latency
Enable CMS
- -XX:+UseConcMarkSweepGC
Good to have
--XX:+CMSScavengeBeforeRemark
- -XX:+ParallelRefProcEnabled
--XX:CMSInitiatingOccupancyFraction=70
Start with Basic Tuning Guidelines
- -XX:+PermSize256m -XX:MaxPermSize=256m
- Old Gen Size is 2X Live Data Size
- Young Gen Size = Old Gen Size
Using CMS
Thursday, September 26, 13
27. @TwitterEng 27
Tuning for Latency
General rules of thumb
-Increase young gen. size to handle higher allocation rates.
- Increase young gen size if promotion rate high
- May suffer from premature promotion, i.e. promotions
from too frequent young GC.
-Larger young gen decreases GC frequency, and gives
more time for objects to die.
-Increase Old Gen size if promotion rate is still high, avoid
allocation and concurrent mode failures
Using CMS
Thursday, September 26, 13
28. @TwitterEng 28
Tuning for Latency
CMS Tuned for Latency
-Xmx18g -Xms18g –XX:PermSize=256m
-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark
-XX:-OmitStackTraceInFastThrow -XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=70
-XX:+UseCMSInitiatingOccupancyOnly
-XX:SurvivorRatio=6 -XX:NewSize=8g
-XX:MaxNewSize=8g –verbosegc
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
Note: Increased Young Gen Size, Survivor Ratio Tuning
Using CMS
Thursday, September 26, 13
29. @TwitterEng 29
Tuning for Latency
Enable G1
-XX:+UseG1GC –XX:MaxGCPauseMillis=100
- Start with just overall heap size and target pause time.
- Increase Young Generation Size for High Allocation
- Tune to keep remembered set processing low
Using G1GC
Thursday, September 26, 13
30. @TwitterEng 30
Tuning for Latency
G1 Tuning to Consider
-XX:InitiatingHeapOccupancyPercent=90
–XX:G1MixedGCLiveThresholdPercent: The occupancy threshold
of live objects in the old region to be included in the mixed collection.
–XX:G1HeapWastePercent: The threshold of garbage that you can
tolerate in the heap.
–XX:G1MixedGCCountTarget: The target number of mixed garbage
collections within which the regions with at most
G1MixedGCLiveThresholdPercent live data should be collected.
–XX:G1OldCSetRegionThresholdPercent: A limit on the max
number of old regions that can be collected during a mixed collection.
Reference: Monica Beckwith’s InfoQ article:
“G1: One Garbage Collector To Rule Them All“
http://www.infoq.com/articles/G1-One-Garbage-Collector-To-Rule-Them-
All
Using G1GC
Thursday, September 26, 13
31. @TwitterEng 31
Tuning for Latency
G1GC Tuned for Latency
- -XX:+TieredCompilation –XX:InitialCodeCacheSize=256m
–XX:ReservedCodeCacheSize=256m -Xmx18g -Xms18g
–XX:PermSize=256m -XX:MaxPermSize=256M --XX:+UseG1GC
–XX:MaxGCPauseMillis=200
-XX:InitiatingHeapOccupancyPercent=90
-XX+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
Note: MaxGCPauseMillis biggest tuning knob.
Don’t start with CMS Tuning!
Using G1GC
Thursday, September 26, 13
33. @TwitterEng 33
Enable ParallelOldGC
--XX:+UseParallelOldGC
Old Gen needs to be 2-4X live data size (LDS)
Young generation should be ¾ the heap
Often used when tuning for throughput
--XX:+AggressiveOpts
--XX:+TieredCompilation
Disabling adaptive sizing and tuning survivor spaces directly.
- -XX:-AdaptiveSizePolicy -XX:SurvivorRatio=7
-XX:TargetSurvivorRatio=90
Using ParallelOldGC
Tuning for Throughput
Thursday, September 26, 13
34. @TwitterEng 34
Tuning for Throughput
ParallelOldGC tuned for Throughput:
-showversion -server -XX:-UseBiasedLocking
-XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch
-XX:+UseLargePages -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+UseLargePages
-Xms29g -Xmx29g -Xmn27g -XX:+UseParallelOldGC
-XX:ParallelGCThreads=24 -XX:SurvivorRatio=16
-XX:TargetSurvivorRatio=90 -XX:-UseAdaptiveSizePolicy
-XX:+AggressiveOpts -XX:InitialCodeCacheSize=160m -
XX:ReservedCodeCache=160m -XX:+TieredCompilation
Using ParallelOldGC
Thursday, September 26, 13
35. @TwitterEng 35
Enable G1
--XX:+UseG1GC
Old Gen needs to be 2X live data size (LDS)
Young generation should be ¾ the heap
Often used when tuning for throughput
--XX:+AggressiveOpts
--XX:+TieredCompilation
Using G1GC
Tuning for Throughput
Thursday, September 26, 13
36. @TwitterEng 36
Tuning for Throughput
G1GC tuned for throughput:
-showversion -server -XX:-UseBiasedLocking
-XX:LargePageSizeInBytes=2m -XX:+AlwaysPreTouch
-XX:+UseLargePages -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+UseLargePages
-Xms28g -Xmx28g -Xmn21g -XX:+UseG1GC
-XX:+AggressiveOpts
-XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m
-XX:+TieredCompilation
Using G1GC
Thursday, September 26, 13
37. @TwitterEng 37
Enable CMS, and tune for throughput
--XX:+UseParNewGC -XX:+UseConcMarkSweepGC
- Configure heap to avoid promotion
- Application design should separate stateful and stateless components
to allow targeted tuning.
Young generation should be ¾ the heap
- Young generation should be size to ensure nearly all objects
die young.
- Very large heaps, very large old generation
- Use memory to avoid the need for Full GC.
Tuning survivor spaces manually, etc.
- -XX:SurvivorRatio=7 -XX:+CMSScavengeBeforeRemark
-XX:+ParallelRefProcEnabled
Using CMS
Tuning for Throughput
Thursday, September 26, 13
38. @TwitterEng 38
Tuning for Throughput
CMS Tuned for Throughput
-Xmx18g -Xms18g –XX:PermSize=256m
-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark
-XX:-OmitStackTraceInFastThrow -XX:+UseAggressiveOpts
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=90
-XX:+UseCMSInitiatingOccupancyOnly
-XX:SurvivorRatio=6 -XX:NewSize=16g
-XX:MaxNewSize=16g –verbosegc
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:InitialCodeCacheSize=160m -XX:ReservedCodeCache=160m
-XX:+TieredCompilation
Using CMS
Thursday, September 26, 13
40. @TwitterEng 40
Enable ParallelOldGC
--XX:+UseParallelOldGC
Old Gen needs to be 2X live data size (LDS)
Young generation should start at 1/2 the Old Generation size.
Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.
Definitely not low-pause. Trading higher response times, for
lower footprint and lower throughput.
Using ParallelOldGC
Tuning for Footprint
Thursday, September 26, 13
41. @TwitterEng 41
Tuning for Footprint
ParallelOldGC tuned for Footprint
-showversion -server -XX:LargePageSizeInBytes=2m
-XX:+UseLargePages -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+UseLargePages
-Xms8g -Xmx8g -Xmn4g -XX:+UseParallelOldGC
-XX:-UseAdaptiveSizePolicy -XX:+AggressiveOpts
–XX:PermSize=256m -XX:MaxPermSize=256M
Using ParallelOldGC
Thursday, September 26, 13
42. @TwitterEng 42
Enable G1
--XX:+UseG1GC
Heap should be 3x live data size (LDS)
-Do not tune the size of the young generation
-Allow G1 to adapt the size
- Tune only after observer minimum size according to G1
Increase the Pause Target to decrease GC overhead
--XX:MaxGCPauseMillis=400
Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.
Using G1GC
Tuning for Footprint
Thursday, September 26, 13
43. @TwitterEng 43
Tuning for Footprint
G1 Tuned for Footprint
-showversion-XX:+PrintGCDetails -XX:+PrintGCTimeStamps
-Xms12g -Xmx12g -XX:+UseG1GC -XX:InitialCodeCacheSize=160m
-XX:ReservedCodeCache=160m
Using G1GC
Thursday, September 26, 13
44. @TwitterEng 44
Enable CMS, and tune for throughput
--XX:+UseParNewGC -XX:+UseConcMarkSweepGC
Old Gen needs to be 2X live data size (LDS)
Young generation should start at 1/2 the Old Generation size.
- Young generation should be sized so “enough” objects die in
the old generation to reduce the pressure on CMS
- Promotion rate needs to be low enough so CMS concurrent
threads don’t loose the race (ConcurrentMode Failures)
Strategy is to reduce young and old GC sizes independently
until a maximum acceptable end user response time is met.
-Young Generation first, then OldGen.
Using CMS
Tuning for Footprint
Thursday, September 26, 13
45. @TwitterEng 45
Tuning for Footprint
Example of a highly tuned CMS deploy for throughput:
-Xmx12g -Xms12g -Xmn4g –XX:PermSize=256m
-XX:MaxPermSize=256M -XX:+CMSScavengeBeforeRemark
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC
-XX:CMSInitiatingOccupancyFraction=60
-XX:SurvivorRatio=6 –verbosegc
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintGCDateStamps -XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
Note: Increased Young Gen Size, Survivor Ratio Tuning
Using CMS
Thursday, September 26, 13
47. @TwitterEng 47
Common Performance Issues
Size of Permanent Generation
- Perm. Gen. only collects and resizes at Full GC.
Heap before GC invocations=40019 (full 36522):
par new generation total 15354176K, used 14K [0x00000003b9c00000, 0x0000000779c00000,
0x0000000779c00000)
eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c039a8, 0x000000074c0a0000)
from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000)
to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000)
concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000,
0x00000007f9c00000)
concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000,
0x0000000800000000)
2013-09-05T17:21:39.530+0000: [Full GC[CMS: 588343K->588343K(2097152K), 1.6166150 secs] 588357K-
>588343K(17451328K), [CMS Perm : 102399K->102399K(102400K)], 1.6167040 secs] [Times: user=1.57 sys=0.00,
real=1.61 secs]
Heap after GC invocations=40020 (full 36523):
par new generation total 15354176K, used 0K [0x00000003b9c00000, 0x0000000779c00000,
0x0000000779c00000)
eden space 14979712K, 0% used [0x00000003b9c00000, 0x00000003b9c00000, 0x000000074c0a0000)
from space 374464K, 0% used [0x000000074c0a0000, 0x000000074c0a0000, 0x0000000762e50000)
to space 374464K, 0% used [0x0000000762e50000, 0x0000000762e50000, 0x0000000779c00000)
concurrent mark-sweep generation total 2097152K, used 588343K [0x0000000779c00000, 0x00000007f9c00000,
0x00000007f9c00000)
concurrent-mark-sweep perm gen total 102400K, used 102399K [0x00000007f9c00000, 0x0000000800000000,
0x0000000800000000)
}
Recommendation: -XX:PermSize=256m –XX:MaxPermSize=256m
In Enterprise Software
Thursday, September 26, 13
48. @TwitterEng 48
Common Performance Issues
Size of Code Cache
- Default size is 64mb, 96mb if running TieredCompilation
- Enterprise Applications have lots of code
Aggressively Tune to Avoid Issue
-Tuning Without Using TieredCompilation
- -XX:InitialCodeCacheSize=128m
-XX:ReservedCodeCacheSize=128m
- Tuning With Using TieredCompilation
- -XX:InitialCodeCacheSize=256m
-XX:ReservedCodeCacheSize=256m
In Enterprise Software
Thursday, September 26, 13
50. @TwitterEng 50
What’s up with Twitter and JDK Development?
Twitter runs Java + Scala on the HotSpot JVM
- Most Highly Optimized Managed Runtime
-Open source :-)
- Massive performance gains moving technologies
Own and Optimize our Platform
- Build out diagnostic tools
- Build, test, and deploy OpenJDK
- Optimize HotSpot Runtime Compilers for Scala, etc.
- Tailored GC for Twitter’s needs
-extremely low latency requirements ( < 10ms)
@TwitterJDK
Thursday, September 26, 13
51. @TwitterEng 51
What’s up with Twitter and JDK Development?
Contribute Back to the Community
- Working closely with Oracle Java Development
- Collaborating with Other OpenJDK contributors
- Posting tools to Github and OpenJDK repositories
Interesting isn’t it?
- We’re just ramping up now.
- Follow us soon: @TwitterJDK (new idea)
- Follow me at: @dagskeenan
- #jointheflock
@TwitterJDK
Thursday, September 26, 13