This document discusses monitoring a Jenkins continuous integration (CI) system using cloud services. It begins by outlining some common issues that can occur in Jenkins like compilation or test failures. It then evaluates the default Jenkins monitoring capabilities and proposes designing a custom monitoring system using events, FluentD for processing, and InfluxDB for storage. Examples are provided of plugins developed to analyze build failures and improve node utilization. The presentation concludes with a discussion of dashboards used for daily monitoring of the Jenkins CI system.
2. Here Technologies
HERE Technologies, the Open Location Platform company, enables
people, enterprises and cities to harness the power of location. By
making sense of the world through the lens of location we empower
our customers to achieve better outcomes – from helping a city
manage its infrastructure or an enterprise optimize its assets to
guiding drivers to their destination safely.
To learn more about HERE, including our new generation of cloud-
based location platform services, visit http://
360.here.com and www.here.com
3. Context
• Every change goes through pre-submit validation
• Feedback time is 15-40 minutes
• A lot of products and platforms
• 6 Jenkins masters
• Up to 185k runs per day in the biggest one
• 20k runs per day in average
5. What can go wrong?
Compilation is broken
Tests are broken
Network issues
6. What can go wrong?
Compilation is broken
Tests are broken
Network issues
Jenkins master crashed
EC2 plugin does not raise new nodes
No connection to labs
Can not cleanup workspace
AWS S3 is down
Git master dies
Git replica is broken
Compiler cache was invalidated
Hit the limit of API calls to AWS
Job was deleted
UI is blocked
Queue is too big
System.exit(1)
NFS stuck
Deadlock in Jenkins
Staging started to give feedback
Restarted the wrong server
7. What can go wrong?
Compilation is broken
Tests are broken
Network issues
Jenkins master crashed
EC2 plugin does not raise new nodes
No connection to labs
Can not cleanup workspace
AWS S3 is down
Git master dies
Git replica is broken
Compiler cache was invalidated
Hit the limit of API calls to AWS
Job was deleted
UI is blocked
Queue is too big
System.exit(1)
NFS stuck
Deadlock in Jenkins
Staging started to give feedback
Restarted the wrong server
16. Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
17. Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
18. Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
- Only one instance
19. Monitoring Plugin (March 2016)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
- Only one instance
- Not scalable
20. Monitoring Plugin (nowadays)
+ Easy to install
+ Nothing to maintain
- Jenkins is slow - no monitoring
- Monitors mainly JVM stats
- Only one instance
- Not scalable
+ InfluxDB/CloudWatch/Graphite
28. Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
API API
29. Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
API API
30. Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
API API
31. Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
- not all data is accessible
API API
32. Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
- not all data is accessible
- extra load
API API
33. Design own monitoring (March 2016)
Jenkins Python InfluxDB
+simple
+worked for 18 months
- polling
- maintain common code
- not all data is accessible
- extra load
API API
79. CCache
• New node - empty local cache
• Old local cache - a lot of misses
80. CCache
• New node - empty local cache
• Old local cache - a lot of misses
+ Distributed cache solves all this problems
81. CCache
• New node - empty local cache
• Old local cache - a lot of misses
+ Distributed cache solves all this problems
- Once a year distributes problem across the
cluster
88. LoadBalancer (solution)
• Default balancer is optimized for cache
• Cron jobs are pinned to different hosts
• Nothing to terminate/stop - no idle nodes
89. LoadBalancer (solution)
• Default balancer is optimized for cache
• Cron jobs are pinned to different hosts
• Nothing to terminate/stop - no idle nodes
+ Saturate Node Load Balancer: always put all load to the oldest
node
92. Jar Hell (problem)
java.io.InvalidClassException: hudson.util.StreamTaskListener;
local class incompatible: stream classdesc serialVersionUID = 1,
local class serialVersionUID = 294073340889094580
95. Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
• If first run that is using some class is aborted - this class is “lost”
96. Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
• If first run that is using some class is aborted - this class is “lost”
• Does not recover
97. Jar Hell (explanation)
• Bug in Jenkins Remoting Layer
• If first run that is using some class is aborted - this class is “lost”
• Does not recover
• Huge impact
98. Jar Hell (“solution”)
if (cause.getName().equals("Jar Hell”)) {
Node node = build.getBuiltOn()
if (node != Jenkins.getInstance()) {
node.setLabelString("disabled_jar_hell");
}