Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies)

1. Мониторинг облачной CI системы на примере Jenkins Alexander Akbashev HERE Technologies

2. Here Technologies HERE Technologies, the Open Location Platform company, enables people, enterprises and cities to harness the power of location. By making sense of the world through the lens of location we empower our customers to achieve better outcomes – from helping a city manage its infrastructure or an enterprise optimize its assets to guiding drivers to their destination safely. To learn more about HERE, including our new generation of cloud- based location platform services, visit http:// 360.here.com and www.here.com

3. Context • Every change goes through pre-submit validation • Feedback time is 15-40 minutes • A lot of products and platforms • 6 Jenkins masters • Up to 185k runs per day in the biggest one • 20k runs per day in average

4. if something goes wrong…

5. What can go wrong? Compilation is broken Tests are broken Network issues

6. What can go wrong? Compilation is broken Tests are broken Network issues Jenkins master crashed EC2 plugin does not raise new nodes No connection to labs Can not cleanup workspace AWS S3 is down Git master dies Git replica is broken Compiler cache was invalidated Hit the limit of API calls to AWS Job was deleted UI is blocked Queue is too big System.exit(1) NFS stuck Deadlock in Jenkins Staging started to give feedback Restarted the wrong server

7. What can go wrong? Compilation is broken Tests are broken Network issues Jenkins master crashed EC2 plugin does not raise new nodes No connection to labs Can not cleanup workspace AWS S3 is down Git master dies Git replica is broken Compiler cache was invalidated Hit the limit of API calls to AWS Job was deleted UI is blocked Queue is too big System.exit(1) NFS stuck Deadlock in Jenkins Staging started to give feedback Restarted the wrong server

9. Monitoring Jenkins Out of the box

10. Monitoring Jenkins © http://www.jenkinselectric.com/monitoring

11. Monitoring Jenkins https://jenkins.io/doc/book/system-administration/monitoring/

12. Monitoring Jenkins https://wiki.jenkins.io/display/JENKINS/Monitoring

13. Monitoring Plugin (March 2016)

14. Monitoring Plugin (March 2016) + Easy to install

15. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain

16. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring

17. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats

18. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance

19. Monitoring Plugin (March 2016) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance - Not scalable

20. Monitoring Plugin (nowadays) + Easy to install + Nothing to maintain - Jenkins is slow - no monitoring - Monitors mainly JVM stats - Only one instance - Not scalable + InfluxDB/CloudWatch/Graphite

21. Let’s craft own monitoring!

22. Design own monitoring (March 2016) Jenkins Python InfluxDB API API

23. Design own monitoring (March 2016) Jenkins Python InfluxDB import influxdb import jenkins j = Jenkins(“jenkins.host”) queue_info = j.get_queue_info() for q in queue_info: influx_server.push({“name”: q[‘job_name’], “reason”: q[‘why’]}) API API

26. Design own monitoring (March 2016) Jenkins Python InfluxDB API API

27. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple API API

28. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months API API

29. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling API API

30. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code API API

31. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible API API

32. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible - extra load API API

33. Design own monitoring (March 2016) Jenkins Python InfluxDB +simple +worked for 18 months - polling - maintain common code - not all data is accessible - extra load API API

34. Let’s do event based monitoring!

36. Jenkins Core public abstract class RunListener<R extends Run> implements ExtensionPoint { public void onCompleted(R r, TaskListener listener) {}    public void onFinalized(R r) {}    public void onStarted(R r, TaskListener listener) {} public void onDeleted(R r) {} }

37. Jenkins Core public abstract class RunListener<R extends Run> implements ExtensionPoint { public void onCompleted(R r, TaskListener listener) {}    public void onFinalized(R r) {}    public void onStarted(R r, TaskListener listener) {} public void onDeleted(R r) {} }

38. Groovy Event Listener Plugin (April 2016) • Allows to execute custom groovy code for every event • Supports RunListener

39. Groovy Event Listener Plugin (nowadays) • Allows to execute custom groovy code for every event • Supports RunListener, ComputerListener, ItemListener, QueueListener • Works at scale • Allows custom classpath

40. Groovy Event Listener Plugin if (event == 'RunListener.onFinalized') { def build = Thread.currentThread().executable def queueAction = build.getAction(TimeInQueueAction.class) def queuing = queueAction.getQueuingDurationMillis() log.info “number=$build.number, queue_duration=$queuing }

41. Ok, we have events, but how to fill the db?

42. FluentD

43. FluentD • Process 13,000 events/second/core

44. FluentD • Process 13,000 events/second/core • Retry/buffer/routing

45. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend

46. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple

47. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable

48. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable • Memory footprint is 30-40MB

49. FluentD • Process 13,000 events/second/core • Retry/buffer/routing • Easy to extend • Simple • Reliable • Memory footprint is 30-40MB • Ruby

50. FluentD Jenkins FluentD InfluxDB JSON JSON

51. FluentD Jenkins FluentD InfluxDB JSON JSON Postgres SQL

52. FluentD Jenkins FluentD InfluxDB JSON JSON Postgres SQL Logs

53. FluentD. Config. <match **.influx.**> type influxdb host influxdb.host port 8086 dbname stats auto_tags “true” timestamp_tag timestamp time_precision s </match>

58. Ok, we have events, we have fluentd, but how to pass event to it?

59. FluentD Plugin for Jenkins

60. FluentD Plugin for Jenkins • Developed in HERE Technologies

61. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple

62. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple • Supports JSON

63. FluentD Plugin for Jenkins • Developed in HERE Technologies • Very simple • Supports JSON • Post-build-step

64. FluentD Plugin for Jenkins https://github.com/jenkinsci/fluentd-plugin

65. Great! Let’s do something with this data!

66. Infra issues

67. Build Failure Analyzer (config)

68. Build Failure Analyzer (code) def bfa = build.getAction(FailureCauseBuildAction.class) def causes = bfa.getFailureCauseDisplayData().getFoundFailureCauses() for(def cause : causes) { final Map<String, Object> data = new HashMap<>(); data.put("name", jobName) data.put("number", build.number) data.put("cause", cause.getName()) data.put("categories", cause.getCategories().join(',')) data.put("timestamp", build.timestamp.timeInMillis) data.put("node", node) context.logger.log("influx.bfa", data) }

74. Build Failure Analyzer (result)

75. Speed up compilation

76. CCache (problem)

77. CCache

78. CCache • New node - empty local cache

79. CCache • New node - empty local cache • Old local cache - a lot of misses

80. CCache • New node - empty local cache • Old local cache - a lot of misses + Distributed cache solves all this problems

81. CCache • New node - empty local cache • Old local cache - a lot of misses + Distributed cache solves all this problems - Once a year distributes problem across the cluster

82. CCache (result)

83. Improve node utilization

84. LoadBalancer (problem)

85. LoadBalancer (solution)

86. LoadBalancer (solution) • Default balancer is optimized for cache

87. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts

88. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts • Nothing to terminate/stop - no idle nodes

89. LoadBalancer (solution) • Default balancer is optimized for cache • Cron jobs are pinned to different hosts • Nothing to terminate/stop - no idle nodes + Saturate Node Load Balancer: always put all load to the oldest node

90. LoadBalancer (result)

91. Minimize impact

92. Jar Hell (problem) java.io.InvalidClassException: hudson.util.StreamTaskListener; local class incompatible: stream classdesc serialVersionUID = 1, local class serialVersionUID = 294073340889094580

93. Jar Hell (explanation)

94. Jar Hell (explanation) • Bug in Jenkins Remoting Layer

95. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost”

96. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost” • Does not recover

97. Jar Hell (explanation) • Bug in Jenkins Remoting Layer • If first run that is using some class is aborted - this class is “lost” • Does not recover • Huge impact

98. Jar Hell (“solution”) if (cause.getName().equals("Jar Hell”)) { Node node = build.getBuiltOn() if (node != Jenkins.getInstance()) { node.setLabelString("disabled_jar_hell"); }

99. Our daily dashboard

101. Resources

102. Resources • FluentD • Influxdb plugin for fluentd • JavaGC plugin for fluentd • FluentD Plugin • Groovy Event Listener Plugin • Build Failure Analyzer Plugin • Saturate Node Load Balancer Plugin • CCache with memcache • InfluxDB

103. Q/A? alexander.akbashev@here.com Github: Jimilian

Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies)

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (12)

Similar to Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies)

Similar to Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies) (20)

More from Ontico

More from Ontico (20)

Recently uploaded

Recently uploaded (20)

Мониторинг облачной CI-системы на примере Jenkins / Александр Акбашев (HERE Technologies)