Weitere ähnliche Inhalte Ähnlich wie Kapacitor Manager (20) Kürzlich hochgeladen (20) Kapacitor Manager1. ©2018 RingCentral, Inc. Proprietary & Confidential.1
Kapacitor Manager
Yuri Ardulov
Alexey Smirnov
Lyubov Fomicheva
Valery Tishkov
Vyacheslav Shvetsov
2. ©2018 RingCentral, Inc. Proprietary & Confidential.2
Collaborative
Communications
Contact Center
Video & meetings
Cloud PBX
Team messaging
Open Platform
AnalyticsGlobal
User Experience2018 RingCentral,
RingCentral
Product
3. 3 | © 2018 RingCentral, Inc. All rights reserved.
Our Journey
RingCentral
Office
2008
Office for
Enterprise &
Video Meetings
Contact Center,
Team Messaging
& Open Platform
Collaborative
Communications
& Global Office
2014
2015
2016 Analytics
& Quality of
Service
Collaborative
Meetings,
Collaborative
Contact Center
& Pulse
2017
2018
4. 4 | © 2018 RingCentral, Inc. All rights reserved.
RingCentral IP Telecommunication Company
▪ 500000 business customers
▪ 10 data centers across the globe (US, Europe, APAC)
▪ 20K+ servers
▪ 30K simultaneous phone calls
▪ 100K Faxes per day
▪ 20M calls per day
5. 5 | © 2018 RingCentral, Inc. All rights reserved.
Tools Landscape
CMDB
6. 6 | © 2018 RingCentral, Inc. All rights reserved.
From Zabbix to Influx - Starting Points
▪ North America:
• 2 Data Centers
• # of hosts: 10K+
• # of metrics: 2.5M+
• # of triggers: 700K+
▪ Europe and Other:
• Multiple Data Centers
• # of hosts: 5K+
• # of metrics: 700K+
• # of triggers: 250K+
7. 7 | © 2018 RingCentral, Inc. All rights reserved.
Do-It-Yourself (DIY) Framework
8. 8 | © 2018 RingCentral, Inc. All rights reserved.
DIY Framework
▪ Alert As A Code
▪ Send Any Application Metrics
▪ Dash-Board As A Code
▪ Horizontally Scalable
▪ Sand-Box for graduation
▪ Structure independent
▪ Fully automated service
▪ Integration with Deployment systems
9. 9 | © 2018 RingCentral, Inc. All rights reserved.
Goals of the project:
▪ Structure independent
▪ Collect metrics with high granularity
▪ Fully automated service
▪ Integration with Deployment systems
▪ Engineering as self-service:
• Alerting as a code
• Dashboards as a code
• CD support
▪ HA implementation
▪ Metrics collection through http get
10. 10 | © 2018 RingCentral, Inc. All rights reserved.
Design of proposed solution
11. 11 | © 2018 RingCentral, Inc. All rights reserved.
Problems with existing Kapacitor (v1.3)
▪ Low efficiency (Tasks per Instance)
▪ Not responsive under High Load (API stops functioning)
▪ Streaming tasks (1000+ ) cause high CPU load
▪ Batch tasks: if not grouped by InfluxDB utilizing whole CPU ???
▪ Low internal concurrency: Alert node stalls on writing into
Internal Topic (Alerts stops producing)
12. 12 | © 2018 RingCentral, Inc. All rights reserved.
Test Cases
1. Alert node -> .tcp() -> Logstash tcp listener -> Kafka
2. Alert node -> .post() -> Logstash http listener -> Kafka
3. Alert node -> InfluxDBOut() -> Logstash http listener as InfluxDB
cluster -> Kafka
4. Eval node -> InfluxDBOut() -> InfluxDB cluster
▪ Kapacitor instance:
• CPU cores: 24
• RAM: 256 GB
1000 tasks generated in following flow:
1. Put value=1 to Kapacitor /write API.
2. Wait for 0.5 seconds
3. Put value=0 to Kapacitor /write API.
4. Wait for 0.5 seconds
5. Repeat 1-4 for 100 times
13. 13 | © 2018 RingCentral, Inc. All rights reserved.
TestCases
▪ Example: Alert node -> .tcp() -> Logstash tcp listener -> Kafka
▪ Task:
▪ stream
▪ |from()
▪ |alert()
▪ .id('alert-id-{{number}}')
▪ .message('alert message {{number}}')
▪ .info(lambda: "value" == 1)
▪ .details('''{"details": "some details", "fqdn": "test.example.com"}''')
▪ .tcp('172.17.0.1:25000')
14. 14 | © 2018 RingCentral, Inc. All rights reserved.
Results: Case 1
Result:
• Delay between event generation
time and time when event received
by Logstash up to 7 minutes.
• Looks like root cause is .tcp
method which opens TCP
connection for every event.
15. 15 | © 2018 RingCentral, Inc. All rights reserved.
Results: Case 2
Result:
• Delay between event generation
time and time when event received
by Logstash up to 30 seconds.
• Looks much better but delay
increases on increased number of
tasks
16. 16 | © 2018 RingCentral, Inc. All rights reserved.
Results: Case 3
Result:
• Delay between event generation
time and time when event received
by Logstash ~3 seconds.
17. 17 | © 2018 RingCentral, Inc. All rights reserved.
Results: Case 4
5000 tasks generated. The same test case.
Result:
• Events received by InfluxDB cluster in near
real-time.
18. 18 | © 2018 RingCentral, Inc. All rights reserved.
Project Goals
▪ Scalable Solution: 50K+ alerts per location
▪ Increase efficiency –> Maximize the Ratio of tasks per Instance
▪ Manageable Solution -> Dynamically change the tasks
allocation and balancing
▪ No single point of failure
▪ Housekeeping capabilities
▪ Tasks Sand-Boxing as a part of service
▪ RESTfull
19. 19 | © 2018 RingCentral, Inc. All rights reserved.
Kapacitor Manager (KM): Functional Description
20. 20 | © 2018 RingCentral, Inc. All rights reserved.
CEP problems
▪ Single core processing
▪ Low Performance
▪ very bad scalability
21. 21 | © 2018 RingCentral, Inc. All rights reserved.
Open and Closed Events
22. 22 | © 2018 RingCentral, Inc. All rights reserved.
Kapacitor And Event State Change Detection
23. 23 | © 2018 RingCentral, Inc. All rights reserved.
K8S: Deployment Media for KM and Kapacitors Nodes
24. 24 | © 2018 RingCentral, Inc. All rights reserved.
Future Plans
▪ Sand Boxing and Routing
▪ Smart Rebalancing
▪ Open Source
▪ Align with TICK stack upcoming releases
▪ UI
25. 25 | © 2018 RingCentral, Inc. All rights reserved.
THANK YOU
26. 26 | © 2018 RingCentral, Inc. All rights reserved.