This presentation explains the challenges we face at Criteo on discovering machines and services.
Criteo is using HashiCorp's Consul to discover services. We explain what is Consul, how it works, what are the challenges we faced and how we improved it.
We also explain how using it in combinaison with consul-templaterb can allow us performing Inversion Of Control for the whole infrastructure allowing us to use it as a database and iterating faster.
6. 6 •
Open-Source, 2014
No SPOF / Fault Tolerant
Distributed Agents on all machines
Services Oriented / DC aware
Updates in Real-Time of Services
Distributed toolbox (Locks, K/V…)
Easy to integrated (DNS support)
Can Work on any IP network
Consul is a Discovery Database with fault-tolerance
8. 8 •
5 people
Create SDK for other teams (JVM, C#, python, ruby)
Handle all infrastructure, on-call 24/24 7/7
Architecture patterns
1st worldwide contributors to Consul
The team
10. 10 •
“Criteo has probably the most intensive usage of Consul in the world” (M. Hashimoto)
Discover all instances of all systems in our applications
All the Load-Balancing, the DNS provisioning
Metrics
Alerting systems
Used it with bare-metal (Windows, Linux), Mesos, Kubernetes, Hadoop
One of the biggest installations of Consul in the world
11. 11 •
RPC query/s/DC from ~1.5k/s to 9k/s (up to 1300 qps on a single service)
1 change/sec on a large service of many instances
→ 2k req/sec if 2k observers #RealLife
Consul At Scale
12. 12 •
~100 PRs, 70+ merged upstream
~15 PR for features (Service.Meta, weights…)
~30 PR merged for performance (DNS, watches)
~10 PR merged for safety (node registration, memberlist…)
~2 PR fixing security bugs
OSS UI: https://github.com/criteo/consul-templaterb/
Our Pull request
13. 13 •
Bandwidth: from 1Gb/s to 12k/s
CPU: from 32/32 CPUs at 100% to 1/32 CPU at 100%
From 3/4 notifications/s to 1 notification/10 min for 1 service
From 1 incident / 10 days to no incident in 6 months
From a fragile tool to a database for the whole infrastructure
Prometheus improvements
Metadata for services…
Improvements from our merged PRs
14. 14 •
A OSS Scalable UI for Consul (consul-templaterb)
17. 17 •
Services expose semantics:
I want HTTPS
I speak Swagger
Call someone when I lost 40% of capacity
Tools observe, react and provision systems: Consul is an infrastructure database
Inversion of Control