1. 1
What’s an SRE at Criteo?
Clément Michaud
c.michaud@criteo.com
June 6th, 2018
1
2. About me
Clément Michaud
SRE building the PaaS at Criteo for 9 months.
Previously C++ software engineer in Finance for 3 years.
clems4ever @clementmichaud1clement.michaud
2
5. 1.4B
Shoppers per Month
What’s Criteo?
Criteo is a global tech
company
Leader in online
advertisement
Manages its own data
centers
600TB
Shopper Data per Day
5
6. Criteo’s partners
6
Publishers & Exchanges Advertisers
Bid for ad spaces to
display advertisers
products
100 ms max to bid
Manage e-commerce
campaigns
7. Why did Criteo choose the SRE model?
7
https://www.youtube.com/watch?v=ncf80_ZvBpo
Great presentation done in 2017 by Nicolas Helleringer at Devoxx
13. 13
13
What does it solve?
Alex could have
helped Laura with
the patch using
his expertise
Dev Ops50% / 50%
Laura Alex
Skills scale
Laura and Alex
could have
evaluated the risk
together
Alex could have
reacted faster if
he was aware of
Laura’s intention
14. 14
9
DevOps is a philosophy and a set of practices designed
to break organizational barriers
1. Promote
collaboration
2. Failure is
normal
3. Make gradual
changes
5. Measure
everything
4. Leverage
tooling & automation
14
16. Plenty of technical challenges
16
Host and power up more
than 20k servers
Hosting teams.
Build & maintain a
datacenter network
Network teams.
Build & maintain platform
running apps
Infra & Core teams, Observability team.
Build & delivery efficiently
(CI/CD)
DevTool team, Deployment team.
Ingest & process big
amount of data
NoSQL team, DBA team.
05
01
02 03
04
More than 100 people
17. Organization of SRE teams
17
InfraTools
Network
PaaS
Observabil
ity
DBA
NoSQL
Infra
LB
Small teams
Service provider
Reduced scope
Expertise
IDM
Lake
Rivers
18. Meta vision provided by EPMs (Engineering Product Manager)
18
Ensure coherence of
the whole
InfraTools
Network
PaaS
Observabil
ity
DBA
NoSQL
Infra
LB
IDM
Lake
Rivers
EPM
EPM
EPM
EPM
19. Close connections with dev teams
19
Network
Observabil
ity
DBA
NoSQL
Infra
LB
IDM
Prediction
Reco
Creator
RTB
PaaS
InfraTools
Dev
Team
SRE
Team
Escalation
21. Maintenance &
Evolution
Maintain & evolve the platform
to ease the life of our users.
Dev & Tech
Promote and use production
grade technologies.
Ensure technological watch
to stay competitive.
Ownership &
Responsibility
As a provider of services, the
team should assume ownership
on the services it provides.
Automation
Installation of 20k servers
managed with Chef.
Infra as code. Automatic
failover.
20k+
servers
What SRE means at Criteo
21
22. Support
Provide the right level of
documentation.
Answer user requests on the
service we provide.
Testability
Ensure new deployment will
not break the platform. Made
easy with Terraform.
On-call
Participate to level-2 on-call
rotations for entire days and
during weekends.
Consulting
Help your colleagues build a
resilient and performant
system by accompanying.
8data centers in the world
What SRE means at Criteo
22
24. Standard work day
24
- Do code reviews
- Check tasks in Jira board
- Read emails
9:00
- Write code to upgrade Consul
- Test the setup in AWS
- Send code reviews
10:00
Lunch break
12:00
- Deploy code in production
- Make sure the deployment is going well
- Do code reviews
13:00
- Meeting with deployment team to
define SLAs
16:00
Day 1
Standard
25. Work day while being on-call
25
- Do bug fixes in a tool we provide
- Send a code review
17:00
- Check emails
- Do code review
- Fix few server failures
9:00
12:00
- Write code to install new servers
- Write code to install new apps.
- Write some documentation
- Do deployment in prod
13:00
18:00
Lunch break
Day 2
On-Call
Start of
on-call shift
26. Work day as interrupt guy
26
18:00
- Got paged because of incident with Mesos
- Investigate and find issue is related to LBs
- Call the on-call guy from the LB team
- Write down timeline
23:00
- Report incidents to my team
- Write a post-mortem and create tickets
to address improvements.
9:00
10:00
12:00
- Provide some support to users in slack
- Do code reviews
Day 3
Interrupt
Go to gym & have lunch!
End of on-call
shift
Start of interrupt
shift
27. Work day as interrupt guy
27
- Improve our documentation around Mesos
- Handle a Jira ticket & send reviews
13:00
- Fix bug in a library reported by user
- Send reviews
- Deploy fix in prod
15:00
- Do code reviews
- Prepare a wheel of misfortune
17:30
Day 3
Interrupt
18:30
And the sprint goes on….
1 sprint == 2 weeks
29. 29
● DevOps is a philosophy
● SRE is an implementation of
DevOps
● SRE comes after breaking the
silos between Dev and Ops
● This model has allowed Criteo to
scale well over the years.
Summary
30. 30
● Complex problems need various
skills to be solved.
● Devs and Ops WILL solve those
problems by interacting together.
So, take the plunge,
Implement the DevOps interface
Conclusion
31. 31
Thank you - Questions?
31
https://www.youtube.com/watch?v=uTEL8Ff1Zvk
https://github.com/apache/mesos
https://github.com/clems4ever