SlideShare ist ein Scribd-Unternehmen logo
1 von 31
Configuration Management
Evolution at CERN
Gavin McCance
gavin.mccance@cern.ch
@gmccance
Agile Infrastructure
•
•
•
•
•

Why we changed the stack
Current status
Technology challenges
People challenges
Puppe
t
Community

T Agile
he
Infras ture
truc
Making ITope
rationsbe r s e201
tte inc
3

Ope tac
ns k

Fore an
m

m c
colle tive

Koji

git

14/10/2013

J nkins
e

Ac MQ
tive

CHEP 2013

3
Why?
•

Homebrew stack of tools
•
•
•

•

“We’re not special”
•

•

Twice the number machines, no new staff
New remote data-centre
Adopting more dynamic Cloud model

Existence of open source tool chain: OpenStack,
Puppet, Foreman, Kibana

Staff turnover
•

Use standard tools – we can hire for it and people
can be hired for it when they leave
14/10/2013

CHEP 2013

4
Agile Infrastructure “stack”
•

Our current stack has been stable for one year now
•

•

Virtual server provisioning
–

•

Cloud “operating system”: OpenStack -> (Belmiro, next)

Configuration management
–
–

•

See plenary talk at last CHEP (Tim Bell et al)

Puppet + ecosystem as configuration management system
Foreman as machine inventory tool and dashboard

Monitoring improvements
•

Flume + Elasticsearch + Kibana -> (Pedro, next++)

14/10/2013

CHEP 2013

5
Puppet
•
•

Puppet manages nodes’ configuration via
“manifests” written in Puppet DSL
All nodes check in frequently (~1-2 hours)
and ask for configuration
•

•

Configuration applied frequently to
minimise drift

Using the central puppet master model
•
•

..rather than masterless model
No shipping of code, central caching and ACLs

14/10/2013

CHEP 2013

6
Separation of data and code
•

Puppet “Hiera” splits configuration “data”
from “code”
•
•

Treat Puppet manifests really as code
More reusable manifests
• Heira is quite new: old manifests are catching up

•

Hiera can use multiple sources for lookup
•
•

Currently we store the data in git
Investigating DB for “canned” operations
14/10/2013

CHEP 2013

7
Modules and Git
•

Manifests (code) and hiera (data) are version
controlled

•

Puppet can use git’s easy branching to support
parallel environments
•

Later…
14/10/2013

CHEP 2013

8
Foreman
•

Lifecycle management tool for VMs and
physical servers

•

External Node Classifier – tells the puppet
master what a node should look like
Receives reports from Puppet runs and
provides dashboard

•

14/10/2013

CHEP 2013

9
14/10/2013

CHEP 2013

10
14/10/2013

CHEP 2013

11
Deployment at CERN
•
•
•
•

Puppet 3.2
Foreman 1.2
Been in real production for 6 months
Over 4000 hosts currently managed by Puppet
•
•
•
•
•
•

SLC5, SLC6, Windows
~100 distinct hostgroups in CERN IT + Expts
New EMI Grid service instances puppetised
Batch/Lxplus service moving as fast as we can drain it
Data services migrating with new capacity
AI services (Openstack, Puppet, etc) 2013
CHEP
12
14/10/2013
Key technical challenges
•

Service stability and scaling

•

Service monitoring

•

Foreman improvements

•

Site integration

14/10/2013

CHEP 2013

13
Scalability experiences
•
•

Most stability issues we had were down to scaling
issues
Puppet masters are easy to load-balance
•
•

•

•

We use standard apache mod_proxy_balancer
We currently have 16 masters
Fairly high IO and CPU requirements

Split up services
•

Puppet – critical vs. non critical

12 backend nodes
“Bulk”
14/10/2013

4 backend nodes
“Interactive”
CHEP 2013

14
Scalability guidelines

14/10/2013

CHEP 2013

15
Scalability experiences
•
•

Foreman is easy to load-balance
Also split into different services
•

That way Puppet and Foreman UI don’t get
affected by e.g. massive installation bursts
Load balancer

ENC

UI/API

14/10/2013

Reports
processing

CHEP 2013

16
PuppetDB
•

All puppet data sent to PuppetDB
•

Querying at compile time for Puppet manifests
• e.g. configure load-balancer for all workers

•

Scaling is still a challenge
•
•

Single instance – manual failover for now
Postgres scaling
• Heavily IO bound (we moved to SSDs)
• Get the book

14/10/2013

CHEP 2013

17
Monitor response times
•

Monitor response, errors
and identify bottlenecks

•

Currently using Splunk – will likely migrate to
Elasticsearch and Kibana
14/10/2013

CHEP 2013

18
Upstream improvements
•

CERN strategy is to run the main-line
upstream code
•
•

Any developments we do gets pushed upstream
e.g Foreman power operations, CVE reported

IPMI

Physical
box

IPMI

Physical
box

IPMI

Physical
box

Foreman
Proxy

Openstack
Nova API
VM
14/10/2013

VM

VM
CHEP 2013

19
Site integration
•

Using Opensource doesn’t get completely get you
away from coding your own stuff

•

We’ve found every time Puppet touches our
existing site infrastructure a new “service” or
“plugin” is born
•
•

Implementing our CA audit policy
Integrating with our existing PXE setup and burnin/hardware allocation process - possible convergence on
tools in the future – Razor?

•

Implementing Lemon monitoring “masking” use-cases –
nothing upstream, yet..

14/10/2013

CHEP 2013

20
People challenges
•

Debugging tools and docu needed!
•

•

Can we have X’, Y’ and Z’ ?
•
•
•

•

PuppetDB helpful here

Just because the old thing did it like that, doesn’t mean it
was the only way to do it
Real requirements are interesting to others too
Re-understanding requirements and documentation and
training

Great tools – how do 150 people use them without
stepping on each other?
•

Workflow and process
14/10/2013

CHEP 2013

21
Your test box

Your special
feature
My special feature

QA
“QA”
machines

Production

Most machines

Use git branches to define isolated puppet environments
14/10/2013

CHEP 2013

22
Easy git cherry pick
14/10/2013

CHEP 2013

23
Git workflow
Git model and flexible environments
•

For simplicity we made it more complex
•

Each Puppet module / hostgroup now has its
own git repo (~200 in all)
• Simple git-merge process within module
• Delegated ACLs to enhance security

•

Standard “QA” and “production” branches
that machines can subscribe to
•

Flexible tool (Jens, to be open-sourced by
CERN) for defining “feature” developments
• Everything from “production” except for the change

I’m testing on my module
14/10/2013

CHEP 2013

25
Strong QA process
•

Mandatory QA process for “shared” modules
•
•
•

•

Recommended for non-shared modules
Everyone is expected to have some nodes from their
service in the QA environment
Normally changes are QA’d for at least 1 week. Hit the
button if it breaks your box!

Still iterating on the process
•
•

Not bound by technology
Is one week enough? Can people “freeze”?
14/10/2013

CHEP 2013

26
Community collaboration
•

Traditionally one of HEPs strong points

•

There’s a large existing Puppet community with a
good model - we can join it and open-source our
modules

•

New HEPiX working group being formed now
•
•
•
•
•

Engage with existing Puppet community
Advice on best practices
Common modules for HEP/Grid-specific software
https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagem
ent
https://lists.desy.de/sympa/info/hepix-config-wg
14/10/2013

CHEP 2013

27
http://github.com/cernops
for the modules we share
Pull requests welcome!
14/10/2013

CHEP 2013

28
Summary
•

The Puppet / Foreman / Git / Openstack model is
working well for us
•

•

•

Key technical challenges are scaling and integration
which are under control
Main challenge now is people and process
•

•

4000 hosts in production, migration ongoing

How to maximise the utility of the tools

The HEP and Puppet communities are both strong and
we can benefit if we join them together
https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagement
http://github.com/cernops
14/10/2013

CHEP 2013

29
Backup slides

14/10/2013

CHEP 2013

30
mcollective, yum

Bamboo

Puppet
AIMS/PXE
Foreman

JIRA

OpenStack
Nova

Koji, Mock
Yum repo
Pulp

Active Directory /
LDAP

git

Lemon /
Hadoop

Hardware
database
Puppet-DB

14/10/2013

CHEP 2013

31

Weitere ähnliche Inhalte

Was ist angesagt?

Open stack in action enovance-quantum in action
Open stack in action enovance-quantum in actionOpen stack in action enovance-quantum in action
Open stack in action enovance-quantum in action
eNovance
 
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Keith Tobin
 

Was ist angesagt? (20)

OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
OpenStack Deployment in the Enterprise
OpenStack Deployment in the Enterprise OpenStack Deployment in the Enterprise
OpenStack Deployment in the Enterprise
 
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell ScruggsOrchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
Orchestration Tool Roundup - Arthur Berezin & Trammell Scruggs
 
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
Tips Tricks and Tactics with Cells and Scaling OpenStack - May, 2015
 
Hostvn ceph in production v1.1 dungtq
Hostvn   ceph in production v1.1 dungtqHostvn   ceph in production v1.1 dungtq
Hostvn ceph in production v1.1 dungtq
 
TripleO
 TripleO TripleO
TripleO
 
What's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with PuppetWhat's New in Grizzly & Deploying OpenStack with Puppet
What's New in Grizzly & Deploying OpenStack with Puppet
 
Open stack in action enovance-quantum in action
Open stack in action enovance-quantum in actionOpen stack in action enovance-quantum in action
Open stack in action enovance-quantum in action
 
Cloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: OpenstackCloud Architect Alliance #15: Openstack
Cloud Architect Alliance #15: Openstack
 
Openstack In Real Life
Openstack In Real LifeOpenstack In Real Life
Openstack In Real Life
 
OpenStack and Windows
OpenStack and WindowsOpenStack and Windows
OpenStack and Windows
 
Chef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly RoadmapChef for OpenStack: Grizzly Roadmap
Chef for OpenStack: Grizzly Roadmap
 
Managing Complexity at Velocity
Managing Complexity at VelocityManaging Complexity at Velocity
Managing Complexity at Velocity
 
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)Openstack architecture for the enterprise (Openstack Ireland Meet-up)
Openstack architecture for the enterprise (Openstack Ireland Meet-up)
 
OpenStack High Availability
OpenStack High AvailabilityOpenStack High Availability
OpenStack High Availability
 
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and WindowsOpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
OpenStack in action 4! Alessandro Pilotti - OpenStack, Hyper-V and Windows
 
RedHat OpenStack Platform Overview
RedHat OpenStack Platform OverviewRedHat OpenStack Platform Overview
RedHat OpenStack Platform Overview
 
OpenStack Best Practices and Considerations - terasky tech day
OpenStack Best Practices and Considerations  - terasky tech dayOpenStack Best Practices and Considerations  - terasky tech day
OpenStack Best Practices and Considerations - terasky tech day
 
Red Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom DirectorRed Hat Enteprise Linux Open Stack Platfrom Director
Red Hat Enteprise Linux Open Stack Platfrom Director
 
Kolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in SydneyKolla talk at OpenStack Summit 2017 in Sydney
Kolla talk at OpenStack Summit 2017 in Sydney
 

Ähnlich wie Configuration Management Evolution at CERN

State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DC
Puppet
 

Ähnlich wie Configuration Management Evolution at CERN (20)

OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander DibboOpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
OpenNebulaConf2015 1.07 Cloud for Scientific Computing @ STFC - Alexander Dibbo
 
Puppet Keynote by Ralph Luchs
Puppet Keynote by Ralph LuchsPuppet Keynote by Ralph Luchs
Puppet Keynote by Ralph Luchs
 
Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Flexible compute
Flexible computeFlexible compute
Flexible compute
 
State of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DCState of Puppet 2013 - Puppet Camp DC
State of Puppet 2013 - Puppet Camp DC
 
OpenStack Enabling DevOps
OpenStack Enabling DevOpsOpenStack Enabling DevOps
OpenStack Enabling DevOps
 
OpenStack at EBSCO
OpenStack at EBSCOOpenStack at EBSCO
OpenStack at EBSCO
 
Puppet overview
Puppet overviewPuppet overview
Puppet overview
 
Swimming upstream: OPNFV Doctor project case study
Swimming upstream: OPNFV Doctor project case studySwimming upstream: OPNFV Doctor project case study
Swimming upstream: OPNFV Doctor project case study
 
TechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
TechWiseTV Workshop: Open NX-OS and Devops with Puppet LabsTechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
TechWiseTV Workshop: Open NX-OS and Devops with Puppet Labs
 
Considerations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack CloudConsiderations for Operating an OpenStack Cloud
Considerations for Operating an OpenStack Cloud
 
Itsummit2015 blizzard
Itsummit2015 blizzardItsummit2015 blizzard
Itsummit2015 blizzard
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Devoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National PoliceDevoxx PL 2018 - Microservices in action at the Dutch National Police
Devoxx PL 2018 - Microservices in action at the Dutch National Police
 
Automate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking EcosystemAutomate Hadoop Cluster Deployment in a Banking Ecosystem
Automate Hadoop Cluster Deployment in a Banking Ecosystem
 
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National PoliceiSense Java Summit 2017 - Microservices in action at the Dutch National Police
iSense Java Summit 2017 - Microservices in action at the Dutch National Police
 
InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)InteropWG Intro & Vertical Programs (May. 2017)
InteropWG Intro & Vertical Programs (May. 2017)
 
OpenVINO introduction
OpenVINO introductionOpenVINO introduction
OpenVINO introduction
 
HiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOSHiPEAC 2019 Tutorial - Maestro RTOS
HiPEAC 2019 Tutorial - Maestro RTOS
 
Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...Get There meetup March 2018 - Microservices in action at the Dutch National P...
Get There meetup March 2018 - Microservices in action at the Dutch National P...
 

KĂźrzlich hochgeladen

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

KĂźrzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Configuration Management Evolution at CERN

  • 1.
  • 2. Configuration Management Evolution at CERN Gavin McCance gavin.mccance@cern.ch @gmccance
  • 3. Agile Infrastructure • • • • • Why we changed the stack Current status Technology challenges People challenges Puppe t Community T Agile he Infras ture truc Making ITope rationsbe r s e201 tte inc 3 Ope tac ns k Fore an m m c colle tive Koji git 14/10/2013 J nkins e Ac MQ tive CHEP 2013 3
  • 4. Why? • Homebrew stack of tools • • • • “We’re not special” • • Twice the number machines, no new staff New remote data-centre Adopting more dynamic Cloud model Existence of open source tool chain: OpenStack, Puppet, Foreman, Kibana Staff turnover • Use standard tools – we can hire for it and people can be hired for it when they leave 14/10/2013 CHEP 2013 4
  • 5. Agile Infrastructure “stack” • Our current stack has been stable for one year now • • Virtual server provisioning – • Cloud “operating system”: OpenStack -> (Belmiro, next) Configuration management – – • See plenary talk at last CHEP (Tim Bell et al) Puppet + ecosystem as configuration management system Foreman as machine inventory tool and dashboard Monitoring improvements • Flume + Elasticsearch + Kibana -> (Pedro, next++) 14/10/2013 CHEP 2013 5
  • 6. Puppet • • Puppet manages nodes’ configuration via “manifests” written in Puppet DSL All nodes check in frequently (~1-2 hours) and ask for configuration • • Configuration applied frequently to minimise drift Using the central puppet master model • • ..rather than masterless model No shipping of code, central caching and ACLs 14/10/2013 CHEP 2013 6
  • 7. Separation of data and code • Puppet “Hiera” splits configuration “data” from “code” • • Treat Puppet manifests really as code More reusable manifests • Heira is quite new: old manifests are catching up • Hiera can use multiple sources for lookup • • Currently we store the data in git Investigating DB for “canned” operations 14/10/2013 CHEP 2013 7
  • 8. Modules and Git • Manifests (code) and hiera (data) are version controlled • Puppet can use git’s easy branching to support parallel environments • Later… 14/10/2013 CHEP 2013 8
  • 9. Foreman • Lifecycle management tool for VMs and physical servers • External Node Classifier – tells the puppet master what a node should look like Receives reports from Puppet runs and provides dashboard • 14/10/2013 CHEP 2013 9
  • 12. Deployment at CERN • • • • Puppet 3.2 Foreman 1.2 Been in real production for 6 months Over 4000 hosts currently managed by Puppet • • • • • • SLC5, SLC6, Windows ~100 distinct hostgroups in CERN IT + Expts New EMI Grid service instances puppetised Batch/Lxplus service moving as fast as we can drain it Data services migrating with new capacity AI services (Openstack, Puppet, etc) 2013 CHEP 12 14/10/2013
  • 13. Key technical challenges • Service stability and scaling • Service monitoring • Foreman improvements • Site integration 14/10/2013 CHEP 2013 13
  • 14. Scalability experiences • • Most stability issues we had were down to scaling issues Puppet masters are easy to load-balance • • • • We use standard apache mod_proxy_balancer We currently have 16 masters Fairly high IO and CPU requirements Split up services • Puppet – critical vs. non critical 12 backend nodes “Bulk” 14/10/2013 4 backend nodes “Interactive” CHEP 2013 14
  • 16. Scalability experiences • • Foreman is easy to load-balance Also split into different services • That way Puppet and Foreman UI don’t get affected by e.g. massive installation bursts Load balancer ENC UI/API 14/10/2013 Reports processing CHEP 2013 16
  • 17. PuppetDB • All puppet data sent to PuppetDB • Querying at compile time for Puppet manifests • e.g. configure load-balancer for all workers • Scaling is still a challenge • • Single instance – manual failover for now Postgres scaling • Heavily IO bound (we moved to SSDs) • Get the book 14/10/2013 CHEP 2013 17
  • 18. Monitor response times • Monitor response, errors and identify bottlenecks • Currently using Splunk – will likely migrate to Elasticsearch and Kibana 14/10/2013 CHEP 2013 18
  • 19. Upstream improvements • CERN strategy is to run the main-line upstream code • • Any developments we do gets pushed upstream e.g Foreman power operations, CVE reported IPMI Physical box IPMI Physical box IPMI Physical box Foreman Proxy Openstack Nova API VM 14/10/2013 VM VM CHEP 2013 19
  • 20. Site integration • Using Opensource doesn’t get completely get you away from coding your own stuff • We’ve found every time Puppet touches our existing site infrastructure a new “service” or “plugin” is born • • Implementing our CA audit policy Integrating with our existing PXE setup and burnin/hardware allocation process - possible convergence on tools in the future – Razor? • Implementing Lemon monitoring “masking” use-cases – nothing upstream, yet.. 14/10/2013 CHEP 2013 20
  • 21. People challenges • Debugging tools and docu needed! • • Can we have X’, Y’ and Z’ ? • • • • PuppetDB helpful here Just because the old thing did it like that, doesn’t mean it was the only way to do it Real requirements are interesting to others too Re-understanding requirements and documentation and training Great tools – how do 150 people use them without stepping on each other? • Workflow and process 14/10/2013 CHEP 2013 21
  • 22. Your test box Your special feature My special feature QA “QA” machines Production Most machines Use git branches to define isolated puppet environments 14/10/2013 CHEP 2013 22
  • 23. Easy git cherry pick 14/10/2013 CHEP 2013 23
  • 25. Git model and flexible environments • For simplicity we made it more complex • Each Puppet module / hostgroup now has its own git repo (~200 in all) • Simple git-merge process within module • Delegated ACLs to enhance security • Standard “QA” and “production” branches that machines can subscribe to • Flexible tool (Jens, to be open-sourced by CERN) for defining “feature” developments • Everything from “production” except for the change I’m testing on my module 14/10/2013 CHEP 2013 25
  • 26. Strong QA process • Mandatory QA process for “shared” modules • • • • Recommended for non-shared modules Everyone is expected to have some nodes from their service in the QA environment Normally changes are QA’d for at least 1 week. Hit the button if it breaks your box! Still iterating on the process • • Not bound by technology Is one week enough? Can people “freeze”? 14/10/2013 CHEP 2013 26
  • 27. Community collaboration • Traditionally one of HEPs strong points • There’s a large existing Puppet community with a good model - we can join it and open-source our modules • New HEPiX working group being formed now • • • • • Engage with existing Puppet community Advice on best practices Common modules for HEP/Grid-specific software https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagem ent https://lists.desy.de/sympa/info/hepix-config-wg 14/10/2013 CHEP 2013 27
  • 28. http://github.com/cernops for the modules we share Pull requests welcome! 14/10/2013 CHEP 2013 28
  • 29. Summary • The Puppet / Foreman / Git / Openstack model is working well for us • • • Key technical challenges are scaling and integration which are under control Main challenge now is people and process • • 4000 hosts in production, migration ongoing How to maximise the utility of the tools The HEP and Puppet communities are both strong and we can benefit if we join them together https://twiki.cern.ch/twiki/bin/view/HEPIX/ConfigManagement http://github.com/cernops 14/10/2013 CHEP 2013 29
  • 31. mcollective, yum Bamboo Puppet AIMS/PXE Foreman JIRA OpenStack Nova Koji, Mock Yum repo Pulp Active Directory / LDAP git Lemon / Hadoop Hardware database Puppet-DB 14/10/2013 CHEP 2013 31