How AI, OpenAI, and ChatGPT impact business and software.
20120524 cern data centre evolution v2
1. CERN Infrastructure Evolution
Tim Bell
CHEP
CERN IT Department
24th May 2012
CH-1211 Genève 23
Switzerland
www.cern.ch/it
2. Agenda
• Problems to address
• Ongoing projects
– Data centre expansion
– Configuration management
– Infrastructure as a Service
– Monitoring
• Timelines
• Summary
CERN Infrastructure Evolution 2
3. Problems to address
• CERN data centre is reaching its
limits
• IT staff numbers remain fixed but
more computing capacity is needed
• Tools are high maintenance and
becoming increasingly brittle
• Inefficiencies exist but root cause
cannot be easily identified
CERN Infrastructure Evolution 3
4. Data centre extension history
• Studies in 2008 into building a new
computer centre on the CERN
Prevessin site
– Too costly
• In 2011, tender run across CERN
member states for remote hosting
– 16 bids
• In March 2012, Wigner Institute in
Budapest, Hungary selected
CERN Infrastructure Evolution 4
5. Data Centre Selection
• Wigner Institute in Budapest, Hungary
CERN Infrastructure Evolution 5
7. Data Centre Plans
• Timescales
– Prototyping in 2012
– Testing in 2013
– Production in 2014
• This will be a “hands-off” facility for CERN
– They manage the infrastructure and hands-on work
– We do everything else remotely
• Need to be reasonably dynamic in our usage
– Will pay power bill
• Various possible models of usage – sure to evolve
– The less specific the hardware installed there, the
easier to change function
– Disaster Recovery for key services in Meyrin data
centre becomes a realistic scenario
CERN Infrastructure Evolution 7
8. Infrastructure Tools Evolution
• We had to develop our own toolset in 2002
• Nowadays,
– CERN compute capacity is no longer leading edge
– Many options available for open source fabric
management
– We need to scale to meet the upcoming capacity
increase
• If there is a requirement which is not available
through an open source tool, we should question
the need
– If we are the first to need it, contribute it back to the
open source tool
CERN Infrastructure Evolution 8
9. Where to Look?
• Large community out there taking the “tool chain”
approach whose scaling needs match ours:
O(100k) servers and many applications
• Become standard and join this community
10. Bulk Hardware Deployment
• Adapt processes for minimal physical
intervention
• New machines register themselves
automatically into network databases
and inventory
• Burn-in test uses standard Scientific
Linux and production monitoring
tools
• Detailed hardware inventory with
serial number tracking to support
accurate CERN Infrastructureof failures
analysis Evolution 10
11. Current Configurations
• Many, diverse applications (“clusters”) managed by different teams
• ..and 700+ other “unmanaged” Linux nodes in VMs that could benefit
from a simple configuration system
CERN Infrastructure Evolution 11
12. Job Adverts from Indeed.com
Index of millions
Puppet of worldwide job
posts across
thousands of
job sites
These are the
sort of posts our
Quattor departing staff
will be applying
for.
Agile Infrastructure - Configuration and
CERN Infrastructure Evolution 12
Operation Tools
13. Configuration Management
• Using off the shelf components
– Puppet – configuration definition
– Foreman – GUI and Data store
– Git – version control
– Mcollective – remote execution
• Integrated with
– CERN Single Sign On
– CERN Certificate Authority
– Installation Server
CERN Infrastructure Evolution 13
15. Different Service Models
• Pets are given names like
lsfmaster.cern.ch
• They are unique, lovingly hand
raised and cared for
• When they get ill, you nurse them
back to health
• Cattle are given numbers like
vm0042.cern.ch
• They are almost identical to other
cattle
• When they get ill, you get another
one
• Future application architectures tend towards Cattle
• .. But users now need support for both modes of working
CERN Infrastructure Evolution 15
16. Infrastructure as a Service
• Goals
– Improve repair processes with virtualisation
– More efficient use of our hardware
– Better tracking of usage
– Enable remote management for new data centre
– Support potential new use cases (PaaS, Cloud)
– Sustainable support model
• At scale for 2015
– 15,000 servers
– 90% of hardware virtualized
– 300,000 VMs needed
CERN Agile Infrastructure
16
17. OpenStack
• Open source cloud software
• Supported by 173 companies including
IBM, RedHat, Rackspace, HP, Cisco,
AT&T, …
• Vibrant development community and
ecosystem
• Infrastructure as a Service to our scale
• Started in 2010 but maturing rapidly
CERN Agile Infrastructure
17
19. Scheduling and Accounting
• Multiple uses of IaaS
– Server consolidation
– Classic batch (single or multi-core)
– Cloud VMs such as CERNVM
• Scheduling options
– Availability zones for disaster recovery
– Quality of service options to improve efficiency
such as build machines, public login services
– Batch system scalability is likely to be an issue
• Accounting
– Use underlying services of IaaS and
Hypervisors for reporting and quotas
CERN Infrastructure Evolution 19
20. Monitoring in IT
• >30 monitoring applications
– Number of producers: ~40k
– Input data volume: ~280 GB per day
• Covering a wide range of different resources
– Hardware, OS, applications, files, jobs, etc.
• Application-specific monitoring solutions
– Using different technologies (including commercial
tools)
– Sharing similar needs: aggregate metrics, get alarms,
etc
• Limited sharing of monitoring data
– Hard to implement complex monitoring queries
20
22. Current Tool Snapshot (Liable to Change)
Puppet
Foreman
mcollective, yum Jenkins
AIMS/PXE
Foreman JIRA
Openstack Nova
git, SVN
Koji, Mock
Yum repo
Pulp
Hardware Lemon
database
Puppet stored
config DB
23. Timelines
Year What Actions
2012 Prepare formal project plan
Establish IaaS in CERN Data Centre
Monitoring Implementation as per WG
Migrate lxcloud users
Early adopters to use new tools
2013 LS 1 Extend IaaS to remote Data Centre
New Data Centre Business Continuity
Migrate CVI users
General migration to new tools with SLC6
and Windows 8
2014 LS 1 (to Phase out legacy tools such as Quattor
November)
CERN Infrastructure Evolution 23
24. Conclusions
• New data centre provides opportunity to
expand the Tier-0 capacity
• New tools are required to manage these
systems and improve processes given
fixed staff levels
• Integrated monitoring is key to identify
inefficiencies, bottlenecks and usage
• Moving to an open source tool chain has
allowed a rapid proof of concept and will
be a more sustainable solution
CERN Infrastructure Evolution 24
25. Background Information
• HEPiX Agile Talks
– http://cern.ch/go/99Ck
• Tier-0 Upgrade
– http://cern.ch/go/NN98
Other information, get in touch
Tim.Bell@cern.ch
CERN Infrastructure Evolution 25
31. Industry Context - Cloud
• Cloud computing models are now standardising
– Facilities as a Service – such as Equinix, Safehost
– Infrastructure as a Service - Amazon EC2, CVI or lxcloud
– Platform as a Service - Microsoft Azure or CERN Web Services
– Software as a Service – Salesforce, Google Mail, Service-Now, Indico
• Different customers want access to different layers
– Both our users and the IT Service Managers
Applications
Platform
Infrastructure
Facilities
CERN Infrastructure Evolution 31
The CERN computing infrastructure has a number of immediate problems to be solved. Specifically, how do we continue providing the capacity for the LHC experiments. The CERN computer centre was built in the 1970s for a Cray and a Mainframe and has been gradually adapted to meet the needs of modern commodity computing. However, we have reached the limits and can no longer adapt due to limits on cooling and electricity. At the same time, the tool set that we wrote during the past 10 years to manage the 8,000 servers is getting old. There were no alternatives when we started and so a hand-crafted set of tools such as Quattor, Sindes, SMS and Lemon have been developed. These are now requiring more work to maintain than resources are available. Equally, we know that we are not getting the most out of our installed capacity but identifying the root causes behind inefficiencies is difficult with the vast amounts of data to work through and correlate.
Deployment is staged, starting small in 2013 and growing through 2017 in gradual steps. This allows us to debug processes and facilities early in 2013.
CERN’s computing capacity is no longer at a leading edge. Companies like Rackspace have over 80,000 servers and Google is rumoured to have 500,000 worldwide. Upcoming evolution like the new datacentre, IPv6, scalability pressures, even new operating system versions require substantial effort to support on our own. We need to share the effort with others and benefit from their work. We started a project to make the CERN tools more agile… with the basic principles of We are not unique, there are solutions out there If we find our requirements are not met, question the requirement (and question it again) If we need to develop something, find the nearest tool and submit a patch rather than write from scratch
Followed a Google toolchain approach of small tools glued together… aim is that we try quickly and change if we find limitations. Tool selection is based on maintenance, community support, packaging and documentation quality.
We arrange bulk deliveries of 1000s of servers a year and these currently can take weeks to install and get into production. The large web companies do this automatically using network boot and automatic registering. We will do the same so that the only necessary steps for installation are to boot the machines on a switch which is declared as being an installation target. The machines then register themselves into the network and inventory tools. This saves manual effort on entry of MAC addresses. After this, the burn in tests are started to stress the machines to the full. This finds both near failing parts along with major issues such as firmware problems. The end results are automatically analysed and failures reported to the vendor. All serial numbers and firmware versions are tracked so that failure analysis is easier and statistics on exchanges can be maintained over the life of machines.
When it comes to configuring the boxes after they have been tested, we have a large variety of configurations. There are 100s of different configurations in the computer centre with only a few such as the batch farm being large. Many of the smaller ones have eless than 10 boxes where the effort is high to configure them for newly trained service managers. Over 700 boxes in the computer centre are not under configuration management control too….
At the same time, the tools we use are not commonly used elsewhere. New arrivals do not learn them at university and require individual training (as no formal training courses are available). By adopting a tool commonly used elsewhere, our staff can learn from books and courses and if they leave after a few years at CERN, they have directly re-usable skills for their CVs.
So, we chose a set of software and integrated it with some standard CERN services which are stable and scalable. Puppet is key for us as there are many off the shelf components, from NTP through to web services, which can be re-used. We chose Git although the default source code version control system at CERN is svn. This is because the toolset we are investigating come with out of the box support for git. Thus, we should do what the others do…
Currently, managing 160 boxes on the pilot platform.
Managing 15,000 servers is not simple… there are pets which need looking after and cattle which are disponsable. The aim is to create the infrastructure to support both pets and cattle but manage the hardware like cattle so that hardware management staff can work without needing to keep on scheduling interventions.
Our objectives for the coming years are having 90% of the hardware resources on the DCs virtualized. In this scenario w e will have around 300,000 Virtual Machines. CERN IT department currently has two small projects covering this (LxCloud and CVI). The main objective is building a common solution combining these two projects in terms of scalability and efficiency. We need to support multiple centres, scalability to 300K VMs and assist in reducing the running effort via live migration and platform as a service.
So after some research in IaaS solutions, we found this solution (Openstack). It provides a set of tools that let us build our private cloud. This solution is multi-site, scalable and uses standard cloud interfaces. The clients could build their applications using these cloud interfaces. This will position our cloud solution as another cloud provider.
Openstack has different components like Nova, Glance, Keystone, Horizon and Swift. This is the architecture of the different services and how the interact with each other. There are also the internal components of Glance and Nova. If you have a look on Nova, they are two key components that are the Queue and the Database. Nova is built on a shared-nothing, messaging-based architecture. All the major nova components can be run on multiple servers. This means that most component to component communication must go via message queue. Our implementation at CERN uses these main components.
So, final tool summary …
Early adopters such as dev boxes, non-quattorised boxes, lxplus