A description of the Sanger Institute's journey with OpenStack to date, covering RHOSP, Ceph, S3, user applications, and future plans. Given at the Sanger Institute's OpenStack Day.
3. What Iâll talk about
â The Sanger Institute
â Motivations for using OpenStack
â Our journey
â Some decisions we made (and why)
â Some problems we encountered (and how we addressed them)
â Projects that are using it so far
â Next steps
4. The Sanger Institute
LSF 9
~10,000 cores in main compute farm
~10,000 cores across smaller project-specific farms
13PB Lustre storage
Mostly everything is available everywhere - âisolationâ is based on POSIX file
permissions
5. Motivations
LSF great for HPC utilization butâŚ
â It doesnât address data size/sharing/locality
â Itâs quicker to move an image (or an image definition) to the data
â benefit from existing data security arrangements
â benefit from tenant isolation
LSF isnât going away - complementary to cloud-style computing
6. Our journey
â 2015, June: sysadmin training
â July: experiments with RHOSP6 (Juno)
â August: RHOSP7 (Kilo) released
â December: pilot âbetaâ system opened to testers
â 2016, first half: Science As A Service
â July: pilot âgammaâ system opened using proper Ceph hardware
â August: datacentre shutdown
â September: production system hardware installation
â 2017, January: âdeltaâ system opened to early adopters
â February: Sanger Flexible Compute Platform announced
7. Science As A Service
First half of 2016
Proof-of-concept of a user-friendly orchestration portal (CloudForms) on top
of OpenStack and VMware
Consultancy and development input from RedHat
Presented at Scientific Working Group in Barcelona summit, October 2016
11. Hardware
We approached current vendors, and SuperMicro via BIOS-IT
Wanted to get most bang for buck
Arista provided seed switch kit and offered VXLAN support
12.
13. Production OpenStack (1)
⢠107 Compute nodes (Supermicro) each with:
⢠512GB of RAM, 2 * 25GB/s network interfaces
⢠1 * 960GB local SSD, 2 * Intel E52690v4 (14 cores @ 2.6Ghz)
⢠6 Control nodes (Supermicro) allow 2 openstack deployments
⢠256 GB RAM, 2 * 100 GB/s network interfaces
⢠1 * 120 GB local SSD, 1 * Intel P3600 NVMe (/var)
⢠2 * Intel E52690v4 (14 cores @ 2.6Ghz)
⢠Total of 53 TB of RAM, 2996 cores, 5992 with hyperthreading
⢠RHOSP8 (Liberty) deployed with Triple-O
14. Production OpenStack (2)
⢠9 Storage nodes (Supermicro) each with:
⢠512GB of RAM
⢠2 * 100GB/s network interfaces,
⢠60 * 6TB SAS discs, 2 system SSD
⢠2 * Intel E52690v4 (14 cores @ 2.6Ghz)
⢠4TB of Intel P3600 NVMe used for journal
⢠Ubuntu Xenial
⢠3 PB of disc space, 1PB usable
⢠Single instance (1.3 GBytes/sec write, 200 MBytes/sec read)
⢠Ceph benchmarks imply 7 GBytes/sec
15. Production OpenStack (3)
⢠3 racks of equipment, 24 KW load per rack
⢠10 Arista 7060CX-32S switches
⢠1U, 32 * 100Gb/s -> 128 * 25Gb/s
⢠Hardware VXLAN support integrated with OpenStack *
⢠Layer two traffic limited to rack, VXLAN used inter-rack
⢠Layer three between racks and interconnect to legacy systems
⢠All network switch software can be upgraded without disruption
⢠True Linux systems
⢠400 Gb/s from racks to spine, 160 Gb/s from spine to legacy systems
* VxLan in ml2 plugin not used in first iteration because of software issues
16. OpenStack installation
RHOSP vs Packstack vs âŚ
⢠Paid-for support from RedHat
⢠Terminology confusion: Triple-O undercloud and overcloud
⢠Need wellness checks of undercloud and overcloud before each
(re)deploy
⢠Keep deployment configuration in git and deploy with a script for
consistency
17.
18. Ceph installation
Integrated or standalone?
⢠Deployment by RHOSP is easier but is tied to that OpenStack
⢠A separate self-supported Ceph was more cost effective and a
better fit for staff knowledge at the time
⢠Itâs possible to share a Ceph between multiple OpenStacks
⢠ceph-ansible is seductive but brings some headaches
⢠e.g. --check causes problems like changing the fsid
19. Networking
We wanted VXLAN support in switches to enable metal-as-a-service
Unfortunately weâre not there yetâŚ
e.g. ml2 driver bugs: âreservedâ is not a valid UUID
We currently have VXLAN double encapsulation
21. Puppet or what?
We chose to use Ansible
⢠Thereâs only a single Puppet post-deploy hook
⢠Wider strategic use of Ansible within Sanger IT
⢠Keep configuration in git
22. Our customisations
⢠scheduler tweaks (stack not spread, CPU/RAM overcommit)
⢠hypervisor tweaks (instance root disk on Ceph or hypervisor)
⢠enable SSL for Horizon and API
⢠change syslog destination
⢠add âMOTDâ to Horizon login page
⢠change session timeouts
⢠register systems with RedHat
⢠and more...
23. Customisation pitfalls
Some customisations become obsolete when moving to a newer
version of OpenStack - canât blindly carry them forward
A redeploy (e.g. to add compute nodes) overwrites configuration so
the customisations need to be reapplied - and thereâs a window when
theyâre absent
Restarting too many services too quickly upsets HAproxy, rabbitmq...
24. Flavours and host aggregates
Three main flavour types:
1. Standard âm1.*â
⢠True cloud-style compute; root disk on hypervisor; 90% of compute
nodes
2. Ceph âc1.*â
⢠Root disk on Ceph allows live migration; 6 compute nodes support this
3. Reserved âh1.*â
⢠Limited to tenants running essential availability services
25. Flavours and host aggregates
Per-project flavours:
⢠For Cancer group âk1.*â
⢠True cloud-style compute, like âm1.*â
⢠Sized to fit two instances on each hypervisor: half the disk, half the CPUs,
half the RAM
⢠Trying to prevent Ceph âdouble loadâ caused by data movement:
CephâS3âinstanceâCinder volumeâCeph
⢠Only viable with homogeneous hypervisors and known/predictable
resource requirements
26. Deployment thoughts
âPremature optimisation is the root of all evilâ - Knuth
âGet it working, then make it fasterâ - my boss Pete
âKeep it simple (because Iâm) stupidâ - me
Turn off h/w acceleration (10GbE offloads guilty until proven innocent)
Find some enthusiastic early adopters to shake the problems out
Deploy, monitor, tweak, rinse, repeat
28. Metrics
Find the balance between
âif it moves, graph itâ
and
âdonât overload the metrics serverâ
50,000 metrics every 10 seconds is optimistic
29. Architecture
Weâre using collectd â graphite/carbon â grafana
Modular plugins make it easy to record new metrics e.g.
entropy_avail
Using the collectd libvirt plugin means new instances are
automatically measured
...although the automatic naming isnât great:
openstack_flex2.instance-00000097_bbb85e84-6c0c-4fe
8-9b3c-db17a665e7ef.libvirt.virt_cpu_total
34. Logging
We wanted something like Splunk
...but without the ÂŁÂŁÂŁ
Weâre using ELK
Today as a syslog destination; planning to use rsyslog to watch
OpenStack component log files
35. Monitoring
Bare minimum in Opsview (Nagios)
⢠Horizon and API availability
⢠Controllers up
⢠radosgw S3 availability
⢠Ceph nodes up
Weâd like hardware status reporting but SuperMicro IPMI is not helpful
37. âSpace,â it says, âis big. Really big. You just won't believe how vastly,
hugely, mindbogglingly big it is.â
Thereâs a substantial learning curve for admins and developers
OpenStack
38. Problems with Docker
Docker likes to use 172.17.0.0/16 for its bridge network
Sanger uses 172.16.0.0/12 for its internal network
...oh.
Also problems with bridge MTU > instance MTU and PMTUD not
working. Fix: --bip=192.168.3.3/24 --mtu=1400
39. Problems with radosgw
Ceph radosgw implements most but not all AWS S3 features
ACLs are implemented, policies are not
Weâre trying to implement a write-only bucket using nginx as a proxy
to rewrite the auth header
40. Problems with DHCP
On Ceph nodes, Ubuntu DHCP client doesnât request a default
gateway
Infoblox DHCP server sends Classless Static Routes option
DHCP client can override a server-supplied value but not ignore it
The Ceph nodesâ default route ends up pointing down the 1GbE
management NIC not the 2x100GbE bond
...oh.
41. Problems with rabbitmq
rabbitmq partitions are really painful
We sometimes end up rebooting all the controllers - there must be a
better way
Fortunately running instances arenât affected
42. Problems with deployment
Running the overcloud deployment from the wrong directory is
very bad
The deployer doesnât find the file containing the service
passwords and proceeds to change them all, which is very tedious
to recover from
The deployment script really really really needs to have
cd ~stack
to prevent accidents
43. Problems with cinder
When a volume is destroyed, cinder overwrites the volume with
zeroes
If a user is running a pipeline which creates and destroys many 1TB
volumes this produces a lot of I/O
Consider setting volume_clear and/or volume_clear_size in
cinder.conf
45. Prostate cancer analysis
Pan-Prostate builds on previous Pan-Cancer work
Multiple participating institutes using Docker to provide a consistent
analysis framework
In the past that required admin time to build an isolated network,
now OpenStack gives us that for free - and lets the scientists drive it
themselves
46.
47.
48. wr - Workflow Runner
Reimplementation of Vertebrate Resequencing Groupâs pipeline
manager in Go
Designed to be fast, powerful and easy to use
Can manage LSF like existing version, and adds OpenStack
https://github.com/VertebrateResequencing/wr
49.
50. wr - Workflow Runner
Lessons learned:
⢠âThereâs a surprising amount of stuff you have to do to get
everything working wellâ
⢠There are annoying gaps in the Go SDK
⢠Lots of things can go wrong if end users bring up servers, so handle
all the details for them
51. New Pipeline Group
Using s3fs as a shim on top of radosgw S3 speeds development
s3fs presents a bucket as a filesystem (but itâs turtles all the way
down)
In tests, launching up to 240 instances, for read-only access to a few
GB of reference sequence data, with caching turned on: up to ~8
might get stuck
52. Human Genetics Informatics
Working towards a production Arvados system
Speedbumps around many tools/SDKs assuming real AWS S3, not
some S3-alike
Sending patches to open-source projects (Packer, TerraformâŚ)
54. More Ceph
...because 1PB isnât enoughâŚ
This has implications for DC placement (due to cooling requirements)
and Ceph CRUSH map (to ensure data replicas are properly
separated)
Should we split rbd pools from radosgw pools?
55. OpenStack version upgrade
We will probably skip to RHOSP10 (Newton)
Need Arista driver integrations for VXLAN for metal-as-a-service
We will install a new system alongside the current one and migrate
users and then compute nodes
56. $THING-as-a-service
metal - deploy instance on bare-metal (Ironic)
key management (Barbican) to enable encrypted volumes
DNS (Designate)
shared filesystem (Manila)
âŚthough many of these can already be achieved with creative use of
images/heat/user-data
57. Federation
JISC Assent looks interesting
Lots of internal process to work through first
Open questions about:
⢠scheduling - pre-emptible instances would help
⢠charging - market-based instance pricing?
58. Lustre
We have 13PB of Lustre storage
Consider exposing some of it to tenants using Lustre routers, NID
mapping and sub-mounts
59. Little things
⢠expose hypervisor RNG to instances
⢠could make instance key generation go faster
⢠have LogStash report metrics of âlog per hostâ
⢠to spot log volume anomalies
⢠...
60. Thanks
My colleagues at Sanger - both in Systems and across the institute
The OpenStack community
Helpful people on mailing lists