DevOps and HPC: Saudi Aramco HPC use case discusses how DevOps practices like infrastructure as code and configuration management tools like Puppet can help optimize HPC clusters. Benefits include speeding up cluster deployments from days to hours, continuous deployment, drift control, and team collaboration through version control. Containers are also discussed as a potential way to improve portability, scalability and software delivery for HPC workloads. However, challenges include changing processes, kernel requirements, security, and keeping pace with the fast-moving container ecosystem.
1. DevOps and HPC:
Saudi Aramco HPC use case
Walid A. Shaari 20th April 2016
Ahmed Bu-khamsin
2. 2
References in this document to any specific commercial products, process, or
service by trade name, trademark, manufacturer, or otherwise, does not
necessarily constitute or imply its endorsement, recommendation, or favoring
by Saudi Aramco or Saudi Aramco HPC group. The ideas and findings of authors
expressed in any slides or other material should not be construed as an official
Saudi Aramco or HPC team position and shall not be used for advertising or
product endorsement purposes. Information contained in this document is
published in the interest of scientific and technical information exchange.
DISCLAIMER OF ENDORSEMENT
27/10/2014
3. 3
DevOps
Cultural movement or practice that
emphasizes the collaboration and
communication of both Application
Developers and Operations
professionals.
Development
Business
Operations
adaptive
automated
agile
4. 4
Business Drives
o Optimization
Effective data center(s) resources utilization:
• Utilization of systems, storage, network, or services.
• Better use of employees time and skills.
o Growth ( N x R x P )
Increasing Infrastructure scale
• N: number of managed nodes/clusters/environments
• R: number of applications(business roles)
• P: number of technical services (technology profiles)
8. 8
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
Script
Pack ag es
Files Servic
es
Mo
un
ts
Securi
ty
• Different Hardware
• Different Sizes
• Different Users
• Different Operating Systems
9. 9
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Script
Packag
es
Fi l es Ser
vi c
es
M
o
u
n
t
s
Se c
u rit
y
Common Tasks:
Apply security patches
Add new storage
Upgrade the OS
Install new packages
Common Issues:
Scalability issue
Lack of history
No team collaboration
No drift control
Long development and
test cycle
10. 10
• Do it DevOps way
- Infrastructure as code
• Definition of Infrastructure as code:
"Enable the reconstruction of the business from nothing but a source code
repository, an application data backup, and bare metal resources"
Solution
11. 11
• Domain Specific Language:
- To describe the infrastructure desired state
• Data Store:
- To store the configuration specifications and other data
• Control System:
- To deploy the code and apply the required configuration changes
• Versioning Control System
- To keep history
- enforce workflow and peer review
- Team collaboration
Configuration Management Tools
12. 12
Puppet
• Open-source IT automation framework
• Framework to simplify and automate system configuration and provisioning
• Replaces ssh-for loops and scripts
• Hundreds of configuration modules available for download
• Supports many Linux distributions, Windows, storage and network devices
13. 13
• Hardware Delivery
• Power Up and Network Connectivity
• OS Installation
• Aramco Customization
• Benchmarking
• Application Testing
• Production
HP CMU . IBM xCat . Dell Bright
Where Puppet Fits
Cluster Deployment Project Plan
14. 14
Benefits
• Speeds up clusters deployment From days to hours
- Shorter development cycle
- Same code is used for deployment and compliance
- Code Reuse
15. 15
Benefits
Contribution During Puppet Deployment Project
Contribution During First Deployment Project
Contribution During Second Deployment Project
November 13 2014 - April 22 2015
Commits statistic for
production
697 commits during 160
days
Average 4.4 commits per day
Contributed by 9 authors
16. 16
Benefits
• Automatic and continuous deployment
- Classify the cluster to the right type and Puppet does the rest
24. 24
Which workloads and frameworks are running on
OpenStack?
Source : https://www.openstack.org/assets/survey/Public-User-Survey-Report.pdf
25. 25
HPC in non bare-metal Experimental? Is it Mature?
Vendor trends
26. 26
Next Generation Provisioning
Puppet
Razor Ironic
• No vendor lock: Open Source availability
• Environments Agnostic
• bare-metal, virtual image, and containers
• Use open standards
• Ipmi2, ipxe, dhcp, REST, https
• Handles end to end application provisioning
• Better integration with other tools
• configuration management, CMDB, Monitoring
• Programmable
• On-demand provisioning
• Policy/Model based
27. 27
Data Center current state
SchedulerSchedulerScheduler
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Cluster Management A
Cluster Management B
Cluster Management C
0%
50%
100%
28. 28
Data Center
Breaking the Silos
SchedulerSchedulerScheduler
MetaScheduler
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
29. 29
Data Center
Efficient Secure Allocation of Resources
VC3
BigData
VC1
Infra
VC2
HPC
SchedulerSchedulerScheduler
DataCenterScheduler
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
Jobs
2nd Generation Cluster Management
30. 30
Containers
Container encapsulate an application completely with all of its
software dependencies into a standardized unit for software portable
across different platforms*
https://www.docker.com/what-docker
31. 31
Containers Potential Benefits to HPC
o High performing
o Lightweight
o Portable, could solve software packaging, configuration, and delivery
o Host Kernel and system drivers visibility
o Composable
o Targets better scalable monitoring, logging, and security
o Private in-house repositories
o Workforce Separation of concerns (e.g. Operations, Development, Security, Users)
o Builds on mature agile application lifecycle management
o Empowers application support, and developers
o Holistic, yet modular ECO system
o Schedulers, and cluster managers
(Traditional e.g. LSF, UGE, Moab, and Slurm)
(Modern: Mesos, Kubernetes, nextflow)
34. 34
Host possible workload
Tiny Core Linux (VM)
Docker Engine
Bin/libs
Enterprise Linux Distribution
Service
RHEL7
HPCtask
HPCtask
HPCtask
HPCtask
AlpineMicroService
MicroService
MicroService
MicroService
Ubuntu
Bigdata
Alpine
Redis
Kibana
Logstash
Elasticsearc
35. 35
HPC Host Reality
RHEL7
HPCTask
HPCTask
HPCTAsk
HPCTask
Bin/Libs
HPC service
Docker Engine
Docker capable OS
Bin/Libs
HPC service
Bin/Libs
HPC service
Docker Engine
Docker capable OS
Docker Engine
Docker capable OS
Bin/Libs
HPC Job 3
Docker Engine
Docker capable OS
Docker Engine
Docker capable OS
Bin/Libs
HPC Job 3
Bin/Libs
HPC Job 3
Container Cluster Management/orchestration
36. 36
Possible HPC Challenges
o Change of processes, and mindset
o Linux kernel requirements
o Maturity of the cluster management and scheduling solution
o Keeping up with the containers eco system
o Extremely fast moving target
o Several architectural and fundamental decisions to make
o Memory deduplication
o Necessity of automated tool-chains
“development, integration, and delivery workflows”
o Security
Trusted container libraries
40. 40
Mesos
§ Mature, Open Source Apache Project
§ Cluster Resource Manager
§ Scalable to 10,000s of nodes
§ Fault tolerant, no single point of failure
§ Multi-tenancy with strong resource isolation
§ Improved resource utilization