SlideShare ist ein Scribd-Unternehmen logo
1 von 24
Flexible Compute Environment
•We have successfully been using the Farms for a number of
years.
•These are a fantastic tool for our science.
•But now - new cloud technologies are available.
•We want to be able to take advantage of these industry
standard technologies in our own private cloud – whilst still
retaining the benefits of the farms.
HPC and Cloud computing are
Complimentary
Traditional HPC
The highest possible
performance for Sanger
workloads
• A mature and centrally managed
compute platform.
• High performance Lustre filesystems
and system libraries optimised for
heavy data centric workloads.
• Efficient fair-share workload
management across large scale
infrastructure
• The ability to add bespoke or
disruptive technologies, such as
FPGA or GPU accelerators.
Flexible Compute
A collaborative platform,
with security and flexibility.
• Full segregation between projects
ensures data security throughout
computation tasks.
• Offers standard API’s and tools including
Docker to our developers and community
of collaborators.
• Scale-out computing with the ability to
over commit CPU allocation ensure
efficient resource utilisation
• Developers and collaborators are no
longer tied to a single operating system or
shared libraries. They are free to follow
the latest technologies and trends
Cloud computing: a flexible
platform for collaborative
research
•Cloud computing permits our customers to run their own images within a
secured network environment. This means that:
•We are no longer tied to the operating system and libraries that run on todays HPC
platforms.
•We can distribute our code in a pre-packed format that others can share.
• By sharing these images we can send Sanger pipelines to collaborators without technical
knowledge. This sends the workloads to where the data resides. (Not data to the work.)
•Cloud computing provides common API’s and libraries to our research community.
•We can support disruptive technologies, including Docker in a secure environment,
accelerating development of new scientific pipelines and software providence.
•We can overcommit CPU usage by at least 1.5 : 1. This allows for greater resource efficiency
and cost management.
Cloud computing collaboration
across sites
Open Standard
APIs
Federated
AAAI
MRC Bioinformatics
Centres
Public
clouds
WTSI Flexible
Compute
Common Collaborative API’s,
and familiar self service
environments that are secure
by design
Future HPC and cloud
computing integration.
OpenStack
CASM
HPC cluster
Cloud
environment
Humgen
Core …
Software Defined Networking
(Infrastructure as Code, IaaS)
It is becoming possible to install and manage
traditional HPC infrastructure using the same
tools that are provided by our cloud
environment
This would allow us to:
• Provide a consistent securable network
infrastructure across all computation
resources.
• The provision of both HPC resources in a
traditional manner and access to cloud
resources on a per tenant basis.
• Provide the best of both worlds for our
customers.
• Reduce management overheads
• Expand to meet future demand
• Accommodate disruptive technologies as
they arise.
Build from a menu requires
no user knowledge….
Sanger flexible compute
platform is coming soon
•Our new flexible compute platform has been under acceptance and early testing for over
a month.
•We’ve had excellent feedback from our research customers.
•On release, this platform will provide:
• 9000 vCPUs
• > 50Tb memory
• 1Pb of storage, including externally accessible S3 object storage.
Released on Wednesday the 22nd of February 2017
For more details, please contact Peter Clapham or Tim Cutts.
OpenStack day on Campus
•On March the 10th we are hosting an OpenStack day
•Guest speakers from the wider community.
•We have Tim Bell from CERN as a key speaker.
•Tim is responsible for CERN’S Openstack environment of > 200,000 cores.
•Other attendees include representatives from:
Production openstack (I)
• 107 Compute nodes (Supermicro) each with:
• 512GB of RAM, 2 * 25GB/s network interfaces,
• 1 * 960GB local SSD, 2 * Intel E52690v4 ( 14 cores @ 2.6Ghz )
• 6 Control nodes (Supermicro) allow 2 openstack instances.
• 256 GB RAM, 2 * 100 GB/s network interfaces,
• 1 * 120 GB local SSD, 1 * Intel P3600 NVMe ( /var )
• 2 * Intel E52690v4 ( 14 cores @ 2.6Ghz )
• Total of 53 TB of RAM, 2996 cores, 5992 with hyperthreading.
• Redhat Liberty deployed with Triple-O
Production openstack (II)
• 9 Storage nodes (Supermicro) each with:
• 512GB of RAM,
• 2 * 100GB/s network interfaces,
• 60 * 6TB SAS discs, 2 system SSD.
• 2 * Intel E52690v4 ( 14 cores @ 2.6Ghz )
• 4TB of Intel P3600 NVMe used for journal.
• Ubuntu Xenial.
• 3 PB of disc space, 1PB usable.
• Single instance ( 1.3 GBytes/sec write, 200 MBytes/sec read )
• Ceph benchmarks imply 7 GBytes/second
Production openstack (III)
• 3 Racks of equipment, 24 KW load per rack.
• 10 Arista 7060CX-32S switches .
• 1U, 32 * 100Gb/s -> 128 * 25Gb/s .
• Hardware VxLan support integrated with openstack *.
• Layer two traffic limited to rack, VxLan used inter-rack.
• Layer three between racks and interconnect to legacy systems.
• All network switch software can be upgraded without disruption.
• True linux systems.
• 400 Gb/s from racks to spine, 160 Gb/s from spine to legacy
systems.
(* VxLan in ml2 plugin not used in first iteration)
But what are we providing?
CloudForms
Service driven access
OpenStack Horizon
Granular control over
instance
Direct API Access
Direct https access
from anywhere
Accessible only from
within Sanger
Ceph Object Storage
(Used to provide volume
and image storage)
S3 Object
Storage
Layer
How Does this Fit with Existing
Services?
OpenStack “Bubble”
Compute
Ceph
100GB/s SDN network
infrastructure
Sanger internal systems
Access to secured
services
i.e.
iRODS
Databases
CIFS (Windows shares)
S3 API access
OpenStack API and GUI
access
80GB/s connectivity
No access to:
NFS
Lustre
CloudForms Interface
Horizon Interface
Efficient Resource
Management
OpenStack resources are managed at a tenant group level.
• Each “tenant” group has an assigned quota for:
• Disk
• CPU
• Memory
Once limits are full, tenant members will either have to wait for resources to
become available or shutdown or terminate a running instance.
Initial quotas are agreed with the IC before creation
Quotas, they are not all the
same.
Some groups have a requirement that they have an absolute number of spots
available for essential services
Other groups would like to burst to meet demand as required.
These requirements do not fit well with each other.
The Proposed Workaround
For those projects which require guaranteed access:
• We create a dedicated tenant group that has specific access to a set quota
allocation of vCPU, Disk and memory.
• This is tied directly against reserved hardware
This guarantees requested resource will be available when required, whilst
providing security, operating system flexibility and instance management.
BUT there is no ability to use more than the requested allocation
Dynamic Workflows
Dynamic workflows can expand to meet demand and collapse when not required.
So a quota that matches the initial resource request will mean constantly under
quota’ing the system
For the initial release we will start by:
• Overcommitting CPU by 1.5 : 1 (available total vCPU ~9000)
• Overallocate quotas so that 115% of the overcommitted vCPU is available to
tenants.
So some initial ability to use more of the system than may be available.
For More Details, see
https://docs.google.com/a/sanger.ac.uk/document/d/17z9urhh3bTLRhQo9b8Ccs
ZW_3O7cxlGY9uiwpAS_GqQ/edit?usp=sharing
Or
http://tinyurl.com/zzurp5s
We are adding monitoring and metrics gathering to the system. This
will provide a feedback loop for quota and project management.
New Opportunities for
Application Development
Cloud application development aims to scale out compute and
provide:
• Auto scaling of key services
• Making pipelines cost effective on commercial platform providers
• Self-healing of service components that fail
• Creating resilient services with reduced impact when service components fail
• Not tied to any one specific environment
• Enabling sharing of code, images and services with collaborators. This can
dramatically reduce the need to copy large data sets around the world and permit
running complex pipelines where the data resides.
How do we see Migration ?
Initial Early Adopters.
We have some early adopters !
1. Mutational signatures
2. Imputation service
3. Blast service
4. Pan-prostate
We look forward to hearing more from
these groups soon !
Mostly Share a Common
Approach
Web
Interface
Data
upload
Run
analysis
Update job
status data base
Present data
Invoke
Analysis
Retain a copy
Adaption to Cloud based tools
Stage Current approach Cloud approach
User details local databases, directory services, Oauth Oauth, directory services
Data Downloads Globus or https S3, Globus
Job status RDBM: MySQL, Oracle or PostgreSQL NoSQL: Mongodb, Cassandra or REDIS
Invoke Job
Analysis
Hand crafted equest to LSF AMQP
Run Analysis LSF job submission AMQP, Heat orchestration or API call to Openstack
Present data Make available via sftp, Globus or https web
upload
S3 automatically generated URL's
Keep data No consistent approach S3, archive as required
Service failure Await systems Use IFTTT or add code to instance to raise or restart an instance as
required
Autoscale options Await systems Use IFTTT or add code to instance to raise or restart an instance as
required
Service discovery Manual Cloud init, heat templates, dynamic DNS

Weitere ähnliche Inhalte

Was ist angesagt?

What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
DataWorks Summit
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
BlueData, Inc.
 

Was ist angesagt? (20)

Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data AnalyticsApache Ignite vs Alluxio: Memory Speed Big Data Analytics
Apache Ignite vs Alluxio: Memory Speed Big Data Analytics
 
Effective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant ClustersEffective Spark on Multi-Tenant Clusters
Effective Spark on Multi-Tenant Clusters
 
What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?What's the Hadoop-la about Kubernetes?
What's the Hadoop-la about Kubernetes?
 
How to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized EnvironmentHow to Protect Big Data in a Containerized Environment
How to Protect Big Data in a Containerized Environment
 
20150716 introduction to apache spark v3
20150716 introduction to apache spark v3 20150716 introduction to apache spark v3
20150716 introduction to apache spark v3
 
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
 
CBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFSCBlocks - Posix compliant files systems for HDFS
CBlocks - Posix compliant files systems for HDFS
 
Running a container cloud on YARN
Running a container cloud on YARNRunning a container cloud on YARN
Running a container cloud on YARN
 
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo VanzinSecuring Spark Applications by Kostas Sakellis and Marcelo Vanzin
Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin
 
How to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environmentHow to deploy Apache Spark in a multi-tenant, on-premises environment
How to deploy Apache Spark in a multi-tenant, on-premises environment
 
Hadoop engineering bo_f_final
Hadoop engineering bo_f_finalHadoop engineering bo_f_final
Hadoop engineering bo_f_final
 
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive WarehouseDisaster Recovery and Cloud Migration for your Apache Hive Warehouse
Disaster Recovery and Cloud Migration for your Apache Hive Warehouse
 
Nl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenchesNl HUG 2016 Feb Hadoop security from the trenches
Nl HUG 2016 Feb Hadoop security from the trenches
 
Best Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene PangBest Practices for Using Alluxio with Apache Spark with Gene Pang
Best Practices for Using Alluxio with Apache Spark with Gene Pang
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
Fast, In-Memory SQL on Apache Cassandra with Apache Ignite (Rachel Pedreschi,...
 
Application Architectures with Hadoop
Application Architectures with HadoopApplication Architectures with Hadoop
Application Architectures with Hadoop
 
Apache ignite v1.3
Apache ignite v1.3Apache ignite v1.3
Apache ignite v1.3
 
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsMulti-tenant, Multi-cluster and Multi-container Apache HBase Deployments
Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments
 
Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016Hadoop Security and Compliance - StampedeCon 2016
Hadoop Security and Compliance - StampedeCon 2016
 

Ähnlich wie HPC and cloud distributed computing, as a journey

Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
Ashish Sharma
 

Ähnlich wie HPC and cloud distributed computing, as a journey (20)

Sanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticiansSanger, upcoming Openstack for Bio-informaticians
Sanger, upcoming Openstack for Bio-informaticians
 
Introduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OSIntroduction to Apache Mesos and DC/OS
Introduction to Apache Mesos and DC/OS
 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
 
Data Lake and the rise of the microservices
Data Lake and the rise of the microservicesData Lake and the rise of the microservices
Data Lake and the rise of the microservices
 
Build cloud native solution using open source
Build cloud native solution using open source Build cloud native solution using open source
Build cloud native solution using open source
 
Hpc lunch and learn
Hpc lunch and learnHpc lunch and learn
Hpc lunch and learn
 
OpenStack at the speed of business with SolidFire & Red Hat
OpenStack at the speed of business with SolidFire & Red Hat OpenStack at the speed of business with SolidFire & Red Hat
OpenStack at the speed of business with SolidFire & Red Hat
 
Openstack_administration
Openstack_administrationOpenstack_administration
Openstack_administration
 
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
2689 - Exploring IBM PureApplication System and IBM Workload Deployer Best Pr...
 
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - LondonInfinitely Scalable Clusters - Grid Computing on Public Cloud - London
Infinitely Scalable Clusters - Grid Computing on Public Cloud - London
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival Brad stack - Digital Health and Well-Being Festival
Brad stack - Digital Health and Well-Being Festival
 
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
The Why and How of HPC-Cloud Hybrids with OpenStack - Lev Lafayette, Universi...
 
High Performance Computing Pitch Deck
High Performance Computing Pitch DeckHigh Performance Computing Pitch Deck
High Performance Computing Pitch Deck
 
Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !Cloud Foundry and OpenStack – Marriage Made in Heaven !
Cloud Foundry and OpenStack – Marriage Made in Heaven !
 
As34269277
As34269277As34269277
As34269277
 
Dbms
DbmsDbms
Dbms
 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
 
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
Scaling out Driverless AI with IBM Spectrum Conductor - Kevin Doyle - H2O AI ...
 

Kürzlich hochgeladen

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Enterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
giselly40
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
vu2urc
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Kürzlich hochgeladen (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

HPC and cloud distributed computing, as a journey

  • 1. Flexible Compute Environment •We have successfully been using the Farms for a number of years. •These are a fantastic tool for our science. •But now - new cloud technologies are available. •We want to be able to take advantage of these industry standard technologies in our own private cloud – whilst still retaining the benefits of the farms.
  • 2. HPC and Cloud computing are Complimentary Traditional HPC The highest possible performance for Sanger workloads • A mature and centrally managed compute platform. • High performance Lustre filesystems and system libraries optimised for heavy data centric workloads. • Efficient fair-share workload management across large scale infrastructure • The ability to add bespoke or disruptive technologies, such as FPGA or GPU accelerators. Flexible Compute A collaborative platform, with security and flexibility. • Full segregation between projects ensures data security throughout computation tasks. • Offers standard API’s and tools including Docker to our developers and community of collaborators. • Scale-out computing with the ability to over commit CPU allocation ensure efficient resource utilisation • Developers and collaborators are no longer tied to a single operating system or shared libraries. They are free to follow the latest technologies and trends
  • 3. Cloud computing: a flexible platform for collaborative research •Cloud computing permits our customers to run their own images within a secured network environment. This means that: •We are no longer tied to the operating system and libraries that run on todays HPC platforms. •We can distribute our code in a pre-packed format that others can share. • By sharing these images we can send Sanger pipelines to collaborators without technical knowledge. This sends the workloads to where the data resides. (Not data to the work.) •Cloud computing provides common API’s and libraries to our research community. •We can support disruptive technologies, including Docker in a secure environment, accelerating development of new scientific pipelines and software providence. •We can overcommit CPU usage by at least 1.5 : 1. This allows for greater resource efficiency and cost management.
  • 4. Cloud computing collaboration across sites Open Standard APIs Federated AAAI MRC Bioinformatics Centres Public clouds WTSI Flexible Compute Common Collaborative API’s, and familiar self service environments that are secure by design
  • 5. Future HPC and cloud computing integration. OpenStack CASM HPC cluster Cloud environment Humgen Core … Software Defined Networking (Infrastructure as Code, IaaS) It is becoming possible to install and manage traditional HPC infrastructure using the same tools that are provided by our cloud environment This would allow us to: • Provide a consistent securable network infrastructure across all computation resources. • The provision of both HPC resources in a traditional manner and access to cloud resources on a per tenant basis. • Provide the best of both worlds for our customers. • Reduce management overheads • Expand to meet future demand • Accommodate disruptive technologies as they arise. Build from a menu requires no user knowledge….
  • 6. Sanger flexible compute platform is coming soon •Our new flexible compute platform has been under acceptance and early testing for over a month. •We’ve had excellent feedback from our research customers. •On release, this platform will provide: • 9000 vCPUs • > 50Tb memory • 1Pb of storage, including externally accessible S3 object storage. Released on Wednesday the 22nd of February 2017 For more details, please contact Peter Clapham or Tim Cutts.
  • 7. OpenStack day on Campus •On March the 10th we are hosting an OpenStack day •Guest speakers from the wider community. •We have Tim Bell from CERN as a key speaker. •Tim is responsible for CERN’S Openstack environment of > 200,000 cores. •Other attendees include representatives from:
  • 8. Production openstack (I) • 107 Compute nodes (Supermicro) each with: • 512GB of RAM, 2 * 25GB/s network interfaces, • 1 * 960GB local SSD, 2 * Intel E52690v4 ( 14 cores @ 2.6Ghz ) • 6 Control nodes (Supermicro) allow 2 openstack instances. • 256 GB RAM, 2 * 100 GB/s network interfaces, • 1 * 120 GB local SSD, 1 * Intel P3600 NVMe ( /var ) • 2 * Intel E52690v4 ( 14 cores @ 2.6Ghz ) • Total of 53 TB of RAM, 2996 cores, 5992 with hyperthreading. • Redhat Liberty deployed with Triple-O
  • 9. Production openstack (II) • 9 Storage nodes (Supermicro) each with: • 512GB of RAM, • 2 * 100GB/s network interfaces, • 60 * 6TB SAS discs, 2 system SSD. • 2 * Intel E52690v4 ( 14 cores @ 2.6Ghz ) • 4TB of Intel P3600 NVMe used for journal. • Ubuntu Xenial. • 3 PB of disc space, 1PB usable. • Single instance ( 1.3 GBytes/sec write, 200 MBytes/sec read ) • Ceph benchmarks imply 7 GBytes/second
  • 10. Production openstack (III) • 3 Racks of equipment, 24 KW load per rack. • 10 Arista 7060CX-32S switches . • 1U, 32 * 100Gb/s -> 128 * 25Gb/s . • Hardware VxLan support integrated with openstack *. • Layer two traffic limited to rack, VxLan used inter-rack. • Layer three between racks and interconnect to legacy systems. • All network switch software can be upgraded without disruption. • True linux systems. • 400 Gb/s from racks to spine, 160 Gb/s from spine to legacy systems. (* VxLan in ml2 plugin not used in first iteration)
  • 11. But what are we providing? CloudForms Service driven access OpenStack Horizon Granular control over instance Direct API Access Direct https access from anywhere Accessible only from within Sanger Ceph Object Storage (Used to provide volume and image storage) S3 Object Storage Layer
  • 12. How Does this Fit with Existing Services? OpenStack “Bubble” Compute Ceph 100GB/s SDN network infrastructure Sanger internal systems Access to secured services i.e. iRODS Databases CIFS (Windows shares) S3 API access OpenStack API and GUI access 80GB/s connectivity No access to: NFS Lustre
  • 15. Efficient Resource Management OpenStack resources are managed at a tenant group level. • Each “tenant” group has an assigned quota for: • Disk • CPU • Memory Once limits are full, tenant members will either have to wait for resources to become available or shutdown or terminate a running instance. Initial quotas are agreed with the IC before creation
  • 16. Quotas, they are not all the same. Some groups have a requirement that they have an absolute number of spots available for essential services Other groups would like to burst to meet demand as required. These requirements do not fit well with each other.
  • 17. The Proposed Workaround For those projects which require guaranteed access: • We create a dedicated tenant group that has specific access to a set quota allocation of vCPU, Disk and memory. • This is tied directly against reserved hardware This guarantees requested resource will be available when required, whilst providing security, operating system flexibility and instance management. BUT there is no ability to use more than the requested allocation
  • 18. Dynamic Workflows Dynamic workflows can expand to meet demand and collapse when not required. So a quota that matches the initial resource request will mean constantly under quota’ing the system For the initial release we will start by: • Overcommitting CPU by 1.5 : 1 (available total vCPU ~9000) • Overallocate quotas so that 115% of the overcommitted vCPU is available to tenants. So some initial ability to use more of the system than may be available.
  • 19. For More Details, see https://docs.google.com/a/sanger.ac.uk/document/d/17z9urhh3bTLRhQo9b8Ccs ZW_3O7cxlGY9uiwpAS_GqQ/edit?usp=sharing Or http://tinyurl.com/zzurp5s We are adding monitoring and metrics gathering to the system. This will provide a feedback loop for quota and project management.
  • 20. New Opportunities for Application Development Cloud application development aims to scale out compute and provide: • Auto scaling of key services • Making pipelines cost effective on commercial platform providers • Self-healing of service components that fail • Creating resilient services with reduced impact when service components fail • Not tied to any one specific environment • Enabling sharing of code, images and services with collaborators. This can dramatically reduce the need to copy large data sets around the world and permit running complex pipelines where the data resides.
  • 21. How do we see Migration ?
  • 22. Initial Early Adopters. We have some early adopters ! 1. Mutational signatures 2. Imputation service 3. Blast service 4. Pan-prostate We look forward to hearing more from these groups soon !
  • 23. Mostly Share a Common Approach Web Interface Data upload Run analysis Update job status data base Present data Invoke Analysis Retain a copy
  • 24. Adaption to Cloud based tools Stage Current approach Cloud approach User details local databases, directory services, Oauth Oauth, directory services Data Downloads Globus or https S3, Globus Job status RDBM: MySQL, Oracle or PostgreSQL NoSQL: Mongodb, Cassandra or REDIS Invoke Job Analysis Hand crafted equest to LSF AMQP Run Analysis LSF job submission AMQP, Heat orchestration or API call to Openstack Present data Make available via sftp, Globus or https web upload S3 automatically generated URL's Keep data No consistent approach S3, archive as required Service failure Await systems Use IFTTT or add code to instance to raise or restart an instance as required Autoscale options Await systems Use IFTTT or add code to instance to raise or restart an instance as required Service discovery Manual Cloud init, heat templates, dynamic DNS

Hinweis der Redaktion

  1. To bring HPC and cloud together we need: A consistent point of management A consistent network infrastructure OpenStack and Cloudforms can provide this when integrated when integrated with a dynamically managed network. The ultimate aim would be to provide a consistent environment on a per group basis. On logging into their environment, both the HPC and private cloud environment will be available to all of the groups customers. This is all dynamic and no configuration is required by the customer Insert slide Common build processes. Providence, management and collaborative by nature. Saves migration of data. Simplifies data governance. Ability to burst compute to meet the challenge and collapse when it is no longer required. Over subscribe cores by 50% initially, improving efficiency. Initially make dev areas pre-prep available Large requests we will require an application. This will reduce conflicts or over subscription IC can decide on distribution if we hit significant over capacity in the future. Tis is not expected any time soon. Initial version is live now. See Sci computing, Pete & Tim.