Weitere ähnliche Inhalte Ähnlich wie Rev Up Your HPC Engine (20) Mehr von inside-BigData.com (20) Kürzlich hochgeladen (20) Rev Up Your HPC Engine1. Rev Up Your HPC Engine
Fritz Ferstl, CTO Univa Corp, fferstl@univa.com
2. Who is Univa?
Copyright © 2014 Univa Corporation. All Rights Reserved.
2
• Profile
• Based in Chicago,
global reach
• >500 customers in 3
yrs (mostly Fortune
500)
• Products
/Technologies:
• Univa Grid Engine
• UniSight
• Univa License
Orchestrator
• UniCloud
Data Center Automation Experts
Do more with less in Big Compute and Big Data
Help organizations
play a better game
of Tetris
3. Challenges for Workload and Resource
Management Systems
Copyright © 2014 Univa Corporation. All Rights Reserved. 3
4. Scalability
• Node counts stay flat or go down, sockets stay
flat, cores explode
• With the core explosion, the number of jobs also explodes
• Ever shorter run-times, more applications, more use
cases
• Large commercial sites approach or go beyond
100K
• Throughput clusters
process >150 million
jobs / month
4Copyright © 2014 Univa Corporation. All Rights Reserved.
5. Heterogeneity
5Copyright © 2014 Univa Corporation. All Rights Reserved.
• Hardware
• Multi-sockets, multi-cores
• Partial cluster upgrades
• Evolving memory, network and storage architectures
• Accelerators: GPUs, Phi
• Job Profiles
• Throughput
• Array Jobs
• Large Parallel
• Interactive
• Sessions
• Reservations
• Transactional
• Hybrid
• Dependencies, Workflows
6. Policy Variety
6Copyright © 2014 Univa Corporation. All Rights Reserved.
• Automated Transparency?
• Manual overrides
• Preferential access
• Priorities
• Reservations
• Resource Urgencies
• Quotas
• Deadlines
• Conflict Resolution
• E.g. don‘t starve large
parallel plus maintain
high utilization
7. Use Case Variety
7Copyright © 2014 Univa Corporation. All Rights Reserved.
• Classical HPC (simulation) Large parallel / many
mid-size parallel
• Verification / Test Throughput
• From single simulation to parameter study array
jobs
• Ultra-short jobs
• Big Data / Data Mining
• Exclusive usage of nodes
vs shared usage
8. Geographical Distribution / Clouds
8Copyright © 2014 Univa Corporation. All Rights Reserved.
• Resource sharing: servers, licenses, data, other
• Data access latencies
• Security
• File system dependencies
• Pre-/Post-Staging
• Data locality:
• Bring the job to the data
• Or bring the data to the job
10. Evolve
• Architecture Evolution
• more cores / nodes / jobs
make it faster
• Integration with GPUs, Phi, etc
• New Scheduling Algorithms
• Efficient handling of job mixes:
parallel / array / sequential jobs
• Scheduling of ultra-short jobs
• More Monitoring, Better Error Tracking
• Reporting, Accounting & Analytics
10Copyright © 2014 Univa Corporation. All Rights Reserved.
11. Be Street-Smart
• Simplify where possible!
• Be-all solution can be the
most expensive
• Effort
• Poor utilization slow ROI
• Focus on most important goals
11Copyright © 2014 Univa Corporation. All Rights Reserved.
12. Think Different
• Examples:
• Less HA @ more throughput via fast SSD-Raid with
regular back-up
• Use array jobs whereever possible
• More smaller jobs vs fewer bigger
jobs
• All considered, preemption may
be a good option
12Copyright © 2014 Univa Corporation. All Rights Reserved.
13. Accept Difference
• Simple: temporarily designate parts of cluster
• Advanced: Cloud-share
• Share resources across separate workload
management system instances
• Dynamically re-assign resources
(servers) based on demand
• Provides autonomy while
maintaining high utilization
• But avoid meta-scheduling
where you can!
13Copyright © 2014 Univa Corporation. All Rights Reserved.
14. Tailored Solutions
• Tailoring & add-ons can make all the
difference
• Tailoring such as
• Job Classes
• Customized reports
• Add-ons such as
• Submission portals
and wrappers
14Copyright © 2014 Univa Corporation. All Rights Reserved.
15. Conclusions
• Workload & Resource Management Systems more
required than ever
• Specifically in the “new” era of Cloud and Big Data
• Allows you to benefit from 20+ years of experience in
HPC workload orchestration and to move beyond
• Clear-cut set of challenges non-trivial solutions
• Build on best-in-class products, architectures and
development teams
• Being “street-smart” about architecting and configuration
of a cluster has big impact
15Copyright © 2014 Univa Corporation. All Rights Reserved.