2. PROBLEM STATEMENT
“FLEXING BETWEEN THE CLOUDS”
▸ Goals of Virtualization seem universally applicable
▸ !(Vendor Lock-in)
▸ Not all workloads are valued equally
=>=>
IT Magic Anywhere
3. SUCCESS CRITERIA
WIN CONDITIONS
‣ Availability of compute resources are independent of the cloud provider
‣ Batch jobs can be allocated based on point in time cost metrics
‣ Work segregation based on compliance qualifications
5. DEFINITIONS: RESOURCE CONTEXT
THE BANE OF TECHNICAL UNDERSTANDING (AKA WORDS):
▸ Region: The isolation boundary of a Nomad Cluster
▸ Datacenter: Low latency, high bandwidth, private network
▸ Resources: The available capacity provided by a node
Region Datacenter
AWS Continental AWS_Region
GCE Continental GCE_Region
Azure Location Location
Region Datacenter
AWS Global AWS_Region
GCE Global GCE_Region
Azure Global Sets of Locations
Common / Comfortable Pattern Ideal Pattern
6. NOMAD ARCHITECTURE - SINGLE REGION VIEW
BDFL FOR WORKLOAD DECISIONS
‣ In Nomad, Datacenter can speak to Region Aware Servers
‣ Datacenters don’t need to be the same platform
‣ Default Region is “global”
7. ARCHITECTURE OF SOLUTION
▸ Nomad Clients potentially
provide Resources for Jobs
▸ Communication between
Datacenters may need
secured
▸ Nodes run a Consul Agent
and Nomad Client
▸ Nomad Servers “Bin Pack”
task onto nodes
THREE PICTURES OF THE SAME THING
Single Region / Multi DataCenter
(different Clouds)
8. DEFINITIONS: TASK CONTEXT
WORDS: THE SEQUEL
▸ Task: Desired state declaration of workload
▸ Constraints: Rules limiting where a job can run
▸ Evaluations: Queued request to compare desired and present state of work
over the region
▸ Caused by a state change event
▸ Job Completion
▸ Node Addiction/Subtraction
▸ Job Scheduled
▸ Allocations: Mapping of tasks to resources within constraints
9. JOB TYPES: SERVICE
KEEPING THE SITE UP
▸ Long running jobs that should always be available
▸ Scheduling decisions favor QoS
▸ Example: Ensuring a front end web service is always
available
10. JOB TYPES: BATCH
WHAT TO DO WITH ALL THIS DATA?
▸ A set of work spanning a few minutes to a few days
▸ Based on the Berkley Sparrow Two Choices model
▸ http://people.eecs.berkeley.edu/~keo/publications/sosp13-
final17.pdf
▸ Probes a set of nodes which meet constraints and sends work
to the "least loaded" nodes
▸ Example: Tasks to manipulate a queue of data when present
11. JOB TYPES: SYSTEM
KEEPING THE LIGHTS ON
▸ A unique job type used to declare jobs which should run on
every node which meets the job constraints
▸ Are re-evaluated whenever a node joins the cluster
▸ Example: distributing common tasks, which can benefit from
rolling updates, job updates, service discovery
12. NOMAD SCHEDULING INTERNALS
GETTING FROM WORK AND RESOURCES TO
ACCOMPLISHMENTS
▸ Evaluations read the Job Specification
and find constraints
▸ Evaluation Brokers maintain the pending
queue, priority, and at least once delivery
▸ Schedulers submit an Allocation Plan,
evaluated for feasibility, followed by
priority
▸ Allocations set jobs against resources
13. LIKE TETRIS FOR WORKLOADS
▸ Tasks require resources
▸ Nodes have “dimensions” of
resources
▸ Allocation fits Tasks inside Nodes
BIN PACKING
14. TASK GROUPS
PREVENTING TASK SEPARATION ANXIETY
▸ Task Groups allow for multiple Jobs to require they are
scheduled on the same node
▸ Are created implicitly for single tasks in isolation
▸ Can be used to enforce compliance elements required to run
together
▸ Example: Requiring log shipping co-processes
15. CONSTRAINTS
JUST BECAUSE YOU CAN, DOESN’T MEAN YOU SHOULD
▸ Job Constraints limit the resources available for a particular
job group
▸ Constraints can map workloads directly to Customized
Hardware such as AWS Placement Groups
16. CONSTRAINTS AND COMPLIANCE
SATISFYING COMPLIANCE REQUIREMENTS
▸ Constraints on datacenter can be used for Data
Isolation inside National Boundaries.
▸ Healthcare workload that must stay within the EU
▸ Metadata attributes can allow for custom
declarations.
▸ Eg. PCI DSS Compliance:
▸ Maintain network firewall
▸ Protect run Anti-Malware/Anti-Virus
▸ Monitor and log access
▸ Regularly test security systems and procedures.
1 job "sample_service" {
2 ...
3 meta {
4 pci_dss = true
5 }
6 group "webservice" {
7 constraint {
8 attribute = "meta.pci_dss"
9 value = true
10 }
11 }
12 }
Constraint Snippet
17. CONSTRAINTS: SATISFYING SPECIAL NEEDS
DIFFERENT THINGS ARE DIFFERENT
▸ Not all platforms are created equal
▸ Platform attributes for specifying Cloud Platforms
1 job "sample_service" {
2 ...
3 constraint {
4 attribute = attr.platform
5 value = aws
6 }
7 }
▸ ${attr.platform} = aws
May be relevant if you need
Float (GPU) processing, which
AWS offers and GCE doesn’t
18. RAW EXECS
CHEKHOV’S TASK DRIVER
▸ Unconstrained, Un-isolated, Disabled by Default
“IT SEEMS TO BE A DEEP INSTINCT IN HUMAN BEINGS FOR
MAKING EVERYTHING COMPULSORY THAT ISN'T FORBIDDEN”
▸ Runs as the user Nomad is running as
▸ Disabled by default
client {
options = {
driver.raw_exec.enable = 1
}
}
~Robert A. Heinlein
19. OPERATOR INTERACTION
RELIABLE MAGIC = OPERATIONS
1 $ nomad run jobfile.nomad -address=$nomad_server
‣ Operators schedule jobs against a
server
‣ Nomad figures out how/where/when
to run tasks
‣ Complex solution through iteration