SlideShare ist ein Scribd-Unternehmen logo
1 von 23
15 Condor – A Distributed Job
Scheduler
Todd Tannenbaum, Derek Wright, Karen
Miller, and Miron Livny
Beowulf Cluster Computing with
Linux, Thomas Sterling, editor, Oct.
2001. Summarized by Simon Kim
Contents
• Introduction to Condor
• Using Condor
• Condor Architecture
• Installing Condor under Linux
• Configuring Condor
• Administration Tools
• Cluster Setup Scenarios
Introduction to Condor
• Distributed Job Scheduler
• Condor Research Project at University of
Wisconsin-Madison Department of Computer
Sciences
• Changed name to HTCondor in 2012
– http://research.cs.wisc.edu/htcondor
Introduction to Condor
Condor
Job
Run Run
Idle
Monitor
Progress
Report
Queue
Run
Idle Run
Nodes
User
Policy
Complete!
Introduction to Condor
• Workload Management System
• Job Queuing Mechanism
• Scheduling Policy
• Priority Scheme
• Resource Monitoring and Management
Condor Features
• Distributed Submission
• User/Job Priorities
• Job Dependency - DAG
• Multiple Job Models – Serial/Parallel Jobs
• ClassAds – Job : Machine Matchmaking
• Job Checkpoint and Migration
• Remote System Calls – Seamless I/O Redirection
• Grid Computing – Interaction with Globus
Resources
ClassAds and Matchmaking
• Job ClassAd
– Looking for Machine
– Requirements: Intel, Linux, Disk Space, …
– Rank: Memory, Kflops, …
• Machine ClassAd
– Looking for Job
– Requirements
– Rank
Using Condor
• Roadmap to Using Condor
• Submitting a Job
• User Commands
• Universes
• Standard Universe
– Process Checkpointing
– Remote System Calls
– Relinking
– Limitations
• Data File Access
• DAGMan Scheduler
Using Condor
Batch
Job
STDIN
STDOUT
STDERR
univers = vanilla
executable = foo
log = foo.log
input = input.data
output = output.data
queue
Submit Description
Standard
Vanilla
PVM
MPI
Grid
Scheduler
Universes:
Runtime Environment
Prepare a Job
Submit
Serial Job
Parallel Job
Meta Scheduler
$ condor_submit
Status of Submitted Jobs
• $ condor_status -submitters
All jobs in the Queue
• $ condor_q
• Removing Job
– $ condor_rm 350.0
• Changing Job Priority: -20 ~ 20(high), default: 0
– $ condor_prio –p -15 350.1
Universes
• Execution Environment - Universe
• Vanilla
– Serial Jobs
– Binary Executable and Scripts
• MPI Universe
– MPI Programs
– Parallel Jobs
– Only on Dedicated Resources
# Submit Description
Universe = mpi
…
machine_count = 8
queue
Universes
• PVM Universe
– Master-worker Style Parallel
Programs
• Written for Parallel Virtual Machine
Interface
– Both Dedicated and Non-
dedicated (workstations)
– Condor Acts as Resource Manager
for PVM Daemon
– Dynamic Node Allocation
PVM Daemon
Condor
# Submit Description
Universe = pvm
…
machine_count = 1..75
queue
pvm_addhosts()
Universes
• Scheduler Universe
– Meta-Scheduler
– DAGMan Scheduler
• Complex Interdependencies Between Jobs
A
B C
D
* B and C are executed in parallel
Job Sequence: A -> B and C -> D
Universes
• Standard Universe
– Serial Job
– Process Checkpoint, Restart, and Migration
– Remote System Calls
Process Checkpointing
• Checkpoint
– Snapshot of the Program’s Current State
– Preemptive Resume Scheduling
– Periodic Checkpoints – Fault Tolerance
– No Program Source Code Change
• Relinking with Condor System Call Library
– Signal Handler
• Process State Written to a Local/Network File
• Stack/Data Segments, CPU state, Open Files, Signal Handlers
and Pending Signals
– Optional Checkpoint Server
• Checkpoint Repository
Remote System Calls
• Redirects File I/O
– Open(), read(), write() -> Network Socket I/O
– Sent to ‘condor_shadow’ process on Submit
Machine
• Handles Actual File I/O
• Note that Job Runs on Remote Machine
• Relinking Condor Remote System Call Library
– $ condor_compile cc myprog.o –o myprog
Standard Universe Limitations
• No Multi-Process Jobs
– fork(), exec(), system()
• No IPC
– Pipes, Semaphores, and Shared Memory
• Brief Network Communication
– Long Connection -> Delay Checkpoints and Migration
• No Kernel-level Threads
– User-level Threads Are Allowed
• File Access: Read-only or Write-only
– Read-Write: Hard to Roll Back to Old Checkpoint
• On Linux, Must be Statically Linked
Data Access from a Job
• Remote System Call – Standard Universe
• Shared Network File System
• What About Non-dedicated Machines (Desktops)
?
– Condor File Transfer
– Before Run, Input Files Transferred to Remote
– On Completion, Output Files Transferred Back to
Submit Machine
– Requested in Submit Description File
• transfer_input_files = <…>, transfer_output_files=<…>
• transfer_files=<ONEXIT | ALWAYS | NEVER>
Condor Architecture
Central Manager Machine
Negotiator
Collector
Startd
Sched
Startd
Sched
Machine 1
Startd
Sched
Machine 2
Startd
Sched
Machine N
Condor Architecture
Central Manager Machine
Negotiator
Collector
Startd
Sched
Startd
Sched
Machine 1: Submit
Startd
Sched
Machine N: Execute
Starter
Job
Shadow
Condor Remote
System Call
Cluster Setup Scenarios
• Uniformed Owned Dedicated Cluster
– MPI Jobs on Dedicated Nodes
• Cluster of Multi-Processor Nodes
– 1VM per Processor
• Cluster of Distributively Owned Nodes
– Jobs from Owner Preferred
• Desktop Submission to Cluster
– Submit-only Node Setup
• Non-Dedicated Computing Resources
– Opportunistic Scheduling and Matchmaking with Process
Checkpointing, Migration, Suspend and Resume
Conclusion
• Distinct Features
– Matchmaking with Job and Machine ClassAds
– Preemptive Scheduling and Migration with
Checkpointing
– Condor Remote System Call
• Powerful Tool for Distributed Scheduling Jobs
– Within and Beyond Beowulf Clusters
• Unique Combination of Dedicated and
Opportunistic Scheduling

Weitere ähnliche Inhalte

Ähnlich wie Presentation 15 condor-v1

Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Adam Dunkels
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudyJohn Adams
 
Freedom of Movement for redisconf19
Freedom of Movement for redisconf19Freedom of Movement for redisconf19
Freedom of Movement for redisconf19Richard Leddy
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...Lucas Jellema
 
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...Tony Erwin
 
Social Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkNico Meisenzahl
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresDocker, Inc.
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsSteve Jamieson
 
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...Adam Dunkels
 
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...
Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...Amazon Web Services
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First PartSoumee Maschatak
 
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesMigrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesTony Erwin
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)ewerkboy
 
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04Aritra Sarkar
 
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3Adam Dunkels
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Uri Cohen
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster RecoveryMarkTaylorIBM
 

Ähnlich wie Presentation 15 condor-v1 (20)

Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
Building the Internet of Things with Thingsquare and Contiki - day 2 part 1
 
John adams talk cloudy
John adams   talk cloudyJohn adams   talk cloudy
John adams talk cloudy
 
Freedom of Movement for redisconf19
Freedom of Movement for redisconf19Freedom of Movement for redisconf19
Freedom of Movement for redisconf19
 
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
A Cloud- and Container-Based Approach to Microservices-Powered Workflows (Cod...
 
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
Migration of an Enterprise UI Microservice System from Cloud Foundry to Kuber...
 
Social Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections PinkSocial Connections 13 - Troubleshooting Connections Pink
Social Connections 13 - Troubleshooting Connections Pink
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 
Evolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.jsEvolution of a cloud start up: From C# to Node.js
Evolution of a cloud start up: From C# to Node.js
 
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
Advanced Internet of Things firmware engineering with Thingsquare and Contiki...
 
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...
Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...Give your little scripts big wings:  Using cron in the cloud with Amazon Simp...
Give your little scripts big wings: Using cron in the cloud with Amazon Simp...
 
Cloud computing Module 2 First Part
Cloud computing Module 2 First PartCloud computing Module 2 First Part
Cloud computing Module 2 First Part
 
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to KubernetesMigrating Enterprise Microservices From Cloud Foundry to Kubernetes
Migrating Enterprise Microservices From Cloud Foundry to Kubernetes
 
Condor
CondorCondor
Condor
 
week15a.pdf
week15a.pdfweek15a.pdf
week15a.pdf
 
Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)Pm ix tutorial-june2019-pub (1)
Pm ix tutorial-june2019-pub (1)
 
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
DuinOS controlled Rover with MATLAB 2009 and Android GingerBread - 2012-11-04
 
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
Building the Internet of Things with Thingsquare and Contiki - day 1, part 3
 
Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014 Cloudify workshop at CCCEU 2014
Cloudify workshop at CCCEU 2014
 
Play With Streams
Play With StreamsPlay With Streams
Play With Streams
 
IBM MQ Disaster Recovery
IBM MQ Disaster RecoveryIBM MQ Disaster Recovery
IBM MQ Disaster Recovery
 

Kürzlich hochgeladen

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 

Kürzlich hochgeladen (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 

Presentation 15 condor-v1

  • 1. 15 Condor – A Distributed Job Scheduler Todd Tannenbaum, Derek Wright, Karen Miller, and Miron Livny Beowulf Cluster Computing with Linux, Thomas Sterling, editor, Oct. 2001. Summarized by Simon Kim
  • 2. Contents • Introduction to Condor • Using Condor • Condor Architecture • Installing Condor under Linux • Configuring Condor • Administration Tools • Cluster Setup Scenarios
  • 3. Introduction to Condor • Distributed Job Scheduler • Condor Research Project at University of Wisconsin-Madison Department of Computer Sciences • Changed name to HTCondor in 2012 – http://research.cs.wisc.edu/htcondor
  • 4. Introduction to Condor Condor Job Run Run Idle Monitor Progress Report Queue Run Idle Run Nodes User Policy Complete!
  • 5. Introduction to Condor • Workload Management System • Job Queuing Mechanism • Scheduling Policy • Priority Scheme • Resource Monitoring and Management
  • 6. Condor Features • Distributed Submission • User/Job Priorities • Job Dependency - DAG • Multiple Job Models – Serial/Parallel Jobs • ClassAds – Job : Machine Matchmaking • Job Checkpoint and Migration • Remote System Calls – Seamless I/O Redirection • Grid Computing – Interaction with Globus Resources
  • 7. ClassAds and Matchmaking • Job ClassAd – Looking for Machine – Requirements: Intel, Linux, Disk Space, … – Rank: Memory, Kflops, … • Machine ClassAd – Looking for Job – Requirements – Rank
  • 8. Using Condor • Roadmap to Using Condor • Submitting a Job • User Commands • Universes • Standard Universe – Process Checkpointing – Remote System Calls – Relinking – Limitations • Data File Access • DAGMan Scheduler
  • 9. Using Condor Batch Job STDIN STDOUT STDERR univers = vanilla executable = foo log = foo.log input = input.data output = output.data queue Submit Description Standard Vanilla PVM MPI Grid Scheduler Universes: Runtime Environment Prepare a Job Submit Serial Job Parallel Job Meta Scheduler $ condor_submit
  • 10. Status of Submitted Jobs • $ condor_status -submitters
  • 11. All jobs in the Queue • $ condor_q • Removing Job – $ condor_rm 350.0 • Changing Job Priority: -20 ~ 20(high), default: 0 – $ condor_prio –p -15 350.1
  • 12. Universes • Execution Environment - Universe • Vanilla – Serial Jobs – Binary Executable and Scripts • MPI Universe – MPI Programs – Parallel Jobs – Only on Dedicated Resources # Submit Description Universe = mpi … machine_count = 8 queue
  • 13. Universes • PVM Universe – Master-worker Style Parallel Programs • Written for Parallel Virtual Machine Interface – Both Dedicated and Non- dedicated (workstations) – Condor Acts as Resource Manager for PVM Daemon – Dynamic Node Allocation PVM Daemon Condor # Submit Description Universe = pvm … machine_count = 1..75 queue pvm_addhosts()
  • 14. Universes • Scheduler Universe – Meta-Scheduler – DAGMan Scheduler • Complex Interdependencies Between Jobs A B C D * B and C are executed in parallel Job Sequence: A -> B and C -> D
  • 15. Universes • Standard Universe – Serial Job – Process Checkpoint, Restart, and Migration – Remote System Calls
  • 16. Process Checkpointing • Checkpoint – Snapshot of the Program’s Current State – Preemptive Resume Scheduling – Periodic Checkpoints – Fault Tolerance – No Program Source Code Change • Relinking with Condor System Call Library – Signal Handler • Process State Written to a Local/Network File • Stack/Data Segments, CPU state, Open Files, Signal Handlers and Pending Signals – Optional Checkpoint Server • Checkpoint Repository
  • 17. Remote System Calls • Redirects File I/O – Open(), read(), write() -> Network Socket I/O – Sent to ‘condor_shadow’ process on Submit Machine • Handles Actual File I/O • Note that Job Runs on Remote Machine • Relinking Condor Remote System Call Library – $ condor_compile cc myprog.o –o myprog
  • 18. Standard Universe Limitations • No Multi-Process Jobs – fork(), exec(), system() • No IPC – Pipes, Semaphores, and Shared Memory • Brief Network Communication – Long Connection -> Delay Checkpoints and Migration • No Kernel-level Threads – User-level Threads Are Allowed • File Access: Read-only or Write-only – Read-Write: Hard to Roll Back to Old Checkpoint • On Linux, Must be Statically Linked
  • 19. Data Access from a Job • Remote System Call – Standard Universe • Shared Network File System • What About Non-dedicated Machines (Desktops) ? – Condor File Transfer – Before Run, Input Files Transferred to Remote – On Completion, Output Files Transferred Back to Submit Machine – Requested in Submit Description File • transfer_input_files = <…>, transfer_output_files=<…> • transfer_files=<ONEXIT | ALWAYS | NEVER>
  • 20. Condor Architecture Central Manager Machine Negotiator Collector Startd Sched Startd Sched Machine 1 Startd Sched Machine 2 Startd Sched Machine N
  • 21. Condor Architecture Central Manager Machine Negotiator Collector Startd Sched Startd Sched Machine 1: Submit Startd Sched Machine N: Execute Starter Job Shadow Condor Remote System Call
  • 22. Cluster Setup Scenarios • Uniformed Owned Dedicated Cluster – MPI Jobs on Dedicated Nodes • Cluster of Multi-Processor Nodes – 1VM per Processor • Cluster of Distributively Owned Nodes – Jobs from Owner Preferred • Desktop Submission to Cluster – Submit-only Node Setup • Non-Dedicated Computing Resources – Opportunistic Scheduling and Matchmaking with Process Checkpointing, Migration, Suspend and Resume
  • 23. Conclusion • Distinct Features – Matchmaking with Job and Machine ClassAds – Preemptive Scheduling and Migration with Checkpointing – Condor Remote System Call • Powerful Tool for Distributed Scheduling Jobs – Within and Beyond Beowulf Clusters • Unique Combination of Dedicated and Opportunistic Scheduling