SlideShare ist ein Scribd-Unternehmen logo
1 von 12
Downloaden Sie, um offline zu lesen
Making	
  HTCondor	
  Energy	
  Efficient	
  
by	
  iden5fying	
  miscreant	
  jobs	
  
Stephen	
  McGough,	
  Ma@hew	
  Forshaw	
  	
  
&	
  Clive	
  Gerrard	
  
Newcastle	
  University	
  
Stuart	
  Wheater	
  
Arjuna	
  Technologies	
  Limited	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
Task	
  lifecycle	
  
5me	
  
HTC	
  user	
  
submits	
  	
  
task	
  
resource	
  
selec5on	
  
Interac5ve	
  
user	
  logs	
  	
  
in	
  
Task	
  
evic5on	
  
resource	
  
selec5on	
  
Computer	
  
reboot	
  
resource	
  
selec5on	
  
Interac5ve	
  
user	
  logs	
  	
  
in	
  
…	
  
resource	
  
selec5on	
  
Task	
  
evic5on	
  
Task	
  
evic5on	
  
Task	
  
comple5on	
  
5me	
  
…	
  
Good	
  
Bad	
  
‘Miscreant’	
  –	
  
has	
  been	
  
evicted	
  but	
  
don’t	
  know	
  if	
  
it’s	
  good	
  or	
  bad	
  	
  
Mo5va5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
Mo5va5on	
  
•  We	
  have	
  run	
  a	
  high-­‐throughput	
  cluster	
  for	
  ~6	
  years	
  
–  Allowing	
  many	
  researchers	
  to	
  perform	
  more	
  work	
  quicker	
  
•  University	
  has	
  strong	
  desire	
  to	
  reduce	
  energy	
  
consump5on	
  and	
  reduce	
  CO2	
  produc5on	
  
–  Currently	
  powering	
  down	
  computer	
  &	
  buying	
  low	
  power	
  PCs	
  
–  “If	
  a	
  computer	
  is	
  not	
  ‘working’	
  it	
  should	
  be	
  powered	
  down”	
  
•  Can	
  we	
  go	
  further	
  to	
  reduce	
  wasted	
  energy?	
  
–  Reduce	
  5me	
  computers	
  spend	
  running	
  work	
  which	
  does	
  not	
  
complete	
  
–  Prevent	
  re-­‐submission	
  of	
  ‘bad’	
  jobs	
  
–  Reduce	
  the	
  number	
  of	
  resubmissions	
  for	
  ‘good’	
  jobs	
  
•  Aims	
  
–  Inves5gate	
  policy	
  for	
  reducing	
  energy	
  consump5on	
  
–  Determine	
  the	
  impact	
  on	
  high-­‐throughput	
  users	
  
Mo5va5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
Can	
  we	
  fix	
  the	
  number	
  of	
  retries?	
  
0 200 400 600 800 1000 1200 1400 1600 1800 200
0
100
200
300
400
500
600
700
800
900
Number of evictions
Cumulativewastedseconds(millions)
Good Jobs
Bad Jobs







    




•  ~57	
  years	
  of	
  compu5ng	
  5me	
  during	
  2010	
  
•  ~39	
  years	
  of	
  wasted	
  5me	
  
–  ~27	
  years	
  for	
  ‘bad’	
  tasks	
  :	
  average	
  45	
  retries	
  :	
  max	
  1946	
  retries	
  
–  ~12	
  years	
  for	
  ‘good’	
  tasks	
  :	
  average	
  1.38	
  retries	
  :	
  max	
  360	
  retries	
  
•  100%	
  ‘good’	
  task	
  comple5on	
  -­‐>	
  360	
  retries	
  
•  S5ll	
  wastes	
  ~13	
  years	
  on	
  ‘bad’	
  tasks	
  
–  95%	
  ‘good’	
  task	
  comple5on	
  -­‐>	
  3	
  retries	
  :	
  9,808	
  good	
  tasks	
  killed	
  (3.32%)	
  
–  99%	
  ‘good’	
  task	
  comple5on	
  -­‐>	
  6	
  retries	
  :	
  	
  2,022	
  good	
  tasks	
  killed	
  (0.68%)	
  
Mo5va5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
0 200 400 600 800 1000 1200 1400
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
1,600,000
1,800,000
Idle time (minutes)
Cumulativefreeseconds
Can	
  we	
  make	
  tasks	
  short	
  enough?	
  
•  Make	
  tasks	
  short	
  enough	
  to	
  reduce	
  miscreants	
  
•  Average	
  idle	
  interval	
  –	
  371	
  minutes	
  
•  But	
  to	
  ensure	
  availability	
  of	
  intervals	
  
– 95%	
  :	
  need	
  to	
  reduce	
  	
  
5me	
  limit	
  to	
  2	
  minutes	
  
– 99%	
  :	
  need	
  to	
  reduce	
  	
  	
  
5me	
  limit	
  to	
  1	
  minute 	
  	
  
•  Imprac5cal	
  to	
  make	
  
tasks	
  this	
  short	
  
Mo5va5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
Cluster	
  Simula5on	
  
•  High	
  Level	
  Simula5on	
  of	
  Condor	
  
– Trace	
  logs	
  from	
  a	
  twelve	
  month	
  period	
  are	
  used	
  as	
  
input	
  
•  User	
  Logins	
  /	
  Logouts	
  (computer	
  used)	
  
•  Condor	
  Job	
  Submission	
  5mes	
  (‘good’/’bad’	
  and	
  dura5on)	
  
end applications are unlikely to migrate to virtual desktops
or user owned devices due to hardware requirements and
icensing conditions, so we expect to need to maintain a
pool of hardware that will be useful for Condor for some
ime.
PUE values have been assigned at the cluster level with
values in the range of 0.9 to 1.4. These values have not been
empirically evaluated but used here to steer jobs. In most
cases the cluster rooms have a low enough computer density
not to require cooling giving these clusters a PUE value of
1.0. However, two clusters are located in rooms that require
air conditioning, giving these a PUE of 1.4. Likewise, four
clusters are based in a basement room, which is cold all
year round; hence computer heat is used to offset heating
requirements for the room, giving a PUE value of 0.9.
By default computers within the cluster will enter the
sleep state after a given interval of inactivity. This time will
depend on whether the cluster is open or not. During open
hours computers will remain in the idle state for one hour
before entering the sleep state whilst outside of these hours
he idle interval before sleep is reduced to 15 minutes. This
policy (P2) was originally trialled under Windows XP where
he time for computers to resume from the shutdown state
was considerable (sleep was an unreliable option for our
environment). Likewise the time interval before a Condor
ob could start using a computer (M1) was set to be 15
minutes during cluster opening hours and 0 minutes outside
Table I: Computer Types
Type Cores Speed Power Consumption
Active Idle Sleep
Normal 2 ⇠3Ghz 57W 40W 2W
High End 4 ⇠3Ghz 114W 67W 3W
Legacy 2 ⇠2Ghz 100-180W 50-80W 4W
Figure 3 illustrates the interactive logins for this period
showing the high degree of seasonality within the data. It
is easy to distinguish between week and weekends as well
as where the three terms lie along with the vacations. This
represents 1,229,820 interactive uses of the computers.
Figure 4 depicts the profile for the 532,467 job submis-
sions made to Condor during this period. As can be seen
the job submissions follow no clearly definable pattern. Note
that out of these submissions 131,909 were later killed by the
original Condor user. In order to simulate these killed jobs
the simulation assumes that these will be non-terminating
jobs and will keep on submitting them to resources until the
time at which the high-throughput user terminates them. The
graph is clipped on Thursday 03/06/2010 as this date had
93,000 job submissions.
For the simulations we will report on the total power
consumed (in MWh) for the period. In order to determine
the effect on high-throughput users of a policy we will also
report the average overhead observed by jobs submitted to
Condor (in seconds). Where overhead is defined to be the
amount of time in excess of the execution duration of the job.
Other statistics will be reported as appropriate for particular
!"!#$
!"#$
#$
#!$
#!!$
!"#$%&'()'*"$#+,,+(-,'."-/&%/,'
012%'
Figure 4: Condor job submission profile
Ac5ve	
  
User	
  /	
  Condor	
  
Idle	
  
Sleep	
  
WOL
Z
Z
Z
Cycle Stealing Users
Interactive Users
High-Throughput
Management
Z
Z
Z
Management policy
Policy	
  and	
  Simula5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
0 5 10 15 20 25 30
10
1
10
2
10
3
10
4
10
5
Number of retries (n)
Numberofgoodtaskskilled
N1 C1
N1 C2
N1 C3
N2 C1
N2 C2
N2 C3
N3 C1
N3 C2
N3 C3
n	
  realloca5on	
  policies	
  
•  N1(n):	
  Abandon	
  task	
  if	
  deallocated	
  n	
  5mes.	
  
•  N2(n):	
  Abandon	
  task	
  if	
  deallocated	
  n	
  5mes	
  
ignoring	
  interac5ve	
  users.	
  
•  N3(n):	
  Abandon	
  task	
  if	
  deallocated	
  n	
  5mes	
  
ignoring	
  planned	
  machine	
  reboots.	
  
•  C1:	
  Tasks	
  allocated	
  to	
  resources	
  at	
  random,	
  
favouring	
  awake	
  resources	
  
•  C2:	
  Target	
  less	
  used	
  computers	
  (longer	
  idle	
  
5mes)	
  
•  C3:	
  Tasks	
  are	
  allocated	
  to	
  computers	
  in	
  clusters	
  
with	
  least	
  amount	
  of	
  5me	
  used	
  by	
  interac5ve	
  
users	
  
0 5 10 15 20 25 30
3
3.5
4
4.5
5
5.5
6
x 10
6
Number of retries (n)
Energyconsumption(MWh)
N1 C1
N1 C2
N1 C3
N2 C1
N2 C2
N2 C3
N3 C1
N3 C2
N3 C3
0 5 10 15 20 25 30
0
2
4
6
8
10
12
x 10
5
Number of retries (n)
Overheadsonallgoodtasks
N1 C1
N1 C2
N1 C3
N2 C1
N2 C2
N2 C3
N3 C1
N3 C2
N3 C3
Policy	
  and	
  Simula5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
0 10 20 30 40 50 60 70 80 90 100
10
1
10
2
10
3
10
4
10
5
Accrued time (hours)
Numberofgoodtaskskilled
A1 C1
A1 C2
A1 C3
A2 C1
A2 C2
A2 C3
A3 C1
A3 C2
A3 C3
Accrued	
  Time	
  Policies	
  
•  Impose	
  a	
  limit	
  on	
  cumula5ve	
  
execu5on	
  5me	
  for	
  a	
  task.	
  
•  A1(t):	
  Abandon	
  if	
  accrued	
  5me	
  >	
  t	
  
and	
  task	
  deallocated.	
  
•  A2(t):	
  As	
  A1,	
  discoun5ng	
  
interac5ve	
  users..	
  
•  A3(t):	
  As	
  A1,	
  discoun5ng	
  reboots.	
   0 10 20 30 40 50 60 70 80 90 100
3.5
4
4.5
5
5.5
6
x 10
6
Accrued time (hours)
Energyconsumption(MWh)
A1 C1
A1 C2
A1 C3
A2 C1
A2 C2
A2 C3
A3 C1
A3 C2
A3 C3
0 10 20 30 40 50 60 70 80 90 100
3
4
5
6
7
8
9
10
11
12
x 10
5
Accrued time (hours)
Overheadsonallgoodtasks
A1 C1
A1 C2
A1 C3
A2 C1
A2 C2
A2 C3
A3 C1
A3 C2
A3 C3
Policy	
  and	
  Simula5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
Individual	
  Time	
  Policy	
  
•  Impose	
  a	
  limit	
  on	
  individual	
  
execu5on	
  5me	
  for	
  a	
  task.	
  
–  Nightly	
  reboots	
  bound	
  this	
  to	
  24	
  hours.	
  
–  What	
  is	
  the	
  impact	
  of	
  lowering	
  this?	
  
•  I1(t):	
  Abandon	
  if	
  individual	
  5me	
  >	
  t.	
  
0 5 10 15 20 25 30
3.5
4
4.5
5
5.5
6
x 10
6
Individual time (hours)
Energyconsumption(MWh)
I1 C1
I1 C2
I1 C3
0 5 10 15 20 25
10
0
10
1
10
2
10
3
10
4
10
5
Individual time (hours)
Numberofgoodtaskskilled
I1 C1
I1 C2
I1 C3
0 5 10 15 20 25 30
3
4
5
6
7
8
9
10
11
12
x 10
5
Individual time (hours)
Overheadsonallgoodtasks
I1 C1
I1 C2
I1 C3
Policy	
  and	
  Simula5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
20 40 60 80 100 120 140
10
6
10
7
10
8
10
9
10
10
10
11
10
12
Maximum execution duration (hours)
Energyconsumption(MWh)
m=10, n=10
m=10, n=20
m=10, n=30
m=20, n=10
m=20, n=20
m=20, n=30
m=30, n=10
m=30, n=20
m=30, n=30
m=40, n=10
m=40, n=20
m=40, n=30
Dedicated	
  Resources	
  
D1(m,d):	
  Miscreant	
  tasks	
  are	
  
permi@ed	
  to	
  con5nue	
  
execu5ng	
  on	
  a	
  dedicated	
  set	
  of	
  
m	
  resources	
  (without	
  
interac5ve	
  users	
  or	
  reboots),	
  
with	
  a	
  maximum	
  dura5on	
  d.	
  
20 40 60 80 100 120 140
0
2
4
6
8
10
12
14
16
Maximum execution duration (hours)
Numberofgoodtaskskilled
m=10, n=10
m=10, n=20
m=10, n=30
m=20, n=10
m=20, n=20
m=20, n=30
m=30, n=10
m=30, n=20
m=30, n=30
m=40, n=10
m=40, n=20
m=40, n=30
20 40 60 80 100 120 140
1.12
1.14
1.16
1.18
1.2
1.22
1.24
1.26
1.28
1.3
x 10
6
Maximum execution duration (hours)
Overheadsonallgoodtasks
m=10, n=10
m=10, n=20
m=10, n=30
m=20, n=10
m=20, n=20
m=20, n=30
m=30, n=10
m=30, n=20
m=30, n=30
m=40, n=10
m=40, n=20
m=40, n=30
Policy	
  and	
  Simula5on	
  
Mo5va5on	
   Policy	
  and	
  Simula5on	
   Conclusion	
  
Conclusion	
  
•  Simple	
  policies	
  can	
  be	
  used	
  to	
  reduce	
  the	
  effect	
  
of	
  miscreant	
  tasks	
  in	
  a	
  mul5-­‐use	
  cycle	
  stealing	
  
cluster.	
  
–  N2	
  (total	
  evic5ons	
  ignoring	
  users)	
  
•  Order	
  of	
  magnitude	
  reduc5on	
  in	
  energy	
  
consump5on	
  
–  Reduce	
  amount	
  of	
  effort	
  wasted	
  on	
  tasks	
  that	
  will	
  
never	
  complete	
  
•  Policies	
  may	
  be	
  combined	
  to	
  achieve	
  further	
  
improvements.	
  
–  Adding	
  in	
  dedicated	
  computers	
  
Conclusion	
  
Ques5ons?	
  
stephen.mcgough@ncl.ac.uk	
  
m.j.forshaw@ncl.ac.uk	
  	
  
More	
  info:	
  
McGough,	
  A	
  Stephen;	
  Forshaw,	
  Ma@hew;	
  Gerrard,	
  Clive;	
  Wheater,	
  Stuart;	
  	
  
Reducing	
  the	
  Number	
  of	
  Miscreant	
  Tasks	
  ExecuBons	
  in	
  a	
  MulB-­‐use	
  Cluster,	
  	
  
Cloud	
  and	
  Green	
  Compu5ng	
  (CGC),	
  2012	
  

Weitere ähnliche Inhalte

Andere mochten auch

Manual de office LOPEZ PEREZ
Manual de office  LOPEZ PEREZManual de office  LOPEZ PEREZ
Manual de office LOPEZ PEREZESCO
 
Google shopping guide 2.0 bidding advanced segment tracking and cpa campaigns
Google shopping guide 2.0 bidding advanced segment tracking and cpa campaignsGoogle shopping guide 2.0 bidding advanced segment tracking and cpa campaigns
Google shopping guide 2.0 bidding advanced segment tracking and cpa campaignsBullsEye Internet Marketing
 
Lec4 MECH ENG STRucture
Lec4    MECH ENG  STRuctureLec4    MECH ENG  STRucture
Lec4 MECH ENG STRuctureMohamed Yaser
 
Business Analytics Insights
Business Analytics InsightsBusiness Analytics Insights
Business Analytics InsightsChrisHoogendoorn
 
Learning Management System og jobbytelse: En kvalitativ tilnærming
Learning Management System og jobbytelse: En kvalitativ tilnærmingLearning Management System og jobbytelse: En kvalitativ tilnærming
Learning Management System og jobbytelse: En kvalitativ tilnærmingTom Erik Holteng
 
La Escuela del Aire
La Escuela del AireLa Escuela del Aire
La Escuela del AireJose Ponce
 
Research On Hong Kong Tourism Websites
Research On Hong Kong Tourism Websites Research On Hong Kong Tourism Websites
Research On Hong Kong Tourism Websites Tiffanie_Chan
 

Andere mochten auch (9)

Manual de office LOPEZ PEREZ
Manual de office  LOPEZ PEREZManual de office  LOPEZ PEREZ
Manual de office LOPEZ PEREZ
 
Google shopping guide 2.0 bidding advanced segment tracking and cpa campaigns
Google shopping guide 2.0 bidding advanced segment tracking and cpa campaignsGoogle shopping guide 2.0 bidding advanced segment tracking and cpa campaigns
Google shopping guide 2.0 bidding advanced segment tracking and cpa campaigns
 
Lec4 MECH ENG STRucture
Lec4    MECH ENG  STRuctureLec4    MECH ENG  STRucture
Lec4 MECH ENG STRucture
 
Business Analytics Insights
Business Analytics InsightsBusiness Analytics Insights
Business Analytics Insights
 
Improving Coordination in Swiss OGD
Improving Coordination in Swiss OGDImproving Coordination in Swiss OGD
Improving Coordination in Swiss OGD
 
Learning Management System og jobbytelse: En kvalitativ tilnærming
Learning Management System og jobbytelse: En kvalitativ tilnærmingLearning Management System og jobbytelse: En kvalitativ tilnærming
Learning Management System og jobbytelse: En kvalitativ tilnærming
 
Keynote Marc Goffart
Keynote Marc GoffartKeynote Marc Goffart
Keynote Marc Goffart
 
La Escuela del Aire
La Escuela del AireLa Escuela del Aire
La Escuela del Aire
 
Research On Hong Kong Tourism Websites
Research On Hong Kong Tourism Websites Research On Hong Kong Tourism Websites
Research On Hong Kong Tourism Websites
 

Ähnlich wie Cw13 0.01-final

Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xMatthew Gaudet
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlockSyed Zaid Irshad
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performancePiotr Przymus
 
On the way to low latency (2nd edition)
On the way to low latency (2nd edition)On the way to low latency (2nd edition)
On the way to low latency (2nd edition)Artem Orobets
 
8 Team Backrow Final Presentation
8 Team Backrow Final Presentation8 Team Backrow Final Presentation
8 Team Backrow Final PresentationKenedyThorne
 
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal ConstraintsIdentifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal ConstraintsLionel Briand
 
Real-Time Systems Intro.pptx
Real-Time Systems Intro.pptxReal-Time Systems Intro.pptx
Real-Time Systems Intro.pptx20EUEE018DEEPAKM
 
dataprocess using different technology.ppt
dataprocess using different technology.pptdataprocess using different technology.ppt
dataprocess using different technology.pptssuserf6eb9b
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueScyllaDB
 
CPU scheduling in Operating System Explanation
CPU scheduling in Operating System ExplanationCPU scheduling in Operating System Explanation
CPU scheduling in Operating System ExplanationAnitaSofiaKeyser
 
What is transaction processing
What is transaction processingWhat is transaction processing
What is transaction processingramuaa128
 
The centroid method for plant location uses which of the following data
The centroid method for plant location uses which of the following dataThe centroid method for plant location uses which of the following data
The centroid method for plant location uses which of the following dataramuaa128
 
Which of the following is an input to the master production schedule (mps)
Which of the following is an input to the master production schedule (mps)Which of the following is an input to the master production schedule (mps)
Which of the following is an input to the master production schedule (mps)ramuaa130
 
In hau lee's uncertainty framework to classify supply chains
In hau lee's uncertainty framework to classify supply chainsIn hau lee's uncertainty framework to classify supply chains
In hau lee's uncertainty framework to classify supply chainsramuaa127
 
3 process scheduling
3 process scheduling3 process scheduling
3 process schedulingahad alam
 

Ähnlich wie Cw13 0.01-final (20)

Ruby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3xRuby3x3: How are we going to measure 3x
Ruby3x3: How are we going to measure 3x
 
What to do when detect deadlock
What to do when detect deadlockWhat to do when detect deadlock
What to do when detect deadlock
 
Energy saving policies final
Energy saving policies finalEnergy saving policies final
Energy saving policies final
 
What’s eating python performance
What’s eating python performanceWhat’s eating python performance
What’s eating python performance
 
On the way to low latency (2nd edition)
On the way to low latency (2nd edition)On the way to low latency (2nd edition)
On the way to low latency (2nd edition)
 
8 Team Backrow Final Presentation
8 Team Backrow Final Presentation8 Team Backrow Final Presentation
8 Team Backrow Final Presentation
 
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal ConstraintsIdentifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
Identifying Optimal Trade-Offs between CPU Time Usage and Temporal Constraints
 
Section05 scheduling
Section05 schedulingSection05 scheduling
Section05 scheduling
 
Real-Time Systems Intro.pptx
Real-Time Systems Intro.pptxReal-Time Systems Intro.pptx
Real-Time Systems Intro.pptx
 
dataprocess using different technology.ppt
dataprocess using different technology.pptdataprocess using different technology.ppt
dataprocess using different technology.ppt
 
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization TechniqueSquare Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
Square Engineering's "Fail Fast, Retry Soon" Performance Optimization Technique
 
A line in apparel
A line in apparelA line in apparel
A line in apparel
 
CPU Scheduling
CPU SchedulingCPU Scheduling
CPU Scheduling
 
CPU scheduling in Operating System Explanation
CPU scheduling in Operating System ExplanationCPU scheduling in Operating System Explanation
CPU scheduling in Operating System Explanation
 
What is transaction processing
What is transaction processingWhat is transaction processing
What is transaction processing
 
The centroid method for plant location uses which of the following data
The centroid method for plant location uses which of the following dataThe centroid method for plant location uses which of the following data
The centroid method for plant location uses which of the following data
 
Which of the following is an input to the master production schedule (mps)
Which of the following is an input to the master production schedule (mps)Which of the following is an input to the master production schedule (mps)
Which of the following is an input to the master production schedule (mps)
 
In hau lee's uncertainty framework to classify supply chains
In hau lee's uncertainty framework to classify supply chainsIn hau lee's uncertainty framework to classify supply chains
In hau lee's uncertainty framework to classify supply chains
 
UNIT II - CPU SCHEDULING.docx
UNIT II - CPU SCHEDULING.docxUNIT II - CPU SCHEDULING.docx
UNIT II - CPU SCHEDULING.docx
 
3 process scheduling
3 process scheduling3 process scheduling
3 process scheduling
 

Kürzlich hochgeladen

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 

Kürzlich hochgeladen (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 

Cw13 0.01-final

  • 1. Making  HTCondor  Energy  Efficient   by  iden5fying  miscreant  jobs   Stephen  McGough,  Ma@hew  Forshaw     &  Clive  Gerrard   Newcastle  University   Stuart  Wheater   Arjuna  Technologies  Limited  
  • 2. Mo5va5on   Policy  and  Simula5on   Conclusion   Task  lifecycle   5me   HTC  user   submits     task   resource   selec5on   Interac5ve   user  logs     in   Task   evic5on   resource   selec5on   Computer   reboot   resource   selec5on   Interac5ve   user  logs     in   …   resource   selec5on   Task   evic5on   Task   evic5on   Task   comple5on   5me   …   Good   Bad   ‘Miscreant’  –   has  been   evicted  but   don’t  know  if   it’s  good  or  bad     Mo5va5on  
  • 3. Mo5va5on   Policy  and  Simula5on   Conclusion   Mo5va5on   •  We  have  run  a  high-­‐throughput  cluster  for  ~6  years   –  Allowing  many  researchers  to  perform  more  work  quicker   •  University  has  strong  desire  to  reduce  energy   consump5on  and  reduce  CO2  produc5on   –  Currently  powering  down  computer  &  buying  low  power  PCs   –  “If  a  computer  is  not  ‘working’  it  should  be  powered  down”   •  Can  we  go  further  to  reduce  wasted  energy?   –  Reduce  5me  computers  spend  running  work  which  does  not   complete   –  Prevent  re-­‐submission  of  ‘bad’  jobs   –  Reduce  the  number  of  resubmissions  for  ‘good’  jobs   •  Aims   –  Inves5gate  policy  for  reducing  energy  consump5on   –  Determine  the  impact  on  high-­‐throughput  users   Mo5va5on  
  • 4. Mo5va5on   Policy  and  Simula5on   Conclusion   Can  we  fix  the  number  of  retries?   0 200 400 600 800 1000 1200 1400 1600 1800 200 0 100 200 300 400 500 600 700 800 900 Number of evictions Cumulativewastedseconds(millions) Good Jobs Bad Jobs                 •  ~57  years  of  compu5ng  5me  during  2010   •  ~39  years  of  wasted  5me   –  ~27  years  for  ‘bad’  tasks  :  average  45  retries  :  max  1946  retries   –  ~12  years  for  ‘good’  tasks  :  average  1.38  retries  :  max  360  retries   •  100%  ‘good’  task  comple5on  -­‐>  360  retries   •  S5ll  wastes  ~13  years  on  ‘bad’  tasks   –  95%  ‘good’  task  comple5on  -­‐>  3  retries  :  9,808  good  tasks  killed  (3.32%)   –  99%  ‘good’  task  comple5on  -­‐>  6  retries  :    2,022  good  tasks  killed  (0.68%)   Mo5va5on  
  • 5. Mo5va5on   Policy  and  Simula5on   Conclusion   0 200 400 600 800 1000 1200 1400 0 200,000 400,000 600,000 800,000 1,000,000 1,200,000 1,400,000 1,600,000 1,800,000 Idle time (minutes) Cumulativefreeseconds Can  we  make  tasks  short  enough?   •  Make  tasks  short  enough  to  reduce  miscreants   •  Average  idle  interval  –  371  minutes   •  But  to  ensure  availability  of  intervals   – 95%  :  need  to  reduce     5me  limit  to  2  minutes   – 99%  :  need  to  reduce       5me  limit  to  1  minute     •  Imprac5cal  to  make   tasks  this  short   Mo5va5on  
  • 6. Mo5va5on   Policy  and  Simula5on   Conclusion   Cluster  Simula5on   •  High  Level  Simula5on  of  Condor   – Trace  logs  from  a  twelve  month  period  are  used  as   input   •  User  Logins  /  Logouts  (computer  used)   •  Condor  Job  Submission  5mes  (‘good’/’bad’  and  dura5on)   end applications are unlikely to migrate to virtual desktops or user owned devices due to hardware requirements and icensing conditions, so we expect to need to maintain a pool of hardware that will be useful for Condor for some ime. PUE values have been assigned at the cluster level with values in the range of 0.9 to 1.4. These values have not been empirically evaluated but used here to steer jobs. In most cases the cluster rooms have a low enough computer density not to require cooling giving these clusters a PUE value of 1.0. However, two clusters are located in rooms that require air conditioning, giving these a PUE of 1.4. Likewise, four clusters are based in a basement room, which is cold all year round; hence computer heat is used to offset heating requirements for the room, giving a PUE value of 0.9. By default computers within the cluster will enter the sleep state after a given interval of inactivity. This time will depend on whether the cluster is open or not. During open hours computers will remain in the idle state for one hour before entering the sleep state whilst outside of these hours he idle interval before sleep is reduced to 15 minutes. This policy (P2) was originally trialled under Windows XP where he time for computers to resume from the shutdown state was considerable (sleep was an unreliable option for our environment). Likewise the time interval before a Condor ob could start using a computer (M1) was set to be 15 minutes during cluster opening hours and 0 minutes outside Table I: Computer Types Type Cores Speed Power Consumption Active Idle Sleep Normal 2 ⇠3Ghz 57W 40W 2W High End 4 ⇠3Ghz 114W 67W 3W Legacy 2 ⇠2Ghz 100-180W 50-80W 4W Figure 3 illustrates the interactive logins for this period showing the high degree of seasonality within the data. It is easy to distinguish between week and weekends as well as where the three terms lie along with the vacations. This represents 1,229,820 interactive uses of the computers. Figure 4 depicts the profile for the 532,467 job submis- sions made to Condor during this period. As can be seen the job submissions follow no clearly definable pattern. Note that out of these submissions 131,909 were later killed by the original Condor user. In order to simulate these killed jobs the simulation assumes that these will be non-terminating jobs and will keep on submitting them to resources until the time at which the high-throughput user terminates them. The graph is clipped on Thursday 03/06/2010 as this date had 93,000 job submissions. For the simulations we will report on the total power consumed (in MWh) for the period. In order to determine the effect on high-throughput users of a policy we will also report the average overhead observed by jobs submitted to Condor (in seconds). Where overhead is defined to be the amount of time in excess of the execution duration of the job. Other statistics will be reported as appropriate for particular !"!#$ !"#$ #$ #!$ #!!$ !"#$%&'()'*"$#+,,+(-,'."-/&%/,' 012%' Figure 4: Condor job submission profile Ac5ve   User  /  Condor   Idle   Sleep   WOL Z Z Z Cycle Stealing Users Interactive Users High-Throughput Management Z Z Z Management policy Policy  and  Simula5on  
  • 7. Mo5va5on   Policy  and  Simula5on   Conclusion   0 5 10 15 20 25 30 10 1 10 2 10 3 10 4 10 5 Number of retries (n) Numberofgoodtaskskilled N1 C1 N1 C2 N1 C3 N2 C1 N2 C2 N2 C3 N3 C1 N3 C2 N3 C3 n  realloca5on  policies   •  N1(n):  Abandon  task  if  deallocated  n  5mes.   •  N2(n):  Abandon  task  if  deallocated  n  5mes   ignoring  interac5ve  users.   •  N3(n):  Abandon  task  if  deallocated  n  5mes   ignoring  planned  machine  reboots.   •  C1:  Tasks  allocated  to  resources  at  random,   favouring  awake  resources   •  C2:  Target  less  used  computers  (longer  idle   5mes)   •  C3:  Tasks  are  allocated  to  computers  in  clusters   with  least  amount  of  5me  used  by  interac5ve   users   0 5 10 15 20 25 30 3 3.5 4 4.5 5 5.5 6 x 10 6 Number of retries (n) Energyconsumption(MWh) N1 C1 N1 C2 N1 C3 N2 C1 N2 C2 N2 C3 N3 C1 N3 C2 N3 C3 0 5 10 15 20 25 30 0 2 4 6 8 10 12 x 10 5 Number of retries (n) Overheadsonallgoodtasks N1 C1 N1 C2 N1 C3 N2 C1 N2 C2 N2 C3 N3 C1 N3 C2 N3 C3 Policy  and  Simula5on  
  • 8. Mo5va5on   Policy  and  Simula5on   Conclusion   0 10 20 30 40 50 60 70 80 90 100 10 1 10 2 10 3 10 4 10 5 Accrued time (hours) Numberofgoodtaskskilled A1 C1 A1 C2 A1 C3 A2 C1 A2 C2 A2 C3 A3 C1 A3 C2 A3 C3 Accrued  Time  Policies   •  Impose  a  limit  on  cumula5ve   execu5on  5me  for  a  task.   •  A1(t):  Abandon  if  accrued  5me  >  t   and  task  deallocated.   •  A2(t):  As  A1,  discoun5ng   interac5ve  users..   •  A3(t):  As  A1,  discoun5ng  reboots.   0 10 20 30 40 50 60 70 80 90 100 3.5 4 4.5 5 5.5 6 x 10 6 Accrued time (hours) Energyconsumption(MWh) A1 C1 A1 C2 A1 C3 A2 C1 A2 C2 A2 C3 A3 C1 A3 C2 A3 C3 0 10 20 30 40 50 60 70 80 90 100 3 4 5 6 7 8 9 10 11 12 x 10 5 Accrued time (hours) Overheadsonallgoodtasks A1 C1 A1 C2 A1 C3 A2 C1 A2 C2 A2 C3 A3 C1 A3 C2 A3 C3 Policy  and  Simula5on  
  • 9. Mo5va5on   Policy  and  Simula5on   Conclusion   Individual  Time  Policy   •  Impose  a  limit  on  individual   execu5on  5me  for  a  task.   –  Nightly  reboots  bound  this  to  24  hours.   –  What  is  the  impact  of  lowering  this?   •  I1(t):  Abandon  if  individual  5me  >  t.   0 5 10 15 20 25 30 3.5 4 4.5 5 5.5 6 x 10 6 Individual time (hours) Energyconsumption(MWh) I1 C1 I1 C2 I1 C3 0 5 10 15 20 25 10 0 10 1 10 2 10 3 10 4 10 5 Individual time (hours) Numberofgoodtaskskilled I1 C1 I1 C2 I1 C3 0 5 10 15 20 25 30 3 4 5 6 7 8 9 10 11 12 x 10 5 Individual time (hours) Overheadsonallgoodtasks I1 C1 I1 C2 I1 C3 Policy  and  Simula5on  
  • 10. Mo5va5on   Policy  and  Simula5on   Conclusion   20 40 60 80 100 120 140 10 6 10 7 10 8 10 9 10 10 10 11 10 12 Maximum execution duration (hours) Energyconsumption(MWh) m=10, n=10 m=10, n=20 m=10, n=30 m=20, n=10 m=20, n=20 m=20, n=30 m=30, n=10 m=30, n=20 m=30, n=30 m=40, n=10 m=40, n=20 m=40, n=30 Dedicated  Resources   D1(m,d):  Miscreant  tasks  are   permi@ed  to  con5nue   execu5ng  on  a  dedicated  set  of   m  resources  (without   interac5ve  users  or  reboots),   with  a  maximum  dura5on  d.   20 40 60 80 100 120 140 0 2 4 6 8 10 12 14 16 Maximum execution duration (hours) Numberofgoodtaskskilled m=10, n=10 m=10, n=20 m=10, n=30 m=20, n=10 m=20, n=20 m=20, n=30 m=30, n=10 m=30, n=20 m=30, n=30 m=40, n=10 m=40, n=20 m=40, n=30 20 40 60 80 100 120 140 1.12 1.14 1.16 1.18 1.2 1.22 1.24 1.26 1.28 1.3 x 10 6 Maximum execution duration (hours) Overheadsonallgoodtasks m=10, n=10 m=10, n=20 m=10, n=30 m=20, n=10 m=20, n=20 m=20, n=30 m=30, n=10 m=30, n=20 m=30, n=30 m=40, n=10 m=40, n=20 m=40, n=30 Policy  and  Simula5on  
  • 11. Mo5va5on   Policy  and  Simula5on   Conclusion   Conclusion   •  Simple  policies  can  be  used  to  reduce  the  effect   of  miscreant  tasks  in  a  mul5-­‐use  cycle  stealing   cluster.   –  N2  (total  evic5ons  ignoring  users)   •  Order  of  magnitude  reduc5on  in  energy   consump5on   –  Reduce  amount  of  effort  wasted  on  tasks  that  will   never  complete   •  Policies  may  be  combined  to  achieve  further   improvements.   –  Adding  in  dedicated  computers   Conclusion  
  • 12. Ques5ons?   stephen.mcgough@ncl.ac.uk   m.j.forshaw@ncl.ac.uk     More  info:   McGough,  A  Stephen;  Forshaw,  Ma@hew;  Gerrard,  Clive;  Wheater,  Stuart;     Reducing  the  Number  of  Miscreant  Tasks  ExecuBons  in  a  MulB-­‐use  Cluster,     Cloud  and  Green  Compu5ng  (CGC),  2012