The document discusses network-aware data management for large-scale distributed applications. It provides an outline for a presentation on this topic, including discussing the performance of VSAN and VVOL storage in virtualized environments, the PetaShare distributed storage system and Stork data scheduler, data streaming in high-bandwidth networks, and several other related topics like network reservations and scheduling. The presenter's background and experience working on data transfer scheduling, distributed storage, and high-performance computing networks is also briefly summarized.
Strategies for Landing an Oracle DBA Job as a Fresher
Network-aware Data Management for Large Scale Distributed Applications, IBM Research-Almaden, San Jose, CA – June 24, 2015
1. Network-‐aware
Data
Management
for
Large-‐scale
Distributed
Applications
June
24,
2015
Mehmet
Balman
h3p://balman.info
Senior
Performance
Engineer
at
VMware
Inc.
Guest/Affiliate
at
Berkeley
Lab
1
2. About
me:
Ø 2013:
Performance,
Central
Engineering,
VMware,
Palo
Alto,
CA
Ø 2009:
ComputaMonal
Research
Division
(CRD)
at
Lawrence
Berkeley
NaMonal
Laboratory
(LBNL)
Ø 2005:
Center
for
ComputaMon
&
Technology
(CCT),
Baton
Rouge,
LA
v Computer
Science,
Louisiana
State
University
(2010,2008)
v Bogazici
University,
Istanbul,
Turkey
(2006,2000)
Data
Transfer
Scheduling
with
Advance
ReservaMon
and
Provisioning,
Ph.D.
Failure-‐Awareness
and
Dynamic
AdaptaMon
in
Data
Scheduling,
M.S.
Parallel
Tetrahedral
Mesh
Refinement,
M.S.
2
3. Why
Network-‐aware?
Networking
is
one
of
the
major
components
in
many
of
the
soluMons
today
• Distributed
data
and
compute
resources
• CollaboraMon:
data
to
be
shared
between
remote
sites
• Data
centers
are
complex
network
infrastructures
ü What
further
steps
are
necessary
to
take
full
advantage
of
future
networking
infrastructure?
ü How
are
we
going
to
deal
with
performance
problems?
ü How
can
we
enhance
data
management
services
and
make
them
network-‐aware?
New
collabora>ons
between
data
management
and
networking
communi>es.
3
4. Two
major
players:
• AbstracMon
and
Programmability
• Rapid
Development,
Intelligent
services
• OrchestraMng
compute,
storage,
and
network
resources
together
• IntegraMon
and
deployment
of
complex
workflows
• VirtualizaMon
(+containers)
• Distributed
storage
(storage
wars)
• Open
Source
(if
you
can’t
fix
it,
you
don’t
own
it)
• Performance
Gap:
• LimitaMon
in
current
system
so3ware
vs
foreseen
speed:
• Hardware
is
fast,
Sofware
is
slow
• Latency
vs
throughput
mismatch
will
lead
to
new
innovaGons
4
5. Outline
• VSAN
+
VVOL
Storage
Performance
in
Virtualized
Environments
• PetaShare
Distributed
Storage
+
Stork
Data
Scheduler
Adap>ve
Tuning
+
Advanced
Buffers
• Data
Streaming
in
High-‐bandwidth
Networks
• Climate100:
Advance
Network
IniMaMve
and
100Gbps
Demo
• MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels
• Core
Affinity
and
End
System
Tuning
in
High-‐Throughput
Flows
• Network
Reserva>on
and
Online
Scheduling
(QoS)
• FlexRes:
A
Flexible
Network
ReservaMon
Algorithm
• SchedSim:
Online
Scheduling
with
Advance
Provisioning
5
7. VSAN
performance
work
in
a
nutshell
7
Observer
image:
blog.vmware.com
• Every
write
operaMon
needs
to
go
over
network
(and
network
is
not
free)
• Each
layer
(cache,
disk,
object
management,
etc.)
needs
resources
(CPU,
memory)
• Resource
limitaMons
vs
Latency
effect
• Needs
to
support
thousands
of
VMs
Placement
of
Objects:
• Which
Host?
• Which
Disk/SSD
in
the
Host?
What
if
there
are
failures,
migraMons,
and
if
we
need
to
rebalance
8. 8
VVOL:
virtual
volumes
VVOL
image:
blog.vmware.com
Offloading
control
operaMons
to
the
storage
array
• powerOn
• powerOff
• Delete
• clone
9. VVOL
performance
work
• Effect
of
the
latency
in
control
path
•
linked
clone
vs
VVOL
clones
9
Vsphere
Storage
Host
VASA
VP
Data
path
Control
path
• Op>mize
service
latencies
• Batching
(disklib)
• Use
concurrent
opera>ons
10. PetaShare
+
Stork
Data
Scheduler
10
AggregaMon
in
Data
Path:
Advance
Buffer
Cache
in
Petafs
and
Petashell
clients
by
aggregaMng
I/O
requests
to
minimize
the
number
of
network
messages
11. Adaptive
Tuning
+
Advanced
Buffer
11
AdapMve
Tuning
for
Bulk
Transfer
Buffer
Cache
for
Remote
I/O
12. Outline
• VSAN
+
VVOL
Storage
Performance
in
Virtualized
Environments
• PetaShare
Distributed
Storage
+
Stork
Data
Scheduler
Adap>ve
Tuning
+
Advanced
Buffers
• Data
Streaming
in
High-‐bandwidth
Networks
• Climate100:
Advance
Network
IniMaMve
and
100Gbps
Demo
• MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels
• Core
Affinity
and
End
System
Tuning
in
High-‐Throughput
Flows
• Network
Reserva>on
and
Online
Scheduling
(QoS)
• FlexRes:
A
Flexible
Network
ReservaMon
Algorithm
• SchedSim:
Online
Scheduling
with
Advance
Provisioning
12
13. 100Gbps
networking
has
Linally
arrived!
Applica>ons’
Perspec>ve
Increasing
the
bandwidth
is
not
sufficient
by
itself;
we
need
careful
evaluaMon
of
high-‐bandwidth
networks
from
the
applicaMons’
perspecMve.
1Gbps
to
10Gbps
transiMon
(10
years
ago)
ApplicaMon
did
not
run
10
Mmes
faster
because
there
was
more
bandwidth
available
13
14. ANI
100Gbps
Demo
• 100Gbps
demo
by
ESnet
and
Internet2
• ApplicaMon
design
issues
and
host
tuning
strategies
to
scale
to
100Gbps
rates
• VisualizaMon
of
remotely
located
data
(Cosmology)
• Data
movement
of
large
datasets
with
many
files
(Climate
analysis)
14
15. Earth
System
Grid
Federation
(ESGF)
15
• Over
2,700
sites
• 25,000
users
• IPCC
Fifh
Assessment
Report
(AR5)
2PB
• IPCC
Forth
Assessment
Report
(AR4)
35TB
• Remote
Data
Analysis
• Bulk
Data
Movement
17.
lots-‐of-‐small-‐*iles
problem!
*ile-‐centric
tools?
FTP
RPC
request a file
request a file
send file
send file
request
data
send data
• Keep
the
network
pipe
full
• We
want
out-‐of-‐order
and
asynchronous
send
receive
17
18. Many
Concurrent
Streams
(a) total throughput vs. the number of concurrent memory-to-memory transfers, (b) interface traffic, packages per second (blue) and bytes per second, over a single
NIC with different number of concurrent transfers. Three hosts, each with 4 available NICs, and a total of 10 10Gbps NIC pairs were used to saturate the 100Gbps
pipe in the ANI Testbed. 10 data movement jobs, each corresponding to a NIC pair, at source and destination started simultaneously. Each peak represents a
different test; 1, 2, 4, 8, 16, 32, 64 concurrent streams per job were initiated for 5min intervals (e.g. when concurrency level is 4, there are 40 streams in total).
18
19. ANI testbed 100Gbps (10x10NICs, three hosts): Interrupts/CPU vs the number of concurrent transfers [1, 2, 4, 8, 16, 32 64 concurrent jobs - 5min
intervals], TCP buffer size is 50M
Effects
of
many
concurrent
streams
19
20. Analysis
of
Core
AfLinities
(NUMA
Effect)
20
Nathan
Hanford
et
al.
NDM’13
Sandy
Bridge
Architecture
Receive
process
21. 21
Analysis
of
Core
AfLinities
(NUMA
Effect)
Nathan
Hanford
et
al.
NDM’14
25. Advantages
• Decoupling
I/O
and
network
operaMons
• front-‐end
(I/O
processing)
• back-‐end
(networking
layer)
• Not
limited
by
the
characterisMcs
of
the
file
sizes
• On
the
fly
tar
approach,
bundling
and
sending
many
files
together
• Dynamic
data
channel
management
Can
increase/decrease
the
parallelism
level
both
in
the
network
communicaMon
and
I/O
read/write
operaMons,
without
closing
and
reopening
the
data
channel
connecMon
(as
is
done
in
regular
FTP
variants).
MemzNet
is
is
not
file-‐centric.
Bookkeeping
informaMon
is
embedded
inside
each
block.
25
27. 100Gbps
Demo
• CMIP3
data
(35TB)
from
the
GPFS
filesystem
at
NERSC
• Block
size
4MB
• Each
block’s
data
secMon
was
aligned
according
to
the
system
pagesize.
• 1GB
cache
both
at
the
client
and
the
server
• At
NERSC,
8
front-‐end
threads
on
each
host
for
reading
data
files
in
parallel.
•
At
ANL/ORNL,
4
front-‐end
threads
for
processing
received
data
blocks.
•
4
parallel
TCP
streams
(four
back-‐end
threads)
were
used
for
each
host-‐to-‐host
connecMon.
27
28. MemzNet’s
Performance
TCP
buffer
size
is
set
to
50MB
MemzNetGridFTP
100Gbps demo
ANI Testbed
28
29. Challenge?
• High-‐bandwidth
brings
new
challenges!
• We
need
substanMal
amount
of
processing
power
and
involvement
of
mulMple
cores
to
fill
a
40Gbps
or
100Gbps
network
• Fine-‐tuning,
both
in
network
and
applicaMon
layers,
to
take
advantage
of
the
higher
network
capacity.
• Incremental
improvement
in
current
tools?
• We
cannot
expect
every
applicaMon
to
tune
and
improve
every
Mme
we
change
the
link
technology
or
speed.
29
30. MemzNet
• MemzNet:
Memory-‐mapped
Network
Channel
• High-‐performance
data
movement
MemzNet
is
an
iniMal
effort
to
put
a
new
layer
between
the
applicaMon
and
the
transport
layer.
• Main
goal
is
to
define
a
network
channel
so
applicaMons
can
directly
use
it
without
the
burden
of
managing/tuning
the
network
communicaMon.
30
Tech
report:
LBNL-‐6177E
31. MemzNet
=
New
Execution
Model
• Luigi
Rizzo
’s
netmap
• proposes
a
new
API
to
send/receive
data
over
the
network
• RDMA
programming
model
• MemzNet
as
a
memory-‐management
component
• IX:
Data
Plane
OS
(Adam
Baley
et
al.
@standford
–
similar
to
MemzNet’s
model)
• mTCP
(even
based
/
replaces
send/receive
in
user
level)
• Tanenbaum
et
al.
Minimizing
context
switches:
proposing
to
use
MONITOR/MWAIT
for
synchronizaMon
31
32. Outline
• VSAN
+
VVOL
Storage
Performance
in
Virtualized
Environments
• PetaShare
Distributed
Storage
+
Stork
Data
Scheduler
Adap>ve
Tuning
+
Advanced
Buffers
• Data
Streaming
in
High-‐bandwidth
Networks
• Climate100:
Advance
Network
IniMaMve
and
100Gbps
Demo
• MemzNet:
Memory-‐Mapped
Network
Zero-‐copy
Channels
• Core
Affinity
and
End
System
Tuning
in
High-‐Throughput
Flows
• Network
Reserva>on
and
Online
Scheduling
(QoS)
• FlexRes:
A
Flexible
Network
ReservaMon
Algorithm
• SchedSim:
Online
Scheduling
with
Advance
Provisioning
32
33. Problem
Domain:
Esnet’s
OSCARS
33
ASIA-PACIFIC
(ASGC/Kreonet2/
TWAREN)
ASIA-PACIFIC
(KAREN/KREONET2/
NUS-GP/ODN/
REANNZ/SINET/
TRANSPAC/TWAREN)
AUSTRALIA
(AARnet)
LATIN AMERICA
CLARA/CUDI
CANADA
(CANARIE)
RUSSIA
AND CHINA
(GLORIAD)
US R&E
(DREN/Internet2/NLR)
US R&E
(DREN/Internet2/
NASA)
US R&E
(NASA/NISN/
USDOI)
ASIA-PACIFIC
(BNP/HEPNET)
ASIA-PACIFIC
(ASCC/KAREN/
KREONET2/NUS-GP/
ODN/REANNZ/
SINET/TRANSPAC)
AUSTRALIA
(AARnet)
US R&E
(DREN/Internet2/
NISN/NLR)
US R&E
(Internet2/
NLR)
CERN
US R&E
(DREN/Internet2/
NISN)
CANADA
(CANARIE) LHCONE
CANADA
(CANARIE)
FRANCE
(OpenTransit)
RUSSIA
AND CHINA
(GLORIAD)
CERN
(USLHCNet)
ASIA-PACIFIC
(SINET)
EUROPE
(GÉANT/
NORDUNET)
EUROPE
(GÉANT)
LATIN AMERICA
(AMPATH/CLARA)
LATIN AMERICA
(CLARA/CUDI)
HOUSTON
ALBUQUERQUE
El PASO
SUNNYVALE
BOISE
SEATTLE
KANSAS CITY
NASHVILLE
WASHINGTON DC
NEW YORK
BOSTON
CHICAGO
DENVER
SACRAMENTO
ATLANTA
PNNL
SLAC
AMES PPPL
BNL
ORNL
JLAB
FNAL
ANL
LBNL
• ConnecMng
experimental
faciliMes
and
supercompuMng
centers
• On-‐Demand
Secure
Circuits
and
Advance
ReservaMon
System
• Guaranteed
between
collaboraMng
insMtuMons
by
delivering
network-‐as-‐a-‐service
• Co-‐allocaMon
of
storage
and
network
resources
(SRM:
Storage
Resource
Manager)
OSCARS
provides
yes/no
answers
to
a
reservaMon
request
for
(bandwidth,
start_Gme,
end_Gme)
End-‐to-‐end
ReservaMon:
Storage+Network
34. Reservation
Request
• Between
edge
routers
Need
to
ensure
availability
of
the
requested
bandwidth
from
source
to
desGnaGon
for
the
requested
Gme
interval
v
R={
nsource,
ndesGnaGon,
Mbandwidth,
tstart,
tend}.
v source/desMnaMon
end-‐points
v Requested
bandwidth
v start/end
Mmes
Commi3ed
reservaMons
between
tstart
and
tend
are
examined
The
shortest
path
from
source
to
desMnaMon
is
calculated
based
on
the
engineering
metric
on
each
link,
and
a
bandwidth
guaranteed
path
is
set
up
to
commit
and
eventually
complete
the
reservaMon
request
for
the
given
Mme
period
34
35. Reservation
35
v Components (Graph):
v node (router), port, link (connecting two ports)
v engineering metric (~latency)
v maximum bandwidth (capacity)
v Reservation:
v source, destination, path, time
v (time t1, t3) A -> B -> D (900Mbps)
v (time t2, t3) A -> C -> D (400Mbps)
v (time t4, t5) A -> B -> D (800Mpbs)
A
C
B
D
800Mbps
900Mbps
500Mbps
1000Mbps
300Mbps
ReservaMon
1
ReservaMon
2
ReservaMon
3
t1
t2
t3
t4
t5
36. Example
(Mme
t1,
t2)
:
A
to
D
(600Mbps)
NO
A
to
D
(500Mbps)
YES
A
C
B
D
0
Mbps
/
900Mbps
(900Mbps)
100
Mbps
/
900Mbps
(1000Mbps)
800
Mbps
/
0Mbps
(800Mbps)
500
Mbps
/
0Mbps
(500Mbps)
300
Mbps
/
0
Mbps
(300Mbps)
AcMve
reservaMon
reservaMon
1:
(Mme
t1,
t3)
A
-‐>
B
-‐>
D
(900Mbps)
reservaMon
2:
(Mme
t1,
t3)
A
-‐>
C
-‐>
D
(400Mbps)
reservaMon
3:
(Mme
t4,
t5)
A
-‐>
B
-‐>
D
(800Mpbs)
available/
reserved
(capacity)
36
37. Example
A
C
B
D
0
Mbps
/
900Mbps
(900Mbps)
100
Mbps
/
900Mbps
(1000Mbps)
400
Mbps
/
400Mbps
(800Mbps)
100
Mbps
/
400Mbps
(500Mbps)
300
Mbps
/
0
Mbps
(300Mbps)
(Mme
t1,
t3)
:
A
to
D
(500Mbps)
NO
A
to
C
(500Mbps)
No
(not
max-‐FLOW!)
AcMve
reservaMon
reservaMon
1:
(Mme
t1,
t3)
A
-‐>
B
-‐>
D
(900Mbps)
reservaMon
2:
(Mme
t1,
t3)
A
-‐>
C
-‐>
D
(400Mbps)
reservaMon
3:
(Mme
t4,
t5)
A
-‐>
B
-‐>
D
(800Mpbs)
available/
reserved
(capacity)
37
38. Alternative
Approach:
Flexible
Reservations
• IF
the
requested
bandwidth
can
not
be
guaranteed:
• Try-‐and-‐error
unMl
get
an
available
reservaMon
• Client
is
not
given
other
possible
opMons
• How
can
we
enhance
the
OSCARS
reservaMon
system?
• Be
Flexible:
• Submit
constraints
and
the
system
suggests
possible
reservaMon
opMons
saMsfying
given
requirements
38
Rs
'={
nsource
,
ndesGnaGon,
MMAXbandwidth,
DdataSize,
tEarliestStart,
tLatestEnd}
ReservaMon
engine
finds
out
the
reservaMon
R={
nsource,
ndesGnaGon,
Mbandwidth,
tstart,
tend}
for
the
earliest
compleMon
or
for
the
shortest
duraMon
where
Mbandwidth≤
MMAXbandwidth
and
tEarliestStart
≤
tstart
<
tend≤
tLatestEnd
.
39. Bandwidth
Allocation
(time-‐dependent)
Modified
Dijstra's
algorithms
(max
available
bandwidth):
• BoUleneck
constraint
(not
addiMve)
• QoS
constraint
is
addiMve
in
shortest
path,
etc)
39
The
maximum
bandwidth
available
for
allocaMon
from
a
source
node
to
a
desMnaMon
node
t1
t2
t3
t4
t5
t6
40. Analogous Example
n A vehicle travelling from city A to city B
n There are multiple cities between A and B connected with separate
highways.
n Each highway has a specific speed limit
– (maximum bandwidth)
n But we need to reduce our speed if there is high traffic load on the
road
n We know the load on each highway for every time period
– (active reservations)
n The first question is which path the vehicle should follow in order to
reach city B from city A as early as possible (earliest completion)
• Or, we can delay our journey and start later if the total travel time
would be reduced. Second question is to find the route along with the
starting time for shortest travel duration (shortest duration)
40
Advance bandwidth reservation: we have to set the speed limit before starting and
cannot change during the journey
41. Time steps
n Time steps between t1 and t13
Mme
t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13
ReservaMon
1
ReservaMon
2
ReservaMon
3
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Mme
Mme
steps
Max (2r+1) time steps,
where r is the number of
reservations
ts1
ts2
ts3
ts4
41
42. Static Graphs
Res
1
Res
1,2
Res
2
t4
t1
t6
t7
t9
A
C
B
D
0
Mbps
100
Mbps
800
Mbps
500
Mbps
300
Mbps)
A
C
B
D
0
Mbps
100
Mbps
400
Mbps
100
Mbps
300
Mbps)
A
C
B
D
900
Mbps
1000
Mbps
400
Mbps
100
Mbps
300
Mbps)
A
C
B
D
900
Mbps
1000
Mbps
800
Mbps
500
Mbps
300
Mbps)
t4
t6
t7
G(ts3)
G(ts4)
G(ts2)
G(ts1)
42
43. Time Windows
Res
1,2
Res
2
t1
t6
t9
A
C
B
D
0
Mbps
100
Mbps
400
Mbps
100
Mbps
300
Mbps
A
C
B
D
900
Mbps
1000
Mbps
400
Mbps
100
Mbps
300
Mbps
t6
Max (s × (s + 1))/2 time windows, where s is the
number of time steps
G(tw)=G(ts3)
x
G(ts4)
tw=ts1+ts2
Bo3leneck
constraint
G(tw)=G(ts1)
x
G(ts2)
tw=ts3+ts4
43
44. Time
Window
List
(special
data
structures)
now
infinite
Time
windows
list
new
reservaMon:
reservaMon
1,
start
t1,
end
t10
now
t1
t10
infinite
Res
1
new
reservaMon:
reservaMon
2,
start
t12,
end
t20
now
t1
t10
t12
Res
1
t20
infinite
Res
2
44
Careful
sofware
design
makes
implementaMon
fast
and
efficient
45. Performance
max-bandwidth path ~ O(n^2 )
n is the number of nodes in the topology graph
In the worst-case, we may require to search all time
windows, (s × (s + 1))/2, where s is the number of
time steps.
If there are r committed reservations in the search
period, there can be a maximum of 2r + 1 different
time steps in the worst-case.
Overall, the worst-case complexity is bounded
by O(r^2 n^2 )
Note: r is relatively very small compared to the
number of nodes n 45
46. Example
Reservation 1: (time t1, t6) A -> B -> D
(900Mbps)
Reservation 2: (time t4, t7) A -> C -> D
(400Mbps)
Reservation 3: (time t9, t12) A -> B -> D
(700Mpbs)
A
C
B
D
800Mbps
900Mbps
500Mbps
1000Mbps
300Mbps
t4
t2
t3
t1
t5
t6
t7
t8
t9
t10
t11
t12
t13
ReservaMon
1
ReservaMon
2
ReservaMon
3
from A to D (earliest completion)
max bandwidth = 200Mbps, volume = 200Mbps x 4 time slots
earliest start = t1, latest finish t13
46
47. Search Order - Time Windows
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Mme
windows
Res
1
Res
1,
2
Res
1,
2
2
Res
1,2
Res
1,
2
Res
2
Res
1,
2
Res
1,
2
t1-‐-‐t6
t4—t6
t1-‐-‐t4
t6—t7
t4—t7
t1—t7
t7—t9
t6—t9
t4—t9
t1—t9
Max
bandwidth
from
A
to
D
1. 900Mbps
(3)
2. 100Mbps
(2)
3. 100Mbps
(5)
4. 900Mbps
(1)
5. 100Mbps
(3)
6. 100Mbps
(6)
7. 900Mpbs
(2)
8. 900Mbps
(3)
9. 100Mbps
(5)
10. 100Mbps
(8)
ReservaMon:
(
A
to
D
)
(100Mbps)
start=t1
end=t9
47
48. Search Order - Time Windows
Shortest
dura>on?
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Mme
windows
Res
3
Res
3
t9—t13
t12—t12
t9—t12
Max
bandwidth
from
A
to
D
1. 200Mbps
(3)
2. 900Mbps
(1)
3. 200Mbps
(4)
ReservaMon:
(A
to
D
)
(200Mbps)
start=t9
end=t13
Ø from
A
to
D,
max
bandwidth
=
200Mbps
volume
=
175Mbps
x
4
Mme
slots
earliest
start
=
t1,
latest
finish
t13
earliest
compleMon:
(
A
to
D
)
(100Mbps)
start=t1
end=t8
shortest
duraMon:
(
A
to
D
)
(200Mbps)
start=t9
end=t12.5
48
49. Source
>
Network
>
Destination
A
CB
D
800Mbps
900Mbps
500Mbps
1000Mbps
300Mbps
n2
n1
Now
we
have
mulMple
requests
49
50. With
start/end
times
•
Each
transfer
request
has
start
and
end
Mmes
• n
transfer
requests
are
given
(each
request
has
a
specific
amount
of
profit)
• ObjecMve
is
to
maximize
the
profit
• If
profit
is
same
for
each
request,
then
objecMve
is
to
maximize
the
number
of
jobs
in
a
give
Mme
period
• Unspli3able
Flow
Problem:
• An
undirected
graph,
• route
demand
from
source(s)
to
desMnaMons(s)
and
maximize/minimize
the
total
profit/cost
50
The
online
scheduling
method
here
is
inspired
from
Gale-‐Shapley
algorithm
(also
known
as
stable
marriage
problem)
51. Methodology
• Displace
other
jobs
to
open
space
for
the
new
request
•
we
can
shif
max
n
jobs?
• Never
accept
a
job
if
it
causes
other
commi3ed
jobs
to
break
their
criteria
• Planning
ahead
(gives
opportunity
for
co-‐allocaMon)
• Gives
a
polynomial
approximaMon
algorithm
• The
preference
converts
the
UFP
problem
into
Dijkstra
path
search
• UMlizes
Mme
windows/Mme
steps
for
ranking
(be3er
than
earliest
deadline
first)
• Earliest
compleMon
+
shortest
duraMon
• Minimize
concurrency
• Even
random
ranking
would
work
(relaxaMon
in
an
NP-‐hard
problem
51
52.
52
53. Recall
Time
Windows
Res
1
Res
1,2
Res
2
Res
3
t4
t1
t6
t7
t9
t12
t13
Mme
windows
Res
1
Res
1,
2
Res
1,
2
2
Res
1,2
Res
1,
2
Res
2
Res
1,
2
Res
1,
2
t1-‐-‐t6
t4—t6
t1-‐-‐t4
t6—t7
t4—t7
t1—t7
t7—t9
t6—t9
t4—t9
t1—t9
Max
bandwidth
from
A
to
D
1. 900Mbps
(3)
2. 100Mbps
(2)
3. 100Mbps
(5)
4. 900Mbps
(1)
5. 100Mbps
(3)
6. 100Mbps
(6)
7. 900Mpbs
(2)
8. 900Mbps
(3)
9. 100Mbps
(5)
10. 100Mbps
(8)
ReservaMon:
(
A
to
D
)
(100Mbps)
start=t1
end=t9
53
54. Test
54
In
real
life,
number
of
nodes
and
number
of
reservaMon
in
a
given
search
interval
are
limited
See
AINA’13
paper
for
results
+
comparison
with
different
preference
metrics
55. Autonomic
Provisioning
System
• Generate
constraints
automaMcally
(without
user
input)
• Volume
(elephant
flow?)
• True
deadline
if
applicable
• End-‐host
resource
availability
• Burst
rate
(fixed
bandwidth,
variable
bandwidth)
• Update
constraints
according
to
feedback
and
monitoring
• Minimize
operaMonal
cost
• AlternaMve
to
manual
traffic
engineering
What
is
the
incenMve
to
make
correct
reservaMons?
55
56. Data
Center
1
Data
Center
2
Data
node
B
(web
access)
Experimental
facility
A
*
(1)
Experimental
facility
A
generates
30T
of
data
every
day,
and
it
needs
to
be
stored
in
data
center
2,
before
the
next
run,
since
local
disk
space
is
limited
*
(2)
There
is
a
reservaMon
made
between
data
center
1
and
2.
It
is
used
to
replicate
data
files,
1P
total
size,
when
new
data
is
available
in
data
center
2
*
(3)
New
results
are
published
at
data
node
B,
we
expect
high
traffic
to
download
new
simulaMon
files
for
the
next
couple
of
months
Wide-‐area
SDN
56
57. Example
• Experimental
facility
periodically
transfers
data
(i.e.
every
night)
• Data
replicaMon
happens
occasionally,
and
it
will
take
a
week
to
move
1P
of
data.
If
could
get
delayed
couple
of
hours
with
no
harm
• Wide-‐area
download
traffic
will
increase
gradually,
most
of
the
traffic
will
be
during
the
day.
• We
can
dynamically
increase
preference
for
download
traffic
in
the
mornings,
give
high
priority
for
transferring
data
from
the
facility
at
night,
and
use
rest
of
the
bandwidth
for
data
replicaMon
(and
allocate
some
bandwidth
to
confirm
that
it
would
finish
within
a
week
as
usual)
57
58. Virtual
Circuit
ReservaMon
Engine
Autonomic
provisioning
system
monitoring
Reserva>on
Engine
– Select
opMmal
path/Mme/bandwidth
– maximize
the
number
of
admi3ed
requests
–
increase
overall
system
uMlizaMon
and
network
efficiency
– Dynamically
update
the
selected
rouMng
path
for
network
efficiency
– Modify
exisMng
reservaMons
dynamically
to
open
space/Mme
for
new
requests
58
59. THANK
YOU
Any
QuesMon/Comment?
Mehmet
Balman
mbalman@lbl.gov
h3p://balman.info
59