SlideShare ist ein Scribd-Unternehmen logo
1 von 90
Downloaden Sie, um offline zu lesen
Eserver pSeries
© 2003 IBM Corporation
"Any sufficiently advanced technology will
have the appearance of magic."
…Arthur C. Clarke
Section 2: The Technology
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Section Objectives
 On completion of this unit you should be able to:
– Describe the relationship between technology and
solutions.
– List key IBM technologies that are part of the POWER5
products.
– Be able to describe the functional benefits that these
technologies provide.
– Be able to discuss the appropriate use of these
technologies.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
IBM and Technology
Science
Technology
Products
Solutions
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Technology and innovation
 Having technology available is a necessary first
step.
 Finding creative new ways to use the technology
for the benefit of our clients is what innovation is
about.
 Solution design is an opportunity for innovative
application of technology.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
When technology won’t ‘fix’ the problem
 When the technology is not related to the problem.
 When the client has unreasonable expectations.
Eserver pSeries
© 2003 IBM Corporation
POWER5 Technology
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER4 and POWER5 Cores
POWER4 Core POWER5 Core
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER5
 Designed for entry and high-
end servers
 Enhanced memory subsystem
 Improved performance
 Simultaneous Multi-Threading
 Hardware support for Shared
Processor Partitions (Micro-
Partitioning)
 Dynamic power management
 Compatibility with existing
POWER4 systems
 Enhanced reliability,
availability, serviceability
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3DirL3DirMemCtrlMemCtrl
GX+
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Enhanced memory subsystem
 Improved L1 cache design
– 2-way set associative i-cache
– 4-way set associative d-cache
– New replacement algorithm (LRU vs. FIFO)
 Larger L2 cache
– 1.9 MB, 10-way set associative
 Improved L3 cache design
– 36 MB, 12-way set associative
– L3 on the processor side of the fabric
– Satisfies L2 cache misses more frequently
– Avoids traffic on the interchip fabric
 On-chip L3 directory and memory controller
– L3 directory on the chip reduces off-chip delays
after an L2 miss
– Reduced memory latencies
 Improved pre-fetch algorithms
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3DirL3DirMemCtrlMemCtrl
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Enhanced memory subsystem
L3
Cache
L3
Cache
ProcessorProcessor ProcessorProcessor ProcessorProcessor ProcessorProcessorProcessorProcessor ProcessorProcessorProcessorProcessor ProcessorProcessor
L2
Cache
L2
Cache
L2
Cache
L2
Cache
L2
Cache
L2
Cache
L2
Cache
L2
Cache
Fabric
controller
Fabric
controller
Fabric
controller
Fabric
controller
Memory
controller
Memory
controller
Memory Memory
L3
Cache
L3
Cache
Fabric
controller
Fabric
controller
Fabric
controller
Fabric
controller
Memory
controller
Memory
controller
Memory
controller
Memory
controller
Memory Memory
POWER4 system structure POWER5 system structure
Reduced
L3 latency
Faster
access to
memory
Larger
SMPs
64-way
Number of
chips cut
in half
L3DirL3Dir
L3DirL3Dir
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Simultaneous Multi-Threading (SMT)
 What is it?
 Why would I want it?
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Branch
pipeline
Load/store
pipeline
Fixed-point
pipeline
Floating-
point pipeline
POWER4 pipeline
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF F6
Xfer
F6F6F6F6F6
D1 D2 D3 Xfer GD
IF BP
CP
Instruction Crack and
Group Formation
Instruction Fetch
Branch redirects
Interrupts & Flushes
Out-of-order processing
WB
Fmt
D0
IC
POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF
= register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-
point execution pipe, Fmt = data format, WB = write back, and CP = group commit)
POWER5 pipeline
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
Processor Cycles
i-Cache
Multi-threading evolution
 Execution unit utilization is low in today’s
microprocessors
 25% of average execution unit utilization across
a broad spectrum of environments
Memory
Instruction streams
Next evolution step
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
Processor Cycles
i-Cache
Coarse-grained multi-threading
 Two instruction streams, one thread at any instance
 Hardware swaps in second thread when long-latency event
occurs
 Swap requires several cycles
Memory
Instruction streams
Swap
Swap
Swap
Next evolution step
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Coarse-grained multi-threading (Cont.)
 Processor (for example, RS64-IV) is able to store context for
two threads
– Rapid switching between threads minimizes lost cycles due
to I/O waits and cache misses.
– Can yield ~20% improvement for OLTP workloads.
 Coarse-grained multi-threading only beneficial where
number of active threads exceeds 2x number of CPUs
– AIX must create a “dummy” thread if there are insufficient
numbers of real threads.
• Unnecessary switches to “dummy” threads can degrade
performance ~20%
• Does not work with dynamic CPU deallocation
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
Processor Cycles
i-Cache
Fine-grained multi-threading
 Variant of coarse-grained multi-threading
 Thread execution in round-robin fashion
 Cycle remains unused when a thread
encounters a long-latency event
Memory
Instruction streams
Next evolution step
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER5 pipeline
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF F6
Xfer
F6F6F6F6F6
D1 D2 D3 Xfer GD
IF BP
CP
Branch
pipeline
Instruction Crack and
Group Formation
Instruction Fetch
Branch redirects
Interrupts & Flushes
Out-of-order processing
WB
Fmt
D0
IC
POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0
= decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF
= register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-
point execution pipe, Fmt = data format, WB = write back, and CP = group commit)
IF
CP
Load/store
pipeline
Fixed-point
pipeline
Floating-
point pipeline
POWER4 pipeline
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
FX0
FX1
LS0
LS1
FP0
FP1
BFX
CRL
Processor Cycles
i-Cache
Simultaneous multi-threading (SMT)
 Reduction in unused execution
units results in a 25-40% boost and
even more!
Memory
Instruction streams
First evolution step
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Simultaneous multi-threading (SMT) (Cont.)
 Each chip appears as a 4-way SMP to software
– Allows instructions from two threads to execute
simultaneously
 Processor resources optimized for enhanced SMT
performance
– No context switching, no dummy threads
 Hardware, POWER Hypervisor, or OS controlled thread
priority
– Dynamic feedback of shared resources allows for balanced
thread execution
 Dynamic switching between single and multithreaded mode
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Dynamic resource balancing
 Threads share many
resources
– Global Completion Table,
Branch History Table,
Translation Lookaside Buffer,
and so on
 Higher performance realized
when resources balanced
across threads
– Tendency to drift toward
extremes accompanied by
reduced performance
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Adjustable thread priority
 Instances when unbalanced
execution is desirable
– No work for opposite thread
– Thread waiting on lock
– Software determined non
uniform balance
– Power management
 Control instruction decode
rate
– Software/hardware controls
eight priority levels for each
thread
0
0
0
1
1
1
1
1
2
2
Instructionspercycle
2,7 4,7 6,7 7,7 7,6 7,4 7,20,7 7,0 1,1
Thread 0 Priority - Thread 1 Priority
Thread 0 IPC Thread 1 IPC
Power
Save
Mode
Single-threaded operation
Hardware thread priorities
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Single-threaded operation
 Advantageous for execution unit
limited applications
– Floating or fixed point intensive
workloads
 Execution unit limited applications
provide minimal performance
leverage for SMT
– Extra resources necessary for SMT
provide higher performance benefit
when dedicated to single thread
 Determined dynamically on a per
processor basis
Dormant
Null
Active
Software
Hardware
or Software
Software
Software
Thread states
Eserver pSeries
© 2003 IBM Corporation
Micro-Partitioning
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Micro-Partitioning overview
 Mainframe inspired technology
 Virtualized resources shared by multiple partitions
 Benefits
– Finer grained resource allocation
– More partitions (Up to 254)
– Higher resource utilization
 New partitioning model
– POWER Hypervisor
– Virtual processors
– Fractional processor capacity partitions
– Operating system optimized for Micro-Partitioning exploitation
– Virtual I/O
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared processor pool
Processor terminology
Shared processor
partition
SMT Off
Shared processor
partition
SMT On
Dedicated
processor partition
SMT Off
Deconfigured
Inactive (CUoD)
Dedicated
Shared
Virtual
Logical (SMT)
Installed physical
processors
Entitled capacity
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared processor partitions
 Micro-Partitioning allows for multiple partitions to
share one physical processor
 Up to 10 partitions per physical processor
 Up to 254 partitions active at the same time
 Partition’s resource definition
– Minimum, desired, and maximum values for each
resource
– Processor capacity
– Virtual processors
– Capped or uncapped
• Capacity weight
– Dedicated memory
• Minimum of 128 MB and 16 MB increments
– Physical or virtual I/O resources
CPU 0 CPU 1
CPU 3 CPU 4
LPAR 1 LPAR 2
LPAR 5 LPAR 6
LPAR 4LPAR 3
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Understanding min/max/desired resource values
 The desired value for a resource is given to a
partition if enough resource is available.
 If there is not enough resource to meet the desired
value, then a lower amount is allocated.
 If there is not enough resource to meet the min
value, the partition will not start.
 The maximum value is only used as an upper limit
for dynamic partitioning operations.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Partition capacity entitlement
 Processing units
– 1.0 processing unit represents one
physical processor
 Entitled processor capacity
– Commitment of capacity that is
reserved for the partition
– Set upper limit of processor
utilization for capped partitions
– Each virtual processor must be
granted at least 1/10 of a
processing unit of entitlement
 Shared processor capacity is
always delivered in terms of whole
physical processors
Processing capacity
1 physical processor
1.0 processing units
0.5 processing unit 0.4 processing unit
Minimum requirement
0.1 processing units
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Capped and uncapped partitions
 Capped partition
– Not allowed to exceed its entitlement
 Uncapped partition
– Is allowed to exceed its entitlement
 Capacity weight
– Used for prioritizing uncapped partitions
– Value 0-255
– Value of 0 referred to as a “soft cap”
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Partition capacity entitlement example
 Shared pool has 2.0 processing units
available
 LPARs activated in sequence
 Partition 1 activated
– Min = 1.0, max = 2.0, desired = 1.5
– Starts with 1.5 allocated processing units
 Partition 2 activated
– Min = 1.0, max = 2.0, desired = 1.0
– Does not start
 Partition 3 activated
– Min = 0.1, max = 1.0, desired = 0.8
– Starts with 0.5 allocated processing units
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Understanding capacity allocation – An example
 A workload is run under different configurations.
 The size of the shared pool (number of physical
processors) is fixed at 16.
 The capacity entitlement for the partition is fixed
at 9.5.
 No other partitions are active.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Uncapped – 16 virtual processors
 16 virtual processors.
 Uncapped.
 Can use all available resource.
 The workload requires 26 minutes to complete.
Uncapped (16PPs/16VPs/9.5CE)
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapsed time
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Uncapped – 12 virtual processors
 12 virtual processors.
 Even though the partition is uncapped, it can only use 12
processing units.
 The workload now requires 27 minutes to complete.
Uncapped (16PPs/12VPs/9.5CE)
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapsed time
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Capped
 The partition is now capped and resource utilization is
limited to the capacity entitlement of 9.5.
– Capping limits the amount of time each virtual processor is
scheduled.
– The workload now requires 28 minutes to complete.
Capped (16PPs/12VPs/9.5E)
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapses time
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Dynamic partitioning operations
 Add, move, or remove processor capacity
– Remove, move, or add entitled shared processor capacity
– Change between capped and uncapped processing
– Change the weight of an uncapped partition
– Add and remove virtual processors
• Provided CE / VP > 0.1
 Add, move, or remove memory
– 16 MB logical memory block
 Add, move, or remove physical I/O adapter slots
 Add or remove virtual I/O adapter slots
 Min/max values defined for LPARs set the bounds within
which DLPAR can work
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Dynamic LPAR
Standard on all new systems
HMC
AIX
5L
Linux
Hypervisor
Part#1
Production
Part#2 Part#3 Part#4
Legacy
Apps
Test/
Dev
File/
Print
AIX
5L
AIX
5L
Move resources
between live
partitions
Eserver pSeries
© 2003 IBM Corporation
Firmware
POWER Hypervisor
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor strategy
 New Hypervisor for POWER5 systems
– Further convergence with iSeries
– But brands will retain unique value propositions
– Reduced development effort
– Faster time to market
 New capabilities on pSeries servers
– Shared processor partitions
– Virtual I/O
 New capability on iSeries servers
– Can run AIX 5L
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
H-Call Interface
POWER Hypervisor component sourcing
HSC
VLANVLAN
VLAN IOALAN IOA
Nucleus (SLIC)
Virtual Ethernet
Capacity on Demand
Shared processor LPAR
Virtual I/O
Bus recovery Dump
Location codes
FSP
Load from flash
NVRAM
Message passing
pSeries
iSeries
255 partitions
Slot/tower concurrent maint.Drawer concurrent maint.
Partition on demand
HMC
SCSI IOA
I/O configuration
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor functions
 Same functions as POWER4 Hypervisor.
– Dynamic LPAR
– Capacity Upgrade on Demand
 New, active functions.
– Dynamic Micro-Partitioning
– Shared processor pool
– Virtual I/O
– Virtual LAN
 Machine is always in LPAR mode.
– Even with all resources dedicated to one OS
Dynamic Micro-Partitioning
CPU 0 CPU 1
CPU 2 CPU 3
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT Core
SMT Core
1.9 MB L2 Cache
1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT Core
SMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT Core
SMT Core
1.9 MB L2 Cache
1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT Core
SMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
Shared processor pools
Disk LAN
Virtual I/O
Dynamic LPAR
Planned
Actual
Client Capacity Growth
Capacity Upgrade on Demand
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor implementation
 Design enhancements to previous POWER4
implementation enable the sharing of processors
by multiple partitions
– Hypervisor decrementer (HDECR)
– New Processor Utilization Resource Register (PURR)
– Refine virtual processor objects
• Does not include physical characteristics of the processor
– New Hypervisor calls
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor processor dispatch
 Manage a set of processors on the machine
(shared processor pool).
 POWER5 generates a 10 ms dispatch window.
– Minimum allocation is 1 ms per physical
processor.
 Each virtual processor is guaranteed to get its
entitled share of processor cycles during each 10
ms dispatch window.
– ms/VP = CE * 10 / VPs
 The partition entitlement is evenly distributed
among the online virtual processors.
 Once a capped partition has received its CE
within a dispatch interval, it becomes not-
runnable.
 A VP dispatched within 1 ms of the end of the
dispatch interval will receive half its CE at the
start of the next dispatch interval.
Shared processor pool
SMT Core
SMT Core
1.9 MB L2 Cache
1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT Core
SMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT Core
SMT Core
1.9 MB L2 Cache
1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT Core
SMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT Core
SMT Core
1.9 MB L2 Cache
1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT Core
SMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT Core
SMT Core
1.9 MB L2 Cache
1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT Core
SMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
CPU 0 CPU 1
CPU 2 CPU 3
POWER
Hypervisor’s
processor
dispatch
Virtual processor capacity entitlement for
six shared processor partitions
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Dispatching and interrupt latencies
 Virtual processors have dispatch latency.
 Dispatch latency is the time between a virtual
processor becoming runnable and being actually
dispatched.
 Timers have latency issues also.
 External interrupts have latency issues also.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared processor pool
 Processors not associated with
dedicated processor partitions.
 No fixed relationship between virtual
processors and physical processors.
 The POWER Hypervisor attempts to
use the same physical processor.
– Affinity scheduling
– Home node
Shared processor pool
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanceddistributedswitch
SMT CoreSMT Core
L3Dir
L3Dir
MemCtrl
MemCtrl
CPU 0 CPU 1 CPU 2 CPU 3
POWER
Hypervisor’s
processor
dispatch
Virtual processor capacity entitlement for
six shared processor partitions
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Affinity scheduling
 When dispatching a VP, the POWER Hypervisor attempts to
preserve affinity by using:
– Same physical processor as before, or
– Same chip, or
– Same MCM
 When a physical processor becomes idle, the POWER
Hypervisor looks for a runnable VP that:
– Has affinity for it, or
– Has affinity to no-one, or
– Is uncapped
 Similar to AIX affinity scheduling
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Operating system support
 Micro-Partitioning capable operating systems need to be modified
to cede a virtual processor when they have no runnable work
– Failure to do this results in wasted CPU resources
• For example, an partition spends its CE waiting for I/O
– Results in better utilization of the pool
 May confer the remainder of their timeslice to another VP
– For example, a VP holding a lock
 Can be redispatched if they become runnable again during the
same dispatch interval
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Example
POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec)
0 1 2 3 4 5 6 7 8 9
Physical
processor 0
Physical
processor 1
10 11 12 13 14 15 16 17 18 19
LPAR 2
VP 0
LPAR 1
VP 1
20
LPAR 3
VP 2
LPAR2
Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped)
LPAR3
Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped)
LPAR1
Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped)
LPAR 1
VP 1
LPAR 1
VP 0
LPAR 3
VP 0
LPAR 3
VP 1
LPAR 3
VP 2
LPAR 1
VP 0
LPAR 3
VP 1
LPAR 2
VP 0
LPAR 1
VP 1
LPAR 3
VP 0
IDLE IDLE
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor and virtual I/O
 I/O operations without dedicating resources to an individual
partition
 POWER Hypervisor’s virtual I/O related operations
– Provide control and configuration structures for virtual
adapter images required by the logical partitions
– Operations that allow partitions controlled and secure access
to physical I/O adapters in a different partition
– The POWER Hypervisor does not own any physical I/O
devices; they are owned by an I/O hosting partition
 I/O types supported
– SCSI
– Ethernet
– Serial console
Disk LAN
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Performance monitoring and accounting
 CPU utilization is measured against CE.
– An uncapped partition receiving more than its CE will record
100% but will be using more.
 SMT
– Thread priorities compound the variable speed rate.
– Twice as many logical CPUs.
 For accounting, interval may be incorrectly allocated.
– New hardware support is required.
 Processor utilization register (PURR) records actual clock ticks
spent executing a partition.
– Used by performance commands (for example, new flags) and
accounting modules.
– Third party tools will need to be modified.
Eserver pSeries
© 2003 IBM Corporation
Virtual I/O Server
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual I/O Server
 Provides an operating environment for virtual I/O administration
– Virtual I/O server administration
– Restricted scriptable command line user interface (CLI)
 Minimum hardware requirements
– POWER5 VIO capable machine
– Hardware management console
– Storage adapter
– Physical disk
– Ethernet adapter
– At least 128 MB of memory
 Capabilities of the Virtual I/O Server
– Ethernet Adapter Sharing
– Virtual SCSI disk
• Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific
models of EMC, HDS, and STK disk subsystems, attached using Fiber Channel
– Interacts with AIX and Linux partitions
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual I/O Server (Cont.)
 Installation CD when Advanced POWER
Virtualization feature is ordered
 Configuration approaches for high availability
– Virtual I/O Server
• LVM mirroring
• Multipath I/O
• EtherChannel
– Second virtual I/O server instance in another partition
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual SCSI
 Allows sharing of storage devices
 Vital for shared processor partitions
– Overcomes potential limit of adapter slots due to Micro-
Partitioning
– Allows the creation of logical partitions without the need for
additional physical resources
 Allows attachment of previously unsupported storage
solutions
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
VSCSI server and client architecture overview
 Virtual SCSI is based on a
client/server relationship.
 The virtual I/O resources are assigned
using an HMC.
 Virtual SCSI enables sharing of
adapters as well as disk devices.
 Dynamic LPAR operations allowed.
 Dynamic mapping between physical
and virtual resources on the virtual
I/O server.
POWER Hypervisor
Client
partition
Linux
Virtual I/O
Server partition
Client
partition
AIX
Physical adapter
VSCI client
adapter
hdisk
VSCSI server
adapter
VSCSI server
adapter
VSCI client
adapter
LVM
hdisk
Logical
volume 2
Logical
volume 1
Physical disk
(SCSI, FC)
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual I/O Server partition
hdisk
Virtual devices
 Are defined as LVs in the I/O server
partition
– Normal LV rules apply
 Appear as real devices (hdisks) in the
hosted partition
 Can be manipulated using Logical
Volume Manager just like an ordinary
physical disk
 Can be used as a boot device and as a
NIM target
 Can be shared by multiple clients
POWER Hypervisor
Client partition
VSCI client
adapter
LVM
LV
VSCSI server
adapter
Virtual
disk
LVM
hdisk
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
SCSI RDMA and Logical Remote Direct Memory Access
 SCSI transport protocols define the
rules for exchanging information
between SCSI initiators and targets.
 Virtual SCSI uses the SCSI RDMA
Protocol (SRP).
– SCSI initiators and targets have the
ability to directly transfer information
between their respective address
spaces.
 SCSI requests and responses are
sent using the Virtual SCSI adapters.
 The actual data transfer, however, is
done using the Logical Redirected
DMA protocol.
Reliable Command / Response Transport
Logical Remote Direct Memory Access
POWER Hypervisor
Virtual I/O Server
partition
Client partition AIX
Physical adapter
Physical
adapter device
driver
VSCI device
driver (target)
Device
Mapping
VSCI device
driver (initiator)
Data Buffer
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual SCSI security
 Only the owning partition has access to its data.
 Data-information is copied directly from the PCI
adapter to the client’s memory.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Performance considerations
 Twice as many processor cycles to do VSCSI as a locally attached
disk I/O (evenly distributed on the client partition and virtual I/O
server)
– The path of each virtual I/O request involves several sources of
overhead that are not present in a non-virtual I/O request.
– For a virtual disk backed by the LVM, there is also the performance
impact of going through the LVM and disk device drivers twice.
 If multiple partitions are competing for resources from a VSCSI
server, care must be taken to ensure enough server resources
(CPU, memory, and disk) are allocated to do the job.
 If not constrained by CPU performance, dedicated partition
throughput is comparable to doing local I/O.
 Because there is no caching in memory on the server I/O partition,
it's memory requirements should be modest.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Limitations
 Hosting partition must be available before hosted
partition boot.
 Virtual SCSI supports FC, parallel SCSI, and SCSI
RAID.
 Maximum of 65535 virtual slots in the I/O server
partition.
 Maximum of 256 virtual slots on a single partition.
 Support for all mandatory SCSI commands.
 Not all optional SCSI commands are supported.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Implementation guideline
 Partitions with high performance and disk I/O
requirements are not recommended for
implementing VSCSI.
 Partitions with very low performance and disk I/O
requirements can be configured at minimum
expense to use only a portion of a logical volume.
 Boot disks for the operating system.
 Web servers that will typically cache a lot of data.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor
Virtual I/O
Server
partition
Client
partition
Virtual I/O
Server
partition
LVM mirroring
 This configuration
protects virtual disks in a
client partition against
failure of:
– One physical disk
– One physical adapter
– One virtual I/O server
 Many possibilities exist
to exploit this great
function!
Physical SCSI
adapter
VSCSI server
adapter
LVM
Physical disk
(SCSI)
Physical SCSI
adapter
Physical disk
(SCSI)
VSCSI server
adapter
LVM
VSCSI
client
adapter
LVM
VSCSI
client
adapter
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
POWER Hypervisor
Virtual I/O
Server
partition
Client
partition
Virtual I/O
Server
partition
Multipath I/O
 This configuration protects
virtual disks in a client
partition against failure of:
– Failure of one physical FC
adapter in one I/O server
– Failure of one Virtual I/O
server
 Physical disk is assigned as a
whole to the client partition
 Many possibilities exist to
exploit this great function!
Physical FC adapter
VSCSI server
adapter
LVM
(hdisk)
Physical FC adapter
Physical disk
ESS
VSCSI server
adapter
LVM
(hdisk)
VSCSI
client
adapter
LVM
VSCSI
client
adapter
SAN Switch
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual LAN overview
 Virtual network segments on top of
physical switch devices.
 All nodes in the VLAN can
communicate without any L3
routing or inter-VLAN bridging.
 VLANs provides:
– Increased LAN security
– Flexible network deployment over
traditional network devices
 VLAN support in AIX is based on
the IEEE 802.1Q VLAN
implementation.
– VLAN ID tagging to Ethernet
frames
– VLAN ID restricted switch ports
Switch B Switch C
Switch A
Node A-1 Node A-2
Node B-1 Node B-2 Node B-3 Node C-1 Node C-2
VLAN 1
VLAN 2
X
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual Ethernet
 Enables inter-partition communication.
– In-memory point to point connections
 Physical network adapters are not needed.
 Similar to high-bandwidth Ethernet connections.
 Supports multiple protocols (IPv4, IPv6, and ICMP).
 No Advanced POWER Virtualization feature required.
– POWER5 Systems
– AIX 5L V5.3 or appropriate Linux level
– Hardware management console (HMC)
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual Ethernet connections
 VLAN technology implementation
– Partitions can only access data directed to
them.
 Virtual Ethernet switch provided by the
POWER Hypervisor
 Virtual LAN adapters appears to the OS as
physical adapters
– MAC-Address is generated by the HMC.
 1-3 Gb/s transmission speed
– Support for large MTUs (~64K) on AIX.
 Up to 256 virtual Ethernet adapters
– Up to 18 VLANs.
 Bootable device support for NIM OS
installations
Virtual Ethernet switch
POWER Hypervisor
Linux
partition
AIX
partition
Virtual
Ethernet
adapter
Virtual
Ethernet
adapter
AIX
partition
Virtual
Ethernet
adapter
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual Ethernet switch
 Based on IEEE 802.1Q VLAN standard
– OSI-Layer 2
– Optional Virtual LAN ID (VID)
– 4094 virtual LANs supported
– Up to 18 VIDs per virtual LAN port
 Switch configuration through HMC
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
How it works
Virtual Ethernet adapter
Virtual VLAN switch port
PHYP caches source MAC
Check VLAN headerIEEE VLAN
header?
Insert VLAN header
Port allowed?
Dest. MAC in
table?
Trunk adapter
defined?
Configured associated switch
port
Match for
VLAN Nr. in
table?
Deliver
Pass to Trunk
adapter
Drop packet
Y
N
N
N
N
Y
Y
N
Y
N
Y
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Performance considerations
 Virtual Ethernet performance
– Throughput scales nearly linear with the
allocated capacity entitlement
 Virtual LAN vs. Gigabit Ethernet
throughput
– Virtual Ethernet adapter has higher raw
throughput at all MTU sizes
– In-memory copy is more efficient at larger
MTU
0
200
400
600
800
1000
Throughput/0.1
entitlement
[Mb/s]
0.1 0.3 0.5 0.8 1
1500
9000
65394
CPU entitlements
MTU
size
Throughput per 0.1 entitlement
0
2000
4000
6000
8000
10000
Throughput
[Mb/s]
1
Throughput, TCP_STREAM
VLAN
Gb Ethernet
MTU 1500 1500 9000 9000 65394 65394
Simpl./Dupl. S D S D S D
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Limitations
 Virtual Ethernet can be used in both shared and
dedicated processor partitions provided with the
appropriate OS levels.
 A mixture of Virtual Ethernet connections, real network
adapters, or both are permitted within a partition.
 Virtual Ethernet can only connect partitions within a
single system.
 A system’s processor load is increased when using
virtual Ethernet.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Implementation guideline
 Know your environment and the network traffic.
 Choose a high MTU size, as it makes sense for the
network traffic in the Virtual LAN.
 Use the MTU size 65394 if you expect a large amount of
data to be copied inside your Virtual LAN.
 Enable tcp_pmtu_discover and udp_pmtu_discover in
conjunction with MTU size 65394.
 Do not turn off SMT.
 No dedicated CPUs are required for virtual Ethernet
performance.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Connecting Virtual Ethernet to external networks
 Routing
– The partition that routes the traffic to the external work does not necessarily have to be
the virtual I/O server.
Virtual Ethernet switch
POWER Hypervisor
Linux
partition
AIX
partition
3.1.1.103.1.1.10
AIX partition
3.1.1.11.1.1.100
Physical adapter
Virtual Ethernet switch
POWER Hypervisor
Linux
partition
AIX
partition
4.1.1.114.1.1.10
AIX partition
4.1.1.12.1.1.100
Physical adapter
IP subnet 1.1.1.X
AIX
Server
Linux
Server
IP subnet 2.1.1.X
1.1.1.10 2.1.1.10
IP Router
1.1.1.1
2.1.1.1
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared Ethernet Adapter
 Connects internal and external VLANs using one physical
adapter.
 SEA is a new service that acts as a layer 2 network switch.
– Securely bridges network traffic from a virtual Ethernet
adapter to a real network adapter
 SEA service runs in the Virtual I/O Server partition.
– Advanced POWER Virtualization feature required
– At least one physical Ethernet adapter required
 No physical I/O slot and network adapter required in the
client partition.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared Ethernet Adapter (Cont.)
 Virtual Ethernet MAC are visible to outside systems.
 Broadcast/multicast is supported.
 ARP (Address Resolution Protocol) and NDP (Neighbor Discovery
Protocol) can work across a shared Ethernet.
 One SEA can be shared by multiple VLANs and multiple subnets
can connect using a single adapter on the Virtual I/O Server.
 Virtual Ethernet adapter configured in the Shared Ethernet Adapter
must have the trunk flag set.
– The trunk Virtual Ethernet adapter enables a layer-2 bridge to a
physical adapter
 IP fragmentation is performed or an ICMP packet too big message
is sent when the shared Ethernet adapter receives IP (or IPv6)
packets that are larger than the MTU of the adapter that the packet
is forwarded through.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual Ethernet and Shared Ethernet Adapter security
 VLAN (virtual local area network) tagging description taken
from the IEEE 802.1Q standard.
 The implementation of this VLAN standard ensures that the
partitions have no access to foreign data.
 Only the network adapters (virtual or physical) that are
connected to a port (virtual or physical) that belongs to the
same VLAN can receive frames with that specific VLAN ID.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Performance considerations
 Virtual I/O-Server
performance
– Adapters stream data at
media speed if the Virtual
I/O server has enough
capacity entitlement.
– CPU utilization per Gigabit
of throughput is higher with
a Shared Ethernet adapter.
0
500
1000
1500
2000
1 2 3 4
Virtual I/O Server Throughput, TCP_STREAM
Throughput
[Mb/s]
MTU 1500 1500 9000 9000
Simplex/Duplex simplex duplex simplex duplex
0
20
40
60
80
100
1 2 3 4
Virtual I/O Server
normalized CPU utilisation, TCP_STREAM
CPU
Utilisation
[%cpu/Gb]
MTU 1500 1500 9000 9000
Simplex/Duplex simplex duplex simplex duplex
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Limitations
 System processors are used for all communication
functions, leading to a significant amount of system
processor load.
 One of the virtual adapters in the SEA on the Virtual I/O
server must be defined as a default adapter with a default
PVID.
 Up to 16 Virtual Ethernet adapters with 18 VLANs on each
can be shared on a single physical network adapter.
 Shared Ethernet Adapter requires:
– POWER Hypervisor component of POWER5
systems
– AIX 5L Version 5.3 or appropriate Linux level
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Implementation guideline
 Know your environment and the network traffic.
 Use a dedicated network adapter if you expect heavy
network traffic between Virtual Ethernet and local
networks.
 If possible, use dedicated CPUs for the Virtual I/O
Server.
 Choose 9000 for MTU size, if this makes sense for
your network traffic.
 Don’t use Shared Ethernet Adapter functionality for
latency critical applications.
 With MTU size 1500, you need about 1 CPU per
gigabit Ethernet adapter streaming at media speed.
 With MTU size 9000, 2 Gigabit Ethernet adapters can
stream at media speed per CPU.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared Ethernet Adapter configuration
 The Virtual I/O Server is
configured with at least one
physical Ethernet adapter.
 One Shared Ethernet Adapter
can be shared by multiple
VLANs.
 Multiple subnets can connect
using a single adapter on the
Virtual I/O Server.
Virtual Ethernet switch
POWER Hypervisor
Linux
partition
AIX
partition
VLAN 2
10.1.2.11
VLAN 1
10.1.1.11
Virtual I/O Server
VLAN 1ent0
Physical adapter
VLAN 1
AIX
Server
10.1.1.14
Shared Ethernet Adapter
VLAN 2
Linux
Server
10.1.2.15
VLAN 2
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Multiple Shared Ethernet Adapter configuration
 Maximizing throughput
– Using several Shared Ethernet
Adapters
– More queues
– More performance
Virtual Ethernet switch
POWER Hypervisor
Linux
partition
AIX
partition
VLAN 2
10.1.2.11
VLAN 1
10.1.1.11
Virtual I/O Server
VLAN
1
ent0
Physical adapter
VLAN 1
AIX
Server
10.1.1.14
Shared Ethernet Adapter
VLAN 2
Linux
Server
10.1.2.15
VLAN
2
Physical adapter
ent1
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Multipath routing with dead gateway detection
 This configuration protects
your access to the external
network against:
– Failure of one physical
network adapter in one I/O
server
– Failure of one Virtual I/O
server
– Failure of one gateway
Virtual Ethernet switch
POWER Hypervisor
AIX partition
Physical adapter
External
network
Physical adapter
Virtual I/O
Server 2
Shared Ethernet Adapter
VLAN 2
9.3.5.21
Virtual I/O
Server 2
ent0
Shared Ethernet Adapter
VLAN 1
9.3.5.11
VLAN 2
9.3.5.22
VLAN 1
9.3.5.12
Multipath routing
with
dead gateway
detection
default route to 9.3.5.10 via 9.3.5.12
default route to 9.3.5.20 via 9.3.5.22
Gateway
9.3.5.10
Gateway
9.3.5.20
ent0
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Shared Ethernet Adapter commands
 Virtual I/O Server commands
– lsdev -type adapter: Lists all the virtual and physical adapters.
– Choose the virtual Ethernet adapter we want to map to the physical
Ethernet adapter.
– Make sure the physical and virtual interfaces are unconfigured
(down or detached).
– mkvdev: Maps the physical adapter to the virtual adapter, creates a
layer 2 bridge, and defines the default virtual adapter with its default
VLAN ID. It creates a new Ethernet interface (for example, ent5).
– The mktcpip command is used for TCP/IP configuration on the new
Ethernet interface (for example, ent5).
 Client partition commands
– No new commands are needed; the typical TCP/IP configuration is
done on the virtual Ethernet interface that it is defined in the client
partition profile on the HMC.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Virtual SCSI commands
 Virtual I/O Server commands
– To map a LV:
• mkvg: Creates the volume group, where a new LV will be created using
the mklv command.
• lsdev: Shows the virtual SCSI server adapters that could be used for
mapping with the LV.
• mkvdev: Maps the virtual SCSI server adapter to the LV.
• lsmap -all: Shows the mapping information.
– To map a physical disk:
• lsdev: Shows the virtual SCSI server adapters that could be used for
mapping with a physical disk.
• mkvdev: Maps the virtual SCSI server adapter to a physical disk.
• lsmap -all: Shows the mapping information.
 Client partition commands
– No new commands needed; the typical device configuration uses
the cfgmgr command.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Section Review Questions
1. Any technology improvement will boost
performance of any client solution.
a. True
b. False
2. The application of technology in a creative way
to solve client’s business problems is one
definition of innovation.
a. True
b. False
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Section Review Questions
3. Client’s satisfaction with your solution can be
enhanced by which of the following?
a. Setting expectations appropriately.
b. Applying technology appropriately.
c. Communicating the benefits of the technology to the
client.
d. All of the above.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Section Review Questions
4. Which of the following are available with
POWER5 architecture?
a. Simultaneous Multi-Threading.
b. Micro-Partitioning.
c. Dynamic power management.
d. All of the above.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Section Review Questions
5. Simultaneous Multi-Threading is the same as
hyperthreading, IBM just gave it a different
name.
a. True.
b. False.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Section Review Questions
6. In order to bridge network traffic between the
Virtual Ethernet and external networks, the
Virtual I/O Server has to be configured with at
least one physical Ethernet adapter.
a. True.
b. False.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Review Question Answers
1. b
2. a
3. d
4. d
5. b
6. a
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Unit Summary
 You should now be able to:
– Describe the relationship between technology and
solutions.
– List key IBM technologies that are part of the POWER5
products.
– Be able to describe the functional benefits that these
technologies provide.
– Be able to discuss the appropriate use of these
technologies.
^Eserver pSeries
© 2003 IBM Corporation
Concepts of Solution Design
Reference
 You may find more information here:
IBM eServer pSeries AIX 5L Support for Micro-Partitioning
and Simultaneous Multi-threading White Paper
Introduction to Advanced POWER Virtualization on IBM
eServer p5 Servers SG24-7940
IBM eServer p5 Virtualization – Performance
Considerations SG24-5768

Weitere ähnliche Inhalte

Was ist angesagt?

FlashCopy and DB2 for z/OS
FlashCopy and DB2 for z/OSFlashCopy and DB2 for z/OS
FlashCopy and DB2 for z/OSFlorence Dubois
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights LandingAndrey Vladimirov
 
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the mythsDB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the mythsFlorence Dubois
 
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)Florence Dubois
 
Coupling Facility CPU
Coupling Facility CPUCoupling Facility CPU
Coupling Facility CPUMartin Packer
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCSlide_N
 
Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero
 Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero
Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - faveroWillie Favero
 
DB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer ExperiencesDB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer ExperiencesJohn Campbell
 
Ibm db2 10.5 for linux, unix, and windows installing ibm data server clients
Ibm db2 10.5 for linux, unix, and windows   installing ibm data server clientsIbm db2 10.5 for linux, unix, and windows   installing ibm data server clients
Ibm db2 10.5 for linux, unix, and windows installing ibm data server clientsbupbechanhgmail
 
Educational seminar lessons learned from customer db2 for z os health check...
Educational seminar   lessons learned from customer db2 for z os health check...Educational seminar   lessons learned from customer db2 for z os health check...
Educational seminar lessons learned from customer db2 for z os health check...John Campbell
 
Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718brettallison
 
Simple Virtualization Overview
Simple Virtualization OverviewSimple Virtualization Overview
Simple Virtualization Overviewbassemir
 

Was ist angesagt? (15)

FlashCopy and DB2 for z/OS
FlashCopy and DB2 for z/OSFlashCopy and DB2 for z/OS
FlashCopy and DB2 for z/OS
 
Developer's Guide to Knights Landing
Developer's Guide to Knights LandingDeveloper's Guide to Knights Landing
Developer's Guide to Knights Landing
 
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the mythsDB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
DB2 for z/OS and DASD-based Disaster Recovery - Blowing away the myths
 
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
Db2 for z/OS and FlashCopy - Practical use cases (June 2019 Edition)
 
Coupling Facility CPU
Coupling Facility CPUCoupling Facility CPU
Coupling Facility CPU
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
 
HOW Series: Knights Landing
HOW Series: Knights LandingHOW Series: Knights Landing
HOW Series: Knights Landing
 
Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero
 Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero
Universal Table Spaces for DB2 10 for z/OS - IOD 2010 Seesion 1929 - favero
 
DB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer ExperiencesDB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer Experiences
 
IBM XIV Gen3 Storage System
IBM XIV Gen3 Storage SystemIBM XIV Gen3 Storage System
IBM XIV Gen3 Storage System
 
Xiv overview
Xiv overviewXiv overview
Xiv overview
 
Ibm db2 10.5 for linux, unix, and windows installing ibm data server clients
Ibm db2 10.5 for linux, unix, and windows   installing ibm data server clientsIbm db2 10.5 for linux, unix, and windows   installing ibm data server clients
Ibm db2 10.5 for linux, unix, and windows installing ibm data server clients
 
Educational seminar lessons learned from customer db2 for z os health check...
Educational seminar   lessons learned from customer db2 for z os health check...Educational seminar   lessons learned from customer db2 for z os health check...
Educational seminar lessons learned from customer db2 for z os health check...
 
Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718Ds8000 Practical Performance Analysis P04 20060718
Ds8000 Practical Performance Analysis P04 20060718
 
Simple Virtualization Overview
Simple Virtualization OverviewSimple Virtualization Overview
Simple Virtualization Overview
 

Ähnlich wie Technology

Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor DesignSri Prasanna
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overviewlambertt
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...In-Memory Computing Summit
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontierinside-BigData.com
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Intel® Software
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore ComputersA B Shinde
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Tokyo Institute of Technology
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsAnand Haridass
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadataKam Chan
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화OpenStack Korea Community
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architectureinside-BigData.com
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computingrinnocente
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Sal Marcus
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsHPCC Systems
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddinSal Marcus
 

Ähnlich wie Technology (20)

Parallelism Processor Design
Parallelism Processor DesignParallelism Processor Design
Parallelism Processor Design
 
Power 7 Overview
Power 7 OverviewPower 7 Overview
Power 7 Overview
 
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
IMCSummit 2015 - Day 1 Developer Track - Evolution of non-volatile memory exp...
 
The Cell Processor
The Cell ProcessorThe Cell Processor
The Cell Processor
 
OpenPOWER Webinar
OpenPOWER Webinar OpenPOWER Webinar
OpenPOWER Webinar
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Ceph
CephCeph
Ceph
 
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale FrontierMultiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
Multiscale Dataflow Computing: Competitive Advantage at the Exascale Frontier
 
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
Performance Optimization of Deep Learning Frameworks Caffe* and Tensorflow* f...
 
Multicore Computers
Multicore ComputersMulticore Computers
Multicore Computers
 
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
Integrating Cache Oblivious Approach with Modern Processor Architecture: The ...
 
Heterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of SystemsHeterogeneous Computing : The Future of Systems
Heterogeneous Computing : The Future of Systems
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
505 kobal exadata
505 kobal exadata505 kobal exadata
505 kobal exadata
 
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
[OpenStack Days Korea 2016] Track1 - All flash CEPH 구성 및 최적화
 
Assisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated ArchitectureAssisting User’s Transition to Titan’s Accelerated Architecture
Assisting User’s Transition to Titan’s Accelerated Architecture
 
Nodes and Networks for HPC computing
Nodes and Networks for HPC computingNodes and Networks for HPC computing
Nodes and Networks for HPC computing
 
Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006Orcl siebel-sun-s282213-oow2006
Orcl siebel-sun-s282213-oow2006
 
OpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC SystemsOpenPOWER Acceleration of HPCC Systems
OpenPOWER Acceleration of HPCC Systems
 
Collaborate07kmohiuddin
Collaborate07kmohiuddinCollaborate07kmohiuddin
Collaborate07kmohiuddin
 

Mehr von Nishant Rayan

Mehr von Nishant Rayan (14)

Pptexamples(1)
Pptexamples(1)Pptexamples(1)
Pptexamples(1)
 
Dishwasher the completeguide
Dishwasher the completeguideDishwasher the completeguide
Dishwasher the completeguide
 
Dishwasher the completeguide
Dishwasher the completeguideDishwasher the completeguide
Dishwasher the completeguide
 
First presentation
First presentationFirst presentation
First presentation
 
First presentation
First presentationFirst presentation
First presentation
 
First presentation
First presentationFirst presentation
First presentation
 
First presentation
First presentationFirst presentation
First presentation
 
Myname
MynameMyname
Myname
 
Productivity (1)
 Productivity (1) Productivity (1)
Productivity (1)
 
Technology
TechnologyTechnology
Technology
 
Technology
TechnologyTechnology
Technology
 
My real name
My real nameMy real name
My real name
 
Myname
MynameMyname
Myname
 
Just my name
Just my nameJust my name
Just my name
 

Kürzlich hochgeladen

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationIES VE
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesMd Hossain Ali
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsSafe Software
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxUdaiappa Ramachandran
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfJamie (Taka) Wang
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?IES VE
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPathCommunity
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDELiveplex
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7DianaGray10
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarPrecisely
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024SkyPlanner
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Brian Pichman
 

Kürzlich hochgeladen (20)

Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve DecarbonizationUsing IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
Using IESVE for Loads, Sizing and Heat Pump Modeling to Achieve Decarbonization
 
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just MinutesAI Fame Rush Review – Virtual Influencer Creation In Just Minutes
AI Fame Rush Review – Virtual Influencer Creation In Just Minutes
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration WorkflowsIgniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
 
Building AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptxBuilding AI-Driven Apps Using Semantic Kernel.pptx
Building AI-Driven Apps Using Semantic Kernel.pptx
 
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
activity_diagram_combine_v4_20190827.pdfactivity_diagram_combine_v4_20190827.pdf
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?How Accurate are Carbon Emissions Projections?
How Accurate are Carbon Emissions Projections?
 
UiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation DevelopersUiPath Community: AI for UiPath Automation Developers
UiPath Community: AI for UiPath Automation Developers
 
COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDEADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
ADOPTING WEB 3 FOR YOUR BUSINESS: A STEP-BY-STEP GUIDE
 
UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7UiPath Studio Web workshop series - Day 7
UiPath Studio Web workshop series - Day 7
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
20150722 - AGV
20150722 - AGV20150722 - AGV
20150722 - AGV
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
AI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity WebinarAI You Can Trust - Ensuring Success with Data Integrity Webinar
AI You Can Trust - Ensuring Success with Data Integrity Webinar
 
Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024Salesforce Miami User Group Event - 1st Quarter 2024
Salesforce Miami User Group Event - 1st Quarter 2024
 
Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )Building Your Own AI Instance (TBLC AI )
Building Your Own AI Instance (TBLC AI )
 

Technology

  • 1. Eserver pSeries © 2003 IBM Corporation "Any sufficiently advanced technology will have the appearance of magic." …Arthur C. Clarke Section 2: The Technology
  • 2. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Section Objectives  On completion of this unit you should be able to: – Describe the relationship between technology and solutions. – List key IBM technologies that are part of the POWER5 products. – Be able to describe the functional benefits that these technologies provide. – Be able to discuss the appropriate use of these technologies.
  • 3. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design IBM and Technology Science Technology Products Solutions
  • 4. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Technology and innovation  Having technology available is a necessary first step.  Finding creative new ways to use the technology for the benefit of our clients is what innovation is about.  Solution design is an opportunity for innovative application of technology.
  • 5. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design When technology won’t ‘fix’ the problem  When the technology is not related to the problem.  When the client has unreasonable expectations.
  • 6. Eserver pSeries © 2003 IBM Corporation POWER5 Technology
  • 7. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER4 and POWER5 Cores POWER4 Core POWER5 Core
  • 8. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER5  Designed for entry and high- end servers  Enhanced memory subsystem  Improved performance  Simultaneous Multi-Threading  Hardware support for Shared Processor Partitions (Micro- Partitioning)  Dynamic power management  Compatibility with existing POWER4 systems  Enhanced reliability, availability, serviceability SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3DirL3DirMemCtrlMemCtrl GX+
  • 9. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Enhanced memory subsystem  Improved L1 cache design – 2-way set associative i-cache – 4-way set associative d-cache – New replacement algorithm (LRU vs. FIFO)  Larger L2 cache – 1.9 MB, 10-way set associative  Improved L3 cache design – 36 MB, 12-way set associative – L3 on the processor side of the fabric – Satisfies L2 cache misses more frequently – Avoids traffic on the interchip fabric  On-chip L3 directory and memory controller – L3 directory on the chip reduces off-chip delays after an L2 miss – Reduced memory latencies  Improved pre-fetch algorithms SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3DirL3DirMemCtrlMemCtrl
  • 10. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Enhanced memory subsystem L3 Cache L3 Cache ProcessorProcessor ProcessorProcessor ProcessorProcessor ProcessorProcessorProcessorProcessor ProcessorProcessorProcessorProcessor ProcessorProcessor L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache L2 Cache Fabric controller Fabric controller Fabric controller Fabric controller Memory controller Memory controller Memory Memory L3 Cache L3 Cache Fabric controller Fabric controller Fabric controller Fabric controller Memory controller Memory controller Memory controller Memory controller Memory Memory POWER4 system structure POWER5 system structure Reduced L3 latency Faster access to memory Larger SMPs 64-way Number of chips cut in half L3DirL3Dir L3DirL3Dir
  • 11. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Simultaneous Multi-Threading (SMT)  What is it?  Why would I want it?
  • 12. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Branch pipeline Load/store pipeline Fixed-point pipeline Floating- point pipeline POWER4 pipeline MP ISS RF EA DC WB Xfer MP ISS RF EX WB Xfer MP ISS RF EX WB Xfer MP ISS RF F6 Xfer F6F6F6F6F6 D1 D2 D3 Xfer GD IF BP CP Instruction Crack and Group Formation Instruction Fetch Branch redirects Interrupts & Flushes Out-of-order processing WB Fmt D0 IC POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating- point execution pipe, Fmt = data format, WB = write back, and CP = group commit) POWER5 pipeline
  • 13. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles i-Cache Multi-threading evolution  Execution unit utilization is low in today’s microprocessors  25% of average execution unit utilization across a broad spectrum of environments Memory Instruction streams Next evolution step
  • 14. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles i-Cache Coarse-grained multi-threading  Two instruction streams, one thread at any instance  Hardware swaps in second thread when long-latency event occurs  Swap requires several cycles Memory Instruction streams Swap Swap Swap Next evolution step
  • 15. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Coarse-grained multi-threading (Cont.)  Processor (for example, RS64-IV) is able to store context for two threads – Rapid switching between threads minimizes lost cycles due to I/O waits and cache misses. – Can yield ~20% improvement for OLTP workloads.  Coarse-grained multi-threading only beneficial where number of active threads exceeds 2x number of CPUs – AIX must create a “dummy” thread if there are insufficient numbers of real threads. • Unnecessary switches to “dummy” threads can degrade performance ~20% • Does not work with dynamic CPU deallocation
  • 16. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles i-Cache Fine-grained multi-threading  Variant of coarse-grained multi-threading  Thread execution in round-robin fashion  Cycle remains unused when a thread encounters a long-latency event Memory Instruction streams Next evolution step
  • 17. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER5 pipeline MP ISS RF EA DC WB Xfer MP ISS RF EX WB Xfer MP ISS RF EX WB Xfer MP ISS RF F6 Xfer F6F6F6F6F6 D1 D2 D3 Xfer GD IF BP CP Branch pipeline Instruction Crack and Group Formation Instruction Fetch Branch redirects Interrupts & Flushes Out-of-order processing WB Fmt D0 IC POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating- point execution pipe, Fmt = data format, WB = write back, and CP = group commit) IF CP Load/store pipeline Fixed-point pipeline Floating- point pipeline POWER4 pipeline
  • 18. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design FX0 FX1 LS0 LS1 FP0 FP1 BFX CRL Processor Cycles i-Cache Simultaneous multi-threading (SMT)  Reduction in unused execution units results in a 25-40% boost and even more! Memory Instruction streams First evolution step
  • 19. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Simultaneous multi-threading (SMT) (Cont.)  Each chip appears as a 4-way SMP to software – Allows instructions from two threads to execute simultaneously  Processor resources optimized for enhanced SMT performance – No context switching, no dummy threads  Hardware, POWER Hypervisor, or OS controlled thread priority – Dynamic feedback of shared resources allows for balanced thread execution  Dynamic switching between single and multithreaded mode
  • 20. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Dynamic resource balancing  Threads share many resources – Global Completion Table, Branch History Table, Translation Lookaside Buffer, and so on  Higher performance realized when resources balanced across threads – Tendency to drift toward extremes accompanied by reduced performance
  • 21. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Adjustable thread priority  Instances when unbalanced execution is desirable – No work for opposite thread – Thread waiting on lock – Software determined non uniform balance – Power management  Control instruction decode rate – Software/hardware controls eight priority levels for each thread 0 0 0 1 1 1 1 1 2 2 Instructionspercycle 2,7 4,7 6,7 7,7 7,6 7,4 7,20,7 7,0 1,1 Thread 0 Priority - Thread 1 Priority Thread 0 IPC Thread 1 IPC Power Save Mode Single-threaded operation Hardware thread priorities
  • 22. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Single-threaded operation  Advantageous for execution unit limited applications – Floating or fixed point intensive workloads  Execution unit limited applications provide minimal performance leverage for SMT – Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread  Determined dynamically on a per processor basis Dormant Null Active Software Hardware or Software Software Software Thread states
  • 23. Eserver pSeries © 2003 IBM Corporation Micro-Partitioning
  • 24. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Micro-Partitioning overview  Mainframe inspired technology  Virtualized resources shared by multiple partitions  Benefits – Finer grained resource allocation – More partitions (Up to 254) – Higher resource utilization  New partitioning model – POWER Hypervisor – Virtual processors – Fractional processor capacity partitions – Operating system optimized for Micro-Partitioning exploitation – Virtual I/O
  • 25. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared processor pool Processor terminology Shared processor partition SMT Off Shared processor partition SMT On Dedicated processor partition SMT Off Deconfigured Inactive (CUoD) Dedicated Shared Virtual Logical (SMT) Installed physical processors Entitled capacity
  • 26. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared processor partitions  Micro-Partitioning allows for multiple partitions to share one physical processor  Up to 10 partitions per physical processor  Up to 254 partitions active at the same time  Partition’s resource definition – Minimum, desired, and maximum values for each resource – Processor capacity – Virtual processors – Capped or uncapped • Capacity weight – Dedicated memory • Minimum of 128 MB and 16 MB increments – Physical or virtual I/O resources CPU 0 CPU 1 CPU 3 CPU 4 LPAR 1 LPAR 2 LPAR 5 LPAR 6 LPAR 4LPAR 3
  • 27. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Understanding min/max/desired resource values  The desired value for a resource is given to a partition if enough resource is available.  If there is not enough resource to meet the desired value, then a lower amount is allocated.  If there is not enough resource to meet the min value, the partition will not start.  The maximum value is only used as an upper limit for dynamic partitioning operations.
  • 28. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Partition capacity entitlement  Processing units – 1.0 processing unit represents one physical processor  Entitled processor capacity – Commitment of capacity that is reserved for the partition – Set upper limit of processor utilization for capped partitions – Each virtual processor must be granted at least 1/10 of a processing unit of entitlement  Shared processor capacity is always delivered in terms of whole physical processors Processing capacity 1 physical processor 1.0 processing units 0.5 processing unit 0.4 processing unit Minimum requirement 0.1 processing units
  • 29. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Capped and uncapped partitions  Capped partition – Not allowed to exceed its entitlement  Uncapped partition – Is allowed to exceed its entitlement  Capacity weight – Used for prioritizing uncapped partitions – Value 0-255 – Value of 0 referred to as a “soft cap”
  • 30. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Partition capacity entitlement example  Shared pool has 2.0 processing units available  LPARs activated in sequence  Partition 1 activated – Min = 1.0, max = 2.0, desired = 1.5 – Starts with 1.5 allocated processing units  Partition 2 activated – Min = 1.0, max = 2.0, desired = 1.0 – Does not start  Partition 3 activated – Min = 0.1, max = 1.0, desired = 0.8 – Starts with 0.5 allocated processing units
  • 31. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Understanding capacity allocation – An example  A workload is run under different configurations.  The size of the shared pool (number of physical processors) is fixed at 16.  The capacity entitlement for the partition is fixed at 9.5.  No other partitions are active.
  • 32. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Uncapped – 16 virtual processors  16 virtual processors.  Uncapped.  Can use all available resource.  The workload requires 26 minutes to complete. Uncapped (16PPs/16VPs/9.5CE) 0 5 10 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Elapsed time
  • 33. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Uncapped – 12 virtual processors  12 virtual processors.  Even though the partition is uncapped, it can only use 12 processing units.  The workload now requires 27 minutes to complete. Uncapped (16PPs/12VPs/9.5CE) 0 5 10 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Elapsed time
  • 34. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Capped  The partition is now capped and resource utilization is limited to the capacity entitlement of 9.5. – Capping limits the amount of time each virtual processor is scheduled. – The workload now requires 28 minutes to complete. Capped (16PPs/12VPs/9.5E) 0 5 10 15 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Elapses time
  • 35. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Dynamic partitioning operations  Add, move, or remove processor capacity – Remove, move, or add entitled shared processor capacity – Change between capped and uncapped processing – Change the weight of an uncapped partition – Add and remove virtual processors • Provided CE / VP > 0.1  Add, move, or remove memory – 16 MB logical memory block  Add, move, or remove physical I/O adapter slots  Add or remove virtual I/O adapter slots  Min/max values defined for LPARs set the bounds within which DLPAR can work
  • 36. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Dynamic LPAR Standard on all new systems HMC AIX 5L Linux Hypervisor Part#1 Production Part#2 Part#3 Part#4 Legacy Apps Test/ Dev File/ Print AIX 5L AIX 5L Move resources between live partitions
  • 37. Eserver pSeries © 2003 IBM Corporation Firmware POWER Hypervisor
  • 38. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor strategy  New Hypervisor for POWER5 systems – Further convergence with iSeries – But brands will retain unique value propositions – Reduced development effort – Faster time to market  New capabilities on pSeries servers – Shared processor partitions – Virtual I/O  New capability on iSeries servers – Can run AIX 5L
  • 39. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design H-Call Interface POWER Hypervisor component sourcing HSC VLANVLAN VLAN IOALAN IOA Nucleus (SLIC) Virtual Ethernet Capacity on Demand Shared processor LPAR Virtual I/O Bus recovery Dump Location codes FSP Load from flash NVRAM Message passing pSeries iSeries 255 partitions Slot/tower concurrent maint.Drawer concurrent maint. Partition on demand HMC SCSI IOA I/O configuration
  • 40. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor functions  Same functions as POWER4 Hypervisor. – Dynamic LPAR – Capacity Upgrade on Demand  New, active functions. – Dynamic Micro-Partitioning – Shared processor pool – Virtual I/O – Virtual LAN  Machine is always in LPAR mode. – Even with all resources dedicated to one OS Dynamic Micro-Partitioning CPU 0 CPU 1 CPU 2 CPU 3 SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3Dir L3Dir MemCtrl MemCtrl SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3Dir L3Dir MemCtrl MemCtrl SMT Core SMT Core 1.9 MB L2 Cache 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT Core SMT Core L3Dir L3Dir MemCtrl MemCtrl SMT Core SMT Core 1.9 MB L2 Cache 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT Core SMT Core L3Dir L3Dir MemCtrl MemCtrl Shared processor pools Disk LAN Virtual I/O Dynamic LPAR Planned Actual Client Capacity Growth Capacity Upgrade on Demand
  • 41. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor implementation  Design enhancements to previous POWER4 implementation enable the sharing of processors by multiple partitions – Hypervisor decrementer (HDECR) – New Processor Utilization Resource Register (PURR) – Refine virtual processor objects • Does not include physical characteristics of the processor – New Hypervisor calls
  • 42. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor processor dispatch  Manage a set of processors on the machine (shared processor pool).  POWER5 generates a 10 ms dispatch window. – Minimum allocation is 1 ms per physical processor.  Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window. – ms/VP = CE * 10 / VPs  The partition entitlement is evenly distributed among the online virtual processors.  Once a capped partition has received its CE within a dispatch interval, it becomes not- runnable.  A VP dispatched within 1 ms of the end of the dispatch interval will receive half its CE at the start of the next dispatch interval. Shared processor pool SMT Core SMT Core 1.9 MB L2 Cache 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT Core SMT Core L3Dir L3Dir MemCtrl MemCtrl SMT Core SMT Core 1.9 MB L2 Cache 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT Core SMT Core L3Dir L3Dir MemCtrl MemCtrl SMT Core SMT Core 1.9 MB L2 Cache 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT Core SMT Core L3Dir L3Dir MemCtrl MemCtrl SMT Core SMT Core 1.9 MB L2 Cache 1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT Core SMT Core L3Dir L3Dir MemCtrl MemCtrl CPU 0 CPU 1 CPU 2 CPU 3 POWER Hypervisor’s processor dispatch Virtual processor capacity entitlement for six shared processor partitions
  • 43. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Dispatching and interrupt latencies  Virtual processors have dispatch latency.  Dispatch latency is the time between a virtual processor becoming runnable and being actually dispatched.  Timers have latency issues also.  External interrupts have latency issues also.
  • 44. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared processor pool  Processors not associated with dedicated processor partitions.  No fixed relationship between virtual processors and physical processors.  The POWER Hypervisor attempts to use the same physical processor. – Affinity scheduling – Home node Shared processor pool SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3Dir L3Dir MemCtrl MemCtrl SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3Dir L3Dir MemCtrl MemCtrl SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3Dir L3Dir MemCtrl MemCtrl SMT CoreSMT Core 1.9 MB L2 Cache1.9 MB L2 Cache Chip-Chip / MCM-MCM / SMPLink Enhanceddistributedswitch SMT CoreSMT Core L3Dir L3Dir MemCtrl MemCtrl CPU 0 CPU 1 CPU 2 CPU 3 POWER Hypervisor’s processor dispatch Virtual processor capacity entitlement for six shared processor partitions
  • 45. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Affinity scheduling  When dispatching a VP, the POWER Hypervisor attempts to preserve affinity by using: – Same physical processor as before, or – Same chip, or – Same MCM  When a physical processor becomes idle, the POWER Hypervisor looks for a runnable VP that: – Has affinity for it, or – Has affinity to no-one, or – Is uncapped  Similar to AIX affinity scheduling
  • 46. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Operating system support  Micro-Partitioning capable operating systems need to be modified to cede a virtual processor when they have no runnable work – Failure to do this results in wasted CPU resources • For example, an partition spends its CE waiting for I/O – Results in better utilization of the pool  May confer the remainder of their timeslice to another VP – For example, a VP holding a lock  Can be redispatched if they become runnable again during the same dispatch interval
  • 47. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Example POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec) 0 1 2 3 4 5 6 7 8 9 Physical processor 0 Physical processor 1 10 11 12 13 14 15 16 17 18 19 LPAR 2 VP 0 LPAR 1 VP 1 20 LPAR 3 VP 2 LPAR2 Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped) LPAR3 Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped) LPAR1 Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped) LPAR 1 VP 1 LPAR 1 VP 0 LPAR 3 VP 0 LPAR 3 VP 1 LPAR 3 VP 2 LPAR 1 VP 0 LPAR 3 VP 1 LPAR 2 VP 0 LPAR 1 VP 1 LPAR 3 VP 0 IDLE IDLE
  • 48. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor and virtual I/O  I/O operations without dedicating resources to an individual partition  POWER Hypervisor’s virtual I/O related operations – Provide control and configuration structures for virtual adapter images required by the logical partitions – Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition – The POWER Hypervisor does not own any physical I/O devices; they are owned by an I/O hosting partition  I/O types supported – SCSI – Ethernet – Serial console Disk LAN
  • 49. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Performance monitoring and accounting  CPU utilization is measured against CE. – An uncapped partition receiving more than its CE will record 100% but will be using more.  SMT – Thread priorities compound the variable speed rate. – Twice as many logical CPUs.  For accounting, interval may be incorrectly allocated. – New hardware support is required.  Processor utilization register (PURR) records actual clock ticks spent executing a partition. – Used by performance commands (for example, new flags) and accounting modules. – Third party tools will need to be modified.
  • 50. Eserver pSeries © 2003 IBM Corporation Virtual I/O Server
  • 51. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual I/O Server  Provides an operating environment for virtual I/O administration – Virtual I/O server administration – Restricted scriptable command line user interface (CLI)  Minimum hardware requirements – POWER5 VIO capable machine – Hardware management console – Storage adapter – Physical disk – Ethernet adapter – At least 128 MB of memory  Capabilities of the Virtual I/O Server – Ethernet Adapter Sharing – Virtual SCSI disk • Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific models of EMC, HDS, and STK disk subsystems, attached using Fiber Channel – Interacts with AIX and Linux partitions
  • 52. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual I/O Server (Cont.)  Installation CD when Advanced POWER Virtualization feature is ordered  Configuration approaches for high availability – Virtual I/O Server • LVM mirroring • Multipath I/O • EtherChannel – Second virtual I/O server instance in another partition
  • 53. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual SCSI  Allows sharing of storage devices  Vital for shared processor partitions – Overcomes potential limit of adapter slots due to Micro- Partitioning – Allows the creation of logical partitions without the need for additional physical resources  Allows attachment of previously unsupported storage solutions
  • 54. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design VSCSI server and client architecture overview  Virtual SCSI is based on a client/server relationship.  The virtual I/O resources are assigned using an HMC.  Virtual SCSI enables sharing of adapters as well as disk devices.  Dynamic LPAR operations allowed.  Dynamic mapping between physical and virtual resources on the virtual I/O server. POWER Hypervisor Client partition Linux Virtual I/O Server partition Client partition AIX Physical adapter VSCI client adapter hdisk VSCSI server adapter VSCSI server adapter VSCI client adapter LVM hdisk Logical volume 2 Logical volume 1 Physical disk (SCSI, FC)
  • 55. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual I/O Server partition hdisk Virtual devices  Are defined as LVs in the I/O server partition – Normal LV rules apply  Appear as real devices (hdisks) in the hosted partition  Can be manipulated using Logical Volume Manager just like an ordinary physical disk  Can be used as a boot device and as a NIM target  Can be shared by multiple clients POWER Hypervisor Client partition VSCI client adapter LVM LV VSCSI server adapter Virtual disk LVM hdisk
  • 56. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design SCSI RDMA and Logical Remote Direct Memory Access  SCSI transport protocols define the rules for exchanging information between SCSI initiators and targets.  Virtual SCSI uses the SCSI RDMA Protocol (SRP). – SCSI initiators and targets have the ability to directly transfer information between their respective address spaces.  SCSI requests and responses are sent using the Virtual SCSI adapters.  The actual data transfer, however, is done using the Logical Redirected DMA protocol. Reliable Command / Response Transport Logical Remote Direct Memory Access POWER Hypervisor Virtual I/O Server partition Client partition AIX Physical adapter Physical adapter device driver VSCI device driver (target) Device Mapping VSCI device driver (initiator) Data Buffer
  • 57. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual SCSI security  Only the owning partition has access to its data.  Data-information is copied directly from the PCI adapter to the client’s memory.
  • 58. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Performance considerations  Twice as many processor cycles to do VSCSI as a locally attached disk I/O (evenly distributed on the client partition and virtual I/O server) – The path of each virtual I/O request involves several sources of overhead that are not present in a non-virtual I/O request. – For a virtual disk backed by the LVM, there is also the performance impact of going through the LVM and disk device drivers twice.  If multiple partitions are competing for resources from a VSCSI server, care must be taken to ensure enough server resources (CPU, memory, and disk) are allocated to do the job.  If not constrained by CPU performance, dedicated partition throughput is comparable to doing local I/O.  Because there is no caching in memory on the server I/O partition, it's memory requirements should be modest.
  • 59. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Limitations  Hosting partition must be available before hosted partition boot.  Virtual SCSI supports FC, parallel SCSI, and SCSI RAID.  Maximum of 65535 virtual slots in the I/O server partition.  Maximum of 256 virtual slots on a single partition.  Support for all mandatory SCSI commands.  Not all optional SCSI commands are supported.
  • 60. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Implementation guideline  Partitions with high performance and disk I/O requirements are not recommended for implementing VSCSI.  Partitions with very low performance and disk I/O requirements can be configured at minimum expense to use only a portion of a logical volume.  Boot disks for the operating system.  Web servers that will typically cache a lot of data.
  • 61. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor Virtual I/O Server partition Client partition Virtual I/O Server partition LVM mirroring  This configuration protects virtual disks in a client partition against failure of: – One physical disk – One physical adapter – One virtual I/O server  Many possibilities exist to exploit this great function! Physical SCSI adapter VSCSI server adapter LVM Physical disk (SCSI) Physical SCSI adapter Physical disk (SCSI) VSCSI server adapter LVM VSCSI client adapter LVM VSCSI client adapter
  • 62. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design POWER Hypervisor Virtual I/O Server partition Client partition Virtual I/O Server partition Multipath I/O  This configuration protects virtual disks in a client partition against failure of: – Failure of one physical FC adapter in one I/O server – Failure of one Virtual I/O server  Physical disk is assigned as a whole to the client partition  Many possibilities exist to exploit this great function! Physical FC adapter VSCSI server adapter LVM (hdisk) Physical FC adapter Physical disk ESS VSCSI server adapter LVM (hdisk) VSCSI client adapter LVM VSCSI client adapter SAN Switch
  • 63. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual LAN overview  Virtual network segments on top of physical switch devices.  All nodes in the VLAN can communicate without any L3 routing or inter-VLAN bridging.  VLANs provides: – Increased LAN security – Flexible network deployment over traditional network devices  VLAN support in AIX is based on the IEEE 802.1Q VLAN implementation. – VLAN ID tagging to Ethernet frames – VLAN ID restricted switch ports Switch B Switch C Switch A Node A-1 Node A-2 Node B-1 Node B-2 Node B-3 Node C-1 Node C-2 VLAN 1 VLAN 2 X
  • 64. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual Ethernet  Enables inter-partition communication. – In-memory point to point connections  Physical network adapters are not needed.  Similar to high-bandwidth Ethernet connections.  Supports multiple protocols (IPv4, IPv6, and ICMP).  No Advanced POWER Virtualization feature required. – POWER5 Systems – AIX 5L V5.3 or appropriate Linux level – Hardware management console (HMC)
  • 65. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual Ethernet connections  VLAN technology implementation – Partitions can only access data directed to them.  Virtual Ethernet switch provided by the POWER Hypervisor  Virtual LAN adapters appears to the OS as physical adapters – MAC-Address is generated by the HMC.  1-3 Gb/s transmission speed – Support for large MTUs (~64K) on AIX.  Up to 256 virtual Ethernet adapters – Up to 18 VLANs.  Bootable device support for NIM OS installations Virtual Ethernet switch POWER Hypervisor Linux partition AIX partition Virtual Ethernet adapter Virtual Ethernet adapter AIX partition Virtual Ethernet adapter
  • 66. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual Ethernet switch  Based on IEEE 802.1Q VLAN standard – OSI-Layer 2 – Optional Virtual LAN ID (VID) – 4094 virtual LANs supported – Up to 18 VIDs per virtual LAN port  Switch configuration through HMC
  • 67. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design How it works Virtual Ethernet adapter Virtual VLAN switch port PHYP caches source MAC Check VLAN headerIEEE VLAN header? Insert VLAN header Port allowed? Dest. MAC in table? Trunk adapter defined? Configured associated switch port Match for VLAN Nr. in table? Deliver Pass to Trunk adapter Drop packet Y N N N N Y Y N Y N Y
  • 68. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Performance considerations  Virtual Ethernet performance – Throughput scales nearly linear with the allocated capacity entitlement  Virtual LAN vs. Gigabit Ethernet throughput – Virtual Ethernet adapter has higher raw throughput at all MTU sizes – In-memory copy is more efficient at larger MTU 0 200 400 600 800 1000 Throughput/0.1 entitlement [Mb/s] 0.1 0.3 0.5 0.8 1 1500 9000 65394 CPU entitlements MTU size Throughput per 0.1 entitlement 0 2000 4000 6000 8000 10000 Throughput [Mb/s] 1 Throughput, TCP_STREAM VLAN Gb Ethernet MTU 1500 1500 9000 9000 65394 65394 Simpl./Dupl. S D S D S D
  • 69. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Limitations  Virtual Ethernet can be used in both shared and dedicated processor partitions provided with the appropriate OS levels.  A mixture of Virtual Ethernet connections, real network adapters, or both are permitted within a partition.  Virtual Ethernet can only connect partitions within a single system.  A system’s processor load is increased when using virtual Ethernet.
  • 70. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Implementation guideline  Know your environment and the network traffic.  Choose a high MTU size, as it makes sense for the network traffic in the Virtual LAN.  Use the MTU size 65394 if you expect a large amount of data to be copied inside your Virtual LAN.  Enable tcp_pmtu_discover and udp_pmtu_discover in conjunction with MTU size 65394.  Do not turn off SMT.  No dedicated CPUs are required for virtual Ethernet performance.
  • 71. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Connecting Virtual Ethernet to external networks  Routing – The partition that routes the traffic to the external work does not necessarily have to be the virtual I/O server. Virtual Ethernet switch POWER Hypervisor Linux partition AIX partition 3.1.1.103.1.1.10 AIX partition 3.1.1.11.1.1.100 Physical adapter Virtual Ethernet switch POWER Hypervisor Linux partition AIX partition 4.1.1.114.1.1.10 AIX partition 4.1.1.12.1.1.100 Physical adapter IP subnet 1.1.1.X AIX Server Linux Server IP subnet 2.1.1.X 1.1.1.10 2.1.1.10 IP Router 1.1.1.1 2.1.1.1
  • 72. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared Ethernet Adapter  Connects internal and external VLANs using one physical adapter.  SEA is a new service that acts as a layer 2 network switch. – Securely bridges network traffic from a virtual Ethernet adapter to a real network adapter  SEA service runs in the Virtual I/O Server partition. – Advanced POWER Virtualization feature required – At least one physical Ethernet adapter required  No physical I/O slot and network adapter required in the client partition.
  • 73. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared Ethernet Adapter (Cont.)  Virtual Ethernet MAC are visible to outside systems.  Broadcast/multicast is supported.  ARP (Address Resolution Protocol) and NDP (Neighbor Discovery Protocol) can work across a shared Ethernet.  One SEA can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server.  Virtual Ethernet adapter configured in the Shared Ethernet Adapter must have the trunk flag set. – The trunk Virtual Ethernet adapter enables a layer-2 bridge to a physical adapter  IP fragmentation is performed or an ICMP packet too big message is sent when the shared Ethernet adapter receives IP (or IPv6) packets that are larger than the MTU of the adapter that the packet is forwarded through.
  • 74. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual Ethernet and Shared Ethernet Adapter security  VLAN (virtual local area network) tagging description taken from the IEEE 802.1Q standard.  The implementation of this VLAN standard ensures that the partitions have no access to foreign data.  Only the network adapters (virtual or physical) that are connected to a port (virtual or physical) that belongs to the same VLAN can receive frames with that specific VLAN ID.
  • 75. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Performance considerations  Virtual I/O-Server performance – Adapters stream data at media speed if the Virtual I/O server has enough capacity entitlement. – CPU utilization per Gigabit of throughput is higher with a Shared Ethernet adapter. 0 500 1000 1500 2000 1 2 3 4 Virtual I/O Server Throughput, TCP_STREAM Throughput [Mb/s] MTU 1500 1500 9000 9000 Simplex/Duplex simplex duplex simplex duplex 0 20 40 60 80 100 1 2 3 4 Virtual I/O Server normalized CPU utilisation, TCP_STREAM CPU Utilisation [%cpu/Gb] MTU 1500 1500 9000 9000 Simplex/Duplex simplex duplex simplex duplex
  • 76. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Limitations  System processors are used for all communication functions, leading to a significant amount of system processor load.  One of the virtual adapters in the SEA on the Virtual I/O server must be defined as a default adapter with a default PVID.  Up to 16 Virtual Ethernet adapters with 18 VLANs on each can be shared on a single physical network adapter.  Shared Ethernet Adapter requires: – POWER Hypervisor component of POWER5 systems – AIX 5L Version 5.3 or appropriate Linux level
  • 77. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Implementation guideline  Know your environment and the network traffic.  Use a dedicated network adapter if you expect heavy network traffic between Virtual Ethernet and local networks.  If possible, use dedicated CPUs for the Virtual I/O Server.  Choose 9000 for MTU size, if this makes sense for your network traffic.  Don’t use Shared Ethernet Adapter functionality for latency critical applications.  With MTU size 1500, you need about 1 CPU per gigabit Ethernet adapter streaming at media speed.  With MTU size 9000, 2 Gigabit Ethernet adapters can stream at media speed per CPU.
  • 78. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared Ethernet Adapter configuration  The Virtual I/O Server is configured with at least one physical Ethernet adapter.  One Shared Ethernet Adapter can be shared by multiple VLANs.  Multiple subnets can connect using a single adapter on the Virtual I/O Server. Virtual Ethernet switch POWER Hypervisor Linux partition AIX partition VLAN 2 10.1.2.11 VLAN 1 10.1.1.11 Virtual I/O Server VLAN 1ent0 Physical adapter VLAN 1 AIX Server 10.1.1.14 Shared Ethernet Adapter VLAN 2 Linux Server 10.1.2.15 VLAN 2
  • 79. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Multiple Shared Ethernet Adapter configuration  Maximizing throughput – Using several Shared Ethernet Adapters – More queues – More performance Virtual Ethernet switch POWER Hypervisor Linux partition AIX partition VLAN 2 10.1.2.11 VLAN 1 10.1.1.11 Virtual I/O Server VLAN 1 ent0 Physical adapter VLAN 1 AIX Server 10.1.1.14 Shared Ethernet Adapter VLAN 2 Linux Server 10.1.2.15 VLAN 2 Physical adapter ent1
  • 80. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Multipath routing with dead gateway detection  This configuration protects your access to the external network against: – Failure of one physical network adapter in one I/O server – Failure of one Virtual I/O server – Failure of one gateway Virtual Ethernet switch POWER Hypervisor AIX partition Physical adapter External network Physical adapter Virtual I/O Server 2 Shared Ethernet Adapter VLAN 2 9.3.5.21 Virtual I/O Server 2 ent0 Shared Ethernet Adapter VLAN 1 9.3.5.11 VLAN 2 9.3.5.22 VLAN 1 9.3.5.12 Multipath routing with dead gateway detection default route to 9.3.5.10 via 9.3.5.12 default route to 9.3.5.20 via 9.3.5.22 Gateway 9.3.5.10 Gateway 9.3.5.20 ent0
  • 81. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Shared Ethernet Adapter commands  Virtual I/O Server commands – lsdev -type adapter: Lists all the virtual and physical adapters. – Choose the virtual Ethernet adapter we want to map to the physical Ethernet adapter. – Make sure the physical and virtual interfaces are unconfigured (down or detached). – mkvdev: Maps the physical adapter to the virtual adapter, creates a layer 2 bridge, and defines the default virtual adapter with its default VLAN ID. It creates a new Ethernet interface (for example, ent5). – The mktcpip command is used for TCP/IP configuration on the new Ethernet interface (for example, ent5).  Client partition commands – No new commands are needed; the typical TCP/IP configuration is done on the virtual Ethernet interface that it is defined in the client partition profile on the HMC.
  • 82. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Virtual SCSI commands  Virtual I/O Server commands – To map a LV: • mkvg: Creates the volume group, where a new LV will be created using the mklv command. • lsdev: Shows the virtual SCSI server adapters that could be used for mapping with the LV. • mkvdev: Maps the virtual SCSI server adapter to the LV. • lsmap -all: Shows the mapping information. – To map a physical disk: • lsdev: Shows the virtual SCSI server adapters that could be used for mapping with a physical disk. • mkvdev: Maps the virtual SCSI server adapter to a physical disk. • lsmap -all: Shows the mapping information.  Client partition commands – No new commands needed; the typical device configuration uses the cfgmgr command.
  • 83. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Section Review Questions 1. Any technology improvement will boost performance of any client solution. a. True b. False 2. The application of technology in a creative way to solve client’s business problems is one definition of innovation. a. True b. False
  • 84. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Section Review Questions 3. Client’s satisfaction with your solution can be enhanced by which of the following? a. Setting expectations appropriately. b. Applying technology appropriately. c. Communicating the benefits of the technology to the client. d. All of the above.
  • 85. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Section Review Questions 4. Which of the following are available with POWER5 architecture? a. Simultaneous Multi-Threading. b. Micro-Partitioning. c. Dynamic power management. d. All of the above.
  • 86. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Section Review Questions 5. Simultaneous Multi-Threading is the same as hyperthreading, IBM just gave it a different name. a. True. b. False.
  • 87. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Section Review Questions 6. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server has to be configured with at least one physical Ethernet adapter. a. True. b. False.
  • 88. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Review Question Answers 1. b 2. a 3. d 4. d 5. b 6. a
  • 89. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Unit Summary  You should now be able to: – Describe the relationship between technology and solutions. – List key IBM technologies that are part of the POWER5 products. – Be able to describe the functional benefits that these technologies provide. – Be able to discuss the appropriate use of these technologies.
  • 90. ^Eserver pSeries © 2003 IBM Corporation Concepts of Solution Design Reference  You may find more information here: IBM eServer pSeries AIX 5L Support for Micro-Partitioning and Simultaneous Multi-threading White Paper Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers SG24-7940 IBM eServer p5 Virtualization – Performance Considerations SG24-5768

Hinweis der Redaktion

  1. The pursuit of scientific discovery provides the basis for new technologies which can be incorporated into new and better products which can then enable clients to solve business problems. In this section we will look at some of the technologies that have been introduced recently with the intention of considering how these technologies may be taken into account in our solution design process. This section is divided into two parts and will take approximately two hours to complete. IBM's rich history of discovery and innovation has brought international recognition. In addition to five Nobel prizes, IBM researchers have been recognized with five U.S. National Medals of Technology, five National Medals of Science and 19 memberships in the National Academy of Sciences. IBM Research has more than 46 members of the National Academy of Engineering and well over 300 industry organization fellows.Over the years, we have received international recognition for our discoveries and produced 22,357 patents - nearly 7,000 more than the nearest competitor. But what's more important than the statistics is the effect these discoveries and patents are having in the marketplace -- and that's what really makes something innovative.Our ability to apply advanced technologies rapidly for our clients distinguishes IBM from all other companies. During the past ten years, our notable breakthroughs in technologies such as copper chips, Web caching, data mining and silicon germanium have helped our clients gain competitive advantage.Our continued innovation springs from our creative, dedicated people whose work continues to shape the future for our customers, the I/T industry and the world.
  2. What do you think of when you think of innovation? Can you provide examples from your own experience? How would you define innovation?
  3. That advances in technology can be applied to problems confronting our clients is not an issue, this is what we do! But consider the case where the technology that we provide fails to solve the problem to the client’s satisfaction. When this happens what might be one of the possible causes of dissatisfaction? It has been demonstrated that the degree of benefit you get from applying a particular technology is directly related to its appropriateness in the situation. Amdahl’s Law shows this relationship. Secondly the client may have unreasonable expectations. Setting expectations is certainly part of a successful solution design process but for the purpose of this section we will focus more on the technologies that are available and consider what problems might be solved by them. We will also look at some of the possible misapplications of the technology and their consequences. Can you think of examples in your own experience where expectations were not met by the technology that was provided? Was the reason a failure of the technology, the expectations of the client or both?
  4. This chart shows the POWER4 and POWER5 chips. POWER4 415mm2 115W @1.1 GHz, 156W @ 1.3 GHz 174M transistors POWER4+ 267mm2 75W @ 1.2 GHz, 95W @ 1.45 GHz, 125W @ 1.7 GHz 184M transistors POWER5 389mm2 167W @ 1.65 GHz 276M transistors
  5. Featuring single- and multithreaded execution, the POWER5 provides higher performance in the single-threaded mode than its POWER4 predecessor at equivalent frequencies. Enhancements include dynamic resource balancing to efficiently allocate system resources to each thread, software-controlled thread prioritization, and dynamic power management to reduce power consumption without affecting performance. The POWER5 processor supports the 64-bit PowerPC architecture. A single die contains two identical processor cores, each supporting two logical threads. This architecture makes the chip appear as a four-way symmetric multiprocessor to the operating system. Each processor core has a separate 64 KB L1 instruction cache and a 32 KB L1 data cache. The L1 cache is shared by the two hardware threads of the processor core. Both the processor cores in a chip share a 1.88 MB unified L2. The processor chip houses a L3 cache controller, which provides for a L3 cache directory on the chip. However, the L3 cache itself is on a separate Merged Logic DRAM (MLD) cache chip. The L3 is a 36 MB victim cache of the L2 cache. The L3 cache is shared by both the processor cores of the POWER5 chip. Needless to say, the L2 and L3 caches are shared by all the hardware threads of both processor cores on the chip. Unlike POWER4, which was specifically aimed at high-end server applications, design features of POWER5 are targeted at a broad range of applications from low-end 1-2-way servers to high-end 64-way super-servers. SMPLink is a very low latency switchless interconnect technology that allows nodes to be interconnected as flat SMPs. The actual SMPLink ports come directly off of the POWER5 chip. When connected, the SMPLinks provide a direct path between each POWER5 chip. With the introduction of SMT, more instructions execute per cycle per processor core, thus increasing the core’s and the chip’s total switching power. POWER5 was design to maintain both binary and structural compatibility with existing POWER4 systems to ensure that binaries continue executing properly and all application optimizations carry forward to newer systems. The rest of the improvements and new features, such as enhancements to the memory subsystem and SMT, are discussed on later charts.
  6. The L1 instruction cache is 2-way set associative with LRU (Least Recently Used) replacement policy. The L1 Instruction cache is also kept coherent with the L2 cache. The L1 data cache is 4-way set associative with LRU replacement policy. The L1 data cache is a store-through design. It never holds modified data. The POWER5 L2 cache is accessed by both cores of the chip. It maintains full hardware coherence within the system and can supply intervention data to cores on other POWER5 chips. L2 is an in-line cache, unlike L1s, which are store-through. It is fully inclusive of the two L1 data caches and L1 instruction caches (one L1 data and instruction cache per core). The 1.88 MB (1,920 KB) L2 is physically implemented in three slices, each 640 KB in size. Each of these three slices have separate L2 cache controllers. Either processor core of the chip can independently access each L2 controller. The L2 slices are 10-way set-associative. 10-way set associativity (vs. 8-way on POWER4) helps to reduce cache contention by allowing more potential storage locations for a given cache line. L3 is a unified 36 MB cache accessed by both cores on the POWER5 processor chip. It maintains full hardware coherence with the system and can supply intervention data to cores on other POWER5 processor chips. Logically, L3 is an inline cache. Actually, L3 is a victim cache of the L2 - that is, all valid cache lines evicted out of the L2 due to associativity (victimized) will be cast out to L3. The L3 is not inclusive of L2; the same line will never reside in both L2 and L3 at the same time. The L3 cache is implemented off-chip as a separate MLD cache chip, but its directory is on the processor chip itself. This helps the processor check the directory after an L2 miss without experiencing off-chip delays. The L3 cache in POWER5 is on the processor side and not on the memory side of the fabric as in POWER4. This is well depicted in the previous chart. This design lets the POWER5 satisfy L2 cache misses more frequently, with hits on the off chip 36 MB MLD L3, thus avoiding traffic on the interchip fabric. References to data not on the on chip L2 cause the system to check the L3 cache before sending requests onto the interchip fabric. The memory controller is also on the POWER5 chip and helps to reduce memory latencies by eliminating driver and receiver delays to an external controller.
  7. The figure shows the high-level structures of POWER4- and POWER5-based systems. The POWER4 handles up to a 32-way symmetric multiprocessor. Going beyond 32 processors increases interprocessor communication, resulting in high traffic on the interconnection fabric. This can cause greater contention and negatively affect system scalability. Moving the level-three (L3) cache from the memory side to the processor side of the fabric, allows POWER5 to satisfy level-two (L2) cache misses more frequently, with hits in the 36 MB off-chip L3 cache, and avoiding traffic on the interchip fabric. References to data not resident in the on-chip L2 cache cause the system to check the L3 cache before sending requests on to the interconnection fabric. Moving the L3 cache provides significantly more cache on the processor side than previously available, thus reducing traffic on the fabric and allowing POWER5-based systems to scale to higher levels of symmetric multiprocessing. Initial POWER5 systems support 64 physical processors. The POWER4 includes a 1.41 MB on-chip L2 cache. POWER4+ chips are similar in design to the POWER4, but are fabricated in 130 nm technology rather than the POWER4’s 180 nm technology. The POWER4+ includes a 1.5 MB on-chip L2 cache, whereas the POWER5 supports a 1.875 MB on-chip L2 cache. POWER4 and POWER4+ systems both have 32 MB L3 caches, whereas POWER5 systems have a 36 MB L3 cache. The L3 cache operates as a backdoor with separate buses for reads and writes that operate at half processor speed. In POWER4 and POWER4+ systems, the L3 was an inline cache for data retrieved from memory. Because of the higher transistor density of the POWER5’s 130 nm technology, memory controller was moved on chip and eliminated a chip previously needed for the memory controller function. These two changes in the POWER5 also have the significant side benefits of reducing latency to the L3 cache and main memory, as well as reducing the number of chips necessary to build a system.
  8. Simultaneous Multi-Threading is a new technology which is part of the POWER5 architecture. You need to know how it works and what benefits it can provide to your clients. It is not a cure-all! Being able to articulate the advantages clearly is one part of understanding it, being able to set client’s expectations appropriately is another. In this topic we will discuss the evolution of SMT, its function and some guidelines for appropriate use in solution design.
  9. The POWER4 microprocessor is a high-frequency, speculative superscalar machine with out-of-order instruction execution capabilities. Eight independent execution units are capable of executing instructions in parallel, providing a significant performance attribute known as superscalar execution. These include two identical floating-point execution units, each capable of completing a multiply/add instruction each cycle (for a total of four floating-point operations per cycle), two load-store execution units, two fixed-point execution units, a branch execution unit, and a conditional register unit used to perform logical operations on the condition register. To keep these execution units supplied with work, each processor can fetch up to eight instructions per cycle and can dispatch and complete instructions at a rate of up to five per cycle. A processor is capable of tracking over 200 instructions in-flight at any point in time. Instructions may issue and execute out-of-order with respect to the initial instruction stream, but are carefully tracked so as to complete in program order. In addition, instructions may execute speculatively to improve performance when accurate predictions can be made about conditional scenarios. The figure in this chart depicts the POWER4 processor execution pipeline. The deeply pipelined structure of the machine’s design is shown. Each small box represents a stage of the pipeline (a stage is the logic that is performed in a single processor cycle). Note that there is a common pipeline which first handles instruction fetching and group formation, and this then divides into four different pipelines corresponding to four of the five types of execution units in the machine (the CR execution unit is not shown, which is similar to the fixed-point execution unit). All pipelines have a common termination stage, which is the group completion (CP) stage. Instruction fetch, group formation, and dispatch: The instructions that make up a program are read in from storage and are executed by the processor. During each cycle, up to eight instructions may be fetched from cache according to the address in the instruction fetch address register (IFAR) and the fetched instructions are scanned for branches (corresponding to the IF, IC, and BP stages in the figure). Since instructions may be executed out of order, it is necessary to keep track of the program order of all instructions in-flight. In the POWER4 microprocessor, instructions are tracked in groups of one to five instructions rather than as individual instructions. Groups are formed in the pipeline stages D0, D1, D2, and D3. This requires breaking some of the more complex PowerPC instructions down into two or more simpler instructions.
  10. Modern processors have multiple specialized execution units, each of which is capable of handling a small subset of the instruction set architecture – some will handle integer operations, some floating point, and so on. These execution units are capable of operating in parallel and so several instructions of a program may be executing simultaneously. However, conventional processors execute instructions from a single instruction stream. Despite microarchitectural advances, execution unit utilization remains low in today’s microprocessors. It is not unusual to see average execution unit utilization rates of approximately 25% across a broad spectrum of environments. To increase execution unit utilization, designers use thread-level parallelism, in which the physical processor core executes instructions from more than one instruction stream. To the operating system, the physical processor core appears as if it is a symmetric multiprocessor containing two logical processors. There are at least three different methods for handling multiple threads: Coarse-grained multi-threading Fine-grained multi-threading Simultaneous multi-threading (SMT) Lets take a look at these methods.
  11. In coarse-grained multi-threading, only one thread executes at any instance. When a thread encounters a long-latency event, such as a cache miss, the hardware swaps in a second thread to use the machine’s resources, rather than letting the machine remain idle. By allowing other work to use what otherwise would be idle cycles, this scheme increases overall system throughput. To conserve resources, both threads share many system resources, such as architectural registers. Hence, swapping program control from one thread to another requires several cycles. IBM implemented coarse-grained multi-threading in the IBM pSeries Model 680.
  12. Coarse-grained multi-threading was introduced in IBM’s Star series of processors (for example, the RS64-IV, available in the S85) to improve system performance for many workloads. A multi-threaded processor improves the resource utilization of a processor core by running several hardware threads in parallel. For the Star series, the number of concurrent threads was two. The basic idea is that when one or more threads of a processor are stalled on a long latency event (for example, waiting on a cache miss), other threads try to keep the core busy. However, AIX needed to be aware of the difference between logical and physical processors and had the responsibility for making sure that each logical processor had a dispatchable thread - even to the point of creating idle threads. Note that coarse-grained multi-threading was never widely used by customers. Partly this was due to the fact that it was not enabled by default and required a reboot to activate it. Another reason was that performance was variable and could, in fact, have a negative impact. For workloads with high thread:processor ratios (for example, TPC-C), HMT can deliver ~20% increased performance. In other workloads, for example, Business Intelligence, where the thread:processor ratio is <2:1, then AIX must create dummy threads for the processor context switch to take place. Switching to/from these dummy threads cost about six machine cycles, whereas without Coarse-grained multi-threading being active, AIX would not have performed a context switch at all. The other disadvantage of Coarse-grained multi-threading was that it disabled Dynamic CPU Deallocation.
  13. A variant of coarse-grained multi-threading is fine-grained multi-threading. Machines of this class execute threads in successive cycles, in round-robin fashion. Accommodating this design requires duplicate hardware facilities. When a thread encounters a long-latency event, its cycles remain unused. POWER4 processors implemented an SMP on a chip, but are not considered fine-grained multi-threading.
  14. The POWER5 processor core supports both enhanced SMT and single-threaded (ST) operation modes. This chart shows the POWER5’s instruction pipeline, which is identical to the POWER4’s. All pipeline latencies in the POWER5, including the branch misprediction penalty and load-to-use latency with an L1 data cache hit, are the same as in the POWER4. The identical pipeline structure lets optimizations designed for POWER4-based systems perform equally well on POWER5-based systems. In SMT mode, the POWER5 uses two separate instruction fetch address registers to store the program counters for the two threads. Instruction fetches (IF stage) alternate between the two threads. In ST mode, the POWER5 uses only one program counter and can fetch instructions for that thread every cycle. It can fetch up to eight instructions from the instruction cache (IC stage) every cycle. The two threads share the instruction cache and the instruction translation facility. In a given cycle, all fetched instructions come from the same thread. Some differences are: There are 120 physical general purpose registers (GPRs) and 120 physical floating-point registers (FPRs). In a single-treaded operation, the POWER5 makes all physical registers available to the single thread, allowing higher instruction-level parallelism. Two groups can commit per cycle, one from each thread. The L1 instruction and data caches are the same size as in the POWER4— 64 KB and 32 KB — but their associativity has doubled to two- and four-way. The first-level data translation table is now fully associative, but the size remains at 128 entries.
  15. In simultaneous multi-threading (SMT), as in other multithreaded implementations, the processor fetches instructions from more than one thread. What differentiates this implementation is its ability to schedule instructions for execution from all threads concurrently. With SMT, the system dynamically adjusts to the environment, allowing instructions to execute from each thread if possible, and allowing instructions from one thread to utilize all the execution units if the other thread encounters a long latency event. The POWER5 design implements two-way SMT on each of the chip’s two processor cores. Although a higher level of multi-threading is possible, our simulations showed that the added complexity was unjustified. As designers add simultaneous threads to a single physical processor, the marginal performance benefit decreases. In fact, additional multi-threading might decrease performance because of cache thrashing, as data from one thread displaces data needed by another thread.
  16. Which Workloads are Likely to Benefit From Simultaneous Multi-threading? This is a very difficult question to answer, because the performance benefit of simultaneous multi-threading is workload dependent. Most measurements of commercial workloads have received a 25-40% boost and a few have been even greater. These measurements were taken in a dedicated partition. Simultaneous multi-threading is also expected to help shared processor partitions. The extra threads give the partition a boost after it is dispatched, because they enable the partition to recover its working set quicker. Subsequently, they perform like they would in a dedicated partition. It may be somewhat non-intuitive, but simultaneous multi-threading is at its best, when the performance of the cache is at its worst. The question may also be answered with the following generalities. Any workload where the majority of individual software threads highly utilize any resource in the processor or memory will benefit little from simultaneous multi-threading. For example, workloads that are heavily floating-point intensive are likely to gain little from simultaneous multi-threading and are the ones most likely to lose performance. They tend to heavily utilize either the floating-point units or the memory bandwidth, while workloads that have a very high Cycles Per Instruction (CPI) count tend to utilize processor and memory resources poorly and usually see the greatest simultaneous multi-threading benefit. These large CPIs are usually caused by high cache miss rates from a very large working set. Large commercial workloads typically have this characteristic, although it is somewhat dependent upon whether the two hardware threads share instructions or data or are completely distinct. Workloads that share instructions or data, which would include those that run a lot in the operating system or within a single application, tend to have better SMT benefits. Workloads with low CPI and low cache miss rates tend to see a benefit, but a smaller one.
  17. The objective of dynamic resource balancing is to ensure that the two threads executing on the same processor flow smoothly through the system. Dynamic resource-balancing logic monitors resources such as the GCT and the load miss queue to determine if one thread is hogging resources. For example, if one thread encounters multiple L2 cache load misses, dependent instructions can back up in the issue queues, preventing additional groups from dispatching and slowing down the other thread. To prevent this, resource-balancing logic detects that a thread has reached a threshold of L2 cache misses and throttles that thread. The other thread can then flow through the machine without encountering congestion from the stalled thread. The POWER5 resource balancing logic also monitors how many GCT entries each thread is using. If one thread starts to use too many GCT entries, the resource balancing logic throttles it back to prevent its blocking the other thread. Depending on the situation, the POWER5 resource-balancing logic has three thread-throttling mechanisms: Reducing the thread’s priority Inhibiting the thread’s instruction decoding until the congestion clears Flushing all the thread’s instructions that are waiting for dispatch and holding the thread’s decoding until the congestion clears
  18. Adjustable thread priority lets software determine when one thread should have a greater (or lesser) share of execution resources. (All software layers — operating systems, middleware, and applications — can set the thread priority. Some priority levels are reserved for setting by a privileged instruction only.) Reasons for choosing an imbalanced thread priority include the following: A thread is in a spin loop waiting for a lock. A thread has no immediate work to do and is waiting in an idle loop. One application must run faster than another. The POWER5 microprocessor supports eight software-controlled priority levels for each thread. Level 0 is in effect when a thread is not running. Levels 1 (the lowest) through 7 apply to running threads. The POWER5 chip observes the difference in priority levels between the two threads and gives the one with higher priority additional decode cycles. The figure shows how the difference in thread priority affects the relative performance of each thread. If both threads are at the lowest running priority (level 1), the microprocessor assumes that neither thread is doing meaningful work and throttles the decode rate to conserve power.
  19. Not all applications benefit from SMT. Having two threads executing on the same processor will not increase the performance of applications with execution-unit-limited performance or applications that consume all the chip’s memory bandwidth. For this reason, the POWER5 supports the ST execution mode. In this mode, the POWER5 gives all the physical resources, including the GPR and FPR rename pools, to the active thread, allowing it to achieve higher performance than a POWER4 system at equivalent frequencies. The POWER5 supports two types of Single-threaded operation: An inactive thread can be in either a dormant or a null state. From a hardware perspective, the only difference between these states is whether or not the thread awakens on an external or decrementer interrupt. In the dormant state, the operating system boots up in SMT mode, but instructs the hardware to put the thread into the dormant state when there is no work for that thread. To make a dormant thread active, either the active thread executes a special instruction or an external or decrementer interrupt targets the dormant thread. The hardware detects these scenarios and changes the dormant thread to the active state. It is software’s responsibility to restore the architected state of a thread transitioning from the dormant to the active state. When a thread is in the null state, the operating system is unaware of the thread’s existence. As in the dormant state, the operating system does not allocate resources to a null thread. This mode is advantageous if all the system’s executing tasks perform better in ST mode.
  20. Micro-partitioning is a mainframe-inspired technology that is based on two major advances in the area of server virtualization. Physical processors and I/O devices have been virtualized, enabling these resources to be shared by multiple partitions. There are several advantages associated with this technology, including finer grained resource allocations, more partitions, and higher resource utilization. The virtualization of processors requires a new partitioning model, since it is fundamentally different from the partitioning model used on POWER4 processor-based servers, where whole processors are assigned to partitions. These processors are owned by the partition and are not easily shared with other partitions. They may be assigned through manual dynamic logical partitioning (LPAR) procedures. In the new micro-partitioning model, physical processors are abstracted into virtual processors, which are assigned to partitions. These virtual processor objects cannot be shared, but the underlying physical processors are shared, since they are used to actualize virtual processors at the platform level. This sharing is the primary feature of this new partitioning model, and it happens automatically. Note that the virtual processor abstraction is implemented in the hardware and the POWER Hypervisor, a component of firmware. From an operating system perspective, a virtual processor is indistinguishable from a physical processor, unless the operating system had been enhanced to be made aware of the difference. The key benefit of implementing partitioning in the hardware/firmware is to allow any operating system to run on POWER5 technology with little or no changes. Optionally, for optimal performance, the operating system can be enhanced to exploit micro-partitioning more in-depth, for example, by voluntarily relinquishing CPU cycles to the POWER Hypervisor, when they are not needed. AIX 5L V5.3 is the first version of AIX 5L that includes such enhancements. The system administrator defines the number of virtual processors that may be utilized by a partition as well as the actual physical processor capacity that should be applied to actualize those virtual processors. The system administrator may specify that a fraction of a physical processor be applied to a partition enabling fractional processor capacity partitions to be created.
  21. The diagram in this chart shows the relationship and new concepts regarding Micro-Partitioning processor terminology used in this presentation. Virtual processors These are the whole number of concurrent operations that the operating system can use on a partition. The processing power can be conceptualized as being spread equally across these virtual processors. Selecting the optimal number of virtual processors depends on the workload in the partition. Some partitions benefit from greater concurrence, where other partitions require greater power. The maximum number of virtual processors per partition is 64. Dedicated processors Dedicated processors are whole processors that are assigned to a single partition. If you choose to assign dedicated processors to a logical partition, you must assign at least one processor to that partition. By default, a powered-off logical partition using dedicated processors will have its processors available to the shared processing pool. When the processors are in the shared processing pool, an uncapped partition that needs more processing power can use the idle processing resources. However, when you power on the dedicated partition while the uncapped partition is using the processors, the activated partition will regain all of its processing resources. If you want to prevent dedicated processors from being used in the shared processing pool, you can disable this function using the logical partition profile properties panels on the Hardware Management Console. Shared processor pool The POWER Hypervisor schedules shared processor partitions from a set of physical processors that is called the shared processor pool. By definition, these processors are not associated with dedicated partitions. Deconfigured processor This is a failing processor left outside the system’s configuration after a dynamic processor deallocation has occurred.
  22. Micro-partitioning allows for multiple partitions to share one physical processor. A partition may be defined with a processor capacity as small as 10 processor units. This represents 1/10 of a physical processor. Each processor can be shared by up to 10 shared processor partitions. The shared processor partitions are dispatched and time-sliced on the physical processors under control of the POWER Hypervisor. Micro-partitioning is supported across the entire POWER5 product line from the entry to the high-end systems. Shared processor partitions still need dedicated memory, but the partitions I/O requirements can be supported through Virtual Ethernet and Virtual SCSI Server. Utilizing all virtualization features support for up to 254 shared processor partitions is possible. The shared processor partitions are created and managed by the HMC. When you start creating a partition, you have to choose between a shared processor partition and a dedicated processor partition. When setting up a partition, you have to define the resources that belong to the partition like memory and IO resources. For shared processor partitions, you have to specify the following partition attributes that are used to define the dimensions and performance characteristics of shared partitions: Minimum, desired, and maximum processor capacity Minimum, desired, and maximum number of virtual processors Capped or uncapped Variable capacity weight
  23. Processor capacity attributes are specified in terms of processing units. 1.0 processing unit represents one physical processor. 1.5 processing units is equivalent to one and a half physical processors. For example, a shared processor partition with 2.2 processing units has the equivalent power of 2.2 physical processors. Processor units are also used; they represent the processor percentage allocated to a partition. One processor unit represents one percent of one physical processor. One hundred processor units is equivalent to one physical processor. Shared processor partitions may be defined with a processor capacity as small as 1/10 of a physical processor. A maximum of 10 partitions may be started for each physical processor in the platform. A maximum of 254 partitions may be active at the same time. When a partition is started, the system chooses the partition’s entitled processor capacity from the specified capacity range. The value that is chosen represents a commitment of capacity that is reserved for the partition. This capacity cannot be used to start another shared partition; otherwise, capacity could be overcommitted. Preference is given to the desired value, but these values cannot always be used, because there may not be enough unassigned capacity in the system. In that event, a different value is chosen, which must be greater than or equal to the minimum capacity attribute. Otherwise, the partition cannot be started. The same basic process applies for selecting the number of online virtual processors with the extra restriction that each virtual processor must be granted at least 1/10 of a processing unit of entitlement. In this way, the entitled processor capacity may affect the number of virtual processors that are automatically brought online by the system during boot. The maximum number of virtual processors per partition is 64. The POWER Hypervisor saves and restores all necessary processor states, when preempting or dispatching virtual processors, which for simultaneous multi-threading-enabled processors means two active thread contexts. The result for shared processors is that two of the logical CPUs will always be scheduled in a physical sense together. These sibling threads are always scheduled in the same partition.
  24. A capped partition is not allowed to exceed it capacity entitlement, while an uncapped partition is. In fact, it may exceed its maximum processor capacity. An uncapped partition is only limited in its ability to consume cycles by the lack of online virtual processors and its variable capacity weight attribute. The variable capacity weight attribute is a number between 0–255, which represents the relative share of extra capacity that the partition is eligible to receive. This parameter applies only to uncapped partitions. A partition’s share is computed by dividing its variable capacity weight by the sum of the variable capacity weights for all uncapped partitions. Therefore, a value of 0 may be used to prevent a partition from receiving extra capacity. This is sometimes referred to as a “soft cap”. There is overhead associated with the maintenance of online virtual processors, so clients should carefully consider their capacity requirements before choosing values for these attributes. In general, the value of the minimum, desired, and maximum virtual processor attributes should parallel those of the minimum, desired, and maximum capacity attributes in some fashion. A special allowance should be made for uncapped partitions, since they are allowed to consume more than their entitlement. If the partition is uncapped, then the administrator may want to define the desired and maximum virtual processor attributes x% above the corresponding entitlement attributes. The exact percentage is installation specific, but 25-50% seems like a reasonable number.
  25. The following sequence of charts shows the relationship between the different parameters used for controlling processor capacity attributes for a partition. In the example, the size of the shared pool is fixed – as is the capacity entitlement for the partition in which the workload is running. No other partitions are active – this allows the example workload to use all available resource and means that we are ignoring the effects of Capacity Weights.
  26. This is the baseline for our example. The partition is configured to have 16 virtual processors and is uncapped. Assuming, as we are, that there are no other partitions active, then this workload can use all 16 real processors in the pool. Note that the partition could have more than 16 virtual processors allocated. If that were the case, then all virtual processors would be scheduled and would be time-sliced across the available real processors. We’ll discuss scheduling in detail later. The dark area shows the number of available virtual processors. The lighter area shows the total amount of CPU resource being consumed. The workload completes in 26 minutes.
  27. This is exactly the same workload as before and uses exactly the same total amount of CPU resource. However, the number of virtual processors has been reduced to 12. Consequently, the workload is limited to using the equivalent of 12 real processor’s worth of power, that is, a virtual processor cannot use more than one real processor’s worth of power. Because of the reduced amount of CPU power available within any given time interval, the workload now requires 27 minutes to complete.
  28. Exactly the same workload as before. Now, however, the partition is capped. For the first time, the capacity entitlement becomes effective and the total amount of resource available within any given time interval (actually, every 10 ms) is limited to 9.5 processing units, that is, the equivalent of having 9.5 real processor’s worth of power. Note that all 12 of the virtual processors are being dispatched, but the scheduling algorithm in the POWER Hypervisor limits the amount of time each can be executing. The workload now requires 28 minutes to complete.
  29. One of the advantages of the shared processor architecture is that processor capacity can be changed without impacting applications or middleware. This is accomplished by modifying the entitled capacity or the variable capacity weight of the partition; however, the ability of the partition to utilize this extra capacity is restricted by the number of online virtual processors, so the user may have to increase this number in some cases to take advantage of the extra capacity. The main restriction here is that the CE per VP must remain greater than 0.1. The variable capacity weight parameter applies to uncapped partitions. It controls the ability of the partition to receive cycles beyond its entitlement, which is dependent on there being unutilized capacity at the platform level. The client may want to modify this parameter, if a partition is getting too much processing capacity or not enough. Real processors can, of course, only be added or removed from the shared pool itself. If you recall the discussion on defining a partition, you will realize that removal of a processor from the shared pool may mean that the POWER Hypervisor can longer guarantee CE for all active partitions. Before the DLPAR operation can be honored then, it may be necessary to reduce the CE for some, or all, of the active partitions. Dynamic memory addition and removal is also supported. The only change in this area is that the size of the logical memory block (LMB) has changed. It has been reduced from 256 MB to 16 MB to allow for thinner partitions. There is no impact associated with these changes. The new LMB size applies to dedicated partition also. The size of the LMB can be set at the service console. Notification of changes to these parameters will be provided so that applications, such as license managers, performance analysis tools, and high level schedulers, can monitor and control the allocation and use of system resources in shared processor partitions. This may be accomplished through scripts, APIs, or kernel services. Other DLPAR operations perform as expected.
  30. Allocate processors, memory and I/O to create virtual servers Minimum 128 MB memory, one CPU, one PCI-X adapter slot All resources can be allocated independently Resources can be moved between live partitions Applications notified of configuration changes Movement can be automated using Partition Load Manager Works with AIX 5.2+ or Linux 2.4+
  31. The section provides a description of the new POWER Hypervisor.
  32. A major feature of the new POWER5 machines is a new, active Hypervisor that represents a convergence with iSeries systems. iSeries and pSeries machines will now have a common Hypervisor and common functionality, which will mean reduced development effort and faster time to market for new functions. However, each brand will retain a unique value proposition. New functions provided for pSeries are Shared Processor Partitions and Virtual I/O. Both of these have been available for iSeries on POWER4 systems and pSeries gets the benefit of using tried and tested microcode to implement these functions on POWER5. iSeries benefits from the POWER Hypervisor convergence as well and gains the ability to run AIX in an LPAR (rather than the more limited PACE environment available today). There are some restrictions for the AIX environment on iSeries (for example, device support) and the primary reason for offering this function is to broaden the range of software applications available to iSeries customers.
  33. This is a simplified diagram showing the sourcing of different elements in the converged POWER Hypervisor. The blue boxes show functions that have been sourced either directly from the existing pSeries POWER4 Hypervisor or from the pSeries architecture. Purple boxes (lighter shading) show those sourced directly from the iSeries SLIC (System Licensed Internal Code) – which is part of OS/400. Some boxes are gradated, and these represent functions that combine elements of the pSeries and iSeries implementation models.
  34. The POWER Hypervisor provides the same basic functions as the POWER4 Hypervisor, plus some new functions designed for shared processor LPARs and virtual I/O. Combined with features designed into the POWER5 processor, the POWER Hypervisor delivers functions that enable other system technologies, including micro-partitioning, virtualized processors, IEEE VLAN compatible virtual switch, virtual SCSI adapters, and virtual consoles. The POWER Hypervisor is a component of the system’s firmware that will always be installed and activated, regardless of system configuration. It operates as a hidden partition, with no entitled capacity assigned to it. Newly architected Hypervisor calls (hcalls) provide a means for the operating system to communicate with the POWER Hypervisor, allowing more efficient usage of physical processor capacity by supporting the scheduling heuristic of minimizing idle time. The POWER Hypervisor is a key component to the functions shown in the chart. It performs the following tasks: Provides an abstraction layer between the physical hardware resources and the logical partitions using them Enforces partition integrity by providing a security layer between logical partitions Controls the dispatch of virtual processors to physical processors Saves and restores all processor state information during logical processor context switch Controls hardware I/O interrupts management facilities for logical partitions
  35. The POWER4 processor introduced support for logical partitioning with a new privileged processor state called Hypervisor mode. It is accessed via a Hypervisor call function, which is generated by the operating system kernel running in a partition. Hypervisor mode allows for a secure mode of operation that is required for various system functions where logical partition integrity and security are required. The Hypervisor validates that the partition has ownership of the resources it is attempting to access, such as processor, memory, and I/O, then completes the function. This mechanism allows for complete isolation of partition resources. In the POWER5 processor, further design enhancements are introduced that enable the sharing of processors by multiple partitions. The Hypervisor decrementer (HDECR) is a new hardware facility in the POWER5 design that provides the POWER Hypervisor with a timed interrupt independent of partition activity. HDECR interrupts are routed directly to the POWER Hypervisor, and use only POWER Hypervisor resources to capture state information from the partition. The HDECR is used for fine grained dispatching of multiple partitions on shared processors. It also provides a means for the POWER Hypervisor to dispatch physical processor resources for its own execution. With the addition of shared partitions and SMT, a mechanism was required to track physical processor resource utilization at a processor thread level. System architecture for POWER5 introduces a new register called the processor utilization resource register (PURR) to accomplish this. It provides the partition with an accurate cycle count to measure activity during timeslices dispatched on a physical processor. The PURR is a POWER Hypervisor resource, assigned one per processor thread, that is incremented at a fixed rate whenever the thread running on a virtual processor is dispatched on a physical processor.
  36. Multiple logical partitions configured to run with a pool of shared physical processors require a robust mechanism to guarantee the distribution of available processing cycles. The POWER Hypervisor manages this task in the POWER5 processor based servers. Each Micro-partition is configured with a specific processor entitlement, based on a quantity of processing units, which is referred to as the partition’s entitled capacity or capacity entitlement (CE). The entitled capacity, along with a defined number of virtual processors, defines the physical processor resource that will be allotted to the partition. The POWER Hypervisor uses the POWER5 HDECR, which is programmed to generate an interrupt every 10 ms, as a timing mechanism for controlling the dispatch of physical processors to system partitions. Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window. The minimum amount of resource that the POWER Hypervisor will allocate to a virtual processor, within a dispatch cycle, is 1 ms of execution time per VP. This gives rise to the current restriction of 10 Micro-Partitions per physical processor. The POWER Hypervisor calculates the amount of time each VP will execute by reference to the CE (as shown on the slide). Note that the calculation for uncapped partitions is more complicated and involves their capacity weight and depends on their being unused capacity available. The amount of time that a virtual processor runs before it is timesliced is based on the partition entitlement, which is specified indirectly by the system administrator. The partition entitlement is evenly distributed amongst the online virtual processors, so the number of online virtual processors impacts the length of each virtual processor’s dispatch cycle. The POWER Hypervisor uses the architectural metaphor of a “dispatch wheel” with a fixed rotation period of X milliseconds to guarantee that each virtual processor receives its share of the entitlement in a timely fashion. Virtual processors are time sliced through the use of the hardware decrementer much like the operating system time slices threads. In general, the POWER Hypervisor uses a very simple scheduling model. The basic idea is that processor entitlement is distributed with each turn of the POWER Hypervisor’s dispatch wheel, so each partition is guaranteed a relatively constant stream of service.
  37. Virtual processors have dispatch latency, since they are scheduled. When a virtual processor is made runnable, it is placed on a run queue by the POWER Hypervisor, where it sits until it is dispatched. The time between these two events is referred to as dispatch latency. The dispatch latency of a virtual processor is a function of the partition entitlement and the number of virtual processors that are online in the partition. Entitlement is equally divided among these online virtual processors, so the number of online virtual processors impacts the length of each virtual processor’s dispatch. The smaller the dispatch cycle, the greater the dispatch latency. Timers have latency issues also. The hardware decrementer is virtualized by the POWER Hypervisor at the virtual processor level, so that timers will interrupt the initiating virtual processor at the designated time. If a virtual processor is not running, then the timer interrupt has to be queued with the virtual processor, since it is delivered in the context of the running virtual processor. External interrupts have latency issues also. External interrupts are routed directly to a partition. When the operating system makes the accept-pending-interrupt Hypervisor call, the POWER Hypervisor, if necessary, dispatches a virtual processor of the target partition to process the interrupt. The POWER Hypervisor provides a mechanism for queuing up external interrupts that is also associated with virtual processors. Whenever this queuing mechanism is used, latencies are introduced. These latency issues are not expected to cause functional problems, but they may present performance problems for real-time applications. To quantify matters, the worst case virtual processor dispatch latency is 18 milliseconds, since the minimum dispatch cycle that is supported at the virtual processor level is one millisecond. This figure is based on the minimum partition entitlement of 1/10 of a physical processor and the 10 millisecond rotation period of the Hypervisor's dispatch wheel. It can be easily visualized by imagining that a virtual processor is scheduled in the first and last portions of two 10 millisecond intervals. In general, if these latencies are too great, then clients may increase entitlement, minimize the number of online virtual processors without reducing entitlement, or use dedicated processor partitions.
  38. The POWER Hypervisor schedules shared processor partitions from a set of physical processors that is called the shared processor pool. By definition, these processors are not associated with dedicated partitions. In shared partitions, there is not a fixed relationship between virtual processors and the physical processors that actualize them. The POWER Hypervisor may use any physical processor in the shared processor pool when it schedules the virtual processor. By default, it attempts to use the same physical processor, but this cannot always be guaranteed. The POWER Hypervisor employs the notion of a home node for virtual processors, enabling it to select the best available physical processor from a memory affinity perspective for the virtual processor that is to be scheduled.
  39. Affinity scheduling is designed to preserve the content of memory caches, so that the working data set of a job can be read or written in the shortest time period possible. Affinity is actively managed by the POWER Hypervisor, since each partition has a completely different context. Currently, there is one shared processor pool, so all virtual processors are implicitly associated with the same pool. The POWER Hypervisor attempts to dispatch work in a way that maximizes processor, cache, and memory affinity. When the POWER Hypervisor is dispatching a VP (for example, at the start of a dispatch interval) it will attempt to use the same physical CPU as this VP was previously dispatched on, or a processor on the same chip, or on the same MCM (or in the same node). If a CPU becomes idle, the POWER Hypervisor will look for work for that processor. Priority will be given to runnable VPs that have an affinity for that processor. If none can be found, then the POWER Hypervisor will select a VP that has affinity to no real processor (for example, because previous affinity has expired) and, finally, will select a VP that is uncapped. The objective of this strategy is to try to improve system scalability by minimizing inter-cache communication.
  40. In general, operating systems and applications running in shared partitions need not be aware that they are sharing processors. However, overall system performance can be significantly improved by minor operating system changes. The main problem here is that the POWER Hypervisor cannot distinguish between the OS doing useful work and, for example, spinning on a lock. The result is that the OS may waste much of its CE doing nothing of value. AIX 5L provides support for optimizing overall system performance of shared processor partitions. An OS therefore needs to be modified so that it can signal to the POWER Hypervisor when it is no longer able schedule work, and it can give up the remainder of its time. This results in better utilization of the real processors in the shared processors in the pool. The dispatch mechanism may utilizes hcalls to communicate between the operating system and the POWER Hypervisor. When a virtual processor is active on a physical processor and the operating system detects an inability to utilize processor cycles, it may cede or confer its cycles back to the POWER Hypervisor, enabling it to schedule another virtual processor on the physical processor for the remainder of the dispatch cycle. Reasons for a cede or confer may include the virtual processor running out of work and becoming idle, entering a spin loop to wait for a resource to free, or waiting for a long latency access to complete. There is no concept of credit for cycles that are ceded or confered. Entitled cycles not used during a dispatch interval are lost. A virtual processor that has ceded cycles back to the POWER Hypervisor can be reactivated using a prod Hypervisor call. If the operating system running on another virtual processor within the logical partition detects that work is available for one of its idle processors, it can use the prod Hypervisor call to signal the POWER Hypervisor to make the prodded virtual processor runnable again. Once dispatched, this virtual processor would resume execution at the return from the cede Hypervisor call. The “payback” for the OS is that the POWER Hypervisor will redispatch it if it becomes runnable again during the same dispatch interval – allocating it the remainder of its CE if possible. While not required, the use of these primitives is highly desirable for performance reasons, because they improve locking and minimize idle time. Response time and throughput should be improved, if these primitives are used. Their use is not required, because the POWER Hypervisor time slices virtual processors, which enables it to sequence through each virtual processor in a continuous fashion. Forward progress is thus assured without the use of the primitives.
  41. In this example, there are three logical partitions defined, sharing the processor cycles from two physical processors, spanning two 10 ms Hypervisor dispatch intervals. Logical partition 1 is defined with an entitlement capacity of 0.8 processing units, with two virtual processors. This allows the partition 80% of one physical processor for each 10 ms dispatch window for the shared processor pool. For each dispatch window, the workload is shown to use 40% of each physical processor during each dispatch interval. It is possible for a virtual processor to be dispatched more than one time during a dispatch interval. Note that in the first dispatch interval, the workload executing on virtual processor 1 is not a continuous utilization of physical processor resource. This can happen if the operating system confers cycles, and is reactivated by a prod Hypervisor call. Logical partition 2 is configured with one virtual processor and a capacity of 0.2 processing units, entitling it to 20% usage of a physical processor during each dispatch interval. In this example, a worst case dispatch latency is shown for this virtual processor, where the 2 ms are used in the beginning of dispatch interval 1 and the last 2 ms of dispatch interval 2, leaving 16 ms between processor allocation. Logical partition 3 contains three virtual processors, with an entitled capacity of 0.6 processing units. Each of the partition’s three virtual processors consumes 20% of a physical processor in each dispatch interval, but in the case of virtual processor 0 and 2, the physical processor they run on changes between dispatch intervals. The POWER Hypervisor does attempt to maintain physical processor affinity when dispatching virtual processors. It will always first try to dispatch the virtual processor on the same physical processor as it last ran on, and depending on resource utilization, will broaden its search out to the other processor on the POWER5 chip, then to another chip on the same MCM, then to a chip on another MCM.
  42. This chart introduces POWER Hypervisor involvement in the virtual I/O functions described later. With the introduction of micro-partitioning, the ability to dedicate physical hardware adapter slots to each partition becomes impractical. Virtualization of I/O devices allows many partitions to communicate with each other, and access networks and storage devices external to the server, without dedicating I/O to an individual partition. Many of the I/O virtualization capabilities introduced with the POWER5 processor based IBM eServer products are accomplished by functions designed into the POWER Hypervisor. The POWER Hypervisor does not own any physical I/O devices, and it does not provide virtual interfaces to them. All physical I/O devices in the system are owned by logical partitions. Virtual I/O devices are owned by an I/O hosting partition, which provides access to the real hardware that the virtual device is based on. The POWER Hypervisor implements the following operations required by system partitions to support virtual I/O: Provide control and configuration structures for virtual adapter images required by the logical partitions Operations that allow partitions controlled and secure access to physical I/O adapters in a different partition Along with the operations listed above, the POWER Hypervisor allows for the virtualization of I/O interrupts. To maintain partition isolation, the POWER Hypervisor controls the hardware interrupt management facilities. Each logical partition is provided controlled access to the interrupt management facilities using hcalls. Virtual I/O adapters and real I/O adapters use the same set of Hypervisor calls interfaces.Virtual I/O adapters are defined by system administrators during logical partition definition. Configuration information for the virtual adapters is presented to the partition operating system by the system firmware. Virtual TTY console support Each partition needs to have access to a system console. Tasks such as operating system install, network setup, and some problem analysis activities require a dedicated system console. The POWER Hypervisor provides virtual console using a virtual TTY or serial adapter and a set of Hypervisor calls to operate on them.Depending on the system configuration, the operating system console can be provided by the Hardware Management Console (HMC) virtual TTY or from a terminal emulator connected to physical serial ports on the system’s service processor.
  43. Processor utilization is a critical component of metering, performance monitoring, and capacity planning. With respect to POWER5 technologies, two new advances that will be commonly used will combine to make the concept of utilization much more complex: partitioning, specifically, shared processor partitioning, and simultaneous multi-threading. Individually, they add complexity to this concept, but together they multiply the complexity. Some changes will be required to performance monitoring and accounting tools for support of Micro-Partitioning. One issue that will need to be addressed is that CPU utilization (using traditional monitoring methods) will be recorded against CE. Clearly, an uncapped partition may exceed its CE and may therefore use more than 100% of its entitlement. Similarly, accounting tools (which rely on the 10 ms timer interrupt) may incorrectly record resource utilization for partitions that cede part of their dispatch interval (or which have picked up part of another via a confer Hypervisor call) The POWER5 processor architecture attempts to deal with these complex issues by introducing a new processor register that is intended for measuring utilization. This new register, Processor Utilization Resource Register (PURR), is used to approximate the time that a virtual processor is actually running on a physical processor. The register advances automatically so that the operating system can always get the current up to date value. The Hypervisor saves and restores the register across virtual processor context switches to simulate a monotonically increasing atomic clock at the virtual processor level.
  44. The Virtual I/O server is an appliance that provides virtual storage and shared Ethernet capability to client logical partitions on a POWER5 system. It allows a physical adapter on one partition to be shared by one or more partitions, enabling clients to consolidate and potentially minimize the number of physical adapters.
  45. The Virtual I/O Server provides a restricted scriptable command line user interface (CLI). All aspects of Virtual I/O server administration are accomplished through the CLI, including: Device management (physical, virtual, LVM) Network configuration Software installation and update Security User management Installation of OEM software Maintenance tasks The creation and deletion of the virtual client and server adapter is managed by the HMC GUI and POWER5 server firmware. The association between the client and server adapters is defined when the virtual adapters are created. The optional Advanced POWER Virtualization hardware feature, which enables micro-partitioning on a POWER5 servers, is required to activate the Virtual I/O Server. A small logical partition with the enough resources to share to other partitions is required. The following is a list of minimum hardware requirements to create the Virtual I/O Server partition: POWER5 server, the VIO capable machine. Hardware management console to create the partition and assign resources. Storage adapter: The server partition needs at least one storage adapter. Physical disk: A disk large enough to make sufficient-sized logical volumes on it. Ethernet adapter: Allows securely route network traffic from a virtual Ethernet to a real network adapter. Memory: At least 128 MB of memory. The Virtual I/O Server provides the Virtual SCSI (VSCSI) Target and Shared Ethernet adapter virtual I/O function to client partitions. This is accomplished by assigning physical devices to the Virtual I/O Server partition, then configuring virtual adapters on the clients to allow communication between the client and the Virtual I/O Server.
  46. Installation of the Virtual I/O Server partition is performed from a special mksysb CD that will be provided to customers that order the Advanced POWER Virtualization feature. This is a dedicated software for the virtual I/O server operations, so the virtual I/O server software is only supported in virtual I/O server partitions. The Virtual I/O Server partition itself is configured using a command line interface. Defining partition resources such as virtual Ethernet or virtual disk connections to client systems requires use of the HMC. Virtual I/O server supports the following operating systems as virtual I/O client: AIX 5L Version 5.3 SUSE LINUX Enterprise Server 9 for POWER Red Hat Enterprise Linux AS for POWER Version 3 When we talk about providing high availability for the virtual I/O server, we are talking about incorporating the I/O resources (physical and virtual) on the virtual I/O server as well as the client partitions into a configuration that is designed to eliminate single points of failure. The virtual I/O server per se is not highly available. If there is a problem in the virtual I/O server or if it should crash, the client partitions will see I/O errors and not be able to access the adapters and devices which are backed by the virtual I/O server. However, redundancy can be built into the configuration of the physical and virtual I/O resources at several stages. Since the virtual I/O server is an AIX based appliance, redundancy for physical devices attached to the virtual I/O server can be provided by using capabilities like LVM mirroring, Multipath I/O, and EtherChannel. When running two instances of the virtual I/O server, you can use LVM mirroring, Multipath I/O, EtherChannel, or Multipath routing with dead gateway detection in the client partition to provide highly available access to virtual resources hosted in the separate virtual I/O server partitions.
  47. The virtualization features of the POWER5 platform support up to 254 partitions, while the biggest planned server only provides up to 160 I/O slots. With each partition requiring at least one I/O slot for disk attachment and another one for network attachment, this puts a constraint on the number of partitions. To overcome these physical limitations, I/O resources have to be virtualized. Virtual SCSI provides the means to do this for storage devices. On the other hand, virtual I/O has a value proposition to it. It allows the creation of logical partitions without the need for additional physical resources. This facilities on demand computing and server consolidation. Virtual I/O also provides a more economic I/O model by using physical resources more efficiently through sharing. Furthermore, virtual I/O allows attachment of previously unsupported storage solutions. As long as the virtual I/O server supports the attachment of a storage resource, any client partition can access this storage by using virtual SCSI adapters. For example, at the time of writing, there is no native support for EMC storage devices on Linux. By running Linux in logical partition of a POWER5 server, this becomes possible. A Linux client partition can access the EMC storage through a virtual SCSI adapter. Requests from the virtual adapters are mapped to the physical resources in the virtual I/O server partition. Driver support for the physical resources is therefore only needed in the virtual I/O server partition.
  48. Virtual SCSI is based on a client/server relationship. The virtual I/O server owns the physical resources and acts as the server. The logical partitions access the virtual I/O resources provided by the virtual I/O server as the clients. The virtual I/O resources are assigned using an HMC. Often the virtual I/O server partition is also referred to as hosting partition and the client partitions as hosted partitions. Virtual SCSI enables sharing of adapters as well as disk devices. To make a physical or a logical volume available to a client partition, it is assigned to a virtual SCSI server adapter in the virtual I/O server partition. The client partition accesses its assigned disks through a virtual SCSI client adapter. It sees standard SCSI devices and LUNs through this virtual adapter. Virtual SCSI resources can be assigned and removed dynamically. On the HMC, virtual SCSI target and server adapters can be assigned and removed from a partition using dynamic logical partitioning. The mapping between physical and virtual resources on the virtual I/O server can also be done dynamically. This chart shows an example where one physical disk is split up into two logical volumes inside the virtual I/O server. Each of the two client partitions is assigned one logical volume, which it accesses through a virtual I/O adapter (vSCSI Client Adapter). Inside the partition, the disk is seen as normal hdisk.
  49. A disk owned by the virtual I/O server can either be exported and assigned to a client partition as a whole or it can be split into several logical volumes. Each of these logical volumes can then be assigned to a different partition. A virtual disk device is mapped by the server VSCSI adapter to a logical volume and presented to the hosted partition as a physical direct access device. There can be many virtual disk devices mapped onto a single physical disk. The system administrator will create a virtual disk device by choosing a logical volume and binding it to a VSCSI server adapter. The virtual I/O adapters are connected to a virtual host bridge, which AIX treats much like a PCI host bridge. It is represented in the ODM as a bus device whose parent is sysplanar0. The virtual I/O adapters are represented as adapter devices with the virtual host bridge as their parent. On the virtual I/O server, each logical volume or physical volume that is exported to a client partition is represented by a virtual target device, which is a child of a virtual SCSI server adapter. On the client partition, the exported disks are visible as normal hdisks; however, they are defined in subclass vscsi. They have a virtual SCSI client adapter as parent. Note that virtual disks can be used as boot devices and as NIM targets. Virtual disks can be shared by multiple clients, allowing for configurations using concurrent LVM, for example.
  50. The SCSI family of standards provides many different transport protocols that define the rules for exchanging information between SCSI initiators and targets. Virtual SCSI uses the SCSI RDMA Protocol (SRP), which defines the rules for exchanging SCSI information in an environment where the SCSI initiators and targets have the ability to directly transfer information between their respective address spaces. SCSI requests and responses are sent using the Virtual SCSI adapters that communicate through the POWER Hypervisor. The actual data transfer however is done directly between a data buffer in the client partition and the physical adapter in the Virtual I/O Server by using the Logical Remote Direct Memory Access (LRDMA) protocol. This chart shows how the data transfer using LRDMA works.
  51. Using Virtual SCSI means the Virtual I/O Server acts like a storage box to provide the data. Instead of SCSI or Fiber cable, the connection is done by the POWER Hypervisor. The Virtual SCSI device drivers of the I/O Server and the POWER Hypervisor ensures that only the owning partition has access to its data. Neither other partitions nor the I/O server itself are able to make the client data visible. Only the control-information is going through the I/O Server; the data-information, however, is copied directly from the PCI-adapter to the clients memory.
  52. Enabling VSCSI may not result in a performance benefit. This is because there is an overhead associated with Hypervisor calls, and because of the several steps involved for the I/O requests from the initiator to target partition, VSCSI will use additional CPU cycles when processing I/O requests. This will not give the same performance from VSCSI devices as from dedicated devices. The use of Virtual SCSI will roughly double the amount of CPU time to perform I/O as compared to using directly attached storage. This CPU load is split between the Virtual I/O Server and the Virtual SCSI Client. Performance is expected to degrade when multiple partitions are sharing a physical disk, and actual impact on overall system performance will vary by environment. The base-case configuration is when one physical disk is dedicated to a partition. The following are general performance considerations when using Virtual SCSI: Since VSCSI is a client/server model, the CPU utilization will always be higher than doing local I/O. A reasonable expectation is a total of twice as many cycles to do VSCSI as a locally attached disk I/O (more or less evenly distributed on the client and server). If multiple partitions are competing for resources from a VSCSI server, care must be taken to ensure enough server resources (CPU, memory, and disk) are allocated to do the job. If not constrained by CPU performance, dedicated partition throughput is comparable to doing local I/O. There is no data caching in memory on the server partition. Thus, all I/Os that it services are essentially synchronous disk I/Os. Because there is no caching in memory on the server partition, its memory requirements should be modest. The path of each virtual I/O request involves several sources of overhead that are not present in a non-virtual I/O request. For a virtual disk backed by the LVM, there is also the performance impact of going through the LVM and disk device drivers twice. (IBM eServer p5 Virtualization - Performance Considerations, SG24-5768)
  53. Supported devices At the time of writing, virtual SCSI supports FC, parallel SCSI, and SCSI raid devices. Any other devices, such as SSA, tape, or CD-ROM, are not supported. Number of adapters Virtual SCSI itself does not have any limitations in terms of the number of supported devices or adapters. However, the virtual I/O server partition supports a maximum of 65535 virtual I/O slots. A maximum of 256 virtual I/O slots can be assigned to a single partition. Obviously, every I/O slot needs some resources to be instantiated. Therefore, the size of the virtual I/O server puts a limit to the number of virtual adapters that can be configured. SCSI commands The SCSI protocol defines mandatory and optional commands. While virtual SCSI supports all the mandatory commands, not all optional commands are supported.
  54. Partitions with high performance and disk I/O requirements are not recommended for implementing VSCSI. Partitions with very low performance and disk I/O requirements can be configured at minimum expense to use only a logical volume. Using a logical volume for virtual storage means that the number of partitions is no longer limited by hardware, but the trade-off is that some of the partitions will have less than optimal storage performance. The suitable applications for VSCSI might be the boot disks for the operating system or Web servers that will typically cache a lot of data.
  55. This chart shows a virtual I/O server configuration using LVM mirroring on the client partition. The client partition is LVM mirroring its logical volumes using the two virtual SCSI client adapters. Each of these adapters is assigned to a separate virtual I/O server partition. The two physical disks are each attached to a separate virtual I/O server partition and made available to the client partition through a virtual SCSI server adapter.
  56. This chart shows a configuration using Multipath I/O to access an ESS disk. The client partition sees two paths to the physical disk through MPIO. Each path is using a different virtual SCSI adapter to access disk. Each of these virtual SCSI adapters is backed by a separate virtual I/O server. This type of configuration will only work when the physical disk is assigned as a whole to the client partition. You cannot split up the physical disk into logical volumes at the virtual I/O server level. Depending on your SAN topology, each physical adapter could possibly be connected to a separate SAN switch to provide redundancy, and on the physical disk level, the ESS will provide redundancy as it RAIDs the disks internally.
  57. Virtual LAN (VLAN) is a technology used for establishing virtual network segments on top of physical switch devices. If configured appropriately, a VLAN definition can straddle multiple switches. Typically, a VLAN is a broadcast domain that enables all nodes in the VLAN to communicate with each other without any L3 routing or inter-VLAN bridging. In the diagram shown in this chart, two VLANs (VLAN 1 and 2) are defined on three switches (Switch A, B, and C). Although nodes C-1 and C-2 are physically connected to the same switch C, traffic between two nodes can be blocked. To enable communication between VLAN 1 and 2, L3 routing or inter-VLAN bridging should be established between them; this is typically provided by an L3 device. The use of VLAN provides increased LAN security and flexible network deployment over traditional network devices. VLAN support in AIX is based on the IEEE 802.1Q VLAN implementation. The IEEE 802.1Q VLAN is achieved by adding a VLAN ID tag to an Ethernet frame, and the Ethernet switches restricting the frames to ports that are authorized to receive frames with that VLAN ID. Switches also restrict broadcasts to the logical network by ensuring that a broadcast packet is delivered to all ports that are configured to receive frames with the VLAN ID that the broadcast frame was tagged with. A port on a VLAN capable switch has a default PVID that indicates the default VLAN the port belongs to. The switch adds the PVID tag to untagged frames that are received by that port. In addition to a PVID, a port may belong to additional VLANs and have those VLAN IDs assigned to it that indicates the additional VLANs the port belongs to. A port will only accept untagged packets or packets with a VLAN ID (PVID or additional VIDs) tag of the VLANs the port belongs to. A port configured in the untagged mode is only allowed to have a PVID and will receive untagged packets or packets tagged with the PVID. The untagged port feature helps systems that do not understand VLAN tagging communicate with other systems using standard Ethernet. Each VLAN ID is associated with a separate Ethernet interface to the upper layers (IP and so on) and creates unique logical Ethernet adapter instances per VLAN (for example, ent1, ent2, and so on). You can configure multiple VLAN logical devices on a single system. Each VLAN logical devices constitutes an additional Ethernet adapter instance. These logical devices can be used to configure the same Ethernet IP interfaces as are used with physical Ethernet adapters.
  58. The Virtual Ethernet enables inter-partition communication without the need for physical network adapters in each partition. The Virtual Ethernet allows the administrator to define in-memory point to point connections between partitions. These connections exhibit similar characteristics, as high bandwidth Ethernet connections supports multiple protocols (IPv4, IPv6, and ICMP). Virtual Ethernet requires a POWER5 system with either AIX 5L V5.3 or the appropriate level of Linux and a Hardware Management Console (HMC) to define the Virtual Ethernet devices. Virtual Ethernet does not require the purchase of any additional features or software, such as the Advanced Virtualization Feature. Virtual Ethernet is also called Virtual LAN or even VLAN, which can be confusing, because these terms are also used in network topology topics. But the Virtual Ethernet, which uses virtual devices, has nothing to do with the VLAN known from Network-Topology, which divides a LAN in further Sub-LANs.
  59. The Virtual Ethernet connections supported in POWER5 systems use VLAN technology to insure that the partitions can only access data directed to them. The POWER Hypervisor provides a Virtual Ethernet switch function based on the IEEE 802.1Q VLAN standard, which allows partition communication within the same server. Partitions wishing to communicate through a Virtual Ethernet channel will need to create an additional in-memory channel. This will require a user to be able to request the creation of an in-memory channel between partitions on the HMC. The kernel would create a virtual adapter for each memory channel indicated by the firmware. A normal AIX configuration routine would create the device special files. A virtual LAN adapter appears to the operating system in the same way as a physical adapter. A unique Media Access Control (MAC) address is also generated when the user creates a Virtual Ethernet adapter. A prefix value can be assigned for the system so that the generated MAC addresses in a system consists of a common system prefix, plus an algorithmically-generated unique part per adapter. The MAC-Address of the virtual adapter is generated by the HMC. The transmission speed of Virtual Ethernet adapters is in the range of 1-3 Gigabits per second, depending on the transmission (MTU) size. The Virtual Ethernet Adapter supports, as Gigabit (Gb) Ethernet, Standard MTU-Sizes of 1500 Byte and Jumbo frames with 9000 Byte. Additionally for Gb Ethernet, the MTU-Size of 65280 Bytes is also supported in Virtual Ethernet. So, the MTU of 65280 Bytes can be only used inside a Virtual Ethernet. A partition can support up to 256 Virtual Ethernet adapters with each Virtual Ethernet capable of being associated with up to 18 VLANs. The Virtual Ethernet can also be used as a bootable device to allow such tasks as operating system installations to be performed using NIM.
  60. The POWER Hypervisor Switch is consistent with IEEE 802.1 Q. This standard defines the operation of virtual LAN (VLAN) bridges that permit the definition, operation, and administration of virtual LAN topologies within a bridged LAN infrastructure. It works on OSI-Layer 2 and supports up to 4096 networks (4096 VIDs). The Hypervisor works as a virtual Ethernet switch and maintains queues for each VLAN in its own memory. IEEE needs a Virtual LAN ID (VID). The LAN ID is optional in the above implementation. When this option is selected while adding a new Virtual LAN interface at the HMC, a VID can be chosen. Up to 4094 Virtual LANs are supported. Up to 18 VIDs can be configured per Virtual LAN port. The authority to communicate between LPARs is granted by configuring ports on a virtual Ethernet switch maintained by the Hypervisor. The switch configuration is defined using the HMC. When frames are sent across the network, a tag header is used to indicate to which VLAN a frame belongs. This ensures that the switch forwards the frame to only those ports that belong to that VLAN. Untagged packets are handled by adding the port VLAN identifier (PVID) to each frame.
  61. When a message arrives at a Logical LAN Switch port from a Logical LAN adapter, the POWER Hypervisor caches the message’s source MAC address to use as a filter for future messages to the Adapter. If the port is configured for VLAN headers, the VLAN header is checked against the port’s allowable VLAN list. If the message specified VLAN is not in the port’s configuration, the message is dropped. Once the message passes the VLAN header check, it passes into destination MAC address processing. If the port is NOT configured for VLAN headers, the Hypervisor (conceptually) inserts a two byte VLAN header (based upon the port’s configured VLAN number). Next, the destination MAC address is processed by searching the table of cached MAC addresses (built from messages received at Logical LAN Switch ports (see above)). If a match for the MAC address is not found and if there is no Trunk Adapter defined for the specified VLAN number, then the message is dropped; otherwise, if a match for the MAC address is not found and if there is a Trunk Adapter defined for the specified VLAN number, then the message is passed on to the Trunk Adapter. If a MAC address match is found, then the associated switch port’s configured, allowable VLAN number table is scanned for a match to VLAN number contained in the message’s VLAN header. If a match is not found, the message is dropped. Next, the VLAN header configuration of the destination switch port is checked. If the port is configured for VLAN headers, the message is delivered to the destination Logical LAN adapters, including any inserted VLAN header. If the port is configured for no VLAN headers, the VLAN header is removed before being delivered to the destination Logical LAN adapter.
  62. The measurements shown were taken using a 4-way POWER5 systems and AIX 5L V5.3 with several partitioning configurations. SMT (Simultaneous Multi Threading) is turned on for POWER5 systems. Virtual LAN adapters and the Gigabit Ethernet adapter default settings where used. The Virtual Ethernet connections generally take up more processor time than a local adapter to move a packet (DMA versus copy). For shared processor partitions, performance will be gated by the partition definitions (for example, entitled capacity and number of processors). Small partitions communicating with each other will experience more packet latency due to partition context switching. In general, high bandwidth applications should not be deployed in small shared processor partitions. For dedicated partitions, throughput should be comparable to a 1 Gigabit Ethernet for small packets providing much better performance than 1 Gigabit Ethernet for large packets. For large packets, the Virtual Ethernet communication is copy bandwidth limited. The throughput of the Virtual Ethernet scales nearly linear with the allocated capacity entitlements. The linear scaling of Virtual Ethernet with CPU entitlements shows that there is no measurable overhead when using shared processors versus dedicated processors for the throughput between Virtual LANs. Throughput is increasing, as expected, with growing MTU-Sizes (from MTU-Size 1500 to 9000 with factor ca. >3 and from 1500 to 65394 with factor >7). The Virtual Ethernet adapter has higher raw throughput at all MTU sizes. On MTU 9000, the difference in throughput is very large, due to the fact that the in-memory copy that Virtual Ethernet uses to transfer data is more efficient at larger MTU.
  63. The following are limitations that must be considered when implementing an Virtual Ethernet. Virtual Ethernet uses the system processors for all communication functions instead of offloading that load to processors on network adapter cards. As a result, there is an increase in system processor load generated by the use of Virtual Ethernet. (Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers, SG24-7940)
  64. Because there is only a little experience with Virtual LANs until now, this guideline should not be taken as a good performance guarantee; they are only for orientation. Know your environment and the network traffic. Choose a high MTU size, as it makes sense for the network traffic in the Virtual LAN . Use the MTU size 65394 if you expect a large amount of data to be copied inside your Virtual LAN. Enable tcp_pmtu_discover and udp_pmtu_discover in conjunction with MTU size 65394, if there is a communication to physical adapters. Do not turn off SMT (Simultaneous Multi-Threading) unless your applications demand it. The throughput in Virtual LANs scale linear with CPU entitlements, so there is no need for dedicated CPUs for partitions because of Virtual LAN performance.
  65. There are two ways you can connect the Virtual Ethernet that enables the communication between logical partitions on the same server to an external network: routing and Shared Ethernet Adapter. By enabling the AIX routing capabilities (ipforwarding network option), one partition with a physical Ethernet adapter connected to an external network can act as router. In this type of configuration, the partition that routes the traffic to the external work does not necessarily have to be the virtual I/O server. It could be any partition with a connection to the outside world. The client partitions would have their default route set to the partition, which routes traffic to the external network. This example shows two systems with VLANs. The first one has an internal VLAN with subnet 3.1.1.x and the other one has subnet 4.1.1.x. The first system has a partition that routes the internal VLAN to an external LAN that has subnet 1.1.1.x. There is an other server connected to this subnet too (1.1.1.10). Similarly, the other system has a partition to route that systems internal VLAN to the external 2.1.1.x subnet. There is an external IP router that connects the two external subnets together.
  66. Using a Shared Ethernet Adapter (SEA), you can connect internal and external VLANs using one physical adapter. Shared Ethernet Adapter is a new service that acts as a layer 2 network switch to securely bridge network traffic from a Virtual Ethernet to a real network adapter. The Shared Ethernet Adapter service runs in the Virtual I/O server partition.
  67. The Shared Ethernet Adapter allows partitions to communicate outside the system without having to dedicate a physical I/O slot and a physical network adapter to a client partition. The Shared Ethernet Adapter has the following characteristics: Virtual Ethernet MAC are visible to outside systems. Broadcast/multicast is supported. ARP and NDP can work across a shared Ethernet. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server partition has to be configured with at least one physical Ethernet adapter. One Shared Ethernet Adapter can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server. A Virtual Ethernet adapter configured in the Shared Ethernet Adapter must have the trunk flag set. Once an Ethernet frame is sent from the Virtual Ethernet adapter on a client partition to the POWER Hypervisor, the POWER Hypervisor searches for the destination MAC address within the VLAN. If no such MAC address exists within the VLAN, it forwards the frame to the trunk Virtual Ethernet adapter that is defined on the same VLAN. The trunk Virtual Ethernet adapter enables a layer-2 bridge to a physical adapter. The shared Ethernet directs packets based on the VLAN ID tags. It learns this information based on observing the packets originating from the virtual adapters. One of the virtual adapters in the Shared Ethernet Adapter is designated as the default PVID adapter. Ethernet frames without any VLAN ID tags are directed to this adapter and assigned the default PVID. When the shared Ethernet receives IP (or IPv6) packets that are larger than the MTU of the adapter that the packet is forwarded through, either IP fragmentation is performed and the fragments forwarded or an ICMP packet too big message is returned to the source when the packet cannot be fragmented.
  68. Similar to Virtual SCSI, the POWER Hypervisor also provides the connection between different partitions when using Virtual Ethernet. Inside the server, the POWER Hypervisor acts like an Ethernet switch. The connection to the external network is done by the virtual I/O Servers shared Ethernet function. This part of the I/O Server acts as a Layer 2 bridge to the physical adapters. The Virtual Ethernet implementation fulfills the IEEE 802.1Q standard, which describes VLAN (virtual local area network) tagging. This means that a VLAN ID tag is inserted into every Ethernet frame. The Ethernet switch restricts the frames to the ports that are authorized to receive frames with that VLAN ID. Every port of an Ethernet switch can be configured to be a member of several VLANs. Only the network adapters, both virtual and physical ones, which are connected to a port (virtual or physical) that belongs to the same VLAN can receive these frames. The implementation of this VLAN standard ensures that the partitions have no access to foreign data.
  69. The measurements shown were taken using a 4-way POWER5 system and AIX 5L V5.3 with several partitioning configurations. SMT (Simultaneous Multi Threading) is turned on, on POWER5 systems. Virtual LAN adapters and the Gigabit Ethernet adapter default settings were used. The shared Ethernet adapter allows the adapters to stream data at media speed as long as it has enough CPU entitlements. This chart shows the throughput of the Virtual I/O-Server at MTU sizes of 1500 and 9000 in both modes, simplex and duplex. CPU utilization per Gigabit of throughput is higher with Shared Ethernet adapter, as it has to receive from one end and send it out the other end, and because of the bridging functionality in the Virtual I/O-Server.
  70. You must consider the following limitations when implementing Shared Ethernet Adapters in the Virtual I/O Server: Because Shared Ethernet Adapter depends on Virtual Ethernet, which uses the system processors for all communication functions, a significant amount of system processor load can be generated by the use of Virtual Ethernet and Shared Ethernet Adapter. One of the virtual adapters in the Shared Ethernet Adapter on the Virtual I/O Server must be defined as the default adapter with a default PVID. This virtual adapter is designated as the PVID adapter and Ethernet frames without any VLAN ID tags are assigned the default PVID and directed to this adapter. Up to 16 Virtual Ethernet adapters with 18 VLANs on each can be shared on a single physical network adapter. There is no limit on the number of partitions that can attach to a VLAN. So the theoretical limit is very high. In practice, the amount of network traffic will limit the number of clients that can be served through a single adapter. Shared Ethernet Adapter requires the POWER Hypervisor component of POWER5 systems and therefore cannot be used on POWER4 systems. It also cannot be used with AIX 5L Version 5.2, because the device drivers for Virtual Ethernet are only available for AIX 5L Version 5.3 and Linux. Thus, there is no way to connect a AIX 5L Version 5.2 system to a Shared Ethernet Adapter.
  71. Because there is only a little experience with Virtual I/O server and Shared Ethernet Adapter until now, these guidelines should not be taken as a good performance guarantee; they are only for orientation. Know your environment and the network traffic. Don’t use the Shared Ethernet Adapter functionality of the Virtual I/O-Server if you expect heavy network traffic between Virtual LANs and local networks. Use a dedicated network adapter instead. If possible, use dedicated CPUs for the Virtual I/O-Server (no shared processors). Choose 9000 for MTU size, if this makes sense for your network traffic. Don’t use the Shared Ethernet Adapter functionality of the Virtual I/O-Server for latency critical applications. With MTU size 1500, you need about 1 CPU per gigabit Ethernet adapter streaming at media speed. With MTU size 9000, 2 Gigabit Ethernet adapters can stream at media speed per CPU.
  72. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server has to be configured with at least one physical Ethernet adapter. One Shared Ethernet Adapter can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server. The chart shows a configuration example. A Shared Ethernet Adapter can include up to 16 Virtual Ethernet adapters that share the physical access.
  73. There are several different ways to configure physical and Virtual Ethernet adapters into Shared Ethernet Adapters to maximize throughput. Using several Shared Ethernet Adapters gives us more queues and more performance. An example for this configuration is shown in this chart.
  74. This chart shows a configuration using multipath routing and dead gateway detection. The client partition has two virtual Ethernet adapters. Each adapter is assigned to a different VLAN (using the PVID). Each virtual I/O server is configured with a Shared Ethernet Adapter, which bridges traffic between the virtual Ethernet and the external network. Each of the Shared Ethernet Adapters is assigned to a different VLAN (using PVID). By using two VLANs, network traffic is separated so that each virtual Ethernet adapter in the client partition seems to be connected to a different virtual I/O server. In the client partition, two default routes with dead gateway detection are defined. One route is going to gateway 9.3.5.10 via virtual Ethernet adapter with address 9.3.5.12. The second default route is going to gateway 9.3.5.20 using the virtual Ethernet adapter with address 9.3.5.22. In case of a failure of the primary route, access to the external network will be provided through the second route. AIX will detect route failure and adjust the cost of the route accordingly. Restriction: It is important to note that multipath routing and dead gateway detection do not make an IP address highly available. In case of the failure of one path, dead gateway detection will route traffic through an alternate path. The network adapters and their IP addresses remain unchanged. Therefore, when using Multipath routing and dead gateway detection, only your access to the network will become redundant, but not the IP addresses.
  75. For more details, refer to the Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers, SG24-7940 redbook.
  76. For more details, refer to the Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers, SG24-7940 redbook.