Weitere ähnliche Inhalte Ähnlich wie Enabling POWER 8 advanced features on Linux (20) Kürzlich hochgeladen (20) Enabling POWER 8 advanced features on Linux1. © Copyright IBM Corporation 2016. Technical University/Symposia materials may not be reproduced in whole or in part without the prior written permission of IBM.
Enabling POWER 8 advanced features on Linux
Sébastien Chabrolles
Julien Limodin
Fabrice Moyen
PowerSystem Linux Center
IBM Montpellier
2. 1
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
POWER8 Hardware Accelerator
NX
On Chip Accelerators (NX):
Symetric Crypto
Compression engine
Random Number Generator
One NX complex per chip
A given NX can access all memory in the SMP
A given NX can be accessed by any core
Can be accessed via powerVM hypervizor call
In Core Accelerators :
Symetric Crytpo
Private per core
Leverage Vector Unit (VMX)
Direct access for guest/VM (including KVM)
IBM - POWER8
12 cores per socket (from 3 to 4 GHz)
8 HW threads / core (SMT technology)
Large cache (96 MB : 8 MB / core)
High Memory Bandwidth(~200 GB/s)
3. 2
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
1. Transparent Memory Compression
2. -
3. Power8 Split-Core
Enable POWER 8 advanced features on Linux
4. 3
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Transparent Memory Compression
Transparent Memory Compression is a feature provided by the operating system (Kernel)
dynamically compresses process memory without process knowledge.
PowerVM with AIX proposes this functionality via AME (Active Memory Expansion)
Unfortunately, AME does not exist for Linux.
Linux has an alternative solution is named ZSWAP !!!
Zswap is a feature that hooks into the read and write sides of the swap code and acts as a
compressed cache for pages go to and from the swap device
Like AME, Zswap can use the Power NX compression accelerator (842) to improve
compression performance.
But unlike AME, zswap has some restriction :
Paging device are needed with enough space to store uncompressed data.
but still the real one.
Application processes must allow to be swapped-out.
5. 4
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
P8 NX (on-chip) block diagram
Second generation Nest Accelerator complex*
Encryption Engine
Random number generator
Two 842 compression / decompression
engines
Proprietary IBM Research algorithm
SRAM based dictionary compression
Used by AME
Good compression ratio at high bandwidth
106% of LZO on 190+ benchmarks
158% of compression ratio of software
DEFLATE with FHT on Canterbury corpus
Only available via PowerVM or BareMetal
Linux.
-chip accelerators for cryptography and active
IBM J. Res. & Dev., vol. 57, no. 4, Nov./Dec. 2013.
On-chip SMP Interconnect Interface
che
DMA Controller
842
Channel
0
RNG
Channel
1
chs
AES
SHA
IOB
chs
AES
SHA
IOB
che
842
Channel
2
Channel
3
32B 32B 16B 16B
32B
32B32B 16B 16B
32B
32B32B
16B16B
ingress arraysegress arrays
2to1 clock region
On-chip SMP interconnect
6. 5
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Zswap !
For that, we will use a well known Java Benchmark (SPECjbb), run it several time
while increasing the JVM Heap-Size.
1 core POWER8 10GB Mem
Ubuntu 16.04
10 GB Phys. Mem
JVM Heap-Size
9GB 10 GB 18 GB
SPECjbb
1- Baseline Test with Zswap deactivated
2- Test with zswap and software compression (default)
3- Test with zswap and Power HW compression (842)
7. 6
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Memory Over-Allocation test with SPECjbb2005 (BaseLine)
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18
%bopsvsnominal
JVM Heap Size
SPECjbb2005 performance and Memory Over-Allocation
1 P8 core SMT8 10GB Mem
zswap off
Memory
Over-commitment
10% of nominal performance due to
Memory thrashing)
8. 7
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
SWAP / Paging Activity
System Memory
Swap device
1- Swap Out / Page Out
When the memory is full, a process
(LRUD) scans memory and move the
device.
Asynchrous Backgroud task => No impact on
2- Swap In / Page In
When page-fault occurs and pages are
located in the paging device, those pages
must be moved back to the Memory.
As physical disks are much more slower
=> THIS HURTS PERFORMANCE !!!
Swap out
Swap in
9. 8
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18
SwapI/O(MB/s)
JVM Heap Size
Swap I/O activity - SPECjbb2005 Memory Over-Allocation
1P8 core SMT8 - 10GB Mem
zswap off
Memory Over-Allocation test with SPECjbb2005 (Swap I/O)
Memory
Over-commitment
Single SAS disk used as Swap device
Reaches his limit at ~100 MB/s (50% read)
10. 9
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
In the memory thrashing case, the non-deterministic latency and
performance degradation that I/O introduces could be fatal to your
I/O storm could even prevent you to connect to your system or start any
We need a way to smooth out this I/O storm and performance cliff as
memory demand meets memory capacity.
Zswap!
11. 10
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
ZSWAP requirement
1. Zswap is directly available in the Linux Kernel since v3.11
RedHat 7, CentOS 7, Fedora 19
Suse 12
Ubuntu 14.04
Enable zswap at boot level by adding the option zswap.enabled=1 in your boot loader.
2. Power NX (on-chip) acceleration (842) is only available for PowerVM and BareMetal Linux.
Not Available today for PowerKVM guest
cat /proc/device-tree/ibm,platform-facilities/ibm,compression-v1/status should return okay
Note : Ubuntu need a kernel 4.2 or above to get access to Power NX hw (starting with ubuntu 15.10)
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1488495
Enable zswap HW compression with zswap.compressor=842 in your boot loader.
12. 11
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Enabling POWER HW compression engine (842) with zswap
RedHat :
1- Enable Zswap with 842 compressor at boot time.
vi /etc/sysconfig/grub
add zswap.enabled=1 zswap.compressor=842 to GRUB_CMDLINE_LINUX
2- Regenerate your grub.cfg file.
grub2-mkconfig > /boot/grub2/grub.cfg
3- Add 842 kernel modules to your ramdisk
echo 842 > /etc/modules-load.d/842.conf
dracut -f
4- reboot and verify with dmesg | grep zswap
[ 1.064790] zswap: loaded using pool 842/zbud
13. 12
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Enabling POWER HW compression engine (842) with zswap
Ubuntu:
1- Enable Zswap with 842 compressor at boot time.
vi /etc/sysconfig/grub
add zswap.enabled=1 zswap.compressor=842 to GRUB_CMDLINE_LINUX
2- Regenerate your grub.cfg file.
grub2-mkconfig > /boot/grub2/grub.cfg
3- Add 842 kernel modules to your ramdisk
echo 842 > /etc/modules-load.d/842.conf
vi /usr/share/initramfs-tools/hooks/842
Add the following lines:
#!/bin/sh -e
PREREQS=""
case $1 in
prereqs) echo "${PREREQS}"; exit 0;;
esac
. /usr/share/initramfs-tools/hook-functions
force_load 842
update-initramfs -u
4- dmesg | grep zswap
[ 1.064790] zswap: loaded using pool 842/zbud
14. 13
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Zswap parameters and monitoring
Zswap parameters are located in /sys/module/zswap/parameters
You can change :
- compressor : [ lzo or 842 ] default lzo
Compressor algorithm to use
- enabled : [ Y or N ]
Enable zswap
- max_pool_percent : [1 to 100] default 20
Compress pool size limit (in % of RAM)
- Zpool : [ zbud or zsmalloc ] default zbud
Compression pool algorithm.
Zbud : - store 2 pages in one slot (compression ratio 2:1)
- evict the oldest pages to disk when full
Zsmalloc : - can store more pages per slot than zbud (compression ratio ~ 3:1)
- but unlike zbud, redirect new allocation to paging device when full.
(does not recycle old pages).
You can monitor zswap activity by looking at counters located in /sys/kernel/debug/zswap
15. 14
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
zswap
Swap device
1- Compress/Uncompress
(zbud by default).
Scan/Compress use extra CPU cycles, but when
page-fault occurs, it is really faster to get pages
from the compressed pool in memory than disk.
3- Swap In / Page In
When page-fault occurs and pages are
located in the paging device, those pages
must be moved back to the Memory.
THIS HURTS PERFORMANCE !!!
Uncompressed Memory Zpool (zbud)
ZSWAP
ZSWAP
2- Swap Out / Page Out
When the compress zpool is full, zbud
moves odest compressed pages to the
swap device
16. 15
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
ZSWAP Memory Over-Allocation test with SPECjbb2005
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18
%bopsvsnominal
JVM Heap Size
Testing zswap (zbud) with SPECjbb2005
1 P8 core SMT8 10GB Mem - max_pool_percent=40
zswap off
zswap 842 (HW)
Memory
Over-commitment
Zpool
Over-commitment
75% of nominal performance
at 140% memory
17. 16
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
ZSWAP HW vs Soft. compression
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18
%bopsvsnominal
JVM Heap Size
Testing zswap (zbud) with SPECjbb2005
1 P8 core SMT8 10GB Mem - max_pool_percent=40
zswap off
zswap lzo
zswap 842 (HW)
Memory
Over-commitment
Zpool
Over-commitment
X1.5
18. 17
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
ZSWAP Memory Over-Allocation test with SPECjbb2005
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18
%bopsvsnominal
JVM Heap Size
Testing zswap (zbud) with SPECjbb2005
1 P8 core SMT8 10GB Mem - max_pool_percent=40
zswap 842 (HW)
Memory
Over-commitment
Zpool
Over-commitment
1 2 3
19. 18
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Case 1 : Zswap with Memory not Over-Committed
Swap device
Memory Used (uncompressed) Free memory
Enough Memory available application
No/Little swap I/O occuring
Zswap is idle (no CPU overhead)
=> You can almost use all the memory before zswap
starts working
100% Memory Used (uncompressed)
100% CPU user
Best performance for application
20. 19
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Case 2 : Zswap with Memory Over-Committed
Swap device
Memory Used (uncompressed)
Application needs more memory than available
Zswap starts working, compressing pages in/out zpool.
Zpool is increasing
No/Little swap I/O occuring
Below nominal performance due to memory scanning,
unmapping.
Compression/decompression are offloaded to NX 842
Zpool (zbud)
ZSWAP
25% CPU system due to page scanning
75% of nominal performance on
CPU bound application (worst case)
21. 20
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Zswap with 842(HW) vs LZO(Soft)
Zswap HW compression 842
10GB RAM, 14GB Java Heap Size
25% of System CPU (overhead) due to
memory page scanning.
Compression offloaded to NX 842
75% of nominal performance
Zswap Soft. Compression LZO
10GB RAM, 14GB Java Heap Size
50% of system CPU (overhead) due to
memory page scanning and compression
50% of nominal performance
50% better CPU usage with POWER HW compression
22. 21
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18
SwapI/O(kB/s)
JVM Heap Size
Testing zswap (zbud) with SPECjbb2005
1P8 core SMT8 - 10GB Mem - max_pool_percent=40
zswap off
zswap on
ZSWAP Memory Over-Allocation (Swap IO activity)
Memory
Over-commitment
Zpool
Over-commitment
No or few paging when running
1 2 3
23. 22
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Case 3 : Zswap with Memory Over-Committed and Zpool Full
Swap device
Memory Used
(uncompressed)
Application needs more memory than available
Zswap is working, compressing pages in/out zpool
Zpool reaches max_pool_percent limit (compress
pool is full). Need to free some space in Zpool
=> Swapping in/out !!! Performance degradation
Zpool (zbud) FULL
ZSWAP
max_pool_percent=40
75% CPU wait I/O; only 10 % CPU user
10% of nominal performance due to waiting
for pages on swap device (swap in)
SWAP IN/OUT
24. 23
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Zswap Conclusion
Zswap is not AME, but it can really helps to reduce impact of paging activity and secure
your production system with no cost and no penalty:
Power8 NX842 compression engine are available for PowerVM and BareMetal Linux
No Impact, when memory demand is below RAM capacity installed.
Can maintain your system at 75% performance in CPU 100% case (the worse scenario) and
Zswap zbud x1.4 Memory expansion ratio (with max_pool_percent=40)
You need More ??? then you can try zswap with ZSMALLOC allocator .
25. 24
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Zswap with Zsmalloc compress pool (vs zbud)
Swap device
1- Compress/Uncompress
Scan/Compress use extra CPU cycles, but when
page-fault occurs, it is really faster to get pages
from the compressed pool in memory than disk.
2- Swap In / Out
But compare to zbud, zsmalloc
page replacement algorithm. When the zpool is full,
Paging out will occurs directly from the main
memory to the paging device.
Uncompressed Memory
Zpool
(zsmalloc)
ZSWAP
ZSWAP
Zsmalloc can store more pages per
slot than zbud. (3:1 measured)
Resulting to a higher memory
26. 25
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
0
20
40
60
80
100
120
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
%bopsvsnomina
JVM Heap Size
Testing zswap (zbud vs zsmalloc) with SPECjbb2005
1 P8 core SMT8 10GB Mem - max_pool_percent=40
zswap off
zswap zsmalloc 842 (HW)
zswap 842 (HW)
75% Nominal perf. @ x1.8 Memory size
50% Nominal perf. @ x2 Memory size
Memory
Over-commitment
Zpool (zbud)
limit
Zpool (zsmalloc)
limit
ZSWAP (zsmalloc) Memory Over-Allocation test with SPECjbb2005
x2
27. 26
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Monitor Zswap (zsmalloc) activity on 10GB VM with Grafana
10GB
15GB
20GB
25GB
30GB 35GB 40GB
28. 27
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
1. Transparent Memory Compression
2. -
3. Power8 Split-Core
Enable POWER 8 advanced features on Linux
29. 28
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Symetric vs Asymetric encryption
Symmetric encryption (AES):
SLOW/Complex operation
Private key never distributed
Use to send AES secret key
FAST/Simple operation
Secret Key must be distributed
Optimized by Power8
Not Optimized by Power8
30. 29
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Anatomy of a SSL/HTTPS request
SSL Handshake
Executed only once
Asymetric encryption
Secret Key exchange
Data exchange
Symetric encryption
Client browser Server
Majority of the exchange will use symetric encryption
31. 30
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
POWER8 Hardware Accelerator
NX
On Chip Accelerators (NX):
Symetric Crypto: AES, SHA
True random number generator
Need to use thru hypervizor call for guest/VM
Better single thread performance, larger bandwith
Symetric Crypto currently not available for PowerKVM
guest
In Core Accelerators :
Symetric Crypto : AES, SHA
Cyclic Redundancy Check
Private per core
Leverage Vector Unit (VMX)
Direct access for guest/VM
IBM - POWER8
12 cores per socket (from 3 to 4 GHz)
8 HW threads / core (SMT technology)
Large cache (96 MB : 8 MB / core)
High Memory Bandwidth(~200 GB/s)
32. 31
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
AES Symmetric Cryptography / SHA Hash Engine
AES Key lengths: 128b,192b,256b
Combination AES-SHA / SHA-AES supported
Move the data once to encrypt/decrypt and/then authenticate
I/O buffer (IOB) provides function
8.9Gbps throughput per engine for AES 128 CBC Encrypt at 2.4GHz, 256B message
7Gbps engine throughput for SHA-512 at 2.4GHz, 256B message
Supports byte aligned source and target data buffers, scatter/gather
AES modes supported
Electronic Codebook (ECB)
Cipher Block Chaining (CBC)
Counter (CTR)
Counter with CBC-MAC (CCM)
Galois Counter Mode (GCM)
XCBC-MAC-96 (XMAC)
Hash mode supported
SHA1
SHA2 SHA-256
SHA2 SHA-512
Keyed-hash MAC (HMAC)
MD5
33. 32
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
POWER8 Hardware Encryption
Source: Performance Characteristics of the POWER8 Processor, Alex Mericas, IBM Corporation
Algorithm
POWER7+ POWER8
On-Chip On-Chip In-Core
AES-GCM X X X
AES-CTR X X X
AES-CBC X X X
AES-ECB X X X
SHA-256 X X X
SHA-512 X X X
RNG X X
CRC X
Algorithm
POWER7+
(SW)
POWER8 (HW)
Single Thread Multi Thread
SHA-512 35 10.7 (x3) 2.6 (x13)
AES-128-ENC 17 4 (x4) 0.8 (x21)
AES-256-ENC 21 5.5 (x3.8) 1.1 (x19)
Cycles per Byte (1 core and in-core crypto)
-Chip Hardware Accelerators
introduced with POWER7+
POWER8 has same accelerators
Offload encryption for OS-based large
messages (encrypted file systems, etc)
On virtualized system, access to On-Chip
(NX) Hardware Accelerators needs to be
made through hypervizor call.
In-Core acceleration is directly accessible
to virtualized guest (no hypervisor call
needed).
includes user-mode
instructions to accelerate common
algorithms
34. 33
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Linux on Power hypervizor compatibility matrix
Accelerator Features Baremetal PowerVM
guest
PowerKVM
guest
On-chip Compression
(842)
AES
RNG
In-core AES
SHA
CRC
35. 34
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
P8 Hardware Encryption Acceleration
Combination of on-chip accelerators for CPU offload with larger blocks of encryption work, and
in-core instructions for small data sizes.
Exploitation available transparently under OS services and APIs
On-chip Crypto In-core CryptoRandom Number
Generation
/dev/random
/dev/urandom
Hardware
Kernel
User Space
Cryptographic Library in C
IPsec TCP/IP Encrypted File System
GSkit
Standard
Library
Strong Keys
Encrypted
Data In
Flight
Encrypted
Data In
At Rest
OpenSSL
Key Generation
Hypervisor H_COP calls
Applications
Custom Application Use/Libs
= can be exploited here
Physical
TPM
Standard Crypto
APIs
OpenSSL 1.0.2
libcrypto
34
36. 35
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
How to enable the in-core crypto accelerator:
In Java, starting with IBM Java 7.1, AES is accelerated by using POWER8 in-core AES instructions by
specifying -Dcom.ibm.crypto.provider.doAESInHardware=true on the JVM command line.
OpenSSL > 1.0.2 is using VMX in-core P8 instruction and optimization for AES/SHA
All the application based on this version of openSSL will benefit from P8 encryption acceleration.
Ubuntu : OpenSSL 1.0.2 in ubuntu 15.10 and 16.04
RedHat : Still in OpenSSL 1.0.1 => Crypto Not Accelerated
Fedora 23 : OpenSSL 1.0.2
Suse12, OpenSuse 13 : Still in OpenSSL 1.0.1 => Crypto Not Accelerated
What can you do if you do not have the OpenSSL 1.0.2 ?
Code recompilation with « Advanced Toolchain (v9) »
« Advanced toolchain » is a gcc based compiler (provided by IBM for free) that provide POWER
optimized library. (like libcrypto).
You can then enable HW crypto acceleration to your application even if your Linux distribution
provide the latest libcrypto (OpenSSL 1.0.2)
37. 36
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
IBM Advance Toolchain for PowerLinux
URLs:
IBM Advance Toolchain for PowerLinux Documentation
Improving performance with IBM Advance Toolchain for PowerLinux
Description:
The IBM Advance Toolchain for PowerLinux is a set of open source development tools and
runtime libraries which allows users to take leading edge advantage of IBM's latest POWER
hardware features on Linux.
Over time, these libraries and latest compiler technologies are integrated into the shipping
distributions.
However, the IBM Advance Toolchain for PowerLinux contains the latest tested and
supported GNU Compiler Collection (GCC) compiler versions, tailored for Power systems, and
packaged together with an expanding set of processor-tuned libraries, allowing you to take
advantage of the latest technology without waiting..
GCC Compiler
38. 37
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Example of Apache and wget compiled with Advance Toolchain (1/3)
Idea was to recompile Apache and wget with Advance Toolchain to use the Power8 HW in-core
cryptography in order to improve the performance.
Recompile on PowerLinux:
Get source code of Apache and wget from community
Install Advance Toolchain AT9
Recompile out-of-the-box with the following flags, no source code changes at all required.
export CFLAGS="-O3 -m64 -mcpu=power8 -mtune=power8"
export PATH=/opt/at9.0/bin/:$PATH
Configure, make and make install
Simple test: download a 10G file with wget from the Apache web server in HTTPSinste
10GB
Apache (httpd)
WGET
loopbackSSL
39. 38
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Example of Advance Toolchain with Apache and wget (2/3)
Standard Apache and wget
provided by the repo
Transfer done in 3m10s
Compiled Apache and wget
with Advance Toolchain
Transfer done in 23s
40. 39
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Standard Advanced
toolchain
Example of Advance Toolchain with Apache and wget (3/3)
Profiling shows that AT version is
using P8 accelerated version of
ghash and aes
41. 40
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Example 2 : J2EE Application benchmark (DayTrader application)
60% better CPU Utilisation with Power in-core encryption
With P8 HW CryptoWithout P8 HW Crypto
42. 41
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
1. Transparent Memory Compression
2. -
3. Power8 Split-Core
Enable POWER 8 advanced features on Linux
43. 43
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Enabling SMT on PowerKVM guests (1/2)
runrunPowerKVM with 2 P8 cores Guest1 2 vcpus
Guest2 4 vcpus
Default : 2 vcores, 1 thread
Manually Defined: 1 vcore, 4 threads
<vcpu>4<vcpu/>
<cpu>
<topology sockets=1 cores=1 threads=4/>
</cpu>
guest2.xml
WAIT
No free core available.
Vcore cannot be dispatched
Waiting for next dispatch
(time sharing)
SMT level different than 1 will slow down Guests dispatching.
How do we schedule guest VCPUs onto physical CPU cores?
Introduce notion of "virtual core" (vcore)
VCPUs are allocated to vcores before being dispatched by PowerKVM host to real Core.
By default 1 vcpu = 1 vcore
Can be modified to xVCPU = 1 core to enable SMT.
44. 44
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Enabling SMT on PowerKVM guests (2/2)
In order to configure a KVM Guest, the number of VCPUs on a guest must be set to the
product of cores and threads per core assigned to the guest, and the number of threads per
core must be explictly set.
vcpu = sockets x cores x threads
For example, when using libvirt, you can configure a guest with the following settings in
order to get a guest with SMT=8 and 2 cores (16 total vcpus)
<vcpu>16</vcpu>
<cpu>
<topology sockets='1' cores='2' threads
</cpu>
With that configuration, a guest OS will be able to enable SMT=8 (default) and use the 16
threads across the assigned two cores.
This also allows the guest to dynamically control the SMT level directly from the OS
(ppc64_cpu --smt=x)
45. 45
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Enabling SMT topology with Kimchi on PowerKVM 3.1
46. 46
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Default guest SMT mode is 1 VCPU/vcore
Inefficient use of resources in whole-core mode (1 thread/core)
Often chosen by users who are not familiar with POWER
Often chosen by management agents (e.g. OpenStack)
Setting topology is too complex in big cloud environment
Up to now, default core-split mode was whole-core
Good for single-thread performance
Allows users to run SMT1, SMT2, SMT4 and SMT8 guests
Hits over commitment early, especially with SMT1 guests
with 20 cores P8 => 20 maximum vcpu dispatched in // by default.
PowerKVM 3.1 addresses these points with 2 features :
1. (sub)core sharing (piggybacking)
2. Dynamic multi-threading (split-core)
2 vcpus
PowerKVM
with 2 P8 cores
run run
Guest 1
Guest 1 Guest 2
runrun
PowerKVM
with 2 P8 cores
47. 47
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
PowerKVM Micro-Threading (Split-Core)
No split-core :
1 full core available with up to 8 parallel threads
Only 1 guest running at a time
(PowerVM only mode available)
split-core by 2 :
2 sub-cores available each with up to 4 parallel threads.
Up to 2 guests running at a time
split-core by 4 :
4 sub-cores available each with up to 2 parallel threads.
Up to 4 guests running at a time
IBM Power8 chip
1 Core
1 2
21
43
1
48. 48
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
PowerKVM Micro-Threading (Split-Core)
VM1 VM2 VM3 VM4
Context switching (hypervisor overhead)
time
Fullcore
thr1 thr2 thr3 thr4
thr5 thr6 thr7 thr8
Full core
POWER8
Power8 is a 8 threads processor.
All threads share MMU(1) context, therefore must
be in same partition.
Guests in single thread (SMT 1) mode cannot use
the full core capacity.
Micro-Threading benefits:
Better CPU resources usage
More virtual machines per core
Reduces over-commitment overhead (context
switch)
Micro-Threading limitations:
Guest SMT is limited to 2 or 4, depending on the
Split Core level (Half core, Quarter Core)
All threads are running in SMT8 mode. (lower
single thread perf.)
PowerKVM introduces the possibility to split a Power8 core in 2 or 4 subcores: Micro-Threading
(static in PowerKVM 2.1, dynamic in PowerKVM 3.1)
Each subcore has its own MMU(1) and can be dispatched independently to a different Guest (VM).
(1) MMU (MemoryManagement Unit) is a Hardware Memory Decoder
that maps virtual addresses to physical addresses
VM2
subcore1 VM1
VM3
VM4
time
subcore1 subcore2
subcore3 subcore4
thr1 thr2 thr3 thr4
thr5 thr6 thr7 thr8
POWER8
subcore2
subcore3
subcore4
49. 49
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
PowerKVM 3.1 Dynamic Micro-Threading (SubCores)
With PowerKVM 3.1, The hypervisor may dynamically choose to split by-two or by-
four each core in order to optimize vcpus needs with hardware available resources.
runrunPowerKVM3
with 1 P8 core
Guest1 2 vcpus
<topology sockets=1 cores=1threads=2/>
Guest2 4 vcpus
<topology sockets=1 cores=1 threads=4/>
Manually Defined :
1 vcore, 2 threads
Manually Defined:
1 vcore, 4 threads
run
runPowerKVM3
with 1 P8 core
Guest1 2 vcpus
<topology sockets=1 cores=1 threads=2/>
Guest2 2 vcpus
<topology sockets=1 cores=1 threads=2/>
Manually Defined :
1 vcore, 2 threads
Manually Defined:
1 vcore, 2 threads
Guest3 2 vcpus
<topology sockets=1 cores=1 threads=2/>
Manually Defined :
1 vcore, 2 threads
Splitting by 2 is optimum Splitting by 4 is optimum
To manually and statically set the level of subcoring, use at PowerKVM host level:
ppc64_cpu --subcores-per-core # Get number of subcores per core
ppc64_cpu --subcores-per-core=X # Set subcores per core to X (1,2 or 4)
ppc64_cpu --threads-per-core # Get threads per core
(It needs all VMs to be offline)
50. 50
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
PowerKVM 3.1
Micro-Threading (Subcore) DEMO
51. 51
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
PowerKVM 3.1 Dynamic Micro-Threading (SubCores) DEMO
The demonstration is done with:
4 Guests (Virtual machines), all
pinned onto one single core of a
20-cores S822L Power8 server.
PowerKVM 3.1 virtualization.
Each guest is defined with a manual
topology of 1 vcore and 2 threads.
run
PowerKVM3
with 1 P8 core
split1 2 vcpus
<topology sockets=1 cores=1 threads=2/>
split2 2 vcpus
<topology sockets=1 cores=1 threads=2/>
Manually Defined :
1 vcore, 2 threads
Manually Defined:
1 vcore, 2 threads
split3 2 vcpus
<topology sockets=1 cores=1 threads=2/>
Manually Defined :
1 vcore, 2 threads
split3 2 vcpus
<topology sockets=1 cores=1 threads=2/>
Manually Defined :
1 vcore, 2 threads
52. 52
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Time Slice
CoreThreads
1
2
3
4
5
6
7
8
Time Slice
CoreThreads
1
2
3
4
5
6
7
8
PowerKVM 3.1 Dynamic Micro-Threading (SubCores) DEMO
(guest topology is 1 vcore, 2 threads)
Time Slice
CoreThreads
1
2
3
4
5
6
7
8
split1 split2 Split3 split4 split1 split2 Split3 split4 split1 split2 Split3 split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
split1
split2
Split3
split4
No Micro-Threading allowed
Micro-Threading with 2 sub-cores max
Micro-Threading with 4 sub-cores max
53. 53
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
400 VMs on a (small) S822LC 20-cores ?
Thanks to split-core (and piggybacking), even 400 VMs
but nevertheless powerfull IBM S822LC is OK (even if definitely extreme).
Guest= 2 vcpus
Default :
2 vcores, 1 threads
No need to
split(thanks to
piggyback with
20 VMs)
Split-core helps
optimizing cores
utilization
Number of VMs
Almost like PowerKVM 2.1 (piggybacknot available with pKVM 2.1)
PowerKVM 3.1 split-corebenefits
PgbenchpostgreSQL
workload(tps)
54. 54
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Session Evaluations
YOUR OPINION MATTERS!
Submit four or more session
evaluations by 5:30pm Wednesday
to be eligible for drawings!
*Winners will be notified Thursday morning. Prizes must be picked up at
registration desk, during operating hours, by the conclusion of the event.
1 2 3 4
55. 55
IBM Systems Technical Events | ibm.com/training/events
© Copyright IBM Corporation 2016. Technical University/Symposia materials may
not be reproduced in whole or in part without the prior written permission of IBM.
Continue growing your IBM skills
ibm.com/training
provides a comprehensive
portfolio of skills and career
accelerators that are designed
to meet all your training needs.
If training that is right for you with our Global
Training Providers, we can help.
Contact IBM Training at dpmc@us.ibm.com
Global Skills Initiative