JT Kellington, IBM and Allan Cantle, Nallatech present at the 2015 HPCC Systems Engineering Summit Community Day about porting HPCC Systems to the POWER8-based ppc64el architecture.
1. OpenPOWER Acceleration of HPCC Systems
2015 LexisNexis® Risk Solutions HPCC Engineering Summit – Community Day
Presenters: JT Kellington & Allan Cantle
September 29, 2015
Delray Beach, FL
Accelerating the
Pace of Innovation
2. Overview/Agenda
• The OpenPOWER Story
• Porting HPCC Systems to ppc64el
• POWER8 at a Glance
• Power 8 Data Centric Architectures with FPGA Acceleration
• Summary
2
4. OpenPOWER, a catalyst for Open Innovation
4
• Moore’s law no longer satisfies
performance gain
• Growing workload demands
• Numerous IT consumption models
• Mature Open software ecosystem
Performance of POWER architecture
amplified capability
Open Development
open software, open hardware
Collaboration of thought leaders
simultaneous innovation, multiple disciplines
The OpenPOWER Foundation is an open development community,
using the POWER Architecture to serve the evolving needs of customers.
• Rich software ecosystem
• Spectrum of power servers
• Multiple hardware options
• Derivative POWER chips
Market Shifts New Open Innovation
145+ members investing on Power architecture
Geographic, sector, and expertise diversity
1,500 ISVs serving Linux on POWER, now with easier
porting from x86!
Open from the chip through software
9 workgroups chartered, 15 hardware innovations
revealed at March 2015 summit
Partner announced plans, offerings, deployments
www.openpowerfoundation.org
6. HPCC Systems and POWER8 Designed For Easy Porting
HPCC Systems Framework
• ECL is Platform Agnostic
• Hardware and architecture
changes transparent to end user
• CMake, gcc, binutils, Node.js, etc.
• Open Source!
• Quick and easy collaboration
POWER8 Designed for Open Innovation
• Little Endian OS
• More compatible with industry
• IBM Software Development Kit for
Linux on Power
• Advance Toolchain
• Post-Link Optimization
• Linux performance analysis tools
• Full support in Ubuntu, RHEL, and
SLES
6
7. Porting Details
Initial port took less than a week!
• Added ARCH defines for PPC64EL (system/include/platform.h)
• Added stack description for POWER8 (ecl/hql/hqlstack.hpp)
• Added binutils support for powerpc (ecl/hqlcpp/hqlres.cpp)
7
Deployment and testing took a bit longer
• Couple of bugs found:
• Stricter alignment for atomic operations
• New compiler did not allow compare of “this” to NULL
• Updates to Init system with latest Ubuntu
• Monkey on the keyboard!
• Basic setup and running was easy, but harder to go to the “next level”
• Storage configuration, memory usage, thread support, etc.
9. POWER8 Is Designed For Big Data!
9
Technology
• 22 nm SOI, eDRAM, 15 ML 650 mm2
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory attach
interface
• Up to 48x 48x Integrated PCI
gen3/socket
• SMP interconnect
• CAPI
Cores
• 12 cores (SMT8)
• 8 dispatch, 10 issue,
16 execution pipes
• 2x internal data
flows/queues
• Enhanced prefetching
• 64 KB data cache,
32 KB instruction cache
Accelerators
• Crypto and memory
expansion
• Transactional memory
• VMM assist
• Data move/VM mobility
POWER8 Scale-Out Dual Chip Module
Chip Interconnect
CoreCoreCore
L2L2L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2L2L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
MemoryBus
MemoryBus
SMPInterconnect
SMPInterconnect
SMPSMPCAPIPCIe
SMPCAPIPCIeSMP
10. POWER8 Introduces CAPI Technology
10
CAPP PCIe
POWER8 Processor
FPGA
Accelerated Functional Unit
(AFU)
CAPI
IBM Supplied POWER Service
Layer
Typical I/O Model Flow
Flow with a Coherent Model
Shared Mem.
Notify Accelerator
Acceleration
Shared Memory
Completion
DD Call
Copy or Pin
Source Data
MMIO Notify
Accelerator
Acceleration
Poll / Int
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Advantages of Coherent Attachment Over I/O Attachment
Virtual Addressing & Data Caching (significant latency reduction)
Easier, Natural Programming Model (avoid application restructuring)
Enables Apps Not Possible on I/O (Pointer chasing, shared mem semaphores, …)
11. strategy ( )
CAPI Attached Flash Optimization
• Attach IBM FlashSystem to POWER8 via CAPI
• Read/write commands issued via APIs from applications to eliminate 97% of code path length
• Saves 20-30 cores per 1M IOPS
Pin buffers, Translate,
Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap,
unpin,Iodone scheduling
20K
instructions
reduced to
<2000
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async
I/O Style API
Shared Memory
Work Queue
aio_read()
aio_write()
iodone ( )
LVM
12. CAPI Unlocks the Next Level of Performance for Flash
Identical hardware with 2 different
paths to data
FlashSystem
Conventional
I/O (FC) CAPI
0
20,000
40,000
60,000
80,000
100,000
120,000
Conventional CAPI
IOPS per Hardware Thread
0
50
100
150
200
250
300
350
400
450
500
Conventional CAPI
Latency (microseconds)
IBM POWER S822L >5x better IOPS
per HW thread
>2x lower latency
13. POWER 8 Data Centric Architectures
with FPGA Acceleration
• Allan Cantle – President & Founder, Nallatech
14. Nallatech at a glance
Server qualified accelerator cards featuring FPGAs, network I/O and an open
architecture software/firmware framework
» Energy-efficient High Performance
Heterogeneous Computing
» Real-time, low latency network
and I/O processing
» Design Services for Application Porting
» IBM Open Power partner
» Altera OpenCL partner
» Xilinx Alliance partner
» 20 Year History of FPGA Acceleration including
IBM System Z qualified & 1000+ Node deployments
14
15. Heterogeneous Computing – Motivation
• Efficient Evolutional Compute Performance
• 1998 - DARPA Polymorphic Computing
• 2003 – Power Wall
GPGPUProcessors
FPGA Processors Multicore Microprocessors
Polymorphic Processor
A processor that is equally efficient at processing
all different data types
Polymorphic
Application Data Types
SymbolicStreaming – SIMD - Vector
(DSP)
Bit Level
SWEPTEfficiency
Size,Weight,Energy,Performance,Time
17. FPGA’s for “Software Accessible” Compute Acceleration
• FPGAs are best processor for
• “In the pipe” Stream Computing Operations
• Highly parallel Bit & Byte level computation
• FPGA are NOW programmable by software engineers with OpenCL!
• Industry Standard Language managed by Khronos
• Heterogeneous language of choice for all processor types
• Low Level Explicitly Parallel language
• Heterogeneous Support is Evolving
17
19. Server
Node 1
Server
Node 400
Server
Node 2
Server
Node 399
400 Node
THOR Cluster
Hierarchical
400 Port
Network
Switch
Disc
Disc
Disc
Disc
400MB/s
140 IOPs
400MB/s
140 IOPs
400MB/s
140 IOPs
400MB/s
140 IOPs
1GB/s
1GB/s
1GB/s
1GB/s
1GB/s
Server
Node 1
Server
Node 100
Server
Node 2
Server
Node 99
100 Node
ROXIE Cluster
Hierarchical
100 Port
Network
Switch
SSD
SSD
SSD
SSD
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1.1GB/s
100 KIOPs
1GB/s
1GB/s
1GB/s
1GB/s
1GB/s
• CPU intensive operations on
fragmented storage data
• Average 1GB/s to storage
• Mapper type parallel operations
• Aggregate 160GBytes/s
Storage bandwidth
• Aggregate 56K IOPs
• CPU intensive operations on
fragmented storage data
• Average 1GB/s to storage
• Pre-indexed eases limitation
• Mapper type parallel operations
• Aggregate 110GBytes/s
Storage bandwidth
• Aggregate 10M IOPs
Typical Scale Out Thor & Roxie Cluster Configurations
20. Memory Coherency & Data Movement Conflict
• Mapper Type – Needle in a Haystack Operations
• Little to no compute on lots of data
• But hey, it’s not a problem for the Xeon so why worry?
• Creates unnecessary traffic congestion across IO interfaces
• I/O Management Consumes CPU Threads
• Excessive Cache thrashing consumes memory bandwidth
• E.g. only 8x 4K IOPs will fit in a 32K L1 Data Cache
• Power & efficiency issues with all the unnecessary data movement
1x
Network
Controller
1x
10GbE
Storage uP
L3
L2
L1
Memory
1x
100x
Data Movement example of
CPU operating on large data
files with mapper type
functions
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus
21. 1x
Network
Controller
1x
10GbE
Storage uP
L3
L2
L1
Memory
1x
100x
Example of CPU moving
stored data to network
for another CPU
Memory Coherency & Data Movement Conflict
• Processor Intensive Operations on fragmented data across cluster
• CPU burdened by passing lots of fragmented data from storage to the
network
• Similarly CPU has to wait for fragmented data to arrive
• Network Congestion & latency become big issues
• Less CPU cycles & cache memory available for compute intensive
operations
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus
22. Homogeneous Commodity Clusters still win out…right?
• Maybe, but storage is getting a LOT faster with NVMe!
Potential
72GB/s
Network
Controller
PCIe
10GbE
24 NVMe
Drives
uP
L3
L2
L1
Memory
60GB/s
Example of CPU moving
stored data to network
for another CPU
Increased Storage IO Bandwidth would
saturate CPU’s available memory
bandwidth & CPU centric architecture
breaks down completely
The burden of
using CPU centric
architectures for
Data Centric
Problems
= I/O Bus
= Memory Bus
23. IBM initiated Data Centric Transformation with CAPI
• CAPI = Coherent Accelerator Processor Interface
• Transport over PCIe bus
• IBM’s CAPI Flash Product
• Make Flash memory appear as Block Memory, BM
• FPGA provides “translation” BM to Flash commands
• CPU threads for storage management reduced from 40-50 down to 2
• CAPI attached Network Adaptor
• 3 Symmetrical MultiProcessing, SMP, Interconnects for Scale Up architectures
Network
Controller
PCIe
10GbE
56TB
Storage
Power 8
L3
L2
L1
Memory
230GB/s
SMP x3
FPGA
4GB/s
CAPI
CAPI
4GB/s
IBM
Standard Flash Drawer
vs CAPI Flash Drawer
= I/O Bus
= Memory Bus
24. Introducing In-Storage Acceleration over CAPI
• Extend CAPI Flash Product Innovations
• Increase CAPI Flash Bandwidth
• FPGA Acceleration
• Leverage NVMe IOPS & Sequential data rates
• Data Compression, Encryption, Filtering
• Bit/Byte Level Stream Computing
• Very power efficient architecture
10GbE
Power 8
L3
L2
L1
FPGA
Acceleration
4GB/s
CAPI
CAPI
Memory
230GB/s
56TB
Storage
Network
Controller
16GB/s
96GB/s
FPGA
SMP x3
= Memory Bus
25. P8 Node with CAPI Flash – Available Today
P8
12 Core
Socket
56TB
4GB/s
FPGA
4GB/s
1 MIOPS
P8
12 Core
Socket
56TB
4GB/s
FPGA
4GB/s
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
56TB
4GB/s
FPGA
4GB/s
56TB
4GB/s
FPGA
4GB/s
40/100GbE
NIC
40/100GbE
NIC
1 MIOPS
1 MIOPS
1 MIOPS
Maximum Configuration Shown
• Possible Configuration for Scale Out Roxie & Thor Clusters
26. Scale Out P8 Node with FPGA accelerated CAPI Flash
P8
12 Core
Socket
56TB
4GB/s
FPGA
1 MIOPS
P8
12 Core
Socket
56TB
4GB/s
FPGA
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
56TB
4GB/s
FPGA
56TB
4GB/s
FPGA
40/100GbE
NIC
40/100GbE
NIC
1 MIOPS
1 MIOPS
1 MIOPS
Maximum Configuration Shown
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
60GB/s
14.4 MIOPs
• Possible Configuration for Scale Out Roxie & Thor Clusters
27. Scale Up IBM E870/E880 with Scale out FPGA Acceleration
P8
Socket 56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket 56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
P8
Socket56TB
4GB/s
FPGA
60GB/s
14.4 MIOPs
76.8GB/s
76.8GB/s
76.8GB/s 76.8GB/s
76.8GB/s
76.8GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
1TB Memory
230GB/s
• Balanced Scale Up & Scale Out Solution
• Scale Up Processors ideal for Compute intensive operations
• Scale Out FPGA Accelerated Storage, ideal for Parallel Mapper type operations
29. In Summary: HPCC Systems and OpenPOWER Are Ready For Business!
• IBM Power8’s Data Centric Architecture is ideal for Big Data Problems
• FPGAs provide an ideal streaming, near storage, accelerator
• Opportunity to re-architect HPCC’s ECL for Power 8 with FPGA Acceleration
• Will provide a step change in performance at lower power & footprint
• A community effort will help to accelerate the transition
29
Contact information: JT Kellington | Allan Cantle
Phone: 512-286-6948 | 805 377 4993
Email: jwkellin@us.ibm.com | a.cantle@nallatech.com
Web: www.ibm.com | www.nallatech.com