Saviak lviv ai-2019-e-mail (1)

HPE Solutions for Challenges in AI and Big Data
Volodymyr Saviak, November 2019

Agenda
1. Introduction
2. Storage for Challenging AI & Big Data Projects
3. Future of Data Storage Paradigm – In-Memory Computing
2

IMPORTANT DATES IN HP(E) HISTORY
1939
A new
company;
HP invents
first product
1994
Planet
Partners
program
launched
1959
Going
global
1966
HP enters
computer
industry;
HP Labs
opens
1972
Replacing the
slide rule—HP
invents the
pocket
calculator
1980
Our first PCs
1984
A print
revolution:
HP
introduces
both the
ThinkJet and
the LaserJet
2003
Cooler
servers
2005
Halo
Collaboration
Studio
2008
Commitment
to cloud
computing
30s 00s90s80s70s60s50s

Hewlett Packard Enterprise: At a glance
People
Values
Quality
Market leadership
Living Progress
Revenue
Servers1
1. IDC Worldwide Quarterly Server Tracker 2Q18, Sept 2018. Market share on a global level for
HPE includes New H3C Group . All data points are worldwide.
2. IDC Worldwide Quarterly Enterprise Storage Tracker 2Q18, Sept 2018. Market share on a global level for HPE includes
New H3C Group All data points are worldwide
3. Hyperion Research HPC Qview for 2Q18, September 2018. TOP500 List of Supercomputer sites, Nov 2017
#1 x86 blade server revenue
#1 Modular server revenue
#1 Four-socket x86 server revenue
#1 Mid-Range Enterprise x86 server
Storage2
#1 Product brand worldwide midrange SAN revenue :
HPE 3PAR StoreServ
#1 Worldwide Internal OEM storage revenue
High Performance Compute3
#1 HPC Server Revenue3
Provider of Top500 energy efficient supercomputers
Hyperconverged Infrastructure4
Fastest growing HCI systems vendor of the top 3,
growing YoY and faster than overall market
HPE Market Leadership
Enterprise WLAN5
#2 Worldwide Enterprise WLAN Vendor
Campus Switching6
#2 Worldwide Enterprise WLAN Vendor
Gartner:
• 2018 Magic Quadrant for Wired and Wireless LAN access
• 2018 Magic Quadrant for Operations Support Systems
• 2018 Magic Quadrant for Hyperconverged Infrastructure
• Highest Scores in 5 out of 6 Gartner use cases for
Critical Capabilities for Wired and Wireless LAN Access
Infrastructure
Forrester:
• Q3-18 The Forrester Wave: Hyperconverged Infrastructure
IDC:
• IDC MarketScape for Wireless LAN
InfoTech Research Group:
• HPE Aruba “Champion Wired and Wireless LAN Vendor
Landscape
HPE Named Leader7
4. IDC Worldwide Converged Systems Tracker for 2Q18, Sept 25, 2018
5. Worldwide Quarterly IDC Enterprise WLAN Tracker 4Q1
6. 650 Group 2QCY18, September 2018
7. Sources provided via hyperlinks

Hewlett Packard Enterprise: At a glance
People
Values
Quality
Market leadership
Living Progress
Revenue
Partnership first
We believe in the power of
collaboration – building long
term relationships with our
customers, our partners and
each other.
Bias for action
We never sit still – we take
advantage of every opportunity.
Innovators at heart
We are driven to innovate –
creating both practical and
breakthrough advancements.

Together, shaping and leading the next generation of High Performance
Computing (HPC) and Artificial Intelligence (AI)
7

HPC Solutions Business Unit Solutions Areas.
HP HPC BU Solutions Areas
HighPerformance
Computing
- Oil & Gas computations
- Meteo/Weather forecast
- Manufacturing CAE
- Life sciences (Bio, Chem,…)
BigDataapplications
- Hadoop & SPARK
- Content delivery
- Rendering
- In memory compute & DB
Scale-OUTStorage
- Scale out digital archive
- Media assets archives
- Geo distributed storage
- Video surveillance archive
PerformanceOptimized
Datacenters
- Modular datacenters
- Mobile datacenters
- EMI/EMR protected DC
- Portable miniDC

Storage for Challenging AI & Big Data Projects
9

HPE Data Management Framework
• Efficient storage utilization and cost management
• Streamline data workflows
• Data assurance and protection
Tape
Zero
Watt
Storage
Object
Storage
& Cloud
Data Management | Fast & Slow Tier Models
Aggregated Storage-in-Compute
10
Ethernet / InfiniBand / Slingshot
HPE Compute Node
File System Access
HPE Compute Node
File System Access
HPE Compute Node
File System Access
HPE Compute Node
File System Access
Flash Tier Storage Server
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
Flash Tier Storage Server
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
NVMe
• All nodes have full POSIX access to the flash tier and parallel file system
• Aggregated Storage-in-Compute model is where multiple NVMe devices are placed
in dense compute nodes (e.g. 1U nodes with 10 NVMe devices)
• Flash configuration provides burst buffer capabilities and persistent shareable
POSIX file system functionality in a single layer
• For expanded tiered data management capabilities, DMF can tier data from/to this
layer into object & cloud storage, Zero Watt buffer storage or tape in order to
deliver virtually infinite capacity as well as integrated backup, archive and disaster
recovery capabilities
SOLUTION ATTRIBUTES
Lustre
HDFS

Apollo 4000 Cluster
More data from the edge means more storage in the core
1
1
Apollo 4200 Gen9
– 2U platform;
28 LFF HDD or 54 SFF HDD
Apollo 4510 Gen10
– 4U platform; 60 LFF HDD
JBOD Option (D8000)
- 4U 106 LFF HDD
Data lakeHot
Warm
Cold
Tiered storage for Big Data Analytics
Process Train
Data storage for AI workflows

Zero Watt Storage
HPE Data Management Framework
High Performance
Power Optimized
Extended Drive Lifespan
• Near 20 GB/s per JBOD performance
provides ‘fast’ hard disk tier to stream data
to active ‘hot’ storage
• Each drive is individually managed by DMF
to track data activity and data layout
• Drives can be spun down when not in use
to significantly reduce power and cooling
costs and increase drive lifespan
• HPE D6020 5U 70 bay JBOD is qualified
today
1
2
Software-based DMF warm tier storage
option with minimized power utilization
paired with the HPE D6020 JBOD

HPE Scalable Object Storage – Scality
Object storage (and some file)
• Key attributes
− Scalable software-defined storage for object (S3) and
file access (SMB/NFS) at the same time
− erasure coding (variable) and replication (small files)
− Native data protection in a shared-nothing, distributed
architecture with no single point of failure
− Multi-node, multi-site, multi-geo data distribution for
extreme data durability (up to 13x 9s)
− Connectors for multiple file and cloud access protocols
to easily support various business applications
− Easy and proven growth path
− Large (reference) customer installed base
• Tight collaboration with HPE; HW encryption
• Certified as Cloud Bank Storage target
• Various whitepapers and reference architectures
available
• Architecture/building blocks
• Sweet spot 500TB – 100s of PBs (scales to Exabytes)
• Minimum: 6 nodes with 10 HDDs/node
3-node min. support (200TB+) – single/2-site only
• Connector nodes need to be configured separately

Object store resilience – through Geo-distributed Erasure Coding
Drive failures Node failures Zone and region failures
Compute Compute
Storage Storage Storage Storage
Data Center A Data Center B
• Component and network failures are to be expected – and thus considered a normal
state
• System functions properly in spite of multiple failures
Compute Compute Compute Compute
Storage Storage Storage Storage Storage Storage Storage StorageStorage Storage
Compute Compute Compute
Storage Storage Storage Storage
Data Center C
1
4

How does erasure coding work?
9MB
1 M B 1 M B
1 M B 1 M B
1 M B
1 M B
1 M B 1 M B 1 M B
data chunks parity chunksoriginal file
Example: ARC(9,3)
Provides three-disk failure protection with ~33% overhead
RING Erasure Coding
 Reed-Solomon EC algorithm (custom XOR acceleration
library)
 Dynamically configurable schema – Up to 64 data +
parity chunks to protect against variable number of
failures
Flexible & Efficient
 Configurable replication or erasure coding per connector
 Great for large objects – avoids replication overhead
 Data chunks stored in the clear to avoid read
performance penalties
 Scales easily – more cost savings and less overhead
with multiple sites
1 M B 1 M B 1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
1 M B
Erasure coding is a cost-effective way to store big files

WekaIO Parallel File System for All-Flash Environments
Applications and storage share the compute & fabric infrastructure
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
APP
ColdData
Unified Namespace
Nodes can be
• FS Clients
• FS Servers or
• Both
Ethernet or InfiniBand Network
Apollo 2000
Apollo 6000
Apollo 6500
SGI 8600
DL 360/380
Option for Apollo 2000-based
storage server model with 4
nodes per 2U chassis loaded
with NVMe storage

o Problem: Could not achieve the bandwidth
required to keep GPU cluster saturated
o Pain Point: Wasted cycle time ($$$$) on
very expensive GPU clusters.
o Test Platform: 10 Node HPE Apollo 2000
vs. local disk and Pure Storage Flashblade
server
o Result:
– WekaIO – 42% faster than Local Disk
– WekaIO – 4.4x faster than FlashBlade
WekaIO is Faster Performance than Local Disk
MB/Second
Higher is Better
Analytics Cluster Results to Single GPU Client
Actual measured data at an autonomous vehicle training installation
0
500
1,000
1,500
2,000
2,500
3,000
3,500
4,000
4,500
5,000
WekaIO 3.0 Local Disk SSD Pure Storage
1MB Read Performance – Single GPU Client

AI data node using Apollo 4200 Gen10
For tiered and hybrid solutions
Apollo 2000 Gen10
DL360 Gen10
Scale-out S3-compatible archive
Petabytes of geo-dispersed data
AI Workloads
GPU-driven data training, recognition, visualization, and simulation
High-performance File
All-NVMe flash storage
Apollo 4200 Gen10
Scality
Apollo 4200 Gen10
Scality
RING
AI data node
NVMe-optimized storage
Scale-out HDD bulk storage
Combined in one storage platform
Current Approach: Tiered Storage New Approach: Hybrid Storage
HPE validated solution
https://www.hpe.com/h20195/v2/Getdocument.aspx?docname=a00065979enw

1
9
CLUSTERSTOR E1000 HARDWARE
“Zero Bottleneck” End-to-End PCIe 4.0 Design
Up to 24 x 2.5” NVMe
PCIe 4.0 SSD in
2 rack units
2 embedded
storage servers each
with 1 AMD “Rome”
socket and PCIe 4.0
Up to 6 x 100/200
Gbps PCIe 4.0 NICs
(Slingshot, GbE, IB)
Up to 230 TB usable
in 2 rack units
Lustre Flash Optimized
Metadata Servers,
Object Storage Servers
and
All Flash Arrays
60 GB/sec Write
80 GB/sec Read

2
0
CLUSTERSTOR E1000 DISK ARRAY
Ultra-dense for less enclosures, racks, floor space
Up to 106 x 3.5” SAS
HDD in 4 rack units
Usable capacity points
• 1.07 PB (14 TB HDD) in 2019
• 1.22 PB (16 TB HDD) in 2020
• 1.53 PB (20 TB HDD) in 2021
Separate Disk Server with:
• 2 embedded storage servers
each with 1 AMD “Rome” socket
and PCEe 4.0
• Up to 4 x 100/200 Gbps
PCIe 4.0 NICs
(GbE, IB, OPA)
• 2 or 4 SSDs for WIBs,
Journals and NXD

Flexible modularity AND extreme scale for HPC & AI workloads
HPE Superdome Flex - Advanced SMP
21
5U, 4-socket
chassis
Scales up to 8 chassis and 32
sockets as a single system in a
single rack
Unparalleled Scale
– Modular scale-up architecture
– Scales seamlessly from 4 to 32 sockets as a single system
with both Gold and Platinum processors
– Designed to provide 768GB-48TB of shared memory
– High bandwidth (13.3GB/sec- bi-directional per link)/low latency (<400ns)
HPE Flex Grid, ~1TB/s total aggregation bandwidth
– Intel ® Xeon® Scalable processors, 1st and 2nd generation, with up to 28 cores
Unbounded I/0
– Up to 128 PCIe standup cards, LP/FH PCIe
Optimum Flexibility
– 4-socket chassis building blocks, low entry cost; HPE nPars
– NVIDIA GPUs, Intel SDVis
– 1/10/25 Gbe, 16/32Gb FC, IB EDR/Ethernet 100 GB, IB HDR, Omni-Path
– SAS, Multi-Rail LNet for Lustre; NVMe SSD
– MPI, OpenMP
Extreme Availability
– Advanced memory resilience, Firmware First, diagnostic engine,
self-healing
– HPE Serviceguard for Linux
Simplified User Experience
– HPE OneView, IRS, OpenStack, Redfish API
– HPE Datacenter Care, HPE Proactive Care

Future of Data Storage Paradigm –
In-Memory Computing

HPE’s architecture innovation addresses declining system ratios despite
improvements in processing performance
2
3
0.0001
0.0010
0.0100
0.1000
1.0000
2010 2012 2013 2013 2016 2016 2018 2018
Hopper Sequoia Titan Edison Cori Hsw Trinity KNL Aurora Summit
Memory (wAvg) / Flops Memory bw (wAvg) / Flops Injection bw / Flops Bissection bw / Flops
2022
Logarithmic
Time 2010 - 2022
Balanced System
Architecture
Memory Driven
Programming Model
Energy Efficiency from
Chip to Cooling Tower
Open Architecture,
Open Ecosystem
HPE is developing advanced system architecture for more balanced systems at scale

Memory
Bandwidth
− Embrace co-packaged
memory transition
(HBM, HMC …)
− Minimize latency for
Gen-Z attached
memory
Memory
Capacity
− Drive co-packaged
memory cost as low
as possible
− Enable Gen-Z
attached memory as
second memory tier
(DRAM or NVM)
Fabric
Injection Rate
− Embed the HCA to the
CPU leveraging
SerDes generalization
thanks to Gen-Z
− Integrated switches
close to compute for
multiple rails option
Fabric Bisection
Bandwidth
− Design high-radix
switches
− Integrate and optimize
for cost and usability
optical technologies
(vcsel -> SiP)
HPE’s technological innovation includes new memory, photonics and fabric
technology for data intensive workloads
2
4

Here is Edward Bear, coming downstairs
now, bump, bump, bump, on the back of
his head, behind Christopher Robin. It is,
as far as he knows, the only way of
coming downstairs, but sometimes he
feels that there really is another way, if
only he could stop bumping for a moment
and think of it.
A. A. Milne, Winnie-the-Pooh

For highest possible level of performance applications must change
Evolving the Software Ecosystem for Persistent Memory
Controller
Cache
File system
I/O Buffers
Drivers
Objects
Interpreters
Libraries
Media
~25k instructions
3+ data copies
Bottleneck
Application
Operating System
SSD/HDD
Objects
Interpreters
Media
Application
Persistent Memory
Bottleneck ?3 instructions
0 data copies
Libraries

The Traditional Memory/Storage Hierarchy
2
7
Processor
Hot Tier
Cold Tier
Super Fast
Super Expensive
Tiny Capacity
Processor
Registers
Level 1 (L1)
Level 2 (L2)
Level 3 (L3)
Physical
Memory
Random Access
Memory
Faster
Expensive
Small Capacity
Fast
Reasonably Priced
Average Capacity
Non-Volatile
Flash-based Memory
Solid State
Storage
Average Speed
Priced Reasonably
Average Capacity
Magnetic
Storage
File-based Memory
Slow
Inexpensive
Large Capacity
Processor
Cache
SAS/SATA HDD
SAS/SATA SSD
NVMe SSD
CPU
DRAM
Capacity

Redefining the Memory/Storage Hierarchy
2
8
SAS/SATA HDD
SAS/SATA SSD
NVMe SSD
CPU
DRAM
Memory
Storage
Persistent
Memory
• Data is volatile
• System DRAM is used
as a cache
• Data is persistent
• System DRAM is used
as main memory
Work as DRAM
Work as SSD

Storage Devices Access Modes IO Stack Comparison
App
File
System
Volsnap
Volmgr /
Partmgr
Disk /
ClassPnP
StorPort
MiniPort
HDD/SSD
Traditional
App
File
System
Volsnap
Volmgr /
Partmgr
PMM
Disk Driver
PMM
PMM Block Mode
PMM
Bus Driver
App
PMM-Aware
File System
Volmgr /
Partmgr
PMM
PMM Direct Access (DAX)
PMM
Bus Driver
Non-CachedIO
CachedIO
MemoryMapped
User Mode
Kernel Mode
4-10μs
read(fileptr,offset)
write(fileptr,offset)
/* OS call */
1-3μs 0.3μs
2
9
load(address)
store(address)
/* CPU opcode */

Storage over App Direct and Direct Access Applications
3
0
Persistent Memory Devices
Storage over App Direct
Applications
Load/Store
File System + DAX
PMM Drivers
mmapread / write Syscalls
PMDK APIs
Page Cache
Read/Write
User space
Kernel
FW/HW
Direct Access
Applications
Volatile DRAM used for system memory
Persistent memory devices from PMM
 PMEM device(s) presented to OS
 Can be formatted and mounted as a
filesystem in fsdax: ext4, xfs, NTFS
Storage over App Direct (SToAD)
 Applications can access through the
storage software layer (legacy, no
application change): open(), read(),
write()
Direct Access
 NVM programing model NVMPM
load/store: mmap(), memcpy(), PMDK

3
1
Baseline – fastest local Optane™ DC SSD P4800X
(built on 3D XPoint technology)
0.000050 cpu=18 pid=16625 tgid=16625 pread64 [17] entry fd=3 *buf=0x268c000 count=4096 offset=0xdeea7000
0.000051 cpu=18 pid=16625 tgid=16625 block_rq_issue dev_t=0x1030000b wr=read flags=SYNC|DONTPREP
sector=0x6f7538 len=4096 async=0 sync=0
0.000058 cpu=18 pid=16625 tgid=16625 comm=fio sched_switch syscall=pread64 prio=120 state=SSLEEP next_pid=0
next_prio=120 next_tgid=n/a policy=n/a vss=174969 rss=192 io_schedule_timeout+0xa6 do_blockdev_direct_IO+0xbc3
__blockdev_direct_IO+0x43 blkdev_direct_IO+0x58 generic_file_read_iter+0x57a blkdev_read_iter+0x37
__vfs_read+0xd9 vfs_read+0x86 sys_pread64+0x8a tracesys_phase2+0x6d|[libpthread-2.17.so]:__pread_nocancel+0x2a
0.000059 cpu=27 pid=-1 tgid=-1 block_rq_complete dev_t=0x1030000b wr=read flags=SYNC|DONTPREP sector=0x6f7538
len=4096 async=0 sync=0 qpid=16625 spid=16625 qtm= 0.000000 svtm= 0.000007
0.000059 cpu=27 pid=-1 tgid=-1 sched_wakeup target_pid=16625 prio=120 target_cpu=18 success=1
0.000061 cpu=18 pid=0 tgid=0 comm=swapper/18 sched_switch syscall=idle prio=n/a state=n/a next_pid=16625
next_prio=120 next_tgid=16625 policy=SCHED_NORMAL vss=0 rss=0
0.000062 cpu=18 pid=16625 tgid=16625 pread64 [17] ret=0x1000 syscallbeg= 0.000012 fd=3 *buf=0x268c000
count=4096 offset=0xdeea7000 type=REG dev=0x1030000b ino=22673
Single 4 KB read
- logical I/O
- physical I/O
green
yellow
Total 4 KB read time 12 us
Latest AFA arrays will show 100’s us here

3
2
Now “slow” (non-DAX) access to the Persistent Memory via
standard OS I/O system calls
The same single 4 KB read, but…
- logical I/O
- NO physical I/O anymore
green
NO yellowNO changes for any Db/App required!
Total 4 KB read time <2 us
0.000003 cpu=13 pid=5979 tgid=0 pread64 [17] entry fd=3 *buf=0x55b883ff2000 count=4096
offset=0x15c145d000
0.000004 cpu=13 pid=5979 tgid=0 pread64 [17] ret=0x1000 syscallbeg= 0.000002 fd=3
*buf=0x55b883ff2000 count=4096 offset=0x15c145d000

3
3
Fastest ever
access
via
Direct Access
(DAX)
Total 4 KB read time ? – NO read!, latencies 350 ns or less!
NO physical I/O, NO logical I/O, NO block device layer, NO buffers, NO queues!!
Nothing
beyond!
App Direct !!

3
4
CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ POLL C1 C1E C6 POLL% C1% C1E% C6%
55 617 33.54 1843 2694 120930 50017 58378 27884 29561 6.01 36.41 15.46 15.90
And let me explain why…
CPU Avg_MHz Busy% Bzy_MHz TSC_MHz IRQ POLL C1 C1E C6 POLL% C1% C1E% C6%
55 3786 99.97 3796 2694 1792 0 0 0 0 0.00 0.00 0.00 0.00
linux-tg7k:/home/anton # numactl --physcpubind=55 --membind=1 fio --filename=/mnt2/file --rw=randwrite --
ioengine=psync --direct=1 --bs=4k --iodepth=1 --numjobs=1 --runtime=60 --group_reporting --name=perf_test
linux-tg7k:/home/anton # numactl --physcpubind=55 --membind=1 fio --filename=/mnt1/file --rw=randwrite --
ioengine=psync --direct=1 --bs=4k --iodepth=1 --numjobs=1 --runtime=60 --group_reporting --name=perf_test
NVMe – real CPU utilization is 617 MHz
PMEM – real CPU utilization is 3786 MHz, Intel’s Turbo-Boost activated!!
Simplest log-writer like workload
not true CPU idle state; not true CPU work state; “do nothing” CPU state
you shouldn’t pay money for that!

CPUs Accelerators
Memory technologies I/O
– High bandwidth
– Low latency
– Advanced workloads & technologies
– Scalable from IoT to exascale
– Compatible
– Economical
– Supports electrical or optical interconnects
– Open standard
– Security built-in at the hardware level
Gen-Z: new open interconnect protocol
Key enabler of the Memory-Driven Computing open architecture
FPGA
GPU
SoC ASICNEURO
Memory
Memory
Network Storage
Direct Attach, Switched, or Fabric Topology
NVM NVM NVM
SoC
Memory
43

Transform performance with Memory-Driven programming
3
6
In-memory analytics
15x
faster
New algorithms Completely rethink
Modify existing
frameworks
Similarity search
40x
faster
Financial models
10,000x
faster
Large-scale
graph inference
100x
faster

DZNE discovered HPE’s Memory-Driven
Computing — and saw unprecedented
computational speed improvements that hold
new promise in the race against Alzheimer’s
60% power reduction cuts research costs
101x
increase in analytics speed blasts
research bottlenecks, leading to shorter
processing time — from 22 minutes to
13seconds
Memory-Driven Computing helps outpace the global time bomb of
neurodegenerative disease

Thank You!
Questions are welcome…
HPC.CEE@HPE.COM
38

Saviak lviv ai-2019-e-mail (1)

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie Saviak lviv ai-2019-e-mail (1)

Ähnlich wie Saviak lviv ai-2019-e-mail (1) (20)

Mehr von Lviv Startup Club

Mehr von Lviv Startup Club (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Saviak lviv ai-2019-e-mail (1)