In this presentation from the DDN User Meeting at SC13, Tommy Minyard from the Texas Advanced Computing Center describes TACC's new Corral data storage system.
Watch the video presentation: http://insidehpc.com/2013/11/13/ddn-user-meeting-coming-sc13-nov-18/
1. Corralling Big Data at TACC
Tommy Minyard
Texas Advanced Computing Center
DDN User Group Meeting
November 18, 2013
2. TACC Mission & Strategy
The mission of the Texas Advanced Computing Center is to enable
scientific discovery and enhance society through the application of
advanced computing technologies.
To accomplish this mission, TACC:
â Evaluates, acquires & operates
advanced computing systems
â Provides training, consulting, and
documentation to users
â Collaborates with researchers to
apply advanced computing techniques
â Conducts research & development to
produce new computational technologies
Resources &
Services
Research &
Development
3. TACC Storage Needs
⢠Cluster specific storage
â High performance (tens to hundreds GB/s bandwidth)
â Large-capacity (~2TBs per Teraflop), purged frequently
â Very scalable to thousands of clients
⢠Center-wide persistent storage
â Global filesystem available on all systems
â Very large capacity, quota enabled
â Moderate performance, very reliable, high availability
⢠Permanent archival storage
â Maximum capacity, tens of PBs of capacity
â Slow performance, tape-based offline storage with spinning
storage cache
4. History of DDN at TACC
⢠2006 â Lonestar 3 with DDN S2A9500
controllers and 120TB of disk
⢠2008 â Corral with DDN S2A9900 controller
and 1.2PB of disk
⢠2010 â Lonestar 4 with DDN SFA10000
controllers with 1.8PB of disk
⢠2011 â Corral upgrade with DDN SFA10000
controllers and 5PB of disk
5. Global Filesystem Requirements
⢠User requests for persistent storage available
on all production systems
â Corral limited to UT System users only
⢠RFP issued for storage system capable of:
â At least 20PB of usable storage
â At least 100GB/s aggregate bandwidth
â High availability and reliability
⢠DDN solution selected for project
7. Stockyard: Design and Setup
⢠A Lustre 2.4.1 based global files system, with
scalability for future upgrades
⢠Scalable Unit (SU): 16 OSS nodes providing
access to 168 OSTâs of RAID6 arrays from
two SFA12k couplets, corresponding to 5PB
capacity and 25+ GB/s throughput per SU
⢠Four SUâs provide 20PB with 100GB/s now
⢠16 initial LNET router set for external mounts
11. Stockyard: Capabilities and Features
⢠20PB usable capacity with 100+ GB/s
aggregate bandwidth
⢠Client systems can bring its own LNET router
set to connect to the Stockyard core IB
switches or connect to the built-in LNET
routers using either IB or TCP. (FDR14 or
10GigE)
⢠HSM potential to Ranch tape archival system
12. Capabilities and Features (contâd)
⢠Meta-data performance enhancement
possible with DNE (phase1)
⢠NRS (Network Request Scheduler)
evaluation: characteristics of different policies
on ost_io.nrs_policies, particularly with
crrn(client round-robin over nids) under
contention dominated by a few jobs
13. Stockyard: Numbers So Far
⢠16 LET-routers configured as direct client
(within the Stockyard fabric) can push 25GB/s
on the unit
⢠With two SUâs the same set of clients can
achieve 50GB/s, and 75GB/s with three SU.
⢠With four SU we hit the 16 client limit. No
improvement beyond 75GB/s (corresponding
to ~4.7GB/s from each client)
14. Numbers So Far (Single Client)
⢠Single thread write performance with Lustre
2.4.1 is ~770MB/s
â big improvement over 2.1.X at about 500MB/s
⢠Multi-thread from a single client saturates
around 4.7GB/s (with credits=256 on both
servers and clients)
15. Numbers So Far (Aggregate)
⢠Performance numbers with 16 lnet-routers :
75GB/s from 16 direct clients
⢠Numbers from Stampede compute clients:
65GB/s with 256 clients (IOR, posix, fpp, with
8 tasks per node)
⢠Saturation point for Stampede clients: 65GB/s
⢠N.B. credits=64 on client nodes of Stampede
â Quick test on interactive 2.1.x node with higher
credit number gives expected boost.
16. Numbers So Far (Failover Tests)
⢠OSS failover test setup and results
⢠Procedure:
â Identify the OSTâs for the test pair
â Initiate the dd processes targeted to the particular OSTâs each of
about 67GB in size so that it does not finish before the failover
â Interrupt one of the OSS server with shutdown using ipmitool
â Record the individual dd process outputs as well as server and
client side Lustre messages
â Compare and confirm the recovery and operation of the failover
pair with 21 OSTâs
⢠All I/O completes within 2 minutes of failover
17. Failover Testing (contâd)
⢠Similarly for MDS pair: same sequence of interrupted
I/O and collection of Lustre messages on both servers and
clients, client side log shows the recovery.
â
â
â
â
Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) @@@ Request sent has timed
out for sent delay: [sent 1381348698/real 0] req@ffff88180cfcd000
x1448277242593528/t0(0) o250>MGC192.168.200.10@o2ib100@192.168.200.10@o2ib100:26/25 lens 400/544 e 0 to
1 dl 1381348704 ref 2 fl Rpc:XN/0/ffffffff rc 0/-1
Oct 9 14:58:24 gsfs-lnet-006 kernel: : Lustre:
13689:0:(client.c:1869:ptlrpc_expire_one_request()) Skipped 1 previous similar
message
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: Evicted from MGS (at
MGC192.168.200.10@o2ib100_1) after server handle changed from
0xb9929a99b6d258cd to 0x6282da9e97a66646
Oct 9 14:58:43 gsfs-lnet-006 kernel: : Lustre: MGC192.168.200.10@o2ib100:
Connection restored to MGS (at 192.168.200.11@o2ib100)
18. Automated Failover
⢠The tests were on an artificial setup to
simplify the tracking of the completion of the
I/O on clients and shutdown and failover
mounts were done manually.
⢠Corosync and pacemaker are being set up to
automate the process.
19. Routed Clients
⢠We monitor the routerstat output on the
attached routers and differences between two
timestamps, focusing on the even distribution
of request streams
⢠Contrary to the expectation that âautodownâ
may suffice, Lustre clients need to have
âcheck_routers_before_use=1â to have
automatic updates of router status
20. Routed Clients (contâd)
⢠Even with automatic router checks, clients
cannot detect the non-functional routers: a
router which was active only on the client side
will be assumed to be active by clients
⢠Clients encounter timeouts due to the nonfunctional routers
⢠Resolution: separate router checks on router
nodes are added.
21. Stockyard: Looking Ahead
⢠Deploy as a global $WORK space for TACC
resources, will push the number of clients to
all TACC resources
⢠Evaluation of Lustre 2.5.0 before full
production for HSM functionality and
compatibility with SAMFS on Ranch
⢠Quota management (different on 2.4+)
⢠Integrated monitoring setup
⢠Security evaluation
22. Summary
⢠Storage capacity and performance needs
growing at exponential rate
⢠High-performance and reliable filesystems
critical for HPC productivity
⢠Benefits of large parallel filesystems outweigh
the system administration overhead
⢠Current best solution for cost, performance
and scalability is Lustre-based filesystem