Dirk Petersen, Scientific Computing Manager, Fred Hutchinson Cancer Research Center (FHCRC)
Joe Arnold, President and Chief Product Officer, SwiftStack
Considering deploying a multi-petabyte storage-as-a-service offering in your research environment? Learn how an industry-leading software-defined object storage solution, architected by SwiftStack and Silicon Mechanics, helped shift hundreds of users to an object-based workflow for their archival data. With an emphasis on cost efficiencies, scalability, and manageability, see how this implementation at Fred Hutchinson Cancer Research Center (FHCRC) is continually evolving across new use cases and access methods.
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
BIO IT 15 - Are Your Researchers Paying Too Much for Their Cloud-Based Data Backups
1. Are Your Researchers Paying Too Much for Their Cloud-
Based Data Backups?
Dirk Petersen, Scientific Computing Director,
Fred Hutchinson Cancer Research Center (FHCRC)
Bio-IT World 2015 1
2. Who are we and what do we do
What is Fred Hutch?
• Cancer & HIV research
• 3 Nobel Laureates
• $430M budget / 85% NIH funding
• Seattle Campus with 13 buildings, 15 acre
campus, 1.5+ million sq ft of facility space
Research at “The Hutch”
• 2,700 employees
• 220 Faculty, many with custom requirements
• 13 research programs
• 14 core facilities
• Conservative use of information technology
IT at “The Hutch”
• Multiple data centers with >1000kw capacity
• 100 staff in Center IT plus divisional IT
• Team of 3 Sysadmins to support storage
• IT funded by indirects (F&A)
• Storage Chargebacks started Nov 2014
Bio-IT World 2015 2
3. How did we get here:
Economy File project in production in 2014
• Chargebacks drove the Hutch to embrace more economical storage
• Selected Swift object storage managed by SwiftStack
• Go-live in 2014, strong interest and expansion in 2015
• Researchers do not want to pay the price for standard enterprise storage
• Additional use cases:
– In production: Swift as a backend for Galaxy
– In progress: Swift replaces standard disk deduplication devices for backup
– Planning: Swift as backend for endpoint backup (Druva)
– Planning: Swift as backend for virtual machines (openvstorage)
– Future option: Swift as backend for Enterprise file sharing / NAS
• File System Gateway for CIFS/NFS access phased out
3Bio-IT World 2015
4. Phasing out of Filesystem Gateway
• Initial deployment was using SwiftStack Gateway (CIFS /NFS)
– User survey: strong preference for traditional file access
– Gateway was easiest integration option in existing authentication and
authorization process
• However – Gateway was up to 10x slower than direct access to API
– Users had agreed on low performance because of low costs
– And low performance still causes frustration and increases Ops cost
• Now we have alternatives, and better AD integration of Swift
– Gateway was non-HA, higher Ops costs
– Removing gateway allows rolling updates during business hours
– Gateway didn’t allow for full auditing of file access, but Swift does
• Users finally saw benefit of removing gateway and were willing to try
alternative tools
4Bio-IT World 2015
5. How chargebacks were implemented
• Custom SharePoint site for storage
chargeback processing and
allocation to grants
– Each PI can allocate certain % of
charges to up 3 grant budgets
– Allocation is default setup for next
month
– User comments positive:
“very easy to use“
• Don’t make chargeback worse by
offering bad tools !
5Bio-IT World 2015
6. Chargebacks spike Swift utilization
• Started storage chargebacks
on Nov 1st
– Triggered strong growth in October
– Users sought to avoid high cost of
enterprise NAS and put as much as
possible into lower cost Swift
• Underestimated success of Swift
– Needed to stop migration to buy
more hardware
– Can migrate 30+ TB per day today
6Bio-IT World 2015
7. Chargebacks spike Swift utilization, cont.
• High Aggregate throughput
• Current network architecture is
an (anticipated) bottleneck
• Many parallel streams required to
max out throughput
• Ideal for HPC cluster architecture
7Bio-IT World 2015
8. Silicon Mechanics – Expert included.
Bio-IT World 2015 8
• Commodity hardware selection
• Open source software identification
• Quality assembly process with zero defects
• On-time installation and deployment
• Design consultation for the right solution
• Focused on your real world problems
• Real people behind the product
• Support staff who knows your system
9. Silicon Mechanics: The value of highly customizable hardware
Bio-IT World 2015 9
Silicon Mechanics Storform Storage Servers
• Flexible, Configurable, Reliable
• 144TB raw capacity; 130TB usable
• No RAID controllers; no storage lost to RAID
• 36 x 4TB 3.5” Seagate SATA drives
• 2 x 120GB Intel S3700 SSDs; OS + metadata
• 10Gb Base-T connectivity
• (2) Intel Xeon E5 CPUs
• 64GB RAM
Supermicro SC847 4U chassisLearn more at Booth #361
@ExpertIncluded
10. Management of OpenStack Swift using SwiftStack
• SwiftStack provides control & visibility
– Deployment automation
• Let us roll out Swift nodes in
10 minutes
• Upgrading Swift across clusters
with 1 click
– Monitoring and stats at cluster, node,
and drive levels
– Authentication & Authorization
– Capacity & Utilization Management
via Quotas and Rate Limits
– Alerting, & Diagnostics
Bio-IT World 2015
11. SwiftStack Architecture Overview
Standard Linux Distribution
Off-the-shelf Ubuntu, Red Hat, CentOS
Standard Hardware
Silicon Mechanics, Supermicro, etc.
Swift Runtime
Integrated storage engine with all node components
Integrations & Interfaces
End-user web UI, legacy interfaces,
authentication, utilization API, etc.
OpenStack Swift
Released and supported by SwiftStack
100% Open Source
SwiftStack Nodes (2 —> 1000s)
Rolling Upgrades & 24x7 Support
Monitoring, Alerting & Diagnostics
Capacity & Utilization Mgmt.
Client Support
Ring & Cluster Management
Authentication Services
Deployment Automation
SwiftStack
Controller
11Bio-IT World 2015
12. How much does it cost?
• Only small changes vs 2014
– Kryder’s law obsolete at <15%/Y ?
– Swift now down to Glacier cost
(hardware down to $3 / TB / month)
– No price reductions in the cloud
• 4TB (~$120) and 6TB (~$250)
drives cost the same
– Do you want a fault domain of 144TB
or 216TB in your storage servers
– Don’t save on CPU / Erasure Code is
coming !
12Bio-IT World 2015
11
26
28
40
0
5
10
15
20
25
30
35
40
45
Swiftstack Google Amazon S3 NAS
Swiftstack
Google
Amazon S3
NAS
13. Object storage systems and traditional file systems –
totally different, right?
• No traditional file system hierarchy, we just have buckets (S3 lingo) or containers
(Swift lingo), that can contain millions of objects (aka files)
• Huh, no sub-directories ? But how the heck can I upload my uber-complex
bioinformatics file system with 11 folder hierarchies to Swift ?
– Answer: we simulate the hierarchical structure by simply putting forward slashes (/) in the object name (or file name)
– source /dir1/dir2/dir3/dir4/file5 can simply be copied to /container1/many/fake/dirs/file5
• So, how do you actually copy / migrate data over to Swift if I don’t want to use API?
– The standard tool is the openstack Swift client, let’s assume I want to copy /my/local/folder to
/Swiftcontainer/pseudo/folder, here is the command you have to type:
swift upload --changed --segment-size=2G --use-slo --object-name=“pseudo/folder" “container" " /my/local/folder"
– Really? Can’t we get this a little easier?
– There are a handful of open source tools available, some of them are easier to use (e.g. rclone)
– However, the Swift client is frequently used, well supported, maintained and really fast !!
Bio-IT World 2015 13
14. Object storage systems and traditional file systems –
totally different, right?
• OK, so let’s get over with this and do what HPC shops do all the time: write a
wrapper and verify that people who don’t have a lot of patience find it usable.
• Swift Commander, a simple shell wrapper for the Swift client, curl and some other
tools makes working with Swift very easy:
• Sub commands such as swc ls, swc cd, swc rm, swc more give you a feel that is quite similar to
a Unix file system, idea stolen from Google’s gsutil
• Actively maintained and available at https://github.com/FredHutch/Swift-
commander/
Bio-IT World 2015 14
$ swc upload /my/posix/folder /my/Swift/folder
$ swc compare /my/posix/folder /my/Swift/folder
$ swc download /my/Swift/folder /my/scratch/fs
15. Object storage systems and traditional file systems –
totally different, right?
• Didn’t someone say that object storage systems were great at using metadata?
• Yes, and you can just add a few key:value pairs as upload argument:
• Query the meta data via swc, or use an external search engine such as elastic search
Bio-IT World 2015 15
$ swc upload /my/posix/folder /my/Swift/folder project:grant-xyz
collaborators:jill,joe,jim cancer:breast
$ swc meta /my/Swift/folder
Meta Cancer: breast
Meta Collaborators: jill,joe,jim
Meta Project: grant-xyz
16. Object storage systems and traditional file systems –
totally different, right?
• Users tend to prefer to work with a posix file system with all files in one place ….. But integrating
Swift in your workflows is not really hard
• Example, running samtools using persistent scratch space
(files deleted if not accessed for 30 days)
• A complex 50 line HPC submission script prepping a GATK workflow requires
just 3 more lines !!
• Read the file from persistent scratch space and if it is not there simply pull it again from Swift
• If you don’t have scratch space you can pipe download from Swift directly to samtools
Bio-IT World 2015 16
If ! [[ -f /fh/scratch/delete30/pi/raw/genome.bam ]]; then
swc download /Swiftfolder/genome.bam /fh/scratch/delete30/raw/genome.bam
fi
samtools view -F 0xD04 -c /fh/scratch/delete30/pi/raw/genome.bam > otherfile
17. Object storage systems and traditional file systems –
totally different, right?
• Use HPC system to download lots of bam files in parallel
• 30 cluster jobs run in parallel on 30 1G nodes (which is my HPC limit)
• My scratch file system says it loads data at 1.4 GB/s
• This means that each bam file is downloaded at 47 MB/s on average and downloading this dataset of 1.2
TB takes 14 min
Bio-IT World 2015 17
$ swc ls /Ext/seq_20150112/ > bamfiles.txt
$ while read FILE; do
$ sbatch -N1 -c4 --wrap="swc download /Ext/seq_20150112/$FILE .";
$ done < bamfiles.txt
$ squeue -u petersen
JOBID PARTITION NAME USER ST TIME NODES NODELIST
17249368 campus sbatch petersen R 15:15 1 gizmof120
17249371 campus sbatch petersen R 15:15 1 gizmof123
17249378 campus sbatch petersen R 15:15 1 gizmof130
$ fhgfs-ctl --userstats --names --interval=5 --nodetype=storage
====== 10 s ======
Sum: 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]
petersen 13803 [sum] 13803 [ops-wr] 1380.300 [MiB-wr/s]
18. Scientific file systems are a mixture of small files & large files
• How does Swift handle copying lots of small files ?
• Answer: not so fast …..but to be honest your NFS NAS does not handle this too well either
• Example: (ab)using filenames as database:
• So, we could tar up this entire directory structure ….. but then we have one giant tar ball of 1 TB that
becomes really hard to handle …
• But what if we had a tool that would not tar up sub dirs in one file but create a tar ball for each level:
/folder1/folder2/folder3 could turn into:
• So restoring folder2 and below we just need folder2.tar.gz + folder3.tar.gz
Bio-IT World 2015 18
dirk@rhino04:# ls metapop_results/corrected/release_test/evo/ | head
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.05_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.15_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.1_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
global_indv_n=1_mutant-freq=1_mig=0_coop-release=0.25_km-adv=10_death-adv=2_coop-freq=1_size=32_occ=1_u=0_hrs=5000
/folder1.tar.gz
/folder1/folder2.tar.gz
/folder1/folder2/folder3.tar.gz
19. Scientific file systems are a mixture of small files & large files
• Solution: Swift commander contains an archiving module
• Written by the author of the postmark file system benchmark … who has some
experience with handling small files
• It’s easy:
• It’s fast:
– Archiving uses multiple processes, measured up to 400 MB/s from one Linux box.
– Each process uses pigz multithreaded gzip compression (Example: compressing 1GB DNA string down to
272MB: 111 sec using gzip, 5 seconds using pigz)
– Restore can use standard gzip
• It’s simple & free: https://github.com/FredHutch/Swift-commander/blob/master/bin/swbundler.py
Bio-IT World 2015 19
$ archive: swc arch /my/posix/folder /my/Swift/folder
$ restore: swc unarch /my/Swift/folder /my/scratch/fs
20. Scientific file systems are a mixture of small files & large files
• Special case: Sometimes we have large ngs files mixed with many small files, we
want to copy but not tar the large files and archive the small files as tar.gz
• Default bundle option in Swift commander copies files >64MB straight and bundles
files < 64M into tar.gz archives
• Can change default to other sizes:
• Benefit, archives small files effectively and still allows you to open large files directly
with other tools, e.g. bam files in public folder in Swift can be opened by IGV
browser
Bio-IT World 2015 20
archive: $ swc bundle /my/posix/folder /my/Swift/folder
$ swc bundle /my/posix/folder /my/Swift/folder 512M
restore: $ swc unbundle /my/Swift/folder /my/scratch/fs
21. Access with GUI tools is required for collaboration
• Reality: Even if infrequent every archive
requires access via GUI tools
• Needs to work with Windows and Mac
• Tools such as Cyberduck are standard but not
perfectly convenient, we need tools that
– Are very easy to use and
– do not create any proprietary data structures in
Swift that cannot be read by other tools and
– Simply replace a shared drive
Bio-IT World 2015 21
22. Access with GUI tools is required for collaboration
• Another example: ExpanDrive and Storage Made Easy
– Works with Windows and Mac
– Integrates in Mac Finder and is mountable as a drive in Windows
Bio-IT World 2015 22
23. rclone: mass copy, backup, data migration - better than rsync
• rclone is a multithreaded data
copy / mirror tool
• Consistent performance on
Linux, Mac and Windows
• E.g. keep a mirror of Synology
workgroup NAS (QNAP has a
builtin swift mirror option)
• Data remains accessible by
swc, desktop clients
• Mirror protected by swift
undelete (currently 60 days
retention)
Bio-IT World 2015 23
24. Galaxy integration with OpenStack Swift in production
• Galaxy web based high throughput
computing at the Hutch uses Swift as
primary storage in production today
• SwiftStack patches contributed to Galaxy
Project
• Swift allows to delegate “root” access to
bioinformaticians
• Integrated with Slurm HPC scheduler:
automatically assigns default PI account
for each user
Bio-IT World 2015 24