SlideShare a Scribd company logo
1 of 7
chris@bioteam.net / @chris_dag
PRACTICAL PETABYTE PUSHING
Jan 2019 / Lightning Talk / Foundation Medicine
Boston Computational Biology and Bioinformatics Meetup
Chris Dagdigian; chris@bioteam.net
chris@bioteam.net / @chris_dag
30 Second Background
● 24x7 Production HPC Environment
● 100s of user accounts; 10+ power users; 50+ frequent users
● Many integrated “cluster aware” commercial apps leverage this system
● ~2 petabytes scientific & user data (Linux & Windows clients)
● Multiple catastrophic NAS outages in 2018
○ Demoralized scientists; shell-shocked IT staff; angry management
○ Replacement storage platform procured; 100% NAS-to-NAS migration ordered
● Mandate / Mission - 2 petabyte live data migration
○ IT must re-earn trust and confidence of scientific end-users & leadership
○ User morale/confidence is low; Stability/Uptime is key; Zero Unplanned Outages
○ “Jobs must flow” -- HPC remains in production during data migration
chris@bioteam.net / @chris_dag
1. NEVER comingle “data management” & “data movement” at same time
Cleanup/manage your data BEFORE or AFTER; never DURING
2. Understand upfront vendor-specific data protection overhead (small files esp)
New NAS needed +20% more raw disk to store the same data, a non-trivial CapEx cost at petascale
3. Interrogate/Understand your data before you move it (or buy new storage!)
Massive replication bandwidth is meaningless if you have 200+ million tiny files;
This was our real-world data movement bottleneck
Lightning Talk ProTip: CONCLUSIONS FIRST
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Lightning Talk ProTip: CONCLUSIONS FIRST
4. Be proactive in setting (and re-setting) management expectations
Data transfer time estimates based off of aggregate network bandwidth were
insanely wrong. Real world throughput range was: [ 2mb/sec -- 13GB/sec ]
5. Tasks that take days/weeks require visibility & transparency
Users & management will want a dashboard or progress view
6. Work against full filesystems or network shares ONLY (See tip #1 …)
Attempts to get clever with curated “exclude-these-files-and-folders” lists add
complexity and introduce vectors for human/operator error
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Materials & Methods - Tooling
Tooling
● We are not special/unique in life science informatics - plagiarizing methods
from Amazon, supercomputing sites & high-energy physics is a legit strategy
● Our tooling choice: fpart/fpsync from https://github.com/martymac/fpart
○ ‘fpart’ - Does the hard work of filesystem crawling to build ‘partition’ lists that can be used as
input data for whatever tool you want to use to replicate/copy data
○ ‘fpsync’ - Wrapper script to parallelize, distribute and manage a swarm of replication jobs
○ ‘rsync’ - https://rsync.samba.org/
● Actual data replication via ‘rsync’ (managed by fpsync)
○ fpsync wrapper script is pluggable and supports different data mover/copy binaries
○ We explicitly chose ‘rsync’ because it is well known, well tested and had the least amount of
potential edge and corner-cases to deal with
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Materials & Methods - Process
The Process (one filesystem or share at a time):
● [A] Perform initial full replication in background on live “in-use” file system
● [B] Perform additional ‘re-sync’ replications to stay current
● [C] Perform ‘delete pass’ sync to catch data that was deleted from source filesystem while
replication(s) were occuring
● Repeat tasks [B] and [C] until time window for full sync + delete-pass is small enough to fit
within an acceptable maintenance/outage window
● Schedule outage window; make source filesystem Read-Only at a global level; perform final
replication sync; migrate client mounts; have backout plan handy
● Test, test, test, test, test, test (admins & end-users should both be involved testing)
● Have a plan to document & support the previously unknown storage users that will come out of the
woodwork once you mark the source filesystem read/only (!)
Things we already knew + things we wished we knew beforehand
chris@bioteam.net / @chris_dag
Wrap Up
Commercial Alternative
● If management requires fancy live dashboards & other UI candy --OR-- you have limited IT/ops support available for
scripted OSS tooling support …
● You can purchase petascale data migration capability commercially
○ Recommendation: Talk to DataDobi (https://datadobi.com)
○ (Yes this is a different niche than IBM Aspera or GridFTP type tooling …)
Acknowledgements
● Aaron Gardner (aaron@bioteam.net)
○ One of several Bioteam infrastructure gurus with extreme storage & filesystem expertise
○ He did the hard work on this
○ I just scripted things & monitored progress #lazy
More Info/Details: If you want to see this topic expanded into a long-form blog post / technical write-up
or BioITWorld conference talk then please let me know via email!

More Related Content

What's hot

Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Guido Schmutz
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
Chris Dwan
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Shirshanka Das
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
eXascale Infolab
 

What's hot (20)

BioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology ExchangeBioIT Trends - 2014 Internet2 Technology Exchange
BioIT Trends - 2014 Internet2 Technology Exchange
 
BioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the TrenchesBioIT World 2016 - HPC Trends from the Trenches
BioIT World 2016 - HPC Trends from the Trenches
 
2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches2015 Bio-IT Trends From the Trenches
2015 Bio-IT Trends From the Trenches
 
Cloud Security for Life Science R&D
Cloud Security for Life Science R&DCloud Security for Life Science R&D
Cloud Security for Life Science R&D
 
Multi-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC ClustersMulti-Tenant Pharma HPC Clusters
Multi-Tenant Pharma HPC Clusters
 
Mapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the CloudMapping Life Science Informatics to the Cloud
Mapping Life Science Informatics to the Cloud
 
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome MeetingBio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
Bio-IT & Cloud Sobriety: 2013 Beyond The Genome Meeting
 
BigData HUB Workshop
BigData HUB WorkshopBigData HUB Workshop
BigData HUB Workshop
 
Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?Big Data and Fast Data – Big and Fast Combined, is it Possible?
Big Data and Fast Data – Big and Fast Combined, is it Possible?
 
2015 04 bio it world
2015 04 bio it world2015 04 bio it world
2015 04 bio it world
 
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...
 
Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?Big Data and Fast Data - big and fast combined, is it possible?
Big Data and Fast Data - big and fast combined, is it possible?
 
The Evolution of Big Data Frameworks
The Evolution of Big Data FrameworksThe Evolution of Big Data Frameworks
The Evolution of Big Data Frameworks
 
5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance 5 Factors Impacting Your Big Data Project's Performance
5 Factors Impacting Your Big Data Project's Performance
 
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of TechnologyGuest Lecture: Introduction to Big Data at Indian Institute of Technology
Guest Lecture: Introduction to Big Data at Indian Institute of Technology
 
Leveraging open source for big data stack
Leveraging open source for big data stackLeveraging open source for big data stack
Leveraging open source for big data stack
 
Big Data - An Overview
Big Data -  An OverviewBig Data -  An Overview
Big Data - An Overview
 
Briefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collectionBriefing room: An alternative for streaming data collection
Briefing room: An alternative for streaming data collection
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Python's Role in the Future of Data Analysis
Python's Role in the Future of Data AnalysisPython's Role in the Future of Data Analysis
Python's Role in the Future of Data Analysis
 

Similar to Practical Petabyte Pushing

Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
programmermag
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Kinetica
 
Resume_Vignesh
Resume_VigneshResume_Vignesh
Resume_Vignesh
Vignesh S
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
HostedbyConfluent
 

Similar to Practical Petabyte Pushing (20)

Data Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data StackData Engineer's Lunch #85: Designing a Modern Data Stack
Data Engineer's Lunch #85: Designing a Modern Data Stack
 
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise ConsciousnessData Engineer's Lunch #60: Series - Developing Enterprise Consciousness
Data Engineer's Lunch #60: Series - Developing Enterprise Consciousness
 
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data ExplosionAudax Group: CIO Perspectives - Managing The Copy Data Explosion
Audax Group: CIO Perspectives - Managing The Copy Data Explosion
 
Google Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 DayGoogle Cloud Computing on Google Developer 2008 Day
Google Cloud Computing on Google Developer 2008 Day
 
(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment(ATS6-PLAT07) Managing AEP in an enterprise environment
(ATS6-PLAT07) Managing AEP in an enterprise environment
 
Big Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 TelcoBig Data Case Study: Fortune 100 Telco
Big Data Case Study: Fortune 100 Telco
 
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled ArchitectureDM Radio Webinar: Adopting a Streaming-Enabled Architecture
DM Radio Webinar: Adopting a Streaming-Enabled Architecture
 
Accelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & AlluxioAccelerating workloads and bursting data with Google Dataproc & Alluxio
Accelerating workloads and bursting data with Google Dataproc & Alluxio
 
Data Management - Full Stack Deep Learning
Data Management - Full Stack Deep LearningData Management - Full Stack Deep Learning
Data Management - Full Stack Deep Learning
 
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
Deep learning beyond the learning - Jörg Schad - Codemotion Amsterdam 2018
 
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU DatabasePowering Real-Time Big Data Analytics with a Next-Gen GPU Database
Powering Real-Time Big Data Analytics with a Next-Gen GPU Database
 
Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014Apache Cassandra at Target - Cassandra Summit 2014
Apache Cassandra at Target - Cassandra Summit 2014
 
Accelerate your SAP BusinessObjects to the Cloud
Accelerate your SAP BusinessObjects to the CloudAccelerate your SAP BusinessObjects to the Cloud
Accelerate your SAP BusinessObjects to the Cloud
 
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
AWS re:Invent 2016: Automating Workflows for Analytics Pipelines (DEV401)
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Denver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualizationDenver devops : enabling DevOps with data virtualization
Denver devops : enabling DevOps with data virtualization
 
Accelerating Cloud Training With Alluxio
Accelerating Cloud Training With AlluxioAccelerating Cloud Training With Alluxio
Accelerating Cloud Training With Alluxio
 
GDSC Cloud Jam.pptx
GDSC Cloud Jam.pptxGDSC Cloud Jam.pptx
GDSC Cloud Jam.pptx
 
Resume_Vignesh
Resume_VigneshResume_Vignesh
Resume_Vignesh
 
Off-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier DataOff-Label Data Mesh: A Prescription for Healthier Data
Off-Label Data Mesh: A Prescription for Healthier Data
 

More from Chris Dagdigian

More from Chris Dagdigian (6)

2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation2014 BioIT World - Trends from the trenches - Annual presentation
2014 BioIT World - Trends from the trenches - Annual presentation
 
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons LearnedBio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
Bio-IT Asia 2013: Informatics & Cloud - Best Practices & Lessons Learned
 
AWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating ResearchAWS re:Invent - Accelerating Research
AWS re:Invent - Accelerating Research
 
Bio-IT for Core Facility Managers
Bio-IT for Core Facility ManagersBio-IT for Core Facility Managers
Bio-IT for Core Facility Managers
 
Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)Trends from the Trenches (Singapore Edition)
Trends from the Trenches (Singapore Edition)
 
Practical Cloud & Workflow Orchestration
Practical Cloud & Workflow OrchestrationPractical Cloud & Workflow Orchestration
Practical Cloud & Workflow Orchestration
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Practical Petabyte Pushing

  • 1. chris@bioteam.net / @chris_dag PRACTICAL PETABYTE PUSHING Jan 2019 / Lightning Talk / Foundation Medicine Boston Computational Biology and Bioinformatics Meetup Chris Dagdigian; chris@bioteam.net
  • 2. chris@bioteam.net / @chris_dag 30 Second Background ● 24x7 Production HPC Environment ● 100s of user accounts; 10+ power users; 50+ frequent users ● Many integrated “cluster aware” commercial apps leverage this system ● ~2 petabytes scientific & user data (Linux & Windows clients) ● Multiple catastrophic NAS outages in 2018 ○ Demoralized scientists; shell-shocked IT staff; angry management ○ Replacement storage platform procured; 100% NAS-to-NAS migration ordered ● Mandate / Mission - 2 petabyte live data migration ○ IT must re-earn trust and confidence of scientific end-users & leadership ○ User morale/confidence is low; Stability/Uptime is key; Zero Unplanned Outages ○ “Jobs must flow” -- HPC remains in production during data migration
  • 3. chris@bioteam.net / @chris_dag 1. NEVER comingle “data management” & “data movement” at same time Cleanup/manage your data BEFORE or AFTER; never DURING 2. Understand upfront vendor-specific data protection overhead (small files esp) New NAS needed +20% more raw disk to store the same data, a non-trivial CapEx cost at petascale 3. Interrogate/Understand your data before you move it (or buy new storage!) Massive replication bandwidth is meaningless if you have 200+ million tiny files; This was our real-world data movement bottleneck Lightning Talk ProTip: CONCLUSIONS FIRST Things we already knew + things we wished we knew beforehand
  • 4. chris@bioteam.net / @chris_dag Lightning Talk ProTip: CONCLUSIONS FIRST 4. Be proactive in setting (and re-setting) management expectations Data transfer time estimates based off of aggregate network bandwidth were insanely wrong. Real world throughput range was: [ 2mb/sec -- 13GB/sec ] 5. Tasks that take days/weeks require visibility & transparency Users & management will want a dashboard or progress view 6. Work against full filesystems or network shares ONLY (See tip #1 …) Attempts to get clever with curated “exclude-these-files-and-folders” lists add complexity and introduce vectors for human/operator error Things we already knew + things we wished we knew beforehand
  • 5. chris@bioteam.net / @chris_dag Materials & Methods - Tooling Tooling ● We are not special/unique in life science informatics - plagiarizing methods from Amazon, supercomputing sites & high-energy physics is a legit strategy ● Our tooling choice: fpart/fpsync from https://github.com/martymac/fpart ○ ‘fpart’ - Does the hard work of filesystem crawling to build ‘partition’ lists that can be used as input data for whatever tool you want to use to replicate/copy data ○ ‘fpsync’ - Wrapper script to parallelize, distribute and manage a swarm of replication jobs ○ ‘rsync’ - https://rsync.samba.org/ ● Actual data replication via ‘rsync’ (managed by fpsync) ○ fpsync wrapper script is pluggable and supports different data mover/copy binaries ○ We explicitly chose ‘rsync’ because it is well known, well tested and had the least amount of potential edge and corner-cases to deal with Things we already knew + things we wished we knew beforehand
  • 6. chris@bioteam.net / @chris_dag Materials & Methods - Process The Process (one filesystem or share at a time): ● [A] Perform initial full replication in background on live “in-use” file system ● [B] Perform additional ‘re-sync’ replications to stay current ● [C] Perform ‘delete pass’ sync to catch data that was deleted from source filesystem while replication(s) were occuring ● Repeat tasks [B] and [C] until time window for full sync + delete-pass is small enough to fit within an acceptable maintenance/outage window ● Schedule outage window; make source filesystem Read-Only at a global level; perform final replication sync; migrate client mounts; have backout plan handy ● Test, test, test, test, test, test (admins & end-users should both be involved testing) ● Have a plan to document & support the previously unknown storage users that will come out of the woodwork once you mark the source filesystem read/only (!) Things we already knew + things we wished we knew beforehand
  • 7. chris@bioteam.net / @chris_dag Wrap Up Commercial Alternative ● If management requires fancy live dashboards & other UI candy --OR-- you have limited IT/ops support available for scripted OSS tooling support … ● You can purchase petascale data migration capability commercially ○ Recommendation: Talk to DataDobi (https://datadobi.com) ○ (Yes this is a different niche than IBM Aspera or GridFTP type tooling …) Acknowledgements ● Aaron Gardner (aaron@bioteam.net) ○ One of several Bioteam infrastructure gurus with extreme storage & filesystem expertise ○ He did the hard work on this ○ I just scripted things & monitored progress #lazy More Info/Details: If you want to see this topic expanded into a long-form blog post / technical write-up or BioITWorld conference talk then please let me know via email!