SlideShare ist ein Scribd-Unternehmen logo
1 von 27
Ensuring Data Integrity
in a Digital Preservation
Archive
  Gabe Nault
      LDS Church
 naultga@ldschurch.org
Future Perfect Conference
           2012
                            image courtesy of IBM
Introducing the LDS Church
• The Church of Jesus Christ of Latter-day Saints
• Global Christian church with 14 million members
• 3 universities, 1 college
• State-of-the-art audio-
  visual capabilities
• Scriptural mandate to
  keep and preserve records                         since
  since 1830
                                  photo by Henok Montoya
The Church History Department
• Preserves records of enduring value from
  Church leaders, departments, universities,
  and affiliations (more than 35 organizations)
• Typically, less than
  10% of records are
  candidates for
  preservation



                            Church History Library on Temple Square
Granite Mountain Records Vault
          • Bored into a solid granite
            mountain

          • Stores large microfilm
            collections and valuable
            Church artifacts

          • Plans recently developed
            to renovate the facility for
            digital preservation
The Media Services Department
• Audiovisual records will
  consume majority of our
  archive capacity


                                                  Mormon Tabernacle Choir and Orchestra




 Free Bible videos from biblevideos.lds.org


• 100+ PB in a decade
  for a single copy!
                                              Conference Center on Temple Square
DRPS System Architecture
                  Fixity
                 Creation     DRPS Ingest Tools

               Preservation
Digital         Functions
Records          Fixity
                              Storage Extensions
Preservation     Bridge

               Information
System           Lifecycle        StorageGRID
               Management

                   Tape              IBM
                 Interface    Tivoli Storage Manager
DRPS Infrastructure
DRPS Highlights
• Multiple copies in multiple geographic
  locations (eventually)
• Approximately 1 PB spinning media
• Automatic replication to remote site(s)
• End to end data integrity
• Tape base permanent storage
Why Tape for Preservation?
        Total cost of storage ownership study
• TCO - Over ten years, ownership and
  operating costs of tape are three to
  fifteen times less than associated costs
  for disk arrays                             IBM TS3500
                                             Tape Libraries
• Cost advantages of tape are expected        image courtesy of IBM

  to increase over time
• Conclusion—for now, tape is required
  to sustain a multi-PB digital archive
• But . . . tape presents some challenges
Why Tape for Preservation?
                    Limitations
• Latency
• Limited to sequential access
• Limited number of read/writes    IBM TS3500
• Leads to greater system and     Tape Libraries
                                   image courtesy of IBM
  operational complexity
Data Integrity
• Data integrity validation is provided by fixity checks
  when data is written, transferred, moved, or copied
• Fixity checking should be performed from file
  creation to permanent storage to delivery
• Periodic validation of the entire archive should also
  be performed to detect data corruption(bit rot, drive
  errors, tape degradation, etc)
• DRPS uses a variety of integrity values for fixity
DRPS Fixity Data Flow
DRPS Data Integrity Validation
SHA-1
control
          DRPS Ingest Tools    SHA-1 created for producer files


SHA-1                          SHA-1 checked upon ingest
control                        and write to permanent storage

                               Web service retrieves StorageGRID
          Storage Extensions   SHA-1, then Rosetta plug-in
                               compares with Rosetta SHA-1

SHA-1                          SHA-1 created for ingested files
control
             StorageGRID
StorageGRID Fixity Checking
• StorageGRID is constructed around the concept
  of object storage
                               StorageGRID
• Provides a layered/overlapping set of protection
  domains to guard against object data corruption
  1.   SHA-1 object hash—checked on store and access
  2.   Content hash—checked on access
  3.   CRC checksum—checked with every operation
  4.   Key-based hash value—checked on access
DRPS Data Integrity Validation
SHA-1
control
          DRPS Ingest Tools        SHA-1 created for producer files


SHA-1                              SHA-1 checked upon ingest
control                            and write to permanent storage

                                   Web service retrieves StorageGRID
          Storage Extensions       SHA-1, then Rosetta plug-in
                                   compares with Rosetta SHA-1

SHA-1                              SHA-1 created for ingested files
control
              StorageGRID
                                   SHA-1 and other fixity checked
                                   during write to storage nodes
CRCs,            IBM               TSM end-to-end logical block
ECCs      Tivoli Storage Manager
                                   protection
TSM End-to-End Logical Block Protection
• Supersedes SHA-1 fixity information with
  cyclic redundancy check values (CRCs)
  and error-correcting codes (ECCs)

• Enabled with new, state-of-the-art
  functionality of IBM LTO-5 and TS1140
  tape drives

• Seamlessly extends validation
  of data integrity as data is
  written to tape
TSM End-to-End Logical Block Protection
                  1. TSM server calculates and
                     appends “original data CRC”
                     to logical data block




  2. Tape drive computes its
     own CRC and compares
     to original data CRC
TSM End-to-End Logical Block Protection
                    3. As logical block is loaded into
                       drive data buffer, on-the-fly
                       verifier checks original data CRC




  4. In parallel, a “C1 code”
     (ECC) is computed and
     appended
TSM End-to-End Logical Block Protection
                    5. An additional ECC, referred to
                       as “C2 code,” is added to the
                       logical block




  6. More powerful than the
     original data CRC, the
     C1 code is checked every
     time data is read from the buffer
TSM End-to-End Logical Block Protection
                  7. Data written to tape at full
                     line speed with read-while-
                     write process
                  8. Just written data loaded to
                     buffer and C1 code checked



  Successful read-while-
  write operation assures no
  data corruption from TSM
  server to tape
TSM End-to-End Logical Block Protection
                     9. When tape is read, all codes
                        (C1, C2, original data CRC)
                        are checked by drive
                     10. Original data CRC appended
                             to logical block



 11. TSM server verifies original
   data CRC, completing TSM
   end-to-end logical block
   protection cycle
Ongoing Archive Data Integrity
• We must assume that data may become
  corrupted after being written correctly
  to tape

• Therefore, tapes must be read
  periodically to identify and correct data
  errors




                                        image courtesy of IBM
Ongoing Archive Data Integrity
• Staging IEs to disk to verify integrity
  is resource intensive!
• IBM LTO-5 and TS1140 tape drives
  provide a more efficient solution
• During “SCSI Verify” operation, a
  tape is mounted, drive checks all         image courtesy of IBM


  codes (C1, C2, original data CRC) as
  data is being read (at full line speed)
• Only status is reported as these
  internal checks are completed
Summary
• Fixity information is the     SHA-1
                                          DRPS Ingest Tools
  key to data integrity         control

• SHA-1 values ensure data      SHA-1
  integrity to StorageGRID      control

• TSM end-to-end logical                  Storage Extensions
  block protection ensures
  data integrity to tape        SHA-1
                                control
                                              StorageGRID
• In-drive validation enables
  ongoing integrity checks      CRCs,            IBM
                                ECCs      Tivoli Storage Manager
  for the entire archive
Thank you!

Questions?
Trademarks


The Ex Libris logo and Rosetta are trademarks of Ex Libris Group.

The IBM logo and Tivoli Storage Manager are trademarks of International Business Machines Corporation.

The NetApp logo and StorageGRID are trademarks of NetApp, Inc.
Rate of Bit Errors
• Preliminary validation of DRPS archive
  resulted in a 3.3x10-14 bit error rate
• USC Shoah Foundation Institute visit
• 8 PB tape archive of videotaped interviews of
  Holocaust survivors and other witnesses
• Experienced 1500 bit flips in 8 PB
  (2.3x10-14 bit error rate)`

Weitere ähnliche Inhalte

Andere mochten auch

How to – data integrity checks in batch processing
How to – data integrity checks in batch processingHow to – data integrity checks in batch processing
How to – data integrity checks in batch processingSon Nguyen
 
Portrait Of Jesus In The Tabernacle 2
Portrait Of Jesus In The Tabernacle 2Portrait Of Jesus In The Tabernacle 2
Portrait Of Jesus In The Tabernacle 2New Life Bible Chapel
 
Portrait Of Jesus In The Tabernacle 3
Portrait Of Jesus In The Tabernacle 3Portrait Of Jesus In The Tabernacle 3
Portrait Of Jesus In The Tabernacle 3New Life Bible Chapel
 
21 CFR Part 58 & Data integrity
21 CFR Part 58 & Data integrity21 CFR Part 58 & Data integrity
21 CFR Part 58 & Data integrityAnand Pandya
 
Portrait Of Jesus In The Tabernacle 4
Portrait Of Jesus In The Tabernacle 4Portrait Of Jesus In The Tabernacle 4
Portrait Of Jesus In The Tabernacle 4New Life Bible Chapel
 
FDA Data Integrity Issues - DMS hot fixes
FDA Data Integrity Issues - DMS hot fixesFDA Data Integrity Issues - DMS hot fixes
FDA Data Integrity Issues - DMS hot fixesVidyasagar P
 
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and OperationsGreg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and OperationsJohn Blue
 
Data Integrity in a GxP-regulated Environment - Pauwels Consulting Academy
Data Integrity in a GxP-regulated Environment - Pauwels Consulting AcademyData Integrity in a GxP-regulated Environment - Pauwels Consulting Academy
Data Integrity in a GxP-regulated Environment - Pauwels Consulting AcademyPauwels Consulting
 
The garments of the high priest
The garments of the high priestThe garments of the high priest
The garments of the high priestbishop01
 
Portrait Of Jesus In The Tabernacle I
Portrait Of Jesus In The Tabernacle IPortrait Of Jesus In The Tabernacle I
Portrait Of Jesus In The Tabernacle INew Life Bible Chapel
 
Data Integrity webinar - Essentials & Solutions
Data Integrity webinar - Essentials & SolutionsData Integrity webinar - Essentials & Solutions
Data Integrity webinar - Essentials & Solutionspi
 
The Shadow of Jesus in the Tabernacle
The Shadow of Jesus in the TabernacleThe Shadow of Jesus in the Tabernacle
The Shadow of Jesus in the TabernacleRick Bowen
 
High priest's garments
High priest's garmentsHigh priest's garments
High priest's garmentsKyle Lammott
 
Data Integrity Validation Keynote Address Boston August 2016
Data Integrity Validation Keynote Address Boston August 2016Data Integrity Validation Keynote Address Boston August 2016
Data Integrity Validation Keynote Address Boston August 2016Ajaz Hussain
 

Andere mochten auch (20)

Tabernacle
TabernacleTabernacle
Tabernacle
 
How to – data integrity checks in batch processing
How to – data integrity checks in batch processingHow to – data integrity checks in batch processing
How to – data integrity checks in batch processing
 
The Tabernacle
The TabernacleThe Tabernacle
The Tabernacle
 
Data integrity
Data integrityData integrity
Data integrity
 
Portrait Of Jesus In The Tabernacle 2
Portrait Of Jesus In The Tabernacle 2Portrait Of Jesus In The Tabernacle 2
Portrait Of Jesus In The Tabernacle 2
 
Data Integrity GMP Scientific
Data Integrity GMP ScientificData Integrity GMP Scientific
Data Integrity GMP Scientific
 
Portrait Of Jesus In The Tabernacle 3
Portrait Of Jesus In The Tabernacle 3Portrait Of Jesus In The Tabernacle 3
Portrait Of Jesus In The Tabernacle 3
 
Data integrity challenges and solutions
Data integrity challenges and solutionsData integrity challenges and solutions
Data integrity challenges and solutions
 
21 CFR Part 58 & Data integrity
21 CFR Part 58 & Data integrity21 CFR Part 58 & Data integrity
21 CFR Part 58 & Data integrity
 
Portrait Of Jesus In The Tabernacle 4
Portrait Of Jesus In The Tabernacle 4Portrait Of Jesus In The Tabernacle 4
Portrait Of Jesus In The Tabernacle 4
 
FDA Data Integrity Issues - DMS hot fixes
FDA Data Integrity Issues - DMS hot fixesFDA Data Integrity Issues - DMS hot fixes
FDA Data Integrity Issues - DMS hot fixes
 
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and OperationsGreg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations
Greg Bilbrey - Data Integrity, Using Records for Benchmarking and Operations
 
Data Integrity in a GxP-regulated Environment - Pauwels Consulting Academy
Data Integrity in a GxP-regulated Environment - Pauwels Consulting AcademyData Integrity in a GxP-regulated Environment - Pauwels Consulting Academy
Data Integrity in a GxP-regulated Environment - Pauwels Consulting Academy
 
The garments of the high priest
The garments of the high priestThe garments of the high priest
The garments of the high priest
 
Portrait Of Jesus In The Tabernacle I
Portrait Of Jesus In The Tabernacle IPortrait Of Jesus In The Tabernacle I
Portrait Of Jesus In The Tabernacle I
 
Data Integrity webinar - Essentials & Solutions
Data Integrity webinar - Essentials & SolutionsData Integrity webinar - Essentials & Solutions
Data Integrity webinar - Essentials & Solutions
 
The Shadow of Jesus in the Tabernacle
The Shadow of Jesus in the TabernacleThe Shadow of Jesus in the Tabernacle
The Shadow of Jesus in the Tabernacle
 
High priest's garments
High priest's garmentsHigh priest's garments
High priest's garments
 
High Priestly Garments
High Priestly GarmentsHigh Priestly Garments
High Priestly Garments
 
Data Integrity Validation Keynote Address Boston August 2016
Data Integrity Validation Keynote Address Boston August 2016Data Integrity Validation Keynote Address Boston August 2016
Data Integrity Validation Keynote Address Boston August 2016
 

Ähnlich wie Gabe Nault Data Integrity

Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4Tony Pearson
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptxDrewMe1
 
IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its ApplicationsIBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its ApplicationsTony Pearson
 
S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5Tony Pearson
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSandeep Patil
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Trishali Nayar
 
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...Ben Stopford
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsTony Pearson
 
Securing the Container Pipeline at Salesforce by Cem Gurkok
Securing the Container Pipeline at Salesforce by Cem Gurkok   Securing the Container Pipeline at Salesforce by Cem Gurkok
Securing the Container Pipeline at Salesforce by Cem Gurkok Docker, Inc.
 
SNW Pres City Of Safford 101209 Lr
SNW Pres City Of Safford 101209 LrSNW Pres City Of Safford 101209 Lr
SNW Pres City Of Safford 101209 LrEric Herzog
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inRahulBhole12
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualizationSisimon Soman
 
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure  to Enable Data Analysis CollaborationThe Efficient Use of Cyberinfrastructure  to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure to Enable Data Analysis CollaborationCybera Inc.
 
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech TalksDeep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech TalksAmazon Web Services
 

Ähnlich wie Gabe Nault Data Integrity (20)

Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4
 
002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx002-Storage Basics and Application Environments V1.0.pptx
002-Storage Basics and Application Environments V1.0.pptx
 
San
SanSan
San
 
San
SanSan
San
 
IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its ApplicationsIBM Cloud Object Storage System (powered by Cleversafe) and its Applications
IBM Cloud Object Storage System (powered by Cleversafe) and its Applications
 
S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5S ss0885 spectrum-scale-elastic-edge2015-v5
S ss0885 spectrum-scale-elastic-edge2015-v5
 
Spectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN CachingSpectrum Scale Unified File and Object with WAN Caching
Spectrum Scale Unified File and Object with WAN Caching
 
Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...Software Defined Analytics with File and Object Access Plus Geographically Di...
Software Defined Analytics with File and Object Access Plus Geographically Di...
 
IWM Infrastructure
IWM InfrastructureIWM Infrastructure
IWM Infrastructure
 
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for H...
 
The Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged EnvironmentsThe Pendulum Swings Back: Converged and Hyperconverged Environments
The Pendulum Swings Back: Converged and Hyperconverged Environments
 
Securing the Container Pipeline at Salesforce by Cem Gurkok
Securing the Container Pipeline at Salesforce by Cem Gurkok   Securing the Container Pipeline at Salesforce by Cem Gurkok
Securing the Container Pipeline at Salesforce by Cem Gurkok
 
SNW Pres City Of Safford 101209 Lr
SNW Pres City Of Safford 101209 LrSNW Pres City Of Safford 101209 Lr
SNW Pres City Of Safford 101209 Lr
 
Cloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation inCloud computing UNIT 2.1 presentation in
Cloud computing UNIT 2.1 presentation in
 
VDI storage and storage virtualization
VDI storage and storage virtualizationVDI storage and storage virtualization
VDI storage and storage virtualization
 
LAN Fundamentals
LAN FundamentalsLAN Fundamentals
LAN Fundamentals
 
Performance
PerformancePerformance
Performance
 
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure  to Enable Data Analysis CollaborationThe Efficient Use of Cyberinfrastructure  to Enable Data Analysis Collaboration
The Efficient Use of Cyberinfrastructure to Enable Data Analysis Collaboration
 
Timesten Architecture
Timesten ArchitectureTimesten Architecture
Timesten Architecture
 
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech TalksDeep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks
Deep Dive on Amazon Elastic File System - June 2017 AWS Online Tech Talks
 

Mehr von Future Perfect 2012

Working Across Organizations white paper
Working Across Organizations white paperWorking Across Organizations white paper
Working Across Organizations white paperFuture Perfect 2012
 
Ensuring Data Integrity white paper
Ensuring Data Integrity white paperEnsuring Data Integrity white paper
Ensuring Data Integrity white paperFuture Perfect 2012
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Future Perfect 2012
 
Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryFuture Perfect 2012
 
James Smithies Academic Earthquake Research
James Smithies Academic Earthquake ResearchJames Smithies Academic Earthquake Research
James Smithies Academic Earthquake ResearchFuture Perfect 2012
 
Shaun Hendy Innovation Ecosystem
Shaun Hendy Innovation EcosystemShaun Hendy Innovation Ecosystem
Shaun Hendy Innovation EcosystemFuture Perfect 2012
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineFuture Perfect 2012
 
Steve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveSteve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveFuture Perfect 2012
 
Parul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right CombinationParul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right CombinationFuture Perfect 2012
 
Alison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for SuccessAlison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for SuccessFuture Perfect 2012
 
Clare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in DatabasesClare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in DatabasesFuture Perfect 2012
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsFuture Perfect 2012
 
Dave Pearson The Adventures of Digi
Dave Pearson The Adventures of DigiDave Pearson The Adventures of Digi
Dave Pearson The Adventures of DigiFuture Perfect 2012
 
Jay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsJay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsFuture Perfect 2012
 
Jeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation PerspectiveJeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation PerspectiveFuture Perfect 2012
 
Stuart Wakefield Cloud Computing
Stuart Wakefield Cloud ComputingStuart Wakefield Cloud Computing
Stuart Wakefield Cloud ComputingFuture Perfect 2012
 

Mehr von Future Perfect 2012 (20)

Working Across Organizations white paper
Working Across Organizations white paperWorking Across Organizations white paper
Working Across Organizations white paper
 
Ensuring Data Integrity white paper
Ensuring Data Integrity white paperEnsuring Data Integrity white paper
Ensuring Data Integrity white paper
 
Bigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie LeanBigger Hard Drive Jamie Lean
Bigger Hard Drive Jamie Lean
 
Steve Knight by Design
Steve Knight by DesignSteve Knight by Design
Steve Knight by Design
 
Michael Parsons Passion
Michael Parsons PassionMichael Parsons Passion
Michael Parsons Passion
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
Joe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage LibraryJoe Coleman Biodiversity Heritage Library
Joe Coleman Biodiversity Heritage Library
 
James Smithies Academic Earthquake Research
James Smithies Academic Earthquake ResearchJames Smithies Academic Earthquake Research
James Smithies Academic Earthquake Research
 
Shaun Hendy Innovation Ecosystem
Shaun Hendy Innovation EcosystemShaun Hendy Innovation Ecosystem
Shaun Hendy Innovation Ecosystem
 
Martin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP OnlineMartin Donnelly Sarah Jones DMP Online
Martin Donnelly Sarah Jones DMP Online
 
Steve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data ArchiveSteve Mc Eachern Australian Data Archive
Steve Mc Eachern Australian Data Archive
 
Parul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right CombinationParul Sharma Sally Vermaaten Right Combination
Parul Sharma Sally Vermaaten Right Combination
 
Alison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for SuccessAlison Fleming Michael Upton Collaborating for Success
Alison Fleming Michael Upton Collaborating for Success
 
Andrew Waugh Business Systems
Andrew Waugh Business SystemsAndrew Waugh Business Systems
Andrew Waugh Business Systems
 
Clare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in DatabasesClare Somerville Trish O’Kane Data in Databases
Clare Somerville Trish O’Kane Data in Databases
 
Cochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and FormatsCochrane von Suchodoletz File Creation, Rendering and Formats
Cochrane von Suchodoletz File Creation, Rendering and Formats
 
Dave Pearson The Adventures of Digi
Dave Pearson The Adventures of DigiDave Pearson The Adventures of Digi
Dave Pearson The Adventures of Digi
 
Jay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying FormatsJay Gattuso Persistently Identifying Formats
Jay Gattuso Persistently Identifying Formats
 
Jeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation PerspectiveJeff Rothenberg Digital Preservation Perspective
Jeff Rothenberg Digital Preservation Perspective
 
Stuart Wakefield Cloud Computing
Stuart Wakefield Cloud ComputingStuart Wakefield Cloud Computing
Stuart Wakefield Cloud Computing
 

Kürzlich hochgeladen

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Kürzlich hochgeladen (20)

Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 

Gabe Nault Data Integrity

  • 1. Ensuring Data Integrity in a Digital Preservation Archive Gabe Nault LDS Church naultga@ldschurch.org Future Perfect Conference 2012 image courtesy of IBM
  • 2. Introducing the LDS Church • The Church of Jesus Christ of Latter-day Saints • Global Christian church with 14 million members • 3 universities, 1 college • State-of-the-art audio- visual capabilities • Scriptural mandate to keep and preserve records since since 1830 photo by Henok Montoya
  • 3. The Church History Department • Preserves records of enduring value from Church leaders, departments, universities, and affiliations (more than 35 organizations) • Typically, less than 10% of records are candidates for preservation Church History Library on Temple Square
  • 4. Granite Mountain Records Vault • Bored into a solid granite mountain • Stores large microfilm collections and valuable Church artifacts • Plans recently developed to renovate the facility for digital preservation
  • 5. The Media Services Department • Audiovisual records will consume majority of our archive capacity Mormon Tabernacle Choir and Orchestra Free Bible videos from biblevideos.lds.org • 100+ PB in a decade for a single copy! Conference Center on Temple Square
  • 6. DRPS System Architecture Fixity Creation DRPS Ingest Tools Preservation Digital Functions Records Fixity Storage Extensions Preservation Bridge Information System Lifecycle StorageGRID Management Tape IBM Interface Tivoli Storage Manager
  • 8. DRPS Highlights • Multiple copies in multiple geographic locations (eventually) • Approximately 1 PB spinning media • Automatic replication to remote site(s) • End to end data integrity • Tape base permanent storage
  • 9. Why Tape for Preservation? Total cost of storage ownership study • TCO - Over ten years, ownership and operating costs of tape are three to fifteen times less than associated costs for disk arrays IBM TS3500 Tape Libraries • Cost advantages of tape are expected image courtesy of IBM to increase over time • Conclusion—for now, tape is required to sustain a multi-PB digital archive • But . . . tape presents some challenges
  • 10. Why Tape for Preservation? Limitations • Latency • Limited to sequential access • Limited number of read/writes IBM TS3500 • Leads to greater system and Tape Libraries image courtesy of IBM operational complexity
  • 11. Data Integrity • Data integrity validation is provided by fixity checks when data is written, transferred, moved, or copied • Fixity checking should be performed from file creation to permanent storage to delivery • Periodic validation of the entire archive should also be performed to detect data corruption(bit rot, drive errors, tape degradation, etc) • DRPS uses a variety of integrity values for fixity
  • 13. DRPS Data Integrity Validation SHA-1 control DRPS Ingest Tools SHA-1 created for producer files SHA-1 SHA-1 checked upon ingest control and write to permanent storage Web service retrieves StorageGRID Storage Extensions SHA-1, then Rosetta plug-in compares with Rosetta SHA-1 SHA-1 SHA-1 created for ingested files control StorageGRID
  • 14. StorageGRID Fixity Checking • StorageGRID is constructed around the concept of object storage StorageGRID • Provides a layered/overlapping set of protection domains to guard against object data corruption 1. SHA-1 object hash—checked on store and access 2. Content hash—checked on access 3. CRC checksum—checked with every operation 4. Key-based hash value—checked on access
  • 15. DRPS Data Integrity Validation SHA-1 control DRPS Ingest Tools SHA-1 created for producer files SHA-1 SHA-1 checked upon ingest control and write to permanent storage Web service retrieves StorageGRID Storage Extensions SHA-1, then Rosetta plug-in compares with Rosetta SHA-1 SHA-1 SHA-1 created for ingested files control StorageGRID SHA-1 and other fixity checked during write to storage nodes CRCs, IBM TSM end-to-end logical block ECCs Tivoli Storage Manager protection
  • 16. TSM End-to-End Logical Block Protection • Supersedes SHA-1 fixity information with cyclic redundancy check values (CRCs) and error-correcting codes (ECCs) • Enabled with new, state-of-the-art functionality of IBM LTO-5 and TS1140 tape drives • Seamlessly extends validation of data integrity as data is written to tape
  • 17. TSM End-to-End Logical Block Protection 1. TSM server calculates and appends “original data CRC” to logical data block 2. Tape drive computes its own CRC and compares to original data CRC
  • 18. TSM End-to-End Logical Block Protection 3. As logical block is loaded into drive data buffer, on-the-fly verifier checks original data CRC 4. In parallel, a “C1 code” (ECC) is computed and appended
  • 19. TSM End-to-End Logical Block Protection 5. An additional ECC, referred to as “C2 code,” is added to the logical block 6. More powerful than the original data CRC, the C1 code is checked every time data is read from the buffer
  • 20. TSM End-to-End Logical Block Protection 7. Data written to tape at full line speed with read-while- write process 8. Just written data loaded to buffer and C1 code checked Successful read-while- write operation assures no data corruption from TSM server to tape
  • 21. TSM End-to-End Logical Block Protection 9. When tape is read, all codes (C1, C2, original data CRC) are checked by drive 10. Original data CRC appended to logical block 11. TSM server verifies original data CRC, completing TSM end-to-end logical block protection cycle
  • 22. Ongoing Archive Data Integrity • We must assume that data may become corrupted after being written correctly to tape • Therefore, tapes must be read periodically to identify and correct data errors image courtesy of IBM
  • 23. Ongoing Archive Data Integrity • Staging IEs to disk to verify integrity is resource intensive! • IBM LTO-5 and TS1140 tape drives provide a more efficient solution • During “SCSI Verify” operation, a tape is mounted, drive checks all image courtesy of IBM codes (C1, C2, original data CRC) as data is being read (at full line speed) • Only status is reported as these internal checks are completed
  • 24. Summary • Fixity information is the SHA-1 DRPS Ingest Tools key to data integrity control • SHA-1 values ensure data SHA-1 integrity to StorageGRID control • TSM end-to-end logical Storage Extensions block protection ensures data integrity to tape SHA-1 control StorageGRID • In-drive validation enables ongoing integrity checks CRCs, IBM ECCs Tivoli Storage Manager for the entire archive
  • 26. Trademarks The Ex Libris logo and Rosetta are trademarks of Ex Libris Group. The IBM logo and Tivoli Storage Manager are trademarks of International Business Machines Corporation. The NetApp logo and StorageGRID are trademarks of NetApp, Inc.
  • 27. Rate of Bit Errors • Preliminary validation of DRPS archive resulted in a 3.3x10-14 bit error rate • USC Shoah Foundation Institute visit • 8 PB tape archive of videotaped interviews of Holocaust survivors and other witnesses • Experienced 1500 bit flips in 8 PB (2.3x10-14 bit error rate)`

Hinweis der Redaktion

  1. Good morning! My presentation will cover the challenges of, and some working solutions to, a key requirement of digital preservation—ongoing data integrity of the archive. The solutions I will discuss were developed cooperatively by three vendors in conjunction with the Church of Jesus Christ of Latter-day Saints. By the way, you will be able to download a white paper that covers my presentation along with the presentation itself when the conference is over.
  2. First let me introduce the Church.Its full name is the Church of Jesus Christ of Latter-day Saints. Headquarters are in Salt Lake City, Utah – a Western State in the United Sates of America. The building shown here is the Salt Lake Temple, which has come to be a symbol for the Church. For your information, the Church operates 134 temples around the world. Temples are not weekly meeting places; rather, they are sacred places where families are sealed together forever – beyond life here on earth.We are a global Christian Church with more than 14 million members.The Church has more than 700,000 students enrolled in religious training around the world.It also operates three universities and a business college. Education is very important to members of the Church of Jesus Christ of Latter-day Saints.Over the last two decades, the Church has developed state-of-the-art digital audiovisual capabilities to support its vast, worldwide communications needs. I will talk more about this later.The Church has a scriptural mandate to keep records of its proceedings and preserve them for future generations. Accordingly, the Church has been creating and keeping records since 1830, when it was organized. A Church Historian’s Office was formed in the 1840s, and in 1972 it was renamed the Church History Department.
  3. Today, the Church History Department has ultimate responsibility for preserving records of enduring value that originate from its ecclesiastical leaders and within the various Church departments, the Church’s educational institutions, and its affiliations. In order to carry out this responsibility, the Church History Department’s Records Management team helps each Church organization develop a records management plan. The plan identifies all records used by the organization and establishes a record retention and disposition schedule for each collection.Usually, less than 10% of the records have a final disposition of “archive.” Only these records are preserved for future generations.
  4. Unfortunately, reading all the tapes in the archive in order to stage AIPs to disk so servers can check the fixity information is clearly a resource intensive task—especially for an archive with a capacity measured in hundreds of petabytes! Fortunately, IBM LTO-5 and TS1140 tape drives provide a much more efficient solution.During a “Verify” operation, IBM LTO-5 and TS1140 drives perform data integrity validation in-drive, which means a drive reads a tape and concurrently checks the three logical block CRCs and ECCs discussed previously at full line speed.Good or bad status is reported as soon as these internal checks are completed. And this is done without requiring any other resources! Clearly, this advanced capability enhances the ability of DRPS to perform periodic data integrity validations of the entire archive more frequently, which will facilitate the correction of bit flips after AIPs are written correctly to tape.
  5. I mentioned earlier that the Church has developed state-of-the-art digital audiovisual capabilities to support its vast, worldwide communications needs. The Media Services Department uses these capabilities to support the rest of the Church organizations in their audiovisual needs. Because of the average size of MSD audiovisual files, which is several hundred gigabytes so far, MSD audiovisual files will consume the vast majority of archive capacity in the Church History Department’s Digital Records Preservation System.One example of audiovisual records we preserve is weekly broadcasts of Music and the Spoken Word—the world’s longest continuous network broadcast (now in its 83rd year). Each broadcast features an inspirational message and music performed by the Mormon Tabernacle Choir, also known as “America’s Choir,” and the Orchestra at Temple Square. These National Radio Hall of Fame broadcasts are clearly a priceless treasure for the world that are being preserved for future generations.Another example of audiovisual records we preserve is semiannual broadcasts of General Conference, which is held in the remarkable Conference Center, shown here, that seats 21,000. The meetings are broadcast in high definition video via satellite to more than 7,400 Church buildings in 102 countries. The broadcasts are simultaneously translated into 76 languages. Ultimately, digital audio tracks for 96 languages are created and preserved to augment the digital video taping of each meeting. Not surprisingly, the Church is the world’s largest language broadcaster. As a gift to the world, the Church launched a new website last Christmas that provides free Bible videos of the birth, life, death, and resurrection of the Lord Jesus Christ. Viewable with a free mobile app, these videos are faithful to the biblical account, and of course will be preserved for future generations. I encourage you to visit the website at biblevideos.lds.org.With audiovisual files such as these, we expect that our archive capacity within a decade will exceed 100 petabytes for a single copy!
  6. The Church History Department’s Digital Records Preservation System, or DRPS, is based on Ex Libris Rosetta. Rosetta provides configurable preservation workflows and advanced preservation planning functions, but only writes a single copy of an AIP to a storage device for permanent storage. Therefore, an appropriate storage layer must be integrated with Rosetta in order to provide the full capabilities of a digital preservation archive, including AIP replication.After investigating a host of potential storage layer solutions, the Church History Department chose NetApp StorageGRID to provide the ILM capabilities that were desired. In particular, StorageGRID’s data integrity, data resilience, and data replication capabilities were attractive. In order to support ILM migration of AIPs from disk to tape, StorageGRID utilizes IBM Tivoli Storage Manager, or TSM, as an interface to tape libraries.DRPS also employs software extensions developed by my team, which is part of Church Information and Communications Services . The first is a set of ingest tools that help with fixity information creation, which I will discuss later.The second involves a fixity information bridge that will also be described later.
  7. You may wonder why we chose to use tape libraries for the DRPS archive. In 2008, an internal study was performed to compare the costs of acquisition, maintenance, administration, data center floor space, and power to archive hundreds of petabytes of digital records using disk arrays, optical disks, virtual tape libraries, and automated tape cartridges. The model also incorporated assumptions about increasing storage densities of these different storage technologies over time.Calculating all costs over a ten year period, the study concluded that the total cost of ownership of automated tape cartridges would be 33.7% of the next closest storage technology, which was disk arrays.Based on discussions with major storage providers, we believe that the cost of power and the cost per terabyte advantages of tape will only increase over time.Therefore, we concluded that, at least for now, we should use tape libraries to sustain our digital archive that is expected to skyrocket to a multiple petabyte capacity in just a few years. When we made this decision, we were NOT naive to the challenges of tape we would be facing.
  8. You may wonder why we chose to use tape libraries for the DRPS archive. In 2008, an internal study was performed to compare the costs of acquisition, maintenance, administration, data center floor space, and power to archive hundreds of petabytes of digital records using disk arrays, optical disks, virtual tape libraries, and automated tape cartridges. The model also incorporated assumptions about increasing storage densities of these different storage technologies over time.Calculating all costs over a ten year period, the study concluded that the total cost of ownership of automated tape cartridges would be 33.7% of the next closest storage technology, which was disk arrays.Based on discussions with major storage providers, we believe that the cost of power and the cost per terabyte advantages of tape will only increase over time.Therefore, we concluded that, at least for now, we should use tape libraries to sustain our digital archive that is expected to skyrocket to a multiple petabyte capacity in just a few years. When we made this decision, we were NOT naive to the challenges of tape we would be facing.
  9. One of those challenges has to do with ensuring data integrity of the tape archive. This is a critical requirement for any digital preservation archive, and it differentiates a tape archive from other types of tape farms. Modern IT equipment, including servers, storage, and network switches and routers, incorporate advanced features to minimize data corruption. Nevertheless, undetected errors still occur for a variety of reasons. Whenever data files are written, read, stored, transmitted over a network, or processed, there is a small but real possibility that corruption will occur. Causes range from hardware and software failures to network transmission failures and interruptions. Bit flips within files stored on tape can also cause data corruption.Fixity information enables data integrity validation. Fixity information is a checksum, or integrity value, that is calculated by a secure hash algorithm to ensure data integrity of an AIP file throughout preservation workflows and after the file has been written to the archive. By comparing fixity hash values before and after records are written, transferred over a network, moved or copied, a digital preservation system can determine if data corruption has taken place during its workflows or while the AIP is stored in the archive.To do data integrity validation correctly, end-to-end fixity checking should be performed from file ingest to storing the file on permanent storage to eventual access and delivery.Furthermore, data integrity validation of the entire archive should be performed periodically to detect and correct bit flips (also known as bit rot).DRPS uses a variety of hash values, cyclic redundancy check values, and error-correcting codes for such fixity information.
  10. As mentioned earlier, DRPS employs a variety of hash values, cyclic redundancy check values, and error-correcting codes in order to ensure data integrity of its tape archive. The chain of control of this fixity information is illustrated here.In order to implement fixity information as early as possible in the preservation process, and thus minimize data errors, DRPS provides ingest tools developed by my team that create SHA-1 fixity information for producer files before they are transferred to DRPS for ingest Control of this SHA-1 fixity information is transferred when a file is ingested into Rosetta. Within Rosetta, SHA-1 fixity checks are performed three times—(1) when the deposit server receives a Submission Information Package (SIP), (2) during the SIP validation process, and (3) when an AIP is moved to permanent storage. Rosetta also provides the capability to perform fixity checks on AIP files written to permanent storage, but the ILM features of StorageGRID do not utilize this capability. Therefore, StorageGRID must take over control of the SHA-1 fixity information once files have been written to it.By collaborating with Ex Libris on this process, ICS and Ex Libris have been successful in making the fixity information hand off from Rosetta to StorageGRID. This is accomplished with a web service we developed that retrieves SHA-1 hash values generated independently by StorageGRID when the files are written to the StorageGRID gateway node. Ex Libris developed a Rosetta plug-in that calls this web service and compares the StorageGRID SHA-1 hash values with those in the Rosetta database, which are known to be correct.
  11. Before I go any further with the DRPS data integrity validation chain, I’d like to discuss in some detail how StorageGRID handles fixity checking.First, StorageGRID is constructed around the concept of object storage, which enables it to provide advanced Information Lifecycle Management capabilities.To ensure object data integrity, StorageGRID provides a layered and overlapping set of protection domains that guard against object data corruption and alteration of files that are written to the grid. The first domain is called the SHA-1 object hash—this is the same SHA-1 hash value we just discussed with the previous slide. It is generated when the object (or AIP) is created (i.e., when the gateway node writes it to the first storage node), and it is verified every time the object is stored and accessed.The second domain is called the content hash. Because this hash is not self-contained, it requires external information for verification, and therefore is checked only when the object is accessed. The third domain is a cyclic redundancy check, or CRC, checksum. It is verified during every StorageGRID object operation—store, retrieve, transmit, receive, access, and background verification.And finally, the fourth domain is a key-based hash value. Using the hash key, this domain secures against all forms of tampering. As you see, StorageGRID provides very sophisticated and advanced fixity checking, which is a major reason we selected it for our DRPS storage layer.
  12. Continuing with the DRPS data integrity validation chain . . .. . . StorageGRID uses the four levels of fixity checking we just discussed to ensure integrity of AIPs that are written to the grid—from the gateway node to the storage nodes. Once a file has been correctly written to a storage node, StorageGRID invokes the TSM Client running on the archive node server in order to write the file to tape. As this happens, the SHA-1 fixity information is not handed off to TSM. Rather, TSM end-to-end logical block protection takes over.
  13. TSM end-to-end logical block protection utilizes CRCs and ECCs that supersede the use of SHA-1 fixity information while TSM is in control of the file.This advanced protection is enabled with brand new, state-of-the-art functionality provided by IBM LTO-5 and TS1140 tape drives, which I will soon illustrate.While the DRPS fixity information chain of control is altered when StorageGRID invokes TSM, validation of the file’s data integrity continues seamlessly until it is correctly written to tape using TSM end-to-end logical block protection.
  14. The TSM end-to-end logical block protection process begins when . . . . . . the TSM server calculates and appends a cyclic redundancy check value, or CRC, to each AIP logical block before transferring it to a tape drive for writing. Each appended CRC is called the “original data CRC” for that logical block. When the tape drive receives a logical block, it computes its own CRC for the data and compares it to the original data CRC. If an error is detected, a check condition is generated, forcing a re-drive or a permanent error. This step effectively guarantees protection of the logical block during transfer.
  15. As the logical block is loaded into the drive’s main data buffer, two parallel processes occur. In one process, data is cycled back through an on-the-fly verifier that once again validates the original data CRC. Any introduced error will force a re-drive or a permanent error. In parallel, an error-correcting code, or ECC, is computed and appended to the data. Referred to as the “C1 code,” this ECC protects data integrity of the logical block as it goes through additional formatting steps . . .
  16. . . . including the addition of an additional ECC, referred to as the “C2 code.”As part of these formatting steps, the C1 code is checked every time data is read from the data buffer. Thus, protection of the original data CRC is essentially transformed to protection from the more powerful C1 code.
  17. Finally the data is read from the main data buffer and is written to tape using a read-while-write process. During this process, the just written data is read back from tape and loaded into the main data buffer so the C1 code can be checked once again to verify the written data. A successful read-while-write operation assures that no data corruption has occurred from the time the AIP logical block was transferred from the TSM server until it is written to tape. And using these ECCs and CRCs, the tape drive can validate AIP logical blocks at full line speed as they are being written!
  18. During a read operation, data is read from the tape and all three codes (C1, C2, and the original data CRC) are decoded and checked, and a read error is generated if any process indicates an error. The original data CRC is then appended to the logical block.When the logical block is transferred to the TSM server, the original data CRC is independently verified by that server, thus completing the TSM end-to-end logical block protection cycle. I didn’t mention this previously, but TSM also performs data integrity validation during client sessions when data is sent between a client and the server, or vice versa.
  19. Unfortunately, continuously ensuring data integrity of a DRPS AIP does not end once the AIP has been written correctly to tape. We must assume that bits will flip after being written correctly to tape.As we discussed earlier, the USC Shoah Foundation Institute has seen a 10-14 bit error rate, and we have also when we recently validated our entire tape archive. Therefore, we believe that all written tapes in the archive must be read periodically to find and correct bit flips that have occurred since the tapes were written correctly.
  20. Unfortunately, reading all the tapes in the archive in order to stage AIPs to disk so servers can check the fixity information is clearly a resource intensive task—especially for an archive with a capacity measured in hundreds of petabytes! Fortunately, IBM LTO-5 and TS1140 tape drives provide a much more efficient solution.During a “Verify” operation, IBM LTO-5 and TS1140 drives perform data integrity validation in-drive, which means a drive reads a tape and concurrently checks the three logical block CRCs and ECCs discussed previously at full line speed.Good or bad status is reported as soon as these internal checks are completed. And this is done without requiring any other resources! Clearly, this advanced capability enhances the ability of DRPS to perform periodic data integrity validations of the entire archive more frequently, which will facilitate the correction of bit flips after AIPs are written correctly to tape.
  21. To summarize my presentation, fixity information is the key to archive data integrity.For the Church History Department’s Digital Records Preservation System, SHA-1 fixity values ensure data integrity all the way from the producer to StorageGRID archive nodes.From there, TSM end-to-end logical block protection takes over to ensure data integrity until the data is correctly written to tape.And finally, the new in-drive data integrity validation capability of IBM tape drives enables DRPS to perform periodic data integrity checks of the entire archive to provide continuous data integrity.
  22. Sizing bit errors in your digital archive may be somewhat difficult. I heard from a tape vendor at a recent preservation conference that tape exhibits a 10-19 bit error rate. This rate is optimistic, however, compared to the results we recently encountered when we performed a data integrity validation of our entire DRPS archive. We realized a 3.3x10-14 bit error rate—which is five orders of magnitude higher than the vendor claim!10-19 is also higher than what I encountered when I visited the University of Southern California’s Shoah Foundation Institute in 2009. But first, some background on this Institute. It was established by Steve Spielberg, the great film producer, after he finished filming Schindler’s List. Shoah is the Hebrew term for Holocaust. More than 51,000 interviews of Holocaust survivors and other witness have been videotaped by the Shoah Institute.Currently, 87% of the 204,000+ Betacam SP master tapes have been converted to Motion JPEG 2000 preservation masters. When I visited the Institute in 2009, the tape archive capacity was 8 petabytes.Sam Gustman, the CTO of the Shoah Foundation Institute, told me that his team had encountered 1500 bit flips in those 8 petabytes. This translates to a bit error rate of 2.3x10-14—also five orders of magnitude higher than the vendor claim!I believe these real life measurements provide credible guidance for tape archives.