SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Downloaden Sie, um offline zu lesen
SCAPE

Evaluation of format
identification tools
Johan van der Knijff
Koninklijke Bibliotheek – National Library of the Netherlands
johan.vanderknijff@kb.nl


The Future of File Format Identification
DPC / The National Archives, Kew, London, 28.11.2011
SCAPE
Background & context
SCAPE
                  About SCAPE

SCAlable Preservation Environments

EU funded FP7 project; 16 partners

Scalable services for preservation and preservation
  planning

Semi-automated workflows for large-scale,
  heterogeneous collections of complex digital objects
SCAPE
    Importance of identification

Object         Automated          Alert
                 watch



Object                            Object
                migration      (new format)




Object                           Emulated
                emulation       performance




         What is it?        Identification!
SCAPE
        Evaluation of identification tools

Which tools suitable for SCAPE architecture?

Specific strengths/weaknesses

Decide on needed enhancements and modifications

Hopefully provide some useful input to developers as
  well!
SCAPE
                          Tools

DROID 6.0

FIDO 0.93/0.95

Unix File Utility 5.0.3

FITS 0.5 (uses DROID 3.0)

JHOVE2 (uses DROID 4.0)
SCAPE
                Evaluation framework
Total of 22 criteria, broadly covering:

   Usability in automated workflow (interface, dependencies)

   Fit to requirements archival setting: format coverage,
      extendibility, accuracy

   Output: format, identifiers, granularity

   User documentation

   Performance, stability and error handling
SCAPE
                Key principles




Hands-on testing
using real data is
  essential!
SCAPE
Key principles



          Inform tool developers
             on results, and give
             them opportunity to
             provide feedback
SCAPE


Performance tests
SCAPE
     Test data: KB Scientific Journals set




N      size(min) size(median) size(max)   Total size
11,892 11        4,737        25,495,289 1.15 GB
SCAPE
One file per tool invocation



tool     File 1     Output

tool     File 2     Output




tool     File n     Output
SCAPE
   Many files per tool invocation



               File 1

               File 2

tool                            Output



               File n
SCAPE
One-file per invocation
SCAPE
One-file per invocation
SCAPE
One-file per invocation
SCAPE
      Comparison: one-file per invocation




File FIDO     FITS JHOVE2 DROID
SCAPE
     Comparison: many-files per invocation



File DROID FIDO
                                     JHOVE2
SCAPE
          Performance: main conclusions
All tested Java-based tools slow for one-file-per-
   invocation use case

Performance much better for many-files-per-invocation
  use case

Slow initialisation seems to be main culprit
   - Actual processing time per file: milliseconds
   - Tool initialisation time: several seconds!
SCAPE
           So is this really a problem?
Depends on required throughput

Depends on workflow interface (command line or Java
  API)

Depends on organisation of workflow

Depends on purpose (e.g. pre-ingest vs profiling of large
  file collections)
SCAPE
              Apples vs oranges

                          FITS, JHOVE2: wrappers;
                             also feature extraction
                             and validation




DROID 6, FIDO: recurse
  into ZIP files ; File
  doesn’t!
SCAPE
Miscellaneous observations
SCAPE
               Other observations
Signature-based identification doesn’t work too well for
   text-based formats (including XML)

File outperforms other tools on format coverage and
   performance; management of signatures (‘magic’ file)
   awkward

DROID 6 output handling clumsy in automated
  workflows (separate DROID invocation needed for
  exporting profile information!)
SCAPE
          Response to this work so far

FIDO: version 0.9.6 released in October; fixes most
  reported issues

FITS: version 0.6 released in October; various
   enhancements based on outcome of evaluation

DROID, JHOVE2: both provided feedback and will
  consider test results for upcoming releases
SCAPE
                Possible next steps

Improve evaluation of accuracy

Keep up with tool updates; keep this work up-to-date

Publish all used scripts and detailed description of
  analysis methods so others can contribute more
  easily

Use publicly available test corpus (e.g. Govdocs1)
SCAPE
                 Link to full report on OPF blog:
www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identification-tools-first-results-scape




                          More about SCAPE:
                   http://www.scape-project.eu

                                         #SCAPEProject

Weitere ähnliche Inhalte

Andere mochten auch

Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Project
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 

Andere mochten auch (10)

Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
SCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with NaniteSCAPE Information Day at BL - Characterising content in web archives with Nanite
SCAPE Information Day at BL - Characterising content in web archives with Nanite
 
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 

Ähnlich wie Evaluation of format identification tools

Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, pleaseAndy Jackson
 
OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)Jooho Lee
 
FusionInventory at LSM/RMLL 2012
FusionInventory at LSM/RMLL 2012FusionInventory at LSM/RMLL 2012
FusionInventory at LSM/RMLL 2012Nouh Walid
 
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...JISC KeepIt project
 
Webinar: SpagoBI Suite
Webinar: SpagoBI SuiteWebinar: SpagoBI Suite
Webinar: SpagoBI SuiteSpagoWorld
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesNicola Ferraro
 
What is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP EcosystemWhat is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP Ecosystemsparkfabrik
 
OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)Jooho Lee
 
Code, ci, infrastructure - the gophers way
Code, ci, infrastructure - the gophers wayCode, ci, infrastructure - the gophers way
Code, ci, infrastructure - the gophers wayAlex Baitov
 
OWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionOWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionSergey Sotnikov
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentationdgarijo
 
Linux/Unix Night - (PEN) Testing Toolkits (English)
Linux/Unix Night - (PEN) Testing Toolkits (English)Linux/Unix Night - (PEN) Testing Toolkits (English)
Linux/Unix Night - (PEN) Testing Toolkits (English)Jelmer de Reus
 
Why documentation osidays
Why documentation osidaysWhy documentation osidays
Why documentation osidaysBastian Feder
 
Decoder Open Research Webinar
Decoder Open Research WebinarDecoder Open Research Webinar
Decoder Open Research WebinarDecoder Project
 
Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...
Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...
Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...Amplexor
 
ImageJ and the SciJava software stack
ImageJ and the SciJava software stackImageJ and the SciJava software stack
ImageJ and the SciJava software stackCurtis Rueden
 
aleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the massesaleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the massesJan Seidl
 

Ähnlich wie Evaluation of format identification tools (20)

Unified characterisation, please
Unified characterisation, pleaseUnified characterisation, please
Unified characterisation, please
 
OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)
 
FusionInventory at LSM/RMLL 2012
FusionInventory at LSM/RMLL 2012FusionInventory at LSM/RMLL 2012
FusionInventory at LSM/RMLL 2012
 
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
Supporting Significant Properties in a Working Archive (SPs part 5), by Steph...
 
Webinar: SpagoBI Suite
Webinar: SpagoBI SuiteWebinar: SpagoBI Suite
Webinar: SpagoBI Suite
 
Extending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with KubernetesExtending DevOps to Big Data Applications with Kubernetes
Extending DevOps to Big Data Applications with Kubernetes
 
Resume_Shanthi
Resume_ShanthiResume_Shanthi
Resume_Shanthi
 
What is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP EcosystemWhat is the Secure Supply Chain and the Current State of the PHP Ecosystem
What is the Secure Supply Chain and the Current State of the PHP Ecosystem
 
OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)OpenSCAP Overview(security scanning for docker image and container)
OpenSCAP Overview(security scanning for docker image and container)
 
Code, ci, infrastructure - the gophers way
Code, ci, infrastructure - the gophers wayCode, ci, infrastructure - the gophers way
Code, ci, infrastructure - the gophers way
 
OWASP Dependency-Track Introduction
OWASP Dependency-Track IntroductionOWASP Dependency-Track Introduction
OWASP Dependency-Track Introduction
 
WORKS 11 Presentation
WORKS 11 PresentationWORKS 11 Presentation
WORKS 11 Presentation
 
Linux/Unix Night - (PEN) Testing Toolkits (English)
Linux/Unix Night - (PEN) Testing Toolkits (English)Linux/Unix Night - (PEN) Testing Toolkits (English)
Linux/Unix Night - (PEN) Testing Toolkits (English)
 
Why documentation osidays
Why documentation osidaysWhy documentation osidays
Why documentation osidays
 
Koha vs abcd
Koha vs abcdKoha vs abcd
Koha vs abcd
 
Open64 compiler
Open64 compilerOpen64 compiler
Open64 compiler
 
Decoder Open Research Webinar
Decoder Open Research WebinarDecoder Open Research Webinar
Decoder Open Research Webinar
 
Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...
Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...
Amplexor Drupal for the Enterprise seminar - evaluating Drupal for the Enterp...
 
ImageJ and the SciJava software stack
ImageJ and the SciJava software stackImageJ and the SciJava software stack
ImageJ and the SciJava software stack
 
aleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the massesaleph - Malware analysis pipelining for the masses
aleph - Malware analysis pipelining for the masses
 

Mehr von SCAPE Project

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000SCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation SCAPE Project
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPESCAPE Project
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...SCAPE Project
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...SCAPE Project
 

Mehr von SCAPE Project (16)

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 

Kürzlich hochgeladen

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Strongerpanagenda
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 

Kürzlich hochgeladen (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better StrongerModern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
Modern Roaming for Notes and Nomad – Cheaper Faster Better Stronger
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 

Evaluation of format identification tools

  • 1. SCAPE Evaluation of format identification tools Johan van der Knijff Koninklijke Bibliotheek – National Library of the Netherlands johan.vanderknijff@kb.nl The Future of File Format Identification DPC / The National Archives, Kew, London, 28.11.2011
  • 3. SCAPE About SCAPE SCAlable Preservation Environments EU funded FP7 project; 16 partners Scalable services for preservation and preservation planning Semi-automated workflows for large-scale, heterogeneous collections of complex digital objects
  • 4. SCAPE Importance of identification Object Automated Alert watch Object Object migration (new format) Object Emulated emulation performance What is it? Identification!
  • 5. SCAPE Evaluation of identification tools Which tools suitable for SCAPE architecture? Specific strengths/weaknesses Decide on needed enhancements and modifications Hopefully provide some useful input to developers as well!
  • 6. SCAPE Tools DROID 6.0 FIDO 0.93/0.95 Unix File Utility 5.0.3 FITS 0.5 (uses DROID 3.0) JHOVE2 (uses DROID 4.0)
  • 7. SCAPE Evaluation framework Total of 22 criteria, broadly covering: Usability in automated workflow (interface, dependencies) Fit to requirements archival setting: format coverage, extendibility, accuracy Output: format, identifiers, granularity User documentation Performance, stability and error handling
  • 8. SCAPE Key principles Hands-on testing using real data is essential!
  • 9. SCAPE Key principles Inform tool developers on results, and give them opportunity to provide feedback
  • 11. SCAPE Test data: KB Scientific Journals set N size(min) size(median) size(max) Total size 11,892 11 4,737 25,495,289 1.15 GB
  • 12. SCAPE One file per tool invocation tool File 1 Output tool File 2 Output tool File n Output
  • 13. SCAPE Many files per tool invocation File 1 File 2 tool Output File n
  • 17. SCAPE Comparison: one-file per invocation File FIDO FITS JHOVE2 DROID
  • 18. SCAPE Comparison: many-files per invocation File DROID FIDO JHOVE2
  • 19. SCAPE Performance: main conclusions All tested Java-based tools slow for one-file-per- invocation use case Performance much better for many-files-per-invocation use case Slow initialisation seems to be main culprit - Actual processing time per file: milliseconds - Tool initialisation time: several seconds!
  • 20. SCAPE So is this really a problem? Depends on required throughput Depends on workflow interface (command line or Java API) Depends on organisation of workflow Depends on purpose (e.g. pre-ingest vs profiling of large file collections)
  • 21. SCAPE Apples vs oranges FITS, JHOVE2: wrappers; also feature extraction and validation DROID 6, FIDO: recurse into ZIP files ; File doesn’t!
  • 23. SCAPE Other observations Signature-based identification doesn’t work too well for text-based formats (including XML) File outperforms other tools on format coverage and performance; management of signatures (‘magic’ file) awkward DROID 6 output handling clumsy in automated workflows (separate DROID invocation needed for exporting profile information!)
  • 24. SCAPE Response to this work so far FIDO: version 0.9.6 released in October; fixes most reported issues FITS: version 0.6 released in October; various enhancements based on outcome of evaluation DROID, JHOVE2: both provided feedback and will consider test results for upcoming releases
  • 25. SCAPE Possible next steps Improve evaluation of accuracy Keep up with tool updates; keep this work up-to-date Publish all used scripts and detailed description of analysis methods so others can contribute more easily Use publicly available test corpus (e.g. Govdocs1)
  • 26. SCAPE Link to full report on OPF blog: www.openplanetsfoundation.org/blogs/2011-09-21-evaluation-identification-tools-first-results-scape More about SCAPE: http://www.scape-project.eu #SCAPEProject