SlideShare ist ein Scribd-Unternehmen logo
1 von 7
Downloaden Sie, um offline zu lesen
Characterising content in web archives with Nanite 
William Palmer SCAPE Information Day British Library, UK, 14th July 2014
•When web sites are harvested they are stored in a container format 
•The main web archive container formats are ARC and WARC (an ISO standard) 
•They are effectively analogous to a zip file 
2 
Web Archives 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
WARC Container
•Web archives can hold billions of individual records 
•To answer deeper questions you have to determine what data is held 
•Not the same as a homogenous collection of images 
•Can contain everything and anything 
•Correctly formed files 
•Malformed files 
•Viruses 
•Unknown files? 
•You name it 
3 
Characterisation 
This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). 
? 
? 
? 
? 
JPG 
GIF 
TXT 
XLS
Nanite 
•Nanite is formed of two main modules 
•nanite-core: a Java API for the UK National Archives’ Droid 
•nanite-hadoop: WARC content characterisation using Hadoop 
•Apache Tika (Detector), Nanite-core & libmagic-jni (‘file’) 
•Optionally use Tika (Parsers); data output to sequence files 
•Also list server content type & file extension 
•Reuses: warc-hadoop-recordreaders (partially SCAPE)
Speed 
•Fast: for 1TB, 14k warcs, 93m files; mimetypes detected in 17 hours 
•Nanite has also been used at the Danish State and University Library 
•7.3TB data, 80k ARC files, 261m files 
•Identification using Droid and Tika 
•Characterisation using Tika 
•…in 32 hours 
•Same platform but using FITS (not using Hadoop, but parallelised): 
•12TB data, 100k ARC files, 400m files 
•An entire year of processing (8760 hours) 
Map 
Tika Identify 
Nanite/ Droid 
Libmagic 
Tika Parser
Stats 
•1370 different MIME types reported by the original servers 
•Tika detected 342 
•DROID detected 319 
•Additional information in this blog post: http://www.openplanetsfoundation.org/blogs/2014- 05-28-weekend-nanite
Visualising Characterisation Information 
•Nanite has an option to output C3PO compatible outputs 
•They can be directly loaded into C3PO for visualisation

Weitere ähnliche Inhalte

Was ist angesagt?

Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Beamsync
 
Expert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive MetastoreExpert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive MetastorelakeFS
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesKamesh Pemmaraju
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014Repository Fringe
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and PotentialDaniel Gomes
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataRob Gardner
 
Archivematica integration handshaking towards comprehensive digital preserva...
Archivematica integration  handshaking towards comprehensive digital preserva...Archivematica integration  handshaking towards comprehensive digital preserva...
Archivematica integration handshaking towards comprehensive digital preserva...Artefactual Systems - Archivematica
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryBiblioteca Nacional de España
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudEuropeana
 

Was ist angesagt? (13)

Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1Basic Analytic Techniques - Using R Tool - Part 1
Basic Analytic Techniques - Using R Tool - Part 1
 
Expert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive MetastoreExpert Roundtable: The Future of Metadata After Hive Metastore
Expert Roundtable: The Future of Metadata After Hive Metastore
 
HDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the CloudHDFCloud Workshop: HDF5 in the Cloud
HDFCloud Workshop: HDF5 in the Cloud
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014DSpace Update from Open Repositories 2014
DSpace Update from Open Repositories 2014
 
Hierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) UpdateHierarchical Data Formats (HDF) Update
Hierarchical Data Formats (HDF) Update
 
Web Archiving – Lessons and Potential
 Web Archiving – Lessons and Potential Web Archiving – Lessons and Potential
Web Archiving – Lessons and Potential
 
Using Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider DataUsing Ceph for Large Hadron Collider Data
Using Ceph for Large Hadron Collider Data
 
Archivematica integration handshaking towards comprehensive digital preserva...
Archivematica integration  handshaking towards comprehensive digital preserva...Archivematica integration  handshaking towards comprehensive digital preserva...
Archivematica integration handshaking towards comprehensive digital preserva...
 
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara AubryArchiving the French Web: the BnF web archiving workflow. Sara Aubry
Archiving the French Web: the BnF web archiving workflow. Sara Aubry
 
Report: Archivematica hosting in the cloud
Report: Archivematica hosting in the cloudReport: Archivematica hosting in the cloud
Report: Archivematica hosting in the cloud
 
LoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloudLoCloud - Local content in a Europeana cloud
LoCloud - Local content in a Europeana cloud
 
Summit2013 eventos onto quad
Summit2013   eventos onto quadSummit2013   eventos onto quad
Summit2013 eventos onto quad
 

Andere mochten auch

SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Project
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...SCAPE Project
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsSCAPE Project
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...SCAPE Project
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusSCAPE Project
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation WatchSCAPE Project
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Project
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPESCAPE Project
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsSCAPE Project
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE Project
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification toolsSCAPE Project
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbSCAPE Project
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012SCAPE Project
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE Project
 

Andere mochten auch (15)

SCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs AvailableSCAPE Information Day at BL - Some of the SCAPE Outputs Available
SCAPE Information Day at BL - Some of the SCAPE Outputs Available
 
Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...Hadoop and its applications at the State and University Library, SCAPE Inform...
Hadoop and its applications at the State and University Library, SCAPE Inform...
 
An image based approach for content analysis in document collections
An image based approach for content analysis in document collectionsAn image based approach for content analysis in document collections
An image based approach for content analysis in document collections
 
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
Large scale preservation workflows with Taverna – SCAPE Training event, Guima...
 
Preservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, AarhusPreservation Policy in SCAPE - Training, Aarhus
Preservation Policy in SCAPE - Training, Aarhus
 
Automatic Preservation Watch
Automatic Preservation WatchAutomatic Preservation Watch
Automatic Preservation Watch
 
SCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation ToolSCAPE Information day at BL - Flint, a Format and File Validation Tool
SCAPE Information day at BL - Flint, a Format and File Validation Tool
 
Policy levels in SCAPE
Policy levels in SCAPEPolicy levels in SCAPE
Policy levels in SCAPE
 
Scape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation EnvironmentsScape project presentation - Scalable Preservation Environments
Scape project presentation - Scalable Preservation Environments
 
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
SCAPE – Scalable Preservation Environments, SCAPE Information Day, 25 June 20...
 
Evaluation of format identification tools
Evaluation of format identification toolsEvaluation of format identification tools
Evaluation of format identification tools
 
LIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven SchlarbLIBER Satellite Event, SCAPE by Sven Schlarb
LIBER Satellite Event, SCAPE by Sven Schlarb
 
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
Digital Preservation - The Saga Continues - SCAPE Training event, Guimarães 2012
 
C sz z6
C sz z6C sz z6
C sz z6
 
SCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation EnvironmentsSCAPE - Scalable Preservation Environments
SCAPE - Scalable Preservation Environments
 

Ähnlich wie SCAPE Information Day at BL - Characterising content in web archives with Nanite

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSven Schlarb
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibrarySven Schlarb
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaJisc RDM
 
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...dwig
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebPascal-Nicolas Becker
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingSven Schlarb
 
Openstack – An introduction
Openstack – An introductionOpenstack – An introduction
Openstack – An introductionMuddassir Nazir
 
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016Sven Schlarb
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...SCAPE Project
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File Systemelliando dias
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Future Perfect 2012
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on DockerDataWorks Summit
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaVangelis Banos
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologiesgagravarr
 

Ähnlich wie SCAPE Information Day at BL - Characterising content in web archives with Nanite (20)

SCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/BelgiumSCAPE Presentation at the Elag2013 conference in Gent/Belgium
SCAPE Presentation at the Elag2013 conference in Gent/Belgium
 
Application scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National LibraryApplication scenarios of the SCAPE project at the Austrian National Library
Application scenarios of the SCAPE project at the Austrian National Library
 
Internet content as research data
Internet content as research dataInternet content as research data
Internet content as research data
 
From Box to Hydra via Archivematica
From Box to Hydra via ArchivematicaFrom Box to Hydra via Archivematica
From Box to Hydra via Archivematica
 
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
Something That Works: Implementing ResourceSpace Open Source Digital Asset Ma...
 
I say emulate
I say emulateI say emulate
I say emulate
 
SWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic WebSWIB14 Weaving repository contents into the Semantic Web
SWIB14 Weaving repository contents into the Semantic Web
 
SCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Information Day at BL - Large Scale Processing with Hadoop
SCAPE Information Day at BL - Large Scale Processing with Hadoop
 
The Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital ArchivingThe Use of Big Data Techniques for Digital Archiving
The Use of Big Data Techniques for Digital Archiving
 
Openstack – An introduction
Openstack – An introductionOpenstack – An introduction
Openstack – An introduction
 
E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016E-ARK-iPRES2016-Bern-October-2016
E-ARK-iPRES2016-Bern-October-2016
 
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
Integrating the Fedora based DOMS repository with Hadoop, SCAPE Information D...
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
Kris Carpenter Negulescu Gordon Paynter Archiving the National Web of New Zea...
 
Lessons learned from running Spark on Docker
Lessons learned from running Spark on DockerLessons learned from running Spark on Docker
Lessons learned from running Spark on Docker
 
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with EuropeanaThe Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
The Hellenic Aggregator - Overview, procedures & the cooperation with Europeana
 
Digital Preservation with Archivematica: An Introduction
Digital Preservation with Archivematica: An IntroductionDigital Preservation with Archivematica: An Introduction
Digital Preservation with Archivematica: An Introduction
 
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
NISO Two-Part Webinar: Sustainable Information Part 2: Digital Preservation o...
 
Apache Content Technologies
Apache Content TechnologiesApache Content Technologies
Apache Content Technologies
 
Hadoop training in bangalore
Hadoop training in bangaloreHadoop training in bangalore
Hadoop training in bangalore
 

Mehr von SCAPE Project

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...SCAPE Project
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Project
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...SCAPE Project
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014SCAPE Project
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3POSCAPE Project
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulationSCAPE Project
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE Project
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalitySCAPE Project
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000SCAPE Project
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation WorkflowsSCAPE Project
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation SCAPE Project
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPESCAPE Project
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...SCAPE Project
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...SCAPE Project
 

Mehr von SCAPE Project (14)

Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
Scape information day at BL - Using Jpylyzer and Schematron for validating JP...
 
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositoriesSCAPE Webinar: Tools for uncovering preservation risks in large repositories
SCAPE Webinar: Tools for uncovering preservation risks in large repositories
 
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
Policy driven validation of JPEG 2000 files based on Jpylyzer, SCAPE Informat...
 
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
Migration of audio files using Hadoop, SCAPE Information Day, 25 June 2014
 
Content profiling and C3PO
Content profiling and C3POContent profiling and C3PO
Content profiling and C3PO
 
Control policy formulation
Control policy formulationControl policy formulation
Control policy formulation
 
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
SCAPE - Skalierbare Langzeitarchivierung (SCAPE - scalable longterm digital p...
 
TAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionalityTAVERNA Components - Semantically annotated and sharable units of functionality
TAVERNA Components - Semantically annotated and sharable units of functionality
 
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000PDF/A-3 for preservation. Notes on embedded files and JPEG2000
PDF/A-3 for preservation. Notes on embedded files and JPEG2000
 
Scalable Preservation Workflows
Scalable Preservation WorkflowsScalable Preservation Workflows
Scalable Preservation Workflows
 
Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation Quality assurance for document image collections in digital preservation
Quality assurance for document image collections in digital preservation
 
Digital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPEDigital Preservation Policies - SCAPE
Digital Preservation Policies - SCAPE
 
Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...Matchbox tool. Quality control for digital collections – SCAPE Training event...
Matchbox tool. Quality control for digital collections – SCAPE Training event...
 
Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...Characterisation - 101. An introduction to the identification and characteris...
Characterisation - 101. An introduction to the identification and characteris...
 

Kürzlich hochgeladen

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 

Kürzlich hochgeladen (20)

FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 

SCAPE Information Day at BL - Characterising content in web archives with Nanite

  • 1. Characterising content in web archives with Nanite William Palmer SCAPE Information Day British Library, UK, 14th July 2014
  • 2. •When web sites are harvested they are stored in a container format •The main web archive container formats are ARC and WARC (an ISO standard) •They are effectively analogous to a zip file 2 Web Archives This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). WARC Container
  • 3. •Web archives can hold billions of individual records •To answer deeper questions you have to determine what data is held •Not the same as a homogenous collection of images •Can contain everything and anything •Correctly formed files •Malformed files •Viruses •Unknown files? •You name it 3 Characterisation This work was partially supported by the SCAPE Project. The SCAPE project is co‐funded by the European Union under FP7 ICT‐2009.4.1 (Grant Agreement number 270137). ? ? ? ? JPG GIF TXT XLS
  • 4. Nanite •Nanite is formed of two main modules •nanite-core: a Java API for the UK National Archives’ Droid •nanite-hadoop: WARC content characterisation using Hadoop •Apache Tika (Detector), Nanite-core & libmagic-jni (‘file’) •Optionally use Tika (Parsers); data output to sequence files •Also list server content type & file extension •Reuses: warc-hadoop-recordreaders (partially SCAPE)
  • 5. Speed •Fast: for 1TB, 14k warcs, 93m files; mimetypes detected in 17 hours •Nanite has also been used at the Danish State and University Library •7.3TB data, 80k ARC files, 261m files •Identification using Droid and Tika •Characterisation using Tika •…in 32 hours •Same platform but using FITS (not using Hadoop, but parallelised): •12TB data, 100k ARC files, 400m files •An entire year of processing (8760 hours) Map Tika Identify Nanite/ Droid Libmagic Tika Parser
  • 6. Stats •1370 different MIME types reported by the original servers •Tika detected 342 •DROID detected 319 •Additional information in this blog post: http://www.openplanetsfoundation.org/blogs/2014- 05-28-weekend-nanite
  • 7. Visualising Characterisation Information •Nanite has an option to output C3PO compatible outputs •They can be directly loaded into C3PO for visualisation