SlideShare ist ein Scribd-Unternehmen logo
1 von 23
EAD without XSLT
a practical approach to archival finding aids

                   Trevor Thornton
       Senior Applications Developer, NYPL Labs
             The New York Public Library
Project goals
• Enable multiple presentations of
  the same data

• Support dynamic web applications

• Cross-collection search with
  component-level specificity in
  results, and faceting on common
  access points
System overview
Ruby on Rails
+ MySQL
+ SOLR

Key functionality:
Data Import
Search index
API
Core models
Collection model

Each collection:
•must have one
description
•may have one or more
components
•may be associated with
one or more access terms
Component model
Each component:
•must belong to one
collection
•must have one description
•may have one parent
component
•may have one or more
child components
•may be associated with
one or more access terms
Component hierarchy attributes
• collection_id (id of root collection)
• parent_id (id of parent component)
• sib_seq (sibling sequence)
• level_num (numeric level within hierarchy)
• level_text (series, sub-series, file, etc.)

• has_children
                                      Computed after initial data import; provided
• max_levels                          as a convenience for finding aid UIs and to
                                      streamline formulation of API responses
• top_component_id
Description model
Elements of description organized
(roughly) based on ISAD(G):
•Descriptive identity
ISAD(G) 3.1

•Context
ISAD(G) 3.2.1 - 3.2.3

•Acquisition & processing
ISAD(G) 3.2.4, 3.3.2-3.3.3

•Content and structure
ISAD(G) 3.3.1, 3.3.4

•Access and use
ISAD(G) 3.4

•Related material
ISAD(G) 3.5

•Notes
ISAG(G) 3.6
Description model: basic EAD mapping
Description model: JSON format
{
    "unitid": [
       { "value": "3283", "type": "local_mss" }
    ],
    "unittitle": [
       { "value": "David Ames Wells papers" }
    ],
    "unitdate": [
       { "type": "inclusive", "normal": "1847/1895", "value": "1847-1895" }
    ],
    "physdesc_extent":[
       { "value": ".5 linear feet", "unit":"linear feet" },
       { "value": "2 boxes", "unit":"containers" }
    ],
    "abstract": [
       { "value": "David Ames Wells was an engineer, economist, textbook author, and
           advocate for lower tariff rates. This collection contains correspondence with
           Gordon L. Ford, Worthington C. Ford, and others; clippings; a manuscript
           draft of Protection: The Poor Man's Friend; and a lecture Wells delivered on
           free trade in 1882"}
    ],
    "prefercite": [
       { "value": "<p>David Ames Wells papers, Manuscripts and Archives Division,
           The New York Public Library</p>" }
    ]
}
EAD as a guide for data storage
• EAD elements that allow only CDATA are stored as
  plain strings
• EAD elements that require content to be structured in
  <p> or other block elements stored as HTML
• Rules established for converting EAD to HTML
  when necessary
• HTML conversion designed to support re-conversion
  back to EAD
Special handling for dates
• Dates are hard
   o Inclusive dates and bulk dates
   o Multiple date formats
   o Ranges, lists and both

• Special data structure for dates:
   o date_statement (original text)
   o inclusive_start / inclusive_end
   o bulk_start / bulk_end
   o keydate (for ordering query response – earliest inclusive date
     or earliest bulk date when present)
   o index_dates (for search faceting – every year included in range/list)
Access Term model
Refinement of Access Term/
Access Term Association models
Data import
•   It’s messy business
•   Bulk of work has focused on EAD;
    Nokogiri used extensively for parsing XML
•   Basic process for EAD import:
    1.   Create collection record
    2.   Extract collection-level data,
         create/save description
    3.   Extract access terms, and for each
         a. Save if it doesn’t already exist
         b. Save collection/term association
    4.   Extract top-level components, and for each:
         a. Create component record
         b. Extract component-level data,
            create/save description
         c. Extract/save access terms & associations
         d. Extract child components and repeat for each
Integration with NYPL digital repository
• Fedora repository
  + custom metadata creation/digitization workflow system
  + API to query repository data
• All records in repository identified with UUID

• UUID of digital object associated with a given component
  is stored locally in archives data system
• Best case scenario: common identifiers appear in
  archival description and in Fedora
Apache Solr
• Inter- and intra-collection search

• Collocation via faceting and filter queries

• Using RSolr to facilitate interaction with Solr
  (for both search and index)
API
• API development is proceeding in step with finding aid
  development – available requests added as needed
• Basic requests:
   o Collection-level data
   o Components of a collection,
     or sub-components of a component
      o Includes all component-level descriptive data
      o Max. depth can be specified
   o Digital assets associated with
     a component
Finding aid prototype
Finding aid prototype
Front-end system overview
Considerations for future development
• Separate API from data management?
  o Data management app to handle all create/update/destroy
    operations, while API (Sinatra?) is read-only
  o Open API to public? Security/load considerations…

• ArchivesSpace
  o NYPL is considering it as a possible replacement for
    our existing ‘home-grown’ system
  o How would this system integrate with ArchivesSpace API?

• Upcoming EAD revision
some code to look at and/or borrow from:
github.com/nypl/archives_data_public

finding aid prototype:
archives.nypl.org

me:
trevorthornton@nypl.org

NYPL Labs:
nypl.org/labs

Weitere ähnliche Inhalte

Was ist angesagt?

Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrAndy Jackson
 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDave Cross
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQLOlaf Hartig
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 
OrientDB the graph database
OrientDB the graph databaseOrientDB the graph database
OrientDB the graph databaseArtem Orobets
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 
MongoDB Advanced Topics
MongoDB Advanced TopicsMongoDB Advanced Topics
MongoDB Advanced TopicsCésar Rodas
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring DataEric Bottard
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaKatrien Verbert
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasMapR Technologies
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrRahul Jain
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and SemanticsTatiana Al-Chueyr
 

Was ist angesagt? (20)

Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Database Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::ClassDatabase Programming with Perl and DBIx::Class
Database Programming with Perl and DBIx::Class
 
4 sw architectures and sparql
4 sw architectures and sparql4 sw architectures and sparql
4 sw architectures and sparql
 
RDFa Tutorial
RDFa TutorialRDFa Tutorial
RDFa Tutorial
 
Querying Linked Data with SPARQL
Querying Linked Data with SPARQLQuerying Linked Data with SPARQL
Querying Linked Data with SPARQL
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Connecting HDF with ISO Metadata Standards
Connecting HDF with ISO Metadata StandardsConnecting HDF with ISO Metadata Standards
Connecting HDF with ISO Metadata Standards
 
OrientDB the graph database
OrientDB the graph databaseOrientDB the graph database
OrientDB the graph database
 
Jena Programming
Jena ProgrammingJena Programming
Jena Programming
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
04 standard class library c#
04 standard class library c#04 standard class library c#
04 standard class library c#
 
MongoDB Advanced Topics
MongoDB Advanced TopicsMongoDB Advanced Topics
MongoDB Advanced Topics
 
Hands On Spring Data
Hands On Spring DataHands On Spring Data
Hands On Spring Data
 
WebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPediaWebTech Tutorial Querying DBPedia
WebTech Tutorial Querying DBPedia
 
Mongodb hackathon 02
Mongodb hackathon 02Mongodb hackathon 02
Mongodb hackathon 02
 
Introduction to solr
Introduction to solrIntroduction to solr
Introduction to solr
 
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael HausenblasBerlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
Berlin Buzz Words - Apache Drill by Ted Dunning & Michael Hausenblas
 
Introduction to Apache Lucene/Solr
Introduction to Apache Lucene/SolrIntroduction to Apache Lucene/Solr
Introduction to Apache Lucene/Solr
 
Linking the world with Python and Semantics
Linking the world with Python and SemanticsLinking the world with Python and Semantics
Linking the world with Python and Semantics
 

Ähnlich wie Tthornton code4lib

Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise Group
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoData Con LA
 
The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...
The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...
The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...Peter Keane
 
Using the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related toolsUsing the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related toolsAudra Eagle Yun
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineDaniel N
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleEvan Chan
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDuraSpace
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes WorkshopErik Hatcher
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionWes McKinney
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...DataWorks Summit
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Vinay Kumar
 

Ähnlich wie Tthornton code4lib (20)

Skillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet appSkillwise - Enhancing dotnet app
Skillwise - Enhancing dotnet app
 
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcacoLa big datacamp-2014-aws-dynamodb-overview-michael_limcaco
La big datacamp-2014-aws-dynamodb-overview-michael_limcaco
 
Introduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIsIntroduction to HDF5 Data Model, Programming Model and Library APIs
Introduction to HDF5 Data Model, Programming Model and Library APIs
 
The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...
The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...
The Role of Atom/AtomPub in Digital Archive Services at The University of Tex...
 
Elasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetupElasticsearch Introduction at BigData meetup
Elasticsearch Introduction at BigData meetup
 
Rdbms
RdbmsRdbms
Rdbms
 
Using the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related toolsUsing the Archivists' Toolkit: Hands-on practice and related tools
Using the Archivists' Toolkit: Hands-on practice and related tools
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search EngineElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
ElasticSearch: Distributed Multitenant NoSQL Datastore and Search Engine
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at ScaleFiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
 
Spark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir VolkSpark Summit EU talk by Shay Nativ and Dvir Volk
Spark Summit EU talk by Shay Nativ and Dvir Volk
 
DSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/ExportDSpace 4.2 Transmission: Import/Export
DSpace 4.2 Transmission: Import/Export
 
Solr Recipes Workshop
Solr Recipes WorkshopSolr Recipes Workshop
Solr Recipes Workshop
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Echoes Project
Echoes ProjectEchoes Project
Echoes Project
 
An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...An architecture for federated data discovery and lineage over on-prem datasou...
An architecture for federated data discovery and lineage over on-prem datasou...
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Roaring with elastic search sangam2018
Roaring with elastic search sangam2018Roaring with elastic search sangam2018
Roaring with elastic search sangam2018
 
GraphDb in XPages
GraphDb in XPagesGraphDb in XPages
GraphDb in XPages
 

Tthornton code4lib

  • 1. EAD without XSLT a practical approach to archival finding aids Trevor Thornton Senior Applications Developer, NYPL Labs The New York Public Library
  • 2. Project goals • Enable multiple presentations of the same data • Support dynamic web applications • Cross-collection search with component-level specificity in results, and faceting on common access points
  • 3. System overview Ruby on Rails + MySQL + SOLR Key functionality: Data Import Search index API
  • 5. Collection model Each collection: •must have one description •may have one or more components •may be associated with one or more access terms
  • 6. Component model Each component: •must belong to one collection •must have one description •may have one parent component •may have one or more child components •may be associated with one or more access terms
  • 7. Component hierarchy attributes • collection_id (id of root collection) • parent_id (id of parent component) • sib_seq (sibling sequence) • level_num (numeric level within hierarchy) • level_text (series, sub-series, file, etc.) • has_children Computed after initial data import; provided • max_levels as a convenience for finding aid UIs and to streamline formulation of API responses • top_component_id
  • 8. Description model Elements of description organized (roughly) based on ISAD(G): •Descriptive identity ISAD(G) 3.1 •Context ISAD(G) 3.2.1 - 3.2.3 •Acquisition & processing ISAD(G) 3.2.4, 3.3.2-3.3.3 •Content and structure ISAD(G) 3.3.1, 3.3.4 •Access and use ISAD(G) 3.4 •Related material ISAD(G) 3.5 •Notes ISAG(G) 3.6
  • 10. Description model: JSON format { "unitid": [ { "value": "3283", "type": "local_mss" } ], "unittitle": [ { "value": "David Ames Wells papers" } ], "unitdate": [ { "type": "inclusive", "normal": "1847/1895", "value": "1847-1895" } ], "physdesc_extent":[ { "value": ".5 linear feet", "unit":"linear feet" }, { "value": "2 boxes", "unit":"containers" } ], "abstract": [ { "value": "David Ames Wells was an engineer, economist, textbook author, and advocate for lower tariff rates. This collection contains correspondence with Gordon L. Ford, Worthington C. Ford, and others; clippings; a manuscript draft of Protection: The Poor Man's Friend; and a lecture Wells delivered on free trade in 1882"} ], "prefercite": [ { "value": "<p>David Ames Wells papers, Manuscripts and Archives Division, The New York Public Library</p>" } ] }
  • 11. EAD as a guide for data storage • EAD elements that allow only CDATA are stored as plain strings • EAD elements that require content to be structured in <p> or other block elements stored as HTML • Rules established for converting EAD to HTML when necessary • HTML conversion designed to support re-conversion back to EAD
  • 12. Special handling for dates • Dates are hard o Inclusive dates and bulk dates o Multiple date formats o Ranges, lists and both • Special data structure for dates: o date_statement (original text) o inclusive_start / inclusive_end o bulk_start / bulk_end o keydate (for ordering query response – earliest inclusive date or earliest bulk date when present) o index_dates (for search faceting – every year included in range/list)
  • 14. Refinement of Access Term/ Access Term Association models
  • 15. Data import • It’s messy business • Bulk of work has focused on EAD; Nokogiri used extensively for parsing XML • Basic process for EAD import: 1. Create collection record 2. Extract collection-level data, create/save description 3. Extract access terms, and for each a. Save if it doesn’t already exist b. Save collection/term association 4. Extract top-level components, and for each: a. Create component record b. Extract component-level data, create/save description c. Extract/save access terms & associations d. Extract child components and repeat for each
  • 16. Integration with NYPL digital repository • Fedora repository + custom metadata creation/digitization workflow system + API to query repository data • All records in repository identified with UUID • UUID of digital object associated with a given component is stored locally in archives data system • Best case scenario: common identifiers appear in archival description and in Fedora
  • 17. Apache Solr • Inter- and intra-collection search • Collocation via faceting and filter queries • Using RSolr to facilitate interaction with Solr (for both search and index)
  • 18. API • API development is proceeding in step with finding aid development – available requests added as needed • Basic requests: o Collection-level data o Components of a collection, or sub-components of a component o Includes all component-level descriptive data o Max. depth can be specified o Digital assets associated with a component
  • 22. Considerations for future development • Separate API from data management? o Data management app to handle all create/update/destroy operations, while API (Sinatra?) is read-only o Open API to public? Security/load considerations… • ArchivesSpace o NYPL is considering it as a possible replacement for our existing ‘home-grown’ system o How would this system integrate with ArchivesSpace API? • Upcoming EAD revision
  • 23. some code to look at and/or borrow from: github.com/nypl/archives_data_public finding aid prototype: archives.nypl.org me: trevorthornton@nypl.org NYPL Labs: nypl.org/labs