SlideShare ist ein Scribd-Unternehmen logo
1 von 58
Downloaden Sie, um offline zu lesen
Technologies For
Appraising and
A    i i     d
Managing Electronic
Records
Presented by: Peter Bajcsy
-Research Scientist at NCSA
-Associate Director of I-CHASS, I3
                               ,
Institute
-Adjunct Assistant Professor, CS & ECE
UIUC

National Center for Supercomputing Applications
University of Illinois at Urbana-Champaign
Acknowledgement

   • This research was partially supported by a National
     Archive and Records Administration (NARA) supplement
                                             (      ) pp
     to NSF PACI cooperative agreement CA #SCI-9619019
     and NCSA Industrial Partners.
   • The views and conclusions contained in this doc ment
           ie s      concl sions                     document
     are those of the authors and should not be interpreted as
     representing the official policies, either expressed or
     implied, of the National Archive and Records
     Administration, or the U.S. government.
   • Contributions by: Peter Bajcsy Kenton McHenry Rob
                              Bajcsy,           McHenry,
     Kooper, Michal Ondrejcek, Jason Kastner, William
     McFadden, and Sang-Chul Lee


Imaginations unbound
Outline

• Introduction
• A disco er of relationships among digital
     discovery
  file collections (file2learn)
• A comprehensive comparison of
  contemporary documents (doc2learn)
• Automated file format conversions and
  conversion quality assessment (Polyglot)
• Summary
Introduction
Supporting NARA’s Strategic Plan
• According to The Strategic Plan of The
  National Archives and Records
  Administration 2006–2016. “Preserving the
  Past to Protect the Future”
  • “Strategic Goal: We will preserve and
    process records to ensure access by the
    public as soon as legally possible”
                              possible
     • “D. We will improve the efficiency with
       which we manage our holdings from
       the time they are scheduled through
       accessioning, processing, storage,
       preservation,
       preservation and public use.”
                                 use
To Be Preserved!
                       Digital representation of
                              information          Preservation
                             & knowledge




   Information
    transfer ?




  AGENCY                                             ARCHIVES

Imaginations unbound
Do We Know the Answers?

Questions During Appraisal of Electronic
  Records Series
• (1) Given M full DVDs with files, which
  files are related?
• (2) Given N versions of the ‘same’ file
                               same file,
  which file version(s) should be
  preserved?d?
Do We Know the Answers?
• (3) Given P file formats, which file format
  to use a d which co e s o so t a e
          and      c conversion software
  to use so files would be possible to view
  in a long run?
  • How much information is lost during file format
    conversion?
• (4) What is the granularity of
  information th t one should preserve
  i f     ti that       h ld
  about a decision process in order to
  reconstruct it?
Goal: Design Technologies for Appraising
and Managing Electronic Records
• Technologies should address the following
  problems:
   • (1) a discovery of relationships among
     digital file collections (file2learn)
   • (2) a comprehensive comparison of
     contemporary documents (doc2learn)
   • (3) automated file f
                     d fil format conversions and
                                           i    d
     conversion quality assessment (Polyglot)
A Discovery of Relationships
 Among Digital File Collections
Discovering Relationships Among Files

• How should one establish relationships
  among electronic records coming
      • From disparate sources or
      • From the same source at multiple time
        instances?


• Need to Understand the Complexity of the
  Problem
  P bl


Imaginations unbound
Discovering Relationships Among Files:
  Components
      p
• Metadata describing electronic records
    • How to extract metadata?
    • How to automate metadata extraction from multiple data
      types, e.g., 2D drawings and 3D CAD models?
• Storage of metadata
    • What ontology to use to represent the extracted metadata?
    • H
      How t represent and store d t and metadata?
          to          t d t      data d       t d t ?
• Exploratory and Search Capabilities
    • Ho to a tomate disco er of relationships?
      How automate discovery
    • How to support discovery of relationships between
      electronic records corresponding to the same p y
                                p     g              physical
      objects but different multidimensional observations?
Imaginations unbound
Relationships Among Multiple Data Types
  • Example Data: Torpedo Weapon Retriever 841
       • 784 existing 2D image drawings and N>22 3D CAD
         models
  • How to establish relationships among the 3D
    CAD models and 2D image drawings during a
    product lifecycle?




          Hypothetical Distribution of 3D CAD models for TWR 841


Imaginations unbound
Methodology
•   File Identification
•   Information Extraction from
     • File System
             S stem
     • File Content
•   Information Organization
     • Taxonomy
        (classification)
     • Ontology
        (relationships)
•   Information
    Representation,
    Integration and Storage
     • XML
     • RDF
•   Relationship Discovery
File Identification and File System Analyses
• File Identification
   • What is the file format?
   • Is the file format well formed?
• Approach: Used DROID built on top of the PRONOM File
  Registry with additional NCSA support of 3D file identification
• Metadata extraction about a file system
   • Where is the file located?
   • What is the file size, time stamp, etc.?
• Approach: Use any file system information extraction
  software, such as Aperture (cross platform, open source, active
  development), Google desktop, OS specific solutions (e.g.,
  Apple Spotlight Linux MS Search)
         Spotlight, Linux,
Content Analyses: Automation ?




                           Relat         iscovery
                               tionship Di          Part name,
                  OCR                                 Author,
                                                     Software,
                                                     Date, …



                        File Descriptors

Imaginations unbound
Content Analyses: Optical Character
Recognition (OCR) of 2D Drawings


              Reference Block




              Title Block


              MMC Block (Marinette Marine Corporation)
‘Standard’ Title Blocks: Organization and
 Ontology                           TEMPLATES

• Examples of title blocks used on
  drawings prepared by Naval
  Construction Battalion and Naval
  Construction Regiment
Title Block: Ontology and Metadata
    Representation

Ontology for sub-fields:
•    A – Record of preparation (<tdrw:recordOfPreparation>),
•   B – Drawing title (<tdrw:drawingTitle>),
•   C – Preparing Activity <tdrw:preparingActivity>,
           p     g        y       p p    g       y
•   F – Code identification number (<tdrw:FSCMNumber> ),
•   G – Drawing size (<tdrw:drawingSize>),
•   H – Drawing number (<tdrw:drawingNumber>)
                           (<tdrw:drawingNumber>),
•   J – Scale (<tdrw:drawingScale>),
•   K – Specification number (<tdrw:drawingNumber>),
•   L – Sheet number (<tdrw:sheetNumber>).
Resource Description Framework (RDF):
•   Metadata representation: subject – predicate - object
MMC and Reference Blocks: Organization

• MMC Blocks


                                   •The list varies
                                   in length
                                         g
                                   •The notation
                                   is not
                                   standardized




Inconsistencies
Summary of OCR Based Analyses
• Manually encoded block coordinates for 784 files in PNG
  (converted from originally LZW compressed TIFF files)
• Automated OCR and executed OCR on
   • 700 title blocks,
   • 150 reference blocks,
   • dozen of revision and list of material
   • about 200 additional areas with the drawing numbers
     (MMC DWG. NO.).
• Performance benchmarks:
   • Full OCR of TB, MMC and RF for about 50 image files
     (105 blocks) took about 6 hours on a quad core
     machine
Content Based Extraction from STEP Files

     • 3D CAD models in STEP file format are searched for any ASCII 
       strings matching English dictionary and following STEP 
       strings matching English dictionary and following STEP
       metadata specification.


                                      Example Metadata for TWR841 ship deck

STEP METADATA SPECIFICATION             EXPECTED STEP METADATA               PARSED STEP METADATA

FILE_DESCRIPTION( /* description */     FILE_DESCRIPTION((''),               FILE_DESCRIPTION((''),
(''),                                   /* implementation_level */ '2;1');   '2;1');
/* implementation_level */ '2;1');      FILE_NAME(                           FILE_NAME(
FILE_NAME(                              '120 TORPEDO WEAPONS RETRIEVER,      'D:NARAArchieve_data_samplesBHD_FR12
/* name */ '',                          TRANSVERSE BULKHEADS BELOW, MAIN     U2110_BHD12_2007_05_09.stp',
                                        DECK',
/* time_stamp */'',                     ‘04-10-86',                          '2007-05-10T13:45:37',
/* author */ (''),                      ('LDOBSON'),                         ('rakowpj'),
/* organization */ (''),                ('NAVAL SEA SYSTEMS COMMAND'),       (''),
/* preprocessor version */ ' ',
   preprocessor_version                 ' ',                                 'Autodesk Inventor 11 ,
                                                                              Autodesk          11'
/* originating_system */ '',            'IDA-STEP',                          'Autodesk Inventor 11',
/* authorization */ ' ');               ' ');                                '');
Exploratory Framework – User Interface
 Overview

Filter for Files                            Filter for Files

                   Graph of Relationships
                   Between Selected Files

          Files                                                Files




Preview of                                              Preview of
 Selected                                                Selected
   Data                                                    Data
Exploratory Framework – User Interface
Overview
        Additional Import/Export and Preference Options




               Table of Relationships
               Between Selected Files
Exploratory Framework: Modes of Operations

• Detection of discrepancies/anomalies in file descriptors
   • OCR results
      • View 2D drawings and OCR results, and then edit OCR
        descriptors
   • 3D Model
      • View 3D model and content based extraction, and then edit
        descriptors
• Comparison of pairs of files
   • Pairs of 2D drawings
   • Pairs of 3D models
   • Pairs of (2D drawing, 3D model)
• Establish file relationships
   • Insert logical links to relate a pair of files
Detection of Anomalies in OCR Results
Comparison of Files

Color encoding:
• P di t
  Predicates
  and values
  match
• Predicates
  match
• P di t
  Predicate
  occurs only
  in one file
Establish File Relationships
Establish File Relationships: Logical Link
AC
 Comprehensive Comparison
         h  i C       i
 of Contemporary Documents
Support of Appraisals by Enabling Comparisons
• How to compare containers with heterogeneous
  information (text, images, vector graphics,
                (         g           g
  animation, 3D, etc.)?
   • Methodology
   • Metrics
   • Weighting factors for fusion
         g    g
• How to quantify similarities between the same type
  of information?
   • Encodings and Representations
   • Metrics
   • Local versus global differences
Imaginations unbound
Example: Adobe Portable Document
   Format (PDF)
 • Why PDF? - PDF is just an example of a container
      • Office environment (Adobe PDF PS MS Word HTML …)
                                  PDF, PS,    Word, HTML, )
      • Satellite measurements (HDF, netCDF, …)




                                                              3D
                                                          Adobe Library 6.0


                                                              Movie
                                                          Adobe Lib
                                                          Ad b Library 7 0
                                                                       7.0




Imaginations unbound
Comparisons




Imaginations unbound
Example: Compare Veterans Affairs Fact
Sheets in PDF and MS Word file formats
• Test data: 108 files from RG 015 - Records of the
  Department of Veterans Affairs/Fact
  Sheets/www1.va.gov/opa/fact/docs.
   • These files are Veterans Affairs Fact Sheets and are stored in both
     PDF and MS Word file formats (54 MS word and 54 PDF files)
                                                              files).
• Which files have identical content?
• Demo: 6 files
   • amwars-2.pdf, amwars.pdf
   • claimpro-2.pdf, claimpro.pdf
   • comprates 2 pdf comprates.pdf
     comprates-2.pdf, comprates pdf
Methodology

  Pair-wise
 comparison
      p               +…
 of the same
digital objects




Comparison of
  multiple and
heterogeneous
 digital objects
                      +…

    Relationship to
  Permanent Records
Exploration of Text Components

                      LOADED FILES


Occurrence of words       Occurrence of numbers   “Ignore” words
Exploration of Image Components

                     LOADED FILES

                                                  “Ignore” colors
List of images   Occurrence of colors   Preview
Exploration of Vector Graphics
   Components

                       LOADED FILES

                                      Preview




            Occurrence of v/h lines




Imaginations unbound
Comprehensive Pair-Wise Comparison of
Documents
   Grouping and
Visualization Control

                             Similarity Values




Document ID
Visual Comparison for 6 Test Files




                   Result:
         amwars-2.pdf = amwars.pdf
        claimpro-2.pdf = claimpro.pdf
      comprates 2.pdf
      comprates-2.pdf = comprates.pdf
Computational
Requirements for
Executing the
Methodology


 Yellow indicates
  computations



    Relationship to
  Permanent Records




Appraisal & Sampling
Work in progress: Group and Validate
                Documents
   ributes of documents
Attr       o




                          Order of documents
Automated File Format Conversions
 and Conversion Quality
 Assessment
Conversions of Electronic Records
  • Conversions of electronic records are needed because
     • Visual exploration depends on various software
       packages
     • Many formats are retired (deprecated) over time
  • How to measure the degree of information
    preservation when files are converted from format A to
    format B?
     • During conversions, information could be lost, added or
       modified
     • Wh t i th i
       What is the importance of each b t object, etc. ?
                         t      f   h byte, bj t t
  • How to design a test bed for analyzing the quality of
    conversion and visualization software?

Imaginations unbound
Illustration of 3D File Format Reality
                                         *.ma, * b *
                                         *     *.mb, *.mp    *.k3d
                                                               k3d
*.pdf (*.prc, *.u3d)



                                                             *.w3d




 *.lwo         *.c4d   *.dwg   *.blend   *.iam          *.max, *.3ds
Our Survey about 3D Content
• Q: How Many 3D File Formats Exist?
• A: We have found more than 140 3D file
  formats. Many are proprietary file formats. Many
  are extremely complex ( ,
                y     p   (1,200 and more p g
                                            pages
  of specifications).
• Q: How Many Software Packages Support 3D
  File Format Import, Export and Display?
• A: We have documented about 16 software
  packages. There are many more. Most of them
  are proprietary/closed source code. Many
  contain incomplete support of file specifications
                                     specifications.
Examples of 3D Formats and Stored Content
    Format                 Geometry                            Appearance                                Scene                Animation
             Faceted   Parametric     CSG   B-Rep   Color   Material   Texture   Bump   Lights   Views     Trans.    Groups

     3ds       √           √                         √         √         √        √       √       √              √

     igs       √           √          √      √       √                                                           √     √

     lwo       √           √                         √         √         √        √

     obj       √           √                         √         √         √        √                                    √

     ply       √                                     √         √         √        √

     stp       √           √          √      √       √                                                                 √

      wrl      √           √                         √         √         √        √       √       √              √     √         √

     u3d       √                                     √                   √        √       √       √              √     √         √

     x3d       √           √                         √         √         √        √       √       √              √     √         √

 


     • Some content may be more important than others
             • The relative importance is situation dependent
Example: Conversion of X3D to STEP to X3D


                       Software:
                   X3dToVrml97



    X3D                                             WRL
                         Software:
                       A3D Reviewer




           Software:                   Software:     Nothing!
          A3D Reviewer                Vrml97ToX3d




  STEP                         WRL                    X3D
Towards a Universal Converter

• Use what is available in 3rd party software to
  perform conversions
      f            i
   • Document what formats can be
     opened/imported b each application
            d/i    t d by      h     li ti
   • Document what formats can be
     saved/exported by each application
   • Automate the use of each application and
     combine their abilities to perform conversions
     over larger set of formats
Input/Output Graphs

                      Adobe 3D Reviewer
Automation of 3D File Format Mapping




          Find the shortest path


                                   Convert



                                   Preview




Imaginations unbound
Automation of 3D File Format Conversion

• The I/O-Graph stores the information needed to convert
  between the formats represented in the graph
                                         graph.
• In order to perform the conversion we must execute the
  conversion path found.
               p
   • Many high end graphics programs are found on the windows
     platform
   • Those on other platforms, such as Linux, tend to have windows
     ports
   • Some are command line driven (usually small converter
     applications).
   • Many have only GUI interfaces
   • AutoHotKey: a scripting language for the Windows GUI.
Methodology           EXTENSIBILITY




                         AUTOMATION

    Cloud Computing

              COMPUTATIONAL
               SCALABILITY




Services to Archivists
NCSA Polyglot – Conversion Services

 • Web interface: user
   can drag and drop files
   into upload area for
   conversion

 • Java interface:
PolyglotRequest pgr;
pgr = new PolyglotRequest(“http://???”, “obj”);
pgr.convertFile(“file.wrl”, “./”);


 • Scalability Test
Number of PCs   One PC           Two PCs
Processing Time 33 minutes 6     16 minutes 40
                seconds          seconds
NCSA Polyglot – Data Loss Measurement
Services
                               We would like to assign
                                  a value to each
                                conversion edge …
Geometry Based Content Retention

  • Several metrics
  • D t di
    Data driven assignment
                    i    t




  • Example results
        p
MetricResult   Single Optimal Conversion                   ‘Best’ File Format
                Software     From       To    Information   Format         Information
                                              Retention                    Retention
Light Fields    Adobe 3D    .pdf       .stp   61.67         .stp         40.73
                Reviewer
Spin Images
 p      g       Adobe 3D    .obj
                               j       .pdf
                                        p     59.07         .stl         34.89
                Reviewer
Summary
• Technologies for appraisal of electronic records
  should assist archivists
• They are designed to support decisions and data
  explorations by automating appraisal tasks
• The software for doc2learn and Polyglot is available
  for downloading at
  http://isda.ncsa.uiuc.edu/download/
• File2learn software – the work is still in progress
• Feedback is very welcome

• Questions: Peter Bajcsy – pbajcsy@ncsa.uiuc.edu
Demo exercise

• Step 1: Check the
  path exists between
  wrl and pdf

• Step 2: drag and drop
  heart.wrl; select target
  to be pdf, click upload
        pdf

• Step 3: download to
  desktop and open in
  Adobe PDF Viewer

Weitere ähnliche Inhalte

Was ist angesagt?

IRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET Journal
 
Stream Processing with DDS and CEP
Stream Processing with  DDS and CEPStream Processing with  DDS and CEP
Stream Processing with DDS and CEPAngelo Corsaro
 
Guest Lecture: Exchange and QA for Metadata at WSU
Guest Lecture: Exchange and QA for Metadata at WSUGuest Lecture: Exchange and QA for Metadata at WSU
Guest Lecture: Exchange and QA for Metadata at WSUMeghan Finch
 
Advanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part IAdvanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part IAngelo Corsaro
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFShuguk
 
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Karen R
 
Building Reactive Applications with DDS
Building Reactive Applications with DDSBuilding Reactive Applications with DDS
Building Reactive Applications with DDSAngelo Corsaro
 
HiTIME project
HiTIME projectHiTIME project
HiTIME projectvty
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkSteve Loughran
 
Presentation Ispass 2012 Session6 Presentation1
Presentation Ispass 2012 Session6 Presentation1Presentation Ispass 2012 Session6 Presentation1
Presentation Ispass 2012 Session6 Presentation1sairahul321
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Carlos Castillo (ChaTo)
 
The DDS Security Standard
The DDS Security StandardThe DDS Security Standard
The DDS Security StandardAngelo Corsaro
 
The DDS Tutorial - Part I
The DDS Tutorial - Part IThe DDS Tutorial - Part I
The DDS Tutorial - Part IAngelo Corsaro
 
Digitization Projects for Small Archives and Museums
Digitization Projects for Small Archives and MuseumsDigitization Projects for Small Archives and Museums
Digitization Projects for Small Archives and MuseumsAnna Naruta-Moya
 
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Using Architectures for Semantic Interoperability to Create Journal Clubs for...Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Using Architectures for Semantic Interoperability to Create Journal Clubs for...James Powell
 

Was ist angesagt? (19)

IRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication SystemIRJET - A Secure Access Policies based on Data Deduplication System
IRJET - A Secure Access Policies based on Data Deduplication System
 
Tese phd
Tese phdTese phd
Tese phd
 
Stream Processing with DDS and CEP
Stream Processing with  DDS and CEPStream Processing with  DDS and CEP
Stream Processing with DDS and CEP
 
Guest Lecture: Exchange and QA for Metadata at WSU
Guest Lecture: Exchange and QA for Metadata at WSUGuest Lecture: Exchange and QA for Metadata at WSU
Guest Lecture: Exchange and QA for Metadata at WSU
 
Advanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part IAdvanced OpenSplice Programming - Part I
Advanced OpenSplice Programming - Part I
 
Federated HDFS
Federated HDFSFederated HDFS
Federated HDFS
 
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
Using Dublin Core for DISCOVER: a New Zealand visual art and music resource f...
 
Building Reactive Applications with DDS
Building Reactive Applications with DDSBuilding Reactive Applications with DDS
Building Reactive Applications with DDS
 
HiTIME project
HiTIME projectHiTIME project
HiTIME project
 
Whither Small Data?
Whither Small Data?Whither Small Data?
Whither Small Data?
 
HA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talkHA Hadoop -ApacheCon talk
HA Hadoop -ApacheCon talk
 
Presentation Ispass 2012 Session6 Presentation1
Presentation Ispass 2012 Session6 Presentation1Presentation Ispass 2012 Session6 Presentation1
Presentation Ispass 2012 Session6 Presentation1
 
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
Challenges Distributed Information Retrieval [RBY] (ICDE 2007 Turkey)
 
The DDS Security Standard
The DDS Security StandardThe DDS Security Standard
The DDS Security Standard
 
The DDS Tutorial - Part I
The DDS Tutorial - Part IThe DDS Tutorial - Part I
The DDS Tutorial - Part I
 
Digitization Projects for Small Archives and Museums
Digitization Projects for Small Archives and MuseumsDigitization Projects for Small Archives and Museums
Digitization Projects for Small Archives and Museums
 
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Using Architectures for Semantic Interoperability to Create Journal Clubs for...Using Architectures for Semantic Interoperability to Create Journal Clubs for...
Using Architectures for Semantic Interoperability to Create Journal Clubs for...
 
Dexjava Technical Seminar Dec 2011
Dexjava Technical Seminar Dec 2011Dexjava Technical Seminar Dec 2011
Dexjava Technical Seminar Dec 2011
 
DDS In Action Part II
DDS In Action Part IIDDS In Action Part II
DDS In Action Part II
 

Andere mochten auch

Key Aspects in 3D File Format Conversions
Key Aspects in 3D File Format ConversionsKey Aspects in 3D File Format Conversions
Key Aspects in 3D File Format Conversionspbajcsy
 
Soccer 3v3 Field Sponsorship2009
Soccer 3v3 Field Sponsorship2009Soccer 3v3 Field Sponsorship2009
Soccer 3v3 Field Sponsorship2009Misty Nesvick
 
Soccer 3v3 Fun Zone 2009
Soccer 3v3 Fun Zone 2009Soccer 3v3 Fun Zone 2009
Soccer 3v3 Fun Zone 2009Misty Nesvick
 
SLiMS improving librarian competences 20150508
SLiMS improving librarian competences 20150508SLiMS improving librarian competences 20150508
SLiMS improving librarian competences 20150508Wardi Yono
 
Spc Gen Pres Final
Spc Gen Pres FinalSpc Gen Pres Final
Spc Gen Pres Finaladamlefebvre
 
To Preserve Or Not To Preserve?
To Preserve Or Not To Preserve?To Preserve Or Not To Preserve?
To Preserve Or Not To Preserve?pbajcsy
 
Mobile ISD Metcalf IEL2010
Mobile ISD Metcalf IEL2010Mobile ISD Metcalf IEL2010
Mobile ISD Metcalf IEL2010David Metcalf
 
Overview of Lincoln Paper Design
Overview of Lincoln Paper DesignOverview of Lincoln Paper Design
Overview of Lincoln Paper Designpbajcsy
 
Home Selling Tips - Pricing and Staging
Home Selling Tips - Pricing and StagingHome Selling Tips - Pricing and Staging
Home Selling Tips - Pricing and StagingDan Eason
 
Gsm2009
Gsm2009Gsm2009
Gsm2009gtran
 

Andere mochten auch (10)

Key Aspects in 3D File Format Conversions
Key Aspects in 3D File Format ConversionsKey Aspects in 3D File Format Conversions
Key Aspects in 3D File Format Conversions
 
Soccer 3v3 Field Sponsorship2009
Soccer 3v3 Field Sponsorship2009Soccer 3v3 Field Sponsorship2009
Soccer 3v3 Field Sponsorship2009
 
Soccer 3v3 Fun Zone 2009
Soccer 3v3 Fun Zone 2009Soccer 3v3 Fun Zone 2009
Soccer 3v3 Fun Zone 2009
 
SLiMS improving librarian competences 20150508
SLiMS improving librarian competences 20150508SLiMS improving librarian competences 20150508
SLiMS improving librarian competences 20150508
 
Spc Gen Pres Final
Spc Gen Pres FinalSpc Gen Pres Final
Spc Gen Pres Final
 
To Preserve Or Not To Preserve?
To Preserve Or Not To Preserve?To Preserve Or Not To Preserve?
To Preserve Or Not To Preserve?
 
Mobile ISD Metcalf IEL2010
Mobile ISD Metcalf IEL2010Mobile ISD Metcalf IEL2010
Mobile ISD Metcalf IEL2010
 
Overview of Lincoln Paper Design
Overview of Lincoln Paper DesignOverview of Lincoln Paper Design
Overview of Lincoln Paper Design
 
Home Selling Tips - Pricing and Staging
Home Selling Tips - Pricing and StagingHome Selling Tips - Pricing and Staging
Home Selling Tips - Pricing and Staging
 
Gsm2009
Gsm2009Gsm2009
Gsm2009
 

Ähnlich wie Technologies For Appraising and Managing Electronic Records

07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representationsMarco Quartulli
 
Preservation Planning: Choosing a suitable digital preservation strategy
Preservation Planning: Choosing a suitable digital preservation strategyPreservation Planning: Choosing a suitable digital preservation strategy
Preservation Planning: Choosing a suitable digital preservation strategyGarethKnight
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projectszsrlibrary
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Jian Qin
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Anita de Waard
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogsandrea huang
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)Vladimir Alexiev, PhD, PMP
 
Wed van horik_handson_research data management
Wed van horik_handson_research data managementWed van horik_handson_research data management
Wed van horik_handson_research data managementeswcsummerschool
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...João Rocha da Silva
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Simeon Warner
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practiceHelen Nneka Okpala
 
Issues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringIssues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringChris Rusbridge
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsGeorge Stathis
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingPlanetData Network of Excellence
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...Oscar Corcho
 

Ähnlich wie Technologies For Appraising and Managing Electronic Records (20)

07 data structures_and_representations
07 data structures_and_representations07 data structures_and_representations
07 data structures_and_representations
 
Preservation Planning: Choosing a suitable digital preservation strategy
Preservation Planning: Choosing a suitable digital preservation strategyPreservation Planning: Choosing a suitable digital preservation strategy
Preservation Planning: Choosing a suitable digital preservation strategy
 
INSPIRE Registers
INSPIRE RegistersINSPIRE Registers
INSPIRE Registers
 
Intro to Digitization Projects
Intro to Digitization ProjectsIntro to Digitization Projects
Intro to Digitization Projects
 
Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...Functional and Architectural Requirements for Metadata: Supporting Discovery...
Functional and Architectural Requirements for Metadata: Supporting Discovery...
 
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Multimedia Data Navigation and the Semantic Web (SemTech 2006)
Multimedia Data Navigation and the Semantic Web (SemTech 2006)
 
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
Research Object Composer: A Tool for Publishing Complex Data Objects in the C...
 
Bertenthal
BertenthalBertenthal
Bertenthal
 
Timbuctoo 2 EASY
Timbuctoo 2 EASYTimbuctoo 2 EASY
Timbuctoo 2 EASY
 
20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs20160818 Semantics and Linkage of Archived Catalogs
20160818 Semantics and Linkage of Archived Catalogs
 
RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)RDF Data and Image Annotations in ResearchSpace (slides)
RDF Data and Image Annotations in ResearchSpace (slides)
 
Wed van horik_handson_research data management
Wed van horik_handson_research data managementWed van horik_handson_research data management
Wed van horik_handson_research data management
 
The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...The Dendro research data management platform: Applying ontologies to long-ter...
The Dendro research data management platform: Applying ontologies to long-ter...
 
Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)Oxford Common File Layout (OCFL)
Oxford Common File Layout (OCFL)
 
Digitization in theory and practice
Digitization in theory and practiceDigitization in theory and practice
Digitization in theory and practice
 
報告
報告報告
報告
 
Issues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineeringIssues in long-term knowledge retention in engineering
Issues in long-term knowledge retention in engineering
 
Sharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data LessonsSharing a Startup’s Big Data Lessons
Sharing a Startup’s Big Data Lessons
 
On the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream ProcessingOn the need for a W3C community group on RDF Stream Processing
On the need for a W3C community group on RDF Stream Processing
 
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
OrdRing 2013 keynote - On the need for a W3C community group on RDF Stream Pr...
 

Kürzlich hochgeladen

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 

Kürzlich hochgeladen (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 

Technologies For Appraising and Managing Electronic Records

  • 1. Technologies For Appraising and A i i d Managing Electronic Records Presented by: Peter Bajcsy -Research Scientist at NCSA -Associate Director of I-CHASS, I3 , Institute -Adjunct Assistant Professor, CS & ECE UIUC National Center for Supercomputing Applications University of Illinois at Urbana-Champaign
  • 2. Acknowledgement • This research was partially supported by a National Archive and Records Administration (NARA) supplement ( ) pp to NSF PACI cooperative agreement CA #SCI-9619019 and NCSA Industrial Partners. • The views and conclusions contained in this doc ment ie s concl sions document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Archive and Records Administration, or the U.S. government. • Contributions by: Peter Bajcsy Kenton McHenry Rob Bajcsy, McHenry, Kooper, Michal Ondrejcek, Jason Kastner, William McFadden, and Sang-Chul Lee Imaginations unbound
  • 3. Outline • Introduction • A disco er of relationships among digital discovery file collections (file2learn) • A comprehensive comparison of contemporary documents (doc2learn) • Automated file format conversions and conversion quality assessment (Polyglot) • Summary
  • 5. Supporting NARA’s Strategic Plan • According to The Strategic Plan of The National Archives and Records Administration 2006–2016. “Preserving the Past to Protect the Future” • “Strategic Goal: We will preserve and process records to ensure access by the public as soon as legally possible” possible • “D. We will improve the efficiency with which we manage our holdings from the time they are scheduled through accessioning, processing, storage, preservation, preservation and public use.” use
  • 6. To Be Preserved! Digital representation of information Preservation & knowledge Information transfer ? AGENCY ARCHIVES Imaginations unbound
  • 7. Do We Know the Answers? Questions During Appraisal of Electronic Records Series • (1) Given M full DVDs with files, which files are related? • (2) Given N versions of the ‘same’ file same file, which file version(s) should be preserved?d?
  • 8. Do We Know the Answers? • (3) Given P file formats, which file format to use a d which co e s o so t a e and c conversion software to use so files would be possible to view in a long run? • How much information is lost during file format conversion? • (4) What is the granularity of information th t one should preserve i f ti that h ld about a decision process in order to reconstruct it?
  • 9. Goal: Design Technologies for Appraising and Managing Electronic Records • Technologies should address the following problems: • (1) a discovery of relationships among digital file collections (file2learn) • (2) a comprehensive comparison of contemporary documents (doc2learn) • (3) automated file f d fil format conversions and i d conversion quality assessment (Polyglot)
  • 10. A Discovery of Relationships Among Digital File Collections
  • 11. Discovering Relationships Among Files • How should one establish relationships among electronic records coming • From disparate sources or • From the same source at multiple time instances? • Need to Understand the Complexity of the Problem P bl Imaginations unbound
  • 12. Discovering Relationships Among Files: Components p • Metadata describing electronic records • How to extract metadata? • How to automate metadata extraction from multiple data types, e.g., 2D drawings and 3D CAD models? • Storage of metadata • What ontology to use to represent the extracted metadata? • H How t represent and store d t and metadata? to t d t data d t d t ? • Exploratory and Search Capabilities • Ho to a tomate disco er of relationships? How automate discovery • How to support discovery of relationships between electronic records corresponding to the same p y p g physical objects but different multidimensional observations? Imaginations unbound
  • 13. Relationships Among Multiple Data Types • Example Data: Torpedo Weapon Retriever 841 • 784 existing 2D image drawings and N>22 3D CAD models • How to establish relationships among the 3D CAD models and 2D image drawings during a product lifecycle? Hypothetical Distribution of 3D CAD models for TWR 841 Imaginations unbound
  • 14. Methodology • File Identification • Information Extraction from • File System S stem • File Content • Information Organization • Taxonomy (classification) • Ontology (relationships) • Information Representation, Integration and Storage • XML • RDF • Relationship Discovery
  • 15. File Identification and File System Analyses • File Identification • What is the file format? • Is the file format well formed? • Approach: Used DROID built on top of the PRONOM File Registry with additional NCSA support of 3D file identification • Metadata extraction about a file system • Where is the file located? • What is the file size, time stamp, etc.? • Approach: Use any file system information extraction software, such as Aperture (cross platform, open source, active development), Google desktop, OS specific solutions (e.g., Apple Spotlight Linux MS Search) Spotlight, Linux,
  • 16. Content Analyses: Automation ? Relat iscovery tionship Di Part name, OCR Author, Software, Date, … File Descriptors Imaginations unbound
  • 17. Content Analyses: Optical Character Recognition (OCR) of 2D Drawings Reference Block Title Block MMC Block (Marinette Marine Corporation)
  • 18. ‘Standard’ Title Blocks: Organization and Ontology TEMPLATES • Examples of title blocks used on drawings prepared by Naval Construction Battalion and Naval Construction Regiment
  • 19. Title Block: Ontology and Metadata Representation Ontology for sub-fields: • A – Record of preparation (<tdrw:recordOfPreparation>), • B – Drawing title (<tdrw:drawingTitle>), • C – Preparing Activity <tdrw:preparingActivity>, p g y p p g y • F – Code identification number (<tdrw:FSCMNumber> ), • G – Drawing size (<tdrw:drawingSize>), • H – Drawing number (<tdrw:drawingNumber>) (<tdrw:drawingNumber>), • J – Scale (<tdrw:drawingScale>), • K – Specification number (<tdrw:drawingNumber>), • L – Sheet number (<tdrw:sheetNumber>). Resource Description Framework (RDF): • Metadata representation: subject – predicate - object
  • 20. MMC and Reference Blocks: Organization • MMC Blocks •The list varies in length g •The notation is not standardized Inconsistencies
  • 21. Summary of OCR Based Analyses • Manually encoded block coordinates for 784 files in PNG (converted from originally LZW compressed TIFF files) • Automated OCR and executed OCR on • 700 title blocks, • 150 reference blocks, • dozen of revision and list of material • about 200 additional areas with the drawing numbers (MMC DWG. NO.). • Performance benchmarks: • Full OCR of TB, MMC and RF for about 50 image files (105 blocks) took about 6 hours on a quad core machine
  • 22. Content Based Extraction from STEP Files • 3D CAD models in STEP file format are searched for any ASCII  strings matching English dictionary and following STEP  strings matching English dictionary and following STEP metadata specification. Example Metadata for TWR841 ship deck STEP METADATA SPECIFICATION EXPECTED STEP METADATA PARSED STEP METADATA FILE_DESCRIPTION( /* description */ FILE_DESCRIPTION((''), FILE_DESCRIPTION((''), (''), /* implementation_level */ '2;1'); '2;1'); /* implementation_level */ '2;1'); FILE_NAME( FILE_NAME( FILE_NAME( '120 TORPEDO WEAPONS RETRIEVER, 'D:NARAArchieve_data_samplesBHD_FR12 /* name */ '', TRANSVERSE BULKHEADS BELOW, MAIN U2110_BHD12_2007_05_09.stp', DECK', /* time_stamp */'', ‘04-10-86', '2007-05-10T13:45:37', /* author */ (''), ('LDOBSON'), ('rakowpj'), /* organization */ (''), ('NAVAL SEA SYSTEMS COMMAND'), (''), /* preprocessor version */ ' ', preprocessor_version ' ', 'Autodesk Inventor 11 , Autodesk 11' /* originating_system */ '', 'IDA-STEP', 'Autodesk Inventor 11', /* authorization */ ' '); ' '); '');
  • 23. Exploratory Framework – User Interface Overview Filter for Files Filter for Files Graph of Relationships Between Selected Files Files Files Preview of Preview of Selected Selected Data Data
  • 24. Exploratory Framework – User Interface Overview Additional Import/Export and Preference Options Table of Relationships Between Selected Files
  • 25. Exploratory Framework: Modes of Operations • Detection of discrepancies/anomalies in file descriptors • OCR results • View 2D drawings and OCR results, and then edit OCR descriptors • 3D Model • View 3D model and content based extraction, and then edit descriptors • Comparison of pairs of files • Pairs of 2D drawings • Pairs of 3D models • Pairs of (2D drawing, 3D model) • Establish file relationships • Insert logical links to relate a pair of files
  • 26. Detection of Anomalies in OCR Results
  • 27. Comparison of Files Color encoding: • P di t Predicates and values match • Predicates match • P di t Predicate occurs only in one file
  • 30. AC Comprehensive Comparison h i C i of Contemporary Documents
  • 31. Support of Appraisals by Enabling Comparisons • How to compare containers with heterogeneous information (text, images, vector graphics, ( g g animation, 3D, etc.)? • Methodology • Metrics • Weighting factors for fusion g g • How to quantify similarities between the same type of information? • Encodings and Representations • Metrics • Local versus global differences Imaginations unbound
  • 32. Example: Adobe Portable Document Format (PDF) • Why PDF? - PDF is just an example of a container • Office environment (Adobe PDF PS MS Word HTML …) PDF, PS, Word, HTML, ) • Satellite measurements (HDF, netCDF, …) 3D Adobe Library 6.0 Movie Adobe Lib Ad b Library 7 0 7.0 Imaginations unbound
  • 34. Example: Compare Veterans Affairs Fact Sheets in PDF and MS Word file formats • Test data: 108 files from RG 015 - Records of the Department of Veterans Affairs/Fact Sheets/www1.va.gov/opa/fact/docs. • These files are Veterans Affairs Fact Sheets and are stored in both PDF and MS Word file formats (54 MS word and 54 PDF files) files). • Which files have identical content? • Demo: 6 files • amwars-2.pdf, amwars.pdf • claimpro-2.pdf, claimpro.pdf • comprates 2 pdf comprates.pdf comprates-2.pdf, comprates pdf
  • 35. Methodology Pair-wise comparison p +… of the same digital objects Comparison of multiple and heterogeneous digital objects +… Relationship to Permanent Records
  • 36. Exploration of Text Components LOADED FILES Occurrence of words Occurrence of numbers “Ignore” words
  • 37. Exploration of Image Components LOADED FILES “Ignore” colors List of images Occurrence of colors Preview
  • 38. Exploration of Vector Graphics Components LOADED FILES Preview Occurrence of v/h lines Imaginations unbound
  • 39. Comprehensive Pair-Wise Comparison of Documents Grouping and Visualization Control Similarity Values Document ID
  • 40. Visual Comparison for 6 Test Files Result: amwars-2.pdf = amwars.pdf claimpro-2.pdf = claimpro.pdf comprates 2.pdf comprates-2.pdf = comprates.pdf
  • 41. Computational Requirements for Executing the Methodology Yellow indicates computations Relationship to Permanent Records Appraisal & Sampling
  • 42. Work in progress: Group and Validate Documents ributes of documents Attr o Order of documents
  • 43. Automated File Format Conversions and Conversion Quality Assessment
  • 44. Conversions of Electronic Records • Conversions of electronic records are needed because • Visual exploration depends on various software packages • Many formats are retired (deprecated) over time • How to measure the degree of information preservation when files are converted from format A to format B? • During conversions, information could be lost, added or modified • Wh t i th i What is the importance of each b t object, etc. ? t f h byte, bj t t • How to design a test bed for analyzing the quality of conversion and visualization software? Imaginations unbound
  • 45. Illustration of 3D File Format Reality *.ma, * b * * *.mb, *.mp *.k3d k3d *.pdf (*.prc, *.u3d) *.w3d *.lwo *.c4d *.dwg *.blend *.iam *.max, *.3ds
  • 46. Our Survey about 3D Content • Q: How Many 3D File Formats Exist? • A: We have found more than 140 3D file formats. Many are proprietary file formats. Many are extremely complex ( , y p (1,200 and more p g pages of specifications). • Q: How Many Software Packages Support 3D File Format Import, Export and Display? • A: We have documented about 16 software packages. There are many more. Most of them are proprietary/closed source code. Many contain incomplete support of file specifications specifications.
  • 47. Examples of 3D Formats and Stored Content Format Geometry Appearance Scene Animation Faceted Parametric CSG B-Rep Color Material Texture Bump Lights Views Trans. Groups 3ds √ √ √ √ √ √ √ √ √ igs √ √ √ √ √ √ √ lwo √ √ √ √ √ √ obj √ √ √ √ √ √ √ ply √ √ √ √ √ stp √ √ √ √ √ √ wrl √ √ √ √ √ √ √ √ √ √ √ u3d √ √ √ √ √ √ √ √ √ x3d √ √ √ √ √ √ √ √ √ √ √   • Some content may be more important than others • The relative importance is situation dependent
  • 48. Example: Conversion of X3D to STEP to X3D Software: X3dToVrml97 X3D WRL Software: A3D Reviewer Software: Software: Nothing! A3D Reviewer Vrml97ToX3d STEP WRL X3D
  • 49. Towards a Universal Converter • Use what is available in 3rd party software to perform conversions f i • Document what formats can be opened/imported b each application d/i t d by h li ti • Document what formats can be saved/exported by each application • Automate the use of each application and combine their abilities to perform conversions over larger set of formats
  • 50. Input/Output Graphs Adobe 3D Reviewer
  • 51. Automation of 3D File Format Mapping Find the shortest path Convert Preview Imaginations unbound
  • 52. Automation of 3D File Format Conversion • The I/O-Graph stores the information needed to convert between the formats represented in the graph graph. • In order to perform the conversion we must execute the conversion path found. p • Many high end graphics programs are found on the windows platform • Those on other platforms, such as Linux, tend to have windows ports • Some are command line driven (usually small converter applications). • Many have only GUI interfaces • AutoHotKey: a scripting language for the Windows GUI.
  • 53. Methodology EXTENSIBILITY AUTOMATION Cloud Computing COMPUTATIONAL SCALABILITY Services to Archivists
  • 54. NCSA Polyglot – Conversion Services • Web interface: user can drag and drop files into upload area for conversion • Java interface: PolyglotRequest pgr; pgr = new PolyglotRequest(“http://???”, “obj”); pgr.convertFile(“file.wrl”, “./”); • Scalability Test Number of PCs One PC Two PCs Processing Time 33 minutes 6 16 minutes 40 seconds seconds
  • 55. NCSA Polyglot – Data Loss Measurement Services We would like to assign a value to each conversion edge …
  • 56. Geometry Based Content Retention • Several metrics • D t di Data driven assignment i t • Example results p MetricResult Single Optimal Conversion ‘Best’ File Format Software From To Information Format Information Retention Retention Light Fields Adobe 3D .pdf .stp 61.67 .stp 40.73 Reviewer Spin Images p g Adobe 3D .obj j .pdf p 59.07 .stl 34.89 Reviewer
  • 57. Summary • Technologies for appraisal of electronic records should assist archivists • They are designed to support decisions and data explorations by automating appraisal tasks • The software for doc2learn and Polyglot is available for downloading at http://isda.ncsa.uiuc.edu/download/ • File2learn software – the work is still in progress • Feedback is very welcome • Questions: Peter Bajcsy – pbajcsy@ncsa.uiuc.edu
  • 58. Demo exercise • Step 1: Check the path exists between wrl and pdf • Step 2: drag and drop heart.wrl; select target to be pdf, click upload pdf • Step 3: download to desktop and open in Adobe PDF Viewer