SlideShare ist ein Scribd-Unternehmen logo
1 von 23
Downloaden Sie, um offline zu lesen
Jakob Voß
Revealing digital documents
 Concealed structures in data
     http://arxiv.org/abs/1105.5832
          http://aboutdata.org


          International Conference on Theory
          and Practice in Digital Libraries (TPDL)
          Doctoral Consortium, Berlin 2011-09-25
question




           how are (digital) documents
            structured and described?



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th     http://aboutdata.org
what is a document?

          “[...] any physical or symbolic sign, preserved
                or recorded, intended to represent, to
            reconstruct, or to demonstrate a physical or
             conceptual phenomenon” – Suzanne Briet

       “[...] consists of anything that someone wishes
       to store. A document is something designated
      by a person to be a document [...]“ – Ted Nelson



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
scope




                digital documents
            somehow recorded (stable),
           eventually as sequence of bits



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
CR2, AAF, AAT, ADL, AES Core Audio, AES Process History, AGLS, Alleg
SCII, ASN.1, Atom, BIBO, BibTeX, BISAC, BPEL, BPMN, BSON, CanCor
 CO, CDR, CDWA, CDWA Lite, CIDOC/CRM, CQL, CSDGM, CSV, DACS
ata Committee Content Standard, DC, DCAM, DDC, DDI, DDL, DFDL, DI
 G35, DjVU, DOM, DTD, Dublin Core, DwC, EAC, EAC-CPF, EAD, ebXM
    ECN, Ediakt, EDIFAKT, eduPerson, EML, ERM, Etch, EXIF, Federal
eographic, FOAF, FRAD, FRBR, FRSAD, FRSAR, GEM, GILS, GKD, GM
ssian, HTML, HTTP, ID3, IDL, IEEE/LOM, indecs, inetOrgPerson, INI, IPT
I, ISAAR(CPF), ISAD(G), ISBD, ISBN, ISO 19115, ISO 19119, JSON, KM
               there is not one
LCC, LCSH, LDAP, Linked Data, LMER, MAB2, MADS, MARC, MARC21
 RC Relator Codes, MARCXML, MathML, MEI, MESH, METS, METS Rig
           single document format
MFC, MGraph, MIX, MO, MODS, MOTS, MPEG-21 , MPEG-7, MSchema
seumDat, MusicXML, MXF, NewsML, NFC, NFD, NFKC, NFKD, NIAM, O
OAI-ORE, OAI-PMH, OAIS, ODRL, ONIX, Ontology for Media, OODBMS
OpenDocument, OpenSearch, OpenURL, ORM, OWL, PB Core, PDF, PI
ca+, Pica3, PND, PREMIS, PRISM, Proto, QDC, RAD, RAK, RDA, RDBM
DF, RDFS, RDF/XML, Relax NG, RELAX NG, Resource, RIS, RSS, RSW
 Schematron, SCORM, SDXF, Seel, S-EXP, SGML, SIOC, SKOS, SMIL,
PECTRUM, SQL, SRU/SRW, SWAP, SWB, TEI, TEX, TextMD, TGM I, TG
 TGN, Thrift, Topic Maps, UCS, ULAN, UML, unAPI, UNIMARC, URI, UTF
 ard, Vorbis Comment, VRA, VSO Data Model, XDR, XMetaDiss, XML, XM
thesis



       but there are common patterns
          on all levels of description,
               independent from
            particular technologies


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
examples of particular technologies
     XML                                                        relational databases
      ●   Unicode                                                ●   Relational Model
      ●   XML Infoset                                            ●   SQL
      ●   XML Schema                                             ●   Entity-Relationship-
      ●   Xpath                                                      Diagrams



                      families of related standards


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
method


                   not statistical
           this would limit my research to
             one level and technology of
                     description


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th    http://aboutdata.org
method




              phenomenological
      data description in all of its forms
       as it appears in our experience



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th    http://aboutdata.org
phenomenological method

                                                                data description analyzed
                                                                as phenomena:
                                                                1. critical intuiting
                                                                   (experience)
                                                                2. analyzing structures,
      Hegel                                                        free of known
                      Husserl                                      categories
                                     Merleau-Ponty*
                                                                3. describing the essence



  * Image CC-BY Pierre-Alain Gouanvic

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
results
      1) Categorization
         of data structuring methods
      2) Collection
         of data structuring paradigms
      3) Pattern language
         of data patterns




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th    http://aboutdata.org
result 1: categorization of methods
      ●   encodings express data
          (UTF-8 Unicode, IEEE floating point, Base64…)
      ●   file and database systems store data
      ●   identifiers and query languages refer to data
      ●   data structuring and markup languages
          structure data
      ●   schema languages constrain and validate data
      ●   conceptual models describe data

    ¡Concrete methods appear as combinations of categories!

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 2: paradigms
      ●   Document- or Object-oriented approach
            ●   Document-oriented (e.g. ordered tree with
                tagged character strings: XML, Relax NG…)
                ⇒ descriptive data description
            ●   Object-oriented (objects with properties and
                defined value spaces: XML Schema, UML…)
                ⇒ prescriptive data description




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 2: paradigms
      ●   Entities and connections

              Jakob                    1979


                                      born
               Jakob                                          1979



               Jakob                   Birth                  1979


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 2: paradigms
      ●   Layers of abstraction
      ●   Standards and rules
      ●   Collections and types
      ●   Granularity




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 3: patterns
      ●   patterns as systematic tool for describing good design
          practice, introduced by Christopher Alexander:
          “Each pattern describes a problem which occurs over and
            over again in our environment, and then describes the
                   core of the solution to that problem […]”
      ●   Adopted as design patterns in software engineering
      ●   Collected in a pattern language with meaningful
          connections between patterns (network of patterns).




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 3: patterns
                                            collection

          separator                                                              known size


                                            sequence




       position                           ordered set                                  array



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th     http://aboutdata.org
applications
      ●   data archeology
            ●   In 200 years someone finds snapshots and
                archives of Wikipedia in different forms
                (SQL, XML, Wikitext, DBPedia, HTML…)
            ●   What are significant parts?
                How relate parts to each other?




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document




                               to give a simple example…




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document

                                   sequence with delimiter




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document

                                   sequence with delimiter



                     grouping of sequences with delimiter




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document

                                   sequence with delimiter



                     grouping of sequences with delimiter



                                   encoding (morse code)
 D           A        T        A                   P              A        T T E             R       N          S
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th       http://aboutdata.org

Weitere ähnliche Inhalte

Was ist angesagt?

Learning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology EngineeringLearning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology Engineering
butest
 
Motivation, inspiration and innovation from frustration
Motivation, inspiration and innovation from frustrationMotivation, inspiration and innovation from frustration
Motivation, inspiration and innovation from frustration
Herbert Van de Sompel
 
Augmenting interoperability across scholarly repositories
Augmenting interoperability across scholarly repositoriesAugmenting interoperability across scholarly repositories
Augmenting interoperability across scholarly repositories
Herbert Van de Sompel
 
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...
Graph Databases Lifecycle Methodology  and Tool to Support Index/Store Versio...Graph Databases Lifecycle Methodology  and Tool to Support Index/Store Versio...
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...
Paolo Nesi
 

Was ist angesagt? (15)

Learning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology EngineeringLearning and Text Analysis for Ontology Engineering
Learning and Text Analysis for Ontology Engineering
 
A Clean Slate?
A Clean Slate?A Clean Slate?
A Clean Slate?
 
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity Recognition Tools in the Web of DataNERD: Evaluating Named Entity Recognition Tools in the Web of Data
NERD: Evaluating Named Entity Recognition Tools in the Web of Data
 
Motivation, inspiration and innovation from frustration
Motivation, inspiration and innovation from frustrationMotivation, inspiration and innovation from frustration
Motivation, inspiration and innovation from frustration
 
The aDORe Federation Architecture
The aDORe Federation ArchitectureThe aDORe Federation Architecture
The aDORe Federation Architecture
 
Augmenting interoperability across scholarly repositories
Augmenting interoperability across scholarly repositoriesAugmenting interoperability across scholarly repositories
Augmenting interoperability across scholarly repositories
 
Applying NLP (natural language processing) to the patent genre
Applying NLP (natural language processing) to the patent genreApplying NLP (natural language processing) to the patent genre
Applying NLP (natural language processing) to the patent genre
 
NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...NERD: an open source platform for extracting and disambiguating named entitie...
NERD: an open source platform for extracting and disambiguating named entitie...
 
OAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall ForumOAC Presentation at CNI 09 Fall Forum
OAC Presentation at CNI 09 Fall Forum
 
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
Top 5 MOST VIEWED LANGUAGE COMPUTING ARTICLE - International Journal on Natur...
 
Perspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from textPerspectives on mining knowledge graphs from text
Perspectives on mining knowledge graphs from text
 
Semantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: IntroductionSemantic Web, Ontology, and Ontology Learning: Introduction
Semantic Web, Ontology, and Ontology Learning: Introduction
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 
Lean ontology development
Lean ontology developmentLean ontology development
Lean ontology development
 
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...
Graph Databases Lifecycle Methodology  and Tool to Support Index/Store Versio...Graph Databases Lifecycle Methodology  and Tool to Support Index/Store Versio...
Graph Databases Lifecycle Methodology and Tool to Support Index/Store Versio...
 

Ähnlich wie Revealing digital documents - concealed structures in data

ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked dataESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
eswcsummerschool
 
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEYUSING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
cseij
 
Understanding Information Architecture
Understanding Information ArchitectureUnderstanding Information Architecture
Understanding Information Architecture
Scott Abel
 

Ähnlich wie Revealing digital documents - concealed structures in data (20)

Linked Open Data Visualization
Linked Open Data VisualizationLinked Open Data Visualization
Linked Open Data Visualization
 
Open data and reuse of public information
Open data and reuse of public informationOpen data and reuse of public information
Open data and reuse of public information
 
Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.Data integration with a façade. The case of knowledge graph construction.
Data integration with a façade. The case of knowledge graph construction.
 
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant FormatELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
ELSE IF 2019: Porting the xEBR Taxonomy to a Linked Open Data compliant Format
 
Research Objects: more than the sum of the parts
Research Objects: more than the sum of the partsResearch Objects: more than the sum of the parts
Research Objects: more than the sum of the parts
 
Semantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenzaSemantic Interoperability - grafi della conoscenza
Semantic Interoperability - grafi della conoscenza
 
Building intelligent systems (that can explain)
Building intelligent systems (that can explain)Building intelligent systems (that can explain)
Building intelligent systems (that can explain)
 
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural HeritageBuild Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
Build Narratives, Connect Artifacts: Linked Open Data for Cultural Heritage
 
Graphic Editor For Multilingual Ontologies
Graphic Editor For Multilingual OntologiesGraphic Editor For Multilingual Ontologies
Graphic Editor For Multilingual Ontologies
 
Mapping of extensible markup language-to-ontology representation for effectiv...
Mapping of extensible markup language-to-ontology representation for effectiv...Mapping of extensible markup language-to-ontology representation for effectiv...
Mapping of extensible markup language-to-ontology representation for effectiv...
 
Ontology Engineering
Ontology EngineeringOntology Engineering
Ontology Engineering
 
FAIR data: LOUD for all audiences
FAIR data: LOUD for all audiencesFAIR data: LOUD for all audiences
FAIR data: LOUD for all audiences
 
Building intelligent systems (that can explain)
Building intelligent systems (that can explain)Building intelligent systems (that can explain)
Building intelligent systems (that can explain)
 
dotte.ppt
dotte.pptdotte.ppt
dotte.ppt
 
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked dataESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
ESWC SS 2013 - Thursday Keynote Vassilis Christophides: Preserving linked data
 
Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001Harmony project - JISC Synthesis meeting 2001
Harmony project - JISC Synthesis meeting 2001
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Science
 
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEYUSING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
USING ONTOLOGIES TO OVERCOMING DRAWBACKS OF DATABASES AND VICE VERSA: A SURVEY
 
Understanding Information Architecture
Understanding Information ArchitectureUnderstanding Information Architecture
Understanding Information Architecture
 
Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1Semantic IoT Semantic Inter-Operability Practices - Part 1
Semantic IoT Semantic Inter-Operability Practices - Part 1
 

Mehr von Jakob .

Stand und Planungen im Bereich der Schnittstellen in der VZG
Stand und Planungen im Bereich der Schnittstellen in der VZGStand und Planungen im Bereich der Schnittstellen in der VZG
Stand und Planungen im Bereich der Schnittstellen in der VZG
Jakob .
 
Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...
Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...
Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...
Jakob .
 
Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...
Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...
Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...
Jakob .
 

Mehr von Jakob . (20)

Einheitliche Normdatendienste der VZG
Einheitliche Normdatendienste der VZGEinheitliche Normdatendienste der VZG
Einheitliche Normdatendienste der VZG
 
Connections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystifiedConnections that work: Linked Open Data demystified
Connections that work: Linked Open Data demystified
 
Linked Open Data in Bibliotheken, Archiven & Museen
Linked Open Data in Bibliotheken, Archiven & MuseenLinked Open Data in Bibliotheken, Archiven & Museen
Linked Open Data in Bibliotheken, Archiven & Museen
 
Collaborative Creation of a Wikidata handbook
Collaborative Creation of a Wikidata handbookCollaborative Creation of a Wikidata handbook
Collaborative Creation of a Wikidata handbook
 
Another RDF Encoding Form
Another RDF Encoding FormAnother RDF Encoding Form
Another RDF Encoding Form
 
On the Way to a Holding Ontology
On the Way to a Holding OntologyOn the Way to a Holding Ontology
On the Way to a Holding Ontology
 
Stand und Planungen im Bereich der Schnittstellen in der VZG
Stand und Planungen im Bereich der Schnittstellen in der VZGStand und Planungen im Bereich der Schnittstellen in der VZG
Stand und Planungen im Bereich der Schnittstellen in der VZG
 
Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...
Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...
Verwaltung dokumentenorientierter DTDs für den Dokument- und Publikationsserv...
 
Beschreibung von Bibliotheks-Dienstleistungen mit Mikro-Ontologien
Beschreibung von Bibliotheks-Dienstleistungen mit Mikro-OntologienBeschreibung von Bibliotheks-Dienstleistungen mit Mikro-Ontologien
Beschreibung von Bibliotheks-Dienstleistungen mit Mikro-Ontologien
 
Linking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization SystemsLinking Folksonomies to Knowledge Organization Systems
Linking Folksonomies to Knowledge Organization Systems
 
Encoding Patron Information in RDF
Encoding Patron Information in RDFEncoding Patron Information in RDF
Encoding Patron Information in RDF
 
Libraries in a data-centered environment
Libraries in a data-centered environmentLibraries in a data-centered environment
Libraries in a data-centered environment
 
Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...
Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...
Was gibt's wie und wo? Informationen zu Standorten, Exemplaren und Dienstleis...
 
FRBR light with Simplified Ontology for Bibliographic Resource
FRBR light with Simplified Ontology for Bibliographic ResourceFRBR light with Simplified Ontology for Bibliographic Resource
FRBR light with Simplified Ontology for Bibliographic Resource
 
RDF-Daten in eigenen Anwendungen nutzen
RDF-Daten in eigenen Anwendungen nutzenRDF-Daten in eigenen Anwendungen nutzen
RDF-Daten in eigenen Anwendungen nutzen
 
Linked Data Light - Linkaggregation mit BEACON
Linked Data Light - Linkaggregation mit BEACONLinked Data Light - Linkaggregation mit BEACON
Linked Data Light - Linkaggregation mit BEACON
 
Wie kommen unsere Sacherschließungsdaten ins Semantic Web? Vom lokalen Normda...
Wie kommen unsere Sacherschließungsdaten ins Semantic Web? Vom lokalen Normda...Wie kommen unsere Sacherschließungsdaten ins Semantic Web? Vom lokalen Normda...
Wie kommen unsere Sacherschließungsdaten ins Semantic Web? Vom lokalen Normda...
 
Herausforderungen und Lösungen bei der Publikation und Nutzung von Normdaten ...
Herausforderungen und Lösungen bei der Publikation und Nutzung von Normdaten ...Herausforderungen und Lösungen bei der Publikation und Nutzung von Normdaten ...
Herausforderungen und Lösungen bei der Publikation und Nutzung von Normdaten ...
 
Linked Data: Die Zukunft der Nutzung von Katalogdaten
Linked Data: Die Zukunft der Nutzung von KatalogdatenLinked Data: Die Zukunft der Nutzung von Katalogdaten
Linked Data: Die Zukunft der Nutzung von Katalogdaten
 
We were promised Xanadu
We were promised XanaduWe were promised Xanadu
We were promised Xanadu
 

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 

Revealing digital documents - concealed structures in data

  • 1. Jakob Voß Revealing digital documents Concealed structures in data http://arxiv.org/abs/1105.5832 http://aboutdata.org International Conference on Theory and Practice in Digital Libraries (TPDL) Doctoral Consortium, Berlin 2011-09-25
  • 2. question how are (digital) documents structured and described? Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 3. what is a document? “[...] any physical or symbolic sign, preserved or recorded, intended to represent, to reconstruct, or to demonstrate a physical or conceptual phenomenon” – Suzanne Briet “[...] consists of anything that someone wishes to store. A document is something designated by a person to be a document [...]“ – Ted Nelson Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 4. scope digital documents somehow recorded (stable), eventually as sequence of bits Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 5. CR2, AAF, AAT, ADL, AES Core Audio, AES Process History, AGLS, Alleg SCII, ASN.1, Atom, BIBO, BibTeX, BISAC, BPEL, BPMN, BSON, CanCor CO, CDR, CDWA, CDWA Lite, CIDOC/CRM, CQL, CSDGM, CSV, DACS ata Committee Content Standard, DC, DCAM, DDC, DDI, DDL, DFDL, DI G35, DjVU, DOM, DTD, Dublin Core, DwC, EAC, EAC-CPF, EAD, ebXM ECN, Ediakt, EDIFAKT, eduPerson, EML, ERM, Etch, EXIF, Federal eographic, FOAF, FRAD, FRBR, FRSAD, FRSAR, GEM, GILS, GKD, GM ssian, HTML, HTTP, ID3, IDL, IEEE/LOM, indecs, inetOrgPerson, INI, IPT I, ISAAR(CPF), ISAD(G), ISBD, ISBN, ISO 19115, ISO 19119, JSON, KM there is not one LCC, LCSH, LDAP, Linked Data, LMER, MAB2, MADS, MARC, MARC21 RC Relator Codes, MARCXML, MathML, MEI, MESH, METS, METS Rig single document format MFC, MGraph, MIX, MO, MODS, MOTS, MPEG-21 , MPEG-7, MSchema seumDat, MusicXML, MXF, NewsML, NFC, NFD, NFKC, NFKD, NIAM, O OAI-ORE, OAI-PMH, OAIS, ODRL, ONIX, Ontology for Media, OODBMS OpenDocument, OpenSearch, OpenURL, ORM, OWL, PB Core, PDF, PI ca+, Pica3, PND, PREMIS, PRISM, Proto, QDC, RAD, RAK, RDA, RDBM DF, RDFS, RDF/XML, Relax NG, RELAX NG, Resource, RIS, RSS, RSW Schematron, SCORM, SDXF, Seel, S-EXP, SGML, SIOC, SKOS, SMIL, PECTRUM, SQL, SRU/SRW, SWAP, SWB, TEI, TEX, TextMD, TGM I, TG TGN, Thrift, Topic Maps, UCS, ULAN, UML, unAPI, UNIMARC, URI, UTF ard, Vorbis Comment, VRA, VSO Data Model, XDR, XMetaDiss, XML, XM
  • 6. thesis but there are common patterns on all levels of description, independent from particular technologies Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 7. examples of particular technologies XML relational databases ● Unicode ● Relational Model ● XML Infoset ● SQL ● XML Schema ● Entity-Relationship- ● Xpath Diagrams families of related standards Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 8. method not statistical this would limit my research to one level and technology of description Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 9. method phenomenological data description in all of its forms as it appears in our experience Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 10. phenomenological method data description analyzed as phenomena: 1. critical intuiting (experience) 2. analyzing structures, Hegel free of known Husserl categories Merleau-Ponty* 3. describing the essence * Image CC-BY Pierre-Alain Gouanvic Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 11. results 1) Categorization of data structuring methods 2) Collection of data structuring paradigms 3) Pattern language of data patterns Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 12. result 1: categorization of methods ● encodings express data (UTF-8 Unicode, IEEE floating point, Base64…) ● file and database systems store data ● identifiers and query languages refer to data ● data structuring and markup languages structure data ● schema languages constrain and validate data ● conceptual models describe data ¡Concrete methods appear as combinations of categories! Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 13. result 2: paradigms ● Document- or Object-oriented approach ● Document-oriented (e.g. ordered tree with tagged character strings: XML, Relax NG…) ⇒ descriptive data description ● Object-oriented (objects with properties and defined value spaces: XML Schema, UML…) ⇒ prescriptive data description Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 14. result 2: paradigms ● Entities and connections Jakob 1979 born Jakob 1979 Jakob Birth 1979 Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 15. result 2: paradigms ● Layers of abstraction ● Standards and rules ● Collections and types ● Granularity Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 16. result 3: patterns ● patterns as systematic tool for describing good design practice, introduced by Christopher Alexander: “Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem […]” ● Adopted as design patterns in software engineering ● Collected in a pattern language with meaningful connections between patterns (network of patterns). Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 17. result 3: patterns collection separator known size sequence position ordered set array Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 18. applications ● data archeology ● In 200 years someone finds snapshots and archives of Wikipedia in different forms (SQL, XML, Wikitext, DBPedia, HTML…) ● What are significant parts? How relate parts to each other? Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 19. Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 20. … another document to give a simple example… Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 21. … another document sequence with delimiter Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 22. … another document sequence with delimiter grouping of sequences with delimiter Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 23. … another document sequence with delimiter grouping of sequences with delimiter encoding (morse code) D A T A P A T T E R N S Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org