SlideShare a Scribd company logo
1 of 31
Download to read offline
Embedding Metadata and Other
      Semantics In Word-Processing
              Documents
                 Peter Sefton (University Southern Queensland)
                   Ian Barnes (Australian National University)
                  Ron Ward (University Southern Queensland)
             Jim Downing (University of Cambridge) (presenting)


                                         [breath]


The paper supporting this presentation provides important detail and can be obtained from
http://www.dspace.cam.ac.uk/handle/1810/206423
Agenda

Motivations

Axioms of choice

Interoperability is Hard

The approach

Examples (+ chemistry)

                           http://www.flickr.com/photos/forezt/524108228
Why is this interesting?


We want to move towards semantically-rich
documents for e-Research. In some disciplines 100%
of documents start life in a word processor.

Introduction of real world constraints yields
interesting result
Semantically Rich Documents


         Enable automation

         Prevent information loss

         Better discovery

         Improved presentation




Automation - zero click upload, not filling in redundant forms etc
Information loss - rich data reduced to tables, images.
Semantic information leads to richer alternatives for discovery and communication of
research.
Fully Supported Research - all the supporting data delivered with the text
Constraints
                    Work
                    in the
               real world,
                    today
                                                http://www.flickr.com/photos/amirjina/2281612876
Solution had to work in ICE - the Integrated Content Environment, a distributed authoring
system in production at USQ.

Therefore the approach is PRAGMATIC!
Real World

Metadata, semantics and data not easily distinguished

Document creation == Metadata creation

 Not separable activities

 Metadata is in the document

Documents have multiple, distributed authors
Tools and Formats

Microsoft Word [Adoption]

OpenOffice.org writer
[Access]

ICE - Integrated Content
Environment

                 .doc, .docx, OOXML, ODF

                 HTML, PDF
The Difference Between
       Standards and Interoperability




This is the test that semantic solutions must work inside to be useful in production - once
semantics are created, they must survive when the document is edited in the wild.
This is the simple subset of document interop we’re talking about, including only word and
OO Writer.

In the wild you can’t control what formats people use to save, or the software they use.

If any of these routes destroys semantics, then we’ve lost interoperability.

There are a lot of standards already involved in this space, but none of them on their own
deliver semantic data interoperability.
Interoperability in Publishing




PDF - scholarly publishing now
HTML - the medium term future of scholarly publishing.

Converter needed since HTML and PDF creation in OOo Writer and MS Word produce pretty
poor results.
http://www.flickr.com/photos/druclimb/289636172


When you apply these interoperability constraints, the solution space gets very small.
<metaphor>Like walking along a ridge, keep it simple and take small steps. The paths off to
the side lead quickly to peril.</metaphor>
Approaches Ruled Out
MS Word “Smart Tags”

 No interop with OOo, but not necessarily a bad idea

MS Word foreign namespace XML encoding

 Expensive, no interop with OOo, lock-in issues

ODF 1.2 embedded semantic

 No Word equivalent in sight

Things that would destroy WYSIWYG such as using wiki
markup in the word processor.
Define A New Encoding
                       Standard?




Codifying a standard wouldn’t work unless vanilla wp software can be shown not to destroy
the information.

For delivering interoperability in this area, standards are not sufficient.
Microformats!




      http://www.flickr.com/photos/onion/2046003604
Encoding Microformats

         Tables: for, like, tabulating things

         Styles: The original extensible inline semantic
         mechanism for word processing and still working!

         Links

         Frames: fragile

         Bookmarks and fields: require lots of field testing, not
         all that reliable in an interop situation


The paper contains much more detail about the mechanism.
Styles




The style approach is: -
 * Simple
 * Metadata schema agnostic
 * User extensible

It doesn’t /need/ any plugin / customized software to work.
Style: p-meta-author


         Style: p-meta-affiliation


d


d
              Style: p-meta-issued                             Style: p-meta-abstract




    Styles can be nested by placing inline styled text within styled paragraphs.
{ 'title':['Metadata in ICE documents'],
                      'author':[{'name':'Ian Barnes',
                                  'affiliation':'ANU'},
                                 {'name':'Peter Sefton',
                                  'affiliation':'USQ'}
                               ]
                    }




Tables are also useful since the layout implies semantics
Toolbars




The toolbars are implemented for Word and Writer. They provide easy access to the common
microformat encoding styles and structures. They also contain macros for communicating
with the ICE system, and uploading the document to the Institutional Repo / publisher system
etc.
http://www.flickr.com/photos/jima/460348206


To make it even easier, templates can be used that include sample text in the relevant places
- all the user has to do is replace the sample text.
Dublin Core metadata can be extracted directly from the document.
As can RDF metadata using the ORE vocabulary.
ICE-TheOREM

        Semantics in chemistry thesis documents

        Structural elements, Chapters, Appendices etc

        Data (molecules, spectral data etc)

        Chemical entities in text




http://wwmm.ch.cam.ac.uk/trac/theorem/
Chemistry

                                               Style: p-exptl-compound

                                                  Link to data.

                                                 Style: p-exptl-compnum



                                                     Style: p-compound-name




This text from a synthetic chemistry thesis.

Highlights the grey area between data and metadata - the compound name is data, but also
the subject of the document.
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/
demos/index.htm
FIN
                 Thank you.




http://www.flickr.com/photos/jaysun/367670007
ICE - Integrated Content Environment http://ice.usq.edu.au/
       Demos at http://ice.usq.edu.au/presentations/demos/
                   ICE-TheOREM. Tag: jisctheorem
              https://wwmm.ch.cam.ac.uk/trac/theorem
http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/
                          theoremice.aspx
                           Peter Sefton
                       sefton@usq.edu.au
                       http://ptsefton.com/
                         Jim Downing
                      ojd20@cam.ac.uk
            http://wwmm.ch.cam.ac.uk/blogs/downing/

More Related Content

Similar to Embedding Metadata In Word Processing Documents

SLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationSLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationIRJET Journal
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsIRJET Journal
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Stuart Chalk
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Jerry SILVER
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling TechniqueCarmen Sanborn
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...IEEEMEMTECHSTUDENTSPROJECTS
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEEMEMTECHSTUDENTPROJECTS
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET Journal
 
PowerPoint
PowerPointPowerPoint
PowerPointVideoguy
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshopl_ernest
 
0001 introduction to database management system
0001 introduction to database management system0001 introduction to database management system
0001 introduction to database management systemJugdambay S
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsMarkus Neteler
 
Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Jeffrey Stewart
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldRob Gillen
 
A web standards & ud approach for access (bps public)
A web standards & ud approach for access (bps   public)A web standards & ud approach for access (bps   public)
A web standards & ud approach for access (bps public)Howard Kramer
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the HaystackAdrian Stevenson
 

Similar to Embedding Metadata In Word Processing Documents (20)

SLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides GenerationSLIDEGen: Approach to automatic Slides Generation
SLIDEGen: Approach to automatic Slides Generation
 
Multikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive GraphsMultikeyword Hunt on Progressive Graphs
Multikeyword Hunt on Progressive Graphs
 
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
Rule-based Capture/Storage of Scientific Data from PDF Files and Export using...
 
Bibliographic metadata (including citation)
Bibliographic metadata (including citation)Bibliographic metadata (including citation)
Bibliographic metadata (including citation)
 
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
Building a Scalable XML-based Dynamic Delivery Architecture: Standards and Be...
 
Document Based Data Modeling Technique
Document Based Data Modeling TechniqueDocument Based Data Modeling Technique
Document Based Data Modeling Technique
 
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
2014 IEEE DOTNET DATA MINING PROJECT A novel model for mining association rul...
 
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
IEEE 2014 DOTNET DATA MINING PROJECTS A novel model for mining association ru...
 
IRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction FrameworkIRJET- Resume Information Extraction Framework
IRJET- Resume Information Extraction Framework
 
Sword Bl 0903[1]
Sword Bl 0903[1]Sword Bl 0903[1]
Sword Bl 0903[1]
 
PowerPoint
PowerPointPowerPoint
PowerPoint
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
0001 introduction to database management system
0001 introduction to database management system0001 introduction to database management system
0001 introduction to database management system
 
Metadata Cloud
Metadata CloudMetadata Cloud
Metadata Cloud
 
The need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formatsThe need of Interoperability in Office and GIS formats
The need of Interoperability in Office and GIS formats
 
Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...Trekk cross media series using xml to create once - distribute everywhere - e...
Trekk cross media series using xml to create once - distribute everywhere - e...
 
Windows Azure: Lessons From The Field
Windows Azure: Lessons From The FieldWindows Azure: Lessons From The Field
Windows Azure: Lessons From The Field
 
A web standards & ud approach for access (bps public)
A web standards & ud approach for access (bps   public)A web standards & ud approach for access (bps   public)
A web standards & ud approach for access (bps public)
 
Sweo talk
Sweo talkSweo talk
Sweo talk
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 

More from Jim Downing

The Metaverse in Fashion
The Metaverse in FashionThe Metaverse in Fashion
The Metaverse in FashionJim Downing
 
Metail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionMetail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionJim Downing
 
Creative Cambridge Metail presentation
Creative Cambridge Metail presentationCreative Cambridge Metail presentation
Creative Cambridge Metail presentationJim Downing
 
XR in fashion & the eTryOn project
XR in fashion  & the eTryOn projectXR in fashion  & the eTryOn project
XR in fashion & the eTryOn projectJim Downing
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards LensfieldJim Downing
 
Web Feeds and Repositories
Web Feeds and RepositoriesWeb Feeds and Repositories
Web Feeds and RepositoriesJim Downing
 

More from Jim Downing (6)

The Metaverse in Fashion
The Metaverse in FashionThe Metaverse in Fashion
The Metaverse in Fashion
 
Metail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni FashionMetail and eTryOn for De Montfort Uni Fashion
Metail and eTryOn for De Montfort Uni Fashion
 
Creative Cambridge Metail presentation
Creative Cambridge Metail presentationCreative Cambridge Metail presentation
Creative Cambridge Metail presentation
 
XR in fashion & the eTryOn project
XR in fashion  & the eTryOn projectXR in fashion  & the eTryOn project
XR in fashion & the eTryOn project
 
Towards Lensfield
Towards LensfieldTowards Lensfield
Towards Lensfield
 
Web Feeds and Repositories
Web Feeds and RepositoriesWeb Feeds and Repositories
Web Feeds and Repositories
 

Recently uploaded

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Recently uploaded (20)

How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

Embedding Metadata In Word Processing Documents

  • 1. Embedding Metadata and Other Semantics In Word-Processing Documents Peter Sefton (University Southern Queensland) Ian Barnes (Australian National University) Ron Ward (University Southern Queensland) Jim Downing (University of Cambridge) (presenting) [breath] The paper supporting this presentation provides important detail and can be obtained from http://www.dspace.cam.ac.uk/handle/1810/206423
  • 2. Agenda Motivations Axioms of choice Interoperability is Hard The approach Examples (+ chemistry) http://www.flickr.com/photos/forezt/524108228
  • 3. Why is this interesting? We want to move towards semantically-rich documents for e-Research. In some disciplines 100% of documents start life in a word processor. Introduction of real world constraints yields interesting result
  • 4. Semantically Rich Documents Enable automation Prevent information loss Better discovery Improved presentation Automation - zero click upload, not filling in redundant forms etc Information loss - rich data reduced to tables, images. Semantic information leads to richer alternatives for discovery and communication of research. Fully Supported Research - all the supporting data delivered with the text
  • 5. Constraints Work in the real world, today http://www.flickr.com/photos/amirjina/2281612876 Solution had to work in ICE - the Integrated Content Environment, a distributed authoring system in production at USQ. Therefore the approach is PRAGMATIC!
  • 6. Real World Metadata, semantics and data not easily distinguished Document creation == Metadata creation Not separable activities Metadata is in the document Documents have multiple, distributed authors
  • 7. Tools and Formats Microsoft Word [Adoption] OpenOffice.org writer [Access] ICE - Integrated Content Environment .doc, .docx, OOXML, ODF HTML, PDF
  • 8. The Difference Between Standards and Interoperability This is the test that semantic solutions must work inside to be useful in production - once semantics are created, they must survive when the document is edited in the wild.
  • 9. This is the simple subset of document interop we’re talking about, including only word and OO Writer. In the wild you can’t control what formats people use to save, or the software they use. If any of these routes destroys semantics, then we’ve lost interoperability. There are a lot of standards already involved in this space, but none of them on their own deliver semantic data interoperability.
  • 10. Interoperability in Publishing PDF - scholarly publishing now HTML - the medium term future of scholarly publishing. Converter needed since HTML and PDF creation in OOo Writer and MS Word produce pretty poor results.
  • 11. http://www.flickr.com/photos/druclimb/289636172 When you apply these interoperability constraints, the solution space gets very small. <metaphor>Like walking along a ridge, keep it simple and take small steps. The paths off to the side lead quickly to peril.</metaphor>
  • 12. Approaches Ruled Out MS Word “Smart Tags” No interop with OOo, but not necessarily a bad idea MS Word foreign namespace XML encoding Expensive, no interop with OOo, lock-in issues ODF 1.2 embedded semantic No Word equivalent in sight Things that would destroy WYSIWYG such as using wiki markup in the word processor.
  • 13. Define A New Encoding Standard? Codifying a standard wouldn’t work unless vanilla wp software can be shown not to destroy the information. For delivering interoperability in this area, standards are not sufficient.
  • 14. Microformats! http://www.flickr.com/photos/onion/2046003604
  • 15. Encoding Microformats Tables: for, like, tabulating things Styles: The original extensible inline semantic mechanism for word processing and still working! Links Frames: fragile Bookmarks and fields: require lots of field testing, not all that reliable in an interop situation The paper contains much more detail about the mechanism.
  • 16. Styles The style approach is: - * Simple * Metadata schema agnostic * User extensible It doesn’t /need/ any plugin / customized software to work.
  • 17. Style: p-meta-author Style: p-meta-affiliation d d Style: p-meta-issued Style: p-meta-abstract Styles can be nested by placing inline styled text within styled paragraphs.
  • 18. { 'title':['Metadata in ICE documents'], 'author':[{'name':'Ian Barnes', 'affiliation':'ANU'}, {'name':'Peter Sefton', 'affiliation':'USQ'} ] } Tables are also useful since the layout implies semantics
  • 19. Toolbars The toolbars are implemented for Word and Writer. They provide easy access to the common microformat encoding styles and structures. They also contain macros for communicating with the ICE system, and uploading the document to the Institutional Repo / publisher system etc.
  • 20. http://www.flickr.com/photos/jima/460348206 To make it even easier, templates can be used that include sample text in the relevant places - all the user has to do is replace the sample text.
  • 21. Dublin Core metadata can be extracted directly from the document.
  • 22. As can RDF metadata using the ORE vocabulary.
  • 23. ICE-TheOREM Semantics in chemistry thesis documents Structural elements, Chapters, Appendices etc Data (molecules, spectral data etc) Chemical entities in text http://wwmm.ch.cam.ac.uk/trac/theorem/
  • 24. Chemistry Style: p-exptl-compound Link to data. Style: p-exptl-compnum Style: p-compound-name This text from a synthetic chemistry thesis. Highlights the grey area between data and metadata - the compound name is data, but also the subject of the document.
  • 25. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 26. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 27. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 28. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 29. These screenshots taken from CML in ICE demo at http://ice.usq.edu.au/presentations/ demos/index.htm
  • 30. FIN Thank you. http://www.flickr.com/photos/jaysun/367670007
  • 31. ICE - Integrated Content Environment http://ice.usq.edu.au/ Demos at http://ice.usq.edu.au/presentations/demos/ ICE-TheOREM. Tag: jisctheorem https://wwmm.ch.cam.ac.uk/trac/theorem http://www.jisc.ac.uk/whatwedo/programmes/digitalrepositories2007/ theoremice.aspx Peter Sefton sefton@usq.edu.au http://ptsefton.com/ Jim Downing ojd20@cam.ac.uk http://wwmm.ch.cam.ac.uk/blogs/downing/