SlideShare ist ein Scribd-Unternehmen logo
1 von 17
Downloaden Sie, um offline zu lesen
Transforming PDF into HTML
Matt Kuznicki, CTO
Agenda
1. Challenges of PDF conversion
2. Making convertible PDF from the start
3. About all those other PDFs out there….
4. Features of Datalogics PDF Alchemist
5. Summary and concluding thoughts
A bit about me
 CTO at Datalogics
 Worked with PDF for over 15 years
 Board member of PDF Association
 Active participant in the PDF standards
community
Challenges of PDF conversion
 PDF was designed to convey an exact visual
representation of information to humans
 PDF’s origins did not account for storing and
retrieving machine-understandable
information
 PDF is page and position based, lacks the
notion of text flow and grouping*
 Many different PDFs in the wild – some easy
to interpret, some very complex
PDF designed to convey exact visual representation
 Reliable visual representation, but
many potential ways to make
something that looks a certain way
 Capability to tie semantic
information to content came later
on to PDF
 Use is increasing but still far from
the majority of content being
produced
 Most PDF generators still prefer
smaller files to PDF files that are
easier to repurpose
PDF designed for human consumption
At the time PDF was conceived as a PostScript replacement, reliable
rendering for human readers was an important issue…
 Focus was on retrieving the information needed to display and print
pages for peoples’ use
 Affordances for machine “reading” were bolt-ons to the format
 Community has made great strides in allowing for machine
interpretation, but proper use requires expertise in the domain
 Structure and semantics are optional – usage is still rare
 This is NOT a PDF specific issue
 Like a TIFF or raster image, marks on a PDF page are precisely
positioned and usually come in small discrete pieces
 Humans automatically see a page flow that is not always present in the
PDF syntax
 Contents of a PDF page can be specified in an order very different from
how we read
 Words, images, other elements on a page may have the marks that
constitute spread far throughout the page marking stream
PDF is page and position based
 CreatingTagged PDF means you embed
the information for repurposing and
reflow directly into the PDF when it’s
created – at the right time!
 Easy to convertTagged PDF into other
formats
 But, not allTagged PDF is the same, and
not generators emit usefulTagged PDF!
Avoid all this trouble at the start – if you can!
But how about all those other PDFs out there?
 Existing PDFs aren’t going to magically gain structure
semantics
 Existing tools and workflows may not be upgradable in
the near future – or at all
 Not all files converted to PDF contain enough information
for structure semantics in the first place
Is OCR the only way to handle these? No!
 OCR is not always reliable in
converting pictures of text
back into actual text flows
 Rasterizing PDFs to scan and
turn back into non-raster
form introduces multiple
chances for errors and
unexpected results
Conversion of PDF to HTML relies upon:
 Seeing pages in a way like a human reads them
 Figuring our the logical structure of the pages
 Putting text back together into text flows
 Putting all these elements out in the correct order
PDF Alchemist
Datalogics PDF to HTML conversion technology
What does PDF Alchemist offer?
 Works on untagged PDFs – handles existing PDFs, does not require
workflow changes or regenerating/reconstructing source PDFs
 Turns placed words in PDFs back into text flows – reflowable text
 Re-creates tables and lists from page content
 Removes pagination artifacts such as page #s and running headers
 Converts PDF into single-page HTML5 + CSS or into EPUB packages
 Converts PDF forms into fixed-layout HTML forms for use in mobile
environments
Demonstrations
 Conversion of a PDF file into an HTML file
 Conversion of a PDF form into an HTML form
• Available as a command line tool for server and workflow
integration
• Or as a simple “C” API for integration into programs
Using PDF Alchemist
Summary
 Most PDFs are and will continue to be made without
regard to repurposing
 Reconstructing the content and flow of PDF relies upon
advanced logic and mimicry of how humans read pages
 PDF Alchemist offers this logic in an easy to use software
package
Any Questions?
Matt Kuznicki
ChiefTechnical Officer
mattk@datalogics.com
LinkedIn: mattkuznicki
Datalogics Inc.
www.datalogics.com
Twitter: @DatalogicsInc

Weitere ähnliche Inhalte

Kürzlich hochgeladen

Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.francesco barbera
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...DianaGray10
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceMartin Humpolec
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopBachir Benyammi
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1DianaGray10
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfDaniel Santiago Silva Capera
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfAnna Loughnan Colquhoun
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdfPedro Manuel
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsSeth Reyes
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding TeamAdam Moalla
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxGDSC PJATK
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024D Cloud Solutions
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfinfogdgmi
 

Kürzlich hochgeladen (20)

Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.Digital magic. A small project for controlling smart light bulbs.
Digital magic. A small project for controlling smart light bulbs.
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
 
Things you didn't know you can use in your Salesforce
Things you didn't know you can use in your SalesforceThings you didn't know you can use in your Salesforce
Things you didn't know you can use in your Salesforce
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
NIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 WorkshopNIST Cybersecurity Framework (CSF) 2.0 Workshop
NIST Cybersecurity Framework (CSF) 2.0 Workshop
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1Secure your environment with UiPath and CyberArk technologies - Session 1
Secure your environment with UiPath and CyberArk technologies - Session 1
 
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdfIaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
IaC & GitOps in a Nutshell - a FridayInANuthshell Episode.pdf
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Spring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdfSpring24-Release Overview - Wellingtion User Group-1.pdf
Spring24-Release Overview - Wellingtion User Group-1.pdf
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Nanopower In Semiconductor Industry.pdf
Nanopower  In Semiconductor Industry.pdfNanopower  In Semiconductor Industry.pdf
Nanopower In Semiconductor Industry.pdf
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
Computer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and HazardsComputer 10: Lesson 10 - Online Crimes and Hazards
Computer 10: Lesson 10 - Online Crimes and Hazards
 
9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team9 Steps For Building Winning Founding Team
9 Steps For Building Winning Founding Team
 
Cybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptxCybersecurity Workshop #1.pptx
Cybersecurity Workshop #1.pptx
 
Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024Artificial Intelligence & SEO Trends for 2024
Artificial Intelligence & SEO Trends for 2024
 
Videogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdfVideogame localization & technology_ how to enhance the power of translation.pdf
Videogame localization & technology_ how to enhance the power of translation.pdf
 

Transforming PDF into HTML

  • 1. Transforming PDF into HTML Matt Kuznicki, CTO
  • 2. Agenda 1. Challenges of PDF conversion 2. Making convertible PDF from the start 3. About all those other PDFs out there…. 4. Features of Datalogics PDF Alchemist 5. Summary and concluding thoughts
  • 3. A bit about me  CTO at Datalogics  Worked with PDF for over 15 years  Board member of PDF Association  Active participant in the PDF standards community
  • 4. Challenges of PDF conversion  PDF was designed to convey an exact visual representation of information to humans  PDF’s origins did not account for storing and retrieving machine-understandable information  PDF is page and position based, lacks the notion of text flow and grouping*  Many different PDFs in the wild – some easy to interpret, some very complex
  • 5. PDF designed to convey exact visual representation  Reliable visual representation, but many potential ways to make something that looks a certain way  Capability to tie semantic information to content came later on to PDF  Use is increasing but still far from the majority of content being produced  Most PDF generators still prefer smaller files to PDF files that are easier to repurpose
  • 6. PDF designed for human consumption At the time PDF was conceived as a PostScript replacement, reliable rendering for human readers was an important issue…  Focus was on retrieving the information needed to display and print pages for peoples’ use  Affordances for machine “reading” were bolt-ons to the format  Community has made great strides in allowing for machine interpretation, but proper use requires expertise in the domain  Structure and semantics are optional – usage is still rare  This is NOT a PDF specific issue
  • 7.  Like a TIFF or raster image, marks on a PDF page are precisely positioned and usually come in small discrete pieces  Humans automatically see a page flow that is not always present in the PDF syntax  Contents of a PDF page can be specified in an order very different from how we read  Words, images, other elements on a page may have the marks that constitute spread far throughout the page marking stream PDF is page and position based
  • 8.  CreatingTagged PDF means you embed the information for repurposing and reflow directly into the PDF when it’s created – at the right time!  Easy to convertTagged PDF into other formats  But, not allTagged PDF is the same, and not generators emit usefulTagged PDF! Avoid all this trouble at the start – if you can!
  • 9. But how about all those other PDFs out there?  Existing PDFs aren’t going to magically gain structure semantics  Existing tools and workflows may not be upgradable in the near future – or at all  Not all files converted to PDF contain enough information for structure semantics in the first place
  • 10. Is OCR the only way to handle these? No!  OCR is not always reliable in converting pictures of text back into actual text flows  Rasterizing PDFs to scan and turn back into non-raster form introduces multiple chances for errors and unexpected results
  • 11. Conversion of PDF to HTML relies upon:  Seeing pages in a way like a human reads them  Figuring our the logical structure of the pages  Putting text back together into text flows  Putting all these elements out in the correct order
  • 12. PDF Alchemist Datalogics PDF to HTML conversion technology
  • 13. What does PDF Alchemist offer?  Works on untagged PDFs – handles existing PDFs, does not require workflow changes or regenerating/reconstructing source PDFs  Turns placed words in PDFs back into text flows – reflowable text  Re-creates tables and lists from page content  Removes pagination artifacts such as page #s and running headers  Converts PDF into single-page HTML5 + CSS or into EPUB packages  Converts PDF forms into fixed-layout HTML forms for use in mobile environments
  • 14. Demonstrations  Conversion of a PDF file into an HTML file  Conversion of a PDF form into an HTML form
  • 15. • Available as a command line tool for server and workflow integration • Or as a simple “C” API for integration into programs Using PDF Alchemist
  • 16. Summary  Most PDFs are and will continue to be made without regard to repurposing  Reconstructing the content and flow of PDF relies upon advanced logic and mimicry of how humans read pages  PDF Alchemist offers this logic in an easy to use software package
  • 17. Any Questions? Matt Kuznicki ChiefTechnical Officer mattk@datalogics.com LinkedIn: mattkuznicki Datalogics Inc. www.datalogics.com Twitter: @DatalogicsInc

Hinweis der Redaktion

  1. Good afternoon!   My name is Ching Yue   I am the director of the ebook mobile technologies at Datalogics.   Today I’d like to use this opportunity to introduce you to Datalogics. To also talk briefly the solutions we provide, and also where we think the market expansions are in the ebook industry.   What I would like for you to take away from this presentation are about who we are, and if you are thinking of ebooks, come talk to us. If you are already in the ebook business and if you are thinking of expanding, do look into the areas that we speak of. If you have questions, feel free to come to talk to me and also talk to Datalogics, we will be very happy to help you succeed in your current or any new endeavors.
  2. Good afternoon!   My name is Ching Yue   I am the director of the ebook mobile technologies at Datalogics.   Today I’d like to use this opportunity to introduce you to Datalogics. To also talk briefly the solutions we provide, and also where we think the market expansions are in the ebook industry.   What I would like for you to take away from this presentation are about who we are, and if you are thinking of ebooks, come talk to us. If you are already in the ebook business and if you are thinking of expanding, do look into the areas that we speak of. If you have questions, feel free to come to talk to me and also talk to Datalogics, we will be very happy to help you succeed in your current or any new endeavors.
  3. Datalogics was founded in Chicago, in 1967. For a software company, we have a pretty good history.   We have evolved a lot of course. In the earlier years, the software we developed ran on room size mainframe machines, and now a good part of our business is working with a computer that fits in your pocket.   One thing that we have stayed true though, is to stay engineering focused.   What that means is that we not only develop and sell software, but also develop tools and solutions to support the developers who can take what we have and develop that further into a product of their own.   This allows you to tailor the solution to your needs, and to add your competitive advantage to your solutions.    We are the Primary channel for Adobe ebook technologies including Adobe Reader Mobile SDK, Adobe Content Server, and more   We have a dedicated ebook team at Datalogics, to promote, sell, and support these solutions. We work closely with many of our customers in different phases of their ebook initiatives and integration.  
  4. Datalogics was founded in Chicago, in 1967. For a software company, we have a pretty good history.   We have evolved a lot of course. In the earlier years, the software we developed ran on room size mainframe machines, and now a good part of our business is working with a computer that fits in your pocket.   One thing that we have stayed true though, is to stay engineering focused.   What that means is that we not only develop and sell software, but also develop tools and solutions to support the developers who can take what we have and develop that further into a product of their own.   This allows you to tailor the solution to your needs, and to add your competitive advantage to your solutions.    We are the Primary channel for Adobe ebook technologies including Adobe Reader Mobile SDK, Adobe Content Server, and more   We have a dedicated ebook team at Datalogics, to promote, sell, and support these solutions. We work closely with many of our customers in different phases of their ebook initiatives and integration.  
  5. The first thing that we look at is the market.   Ebook business has gone through a few phases. Where we see now, is the potentital in bringing the digital content and digital learning into classrooms.   If you are a library, you probably have some kind of ebook offerings to your readers already. Your next phase is to think about how you can leverage the electronic platform to engage your patrons, and become yet again, their source of knowledge, virtually, this time, without their physical presence in a library building.   Geo expansion. Ebook has a great advantage to break the geo barrier. Ebook delivery, not to simplify the business process, makes content delivery across the global much more feasible.   And lastly, the ebook platform can be very useful for people facing physical and learning challenges.   Softwares can be a great way to assist them in enjoying reading and learning in ways that they wouldn’t be able to with a printed book.
  6. Datalogics was founded in Chicago, in 1967. For a software company, we have a pretty good history.   We have evolved a lot of course. In the earlier years, the software we developed ran on room size mainframe machines, and now a good part of our business is working with a computer that fits in your pocket.   One thing that we have stayed true though, is to stay engineering focused.   What that means is that we not only develop and sell software, but also develop tools and solutions to support the developers who can take what we have and develop that further into a product of their own.   This allows you to tailor the solution to your needs, and to add your competitive advantage to your solutions.    We are the Primary channel for Adobe ebook technologies including Adobe Reader Mobile SDK, Adobe Content Server, and more   We have a dedicated ebook team at Datalogics, to promote, sell, and support these solutions. We work closely with many of our customers in different phases of their ebook initiatives and integration.  
  7. In conjuection with the market expansion, we see higher demand in hardward and software channels.