SlideShare ist ein Scribd-Unternehmen logo
1 von 34
Downloaden Sie, um offline zu lesen
www.karakun.com
Extracting information
from tables in documents
Holger Keibel
AI-SDV 2022, Vienna
© 2022 Karakun AG | 2
Karakun
Services
Software Engineering, UX Design, Consulting,
Training, Maintenance & Support
Platforms & Products
Efficiency-enhancing software platforms,
ready-made products for selected use cases,
e.g., HIBU Platform for search and LT solutions
Experienced & Established Team
60+ employees working in 4 locations
in CH (HQ), DE and IN
Competences / Skills
State-of-the-art tech stack (Java, web &
mobile), LT / AI / Big Data,
focus on open-source software
Sustainable Custom Solutions
Customers from various industries,
e.g., Insurance, Finance, Life Science,
Logistics
Authors, speakers, lecturers at universities,
Java Champions, contributors to
open-source projects
Community Engagement
© 2022 Karakun AG | 3
HIBU Platform
Efficient development of custom solutions in the areas of
Artificial Intelligence: Rule-based, statistical, neural
Intelligent Search
Full-text search,
search filters,
convenience functions
Text Analysis
Classification,
information extraction,
sentiment analysis,
…
Document Automation
Content-driven,
input management,
smart actions
© 2022 Karakun AG | 4
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Ut enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat.
Lorem ipsum: dolor sit
Lorem ipsum 1500
Information extraction methods generally
tuned to sequential textual data
• Running text
• Example approach: modern language models
(transformer-based)
• Graphical information (coordinates) can
largely be ignored
• Horizontal label-value pairs
• Example approach: regular expressions
• Graphical/layout information rarely needed
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Ut enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat.
Lorem ipsum: dolor sit
Lorem ipsum 1500
© 2022 Karakun AG | 5
Vertical label-value pairs
• Semi-sequential problem → more challenging
• Some graphical/layout information needed:
same x-coordinate on subsequent lines
• Example approach:
• regular expressions (to find labels)
• plus mild use of graphical information
(to extract the corresponding values)
• plus possibly again regular expressions
(to constrain values)
Date Reference
Oct 10, 2022 12345-67
Order date Your order
Oct 5, 2022 ABC-789/5
© 2022 Karakun AG | 6
Information extraction from tables
© 2022 Karakun AG | 7
Table extraction challenges
• How are row boundaries encoded?
Lines, spacing, aligned content, …
• How are column boundaries encoded?
Lines, spacing, aligned content, …
• Merged table cells
• …
• General lesson learned: Very difficult to solve by a
general-purpose table extraction solution
• Better: Limit the solution to specific table types
→ use any known constraints in the algorithm
© 2022 Karakun AG | 8
Use case 1
• Land certificates (in Germany)
• Only scanned documents
© 2022 Karakun AG | 9
Land certificates (in Germany)
© 2022 Karakun AG | 10
Detect tables
Step 1:
Detect graphical
elements (red)
vs.
free text elements (blue)
Indicators:
• pixel density
• border lines
Using LAREX (Reul et al., 2017)
© 2022 Karakun AG | 11
Detect tables
Step 2:
Straighten lines
(bounding boxes)
This and the subsequent steps performed with OpenCV (https://opencv.org/).
© 2022 Karakun AG | 12
Detect tables
Step 3: For each detected table: Cut out table image
© 2022 Karakun AG | 13
Analyze table structure
Step 4: Blur image to smoothen and repair lines
© 2022 Karakun AG | 14
Analyze table structure
Step 5: Invert colors
© 2022 Karakun AG | 15
Analyze table structure
Step 6: Binarize image to increase width of lines
© 2022 Karakun AG | 16
Analyze table structure
Step 6: Detect horizontal lines
© 2022 Karakun AG | 17
Analyze table structure
Step 7: Extend horizontal lines by means of dilation
© 2022 Karakun AG | 18
Analyze table structure
Step 8: The same for vertical lines
© 2022 Karakun AG | 19
Analyze table structure
Step 9: Combine vertical and horizontal lines to a grid
and derive coordinates of cells.
© 2022 Karakun AG | 20
Analyze table structure
Step 10: Submit entire table to OCR engine
© 2022 Karakun AG | 21
Analyze table structure
Step 11: Parse OCR result (hOCR) to assign words to table cells
© 2022 Karakun AG | 22
Analyze table structure
Step 12: Resolve merged cells to derive structured representation
[Lfd.Nr. …] [Bish. …] [Bezeichnung …] [Bezeichnung …] [Bezeichnung …]
[Größe] [Größe] [Größe]
[Lfd.Nr. …] [Bish. …] [a) Gemarkung …] [a) Gemarkung …] [c) Wirtschaftsart
…] [ha] [a] [m²]
[Lfd.Nr. …] [Bish. …] [b) Karte] [Flurstück] [c) Wirtschaftsart …] [ha]
[a] [m²]
[1] [2] [3] [3] [3] [4] [4] [4]
[1] [-] [123.45] [234/15] [Musterstraße 123nGebäude- und Freifläche] []
[5] [94]
[2] [-] [123.45] [234/18] [Musterstraße 123anGebäude- und Freifläche] []
[3] [41]
[3] [-] [123.45] [137/8] [Musterstraße 58nGebäude- und Freifläche] []
[11] [70]
© 2022 Karakun AG | 23
Use case 2
• Order confirmations + invoices
• Digitally generated documents (PDFs)
• Focus: Known table layouts
© 2022 Karakun AG | 24
Table extraction
{
"freightCosts": 145.0,
"orderDate": "2020-10-09",
"orderId": "12345",
"packagingCosts": 11.8,
"positions": [
{ … },
{
"values": {
"articleDescription": "Air/oil
separator",
"articleId": "341018-00",
"articlePrice": 13.7,
"deliveryDate": "2020-10-28",
"positionPrice": 3425.0,
"positionQuantity": 250.0,
"positionReference": "107.246"
}
},
{ … },
],
"positionsTotal": 7017.15
© 2022 Karakun AG | 25
Considerations
• Broad range of fairly special table layouts
• Tables might spread across multiple pages
• But mostly with some commonalities:
• Column boundaries indicated by aligned content
• Column titles use fairly recurrent terms
• Strategy of four steps
• Each can be rule-based or an ML component
© 2022 Karakun AG | 26
Step 1: Detect table area
Approaches:
• Train on textual
and graphical input
• Rule-based
(layout-specific
keywords)
© 2022 Karakun AG | 27
Step 2: Detect columns
Approaches:
• Train on textual
and graphical input
• Cluster left/right
x-coordinates
of tokens
• Configure exact
x-coordinates
© 2022 Karakun AG | 28
Step 3: Detect positions (logical rows)
Approaches:
• Train on textual
and graphical input
• Rule-based
(layout-specific)
Handle exceptions
(e.g. values across
cell boundaries)
© 2022 Karakun AG | 29
Step 4: For each position:
Extract target field values from cells
Select proper cell
by column label.
Approaches:
• Train on cell
content
• Rule-based
(partly layout-
specific)
© 2022 Karakun AG | 30
Rule-based steps
• Very efficient approach (unless large
number of different layouts)
• General rule logic with configurable
parameters or regex patterns
• Default configuration
• Parameters/patterns only have to be
configured if deviating
• Logic can also be applied to unknown layouts
and produce some results
© 2022 Karakun AG | 31
Challenge: Map table data to table context
Defined globally
Per position, overhead
Per position, inside position
One table per position
© 2022 Karakun AG | 32
Summary and insights
• No general-purpose table extraction
solution exists on the market
• Do not try to build one
• Instead:
• Limit the solution to specific table types
• Use any known constraints to inform the
extraction algorithm
• Split task into smaller tasks
• Decide for each task independently:
which method (some ML method?, rule-based?)
© 2022 Karakun AG | 33
Summary and insights
• No general-purpose table extraction
solution exists on the market
• Do not try to build one
• Instead:
• Limit the solution to specific table types
• Use any known constraints to inform the
extraction algorithm
• Split task into smaller tasks
• Decide for each task independently:
which method (some ML method?, rule-based?)
Karakun AG
Elisabethenanlage 25
4051 Basel
Switzerland
P
E
W
+41 61 551 36 00
info@karakun.com
www.karakun.com
Dr. Holger Keibel
Product Manager
holger.keibel@karakun.com
Thank you!

Weitere ähnliche Inhalte

Ähnlich wie AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH)

Ptc creo reverse engineering extension
Ptc creo reverse engineering extensionPtc creo reverse engineering extension
Ptc creo reverse engineering extension
Victor Mitov
 

Ähnlich wie AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH) (20)

Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between  CAD & GIS: 6 Ways to Automate Your  Data IntegrationBridging Between  CAD & GIS: 6 Ways to Automate Your  Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Your Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Your Data Integration
 
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
 
A CAD ppt 25-10-19.pdf
A CAD ppt 25-10-19.pdfA CAD ppt 25-10-19.pdf
A CAD ppt 25-10-19.pdf
 
Q Cad Presentation
Q Cad PresentationQ Cad Presentation
Q Cad Presentation
 
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data IntegrationBridging Between CAD & GIS: 8 Ways to Automate Data Integration
Bridging Between CAD & GIS: 8 Ways to Automate Data Integration
 
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AIQualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
Qualcomm Webinar: Solving Unsolvable Combinatorial Problems with AI
 
RESUME.pdf
RESUME.pdfRESUME.pdf
RESUME.pdf
 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
 
computer aided design
computer aided design computer aided design
computer aided design
 
Optimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4jOptimizing Your Supply Chain with Neo4j
Optimizing Your Supply Chain with Neo4j
 
Presentation on
Presentation on Presentation on
Presentation on
 
Presentation
Presentation Presentation
Presentation
 
Ptc creo reverse engineering extension
Ptc creo reverse engineering extensionPtc creo reverse engineering extension
Ptc creo reverse engineering extension
 
Graphics Standards and Algorithm
Graphics Standards and AlgorithmGraphics Standards and Algorithm
Graphics Standards and Algorithm
 
DLP_Observation-1.docx
DLP_Observation-1.docxDLP_Observation-1.docx
DLP_Observation-1.docx
 
PCL (Point Cloud Library)
PCL (Point Cloud Library)PCL (Point Cloud Library)
PCL (Point Cloud Library)
 
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
Delivering Asset Management for Infrastructure Projects by Liam Gallagher, Ja...
 

Mehr von Dr. Haxel Consult

AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
Dr. Haxel Consult
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
Dr. Haxel Consult
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
Dr. Haxel Consult
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
Dr. Haxel Consult
 

Mehr von Dr. Haxel Consult (20)

AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering ManagementAI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
AI-SDV 2022: Henry Chang Patent Intelligence and Engineering Management
 
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
AI-SDV 2022: Creation and updating of large Knowledge Graphs through NLP Anal...
 
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
AI-SDV 2022: The race to net zero: Tracking the green industrial revolution t...
 
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
AI-SDV 2022: Accommodating the Deep Learning Revolution by a Development Proc...
 
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
AI-SDV 2022: Domain Knowledge makes Artificial Intelligence Smart Linda Ander...
 
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
AI-SDV 2022: Embedding-based Search Vs. Relevancy Search: comparing the new w...
 
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
AI-SDV 2022: Rolling out web crawling at Boehringer Ingelheim - 10 years of e...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...AI-SDV 2022: Machine learning based patent categorization: A success story in...
AI-SDV 2022: Machine learning based patent categorization: A success story in...
 
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
AI-SDV 2022: Finding the WHAT – Will AI help? Nils Newman (Search Technology,...
 
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
AI-SDV 2022: New Insights from Trademarks with Natural Language Processing Al...
 
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
AI-SDV 2022: Scientific publishing in the age of data mining and artificial i...
 
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
AI-SDV 2022: AI developments and usability Linus Wretblad (IPscreener / Uppdr...
 
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
AI-SDV 2022: Where’s the one about…? Looney Tunes® Revisited Jay Ven Eman (CE...
 
AI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance CenterAI-SDV 2022: Copyright Clearance Center
AI-SDV 2022: Copyright Clearance Center
 
AI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IPAI-SDV 2022: Lighthouse IP
AI-SDV 2022: Lighthouse IP
 
AI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOCAI-SDV 2022: New Product Introductions: CENTREDOC
AI-SDV 2022: New Product Introductions: CENTREDOC
 
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
AI-SDV 2022: Possibilities and limitations of AI-boosted multi-categorization...
 
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
AI-SDV 2022: Big data analytics platform at Bayer – Turning bits into insight...
 
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
The Artificial Intelligence Conference on Search, Data and Text Mining, Analy...
 

Kürzlich hochgeladen

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
ayvbos
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
ydyuyu
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Monica Sydney
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
galaxypingy
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
Asmae Rabhi
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
JOHNBEBONYAP1
 

Kürzlich hochgeladen (20)

一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
一比一原版(Flinders毕业证书)弗林德斯大学毕业证原件一模一样
 
Best SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency DallasBest SEO Services Company in Dallas | Best SEO Agency Dallas
Best SEO Services Company in Dallas | Best SEO Agency Dallas
 
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime NagercoilNagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
Nagercoil Escorts Service Girl ^ 9332606886, WhatsApp Anytime Nagercoil
 
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac RoomVip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
Vip Firozabad Phone 8250092165 Escorts Service At 6k To 30k Along With Ac Room
 
Real Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirtReal Men Wear Diapers T Shirts sweatshirt
Real Men Wear Diapers T Shirts sweatshirt
 
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
20240510 QFM016 Irresponsible AI Reading List April 2024.pdf
 
20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf20240508 QFM014 Elixir Reading List April 2024.pdf
20240508 QFM014 Elixir Reading List April 2024.pdf
 
Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.Meaning of On page SEO & its process in detail.
Meaning of On page SEO & its process in detail.
 
Power point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria IuzzolinoPower point inglese - educazione civica di Nuria Iuzzolino
Power point inglese - educazione civica di Nuria Iuzzolino
 
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查在线制作约克大学毕业证(yu毕业证)在读证明认证可查
在线制作约克大学毕业证(yu毕业证)在读证明认证可查
 
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi EscortsIndian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
Indian Escort in Abu DHabi 0508644382 Abu Dhabi Escorts
 
PowerDirector Explination Process...pptx
PowerDirector Explination Process...pptxPowerDirector Explination Process...pptx
PowerDirector Explination Process...pptx
 
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Dindigul [ 7014168258 ] Call Me For Genuine Models ...
 
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
20240509 QFM015 Engineering Leadership Reading List April 2024.pdf
 
Trump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts SweatshirtTrump Diapers Over Dems t shirts Sweatshirt
Trump Diapers Over Dems t shirts Sweatshirt
 
75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx75539-Cyber Security Challenges PPT.pptx
75539-Cyber Security Challenges PPT.pptx
 
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdfpdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
pdfcoffee.com_business-ethics-q3m7-pdf-free.pdf
 
APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53APNIC Updates presented by Paul Wilson at ARIN 53
APNIC Updates presented by Paul Wilson at ARIN 53
 
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
APNIC Policy Roundup, presented by Sunny Chendi at the 5th ICANN APAC-TWNIC E...
 
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency""Boost Your Digital Presence: Partner with a Leading SEO Agency"
"Boost Your Digital Presence: Partner with a Leading SEO Agency"
 

AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH)

  • 1. www.karakun.com Extracting information from tables in documents Holger Keibel AI-SDV 2022, Vienna
  • 2. © 2022 Karakun AG | 2 Karakun Services Software Engineering, UX Design, Consulting, Training, Maintenance & Support Platforms & Products Efficiency-enhancing software platforms, ready-made products for selected use cases, e.g., HIBU Platform for search and LT solutions Experienced & Established Team 60+ employees working in 4 locations in CH (HQ), DE and IN Competences / Skills State-of-the-art tech stack (Java, web & mobile), LT / AI / Big Data, focus on open-source software Sustainable Custom Solutions Customers from various industries, e.g., Insurance, Finance, Life Science, Logistics Authors, speakers, lecturers at universities, Java Champions, contributors to open-source projects Community Engagement
  • 3. © 2022 Karakun AG | 3 HIBU Platform Efficient development of custom solutions in the areas of Artificial Intelligence: Rule-based, statistical, neural Intelligent Search Full-text search, search filters, convenience functions Text Analysis Classification, information extraction, sentiment analysis, … Document Automation Content-driven, input management, smart actions
  • 4. © 2022 Karakun AG | 4 Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Lorem ipsum: dolor sit Lorem ipsum 1500 Information extraction methods generally tuned to sequential textual data • Running text • Example approach: modern language models (transformer-based) • Graphical information (coordinates) can largely be ignored • Horizontal label-value pairs • Example approach: regular expressions • Graphical/layout information rarely needed Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Lorem ipsum: dolor sit Lorem ipsum 1500
  • 5. © 2022 Karakun AG | 5 Vertical label-value pairs • Semi-sequential problem → more challenging • Some graphical/layout information needed: same x-coordinate on subsequent lines • Example approach: • regular expressions (to find labels) • plus mild use of graphical information (to extract the corresponding values) • plus possibly again regular expressions (to constrain values) Date Reference Oct 10, 2022 12345-67 Order date Your order Oct 5, 2022 ABC-789/5
  • 6. © 2022 Karakun AG | 6 Information extraction from tables
  • 7. © 2022 Karakun AG | 7 Table extraction challenges • How are row boundaries encoded? Lines, spacing, aligned content, … • How are column boundaries encoded? Lines, spacing, aligned content, … • Merged table cells • … • General lesson learned: Very difficult to solve by a general-purpose table extraction solution • Better: Limit the solution to specific table types → use any known constraints in the algorithm
  • 8. © 2022 Karakun AG | 8 Use case 1 • Land certificates (in Germany) • Only scanned documents
  • 9. © 2022 Karakun AG | 9 Land certificates (in Germany)
  • 10. © 2022 Karakun AG | 10 Detect tables Step 1: Detect graphical elements (red) vs. free text elements (blue) Indicators: • pixel density • border lines Using LAREX (Reul et al., 2017)
  • 11. © 2022 Karakun AG | 11 Detect tables Step 2: Straighten lines (bounding boxes) This and the subsequent steps performed with OpenCV (https://opencv.org/).
  • 12. © 2022 Karakun AG | 12 Detect tables Step 3: For each detected table: Cut out table image
  • 13. © 2022 Karakun AG | 13 Analyze table structure Step 4: Blur image to smoothen and repair lines
  • 14. © 2022 Karakun AG | 14 Analyze table structure Step 5: Invert colors
  • 15. © 2022 Karakun AG | 15 Analyze table structure Step 6: Binarize image to increase width of lines
  • 16. © 2022 Karakun AG | 16 Analyze table structure Step 6: Detect horizontal lines
  • 17. © 2022 Karakun AG | 17 Analyze table structure Step 7: Extend horizontal lines by means of dilation
  • 18. © 2022 Karakun AG | 18 Analyze table structure Step 8: The same for vertical lines
  • 19. © 2022 Karakun AG | 19 Analyze table structure Step 9: Combine vertical and horizontal lines to a grid and derive coordinates of cells.
  • 20. © 2022 Karakun AG | 20 Analyze table structure Step 10: Submit entire table to OCR engine
  • 21. © 2022 Karakun AG | 21 Analyze table structure Step 11: Parse OCR result (hOCR) to assign words to table cells
  • 22. © 2022 Karakun AG | 22 Analyze table structure Step 12: Resolve merged cells to derive structured representation [Lfd.Nr. …] [Bish. …] [Bezeichnung …] [Bezeichnung …] [Bezeichnung …] [Größe] [Größe] [Größe] [Lfd.Nr. …] [Bish. …] [a) Gemarkung …] [a) Gemarkung …] [c) Wirtschaftsart …] [ha] [a] [m²] [Lfd.Nr. …] [Bish. …] [b) Karte] [Flurstück] [c) Wirtschaftsart …] [ha] [a] [m²] [1] [2] [3] [3] [3] [4] [4] [4] [1] [-] [123.45] [234/15] [Musterstraße 123nGebäude- und Freifläche] [] [5] [94] [2] [-] [123.45] [234/18] [Musterstraße 123anGebäude- und Freifläche] [] [3] [41] [3] [-] [123.45] [137/8] [Musterstraße 58nGebäude- und Freifläche] [] [11] [70]
  • 23. © 2022 Karakun AG | 23 Use case 2 • Order confirmations + invoices • Digitally generated documents (PDFs) • Focus: Known table layouts
  • 24. © 2022 Karakun AG | 24 Table extraction { "freightCosts": 145.0, "orderDate": "2020-10-09", "orderId": "12345", "packagingCosts": 11.8, "positions": [ { … }, { "values": { "articleDescription": "Air/oil separator", "articleId": "341018-00", "articlePrice": 13.7, "deliveryDate": "2020-10-28", "positionPrice": 3425.0, "positionQuantity": 250.0, "positionReference": "107.246" } }, { … }, ], "positionsTotal": 7017.15
  • 25. © 2022 Karakun AG | 25 Considerations • Broad range of fairly special table layouts • Tables might spread across multiple pages • But mostly with some commonalities: • Column boundaries indicated by aligned content • Column titles use fairly recurrent terms • Strategy of four steps • Each can be rule-based or an ML component
  • 26. © 2022 Karakun AG | 26 Step 1: Detect table area Approaches: • Train on textual and graphical input • Rule-based (layout-specific keywords)
  • 27. © 2022 Karakun AG | 27 Step 2: Detect columns Approaches: • Train on textual and graphical input • Cluster left/right x-coordinates of tokens • Configure exact x-coordinates
  • 28. © 2022 Karakun AG | 28 Step 3: Detect positions (logical rows) Approaches: • Train on textual and graphical input • Rule-based (layout-specific) Handle exceptions (e.g. values across cell boundaries)
  • 29. © 2022 Karakun AG | 29 Step 4: For each position: Extract target field values from cells Select proper cell by column label. Approaches: • Train on cell content • Rule-based (partly layout- specific)
  • 30. © 2022 Karakun AG | 30 Rule-based steps • Very efficient approach (unless large number of different layouts) • General rule logic with configurable parameters or regex patterns • Default configuration • Parameters/patterns only have to be configured if deviating • Logic can also be applied to unknown layouts and produce some results
  • 31. © 2022 Karakun AG | 31 Challenge: Map table data to table context Defined globally Per position, overhead Per position, inside position One table per position
  • 32. © 2022 Karakun AG | 32 Summary and insights • No general-purpose table extraction solution exists on the market • Do not try to build one • Instead: • Limit the solution to specific table types • Use any known constraints to inform the extraction algorithm • Split task into smaller tasks • Decide for each task independently: which method (some ML method?, rule-based?)
  • 33. © 2022 Karakun AG | 33 Summary and insights • No general-purpose table extraction solution exists on the market • Do not try to build one • Instead: • Limit the solution to specific table types • Use any known constraints to inform the extraction algorithm • Split task into smaller tasks • Decide for each task independently: which method (some ML method?, rule-based?)
  • 34. Karakun AG Elisabethenanlage 25 4051 Basel Switzerland P E W +41 61 551 36 00 info@karakun.com www.karakun.com Dr. Holger Keibel Product Manager holger.keibel@karakun.com Thank you!