Weitere ähnliche Inhalte Ähnlich wie AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH) (20) Mehr von Dr. Haxel Consult (20) Kürzlich hochgeladen (20) AI-SDV 2022: Extracting information from tables in documents Holger Keibel (Karakun, CH)2. © 2022 Karakun AG | 2
Karakun
Services
Software Engineering, UX Design, Consulting,
Training, Maintenance & Support
Platforms & Products
Efficiency-enhancing software platforms,
ready-made products for selected use cases,
e.g., HIBU Platform for search and LT solutions
Experienced & Established Team
60+ employees working in 4 locations
in CH (HQ), DE and IN
Competences / Skills
State-of-the-art tech stack (Java, web &
mobile), LT / AI / Big Data,
focus on open-source software
Sustainable Custom Solutions
Customers from various industries,
e.g., Insurance, Finance, Life Science,
Logistics
Authors, speakers, lecturers at universities,
Java Champions, contributors to
open-source projects
Community Engagement
3. © 2022 Karakun AG | 3
HIBU Platform
Efficient development of custom solutions in the areas of
Artificial Intelligence: Rule-based, statistical, neural
Intelligent Search
Full-text search,
search filters,
convenience functions
Text Analysis
Classification,
information extraction,
sentiment analysis,
…
Document Automation
Content-driven,
input management,
smart actions
4. © 2022 Karakun AG | 4
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Ut enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat.
Lorem ipsum: dolor sit
Lorem ipsum 1500
Information extraction methods generally
tuned to sequential textual data
• Running text
• Example approach: modern language models
(transformer-based)
• Graphical information (coordinates) can
largely be ignored
• Horizontal label-value pairs
• Example approach: regular expressions
• Graphical/layout information rarely needed
Lorem ipsum dolor sit amet,
consectetur adipiscing elit, sed
do eiusmod tempor incididunt ut
labore et dolore magna aliqua.
Ut enim ad minim veniam, quis
nostrud exercitation ullamco
laboris nisi ut aliquip ex ea
commodo consequat.
Lorem ipsum: dolor sit
Lorem ipsum 1500
5. © 2022 Karakun AG | 5
Vertical label-value pairs
• Semi-sequential problem → more challenging
• Some graphical/layout information needed:
same x-coordinate on subsequent lines
• Example approach:
• regular expressions (to find labels)
• plus mild use of graphical information
(to extract the corresponding values)
• plus possibly again regular expressions
(to constrain values)
Date Reference
Oct 10, 2022 12345-67
Order date Your order
Oct 5, 2022 ABC-789/5
7. © 2022 Karakun AG | 7
Table extraction challenges
• How are row boundaries encoded?
Lines, spacing, aligned content, …
• How are column boundaries encoded?
Lines, spacing, aligned content, …
• Merged table cells
• …
• General lesson learned: Very difficult to solve by a
general-purpose table extraction solution
• Better: Limit the solution to specific table types
→ use any known constraints in the algorithm
8. © 2022 Karakun AG | 8
Use case 1
• Land certificates (in Germany)
• Only scanned documents
10. © 2022 Karakun AG | 10
Detect tables
Step 1:
Detect graphical
elements (red)
vs.
free text elements (blue)
Indicators:
• pixel density
• border lines
Using LAREX (Reul et al., 2017)
11. © 2022 Karakun AG | 11
Detect tables
Step 2:
Straighten lines
(bounding boxes)
This and the subsequent steps performed with OpenCV (https://opencv.org/).
12. © 2022 Karakun AG | 12
Detect tables
Step 3: For each detected table: Cut out table image
13. © 2022 Karakun AG | 13
Analyze table structure
Step 4: Blur image to smoothen and repair lines
14. © 2022 Karakun AG | 14
Analyze table structure
Step 5: Invert colors
15. © 2022 Karakun AG | 15
Analyze table structure
Step 6: Binarize image to increase width of lines
16. © 2022 Karakun AG | 16
Analyze table structure
Step 6: Detect horizontal lines
17. © 2022 Karakun AG | 17
Analyze table structure
Step 7: Extend horizontal lines by means of dilation
18. © 2022 Karakun AG | 18
Analyze table structure
Step 8: The same for vertical lines
19. © 2022 Karakun AG | 19
Analyze table structure
Step 9: Combine vertical and horizontal lines to a grid
and derive coordinates of cells.
20. © 2022 Karakun AG | 20
Analyze table structure
Step 10: Submit entire table to OCR engine
21. © 2022 Karakun AG | 21
Analyze table structure
Step 11: Parse OCR result (hOCR) to assign words to table cells
22. © 2022 Karakun AG | 22
Analyze table structure
Step 12: Resolve merged cells to derive structured representation
[Lfd.Nr. …] [Bish. …] [Bezeichnung …] [Bezeichnung …] [Bezeichnung …]
[Größe] [Größe] [Größe]
[Lfd.Nr. …] [Bish. …] [a) Gemarkung …] [a) Gemarkung …] [c) Wirtschaftsart
…] [ha] [a] [m²]
[Lfd.Nr. …] [Bish. …] [b) Karte] [Flurstück] [c) Wirtschaftsart …] [ha]
[a] [m²]
[1] [2] [3] [3] [3] [4] [4] [4]
[1] [-] [123.45] [234/15] [Musterstraße 123nGebäude- und Freifläche] []
[5] [94]
[2] [-] [123.45] [234/18] [Musterstraße 123anGebäude- und Freifläche] []
[3] [41]
[3] [-] [123.45] [137/8] [Musterstraße 58nGebäude- und Freifläche] []
[11] [70]
23. © 2022 Karakun AG | 23
Use case 2
• Order confirmations + invoices
• Digitally generated documents (PDFs)
• Focus: Known table layouts
24. © 2022 Karakun AG | 24
Table extraction
{
"freightCosts": 145.0,
"orderDate": "2020-10-09",
"orderId": "12345",
"packagingCosts": 11.8,
"positions": [
{ … },
{
"values": {
"articleDescription": "Air/oil
separator",
"articleId": "341018-00",
"articlePrice": 13.7,
"deliveryDate": "2020-10-28",
"positionPrice": 3425.0,
"positionQuantity": 250.0,
"positionReference": "107.246"
}
},
{ … },
],
"positionsTotal": 7017.15
25. © 2022 Karakun AG | 25
Considerations
• Broad range of fairly special table layouts
• Tables might spread across multiple pages
• But mostly with some commonalities:
• Column boundaries indicated by aligned content
• Column titles use fairly recurrent terms
• Strategy of four steps
• Each can be rule-based or an ML component
26. © 2022 Karakun AG | 26
Step 1: Detect table area
Approaches:
• Train on textual
and graphical input
• Rule-based
(layout-specific
keywords)
27. © 2022 Karakun AG | 27
Step 2: Detect columns
Approaches:
• Train on textual
and graphical input
• Cluster left/right
x-coordinates
of tokens
• Configure exact
x-coordinates
28. © 2022 Karakun AG | 28
Step 3: Detect positions (logical rows)
Approaches:
• Train on textual
and graphical input
• Rule-based
(layout-specific)
Handle exceptions
(e.g. values across
cell boundaries)
29. © 2022 Karakun AG | 29
Step 4: For each position:
Extract target field values from cells
Select proper cell
by column label.
Approaches:
• Train on cell
content
• Rule-based
(partly layout-
specific)
30. © 2022 Karakun AG | 30
Rule-based steps
• Very efficient approach (unless large
number of different layouts)
• General rule logic with configurable
parameters or regex patterns
• Default configuration
• Parameters/patterns only have to be
configured if deviating
• Logic can also be applied to unknown layouts
and produce some results
31. © 2022 Karakun AG | 31
Challenge: Map table data to table context
Defined globally
Per position, overhead
Per position, inside position
One table per position
32. © 2022 Karakun AG | 32
Summary and insights
• No general-purpose table extraction
solution exists on the market
• Do not try to build one
• Instead:
• Limit the solution to specific table types
• Use any known constraints to inform the
extraction algorithm
• Split task into smaller tasks
• Decide for each task independently:
which method (some ML method?, rule-based?)
33. © 2022 Karakun AG | 33
Summary and insights
• No general-purpose table extraction
solution exists on the market
• Do not try to build one
• Instead:
• Limit the solution to specific table types
• Use any known constraints to inform the
extraction algorithm
• Split task into smaller tasks
• Decide for each task independently:
which method (some ML method?, rule-based?)
34. Karakun AG
Elisabethenanlage 25
4051 Basel
Switzerland
P
E
W
+41 61 551 36 00
info@karakun.com
www.karakun.com
Dr. Holger Keibel
Product Manager
holger.keibel@karakun.com
Thank you!