SlideShare ist ein Scribd-Unternehmen logo
1 von 38
Page Layout Analysis of
19th
Century Siamese Newspapers
using Python and OpenCV
Mark Hollow
PyCon APAC, 2017
graduated in classical music · self-taught in computing
programming python since 2002 · 20 years working in IT
IT infrastructure · UNIX sysadmin · project management
software engineering · data systems · product management
about me...
2
once upon a time...
Dr Dan Beach Bradley - หมอ บรัดเลย
Born 18th
July 1804, New York; died 23th
June 1873, Bangkok
Graduated as Doctor of Medicine from New York University
American Protestant missionary in Siam
Arrives in Bangkok on 18th
July 1835 from Boston via Singapore
Brings with him the first printing press to Siam
Many notable achievements & firsts in Siam
first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial
advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the
Old Testament, first monolingual Siamese dictionary
3
the first siamese newspaper
The Bangkok Recorder - หนังสือจดหมายเหตุ
1844–1845 magazine-like, fact-based, introduces western
ideas, knowledge, science and Christianity
1865–1867 more social commentary and introduction of
western liberalism -- rather controversial
a lot of historical information...
thai society (seen from a western perspective)
regional and global news/information
prices of goods, services, imports and exports
4
there is no online
searchable database
of this historical
information.
5
DigitalBangkokRecorder
markhollow.com
digital bangkok recorder project
objectives
scan all the surviving editions
transcribe all text
make all text available online
learn how to do all of this
in this presentation
cleaning scanned images
detecting the page layout
extract all text lines
prepare for transcription
6
page layout
2 column layout
front page:
title & date lines
last page:
tabular data
some illustrations
some full-width
tables
7
a closer look...
large header on cover
dual-language headings
column separator line
topic separators
unique typeface
the first ever thai typeface
now-obsolete characters
not supported by modern ocr
8
basic workflow
1. SCAN
2. CLEAN
9
3. STRUCTURAL ANALYSIS
4. EXTRACT TEXT
5. TRANSCRIPTION
getting started
with opencv
10
what is opencv?
“OpenCV (Open Source Computer Vision Library) is an
open source computer vision and machine learning
software library.”
- opencv.org
Written in C++; bindings for Python and others
v3.2 used for here, v3.1 probably works
v2.x won’t work - different API structure
many v2.x blogs/articles still online - beware!
11
opencv basics: installation
$ pip install opencv-python
or
$ pip install opencv-contrib-python
No FFmpeg, GTK or carbon support - limits some features.
Works well in jupyter/ipython.
Non-free
patented
stuff!!
12
opencv basics: loading/saving images
loading images…
>>> import cv2
>>> img = cv2.imread(’image001.jpg’)
>>> type(img)
<type 'numpy.ndarray'>
saving images...
>>> cv2.imwrite(’newfile.png’, img)
OpenCV images
are numpy arrays!
All common
formats supported.
Extra args
supported for
image formats.
13
document cleaning.
14
removing background noise (1)
- binarization: set pixel value based on threshold
- types: basic, adaptive
- both need experimentation with threshold value
bin_image, th = cv2.threshold(image, 192, 255,
cv2.THRESH_BINARY)
bin_image = cv2.adaptiveThreshold(image, 255,
cv2.ADAPTIVE_THRESH_MEAN_C,
cv2.THRESH_BINARY, 101, 2)
15
removing background noise (2)
bin_image, th = cv2.threshold(img, 0, 255,
cv2.THRESH_BINARY + cv2.THRESH_OTSU)
otsu binarization tries to find best threshold value
example:
manual threshold guessed at v=192
otsu selects v=177
improvement in number of artifacts
16
removing background noise (3)
* Contrast emphasized for display purposes. 17
structural analysis:
page margins
18
morphological transforms (1)
- erosion: erodes away the
boundaries of foreground object
- dilation: dilates/thickens
boundaries
NOTE: black = background
white = foreground
kernel = numpy.ones(
(5, 5), np.uint8)
new_image = ~cv2.erode(
~original_image,
kernel,
iterations=1)
19
morphological transforms (2)
- opening: erosion+dilation
used for removing noise
- closing: dilation+erosion
closes small holes in objects
kernel = numpy.ones(
(5, 5), np.uint8)
img2= ~cv2.morphologyEx(
~img1,
cv2.MORPH_OPEN,
kernel,
iterations=1)
20
contours
“a curve joining all the
continuous points having same
color or intensity”
_, contours, hierarchy = cv2.findContours(
~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
cv2.drawContours(col_img, contours, -1, (255,0,0), 5)
findContours return values:
contours: list of contours
hierarchy: contour structure
21
finding page margins
“open” removes
artifacts; “dilate”
emphasizes text
opened & dilated
22
get margin from
contour edges
findContours() to
group blocks; filter
out small contours.
structural analysis:
identify page
sections
23
morphological transforms (revisited)
structuring element (kernel) array
is made of 1’s & 0’s
it’s compared to each pixel
erode: takes minimum value
dilate: takes maximum value
a linear structuring element will
operate on linear patterns
Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html
kernel
input image output image
Dilation Example
>>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1))
array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8)
24
page segmentation
1
2
3
once for horizontal,
then for vertical
lines
erode & dilate with a long
linear structuring element:
extracts lines to mask
findContours() on
the mask gets
contour coordinates
draw contour to
remove line
centre line of contours
used as page section
boundaries
25
page segmentation (full page)
section boundaries
page margin
26
blank areas from
average values of
multiple adjacent
lines
structural analysis:
topic separators
27
template matching: finding objects in an image
result = cv2.matchTemplate(image,
template, cv2.TM_CCOEFF_NORMED)
_, maxval, _, maxloc =
cv2.minMaxLoc(result)
a template is a small image segment:
cv2.matchTemplate() returns match scores
28
structural analysis complete
- margins identified
- horizontal and vertical lines detected
- original lines removed
- blank areas identified
- removed decorative markers with templates
- use template matching to identify titles
- and therefore page style (eg. first or other page)
29
structural analysis: first edition
30
extract
text lines
31
extract text lines
THRESHOLD = 248
thresholds = cv2.reduce(
image,
1, # 1 => column; 0 => row
cv2.REDUCE_AVG
) >= THRESHOLD
32
workflow: page layout analysis all done!
1. SCAN
2. CLEAN
5. TRANSCRIPTION
33
3. STRUCTURAL ANALYSIS
4. EXTRACT TEXT
✓
✓
✓
what’s next?
34
transcription
- transcribe enough text for developing an OCR model
- regular ocr is very inaccurate due to
the unique font
- hire typists or amazon mechanical turk
- there’s a few problems to solve:
- transcription cost, guidelines needed due to archaic text & unique typeface
- how to develop an OCR system?
- retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network?
- use tensorflow or similar?
- perhaps that’s my next PyCon presentation!
35
appendix
Not enough time to cover these
topics… :-(
- Removing page frames *
- Skew correction *
- Detecting tables †
- Detecting pictures
* See https://markhollow.com/
† Coming soon
36
Other resources:
- ocropus / ocropy: python document
analysis tools
- scantailor: GUI for cleaning
scanned documents
- CE316 / CE866: Computer Vision,
University of Essex, UK
http://orb.essex.ac.uk/ce/ce316/
in summary...
opencv basics · thresholds · morphological
transformations · contours · masks · template
matching and a little bit of numpy
...plus a practical application
to document layout analysis
37
thank you for listening.
questions?
38
Mark Hollow
markhollow.com
DigitalBangkokRecorder

Weitere ähnliche Inhalte

Was ist angesagt?

Chapitre 2 problème de plus court chemin
Chapitre 2 problème de plus court cheminChapitre 2 problème de plus court chemin
Chapitre 2 problème de plus court cheminSana Aroussi
 
Les algorithmes d'arithmetique
Les algorithmes d'arithmetiqueLes algorithmes d'arithmetique
Les algorithmes d'arithmetiquemohamed_SAYARI
 
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...Nebucom
 
Support matlab st
Support matlab stSupport matlab st
Support matlab stN NASRI
 
Matlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - IMatlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - IVijay Kumar Gupta
 
La culture numerique
La culture numeriqueLa culture numerique
La culture numeriqueFormaVia
 
TD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTI
TD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTITD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTI
TD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTIsoufiane merabti
 
cahier des charges pdf
cahier des charges pdfcahier des charges pdf
cahier des charges pdfamine niba
 

Was ist angesagt? (9)

Chapitre 2 problème de plus court chemin
Chapitre 2 problème de plus court cheminChapitre 2 problème de plus court chemin
Chapitre 2 problème de plus court chemin
 
Les algorithmes d'arithmetique
Les algorithmes d'arithmetiqueLes algorithmes d'arithmetique
Les algorithmes d'arithmetique
 
MATLAB - Arrays and Matrices
MATLAB - Arrays and MatricesMATLAB - Arrays and Matrices
MATLAB - Arrays and Matrices
 
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
An overview of the OASIS TOSCA standard: Topology and Orchestration Specifica...
 
Support matlab st
Support matlab stSupport matlab st
Support matlab st
 
Matlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - IMatlab Tutorial for Beginners - I
Matlab Tutorial for Beginners - I
 
La culture numerique
La culture numeriqueLa culture numerique
La culture numerique
 
TD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTI
TD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTITD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTI
TD - travaux dirigé limite de fonction ( exercice ) SOUFIANE MERABTI
 
cahier des charges pdf
cahier des charges pdfcahier des charges pdf
cahier des charges pdf
 

Ähnlich wie PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV

25 лет истории C++, пролетевшей на моих глазах
25 лет истории C++, пролетевшей на моих глазах25 лет истории C++, пролетевшей на моих глазах
25 лет истории C++, пролетевшей на моих глазахcorehard_by
 
25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My EyesYauheni Akhotnikau
 
graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...Rothamsted Research, UK
 
Introduction To Autumata Theory
 Introduction To Autumata Theory Introduction To Autumata Theory
Introduction To Autumata TheoryAbdul Rehman
 
Let's LISP like it's 1959
Let's LISP like it's 1959Let's LISP like it's 1959
Let's LISP like it's 1959Mohamed Essam
 
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrWiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrAnn Loraine
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...David Walker
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforRomain Boman
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendLuis Goldster
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendFraboni Ec
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendYoung Alista
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendJames Wong
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendHarry Potter
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendTony Nguyen
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friendHoang Nguyen
 

Ähnlich wie PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV (20)

25 лет истории C++, пролетевшей на моих глазах
25 лет истории C++, пролетевшей на моих глазах25 лет истории C++, пролетевшей на моих глазах
25 лет истории C++, пролетевшей на моих глазах
 
tools
toolstools
tools
 
tools
toolstools
tools
 
25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes25 Years of C++ History Flashed in Front of My Eyes
25 Years of C++ History Flashed in Front of My Eyes
 
graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...graph2tab, a library to convert experimental workflow graphs into tabular for...
graph2tab, a library to convert experimental workflow graphs into tabular for...
 
Introduction To Autumata Theory
 Introduction To Autumata Theory Introduction To Autumata Theory
Introduction To Autumata Theory
 
Let's LISP like it's 1959
Let's LISP like it's 1959Let's LISP like it's 1959
Let's LISP like it's 1959
 
Q
QQ
Q
 
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitrWiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
WiNGS 2014 Workshop 2 R, RStudio, and reproducible research with knitr
 
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
The Effect of Hierarchical Memory on the Design of Parallel Algorithms and th...
 
Lecture12
Lecture12Lecture12
Lecture12
 
Bioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekingeBioinformatics v2014 wim_vancriekinge
Bioinformatics v2014 wim_vancriekinge
 
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using MetaforNumerical Simulation of Nonlinear Mechanical Problems using Metafor
Numerical Simulation of Nonlinear Mechanical Problems using Metafor
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 
Python your new best friend
Python your new best friendPython your new best friend
Python your new best friend
 

Kürzlich hochgeladen

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 

Kürzlich hochgeladen (20)

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 

PyCon APAC 2017: Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV

  • 1. Page Layout Analysis of 19th Century Siamese Newspapers using Python and OpenCV Mark Hollow PyCon APAC, 2017
  • 2. graduated in classical music · self-taught in computing programming python since 2002 · 20 years working in IT IT infrastructure · UNIX sysadmin · project management software engineering · data systems · product management about me... 2
  • 3. once upon a time... Dr Dan Beach Bradley - หมอ บรัดเลย Born 18th July 1804, New York; died 23th June 1873, Bangkok Graduated as Doctor of Medicine from New York University American Protestant missionary in Siam Arrives in Bangkok on 18th July 1835 from Boston via Singapore Brings with him the first printing press to Siam Many notable achievements & firsts in Siam first surgery, first vaccination, first printed Royal edict, first newspaper, first classified & commercial advertising, first copyright transaction, first printing of Siamese law, first Siamese translation of the Old Testament, first monolingual Siamese dictionary 3
  • 4. the first siamese newspaper The Bangkok Recorder - หนังสือจดหมายเหตุ 1844–1845 magazine-like, fact-based, introduces western ideas, knowledge, science and Christianity 1865–1867 more social commentary and introduction of western liberalism -- rather controversial a lot of historical information... thai society (seen from a western perspective) regional and global news/information prices of goods, services, imports and exports 4
  • 5. there is no online searchable database of this historical information. 5
  • 6. DigitalBangkokRecorder markhollow.com digital bangkok recorder project objectives scan all the surviving editions transcribe all text make all text available online learn how to do all of this in this presentation cleaning scanned images detecting the page layout extract all text lines prepare for transcription 6
  • 7. page layout 2 column layout front page: title & date lines last page: tabular data some illustrations some full-width tables 7
  • 8. a closer look... large header on cover dual-language headings column separator line topic separators unique typeface the first ever thai typeface now-obsolete characters not supported by modern ocr 8
  • 9. basic workflow 1. SCAN 2. CLEAN 9 3. STRUCTURAL ANALYSIS 4. EXTRACT TEXT 5. TRANSCRIPTION
  • 11. what is opencv? “OpenCV (Open Source Computer Vision Library) is an open source computer vision and machine learning software library.” - opencv.org Written in C++; bindings for Python and others v3.2 used for here, v3.1 probably works v2.x won’t work - different API structure many v2.x blogs/articles still online - beware! 11
  • 12. opencv basics: installation $ pip install opencv-python or $ pip install opencv-contrib-python No FFmpeg, GTK or carbon support - limits some features. Works well in jupyter/ipython. Non-free patented stuff!! 12
  • 13. opencv basics: loading/saving images loading images… >>> import cv2 >>> img = cv2.imread(’image001.jpg’) >>> type(img) <type 'numpy.ndarray'> saving images... >>> cv2.imwrite(’newfile.png’, img) OpenCV images are numpy arrays! All common formats supported. Extra args supported for image formats. 13
  • 15. removing background noise (1) - binarization: set pixel value based on threshold - types: basic, adaptive - both need experimentation with threshold value bin_image, th = cv2.threshold(image, 192, 255, cv2.THRESH_BINARY) bin_image = cv2.adaptiveThreshold(image, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 101, 2) 15
  • 16. removing background noise (2) bin_image, th = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU) otsu binarization tries to find best threshold value example: manual threshold guessed at v=192 otsu selects v=177 improvement in number of artifacts 16
  • 17. removing background noise (3) * Contrast emphasized for display purposes. 17
  • 19. morphological transforms (1) - erosion: erodes away the boundaries of foreground object - dilation: dilates/thickens boundaries NOTE: black = background white = foreground kernel = numpy.ones( (5, 5), np.uint8) new_image = ~cv2.erode( ~original_image, kernel, iterations=1) 19
  • 20. morphological transforms (2) - opening: erosion+dilation used for removing noise - closing: dilation+erosion closes small holes in objects kernel = numpy.ones( (5, 5), np.uint8) img2= ~cv2.morphologyEx( ~img1, cv2.MORPH_OPEN, kernel, iterations=1) 20
  • 21. contours “a curve joining all the continuous points having same color or intensity” _, contours, hierarchy = cv2.findContours( ~img, cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE) col_img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB) cv2.drawContours(col_img, contours, -1, (255,0,0), 5) findContours return values: contours: list of contours hierarchy: contour structure 21
  • 22. finding page margins “open” removes artifacts; “dilate” emphasizes text opened & dilated 22 get margin from contour edges findContours() to group blocks; filter out small contours.
  • 24. morphological transforms (revisited) structuring element (kernel) array is made of 1’s & 0’s it’s compared to each pixel erode: takes minimum value dilate: takes maximum value a linear structuring element will operate on linear patterns Diagram: http://docs.opencv.org/3.1.0/d1/dee/tutorial_moprh_lines_detection.html kernel input image output image Dilation Example >>> cv2.getStructuringElement(cv2.MORPH_RECT, (10, 1)) array([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1]], dtype=uint8) 24
  • 25. page segmentation 1 2 3 once for horizontal, then for vertical lines erode & dilate with a long linear structuring element: extracts lines to mask findContours() on the mask gets contour coordinates draw contour to remove line centre line of contours used as page section boundaries 25
  • 26. page segmentation (full page) section boundaries page margin 26 blank areas from average values of multiple adjacent lines
  • 28. template matching: finding objects in an image result = cv2.matchTemplate(image, template, cv2.TM_CCOEFF_NORMED) _, maxval, _, maxloc = cv2.minMaxLoc(result) a template is a small image segment: cv2.matchTemplate() returns match scores 28
  • 29. structural analysis complete - margins identified - horizontal and vertical lines detected - original lines removed - blank areas identified - removed decorative markers with templates - use template matching to identify titles - and therefore page style (eg. first or other page) 29
  • 32. extract text lines THRESHOLD = 248 thresholds = cv2.reduce( image, 1, # 1 => column; 0 => row cv2.REDUCE_AVG ) >= THRESHOLD 32
  • 33. workflow: page layout analysis all done! 1. SCAN 2. CLEAN 5. TRANSCRIPTION 33 3. STRUCTURAL ANALYSIS 4. EXTRACT TEXT ✓ ✓ ✓
  • 35. transcription - transcribe enough text for developing an OCR model - regular ocr is very inaccurate due to the unique font - hire typists or amazon mechanical turk - there’s a few problems to solve: - transcription cost, guidelines needed due to archaic text & unique typeface - how to develop an OCR system? - retrain tesseract 4.x’s LSTM (Long Short Term Memory) neural network? - use tensorflow or similar? - perhaps that’s my next PyCon presentation! 35
  • 36. appendix Not enough time to cover these topics… :-( - Removing page frames * - Skew correction * - Detecting tables † - Detecting pictures * See https://markhollow.com/ † Coming soon 36 Other resources: - ocropus / ocropy: python document analysis tools - scantailor: GUI for cleaning scanned documents - CE316 / CE866: Computer Vision, University of Essex, UK http://orb.essex.ac.uk/ce/ce316/
  • 37. in summary... opencv basics · thresholds · morphological transformations · contours · masks · template matching and a little bit of numpy ...plus a practical application to document layout analysis 37
  • 38. thank you for listening. questions? 38 Mark Hollow markhollow.com DigitalBangkokRecorder