Slides of the paper Towards a Higher Accuracy of Optical Character Recognition of Chinese Rare Books in Making Use of Text Model by Hsiang-An Wang and Pin-Ting Liu at the 3rd Edition of the DATeCH2019 International Conference
1. Towards a Higher Accuracy of Optical
Character Recognition of Chinese Rare
Books in Making Use of Text Model
Hsiang-An Wang
Academia Sinica
Center for Digital Cultures
4. Experiment: Data Collection
• Training dataset: 187 ancient medicine books
from the Scripta Sinica Database (about 40
million words)
• Testing dataset: 1 relevant ancient medicine
book named “ ” with a total of
185,000 words
• The OCR results are about 180,000 words
correct and about 5000 incorrect words,
which means the correct rate is about 97.3 %
4
5. Experiment: Building a N-gram Model
• Relied on the sequence of words in the
training dataset, and thus we picked the
highest frequency of output.
• " "
– 2-gram: input to predict " "
– 3-gram: input predict " "
– 4-gram: input predict " "
– ...
5
6. Experiment: Building a
Backward and Forward N-gram Model
• Relied on the sequence of backward and forward
words in the training dataset, and thus we picked the
highest frequency of output.
• Since the backward and forward N-gram are divided
into two different sets of N-gram, therefore, the
model can be used when the same word is found
afterwards.
• " "
– Backward 4-gram: input to predict " "
– Forward 4-gram: input to predict " "
6
7. Experiment: Building a LSTM Model
• Used the Word2vec to project text into the vector
space with 200 dimension
• Used LSTM with three layers of neural network
• Picked the highest score of softmax layer to
predict the word
• " "
– LSTM 2-gram: input to predict " "
– LSTM 3-gram: input to predict " "
– LSTM 4-gram: input to predict " "
7
8. The Modification of Correctness Rate
in N-gram Model
• 7-gram can achieve the best correction rate
8
9. The Modification of Correctness Rate in
Backward and Forward N-gram Model
• Backward and Forward 4-gram can achieve
the best correction rate
9
10. The Modification of Correctness Rate
in LSTM Model
• LSTM 6-gram can achieve the best correction
rate
•
10
11. Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.35% 13.06% 97.49%
LSTM 6-gram 0.1% 7.33% 97.5%
BF 4-gram 0.08% 9.54% 97.57%
Comparison of 7-gram, LSTM 6-gram
and BF 4-gram Text Models
• Backward and Forward 4-gram has the best
performance, with the lowest modification error
result and the highest correct results
11
12. Three Text models with
OCR Top 5 Candidate Words
• The OCR software we use is a Convolution Neural
Network model and to calculate the probability of
classification through softmax function
• When the probability of OCR Top 1 is lower than 95%,
it determines the word might be wrong and will use
mixed model
• Pick the word that has the highest score of the text
model also appeared in OCR Top 5 candidate words
12
13. Model The ratio of the
correct result of OCR
changes to the
wrong one
The ratio of making
the incorrect result
of OCR changes to
the right one
The ratio of
accuracy of OCR
and the text model
OCR X X 97.30%
7-gram 0.012% 9% 97.63%
LSTM 6-gram 0.13% 16% 97.71%
BF 4-gram 0.009% 5.92% 97.55%
Comparison of Three Text Models
Mixed with the Probability of OCR
• LSTM 6-gram mixed with the probability of OCR that
has the best performance
13
14. Conclusion: Using Text Model
• N-gram, backward and forward N-gram or LSTM N-
gram text model can increase the ratio of accuracy of
OCR
• Backward and Forward 4-gram model has the lowest
modification error result and the highest correct
result
14
15. Conclusion: Mixing Text Models with
the Probability of OCR
• By mixing rules of OCR Top 5 candidate words
and probability of Top 1 with text model, it can
archive better result than using text model only
• Mixing the LSTM 6-gram with the probability of
OCR model has the highest correct results
15