Google Cloud Visionを帳票OCRが実務レベルで利用可能かを評価した懸賞資料です。
This is a detailed report on evaluating Google Could Vision for real world business process OCR purposes.
Exploring the Future Potential of AI-Enabled Smartphone Processors
Google Cloud Vision API Evaluation for Japanese Healthcare Reports
1. 1
Evaluation on Google Cloud Vision API 1.1 (beta)
- POC Report for Japanese Health Care Report OCR
26-May, 2017
Asia Technology Office
Shinichi Hashitani
2. Executive Summary
• Google Cloud Vision API is an AIaaS provided by Google based on machine learning engine,
which come with OCR (Optical Character Recognition), Image classification, landmark
detection, and other features. OCR feature is highly sophisticated and able to recognize
Japanese characters at almost perfect accuracy.
• Its OCR capability is suited for standard document scanning. It does not recognizes multi-
column document structure well, and not very suited for tabular format document. The
format of health care reports vary across medical institutions, but they are all primarily in
tabular format, making it difficult to extract meaningful data accurately. Only about 30% of
text are extracted, and they are not structured well enough for processing, either.
• OCR feature does not accept any parameter other than the image itself; therefore, it requires
in-house processing of response JSON. There are two primary approaches: 1. Text mining on
content section of response. 2. Programmatic text composition based on characters
coordinate information. For health care report scanning, neither approach is feasible.
• Still Google Cloud Vision API can be used for standard format documents (Books,
whitepapers, academic writings, public announcement, etc.) It can also used for publications
(newspapers, magazines, case study reports) for text mining purpose. Further evaluation of
other format is also conducted during POC.
2
3. About Google Cloud Vision API
Google Cloud API is a REST API based service, accessible from any system in any language
which can communicate with JSON over HTTP. The request is authenticated on either based on
OAuth2 (recommended) or Cloud API Key.
Request format is common for all Cloud Vision services, “type” needs to be specified for a
specific use. It is also possible to request multiple services within a single call on the same
image. (In this case, each type specified is counted as one unit.)
3
The response time for A4 page with 500 characters is 3 to 6 seconds round trip.
The price model is per-unit-of-task basis, and relatively inexpensive. (1.5 USD for 1000 unit-per-
month. 1.0 USD beyond 20 million unit-per-month. Free for below 1000 unit-per-month.)
{“requests”: [
“image”: {“content”: image_base64},
“features”: [“type”, “TEXT_DETECTION”,
“maxResults”: 1}]
}
]}
{response: [
….
{“blockType”: “TEXT”,
“boundingBox”: {
“vertices”: [
{“x”: 594, “y”:327}
….
“text”: “遥”
….
4. Google Cloud Vision API – Output format
The output is in JSON format composite of two sets of information.
1. Character (or a small group of character) information.
2. Re-structured full text of covert text.
4
The full description mimics the actual text structure by concatenating characters based on
their coordinates and appending line break character for each line.
From the output, it is confirmed that the engine analyze character-by-character and able to
process text with characters in different languages accurately. At this point in time, this
capability is far superior than that of Microsoft Cognitive Service, which confuses alphabet
characters with similar Kanji characters.
{"boundingBox":
{"vertices": [{"x": 444,"y": 71,{"x": 485,"y":
67,{"x": 488,"y": 104,{"x": 447,"y": 108}],
"property": {"detectedLanguages":
[{"languageCode": "ja"}],
"text": "基"
}
….
"textAnnotations": [
{"boundingPoly": {"vertices": [{"x": 24,"y": 62,{"x": 1538,"y": 62,{"x": 1538,"y":
3096,{"x": 24,"y": 3096}],
"description": "-基準値¥n|今回ー前回ー前々回ー¥n総合判定¥n要経過観察!
要経過観察|要経過観察¥nメタボリックシンドローム判定¥n非該当 1予備群
該当1基準該当¥n【心電図】不完全右脚ブロック¥n甩¥n血中脂質] LDLコレス
テロールやや高値。食べ過ぎに注意し、動物¥n性脂肪や卵などコレステ
ロールの多いものを制限し、経過を見て下¥nさい。¥n[尿酸]尿酸が高めです。
注意してください。¥n治療中の場合は、この結果表を主治医にお見せ下さ
い。¥n総合判定医師名: 川口 毅 ーーーーーッ童¥n総合所見¥n",
"locale": "ja",
….
5. Google Cloud Vision API – Output Processing
Since Google Cloud Vision API does not support structured documents and doesn’t accept any
additional information for processing, the in-house output processing is needed in order to
extract desired data out of the out put. There are two ways:
1. Re-structure data from each character from their coordinates.
2. Text mining on the structured full text.
5
Based on the accuracy, composition of structured text, and what needs to be extracted, the
approach to take varies. Text mining approach is a simpler solution between two methods.
{"boundingBox":
{"vertices": [{"x": 444,"y": 71,{"x": 485,"y":
67,{"x": 488,"y": 104,{"x": 447,"y": 108}],
"property": {"detectedLanguages": ….
"textAnnotations": [
…
"description": "-基準値¥n|今回ー前回ー
前々回ー¥n総合判定¥n要経過観察!要経過観
察|要経過観察¥n…
….
“要“ + “経“ + “過“ + “観“+ “察“
= “要経過観察”
“…¥n総合判定¥n要経過観察…“
= “要経過観察”
6. Output Processing – Text Restructuring
This is a raw data processing. Like structured full text provided in the output itself, the method
is to re-structure text based on concatenating each character based on their coordinates.
Pros:
- Targets specific area to be extracted. (Suitable for structured document.)
- Less affected by the accuracy of the scan.
Cons:
- Requires complex logic to process. (Requires coordinate-based calculation for each string)
- Requires tailored logic for each type of document.
It is ideal for extracting a small amount of information out of the entire document. The logic
depends on coordinates. Therefore, it cannot process unstructured documents or semi-
structured documents. It also strongly depends on the scan positioning of the document; a
small mispositioning of scan can cause the logic to fail fetching characters to process.
6
7. Output Processing – Full Text Mining
Text mining disregards coordinate information of each character. Rather, it takes the
restructured full text as input, search through the string to extract text.
Pros:
- Logic is simple and known text mining techniques are directly applicable.
- Possibly re-use one logic to multiple document formats.
Cons:
- The accuracy entirely depends on the accuracy of full text extraction.
- Failing to read “key” text will also fail to extract the value.
It is ideal for processing large text, especially for analytical purpose. It is still ban be used for
extracting particular set of information if the accuracy of the extracted text is high.
7
8. Google Cloud Vision API – Restuctured Full Text
Google Cloud Vision is designed for a standard single column document, reading and processing
from top to bottom, left to right. When restructuring the full text, it cannot restructure it well if
it is in multi column format.
Google Cloud Vision tries to read and to process line by line. Therefore, the entire row will be
displayed as one line, each column is concatenated with spaces in between.
8
AAAAA¥n
BBBBB EEEEE HHHHH¥n
CCCCC FFFFF IIIII¥n
DDDDD GGGGG JJJJJ¥n
The sentence flows from B to C to D, but the text comes out as from B to E to H. A word can
be divided into two lines, therefore some words (words span across multiple lines) cannot be
recognized correctly.
Also, when lines don’t align horizontally beyond columns, or space between columns are too
wide, often the entire sentence is not processed.
9. Reading Health Care Report – POC Procedures
In this POC, the actual health care report is scanned by a MFP, in both color and monochrome
modes. Cloud API is called from a python program running on a local machine. The same report
is scanned in 3 mode (color/mono/grayscale) in the same resolution. (300dpi/JPEG) Since the
grayscale is not supported by MFP, color TIFF is converted into grayscale JPEG.
9
1. The program reads the image, encodes it into a text format (base64).
2. The program construct JSON requests including encoded image and send it to the Cloud API.
3. The Cloud API processes the image and send back text in JSON format.
4. The program dumps JSON response into a physical file for analysis.
Program (Python) 2
1
3
4
10. POC Result - Monochrome
10
- Overall read accuracy is
very poor. The left-most
pane is not scanned
entirely.
- Only limited parts of the
document are scanned.
When scanned, character
are recognized correctly in
most cases.
- Traditional OCR worked
better with monochrome,
but it is not in Google
Cloud Vision.
Correct
Incorrect
Not Scanned
11. POC Result - Grayscale
11
- Overall read accuracy is the
worst among three options.
- The left-most pane is
recognized well; able to
read outlined characters as
well.
- Only limited parts of the
document are scanned.
When scanned, character
are recognized correctly in
most cases.
Correct
Incorrect
Not Scanned
12. POC Result - Color
12
- Overall read accuracy is
poor, but better than other
two options.
- The left-most pane is
recognized well; able to
read outlined characters as
well.
- Only limited parts of the
document are scanned.
When scanned, character
are recognized correctly in
most cases.
Correct
Incorrect
Not Scanned
13. POC Result – Summary
All patterns failed to deliver dependable results for production use.
- The results varies among three patterns, but none of them recognized even a half of fields
interested for scanning.
- Character recognition accuracy itself is high. (Around 95%.) Still it is not reliable enough for
production use.
Health Care Report is often in multi-pane/tabular format and not suited for this solution.
- Due to its document structure, large part of the document is not recognized as text areas for
scanning.
- Tabular column borders are wrongly recognized as characters.
- Table columns are often not fully scanned. (Whitespaces between columns are recognized as
the end of sentence.)
13
14. POC Result – Critical Issues
Rows not scanned in multi column structure
- Since the entire image is scanned as a single column paragraph, some rows are entirely
skipped based on the alignment of lines across columns.
14
1 2
3
4
5
Table border is often wrongly converted to “!” or “1”
- Since the scan is processed as a single line, table border is also converted to “|” . But often
converted to some meaningful value like “1”.
- This happens by chance, and it can alter the actual value with wrongly converted character.
(In below case, 80 is converted as 180)
15. POC Result – Critical Issues cont’d
Columns are skipped due to whitespaces between them.
- In tabular format, the whitespace between column values often considered as the end of the
line, and the remaining columns are not scanned.
15
16. Follow Up Case – Overview
Considering the fact that the document structure affects the accuracy of scan significantly, the
complexity of Health Care Report is a particularly challenging for Google Cloud Vision API to
process correctly.
Additional test is conducted to divide the image into three independent images, so a single 3-
pane tabular format image is divided into 3 tabular format images. Each divided image is sent to
Cloud Vision API as a separate request.
16
17. Follow Up Case - Result
17
- Read accuracy is
significantly improved.
Around 90% of fields
interested are scanned.
- Character recognition
accuracy is high, about the
same level as previous
cases.
- Still all critical issues are
present. (Caused not by
multi-pane document
structure, but by tabular
format.)
Correct
Incorrect
Not Scanned
18. Overall Summary
Google Cloud Vision API is not suitable for HCR scanning.
- The nature of the document structure hinders it from scanning the desired value.
- Due to some critical issues in tabular data scanning, incorrect values can be extracted.
- For HCR, both Text Restructuring and Full Text Mining approach can cover for scanning
inaccuracy.
By processing partially by dividing or cutting the image, there is a possibility of using Google
Cloud Vision API as a part of solution. However…
- Each image sent will be counted as one request. # of partial images for each HCR will multiply
the cost and response time of the processing.
- Fairly good amount of effort needed for pre-process and post-process in order to extract the
right set of data.
- Logic required strongly depends on the accuracy of the service. It is a high risk that the
change in Cloud Vision API behavior affects the entire solution.
- By the same token, there is a chance of improvement of Google Cloud Vision API will
significantly simplify the overall solution. (Cloud Vision API is still in beta.)
18
19. Appendix 1 – Sample Scanning 1
19
Standard Report with a footer annotation
Scan Rate: 100%
Scan Accuracy (without punctuations): 100%
Scan Accuracy (with punctuations): 99%
Source: Reinsurance Trend Report by SOMPO Japan
Correct
Incorrect
Not Scanned
20. Appendix 1 – Sample Scanning 2
20
Standard Report within a single-column
table
Scan Rate: 99%
Scan Accuracy (without punctuations): 100%
Scan Accuracy (with punctuations): 99%
Source: Overview on Japan Pension System by
Ministry of Health, Labour, and Welfare
Correct
Incorrect
Not Scanned
21. Appendix 1 – Sample Scanning 3
21
Standard Report within a single-column
table and a standard paragraph
Scan Rate: 100%
Scan Accuracy (without punctuations): 99%
Scan Accuracy (with punctuations): 99%
Source: Reinsurance Trend Report by SOMPO Japan
Correct
Incorrect
Not Scanned
22. Appendix 1 – Sample Scanning 4
22
Standard Report within a row-wide image
Scan Rate: 100%
Scan Accuracy (without punctuations): 100%
Scan Accuracy (with punctuations): 99%
Source: Overview on Japan Pension System by
Ministry of Health, Labour, and Welfare
Correct
Incorrect
Not Scanned
23. Appendix 1 – Sample Scanning 5
23
Case Study Report in two columns with a
row-wide image
Scan Rate: 94%
Scan Accuracy (without punctuations): 99%
Scan Accuracy (with punctuations): 98%
Source: IoT Case Study on Fujitsu i Network Systems
by CISCO Solution
Correct
Incorrect
Not Scanned
24. Appendix 1 – Sample Scanning 6
24
Case Study Report in three columns with
in-text images
Scan Rate: 93%
Scan Accuracy (without punctuations): 100%
Scan Accuracy (with punctuations): 99%
Source: IoT Case Study on Fujitsu i Network Systems
by CISCO Solution
Correct
Incorrect
Not Scanned