Search by Screenshots for Universal Article Clipping in Mobile Apps

Search by Screenshots
for Universal Article Clipping
in Mobile Apps
Kazutoshi Umemoto 1, Ruihua Song 2, Jian-Yun Nie 3,
Xing Xie 2, Katsumi Tanaka 1, Yong Rui 2
1 Kyoto University, 2 Microsoft Research Asia, 3 University of Montreal

Information Access from Mobile
2
http://gs.statcounter.com/press/mobile-and-tablet-internet-usage-exceeds-desktop-for-first-time-worldwide

Web Access Style: Desktop vs. Mobile
3
social news food travel
⋯

Web Access Style: Desktop vs. Mobile
4
social news food travel
⋯
How can we assist
read-it-later behavior
of mobile users?
l What a user reads/likes
is scattered in different
apps
l People have limited time
to read articles at a time

Existing Solutions
5
No universal way to clip articles on mobile
l Various features: OneNote, Evernote, Pocket, URL copy, Email, …
l Difficult for clipping service to get partnership with all mobile apps

Proposal: UniClip
6
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
core task
Users only have to take one screenshot of target articles
l UniClip only requires a single interaction
l UniClip allows users to save articles in one place
l UniClip is independent of app features
Advantages

Core Task: Search by Screenshots
7
• Input
– One screenshot of a single article
– Any part is OK if it contains the article’s text
• Output
– A URL that corresponds to the given article
– Exactly the same page is desired
(if identifiable from the screenshot)
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
Challenges

Overview of Our Approaches
8
Block
segmentation
Query
formulation
Result
aggregation
Attribute
extraction
Text
recognition
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries

Approaches
Block segmentation
9

Two-Phase Segmentation Algorithm
10
1. Merge adjacent lines detected by OCR engine
2. Merge adjacent candidate segments
Detected lines Candidate segments Final segments
1 2

Line1
Line2
Alignment
Distance
Height
Line1
Line1
Distance
Alignment
Height
Segment1
Segment2

Before: OCR lines After: Segmented blocks

Baseline: OCR blocks Proposed: Segmented blocks

Approaches
Attribute extraction
14

Role of Each Block in Article
15
l Screenshots may contain
blocks unrelated to main
content
❯ Queries generated from such
blocks are useless to retrieve the
original article
l Classify blocks into 3 groups
❯ Title block
❯ Body block
― e.g., paragraphs
❯ Other block
― e.g., ads, toolbar

How to Estimate Block Attribute
16
2. Use majority voting for line sequence in a block
#{Title lines} = 1
#{Body lines} = 3
#{Other lines} = 1
Line1
Body
Body
Other
Body
Body
Title
Body
1. Estimate attributes of each line using CRF
one line is regarded
as one observation in CRF model
attributes
lines

Features
17
l Point-wise features (for a targeted line)
❯ Font size of words
❯ Recognition confidence of words
❯ Number of words
❯ Does contain any punctuations?
❯ Does exist a certain format (e.g., upper-case, title-case)?
❯ Vertical position
l Pair-wise features
(for a targeted and the previous lines)
❯ Alignment (left/center/right) accordance of two lines
❯ Gap (line space) between two lines
❯ Difference of height between two lines

Approaches
Query formulation
18

Basic Approach
l Simple query
❯ Formulate phrase queries from substring of each block
❯ Length of each query must be ["#$%, "#'(]
― ("#$% and "#'( are empirically set to 4 and 14)
l Compound query
❯ Simple queries generated from a single block may not be
unique in some cases (e.g., cited paragraphs)
❯ Concatenate two (half-length) simple queries
each generated from different blocks
19
l Queries should be as unique as possible
l Too long queries do not be return good results
Observation

How Simple/Compound Queries Are Generated
20
Title
Body
Body
Other 3
1
1 2
2
7
4 5
5 6
6
7
4
5
1
6
2
2
3
3 4
4
1 2
3
4
5
5 6
1
2
3
1 1
2 2
3 3
!"
!#
!$
!"
!#
!$
Simple query Compound query

Advanced Approach
l Hybrid query
❯ Title: simple query
❯ Body: compound query
❯ Other: simple query
21
Take block attribute into account for query formulation
l Title is unique enough to distinguish a given article from others
l Body blocks are sometimes not unique enough (e.g., citation)
l Other blocks are usually noisy but may have useful contents

22
Hybrid
method
Simple
method

Approaches
Result aggregation
23

Exploit Attribute and Rank
1. Score each search result based on its rank
and the weight of query attribute
❯ ! ", $% =
'()*)
%
2. Aggregate scores for each URL among all
queries ,
❯ ! $ = ∑.∈0 !(", $)
3. Output URL having highest aggregation score
❯ ̂$ = arg max
7
!($)
24
l Good queries return answer URLs at high rank positions
l Bad queries return diverse URLs at different positions
l Queries from title/body blocks are more promising than others
$8
$9
$:
$;
"
$<

Datasets
l Training dataset: 98 screenshots
❯ to learn some parameters and to build CRF model
l Testing dataset: 189 screenshots
❯ to evaluate effectiveness of each approach
l Ground-truth: Manually assess the relevance of output URLs
26

Block Segmentation: Setting
l Baselines
❯ Line: regard each OCR line as a segment
❯ Region: regard each OCR region as a segment
l Procedure
1. Manually group OCR lines into ground-truth
clusters
2. Evaluate clustering quality of each method
― with Purity, NMI, RI, Precision, Recall, and F1
27

Attribute Extraction: Setting
29
Too short
Low confidence due to blurred
l Heuristic baseline
❯ Title
― select the biggest block after some filtering
❯ Body
― select sentence-like block except for title block
l Procedure
1. Manually label the ground-truth attribute
of each OCR line
2. Evaluate classification performance of
each method
― with Precision, Recall, and F1

Attribute Extraction: Result
30
Precision Recall F1
Title
Heuristic 0.340 0.912 0.327
CRF 0.928 0.919 0.868
Body
Heuristic 0.754 0.780 0.702
CRF 0.967 0.880 0.893
(Macro Average)

Query Formulation & Result Aggregation: Setting
l Methods
❯ Hybrid: hybrid query (i.e., w/ attribute)
❯ Simple: simple query (i.e., w/o attribute)
❯ Keyword: non-phrase query consisting of keywords
extracted by TextRank
l Procedure
❯ Evaluate retrieval performance of each method with
different query budget
― Measure: F1 (and RR@8)
31

Query Formulation & Result Aggregation: Result
32

Successful Cases
33
Quotations
from
other pages
(1) thanks to compound query (2) thanks to query weighting
The only
block
that is useful

(Simple) User Study
l Implement UniClip as an Android app
l Ask 22 participants to try our app and give
their preference compared to other methods
34

Summary
l Approaches
❯ CRF-based attribute extraction for segmented blocks
❯ Attribute-dependent phrase query formulation
❯ Aggregation based on result rank and attribute weight
l Future work
❯ Improving efficiency by allowing only one query (done)
❯ Evaluation with larger datasets
❯ Leveraging the potential of screenshots for other tasks
35
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
UniClip: A Framework for Article Clipping in Mobile Devices

Search by Screenshots for Universal Article Clipping in Mobile Apps

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Ähnlich wie Search by Screenshots for Universal Article Clipping in Mobile Apps

Ähnlich wie Search by Screenshots for Universal Article Clipping in Mobile Apps (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

Search by Screenshots for Universal Article Clipping in Mobile Apps