This document proposes UniClip, a framework that allows mobile users to clip articles from any app using a single screenshot. It segments the screenshot into blocks, classifies each block by attribute using CRF, formulates queries from the blocks depending on attribute, and aggregates search results. Experiments show the CRF approach outperforms heuristics in attribute extraction. The hybrid query method using attributes performs best in retrieving articles, outperforming simple and keyword queries. A user study found participants preferred UniClip over other clipping methods.
Recombination DNA Technology (Nucleic Acid Hybridization )
Search by Screenshots for Universal Article Clipping in Mobile Apps
1. Search by Screenshots
for Universal Article Clipping
in Mobile Apps
Kazutoshi Umemoto 1, Ruihua Song 2, Jian-Yun Nie 3,
Xing Xie 2, Katsumi Tanaka 1, Yong Rui 2
1 Kyoto University, 2 Microsoft Research Asia, 3 University of Montreal
2. Information Access from Mobile
2
http://gs.statcounter.com/press/mobile-and-tablet-internet-usage-exceeds-desktop-for-first-time-worldwide
4. Web Access Style: Desktop vs. Mobile
4
social news food travel
⋯
How can we assist
read-it-later behavior
of mobile users?
l What a user reads/likes
is scattered in different
apps
l People have limited time
to read articles at a time
5. Existing Solutions
5
No universal way to clip articles on mobile
l Various features: OneNote, Evernote, Pocket, URL copy, Email, …
l Difficult for clipping service to get partnership with all mobile apps
7. Core Task: Search by Screenshots
7
• Input
– One screenshot of a single article
– Any part is OK if it contains the article’s text
• Output
– A URL that corresponds to the given article
– Exactly the same page is desired
(if identifiable from the screenshot)
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
Challenges
8. Overview of Our Approaches
8
Block
segmentation
Query
formulation
Result
aggregation
Attribute
extraction
Text
recognition
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
15. Role of Each Block in Article
15
l Screenshots may contain
blocks unrelated to main
content
❯ Queries generated from such
blocks are useless to retrieve the
original article
l Classify blocks into 3 groups
❯ Title block
❯ Body block
― e.g., paragraphs
❯ Other block
― e.g., ads, toolbar
16. How to Estimate Block Attribute
16
2. Use majority voting for line sequence in a block
#{Title lines} = 1
#{Body lines} = 3
#{Other lines} = 1
Line1
Body
Body
Other
Body
Body
Title
Body
1. Estimate attributes of each line using CRF
one line is regarded
as one observation in CRF model
attributes
lines
17. Features
17
l Point-wise features (for a targeted line)
❯ Font size of words
❯ Recognition confidence of words
❯ Number of words
❯ Does contain any punctuations?
❯ Does exist a certain format (e.g., upper-case, title-case)?
❯ Vertical position
l Pair-wise features
(for a targeted and the previous lines)
❯ Alignment (left/center/right) accordance of two lines
❯ Gap (line space) between two lines
❯ Difference of height between two lines
19. Basic Approach
l Simple query
❯ Formulate phrase queries from substring of each block
❯ Length of each query must be ["#$%, "#'(]
― ("#$% and "#'( are empirically set to 4 and 14)
l Compound query
❯ Simple queries generated from a single block may not be
unique in some cases (e.g., cited paragraphs)
❯ Concatenate two (half-length) simple queries
each generated from different blocks
19
l Queries should be as unique as possible
l Too long queries do not be return good results
Observation
20. How Simple/Compound Queries Are Generated
20
Title
Body
Body
Other 3
1
1 2
2
7
4 5
5 6
6
7
4
5
1
6
2
2
3
3 4
4
1 2
3
4
5
5 6
1
2
3
1 1
2 2
3 3
!"
!#
!$
!"
!#
!$
Simple query Compound query
21. Advanced Approach
l Hybrid query
❯ Title: simple query
❯ Body: compound query
❯ Other: simple query
21
Take block attribute into account for query formulation
l Title is unique enough to distinguish a given article from others
l Body blocks are sometimes not unique enough (e.g., citation)
l Other blocks are usually noisy but may have useful contents
24. Exploit Attribute and Rank
1. Score each search result based on its rank
and the weight of query attribute
❯ ! ", $% =
'()*)
%
2. Aggregate scores for each URL among all
queries ,
❯ ! $ = ∑.∈0 !(", $)
3. Output URL having highest aggregation score
❯ ̂$ = arg max
7
!($)
24
l Good queries return answer URLs at high rank positions
l Bad queries return diverse URLs at different positions
l Queries from title/body blocks are more promising than others
$8
$9
$:
$;
"
$<
26. Datasets
l Training dataset: 98 screenshots
❯ to learn some parameters and to build CRF model
l Testing dataset: 189 screenshots
❯ to evaluate effectiveness of each approach
l Ground-truth: Manually assess the relevance of output URLs
26
27. Block Segmentation: Setting
l Baselines
❯ Line: regard each OCR line as a segment
❯ Region: regard each OCR region as a segment
l Procedure
1. Manually group OCR lines into ground-truth
clusters
2. Evaluate clustering quality of each method
― with Purity, NMI, RI, Precision, Recall, and F1
27
29. Attribute Extraction: Setting
29
Too short
Low confidence due to blurred
l Heuristic baseline
❯ Title
― select the biggest block after some filtering
❯ Body
― select sentence-like block except for title block
l Procedure
1. Manually label the ground-truth attribute
of each OCR line
2. Evaluate classification performance of
each method
― with Precision, Recall, and F1
30. Attribute Extraction: Result
30
Precision Recall F1
Title
Heuristic 0.340 0.912 0.327
CRF 0.928 0.919 0.868
Body
Heuristic 0.754 0.780 0.702
CRF 0.967 0.880 0.893
(Macro Average)
31. Query Formulation & Result Aggregation: Setting
l Methods
❯ Hybrid: hybrid query (i.e., w/ attribute)
❯ Simple: simple query (i.e., w/o attribute)
❯ Keyword: non-phrase query consisting of keywords
extracted by TextRank
l Procedure
❯ Evaluate retrieval performance of each method with
different query budget
― Measure: F1 (and RR@8)
31
34. (Simple) User Study
l Implement UniClip as an Android app
l Ask 22 participants to try our app and give
their preference compared to other methods
34
35. Summary
l Approaches
❯ CRF-based attribute extraction for segmented blocks
❯ Attribute-dependent phrase query formulation
❯ Aggregation based on result rank and attribute weight
l Future work
❯ Improving efficiency by allowing only one query (done)
❯ Evaluation with larger datasets
❯ Leveraging the potential of screenshots for other tasks
35
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
UniClip: A Framework for Article Clipping in Mobile Devices