SlideShare ist ein Scribd-Unternehmen logo
1 von 35
Downloaden Sie, um offline zu lesen
Search by Screenshots
for Universal Article Clipping
in Mobile Apps
Kazutoshi Umemoto 1, Ruihua Song 2, Jian-Yun Nie 3,
Xing Xie 2, Katsumi Tanaka 1, Yong Rui 2
1 Kyoto University, 2 Microsoft Research Asia, 3 University of Montreal
Information Access from Mobile
2
http://gs.statcounter.com/press/mobile-and-tablet-internet-usage-exceeds-desktop-for-first-time-worldwide
Web Access Style: Desktop vs. Mobile
3
social news food travel
⋯
Web Access Style: Desktop vs. Mobile
4
social news food travel
⋯
How can we assist
read-it-later behavior
of mobile users?
l What a user reads/likes
is scattered in different
apps
l People have limited time
to read articles at a time
Existing Solutions
5
No universal way to clip articles on mobile
l Various features: OneNote, Evernote, Pocket, URL copy, Email, …
l Difficult for clipping service to get partnership with all mobile apps
Proposal: UniClip
6
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
core task
Users only have to take one screenshot of target articles
l UniClip only requires a single interaction
l UniClip allows users to save articles in one place
l UniClip is independent of app features
Advantages
Core Task: Search by Screenshots
7
• Input
– One screenshot of a single article
– Any part is OK if it contains the article’s text
• Output
– A URL that corresponds to the given article
– Exactly the same page is desired
(if identifiable from the screenshot)
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
Challenges
Overview of Our Approaches
8
Block
segmentation
Query
formulation
Result
aggregation
Attribute
extraction
Text
recognition
1. How to represent a given article screenshot in a tractable format
2. How to formulate queries effective for identifying the article
3. How to aggregate search results of multiple queries
Approaches
Block segmentation
9
Two-Phase Segmentation Algorithm
10
1. Merge adjacent lines detected by OCR engine
2. Merge adjacent candidate segments
Detected lines Candidate segments Final segments
1 2
Line1
Line2
Alignment
Distance
Height
Line1
Line1
Distance
Alignment
Height
Segment1
Segment2
Before: OCR lines After: Segmented blocks
Baseline: OCR blocks Proposed: Segmented blocks
Approaches
Attribute extraction
14
Role of Each Block in Article
15
l Screenshots may contain
blocks unrelated to main
content
❯ Queries generated from such
blocks are useless to retrieve the
original article
l Classify blocks into 3 groups
❯ Title block
❯ Body block
― e.g., paragraphs
❯ Other block
― e.g., ads, toolbar
How to Estimate Block Attribute
16
2. Use majority voting for line sequence in a block
#{Title lines} = 1
#{Body lines} = 3
#{Other lines} = 1
Line1
Body
Body
Other
Body
Body
Title
Body
1. Estimate attributes of each line using CRF
one line is regarded
as one observation in CRF model
attributes
lines
Features
17
l Point-wise features (for a targeted line)
❯ Font size of words
❯ Recognition confidence of words
❯ Number of words
❯ Does contain any punctuations?
❯ Does exist a certain format (e.g., upper-case, title-case)?
❯ Vertical position
l Pair-wise features
(for a targeted and the previous lines)
❯ Alignment (left/center/right) accordance of two lines
❯ Gap (line space) between two lines
❯ Difference of height between two lines
Approaches
Query formulation
18
Basic Approach
l Simple query
❯ Formulate phrase queries from substring of each block
❯ Length of each query must be ["#$%, "#'(]
― ("#$% and "#'( are empirically set to 4 and 14)
l Compound query
❯ Simple queries generated from a single block may not be
unique in some cases (e.g., cited paragraphs)
❯ Concatenate two (half-length) simple queries
each generated from different blocks
19
l Queries should be as unique as possible
l Too long queries do not be return good results
Observation
How Simple/Compound Queries Are Generated
20
Title
Body
Body
Other 3
1
1 2
2
7
4 5
5 6
6
7
4
5
1
6
2
2
3
3 4
4
1 2
3
4
5
5 6
1
2
3
1 1
2 2
3 3
!"
!#
!$
!"
!#
!$
Simple query Compound query
Advanced Approach
l Hybrid query
❯ Title: simple query
❯ Body: compound query
❯ Other: simple query
21
Take block attribute into account for query formulation
l Title is unique enough to distinguish a given article from others
l Body blocks are sometimes not unique enough (e.g., citation)
l Other blocks are usually noisy but may have useful contents
22
Hybrid
method
Simple
method
Approaches
Result aggregation
23
Exploit Attribute and Rank
1. Score each search result based on its rank
and the weight of query attribute
❯ ! ", $% =
'()*)
%
2. Aggregate scores for each URL among all
queries ,
❯ ! $ = ∑.∈0 !(", $)
3. Output URL having highest aggregation score
❯ ̂$ = arg max
7
!($)
24
l Good queries return answer URLs at high rank positions
l Bad queries return diverse URLs at different positions
l Queries from title/body blocks are more promising than others
$8
$9
$:
$;
"
$<
Experiments
25
Datasets
l Training dataset: 98 screenshots
❯ to learn some parameters and to build CRF model
l Testing dataset: 189 screenshots
❯ to evaluate effectiveness of each approach
l Ground-truth: Manually assess the relevance of output URLs
26
Block Segmentation: Setting
l Baselines
❯ Line: regard each OCR line as a segment
❯ Region: regard each OCR region as a segment
l Procedure
1. Manually group OCR lines into ground-truth
clusters
2. Evaluate clustering quality of each method
― with Purity, NMI, RI, Precision, Recall, and F1
27
Block Segmentation: Result
28
Attribute Extraction: Setting
29
Too short
Low confidence due to blurred
l Heuristic baseline
❯ Title
― select the biggest block after some filtering
❯ Body
― select sentence-like block except for title block
l Procedure
1. Manually label the ground-truth attribute
of each OCR line
2. Evaluate classification performance of
each method
― with Precision, Recall, and F1
Attribute Extraction: Result
30
Precision Recall F1
Title
Heuristic 0.340 0.912 0.327
CRF 0.928 0.919 0.868
Body
Heuristic 0.754 0.780 0.702
CRF 0.967 0.880 0.893
(Macro Average)
Query Formulation & Result Aggregation: Setting
l Methods
❯ Hybrid: hybrid query (i.e., w/ attribute)
❯ Simple: simple query (i.e., w/o attribute)
❯ Keyword: non-phrase query consisting of keywords
extracted by TextRank
l Procedure
❯ Evaluate retrieval performance of each method with
different query budget
― Measure: F1 (and RR@8)
31
Query Formulation & Result Aggregation: Result
32
Successful Cases
33
Quotations
from
other pages
(1) thanks to compound query (2) thanks to query weighting
The only
block
that is useful
(Simple) User Study
l Implement UniClip as an Android app
l Ask 22 participants to try our app and give
their preference compared to other methods
34
Summary
l Approaches
❯ CRF-based attribute extraction for segmented blocks
❯ Attribute-dependent phrase query formulation
❯ Aggregation based on result rank and attribute weight
l Future work
❯ Improving efficiency by allowing only one query (done)
❯ Evaluation with larger datasets
❯ Leveraging the potential of screenshots for other tasks
35
camera
camera
Cloud
Search
by
Screenshots
Main
Article
Extractor
UniClip: A Framework for Article Clipping in Mobile Devices

Weitere ähnliche Inhalte

Ähnlich wie Search by Screenshots for Universal Article Clipping in Mobile Apps

MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and Protobuf
Gaurav Bhardwaj
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
dhabalia
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
USD Bioinformatics
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web Pages
IJCSEA Journal
 

Ähnlich wie Search by Screenshots for Universal Article Clipping in Mobile Apps (20)

Web clustering engines
Web clustering enginesWeb clustering engines
Web clustering engines
 
Bachelor-Thesis
Bachelor-ThesisBachelor-Thesis
Bachelor-Thesis
 
FEM_PPT.ppt
FEM_PPT.pptFEM_PPT.ppt
FEM_PPT.ppt
 
MOOC backbone using Netty and Protobuf
MOOC backbone using Netty and ProtobufMOOC backbone using Netty and Protobuf
MOOC backbone using Netty and Protobuf
 
Oo aand d-overview
Oo aand d-overviewOo aand d-overview
Oo aand d-overview
 
Introducing object oriented programming (oop)
Introducing object oriented programming (oop)Introducing object oriented programming (oop)
Introducing object oriented programming (oop)
 
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stackLow Cost Business Intelligence Platform for MongoDB instances using MEAN stack
Low Cost Business Intelligence Platform for MongoDB instances using MEAN stack
 
IRJET- Online Course Recommendation System
IRJET- Online Course Recommendation SystemIRJET- Online Course Recommendation System
IRJET- Online Course Recommendation System
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Session ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmcSession ii g2 overview protein modeling mmc
Session ii g2 overview protein modeling mmc
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
Joomla!Day Poland 2013 - Joomla Architecture (Ofer Cohen)
Joomla!Day Poland 2013 - Joomla Architecture  (Ofer Cohen)Joomla!Day Poland 2013 - Joomla Architecture  (Ofer Cohen)
Joomla!Day Poland 2013 - Joomla Architecture (Ofer Cohen)
 
MDE in Practice
MDE in PracticeMDE in Practice
MDE in Practice
 
Incremental clustering in search engines
Incremental clustering in search enginesIncremental clustering in search engines
Incremental clustering in search engines
 
Boilerplate removal and content
Boilerplate removal and contentBoilerplate removal and content
Boilerplate removal and content
 
Boilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web PagesBoilerplate Removal and Content Extraction from Dynamic Web Pages
Boilerplate Removal and Content Extraction from Dynamic Web Pages
 
Deep learning Tutorial - Part II
Deep learning Tutorial - Part IIDeep learning Tutorial - Part II
Deep learning Tutorial - Part II
 
Machine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and ApplicationsMachine Learning and AI: Core Methods and Applications
Machine Learning and AI: Core Methods and Applications
 
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
 
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
Feature Extraction for Effective Microblog Search and Adaptive Clustering Alg...
 

Kürzlich hochgeladen

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Sérgio Sacani
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
gindu3009
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
Sérgio Sacani
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
RohitNehra6
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
anilsa9823
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
LeenakshiTyagi
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
PirithiRaju
 

Kürzlich hochgeladen (20)

Hubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroidsHubble Asteroid Hunter III. Physical properties of newly found asteroids
Hubble Asteroid Hunter III. Physical properties of newly found asteroids
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Nanoparticles synthesis and characterization​ ​
Nanoparticles synthesis and characterization​  ​Nanoparticles synthesis and characterization​  ​
Nanoparticles synthesis and characterization​ ​
 
CELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdfCELL -Structural and Functional unit of life.pdf
CELL -Structural and Functional unit of life.pdf
 
GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)GBSN - Microbiology (Unit 1)
GBSN - Microbiology (Unit 1)
 
Chemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdfChemistry 4th semester series (krishna).pdf
Chemistry 4th semester series (krishna).pdf
 
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
❤Jammu Kashmir Call Girls 8617697112 Personal Whatsapp Number 💦✅.
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral AnalysisRaman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
Raman spectroscopy.pptx M Pharm, M Sc, Advanced Spectral Analysis
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service  🪡
CALL ON ➥8923113531 🔝Call Girls Kesar Bagh Lucknow best Night Fun service 🪡
 
Natural Polymer Based Nanomaterials
Natural Polymer Based NanomaterialsNatural Polymer Based Nanomaterials
Natural Polymer Based Nanomaterials
 
DIFFERENCE IN BACK CROSS AND TEST CROSS
DIFFERENCE IN  BACK CROSS AND TEST CROSSDIFFERENCE IN  BACK CROSS AND TEST CROSS
DIFFERENCE IN BACK CROSS AND TEST CROSS
 
Pests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdfPests of mustard_Identification_Management_Dr.UPR.pdf
Pests of mustard_Identification_Management_Dr.UPR.pdf
 
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCRStunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
Stunning ➥8448380779▻ Call Girls In Panchshil Enclave Delhi NCR
 
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 60009654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
9654467111 Call Girls In Raj Nagar Delhi Short 1500 Night 6000
 
Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )Recombination DNA Technology (Nucleic Acid Hybridization )
Recombination DNA Technology (Nucleic Acid Hybridization )
 

Search by Screenshots for Universal Article Clipping in Mobile Apps

  • 1. Search by Screenshots for Universal Article Clipping in Mobile Apps Kazutoshi Umemoto 1, Ruihua Song 2, Jian-Yun Nie 3, Xing Xie 2, Katsumi Tanaka 1, Yong Rui 2 1 Kyoto University, 2 Microsoft Research Asia, 3 University of Montreal
  • 2. Information Access from Mobile 2 http://gs.statcounter.com/press/mobile-and-tablet-internet-usage-exceeds-desktop-for-first-time-worldwide
  • 3. Web Access Style: Desktop vs. Mobile 3 social news food travel ⋯
  • 4. Web Access Style: Desktop vs. Mobile 4 social news food travel ⋯ How can we assist read-it-later behavior of mobile users? l What a user reads/likes is scattered in different apps l People have limited time to read articles at a time
  • 5. Existing Solutions 5 No universal way to clip articles on mobile l Various features: OneNote, Evernote, Pocket, URL copy, Email, … l Difficult for clipping service to get partnership with all mobile apps
  • 6. Proposal: UniClip 6 camera camera Cloud Search by Screenshots Main Article Extractor core task Users only have to take one screenshot of target articles l UniClip only requires a single interaction l UniClip allows users to save articles in one place l UniClip is independent of app features Advantages
  • 7. Core Task: Search by Screenshots 7 • Input – One screenshot of a single article – Any part is OK if it contains the article’s text • Output – A URL that corresponds to the given article – Exactly the same page is desired (if identifiable from the screenshot) 1. How to represent a given article screenshot in a tractable format 2. How to formulate queries effective for identifying the article 3. How to aggregate search results of multiple queries Challenges
  • 8. Overview of Our Approaches 8 Block segmentation Query formulation Result aggregation Attribute extraction Text recognition 1. How to represent a given article screenshot in a tractable format 2. How to formulate queries effective for identifying the article 3. How to aggregate search results of multiple queries
  • 10. Two-Phase Segmentation Algorithm 10 1. Merge adjacent lines detected by OCR engine 2. Merge adjacent candidate segments Detected lines Candidate segments Final segments 1 2
  • 12. Before: OCR lines After: Segmented blocks
  • 13. Baseline: OCR blocks Proposed: Segmented blocks
  • 15. Role of Each Block in Article 15 l Screenshots may contain blocks unrelated to main content ❯ Queries generated from such blocks are useless to retrieve the original article l Classify blocks into 3 groups ❯ Title block ❯ Body block ― e.g., paragraphs ❯ Other block ― e.g., ads, toolbar
  • 16. How to Estimate Block Attribute 16 2. Use majority voting for line sequence in a block #{Title lines} = 1 #{Body lines} = 3 #{Other lines} = 1 Line1 Body Body Other Body Body Title Body 1. Estimate attributes of each line using CRF one line is regarded as one observation in CRF model attributes lines
  • 17. Features 17 l Point-wise features (for a targeted line) ❯ Font size of words ❯ Recognition confidence of words ❯ Number of words ❯ Does contain any punctuations? ❯ Does exist a certain format (e.g., upper-case, title-case)? ❯ Vertical position l Pair-wise features (for a targeted and the previous lines) ❯ Alignment (left/center/right) accordance of two lines ❯ Gap (line space) between two lines ❯ Difference of height between two lines
  • 19. Basic Approach l Simple query ❯ Formulate phrase queries from substring of each block ❯ Length of each query must be ["#$%, "#'(] ― ("#$% and "#'( are empirically set to 4 and 14) l Compound query ❯ Simple queries generated from a single block may not be unique in some cases (e.g., cited paragraphs) ❯ Concatenate two (half-length) simple queries each generated from different blocks 19 l Queries should be as unique as possible l Too long queries do not be return good results Observation
  • 20. How Simple/Compound Queries Are Generated 20 Title Body Body Other 3 1 1 2 2 7 4 5 5 6 6 7 4 5 1 6 2 2 3 3 4 4 1 2 3 4 5 5 6 1 2 3 1 1 2 2 3 3 !" !# !$ !" !# !$ Simple query Compound query
  • 21. Advanced Approach l Hybrid query ❯ Title: simple query ❯ Body: compound query ❯ Other: simple query 21 Take block attribute into account for query formulation l Title is unique enough to distinguish a given article from others l Body blocks are sometimes not unique enough (e.g., citation) l Other blocks are usually noisy but may have useful contents
  • 24. Exploit Attribute and Rank 1. Score each search result based on its rank and the weight of query attribute ❯ ! ", $% = '()*) % 2. Aggregate scores for each URL among all queries , ❯ ! $ = ∑.∈0 !(", $) 3. Output URL having highest aggregation score ❯ ̂$ = arg max 7 !($) 24 l Good queries return answer URLs at high rank positions l Bad queries return diverse URLs at different positions l Queries from title/body blocks are more promising than others $8 $9 $: $; " $<
  • 26. Datasets l Training dataset: 98 screenshots ❯ to learn some parameters and to build CRF model l Testing dataset: 189 screenshots ❯ to evaluate effectiveness of each approach l Ground-truth: Manually assess the relevance of output URLs 26
  • 27. Block Segmentation: Setting l Baselines ❯ Line: regard each OCR line as a segment ❯ Region: regard each OCR region as a segment l Procedure 1. Manually group OCR lines into ground-truth clusters 2. Evaluate clustering quality of each method ― with Purity, NMI, RI, Precision, Recall, and F1 27
  • 29. Attribute Extraction: Setting 29 Too short Low confidence due to blurred l Heuristic baseline ❯ Title ― select the biggest block after some filtering ❯ Body ― select sentence-like block except for title block l Procedure 1. Manually label the ground-truth attribute of each OCR line 2. Evaluate classification performance of each method ― with Precision, Recall, and F1
  • 30. Attribute Extraction: Result 30 Precision Recall F1 Title Heuristic 0.340 0.912 0.327 CRF 0.928 0.919 0.868 Body Heuristic 0.754 0.780 0.702 CRF 0.967 0.880 0.893 (Macro Average)
  • 31. Query Formulation & Result Aggregation: Setting l Methods ❯ Hybrid: hybrid query (i.e., w/ attribute) ❯ Simple: simple query (i.e., w/o attribute) ❯ Keyword: non-phrase query consisting of keywords extracted by TextRank l Procedure ❯ Evaluate retrieval performance of each method with different query budget ― Measure: F1 (and RR@8) 31
  • 32. Query Formulation & Result Aggregation: Result 32
  • 33. Successful Cases 33 Quotations from other pages (1) thanks to compound query (2) thanks to query weighting The only block that is useful
  • 34. (Simple) User Study l Implement UniClip as an Android app l Ask 22 participants to try our app and give their preference compared to other methods 34
  • 35. Summary l Approaches ❯ CRF-based attribute extraction for segmented blocks ❯ Attribute-dependent phrase query formulation ❯ Aggregation based on result rank and attribute weight l Future work ❯ Improving efficiency by allowing only one query (done) ❯ Evaluation with larger datasets ❯ Leveraging the potential of screenshots for other tasks 35 camera camera Cloud Search by Screenshots Main Article Extractor UniClip: A Framework for Article Clipping in Mobile Devices