SlideShare ist ein Scribd-Unternehmen logo
1 von 28
Downloaden Sie, um offline zu lesen
Efficiently Estimating Statistics of Points of
Interest on Maps – Wang, He, Liu (2016)
Alex Klibisz, alex.klibisz.com, UTK STAT645
November 10, 2016
Motivation
Point-of-interest (PoI) data is valuable.
Google, Foursquare, Baidu, etc. collect user activity and feedback for
PoIs.
This data can help businesses understand market, determine locations,
consumer preferences, etc.
Public-facing APIs have restrictions
Number of PoIs returned, query frequency, area size, no guarantee of
underlying sampling.
Propose methods to sample and approximate aggregate statistics
Emphasize query efficiency, aggregate statistic error.
Notation
A – Area of interest (e.g. a city).
P – Set of PoIs in A (e.g. all hotels).
k – Maximum number of PoIs returned (API constraint).
Fully-accessible region – when queried, returns < k PoIs.
If k = 50, a region is fully-accessible when a query returns 49 PoIs.
δ – minimum acceptable lat. and lon. precision of map APIs.
Hit-ratio – probability of sampling a non-empty sub-region from A.
ρ is the fraction of non-empty sub-regions in A → 1
ρ
queries to get a
non-empty sub-region.
Problem Statement
Given
1. Area of interest A containing PoIs P.
2. API-specific query restrictions.
Estimate
1. Sum Aggregate: the sum of an attribute across P.
2. Average Aggregate: the average of an attribute across P.
3. PoI Distribution: distribution of an attribute across P
4. n(A), number of PoIs in A.1
1Not directly presented how to estimate this, but used for evaluation.
Datasets
Baidu and Foursquare datasets used from two other publications. No clear
indication of time span.
Algorithms
1. Random Region Zoom-In, RRZI
2. RRZI w/ Count Information, RRZIC
3. Uniform Region Sampling, RRZI(C)_URS
4. Metropolis-Hastings Weighted Region Sampling, RRZIC_MHWRS
RRZI Algorithm
1. Sample region Q from A at random.
2. Divide Q into two sub-regions Q0, Q1 without overlap.
3. Randomly select a non-empty sub-region as the next region to query.
4. Query the selected region.
5. Repeat until a fully-accessible sub-region is found.
Characteristics
Typically run RRZI until m fully-accessible sub-regions are found.
How to divide Q into Q0, Q1? – Equations (1), (2).
How to determine if Q0, Q1 are empty? – Store prior PoIs.
Correcting for sampling bias? – Counter τ records probability of
sampling each sub-region.
Maximum number of queries? – Hmax = log(Lx /δ) + log(Ly /δ)
Lx and Ly are x, y dimensions, δ is degree granularity.
Binary search over a 2-d array
Seems like random binary search until a fully-accessible sub-region is found.
RRZI Example
Important to note RRZI is repeated m = 3 times. Evaluation show that as
m increases, estimation improves.
RRZI Estimators
Sum Estimator, Proven Consistent pg. 5
The average of some attribute over all PoIs in fully-accessible
regions, standardized by the probability of picking the PoI’s
region.
Confidence Interval
(variance defined equation 4.)
RRZI Estimators
Distribution Estimator
RRZIC Algorithm
Context
Some APIs provide the count z of PoIs in a queried region.
Use the count to improve RRZI:
Choose the next sub-region with probability z0
z and z1
z .
→ The larger sub-region is more likely to be chosen to query next.
→ The number of PoIs in the next-explored region is more stable.
→ Sampling is closer to uniform and error is reduced.
Estimators now standardize by the known count in each region, n(ri )
instead of probability of choosing the region.
Question: why not always pick the region with greater z?
You would end up with the same FA region every time.
Seems like semi-sorted binary search now.
Mix Methods to overcome Size Constraints
Context
Some APIs constrain the size of the queried region.
3◦
x3◦
(lat, lon) query fails on Foursquare.
Introduce mix-methods URS and MHWRS to overcome size
constraints with clever sampling.
Intuition
Subdivide A before running RRZI and RRZIC.
Improved sampling makes it more query-efficient and lowers error.
Uniform Random Sampling (RRZI_URS, RRZIC_URS)
Uniform Random Sampling Step
1. Apply L recursive region divisions to get set of sub-regions BL,
|BL| = 2L
, B∗
L is the set of non-empty sub-regions.
L tuned such that the sub-regions meet size constraint.
Continue with RRZI or RRZIC using regions from URS
2. Randomly select nonempty b from BL
3. Sample fully-accessible region(s) from b using RRZI(b) and
RRZIC(b) (instead of RRZI(A) and RRZIC(A)).
Characteristics
Estimator functions are similar; standardize w.r.t BL instead of A.
(Generally) more query-efficient:
URS requires |BL|
|B∗
L
|
queries to find a non-empty region; non-URS
requires L. |BL|
|B∗
L
|
< L for small values of L (few division steps).
Arrives at a non-empty query more quickly → undersamples dense
regions → higher error.
Metropolis-Hastings Based Weighted Region Sampling
(RRZIC_MHWRS)
Modify the Sampling Step to Improve Error
1. Sample non-empty region b from BL following distribution
π = (πb = n(b)
n(A) : b ∈ B∗
L )
Draw more samples from dense regions.
2. Sample a fully-accessible region with RRZIC.
Move to Another Region (maybe)
3. Sample next region b∗
, and move to it with probability min(n(b∗
)
n(b) , 1)
If b∗
is larger than b, it will always be moved to.
Characteristics
Only works if you know the count of the region.
More query-efficient for same reasons as URS.
MHWRS falls into Markov-Chain Monte Carlo techniques.
Algorithms Comparison
Parameter L required to determine sub-region size.
Evaluation
Tests
1. Estimate n(A) (number of PoIs in area A).
2. Estimating average and distribution statistics.
Baselines
1. Nearest-Neighbor Search
2. Random Region Sampling
Hypothesis
1. RRZIC_MHWRS will be most efficient if PoI count is available.
2. RRZI_URS will be most efficient otherwise.
Estimating n(A)
Normalized root-mean-squared error for n(A) estimate using RRZI.
m ↑, error ↓
k ↑, error ↓
Not obvious how they actually estimate n(A).
Estimating n(A)
How many RRZI_URS queries to reach a fully-accessible region?
L = 0 models RRZI.
L ↑, sub-region size ↓
Local minimum around 10-15.
Estimating n(A)
How many RRZI_URS queries to sample a non-empty sub-region?
Small L → large sub-region → less likely to be empty.
Estimating n(A)
How does decreasing sub-region size affect error for n(A)?
Smaller regions → lower error until L = 20
Estimating n(A)
2
How many queries to decrease error to 0.1?
Baseline methods require ~150K queries.
RRZI and RRZI_URC require ~20K and ~50K queries.
2The lines between cities don’t really represent anything.
Estimating n(A)
Test Interpretations
1. RRZI, RRZI_URC methods reach a low error much sooner.
2. Tuning hyper-parameter L is important.
3. Not obvious how the proposed methods compute the n(A) estimate.
Estimating average and distribution statistics
“Correct” data for Foursquare.
Leaving out Baidu evaluation for brevity.
Estimating average and distribution statistics
Average and distribution root-mean-squared error for RRZI, RRZIC,
RRZI_URS, RRZIC_MHWRS up to 10K queries.
RRZIC_MHWRS is best in all cases.
Estimating average and distribution statistics
How many queries needed for error < 0.1 for average number of
Foursquare check-ins?
RRZIC_MHWRS is best in all cases.
RRZI_URS is best if PoI count is unavailable.
Estimating average and distribution statistics
Test Interpretations
1. True PoI count is very nice to have.
2. Modified sampling methods reach a low error much more efficiently.
3. Modified sampling methods are more query efficient; possible they get
more meaningful data more quickly.
Omitted for Brevity
Real Applications
Present data collected from Foursquare, Google, Baidu using the
proposed methods.
Related Work
Describe Nearest-Neighbor Search and Random Region Sampling
drawbacks for this task.
Contributions, Questions
Contributions
1. A practical guide to overcoming data limitations.
2. Clear improvement on prior “state-of-the-art” methods (NNS, RRS).
3. Clever sampling methods to reach low errors very efficiently.
Questions
1. What’s the shelf-life of PoI estimates? How often would you have to
re-query to maintain accurate estimates?
2. Do the ground-truth data and estimates come from the same time
window? If not, is it valid to compare data from different points in
time? Is it useful to use data from all time?3
3. Is it possible that the decreased error for mix-methods is a product of
more efficient querying?
4. Didn’t fully understand role of CDS_UNI and CDS_NOR in section
4.2.
3At a quick glance, most of Foursquare’s venue statistics can but don’t necessarily
require a bounded time range.

Weitere ähnliche Inhalte

Ähnlich wie Research Summary: Efficiently Estimating Statistics of Points of Interest on Maps

New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reductionMarco Quartulli
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepSanjanaSaxena17
 
AIRBORNE LIDAR POINT DENSITY
AIRBORNE LIDAR POINT DENSITYAIRBORNE LIDAR POINT DENSITY
AIRBORNE LIDAR POINT DENSITYMattBethel
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetRishabh Indoria
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliMDO_Lab
 
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...Juan Tobar
 
International Journal of Pharmaceutical Science Invention (IJPSI)
International Journal of Pharmaceutical Science Invention (IJPSI)International Journal of Pharmaceutical Science Invention (IJPSI)
International Journal of Pharmaceutical Science Invention (IJPSI)inventionjournals
 
Iaetsd modified artificial potential fields algorithm for mobile robot path ...
Iaetsd modified  artificial potential fields algorithm for mobile robot path ...Iaetsd modified  artificial potential fields algorithm for mobile robot path ...
Iaetsd modified artificial potential fields algorithm for mobile robot path ...Iaetsd Iaetsd
 
Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...
Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...
Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...IJMTST Journal
 
Line Detection in Computer Vision - Recent Developments and Applications
Line Detection in Computer Vision - Recent Developments and ApplicationsLine Detection in Computer Vision - Recent Developments and Applications
Line Detection in Computer Vision - Recent Developments and ApplicationsParth Nandedkar
 
AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013OptiModel
 
Computer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC AlgorithmComputer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC Algorithmallyn joy calcaben
 
Knowledge Based Genetic Algorithm for Robot Path Planning
Knowledge Based Genetic Algorithm for Robot Path PlanningKnowledge Based Genetic Algorithm for Robot Path Planning
Knowledge Based Genetic Algorithm for Robot Path PlanningTarundeep Dhot
 

Ähnlich wie Research Summary: Efficiently Estimating Statistics of Points of Interest on Maps (20)

New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
07 dimensionality reduction
07 dimensionality reduction07 dimensionality reduction
07 dimensionality reduction
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Machine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by stepMachine Learning Notes for beginners ,Step by step
Machine Learning Notes for beginners ,Step by step
 
AIRBORNE LIDAR POINT DENSITY
AIRBORNE LIDAR POINT DENSITYAIRBORNE LIDAR POINT DENSITY
AIRBORNE LIDAR POINT DENSITY
 
Object detection - RCNNs vs Retinanet
Object detection - RCNNs vs RetinanetObject detection - RCNNs vs Retinanet
Object detection - RCNNs vs Retinanet
 
PEMF2_SDM_2012_Ali
PEMF2_SDM_2012_AliPEMF2_SDM_2012_Ali
PEMF2_SDM_2012_Ali
 
Where Next
Where NextWhere Next
Where Next
 
Reginf pldi3
Reginf pldi3Reginf pldi3
Reginf pldi3
 
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
A Method for Determining and Improving the Horizontal Accuracy of Geospatial ...
 
TamingStatistics
TamingStatisticsTamingStatistics
TamingStatistics
 
International Journal of Pharmaceutical Science Invention (IJPSI)
International Journal of Pharmaceutical Science Invention (IJPSI)International Journal of Pharmaceutical Science Invention (IJPSI)
International Journal of Pharmaceutical Science Invention (IJPSI)
 
Iaetsd modified artificial potential fields algorithm for mobile robot path ...
Iaetsd modified  artificial potential fields algorithm for mobile robot path ...Iaetsd modified  artificial potential fields algorithm for mobile robot path ...
Iaetsd modified artificial potential fields algorithm for mobile robot path ...
 
Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...
Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...
Support Recovery with Sparsely Sampled Free Random Matrices for Wideband Cogn...
 
Line Detection in Computer Vision - Recent Developments and Applications
Line Detection in Computer Vision - Recent Developments and ApplicationsLine Detection in Computer Vision - Recent Developments and Applications
Line Detection in Computer Vision - Recent Developments and Applications
 
AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013AIAA-SDM-PEMF-2013
AIAA-SDM-PEMF-2013
 
Computer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC AlgorithmComputer Vision: Feature matching with RANSAC Algorithm
Computer Vision: Feature matching with RANSAC Algorithm
 
Knowledge Based Genetic Algorithm for Robot Path Planning
Knowledge Based Genetic Algorithm for Robot Path PlanningKnowledge Based Genetic Algorithm for Robot Path Planning
Knowledge Based Genetic Algorithm for Robot Path Planning
 

Kürzlich hochgeladen

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 

Kürzlich hochgeladen (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 

Research Summary: Efficiently Estimating Statistics of Points of Interest on Maps

  • 1. Efficiently Estimating Statistics of Points of Interest on Maps – Wang, He, Liu (2016) Alex Klibisz, alex.klibisz.com, UTK STAT645 November 10, 2016
  • 2. Motivation Point-of-interest (PoI) data is valuable. Google, Foursquare, Baidu, etc. collect user activity and feedback for PoIs. This data can help businesses understand market, determine locations, consumer preferences, etc. Public-facing APIs have restrictions Number of PoIs returned, query frequency, area size, no guarantee of underlying sampling. Propose methods to sample and approximate aggregate statistics Emphasize query efficiency, aggregate statistic error.
  • 3. Notation A – Area of interest (e.g. a city). P – Set of PoIs in A (e.g. all hotels). k – Maximum number of PoIs returned (API constraint). Fully-accessible region – when queried, returns < k PoIs. If k = 50, a region is fully-accessible when a query returns 49 PoIs. δ – minimum acceptable lat. and lon. precision of map APIs. Hit-ratio – probability of sampling a non-empty sub-region from A. ρ is the fraction of non-empty sub-regions in A → 1 ρ queries to get a non-empty sub-region.
  • 4. Problem Statement Given 1. Area of interest A containing PoIs P. 2. API-specific query restrictions. Estimate 1. Sum Aggregate: the sum of an attribute across P. 2. Average Aggregate: the average of an attribute across P. 3. PoI Distribution: distribution of an attribute across P 4. n(A), number of PoIs in A.1 1Not directly presented how to estimate this, but used for evaluation.
  • 5. Datasets Baidu and Foursquare datasets used from two other publications. No clear indication of time span.
  • 6. Algorithms 1. Random Region Zoom-In, RRZI 2. RRZI w/ Count Information, RRZIC 3. Uniform Region Sampling, RRZI(C)_URS 4. Metropolis-Hastings Weighted Region Sampling, RRZIC_MHWRS
  • 7. RRZI Algorithm 1. Sample region Q from A at random. 2. Divide Q into two sub-regions Q0, Q1 without overlap. 3. Randomly select a non-empty sub-region as the next region to query. 4. Query the selected region. 5. Repeat until a fully-accessible sub-region is found. Characteristics Typically run RRZI until m fully-accessible sub-regions are found. How to divide Q into Q0, Q1? – Equations (1), (2). How to determine if Q0, Q1 are empty? – Store prior PoIs. Correcting for sampling bias? – Counter τ records probability of sampling each sub-region. Maximum number of queries? – Hmax = log(Lx /δ) + log(Ly /δ) Lx and Ly are x, y dimensions, δ is degree granularity. Binary search over a 2-d array Seems like random binary search until a fully-accessible sub-region is found.
  • 8. RRZI Example Important to note RRZI is repeated m = 3 times. Evaluation show that as m increases, estimation improves.
  • 9. RRZI Estimators Sum Estimator, Proven Consistent pg. 5 The average of some attribute over all PoIs in fully-accessible regions, standardized by the probability of picking the PoI’s region. Confidence Interval (variance defined equation 4.)
  • 11. RRZIC Algorithm Context Some APIs provide the count z of PoIs in a queried region. Use the count to improve RRZI: Choose the next sub-region with probability z0 z and z1 z . → The larger sub-region is more likely to be chosen to query next. → The number of PoIs in the next-explored region is more stable. → Sampling is closer to uniform and error is reduced. Estimators now standardize by the known count in each region, n(ri ) instead of probability of choosing the region. Question: why not always pick the region with greater z? You would end up with the same FA region every time. Seems like semi-sorted binary search now.
  • 12. Mix Methods to overcome Size Constraints Context Some APIs constrain the size of the queried region. 3◦ x3◦ (lat, lon) query fails on Foursquare. Introduce mix-methods URS and MHWRS to overcome size constraints with clever sampling. Intuition Subdivide A before running RRZI and RRZIC. Improved sampling makes it more query-efficient and lowers error.
  • 13. Uniform Random Sampling (RRZI_URS, RRZIC_URS) Uniform Random Sampling Step 1. Apply L recursive region divisions to get set of sub-regions BL, |BL| = 2L , B∗ L is the set of non-empty sub-regions. L tuned such that the sub-regions meet size constraint. Continue with RRZI or RRZIC using regions from URS 2. Randomly select nonempty b from BL 3. Sample fully-accessible region(s) from b using RRZI(b) and RRZIC(b) (instead of RRZI(A) and RRZIC(A)). Characteristics Estimator functions are similar; standardize w.r.t BL instead of A. (Generally) more query-efficient: URS requires |BL| |B∗ L | queries to find a non-empty region; non-URS requires L. |BL| |B∗ L | < L for small values of L (few division steps). Arrives at a non-empty query more quickly → undersamples dense regions → higher error.
  • 14. Metropolis-Hastings Based Weighted Region Sampling (RRZIC_MHWRS) Modify the Sampling Step to Improve Error 1. Sample non-empty region b from BL following distribution π = (πb = n(b) n(A) : b ∈ B∗ L ) Draw more samples from dense regions. 2. Sample a fully-accessible region with RRZIC. Move to Another Region (maybe) 3. Sample next region b∗ , and move to it with probability min(n(b∗ ) n(b) , 1) If b∗ is larger than b, it will always be moved to. Characteristics Only works if you know the count of the region. More query-efficient for same reasons as URS. MHWRS falls into Markov-Chain Monte Carlo techniques.
  • 15. Algorithms Comparison Parameter L required to determine sub-region size.
  • 16. Evaluation Tests 1. Estimate n(A) (number of PoIs in area A). 2. Estimating average and distribution statistics. Baselines 1. Nearest-Neighbor Search 2. Random Region Sampling Hypothesis 1. RRZIC_MHWRS will be most efficient if PoI count is available. 2. RRZI_URS will be most efficient otherwise.
  • 17. Estimating n(A) Normalized root-mean-squared error for n(A) estimate using RRZI. m ↑, error ↓ k ↑, error ↓ Not obvious how they actually estimate n(A).
  • 18. Estimating n(A) How many RRZI_URS queries to reach a fully-accessible region? L = 0 models RRZI. L ↑, sub-region size ↓ Local minimum around 10-15.
  • 19. Estimating n(A) How many RRZI_URS queries to sample a non-empty sub-region? Small L → large sub-region → less likely to be empty.
  • 20. Estimating n(A) How does decreasing sub-region size affect error for n(A)? Smaller regions → lower error until L = 20
  • 21. Estimating n(A) 2 How many queries to decrease error to 0.1? Baseline methods require ~150K queries. RRZI and RRZI_URC require ~20K and ~50K queries. 2The lines between cities don’t really represent anything.
  • 22. Estimating n(A) Test Interpretations 1. RRZI, RRZI_URC methods reach a low error much sooner. 2. Tuning hyper-parameter L is important. 3. Not obvious how the proposed methods compute the n(A) estimate.
  • 23. Estimating average and distribution statistics “Correct” data for Foursquare. Leaving out Baidu evaluation for brevity.
  • 24. Estimating average and distribution statistics Average and distribution root-mean-squared error for RRZI, RRZIC, RRZI_URS, RRZIC_MHWRS up to 10K queries. RRZIC_MHWRS is best in all cases.
  • 25. Estimating average and distribution statistics How many queries needed for error < 0.1 for average number of Foursquare check-ins? RRZIC_MHWRS is best in all cases. RRZI_URS is best if PoI count is unavailable.
  • 26. Estimating average and distribution statistics Test Interpretations 1. True PoI count is very nice to have. 2. Modified sampling methods reach a low error much more efficiently. 3. Modified sampling methods are more query efficient; possible they get more meaningful data more quickly.
  • 27. Omitted for Brevity Real Applications Present data collected from Foursquare, Google, Baidu using the proposed methods. Related Work Describe Nearest-Neighbor Search and Random Region Sampling drawbacks for this task.
  • 28. Contributions, Questions Contributions 1. A practical guide to overcoming data limitations. 2. Clear improvement on prior “state-of-the-art” methods (NNS, RRS). 3. Clever sampling methods to reach low errors very efficiently. Questions 1. What’s the shelf-life of PoI estimates? How often would you have to re-query to maintain accurate estimates? 2. Do the ground-truth data and estimates come from the same time window? If not, is it valid to compare data from different points in time? Is it useful to use data from all time?3 3. Is it possible that the decreased error for mix-methods is a product of more efficient querying? 4. Didn’t fully understand role of CDS_UNI and CDS_NOR in section 4.2. 3At a quick glance, most of Foursquare’s venue statistics can but don’t necessarily require a bounded time range.