Research summary for my STAT645 course fall 2016. Paper Scalable Algorithms for Nearest-Neighbor Joins on Big Trajectory Data by Fang, Cheng, Tang, Maniu, Yang. http://ieeexplore.ieee.org/document/7498408/
3. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Trajectory Joins Vocabulary
• Trajectory: series of locations that depicts movement of
an entity over time.
• Trajectory Object: snapshot of time and location; many
trajectory objects in a single trajectory.
• Trajectory Join: given two sets M and R of trajectories,
join(M, R) returns trajectory objects from M and R within
some proximity of space and time.
• Joining Criterion: criteria by which objects in M and R are
joined. This paper uses the k-nearest-neighbors algorithm
to join objects.
4. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Example Use Case
• Hubble space telescope generates 140GB/week about
movements of stars and asteroids. Analysis of proximity
among trajectory objects helps to uncover behavior of
outer-space objects, discover meteors, etc. We can use
trajectory joins to find objects in some proximity to one
another.
• Given two groups A and B of asteroids, return the
identities of asteroids from B that have been close to
those in A.
5. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
MapReduce Basics
• Divide-and-conquer ”big data” on share-nothing clusters.
• Master node partitions data and assigns it to map nodes.
• Map performs analysis on local data.
• Shuffle step redistributes data after the map step.
• Reduce performs a summary operation over data from the
the Map step.
• MapReduce software handles the data partitioning,
execution over distributed nodes, error recovery.
1
1
https://goo.gl/0nbYhp
6. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Problem Statement
kNN Join
Find the K nearest neighbors from set R for objects in M over
time interval [ts, te] ⊆ [Ts, Te].
(h,k)NN Join
Find a list of h objects from M over time interval
[ts, te] ⊆ [Ts, Te] that minimize function f . Then return the k
nearest neighbors for each of the h objects.
7. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
kNN Example
Figure illustrates a kNN Join. An (h,k)NN join with h = 1, k = 2
might use f (m1) = max{d1, d2} = d2 to return the k nearest
neighbors of d2 = {r1, r2}.
8. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Some Fundamental Operations
• Min/max distance from point to line-segment.
• Min/max distance from point to trajectory.
• Min/max distance from trajectory to trajectory.
• kNN from trajectory object to trajectory objects.
2
2
Formulas omitted for brevity, available in section 3.
9. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Sub-optimal Solutions
Single Machine Brute Force (BF)
Nested loop to compute euclidean distance between every pair
of points in M and R. Worst-case O(|M||N|l) for l points in
trajectory of interest tr.
Single Machine Sweep Line (SL)
Pre-sort the data based on time and compute only distances for
overlapping trajectories. Also worst-case O(|M||N|l).
Naive MapReduce
Map divides objects in M and R randomly into disjoint subsets.
Reduce joins all pairs of subsets to compute distance. A second
MapReduce job selects the k nearest neighbors.
10. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Overview of kNN Join
Each of the steps is composed of its own MapReduce algorithm for a
total of 6 algorithms.
12. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Pre-processing Phase
Algorithm 1
1 Input: non-partitioned trajectories.
2 Map splits trajectories in sets M and R into T temporal
partitions. O(l + T) where l is the size of a trajectory.
3 Reduce splits each temporal partition into N spatial
partitions. O((|M| + |R|)(l + N))
4 Output: trajectories partitioned by time and space.
13. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Sub-Trajectory Extraction
• An anchor trajectory must span an entire time partition.
• TrL
i is object i in trajectory r in set L in time partiton T.
Algorithm 2
1 Input: trajectories partitioned by time and space.
2 Map retrieves all sub-trajectories in [ts, te]3. Ot(log(l)),
Os(l)
3 Reduce finds anchor trajectories that will be used in next
step. Ot(|TrL
i |2l), Os(|TrL
i |l).
4 Output: anchor trajectories
3
the queried time window
14. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Anchor Trajectories
• An anchor trajectory must span an entire time partition ts
to te.
15. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Computing Time-dependent Bound (TDB)
• The TDB is a circle c(t) that bounds the k nearest
neighbors of a set S of objects at time t.
• The TDB for a set S of objects can change over time.
Algorithm 4, containing Algorithm 3
1 Input: anchor trajectories
2 Map computes the maximum distance from each anchor
trajectory to each central point pi in each temporal
partition T. Ot(N · l), Os(l)
3 Reduce computes the TDB of TrM
i based on the maximum
distances. Ot(|R|log|R|), Os(|R|) for the set of objects R.
4 Output: Time-dependent Bounds
16. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Time-dependent Bounds
• The TDB is a circle c(t) that bounds the k nearest
neighbors of a set S of objects at time t.
• The TDB for a set S of objects can change over time.
White dots are objects from M. Black dots are objects from R. c(t)
needs a small circle to encompass k = 2 points. c(t ) needs a bigger
circle to encompass k = 2 points.
17. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Finding Candidate Trajectories
Algorithm 5
1 Input: partition of trajectories TrR
j .
2 Map classifies each partition of trajectories TrR
j as having
no candidates, all candidates, or some candidates.
Ot(|Tr|Nl), Os(|Tr|l).
3 Reduce gathers the candidates for a join into CR
i . Ot(1),
Os(|CR
i |l).
4 Output: a set of candidate trajectories CR
i .
18. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Candidate Trajectories
Finding candidates for TrR
j (red). Case 1 have no overlap. Case 2
have complete overlap. Case 3 have partial overlap.
19. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Trajectory Join
Algorithm 6
1 Input: candidate trajectories
2 Map joins each partition TrM
i with corresponding
candidates CR
i using a single machine. O(|Tr||CR
i |l).
3 Reduce sorts each object’s neighbors and leaves only the k
nearest. O(kN).
4 Output: each queried object with its k nearest neighbors.
20. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Extension: kNN Load Balancing
1 Hash the trajectory objects by an ID to distribute them
more uniformly among compute nodes.
2 Requires modification in the sub-trajectory extraction,
finding candidates, and trajectory join.
21. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Extension: hkNN Join
1 Review: finds the h objects from M that minimize some
function f and returns each of their k nearest neighbors.
2 Forced to compute a smaller TDB.
3 Smaller query result hxk size. kNN query was |M|xk.
4 Time and space complexities remain the same.
22. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Evaluation Setup
• 2 Synthetic and 2 real datasets.
• Non-trivial size, up to 1.2B observations and 17.2GB.
• Hadoop cluster with 60 slave nodes, multi-core 3.40GHz
and 16GB memory per node.
• Using Sweep Line (SL) for single-node parts.
• Measuring query execution time and MapReduce shuffling
cost.4
• k = 10, N = 400 constant for all datasets. T and tq
varied.
4
The amount of data sent from mappers to reducers.
23. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Effect of T (number of temporal partitions)
As T grows the time decreases until it hits an inflection point. This
happens to be similar for both datasets. We are still spending the
most time on single-node SL.
24. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
kNN Results Summary
• Increasing N (number of temporal patitions) improves
performance to a point of inflection. This point is different
for the two datasets. Fig. 15.
• Balanced Sweep-Line (BL-SL) is the more efficient
single-node algorithm. Fig. 16.5
• Adding slave nodes improves performance. Rate of change
is slow, likely due to I/O overhead. Fig. 17.
• As k increases the running time and shuffle cost increase.
TDB makes a difference. Fig. 18.
• Increases in tq show a near-linear increase in running time
and shuffling cost. TDB and load balancing make a
difference. Fig. 19.
• Time increases linearly with dataset size. Sharper increase
in shuffling cost than time. Fig. 20.
5
I think they mixed up the figure labels.
25. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
hkNN Results Summary
• Time is constant as h grows (probably because k is
constant).
• (h,k)NN is 2x faster than kNN methods.
• Load-balanced is faster than non-load-balanced.
26. Scalable kNN
Joins, Fang
presented by
Alex Klibisz
Introduction
Trajectory Joins
Introduction
Motivation
MapReduce
Introduction
Problem
Statement
Trajectory
Operations
Sub-optimal
Solutions
Solution: kNN
Join
Pre-processing
Phase
Querying Phase
Extension: kNN
Load Balancing
Extension:
hkNN Join
Results
Evaluation Setup
kNN Results
hkNN Results
Summary
Conclusion
Conclusion
Contributions
1 Leverage share-nothing MapReduce structure for kNN
joins, which typically rely on shared indices.
2 Introduce the TDB and load-balancing methods, which
yield tangible improvements.
Questions
1 Most of the time is still spent on the single-node
computation. What is the theoretical bound for
improvement via parallelization?
2 How much time does the partitioning step take?
3 The partitioning step probably has to be re-run when new
data arrives. Does this prevent a real-time
implementation?
4 Any benefit to localize data instead of using HDFS?