The presentation done at ACM/IFIP/USENIX Middleware workshop 2015
Adaptive and Reflective Middleware (ARM) is the main forum for researchers on adaptive and reflective middleware platforms and systems. It was the first ever workshop to be held with the ACM/IFIP/USENIX International Middleware Conference, dating back to the year 2000, in Palisades, NY (Middleware 2000) and has been running every year since.
Authors:
Y.S.Horawalavithana
D.N.Ranasinghe
http://dl.acm.org/citation.cfm?id=2834975
Citation:
Y. S. Horawalavithana and D. N. Ranasinghe. 2015. An Efficient Incremental Indexing Mechanism for Extracting Top-k Representative Queries Over Continuous Data-streams. In Proceedings of the 14th International Workshop on Adaptive and Reflective Middleware (ARM 2015). ACM, New York, NY, USA, , Article 8 . DOI=http://dx.doi.org/10.1145/2834965.2834975
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
1. An Efficient incremental indexing
mechanism for extracting Top-k
representative queries over continuous
data streams
Y.S. Horawalavithana, D.N. Ranasinghe
Adaptive and Reflective Middleware (ARM)
ACM/IFIP/USENIX Middleware
Vancouver, BC, Canada
December 08, 2015
1
University of Colombo School of Computing,
Sri Lanka
7. 7
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 𝑡ℎ window may still have the chance to be in 𝑖 + 1 𝑡ℎ window
if it's not expired & other valid items in 𝑖 + 1 𝑡ℎ
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
9. 9
Locality Sensitive Hashing (LSH)
Simple Idea
if two points are close together, then after a “projection” operation these two
points will remain close together
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
10. 10
LSH in Adaptive Diversification:
Publications as categorical data
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
11. 11
LSH in Adaptive Diversification:
Characteristic Matrix
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
12. 12
LSH in Adaptive Diversification:
Minhashing
No Publications any more!
Signature to represent
Technique
Randomly permute the rows at
characteristic matrix m times
Take the number of the 1st row, in
the permuted order,
which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
Advantage:
Reduce the dimensions into a small
minhash signature
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
13. 13
LSH in Adaptive Diversification:
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
14. 14
LSH in Adaptive Diversification:
LSH Buckets
Take r sized
signature vectors
From m sized
minhash-
signature
Map them into,
L Hash-Tables
Each with
arbitrary b
number of
buckets
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
15. 15
LSH in Adaptive Diversification:
Batch-wise Top-k computation
Bucket “Winner” – a publication which has the
highest relevancy score
Winner is dominant to represent it's bucket
neighborhood
Top-k "winners“ that have a majority of votes
k winners are independent
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
16. 16
LSH in Dynamic Diversification:
Incremental Top-k computation
𝑁𝑒𝑤 𝑝𝑢𝑏𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛 𝑖 𝑈𝑝𝑑𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑐ℎ𝑎𝑟𝑎𝑐𝑡𝑒𝑟𝑖𝑠𝑡𝑖𝑐 𝑣𝑒𝑐𝑡𝑜𝑟
Characteristic
Matrix
𝐺𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑖 𝑡ℎ
𝑚𝑖𝑛ℎ𝑎𝑠ℎ 𝑠𝑖𝑔𝑛𝑎𝑡𝑢𝑟𝑒
Signature
Matrix
Map 𝑖 𝑡ℎ
signature
into L hash-tables
Update “Winner” at
bucket 𝑖 𝑡ℎ
signature
maps into
Vote 𝑇𝑜𝑝 − 𝑘 𝑐𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
17. 17
LSH in Dynamic Diversification:
When new publication F arrives…
Only buckets 𝐵13
, 𝐵23
, 𝐵32
, 𝐵43
will vote
Follow continuity requirements
Durability
Order
𝑃𝐴 𝑃𝐵 𝑃𝐶 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
18. 18
LSH in Adaptive Diversification:
Analysis
For two vectors x,y
𝐽𝐷 𝑥, 𝑦 = 1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 ;
𝑤ℎ𝑒𝑟𝑒, 𝐽𝑆𝐼𝑀 𝑥, 𝑦 =
𝑥 ∩ 𝑦
𝑥 ∪ 𝑦
For publications x & y
𝐽𝑆𝐼𝑀 𝑥, 𝑦 ∝ 𝑃𝑟𝑜𝑏 𝐻 𝑥 = 𝐻 𝑦
At a particular hash table
x & y map into the same bucket:
𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
x & y does not map into the same bucket:
1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
At L Hash-tables
x & y does not map into the same bucket:
(1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏
) 𝐿 1 − (1 − 𝐽𝑆𝐼𝑀 𝑥, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
19. Publication Stream Zipfian subscriptions
Normalized preferences
19
Evaluation:
Dataset
Amazon on-line market place data available at 17th – 19th November 2014
𝑧𝑖𝑝𝑓 𝑘: 𝑠, 𝑁 =
1
𝑘 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑢𝑏𝑐𝑟𝑖𝑏𝑒𝑟 𝑣𝑖𝑒𝑤𝑠
=
𝑖=2
32
48 𝑐 𝑖
+ 42 𝑐 𝑖
+ 54 𝑐 𝑖
+ 66 𝑐 𝑖
+ 57 𝑐 𝑖
+ 67 𝑐 𝑖
20. 20
Terminology
ILSH, BLSH and NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
21. 21
Accuracy:
ILSH vs. NAÏVE
Probability of producing optimal diverse set of results by ILSH under Jaccard similarity threshold (s)
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
22. 22
Performance & Efficiency:
ILSH vs. BLSH vs. NAÏVE
log (Top-k matching time) on number of publications with D=500
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
23. 23
Conclusions
Locality Sensitive Hashing (LSH) indexing method
Produce diverse set of results at average 70% accuracy over naïve method
Reduce the matching time very significantly over NAÏVE method
Further, refine by it’s incremental version
For handling streaming publications
Avoid the curse of re-computing neighborhoods
Top k to restrict the delivery of Top publications
Given a window size & delivery method
Model can produce best diverse set of personalized results
To represent the set of all matching publications at given instance
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
24. 24
Future work
Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
Personalized newspaper for every Facebook user
Adaptive resource scheduling in large scale distributed system
Exploit overlap among diversified results of users who have similar interest
Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Adaptive Diversification 3.Incremental Top-k 4.Evaluation 5.Conclusion 6.Future Work
each user gets exposed to more than 1,500 stories each day, but an average user would only get to see about
Since similar publications have the tendency to map into same bucket at probability 1 − d,
dominance condition can be well served. Because the "winner" publication as the most relevant
publication at each bucket, can cover it's neighborhood. Also two buckets represent two separate
neighborhoods. That results all "winner" publications to be dis-similar from each other by at
least d distance. So it also satises the independence condition
Talk on ILSH update cost, because of maintaining a large characteristic matrix