IAC 2024 - IA Fast Track to Search Focused AI Solutions
Β
[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams
1. Cloud based publish/subscribe model for
Top-k matching over continuous data
streams
Author:
Y.S. Horawalavithana
10002103
Supervisor:
Dr. D.N. Ranasinghe
U/Graduate Thesis Defense
January 23, 2015
UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING
SCS 4001: INDIVIDUAL PROJECT
1
2. 2
Overview
β’ Motivation
β’ Target
β’ Design & Architecture
β’ Related work
β’ Dynamic Diversification
β’ Incremental Top-k
β’ Implementation
β’ Evaluation
β’ Conclusion
β’ Future work
4. 4
Boolean publish/subscribe
Drawbacks
ο± A subscriber may be either overloaded with
publications or receive too few publications
ο± Impossible to compare different matching
publications as ranking functions are not
defined, and
ο± Partial matching between subscriptions and
publications is not supported.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
5. 5
Top-k publish/subscribe
ο± Expressive stateful query processing systems
ο± User defined parameter k restricts the
delivered publications
ο± Pub/Sub Matching
ο± Top-k pub/sub scoring or ranking
ο± Pub/Sub Indexing
ο± Indexing to support personalized subscriptions
ο± Indexing to support continuous Top-k
publications retrieval
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
6. 6
Target
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-art
publish/subscribe systems under
a) large subscription volume,
b) high event rate and,
c) the variety of subscribable attributes,
to support Top-k matching queries?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
7. 7
Scope
ο± Optimize Top-k Heuristic for specific domain
ο± E-commerce with buyers & sellers
ο± Subscriptions & publications follow a pre-defined
data-structure
ο± The number of incoming publications follow a
Poisson random variable
ο± Retrieve Top-k publications against subscriptions,
not reverse.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
12. 12
Relevancy: Personalized Subscription space
2
Carrier = Verizon
Storage β€ 32GB
2.5
Carrier = AT&T
Storage β€ 16πΊπ΅
1.75
Brand = HTC
1.3
2.3
Carrier = Verizon
Color = White
OS = Android
Storage = 16GB
Brand = HTC
Subscribe
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
13. 13
Subscription Indexing: Modified opIndex
ο± Based on inverted-lists
ο± Posting lists
ο± Two level portioning
ο± Attribute posting list
ο± Operator posting list
ο± Locate satisfying subscription tuples
ο± Relevancy score
ο± By satisfying relations
ο± By satisfying subscription tuples
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
14. 14
Freshness
ο± When window becomes larger,
ο± Older publications may prevent the newer publications
to enter into Top-k results
ο± Lease relevancy scores?
ο± But have to re-calculate scores
ο± Forward decaying!
ο± Fresh-relevancy score = relevancy score Γ Freshness score
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
15. 15
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
16. 16
MAX* k-diversity problem
where
1. P = {p1, β¦, pn}
2. k β€ n
3. d: a distance metric
4. f: a diversity function
),(argmax*
dSfS
k|S|
PS
ο½
ο
ο½
Find:
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
21. 21
Handling streaming publications
π1
π2
π3
π4
π5
π£1
π£4
π£3
π£5
π£2πΌ
π6
π£1
π£4
π£3
π£5
π£2π£6
Continuity Requirements
1. Durability
an item is selected as diversified in π π‘β window may still have the chance to be in π + 1 π‘β window
if it's not expired & other valid items in π + 1 π‘β
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
23. 23
Locality Sensitive Hashing (LSH)
ο± Simple Idea
ο± if two points are close together, then after a βprojectionβ operation these two
points will remain close together
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
24. 24
LSH Analysis
ο± For any given points π, π β π π
π π» β π = β π β₯ π1 πππ π β π β€ π1
π π» β π = β π β€ π2 πππ π β π β₯ ππ1 = π2
β’ Hash function h is (π1, π2, π1, π2) sensitive,
β’ Ideally we need
β’ (π1βπ2) to be large
β’ (π1βπ2) to be small
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
25. 25
LSH in MAXDIVREL:
Publications as categorical data
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
26. 26
LSH in MAXDIVREL:
Characteristic Matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
27. 27
LSH in MAXDIVREL:
Minhashing
ο± No Publications any more!
ο± Signature to represent
ο± Technique
ο± Randomly permute the rows at
characteristic matrix m times
ο± Take the number of the 1st row, in
the permuted order,
ο± which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
ο± Advantage:
ο± Reduce the dimensions into a small
minhash signature
28. 28
LSH in MAXDIVREL:
Signature Matrix
ο±Fast-minhashing
ο±Select m number of random hash
functions
ο±To model the effect of m number of
random permutation
ο±Mathematically proved only when,
ο±The number of rows is a prime.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
29. 29
LSH in MAXDIVREL:
LSH Buckets
ο± Take r sized
signature vectors
ο± From m sized
minhash-
signature
ο± Map them into,
ο± L Hash-Tables
ο± Each with
arbitrary b
number of
buckets
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
32. 32
LSH in MAXDIVREL:
Batch-wise Top-k computation
ο± Bucket βWinnerβ β a publication which has the
highest relevancy score
οΌ Winner is dominant to represent it's bucket
neighborhood
ο± Top-k "winnersβ that have a majority of votes
οΌ k winners are independent
ππ΄ ππ΅ ππΆ π π· ππΈ ππΉ ππΊ π π» . .
ith
window
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
33. 33
LSH in MAXDIVREL:
Incremental Top-k computation
πππ€ ππ’ππππππ‘πππ π πππππ‘π π π‘β
πβπππππ‘ππππ π‘ππ π£πππ‘ππ
Characteristic
Matrix
πΊππππππ‘π π π‘β
πππβππ β π πππππ‘π’ππ
Signature
Matrix
Map π π‘β
signature
into L hash-tables
Update βWinnerβ at
bucket π π‘β
signature
maps into
Vote πππ β π ππππππππ‘π
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
34. 34
LSH in MAXDIVREL:
When new publication F arrivesβ¦
ο± Only buckets π΅13
, π΅23
, π΅32
, π΅43
will vote
ο± Follow continuity requirements
ο± Durability
ο± Order
ππ΄ ππ΅ ππΆ π π· ππΈ ππΉ ππΊ π π» . .
ith
window
(i+1)th
window
ο»
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
39. 39
Subscriber Effectiveness:
Quality or natural behvior
ο± Testing zipf or power law hypothesis on
distribution of ranked results (KS Test)
i. Fitting power law
ii. Goodness of fit tests
iii. Alternative distributions
ο± Compute 19030 ranked distributions
over 100K publication stream
ο± Under different subscriber views
ο± Under different sized sliding window
instances
Sample distribution of ranked votes
logzipf_prob(rank)
log (rank)
π§πππ π: π , π =
1
π π
π=1
π
(
1
π π )
N - number of elements in distribution,
k - rank of element
s - value of exponent
40. 40
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
41. 41
Subscriber Effectiveness:
i. Fitting power law
Illustration of Zipf exponent values convergence
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
42. 42
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values under different similarity threshold
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
43. 43
Subscriber Effectiveness:
ii. Goodness of fit tests
πΎ1 = πππ₯ π₯β₯π₯ πππ
π π₯ β π π₯
π π₯ : πππ πππ£ππ ππππ πΆπ·πΉ
π π₯ : πππππππ‘ πππ‘π‘ππ πΆπ·πΉ
π β π£πππ’π =
ππ’ππππ ππ πΎπ π€βππππΎπ > πΎ1;
π
π = 1000 π π¦ππ‘βππ‘ππ π§πππ πππ‘ππ ππ‘π
P-values of KS test under different subscriber views
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
44. 44
Subscriber Effectiveness:
iii. Testing alternative distributions
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
45. 45
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
ο±For an even comparison,
ο±Combine relevancy at all diversity method
ο±To achieve a bi-criteria objective
Average zipf law exponent in a comparison with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
46. 46
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
A comparison of average zipf law exponent with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
47. 47
Subscriber Effectiveness:
Accuracy of Top-k results
LSH Index vs. NAΓVE
ο§ Rank probability
ο§ Diversity probability
Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
49. 49
Performance
Subscription index update time
Index construction time on opIndex vs. modified opIndex
opIndex vs. modified opIndex
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
50. 50
Efficiency:
Initial matching time at modified opIndex
Initial matching time under different size of subscription spaces Initial matching time under different size of publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
51. 51
Performance & Efficiency:
LSH Index
BLSH index construction + update time on different number of minhash functions
Number of minhash functions
(m) =
1
ππ π‘ππππ‘ππ πππππ2
ο± How much accuracy
do we sacrifice by
comparing small
minhash signatures?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
52. 52
Performance & Efficiency
ILSH vs. BLSH vs. NAΓVE
π1 π2 π3 π4 π5 π6 π7 π8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
53. 53
Performance & Efficiency:
BLSH vs. NAΓVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
54. 54
Performance & Efficiency:
ILSH vs. BLSH vs. NAΓVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
55. 55
Conclusions
ο± Diversified results produced by MAXDIVREL based on independent-
dominating set problem
ο± Exhibits strong natural behavior other than,
ο± Methods based p-dispersion problem
ο± Relevancy is a important factor to employ
ο± In distance based diversity methods
ο± Always has the tendency to produce the diverse set of personalized
results
ο± Absolute ranks are sensitive to the preference value
ο± While keeping the deviation small among relative ranks
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
56. 56
Conclusions (Ctd.)
ο± Locality Sensitive Hashing (LSH) indexing method
ο± Produce MAXDIVREL diverse set of results at average 70% accuracy
over naΓ―ve method
ο± Reduce the matching time very significantly over NAΓVE method
ο± Further, refine by itβs incremental version
ο± For handling streaming publications
ο± Avoid the curse of re-computing neighborhoods
ο± No such k to restrict the delivery of Top publications
ο± Given a window size & delivery method
ο± Model can produce best diverse set of personalized results
ο± To represent the set of all matching publications at given instance
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
57. 57
Major Contributions
ο± Dynamic diversification method based on independent-dominating set
problem
ο± We introduced a novel diversity definition based on representative
neighborhoods, called MAXDIVREL k-diversity employing relevancy.
ο± Index based diversification approach to rank results incrementally
ο± We proposed a novel, hashing based index approach to solve
MAXDIVREL continuous k-diversity problem based on Locality Sensitive
Hashing (LSH) technique
ο± Advanced evaluation method to measure the quality of diverse results
ο± First significant try to model natural behavior of diversity methods in
pub/sub community
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
58. 58
Future work
ο± Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
ο± Personalized newspaper for every Facebook user
ο± Diverse set of personalized Twitter trends
ο± Social annotation of news-stories
ο± Exploit overlap among diversified results of users who have similar interest
ο± Employ existing implicit methods to extract human preferences
ο± E.g. click stream analytics
ο± Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work