SlideShare a Scribd company logo
1 of 63
Cloud based publish/subscribe model for
Top-k matching over continuous data
streams
Author:
Y.S. Horawalavithana
10002103
Supervisor:
Dr. D.N. Ranasinghe
U/Graduate Thesis Defense
January 23, 2015
UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING
SCS 4001: INDIVIDUAL PROJECT
1
2
Overview
β€’ Motivation
β€’ Target
β€’ Design & Architecture
β€’ Related work
β€’ Dynamic Diversification
β€’ Incremental Top-k
β€’ Implementation
β€’ Evaluation
β€’ Conclusion
β€’ Future work
3
Motivation – β€œBig Filter”
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
4
Boolean publish/subscribe
Drawbacks
 A subscriber may be either overloaded with
publications or receive too few publications
 Impossible to compare different matching
publications as ranking functions are not
defined, and
 Partial matching between subscriptions and
publications is not supported.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
5
Top-k publish/subscribe
 Expressive stateful query processing systems
 User defined parameter k restricts the
delivered publications
 Pub/Sub Matching
 Top-k pub/sub scoring or ranking
 Pub/Sub Indexing
 Indexing to support personalized subscriptions
 Indexing to support continuous Top-k
publications retrieval
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
6
Target
1. How to define an efficient scoring algorithm by integrating query
independent & dependent score metrics taken into account?
- Relevance, Freshness & Diversity
2. How to adapt existing indexing data structures used in state-of-the-art
publish/subscribe systems under
a) large subscription volume,
b) high event rate and,
c) the variety of subscribable attributes,
to support Top-k matching queries?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
7
Scope
 Optimize Top-k Heuristic for specific domain
 E-commerce with buyers & sellers
 Subscriptions & publications follow a pre-defined
data-structure
 The number of incoming publications follow a
Poisson random variable
 Retrieve Top-k publications against subscriptions,
not reverse.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
8
Design & Architecture
Expire
Expire
Publication
Store
Subscription
Store
Subscription
Indexing
Relevance
Matching
Publication
Stream
Matching
Publication
Store
Publication
(Relevance
Score)
Publication
Indexing
Top-k
Continuous
Diversity
Personalized
Subscription
Personalized
Subscription
Personalized
Subscription
Dissimilarity
Relevancy
Event
Delivery
Top-k
Notification
Store
Notification
Notification
Notification
Sliding window
9
Related work:General Top-k publish/subscribe
Pub/sub model Subscription
Timing
policy
Diversity
Scoring
metric
Subscription
Indexing
method
Incremental
publication
indexing
Architecture
PrefSIENA
(Drosou, ACM
DEBS 2009)
Preferential
subscription
Sliding
window
Relevancy +
MAXMIN
diversity
Subscription
covering
Centralized
message-
brokers
RRPS
(Lu, ICCSA 2009)
Normal Continuous QoS Centralized
DaZaLaPs
(Pripuzi, IS 2012)
Normal
Sliding
window
Relevancy Grid based P2P
Top-k pub/sub
(Shraer[Google],
VLDB 2014)
Normal Continuous
Relevancy +
Freshness
Tree based TAAT & DAAT Centralized
Our model
Personalized
subscription
space
Sliding
window
MAXDIVREL
diversity
Inverted-list
based
Hashing
based
Cloud based







1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
10
Sliding window Top-k computation
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 ....
𝑃5 𝑃1
𝑃5 𝑃6
𝑃5 𝑃9
Top-2
Matching publication stream
h=1
h=3
Jumping
step
(h)
 Top-k notifications
delivery
 On-demand
 Pro-active
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
Expired
Active
Top-k
11
Relevancy: Personalized Subscription space
Carrier = AT&T (0.4) Subscribe
Brand = HTC (0.3)
Storage ≀ 16𝐺𝐡 (0.7)
1.75
1.3
2.3
Carrier = Verizon (0.5)
Storage ≀ 32GB (0.2)
2.52
Storage ≀ 32GB (0.6)
Brand = HTC (0.3)
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
12
Relevancy: Personalized Subscription space
2
Carrier = Verizon
Storage ≀ 32GB
2.5
Carrier = AT&T
Storage ≀ 16𝐺𝐡
1.75
Brand = HTC
1.3
2.3
Carrier = Verizon
Color = White
OS = Android
Storage = 16GB
Brand = HTC
Subscribe
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
13
Subscription Indexing: Modified opIndex
 Based on inverted-lists
 Posting lists
 Two level portioning
 Attribute posting list
 Operator posting list
 Locate satisfying subscription tuples
 Relevancy score
 By satisfying relations
 By satisfying subscription tuples
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
14
Freshness
 When window becomes larger,
 Older publications may prevent the newer publications
to enter into Top-k results
 Lease relevancy scores?
 But have to re-calculate scores
 Forward decaying!
 Fresh-relevancy score = relevancy score Γ— Freshness score
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
15
Diversity: Top-k representative set
Representative Top-kDrawback
(without diversity)
What we want
(with diversity)
Method to retrieve Top-k publications from matching publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
16
MAX* k-diversity problem
where
1. P = {p1, …, pn}
2. k ≀ n
3. d: a distance metric
4. f: a diversity function
),(argmax*
dSfS
k|S|
PS
ο€½

ο€½
Find:
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
17
Proposed: MAXDIVREL k-diversity problem
 
  S-Pinrelevancy&similarity-distheminimize,,
Sinrelevancy&similarity-disthemaximize,,g
),,(
),,(
maxarg),,(argmax*
ο€½
ο€½
ο€½ο€½

rdSh
rdS
rdSh
rdSg
rdSfS
PS
where
1. P = {p1, …, pn}
2. d: a distance metric
3. r: a relevance metric
4. f: a diversity function
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
18
Formal Definition: MAXDIVREL k-diversity
 
  οƒ₯
οƒ₯
ο€­οƒŽοƒŽ
οƒŽ
ο€­
ο€½
ο€½
SPpSp
ji
i
j
Spp
ji
i
j
ji
ji
ppd
pr
pr
SP
rdSh
ppd
pr
pr
S
rdS
,
,
dominanceholds),(
)(
)(
||
1
,,argmin
ceindependenholds),(
)(
)(
||
1
,,gargmax
where
1. P = {p1, …, pn}
2. d: a distance metric
3. r: a relevance metric
4. 𝛼 > 0
Independence condition:
βˆ€π‘π‘–, 𝑝𝑗 ∈ 𝑆, 𝑑 𝑝𝑖, 𝑝𝑗 > 𝛼
Dominance condition:
βˆ€π‘π‘– ∈ 𝑃, βˆƒπ‘π‘— ∈ 𝑆 𝑠. 𝑑. 𝑑 𝑝𝑖, 𝑝𝑗 ≀ 𝛼; 𝑖 β‰  𝑗
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
19
NP-Hardness:
Minimum independent-dominating set
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2
𝛼
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2

𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
𝑣1
𝑣4
𝑣3
𝑣2
𝑣5
  jijiji ppppdppodNeighborho  ,|)(
𝑣1
𝑣4
𝑣3𝑣2
𝑣5
Publication
space
Graph
model
Independent, dominating Independent, dominating Independent, dominating Dominating, not independent
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
20
NAÏVE Greedy argmax
π‘Ÿ(𝑝𝑖)2
𝑝 π‘—βˆˆπ‘(𝑝 𝑖) π‘Ÿ(𝑝𝑗) Γ— 𝑑(𝑝𝑖, 𝑝𝑗)
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
21
Handling streaming publications
𝑝1
𝑝2
𝑝3
𝑝4
𝑝5
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝛼
𝑝6
𝑣1
𝑣4
𝑣3
𝑣5
𝑣2𝑣6
Continuity Requirements
1. Durability
an item is selected as diversified in 𝑖 π‘‘β„Ž window may still have the chance to be in 𝑖 + 1 π‘‘β„Ž window
if it's not expired & other valid items in 𝑖 + 1 π‘‘β„Ž
window are failed to compete with it.
2. Order
Publication stream follow the chronological order
We avoid the selection of item j as diverse later, when we already selected an item i which is not-
older than j.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
22
MAXDIVREL continuous k-diversity
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
Matching publication stream
𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. ....
ith window
(i+1)th window
𝑆𝑖
βˆ—
𝑆𝑖+1
βˆ—
MAXDIVREL k-diversity
MAXDIVREL k-diversity
Independence
Dominance
Durability
Order
 Straightforward solution:
 Apply naïve greedy method at each instance
 Propose incremental index mechanism!
 Avoid the curse of re-calculating neighborhood
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
23
Locality Sensitive Hashing (LSH)
 Simple Idea
 if two points are close together, then after a β€œprojection” operation these two
points will remain close together
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
24
LSH Analysis
 For any given points 𝑝, π‘ž ∈ 𝑅 𝑑
𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž β‰₯ 𝑃1 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž ≀ 𝑑1
𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž ≀ 𝑃2 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž β‰₯ 𝑐𝑑1 = 𝑑2
β€’ Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive,
β€’ Ideally we need
β€’ (𝑃1βˆ’π‘ƒ2) to be large
β€’ (𝑑1βˆ’π‘‘2) to be small
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
25
LSH in MAXDIVREL:
Publications as categorical data
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
26
LSH in MAXDIVREL:
Characteristic Matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
27
LSH in MAXDIVREL:
Minhashing
 No Publications any more!
 Signature to represent
 Technique
 Randomly permute the rows at
characteristic matrix m times
 Take the number of the 1st row, in
the permuted order,
 which the column has a 1 for
the correspondent column of
publications.
First permutation of rows at characteristic matrix
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
 Advantage:
 Reduce the dimensions into a small
minhash signature
28
LSH in MAXDIVREL:
Signature Matrix
Fast-minhashing
Select m number of random hash
functions
To model the effect of m number of
random permutation
Mathematically proved only when,
The number of rows is a prime.
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
29
LSH in MAXDIVREL:
LSH Buckets
 Take r sized
signature vectors
 From m sized
minhash-
signature
 Map them into,
 L Hash-Tables
 Each with
arbitrary b
number of
buckets
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
30
LSH in MAXDIVREL:
How to select L, r?
For two vectors x,y
𝐽𝐷 π‘₯, 𝑦 = 1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 ;
π‘€β„Žπ‘’π‘Ÿπ‘’, 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 =
π‘₯ ∩ 𝑦
π‘₯ βˆͺ 𝑦
1. 𝐿 Γ— π‘Ÿ = π‘š
2. ?
2) π‘ π‘–π‘šπ‘–π‘™π‘Žπ‘Ÿπ‘–π‘‘π‘¦ π‘‘β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘(𝑠) β‰ˆ
1
𝐿
1
π‘Ÿ
31
LSH in MAXDIVREL:
Analysis
For two vectors x,y
𝐽𝐷 π‘₯, 𝑦 = 1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 ;
π‘€β„Žπ‘’π‘Ÿπ‘’, 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 =
π‘₯ ∩ 𝑦
π‘₯ βˆͺ 𝑦
 For publications x & y
𝐽𝑆𝐼𝑀 π‘₯, 𝑦 ∝ π‘ƒπ‘Ÿπ‘œπ‘ 𝐻 π‘₯ = 𝐻 𝑦
 At a particular hash table
 x & y map into the same bucket:
𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏
 x & y does not map into the same bucket:
1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏
 At L Hash-tables
 x & y does not map into the same bucket:
(1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏
) 𝐿 1 βˆ’ (1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏) 𝐿
True near neighbors will
be unlikely to be unlucky
in all the projections
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
32
LSH in MAXDIVREL:
Batch-wise Top-k computation
 Bucket β€œWinner” – a publication which has the
highest relevancy score
οƒΌ Winner is dominant to represent it's bucket
neighborhood
 Top-k "winnersβ€œ that have a majority of votes
οƒΌ k winners are independent
𝑃𝐴 𝑃𝐡 𝑃𝐢 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
33
LSH in MAXDIVREL:
Incremental Top-k computation
𝑁𝑒𝑀 π‘π‘’π‘π‘™π‘–π‘π‘Žπ‘‘π‘–π‘œπ‘› 𝑖 π‘ˆπ‘π‘‘π‘Žπ‘‘π‘’ 𝑖 π‘‘β„Ž
π‘β„Žπ‘Žπ‘Ÿπ‘Žπ‘π‘‘π‘’π‘Ÿπ‘–π‘ π‘‘π‘–π‘ π‘£π‘’π‘π‘‘π‘œπ‘Ÿ
Characteristic
Matrix
πΊπ‘’π‘›π‘’π‘Ÿπ‘Žπ‘‘π‘’ 𝑖 π‘‘β„Ž
π‘šπ‘–π‘›β„Žπ‘Žπ‘ β„Ž π‘ π‘–π‘”π‘›π‘Žπ‘‘π‘’π‘Ÿπ‘’
Signature
Matrix
Map 𝑖 π‘‘β„Ž
signature
into L hash-tables
Update β€œWinner” at
bucket 𝑖 π‘‘β„Ž
signature
maps into
Vote π‘‡π‘œπ‘ βˆ’ π‘˜ π‘π‘Žπ‘›π‘‘π‘–π‘‘π‘Žπ‘‘π‘’
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
34
LSH in MAXDIVREL:
When new publication F arrives…
 Only buckets 𝐡13
, 𝐡23
, 𝐡32
, 𝐡43
will vote
 Follow continuity requirements
 Durability
 Order
𝑃𝐴 𝑃𝐡 𝑃𝐢 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . .
ith
window
(i+1)th
window

1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
35
Implementation
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
36
Cloud service modules
Source: Amazon Kinesis Source: Amazon Elastic-cache
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
Publication Stream  Zipfian subscriptions
 Normalized preferences
37
Evaluation:
Dataset
Amazon on-line market place data available at 17th – 19th November 2014
𝑧𝑖𝑝𝑓 π‘˜: 𝑠, 𝑁 =
1
π‘˜ 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
π‘‡π‘œπ‘‘π‘Žπ‘™ π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘ π‘’π‘π‘π‘Ÿπ‘–π‘π‘’π‘Ÿ 𝑣𝑖𝑒𝑀𝑠
=
𝑖=2
32
48 𝑐 𝑖
+ 42 𝑐 𝑖
+ 54 𝑐 𝑖
+ 66 𝑐 𝑖
+ 57 𝑐 𝑖
+ 67 𝑐 𝑖
38
Evaluation:
Methodology
Subscriber
Effectiveness
Performance &
Efficiency
Quality
Accuracy
Resiliency
Freshness
Index construction time
Top-k matching time
 Platform: Amazon AWS
οƒΌ Linux based micro-node instances
οƒΌ Each with 2.3 GHz, 8GB memory
οƒΌ Algorithms are implemented in Java
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
39
Subscriber Effectiveness:
Quality or natural behvior
 Testing zipf or power law hypothesis on
distribution of ranked results (KS Test)
i. Fitting power law
ii. Goodness of fit tests
iii. Alternative distributions
 Compute 19030 ranked distributions
over 100K publication stream
 Under different subscriber views
 Under different sized sliding window
instances
Sample distribution of ranked votes
logzipf_prob(rank)
log (rank)
𝑧𝑖𝑝𝑓 π‘˜: 𝑠, 𝑁 =
1
π‘˜ 𝑠
𝑛=1
𝑁
(
1
𝑛 𝑠)
N - number of elements in distribution,
k - rank of element
s - value of exponent
40
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
41
Subscriber Effectiveness:
i. Fitting power law
Illustration of Zipf exponent values convergence
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
42
Subscriber Effectiveness:
i. Fitting power law
Zipf exponent values under different similarity threshold
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
43
Subscriber Effectiveness:
ii. Goodness of fit tests
𝛾1 = π‘šπ‘Žπ‘₯ π‘₯β‰₯π‘₯ π‘šπ‘–π‘›
𝑓 π‘₯ βˆ’ 𝑔 π‘₯
𝑓 π‘₯ : π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ π‘Ÿπ‘Žπ‘›π‘˜ 𝐢𝐷𝐹
𝑔 π‘₯ : π‘π‘’π‘Ÿπ‘“π‘’π‘π‘‘ 𝑓𝑖𝑑𝑑𝑒𝑑 𝐢𝐷𝐹
𝑝 βˆ’ π‘£π‘Žπ‘™π‘’π‘’ =
π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ 𝛾𝑖 π‘€β„Žπ‘’π‘Ÿπ‘’π›Ύπ‘– > 𝛾1;
𝑖
𝑖 = 1000 π‘ π‘¦π‘›π‘‘β„Žπ‘’π‘‘π‘–π‘ 𝑧𝑖𝑝𝑓 π‘‘π‘Žπ‘‘π‘Žπ‘ π‘’π‘‘π‘ 
P-values of KS test under different subscriber views
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
44
Subscriber Effectiveness:
iii. Testing alternative distributions
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
45
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
For an even comparison,
Combine relevancy at all diversity method
To achieve a bi-criteria objective
Average zipf law exponent in a comparison with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
46
Subscriber Effectiveness:
Other diversity based methods
P-dispersion problem
MAXMIN
MAXSUM
Minimum
independent-
dominating set
problem
MAXDIVREL
DisC
A comparison of average zipf law exponent with other methods
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
47
Subscriber Effectiveness:
Accuracy of Top-k results
LSH Index vs. NAÏVE
 Rank probability
 Diversity probability
Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
48
Subscriber Effectiveness:
Resiliency of Top-k results
Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)
49
Performance
Subscription index update time
Index construction time on opIndex vs. modified opIndex
opIndex vs. modified opIndex
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
50
Efficiency:
Initial matching time at modified opIndex
Initial matching time under different size of subscription spaces Initial matching time under different size of publications
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
51
Performance & Efficiency:
LSH Index
BLSH index construction + update time on different number of minhash functions
Number of minhash functions
(m) =
1
π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’π‘‘ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ2
 How much accuracy
do we sacrifice by
comparing small
minhash signatures?
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
52
Performance & Efficiency
ILSH vs. BLSH vs. NAÏVE
𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . .
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
BLSH
or
NAIVE
ILSH
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
53
Performance & Efficiency:
BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
54
Performance & Efficiency:
ILSH vs. BLSH vs. NAÏVE
log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
55
Conclusions
 Diversified results produced by MAXDIVREL based on independent-
dominating set problem
 Exhibits strong natural behavior other than,
 Methods based p-dispersion problem
 Relevancy is a important factor to employ
 In distance based diversity methods
 Always has the tendency to produce the diverse set of personalized
results
 Absolute ranks are sensitive to the preference value
 While keeping the deviation small among relative ranks
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
56
Conclusions (Ctd.)
 Locality Sensitive Hashing (LSH) indexing method
 Produce MAXDIVREL diverse set of results at average 70% accuracy
over naΓ―ve method
 Reduce the matching time very significantly over NAÏVE method
 Further, refine by it’s incremental version
 For handling streaming publications
 Avoid the curse of re-computing neighborhoods
 No such k to restrict the delivery of Top publications
 Given a window size & delivery method
 Model can produce best diverse set of personalized results
 To represent the set of all matching publications at given instance
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
57
Major Contributions
 Dynamic diversification method based on independent-dominating set
problem
 We introduced a novel diversity definition based on representative
neighborhoods, called MAXDIVREL k-diversity employing relevancy.
 Index based diversification approach to rank results incrementally
 We proposed a novel, hashing based index approach to solve
MAXDIVREL continuous k-diversity problem based on Locality Sensitive
Hashing (LSH) technique
 Advanced evaluation method to measure the quality of diverse results
 First significant try to model natural behavior of diversity methods in
pub/sub community
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
58
Future work
 Explore other suitable use-cases to apply proposed model & develop
prototype applications, E.g.
 Personalized newspaper for every Facebook user
 Diverse set of personalized Twitter trends
 Social annotation of news-stories
 Exploit overlap among diversified results of users who have similar interest
 Employ existing implicit methods to extract human preferences
 E.g. click stream analytics
 Develop LSH based index over multi-threaded distributed environment
1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
59
Q&A
THANK YOU!
60
Appendix
Freshness
Mean delay between publications = 5000ms
A comparison between relevancy scores after influenced by freshness
61
Appendix
NAÏVE Ranking time
Average naΓ―ve Top-k matching time in comparison with size D of publications
62
Appendix
BLSH Ranking time
Average BLSH Top-k matching time in comparison with size D of publications
63
Appendix
ILSH Ranking time
Average ILSH Top-k matching time in comparison with size D of publications

More Related Content

Viewers also liked

Thesis Identifying Activity
Thesis Identifying ActivityThesis Identifying Activity
Thesis Identifying Activitymr_rodriguez23
Β 
La motivation au travail ude s
La motivation au travail ude sLa motivation au travail ude s
La motivation au travail ude sjoannecyr1962
Β 
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
 Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be... Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...Arturo Hoffstadt
Β 
Thesis Defense Presentation
Thesis Defense PresentationThesis Defense Presentation
Thesis Defense Presentationosideloc
Β 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense PresentationDavid Onoue
Β 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentationDr. Naomi Mangatu
Β 
How to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a ProfessionalHow to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a ProfessionalMiriam College
Β 
Opportunistic persistent data storage
Opportunistic persistent data storage Opportunistic persistent data storage
Opportunistic persistent data storage Luke Weerasooriya
Β 
Thesis Powerpoint
Thesis PowerpointThesis Powerpoint
Thesis Powerpointneha47
Β 
The thesis and its parts
The thesis and its partsThe thesis and its parts
The thesis and its partsDraizelle Sexon
Β 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentationriddhikapandya1985
Β 
Writing thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelinesWriting thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelinespoleyseugenio
Β 

Viewers also liked (12)

Thesis Identifying Activity
Thesis Identifying ActivityThesis Identifying Activity
Thesis Identifying Activity
Β 
La motivation au travail ude s
La motivation au travail ude sLa motivation au travail ude s
La motivation au travail ude s
Β 
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
 Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be... Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
Planning Mode Simulator: A simulation tool for studying ALMA's scheduling be...
Β 
Thesis Defense Presentation
Thesis Defense PresentationThesis Defense Presentation
Thesis Defense Presentation
Β 
My Thesis Defense Presentation
My Thesis Defense PresentationMy Thesis Defense Presentation
My Thesis Defense Presentation
Β 
Dissertation oral defense presentation
Dissertation   oral defense presentationDissertation   oral defense presentation
Dissertation oral defense presentation
Β 
How to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a ProfessionalHow to Defend your Thesis Proposal like a Professional
How to Defend your Thesis Proposal like a Professional
Β 
Opportunistic persistent data storage
Opportunistic persistent data storage Opportunistic persistent data storage
Opportunistic persistent data storage
Β 
Thesis Powerpoint
Thesis PowerpointThesis Powerpoint
Thesis Powerpoint
Β 
The thesis and its parts
The thesis and its partsThe thesis and its parts
The thesis and its parts
Β 
Thesis Power Point Presentation
Thesis Power Point PresentationThesis Power Point Presentation
Thesis Power Point Presentation
Β 
Writing thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelinesWriting thesis chapters 1-3 guidelines
Writing thesis chapters 1-3 guidelines
Β 

Similar to [Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...Sameera Horawalavithana
Β 
Reranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningAhmed Saleh
Β 
Portfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale AgilePortfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale AgileDashlane
Β 
ODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLBryan Bischof
Β 
"Paradigm Shifting" Presentation
"Paradigm Shifting" Presentation"Paradigm Shifting" Presentation
"Paradigm Shifting" PresentationDiego Malpica Chauvet
Β 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache SparkQuantUniversity
Β 
QUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docxQUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docxmakdul
Β 
SESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_PosterSESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_PosterOvidiu Popoviciu
Β 
Working beyond boundaries
Working beyond boundariesWorking beyond boundaries
Working beyond boundariesPLACEmaking
Β 
Ux for data exploration
Ux for data explorationUx for data exploration
Ux for data explorationVladislav Korobov
Β 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016Paolo Missier
Β 
State of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryState of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryRandy Bias
Β 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptAnirbanBhar3
Β 
Evolutionary Architecture And Design
Evolutionary Architecture And DesignEvolutionary Architecture And Design
Evolutionary Architecture And DesignNaresh Jain
Β 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation Sameera Horawalavithana
Β 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Neo4j
Β 
Technical briefing on Software Release Planning
Technical briefing on Software Release PlanningTechnical briefing on Software Release Planning
Technical briefing on Software Release PlanningGuenther Ruhe
Β 

Similar to [Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams (20)

[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
[Undergraduate Thesis] Interim presentation on A Publish/Subscribe Model for ...
Β 
Reranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learningReranking based-recommender-system-with-deep-learning
Reranking based-recommender-system-with-deep-learning
Β 
Portfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale AgilePortfolio & Roadmap: 2 tools to scale Agile
Portfolio & Roadmap: 2 tools to scale Agile
Β 
ODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in MLODSC West 2021 – Composition in ML
ODSC West 2021 – Composition in ML
Β 
"Paradigm Shifting" Presentation
"Paradigm Shifting" Presentation"Paradigm Shifting" Presentation
"Paradigm Shifting" Presentation
Β 
Framework for Agile Living Labs - FALL
Framework for Agile Living Labs - FALLFramework for Agile Living Labs - FALL
Framework for Agile Living Labs - FALL
Β 
Scaling Analytics with Apache Spark
Scaling Analytics with Apache SparkScaling Analytics with Apache Spark
Scaling Analytics with Apache Spark
Β 
QUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docxQUESTION 1 1. Create a clustered bar graph showing employment stat.docx
QUESTION 1 1. Create a clustered bar graph showing employment stat.docx
Β 
19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE19CS3052R-CO1-7-S7 ECE
19CS3052R-CO1-7-S7 ECE
Β 
SESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_PosterSESP_Ovidiu_Popoviciu_Poster
SESP_Ovidiu_Popoviciu_Poster
Β 
Working beyond boundaries
Working beyond boundariesWorking beyond boundaries
Working beyond boundaries
Β 
Ux for data exploration
Ux for data explorationUx for data exploration
Ux for data exploration
Β 
ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016ReComp project kickoff presentation 11-03-2016
ReComp project kickoff presentation 11-03-2016
Β 
State of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's GloryState of the Stack v4 - OpenStack in All It's Glory
State of the Stack v4 - OpenStack in All It's Glory
Β 
080613 Mega-Project Schedule Integration & Management RCF Method-1
080613 Mega-Project Schedule Integration & Management RCF Method-1 080613 Mega-Project Schedule Integration & Management RCF Method-1
080613 Mega-Project Schedule Integration & Management RCF Method-1
Β 
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.pptProto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt Proto Spiral.ppt
Β 
Evolutionary Architecture And Design
Evolutionary Architecture And DesignEvolutionary Architecture And Design
Evolutionary Architecture And Design
Β 
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation [ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
[ARM 15 | ACM/IFIP/USENIX Middleware 2015] Research Paper Presentation
Β 
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Novel Graph Modeling Framework for Feature Importance Determination in Unsupe...
Β 
Technical briefing on Software Release Planning
Technical briefing on Software Release PlanningTechnical briefing on Software Release Planning
Technical briefing on Software Release Planning
Β 

More from Sameera Horawalavithana

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationSameera Horawalavithana
Β 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political CrisisSameera Horawalavithana
Β 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White HelmetsSameera Horawalavithana
Β 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Sameera Horawalavithana
Β 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubSameera Horawalavithana
Β 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...Sameera Horawalavithana
Β 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...Sameera Horawalavithana
Β 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetSameera Horawalavithana
Β 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Sameera Horawalavithana
Β 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingSameera Horawalavithana
Β 

More from Sameera Horawalavithana (15)

Data-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and SimulationData-driven Studies on Social Networks: Privacy and Simulation
Data-driven Studies on Social Networks: Privacy and Simulation
Β 
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
 Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Drivers of Polarized Discussions on Twitter during Venezuela Political Crisis
Β 
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
 Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Twitter Is the Megaphone of Cross-platform Messaging on the White Helmets
Β 
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Behind the Mask: Understanding the Structural Forces That Make Social Graphs ...
Β 
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHubMentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Mentions of Security Vulnerabilities on Reddit, Twitter and GitHub
Β 
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
[MLNS | NetSci] A Generative/ Discriminative Approach to De-construct Cascadi...
Β 
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
[Compex Network 18] Diversity, Homophily, and the Risk of Node Re-identificat...
Β 
Duplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy DatasetDuplicate Detection on Hoaxy Dataset
Duplicate Detection on Hoaxy Dataset
Β 
Dancing with Stream Processing
Dancing with Stream ProcessingDancing with Stream Processing
Dancing with Stream Processing
Β 
Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015Be Elastic: Leapset Innovation session 06-08-2015
Be Elastic: Leapset Innovation session 06-08-2015
Β 
Locality sensitive hashing
Locality sensitive hashingLocality sensitive hashing
Locality sensitive hashing
Β 
Zipf distribution
Zipf distributionZipf distribution
Zipf distribution
Β 
Query personalization
Query personalizationQuery personalization
Query personalization
Β 
Dancing with publish/subscribe
Dancing with publish/subscribeDancing with publish/subscribe
Dancing with publish/subscribe
Β 
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand StreamingTalk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Talk on Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming
Β 

Recently uploaded

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
Β 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
Β 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
Β 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
Β 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
Β 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
Β 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhisoniya singh
Β 
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...Patryk Bandurski
Β 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
Β 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
Β 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
Β 

Recently uploaded (20)

Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
Β 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
Β 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
Β 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
Β 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
Β 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
Β 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Β 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
Β 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Β 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
Β 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Β 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
Β 
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 βœ“Call Girls In Kalyan ( Mumbai ) secure service
Β 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
Β 
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | DelhiFULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY πŸ” 8264348440 πŸ” Call Girls in Diplomatic Enclave | Delhi
Β 
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Integration and Automation in Practice: CI/CD in MuleΒ Integration and Automat...
Β 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Β 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
Β 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
Β 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
Β 

[Undergraduate Thesis] Final Defense presentation on Cloud Publish/Subscribe Model for Top-k Matching Over Continuous Data-streams

  • 1. Cloud based publish/subscribe model for Top-k matching over continuous data streams Author: Y.S. Horawalavithana 10002103 Supervisor: Dr. D.N. Ranasinghe U/Graduate Thesis Defense January 23, 2015 UNIVERSITY OF COLOMBO SCHOOL OF COMPUTING SCS 4001: INDIVIDUAL PROJECT 1
  • 2. 2 Overview β€’ Motivation β€’ Target β€’ Design & Architecture β€’ Related work β€’ Dynamic Diversification β€’ Incremental Top-k β€’ Implementation β€’ Evaluation β€’ Conclusion β€’ Future work
  • 3. 3 Motivation – β€œBig Filter” 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 4. 4 Boolean publish/subscribe Drawbacks  A subscriber may be either overloaded with publications or receive too few publications  Impossible to compare different matching publications as ranking functions are not defined, and  Partial matching between subscriptions and publications is not supported. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 5. 5 Top-k publish/subscribe  Expressive stateful query processing systems  User defined parameter k restricts the delivered publications  Pub/Sub Matching  Top-k pub/sub scoring or ranking  Pub/Sub Indexing  Indexing to support personalized subscriptions  Indexing to support continuous Top-k publications retrieval 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 6. 6 Target 1. How to define an efficient scoring algorithm by integrating query independent & dependent score metrics taken into account? - Relevance, Freshness & Diversity 2. How to adapt existing indexing data structures used in state-of-the-art publish/subscribe systems under a) large subscription volume, b) high event rate and, c) the variety of subscribable attributes, to support Top-k matching queries? 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 7. 7 Scope  Optimize Top-k Heuristic for specific domain  E-commerce with buyers & sellers  Subscriptions & publications follow a pre-defined data-structure  The number of incoming publications follow a Poisson random variable  Retrieve Top-k publications against subscriptions, not reverse. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 9. 9 Related work:General Top-k publish/subscribe Pub/sub model Subscription Timing policy Diversity Scoring metric Subscription Indexing method Incremental publication indexing Architecture PrefSIENA (Drosou, ACM DEBS 2009) Preferential subscription Sliding window Relevancy + MAXMIN diversity Subscription covering Centralized message- brokers RRPS (Lu, ICCSA 2009) Normal Continuous QoS Centralized DaZaLaPs (Pripuzi, IS 2012) Normal Sliding window Relevancy Grid based P2P Top-k pub/sub (Shraer[Google], VLDB 2014) Normal Continuous Relevancy + Freshness Tree based TAAT & DAAT Centralized Our model Personalized subscription space Sliding window MAXDIVREL diversity Inverted-list based Hashing based Cloud based        1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 10. 10 Sliding window Top-k computation 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 .... 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 .... 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 𝑃9 𝑃10 .... 𝑃5 𝑃1 𝑃5 𝑃6 𝑃5 𝑃9 Top-2 Matching publication stream h=1 h=3 Jumping step (h)  Top-k notifications delivery  On-demand  Pro-active 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work Expired Active Top-k
  • 11. 11 Relevancy: Personalized Subscription space Carrier = AT&T (0.4) Subscribe Brand = HTC (0.3) Storage ≀ 16𝐺𝐡 (0.7) 1.75 1.3 2.3 Carrier = Verizon (0.5) Storage ≀ 32GB (0.2) 2.52 Storage ≀ 32GB (0.6) Brand = HTC (0.3) 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 12. 12 Relevancy: Personalized Subscription space 2 Carrier = Verizon Storage ≀ 32GB 2.5 Carrier = AT&T Storage ≀ 16𝐺𝐡 1.75 Brand = HTC 1.3 2.3 Carrier = Verizon Color = White OS = Android Storage = 16GB Brand = HTC Subscribe 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 13. 13 Subscription Indexing: Modified opIndex  Based on inverted-lists  Posting lists  Two level portioning  Attribute posting list  Operator posting list  Locate satisfying subscription tuples  Relevancy score  By satisfying relations  By satisfying subscription tuples 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 14. 14 Freshness  When window becomes larger,  Older publications may prevent the newer publications to enter into Top-k results  Lease relevancy scores?  But have to re-calculate scores  Forward decaying!  Fresh-relevancy score = relevancy score Γ— Freshness score 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 15. 15 Diversity: Top-k representative set Representative Top-kDrawback (without diversity) What we want (with diversity) Method to retrieve Top-k publications from matching publications 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 16. 16 MAX* k-diversity problem where 1. P = {p1, …, pn} 2. k ≀ n 3. d: a distance metric 4. f: a diversity function ),(argmax* dSfS k|S| PS ο€½  ο€½ Find: 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 17. 17 Proposed: MAXDIVREL k-diversity problem     S-Pinrelevancy&similarity-distheminimize,, Sinrelevancy&similarity-disthemaximize,,g ),,( ),,( maxarg),,(argmax* ο€½ ο€½ ο€½ο€½  rdSh rdS rdSh rdSg rdSfS PS where 1. P = {p1, …, pn} 2. d: a distance metric 3. r: a relevance metric 4. f: a diversity function 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 18. 18 Formal Definition: MAXDIVREL k-diversity     οƒ₯ οƒ₯ ο€­οƒŽοƒŽ οƒŽ ο€­ ο€½ ο€½ SPpSp ji i j Spp ji i j ji ji ppd pr pr SP rdSh ppd pr pr S rdS , , dominanceholds),( )( )( || 1 ,,argmin ceindependenholds),( )( )( || 1 ,,gargmax where 1. P = {p1, …, pn} 2. d: a distance metric 3. r: a relevance metric 4. 𝛼 > 0 Independence condition: βˆ€π‘π‘–, 𝑝𝑗 ∈ 𝑆, 𝑑 𝑝𝑖, 𝑝𝑗 > 𝛼 Dominance condition: βˆ€π‘π‘– ∈ 𝑃, βˆƒπ‘π‘— ∈ 𝑆 𝑠. 𝑑. 𝑑 𝑝𝑖, 𝑝𝑗 ≀ 𝛼; 𝑖 β‰  𝑗 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 19. 19 NP-Hardness: Minimum independent-dominating set 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2 𝛼 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2  𝑣1 𝑣4 𝑣3 𝑣2 𝑣5 𝑣1 𝑣4 𝑣3 𝑣2 𝑣5   jijiji ppppdppodNeighborho  ,|)( 𝑣1 𝑣4 𝑣3𝑣2 𝑣5 Publication space Graph model Independent, dominating Independent, dominating Independent, dominating Dominating, not independent 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 20. 20 NAÏVE Greedy argmax π‘Ÿ(𝑝𝑖)2 𝑝 π‘—βˆˆπ‘(𝑝 𝑖) π‘Ÿ(𝑝𝑗) Γ— 𝑑(𝑝𝑖, 𝑝𝑗) 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 21. 21 Handling streaming publications 𝑝1 𝑝2 𝑝3 𝑝4 𝑝5 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2𝛼 𝑝6 𝑣1 𝑣4 𝑣3 𝑣5 𝑣2𝑣6 Continuity Requirements 1. Durability an item is selected as diversified in 𝑖 π‘‘β„Ž window may still have the chance to be in 𝑖 + 1 π‘‘β„Ž window if it's not expired & other valid items in 𝑖 + 1 π‘‘β„Ž window are failed to compete with it. 2. Order Publication stream follow the chronological order We avoid the selection of item j as diverse later, when we already selected an item i which is not- older than j. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 22. 22 MAXDIVREL continuous k-diversity 𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. .... Matching publication stream 𝑃1 𝑃2 𝑃3 𝑃4 .. 𝑃𝑗 𝑃𝑗+1 .. .. .. .... ith window (i+1)th window 𝑆𝑖 βˆ— 𝑆𝑖+1 βˆ— MAXDIVREL k-diversity MAXDIVREL k-diversity Independence Dominance Durability Order  Straightforward solution:  Apply naΓ―ve greedy method at each instance  Propose incremental index mechanism!  Avoid the curse of re-calculating neighborhood 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 23. 23 Locality Sensitive Hashing (LSH)  Simple Idea  if two points are close together, then after a β€œprojection” operation these two points will remain close together 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 24. 24 LSH Analysis  For any given points 𝑝, π‘ž ∈ 𝑅 𝑑 𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž β‰₯ 𝑃1 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž ≀ 𝑑1 𝑃 𝐻 β„Ž 𝑝 = β„Ž π‘ž ≀ 𝑃2 π‘“π‘œπ‘Ÿ 𝑝 βˆ’ π‘ž β‰₯ 𝑐𝑑1 = 𝑑2 β€’ Hash function h is (𝑑1, 𝑑2, 𝑃1, 𝑃2) sensitive, β€’ Ideally we need β€’ (𝑃1βˆ’π‘ƒ2) to be large β€’ (𝑑1βˆ’π‘‘2) to be small 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 25. 25 LSH in MAXDIVREL: Publications as categorical data 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 26. 26 LSH in MAXDIVREL: Characteristic Matrix 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 27. 27 LSH in MAXDIVREL: Minhashing  No Publications any more!  Signature to represent  Technique  Randomly permute the rows at characteristic matrix m times  Take the number of the 1st row, in the permuted order,  which the column has a 1 for the correspondent column of publications. First permutation of rows at characteristic matrix 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work  Advantage:  Reduce the dimensions into a small minhash signature
  • 28. 28 LSH in MAXDIVREL: Signature Matrix Fast-minhashing Select m number of random hash functions To model the effect of m number of random permutation Mathematically proved only when, The number of rows is a prime. 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 29. 29 LSH in MAXDIVREL: LSH Buckets  Take r sized signature vectors  From m sized minhash- signature  Map them into,  L Hash-Tables  Each with arbitrary b number of buckets 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 30. 30 LSH in MAXDIVREL: How to select L, r? For two vectors x,y 𝐽𝐷 π‘₯, 𝑦 = 1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 ; π‘€β„Žπ‘’π‘Ÿπ‘’, 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 = π‘₯ ∩ 𝑦 π‘₯ βˆͺ 𝑦 1. 𝐿 Γ— π‘Ÿ = π‘š 2. ? 2) π‘ π‘–π‘šπ‘–π‘™π‘Žπ‘Ÿπ‘–π‘‘π‘¦ π‘‘β„Žπ‘Ÿπ‘’π‘ β„Žπ‘œπ‘™π‘‘(𝑠) β‰ˆ 1 𝐿 1 π‘Ÿ
  • 31. 31 LSH in MAXDIVREL: Analysis For two vectors x,y 𝐽𝐷 π‘₯, 𝑦 = 1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 ; π‘€β„Žπ‘’π‘Ÿπ‘’, 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 = π‘₯ ∩ 𝑦 π‘₯ βˆͺ 𝑦  For publications x & y 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 ∝ π‘ƒπ‘Ÿπ‘œπ‘ 𝐻 π‘₯ = 𝐻 𝑦  At a particular hash table  x & y map into the same bucket: 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏  x & y does not map into the same bucket: 1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏  At L Hash-tables  x & y does not map into the same bucket: (1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏 ) 𝐿 1 βˆ’ (1 βˆ’ 𝐽𝑆𝐼𝑀 π‘₯, 𝑦 𝑏) 𝐿 True near neighbors will be unlikely to be unlucky in all the projections 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 32. 32 LSH in MAXDIVREL: Batch-wise Top-k computation  Bucket β€œWinner” – a publication which has the highest relevancy score οƒΌ Winner is dominant to represent it's bucket neighborhood  Top-k "winnersβ€œ that have a majority of votes οƒΌ k winners are independent 𝑃𝐴 𝑃𝐡 𝑃𝐢 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . . ith window 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 33. 33 LSH in MAXDIVREL: Incremental Top-k computation 𝑁𝑒𝑀 π‘π‘’π‘π‘™π‘–π‘π‘Žπ‘‘π‘–π‘œπ‘› 𝑖 π‘ˆπ‘π‘‘π‘Žπ‘‘π‘’ 𝑖 π‘‘β„Ž π‘β„Žπ‘Žπ‘Ÿπ‘Žπ‘π‘‘π‘’π‘Ÿπ‘–π‘ π‘‘π‘–π‘ π‘£π‘’π‘π‘‘π‘œπ‘Ÿ Characteristic Matrix πΊπ‘’π‘›π‘’π‘Ÿπ‘Žπ‘‘π‘’ 𝑖 π‘‘β„Ž π‘šπ‘–π‘›β„Žπ‘Žπ‘ β„Ž π‘ π‘–π‘”π‘›π‘Žπ‘‘π‘’π‘Ÿπ‘’ Signature Matrix Map 𝑖 π‘‘β„Ž signature into L hash-tables Update β€œWinner” at bucket 𝑖 π‘‘β„Ž signature maps into Vote π‘‡π‘œπ‘ βˆ’ π‘˜ π‘π‘Žπ‘›π‘‘π‘–π‘‘π‘Žπ‘‘π‘’ 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 34. 34 LSH in MAXDIVREL: When new publication F arrives…  Only buckets 𝐡13 , 𝐡23 , 𝐡32 , 𝐡43 will vote  Follow continuity requirements  Durability  Order 𝑃𝐴 𝑃𝐡 𝑃𝐢 𝑃 𝐷 𝑃𝐸 𝑃𝐹 𝑃𝐺 𝑃 𝐻 . . ith window (i+1)th window  1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 35. 35 Implementation 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 36. 36 Cloud service modules Source: Amazon Kinesis Source: Amazon Elastic-cache 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 37. Publication Stream  Zipfian subscriptions  Normalized preferences 37 Evaluation: Dataset Amazon on-line market place data available at 17th – 19th November 2014 𝑧𝑖𝑝𝑓 π‘˜: 𝑠, 𝑁 = 1 π‘˜ 𝑠 𝑛=1 𝑁 ( 1 𝑛 𝑠) N - number of elements in distribution, k - rank of element s - value of exponent π‘‡π‘œπ‘‘π‘Žπ‘™ π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ π‘ π‘’π‘π‘π‘Ÿπ‘–π‘π‘’π‘Ÿ 𝑣𝑖𝑒𝑀𝑠 = 𝑖=2 32 48 𝑐 𝑖 + 42 𝑐 𝑖 + 54 𝑐 𝑖 + 66 𝑐 𝑖 + 57 𝑐 𝑖 + 67 𝑐 𝑖
  • 38. 38 Evaluation: Methodology Subscriber Effectiveness Performance & Efficiency Quality Accuracy Resiliency Freshness Index construction time Top-k matching time  Platform: Amazon AWS οƒΌ Linux based micro-node instances οƒΌ Each with 2.3 GHz, 8GB memory οƒΌ Algorithms are implemented in Java 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 39. 39 Subscriber Effectiveness: Quality or natural behvior  Testing zipf or power law hypothesis on distribution of ranked results (KS Test) i. Fitting power law ii. Goodness of fit tests iii. Alternative distributions  Compute 19030 ranked distributions over 100K publication stream  Under different subscriber views  Under different sized sliding window instances Sample distribution of ranked votes logzipf_prob(rank) log (rank) 𝑧𝑖𝑝𝑓 π‘˜: 𝑠, 𝑁 = 1 π‘˜ 𝑠 𝑛=1 𝑁 ( 1 𝑛 𝑠) N - number of elements in distribution, k - rank of element s - value of exponent
  • 40. 40 Subscriber Effectiveness: i. Fitting power law Zipf exponent values 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 41. 41 Subscriber Effectiveness: i. Fitting power law Illustration of Zipf exponent values convergence 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 42. 42 Subscriber Effectiveness: i. Fitting power law Zipf exponent values under different similarity threshold 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 43. 43 Subscriber Effectiveness: ii. Goodness of fit tests 𝛾1 = π‘šπ‘Žπ‘₯ π‘₯β‰₯π‘₯ π‘šπ‘–π‘› 𝑓 π‘₯ βˆ’ 𝑔 π‘₯ 𝑓 π‘₯ : π‘œπ‘π‘ π‘’π‘Ÿπ‘£π‘’π‘‘ π‘Ÿπ‘Žπ‘›π‘˜ 𝐢𝐷𝐹 𝑔 π‘₯ : π‘π‘’π‘Ÿπ‘“π‘’π‘π‘‘ 𝑓𝑖𝑑𝑑𝑒𝑑 𝐢𝐷𝐹 𝑝 βˆ’ π‘£π‘Žπ‘™π‘’π‘’ = π‘›π‘’π‘šπ‘π‘’π‘Ÿ π‘œπ‘“ 𝛾𝑖 π‘€β„Žπ‘’π‘Ÿπ‘’π›Ύπ‘– > 𝛾1; 𝑖 𝑖 = 1000 π‘ π‘¦π‘›π‘‘β„Žπ‘’π‘‘π‘–π‘ 𝑧𝑖𝑝𝑓 π‘‘π‘Žπ‘‘π‘Žπ‘ π‘’π‘‘π‘  P-values of KS test under different subscriber views 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 44. 44 Subscriber Effectiveness: iii. Testing alternative distributions 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 45. 45 Subscriber Effectiveness: Other diversity based methods P-dispersion problem MAXMIN MAXSUM Minimum independent- dominating set problem MAXDIVREL DisC For an even comparison, Combine relevancy at all diversity method To achieve a bi-criteria objective Average zipf law exponent in a comparison with other methods 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 46. 46 Subscriber Effectiveness: Other diversity based methods P-dispersion problem MAXMIN MAXSUM Minimum independent- dominating set problem MAXDIVREL DisC A comparison of average zipf law exponent with other methods 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 47. 47 Subscriber Effectiveness: Accuracy of Top-k results LSH Index vs. NAÏVE  Rank probability  Diversity probability Accuracy on similarity threshold = 0.55 Accuracy on similarity threshold = 0.85 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 48. 48 Subscriber Effectiveness: Resiliency of Top-k results Getting Top-k publications (Unordered) Getting Top-k publications (Ordered)
  • 49. 49 Performance Subscription index update time Index construction time on opIndex vs. modified opIndex opIndex vs. modified opIndex 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 50. 50 Efficiency: Initial matching time at modified opIndex Initial matching time under different size of subscription spaces Initial matching time under different size of publications 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 51. 51 Performance & Efficiency: LSH Index BLSH index construction + update time on different number of minhash functions Number of minhash functions (m) = 1 π‘’π‘ π‘‘π‘–π‘šπ‘Žπ‘‘π‘’π‘‘ π‘’π‘Ÿπ‘Ÿπ‘œπ‘Ÿ2  How much accuracy do we sacrifice by comparing small minhash signatures? 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 52. 52 Performance & Efficiency ILSH vs. BLSH vs. NAÏVE 𝑃1 𝑃2 𝑃3 𝑃4 𝑃5 𝑃6 𝑃7 𝑃8 . . BLSH or NAIVE BLSH or NAIVE BLSH or NAIVE BLSH or NAIVE ILSH 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 53. 53 Performance & Efficiency: BLSH vs. NAÏVE log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 54. 54 Performance & Efficiency: ILSH vs. BLSH vs. NAÏVE log (Ranking time) on number of publications with D=250 log (Ranking time) on number of publications with D=500 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 55. 55 Conclusions  Diversified results produced by MAXDIVREL based on independent- dominating set problem  Exhibits strong natural behavior other than,  Methods based p-dispersion problem  Relevancy is a important factor to employ  In distance based diversity methods  Always has the tendency to produce the diverse set of personalized results  Absolute ranks are sensitive to the preference value  While keeping the deviation small among relative ranks 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 56. 56 Conclusions (Ctd.)  Locality Sensitive Hashing (LSH) indexing method  Produce MAXDIVREL diverse set of results at average 70% accuracy over naΓ―ve method  Reduce the matching time very significantly over NAÏVE method  Further, refine by it’s incremental version  For handling streaming publications  Avoid the curse of re-computing neighborhoods  No such k to restrict the delivery of Top publications  Given a window size & delivery method  Model can produce best diverse set of personalized results  To represent the set of all matching publications at given instance 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 57. 57 Major Contributions  Dynamic diversification method based on independent-dominating set problem  We introduced a novel diversity definition based on representative neighborhoods, called MAXDIVREL k-diversity employing relevancy.  Index based diversification approach to rank results incrementally  We proposed a novel, hashing based index approach to solve MAXDIVREL continuous k-diversity problem based on Locality Sensitive Hashing (LSH) technique  Advanced evaluation method to measure the quality of diverse results  First significant try to model natural behavior of diversity methods in pub/sub community 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 58. 58 Future work  Explore other suitable use-cases to apply proposed model & develop prototype applications, E.g.  Personalized newspaper for every Facebook user  Diverse set of personalized Twitter trends  Social annotation of news-stories  Exploit overlap among diversified results of users who have similar interest  Employ existing implicit methods to extract human preferences  E.g. click stream analytics  Develop LSH based index over multi-threaded distributed environment 1.Motivation 2.Target 3.Design & Architecture 4.Related work 5.Dynamic Diversification 6.Incremental Top-k 7.Implementation 8.Evaluation 9.Conclusion 10.Future Work
  • 60. 60 Appendix Freshness Mean delay between publications = 5000ms A comparison between relevancy scores after influenced by freshness
  • 61. 61 Appendix NAÏVE Ranking time Average naΓ―ve Top-k matching time in comparison with size D of publications
  • 62. 62 Appendix BLSH Ranking time Average BLSH Top-k matching time in comparison with size D of publications
  • 63. 63 Appendix ILSH Ranking time Average ILSH Top-k matching time in comparison with size D of publications

Editor's Notes

  1. to overcome the drawbacks identified in traditional pub/sub systems