SlideShare ist ein Scribd-Unternehmen logo
1 von 22
STRIPStream Learning of Influence
Probabilities
Konstantin Kutzkov – IT University of Copenhagen, Denmark
Albert Bifet, Francesco Bonchi – Yahoo! Labs, Barcelona
Aristides Gionis – Aalto University and HIIT, Finland
1
KDD 2013
Problem: Learning Influence
Probabilities
• Learning influence strength along links in
social networks
• Real-time interactions
– Small amount of time
and memory
– Only one pass
The general
Propagation log
Social graph
Learnprobabilities
Data Stream Learning
• Sequence is potentially infinite
• High amount of data, high speed of arrival
• Change over time
• Process elements from a data stream in only
one pass
• Approximation algorithms
– Small error rate with high probability
Social graph and log of propagations
We have 2 pieces of input data:
(1) social graph and (2) a log of past propagations
u12
u45
u45 follows u12 – arc u12→ u45
I liked this
movie
I read this
article
great
movie 09:30
09:00
Action Node Time
a u12 1
a u45 2
a u32 3
a u76 8
b u32 1
b u45 3
b u98 7
Learning Influence Probabilities
• STRIP
– Streaming methods
• approximate solutions
– Superlinear space
– Linear space
– Sublinear space
• Landmark and sliding window
Independent Casca
Every arc (u,v) has associated the prob
Time proceeds in di
At time t, nodes that became active at
neighbors, and succeed
15
.1
.1
.1
.1
.1
.2
.2
.2
.2
.3
.3
.3
.3
.4
.4
.4
.4
.4
.1
.1
b
a c
f
e
d
g
h
i
The problem
• Given a social network with nodes being the users
and directed edges the social relations among them.
• For two users u and v connected by an edge (u, v), let
pu,v ϵ [0,1] be the influence of u on v.
• In the Independent Cascade Model the influence is
the probability that an action propagates from user u
to user v.
• The problem of finding the influential users has been
widely studied in viral marketing. (Domingos and
Richardson KDD’01, Kempe et al. KDD’03)
6
But how do we know the influence
probabilities?
• It is usually assumed the influence probabilities are known in
advance.
• However, in real-life (social) networks the relations and influence
probabilities are not known in advance.
• The problem of learning the probabilities was first considered in
Goyal et al. WSDM’10.
• In particular the following definition was shown to give a good
prediction of propagation:
Let Au2v(t) be the number of actions that propagated from
user u to user v within time t and Au|v – the total number of
actions performed by either u or v. Then
pu,v = Au2v(t)/Au|v
7
First solution. Store the whole social
graph in memory.
• Count exactly for each user u how many actions
propagate to its neighbor v within time
t, i.e., compute exactly Au2v(t). (Means we have to
store all actions performed at most t time units ago.)
• Estimate the total number of actions performed by
either u or v, i.e., estimate Au|v. (Set size estimation in
a streaming setting, many efficient algorithms.)
9
10
Experiments with Twitter dataset.
“+” Very good approximation, storing the network in RAM does not
dominate the space usage (real graphs are sparse), for small time
threshold t good space usage.
“-” Slow processing time per action (one has to check all
neighbors), bad space usage for large t.
Experiments with Twitter Dataset
Second solution. (Not storing the whole social
graph in main memory.)
Lower bound:
11
• Intuitively, u and v can perform just one action and
have influence probability 1.
• We thus need to somehow store information for each
user.
• Formally, using a reduction from the Set Disjointness
problem we show the following result:
Let the number of users be n. Any streaming algorithm which
distinguishes with probability at least 2/3 between the cases
(i) there exists an edge with influence probability 1/2 (ii) all
edges have influence probability at most 1/3, needs Ω(n) bits.
Second solution. (Not storing the whole social
graph in main memory.)
• The definition pu,v = Au2v(t)/Au|v is almost identical to
Jaccard coefficient for estimating the similarity
between sets A and B.
• Jaccard(A, B) = |A∩B|/|A U B|
• Min-wise independent hashing (Broder et al., 2000)
has become the de facto standard technique for
estimating the Jaccard similarity in data streams.
12
Min-wise independent hashing. Example.
Similarity in market baskets among supermarket customers.
13
Alice Bob
14
Min-wise independent hashing. Example.
Let h: Items  [0,1] be a random hash function.
Find the 4 items bought by either Alice or Bob with the smallest
hash values:
i) find the 4 smallest hash values for Alice, and the 4 for Bob.
ii) Aggregate the results.
iii) Estimate the similarity.
15
Min-wise independent hashing. Example.
Let h: Items  [0,1] be a random hash function.
Find the 4 items bought by either Alice or Bob with the smallest
hash values:
i) find the 4 smallest hash values for Alice, and the 4 for Bob.
ii) Aggregate the results.
iii) Estimate the similarity.
Alice Bob
h( ) = 0.06
h( ) = 0.09
h( ) = 0.1
h( ) = 0.12
h( ) = 0.03
h( ) = 0.07
h( ) = 0.09
h( ) = 0.096
Min-wise independent hashing. Example.
16
Let h: Items  [0,1] be a random hash function.
Find the 4 items bought by either Alice or Bob with the smallest
hash values:
i) find the 4 smallest hash values for Alice, and the 4 for Bob.
ii) Aggregate the results.
iii) Estimate the similarity.
Alice or Bob:
h( ) = 0.03 h( ) = 0.07 h( ) = 0.09h( ) = 0.06
Min-wise independent hashing. Example.
17
Let h: Items  [0,1] be a random hash function.
Find the 4 items bought by either Alice or Bob with the smallest
hash values:
i) find the 4 smallest hash values for Alice, and the 4 for Bob.
ii) Aggregate the results.
iii) Estimate the similarity.
Among the 4 items with the smallest hash
values, only is bought by Alice and
Bob, thus the estimated similarity is 1/4.
Second solution. Min-wise independent
hashing.
• Store the k actions with the smallest hash values for
each user.
• After processing the stream, for an edge (u, v) we
count the number of actions among the k actions
with the smallest hash values that propagated from u
to v within time t.
18
For all edges (u, v) with constant influence
probability pu,v, we can obtain a constant factor
approximation with probability 2/3 using O(n log n)
bits and constant processing time per action.
19
Experiments with Twitter dataset.
“+” Good space usage, fast processing time, reasonably
good approximation.
“-” One achieves really good space savings for large streams
(average number of actions per user is a few thousands).
Experiments with Twitter Dataset
Third solution.
(Estimate influence only for active users.)
• Often, the most interesting users are the most active
users.
• By sampling only actions a with h(a)<α, for some α =
α(k) ϵ [0,1], we can collect the min-wise samples for
the k most active users.
• Thus, we can estimate influence probabilities pu,v for
users u and v who are among the k most active users.
• Space efficient for small k and highly skewed user
activity.
20
Extending the algorithms to sliding
windows.
• Influence probabilities may change over time.
• Streaming Sliding Window: Exponential histograms
technique for bit-counting in data streams (Datar et
al. SIAMCOMP, 2002)
• Essentially, we obtain the same results, at the price
of multiplicative polylog-factors in the time and
space complexity.
21
Conclusions and future directions.
• The first algorithms for learning influence
probabilities in data streams.
• We performed only a high-level evaluation of our
algorithms, confirming the theoretical findings.
• We plan to extend the algorithms to handle adaptive
size windows such that the user does not have to
decide in advance the optimal window size.
• The min-wise independent hashing approach can be
distributed such that each processor processes only a
subset of the users.
22
STRIP: Stream Learning of Influence
Probabilities
Thank you!
23

Weitere ähnliche Inhalte

Was ist angesagt?

MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 Albert Bifet
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsAlbert Bifet
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsAlbert Bifet
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsAlbert Bifet
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream miningAlbert Bifet
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataAlbert Bifet
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themAlbert Bifet
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming AlgorithmsJoe Kelley
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017StampedeCon
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filterxlight
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker DiarizationHONGJOO LEE
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Aijun Zhang
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithmsSandeep Joshi
 
Bloom filter
Bloom filterBloom filter
Bloom filterwang ping
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Kira
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEHONGJOO LEE
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Andrii Gakhov
 

Was ist angesagt? (20)

MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016 MOA for the IoT at ACML 2016
MOA for the IoT at ACML 2016
 
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data StreamsFast Perceptron Decision Tree Learning from Evolving Data Streams
Fast Perceptron Decision Tree Learning from Evolving Data Streams
 
Mining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data StreamsMining Frequent Closed Graphs on Evolving Data Streams
Mining Frequent Closed Graphs on Evolving Data Streams
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data StreamsMining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
Mining Adaptively Frequent Closed Unlabeled Rooted Trees in Data Streams
 
Artificial intelligence and data stream mining
Artificial intelligence and data stream miningArtificial intelligence and data stream mining
Artificial intelligence and data stream mining
 
Sentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming DataSentiment Knowledge Discovery in Twitter Streaming Data
Sentiment Knowledge Discovery in Twitter Streaming Data
 
Mining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOAMining Big Data Streams with APACHE SAMOA
Mining Big Data Streams with APACHE SAMOA
 
Pitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid themPitfalls in benchmarking data stream classification and how to avoid them
Pitfalls in benchmarking data stream classification and how to avoid them
 
Streaming Algorithms
Streaming AlgorithmsStreaming Algorithms
Streaming Algorithms
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16
 
New zealand bloom filter
New zealand bloom filterNew zealand bloom filter
New zealand bloom filter
 
Speaker Diarization
Speaker DiarizationSpeaker Diarization
Speaker Diarization
 
Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)Dual-time Modeling and Forecasting in Consumer Banking (2016)
Dual-time Modeling and Forecasting in Consumer Banking (2016)
 
Data streaming algorithms
Data streaming algorithmsData streaming algorithms
Data streaming algorithms
 
Bloom filter
Bloom filterBloom filter
Bloom filter
 
Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)Tutorial 9 (bloom filters)
Tutorial 9 (bloom filters)
 
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOMEEuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
EuroPython 2017 - PyData - Deep Learning your Broadband Network @ HOME
 
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
Exceeding Classical: Probabilistic Data Structures in Data Intensive Applicat...
 

Ähnlich wie STRIPStream Learning of Influence Probabilities

The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingUniversity of Washington
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network Cigdem Aslay
 
Lecture_2_Stats.pdf
Lecture_2_Stats.pdfLecture_2_Stats.pdf
Lecture_2_Stats.pdfpaijitk
 
Introduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theoryIntroduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theoryAbner Chih Yi Huang
 
Artificial Agents Without Ontological Access to Reality
Artificial Agents Without Ontological Access to RealityArtificial Agents Without Ontological Access to Reality
Artificial Agents Without Ontological Access to Realityogeorgeon
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept Miha Ahronovitz
 
Information theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloudInformation theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloudMOVING Project
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data StreamsSujaAldrin
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningSujit Pal
 
A Novel Target Marketing Approach based on Influence Maximization
A Novel Target Marketing Approach based on Influence MaximizationA Novel Target Marketing Approach based on Influence Maximization
A Novel Target Marketing Approach based on Influence MaximizationSurendra Gadwal
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18Matt Yang
 
Influence-based Network-oblivious - ICDM 2013
Influence-based Network-oblivious - ICDM 2013Influence-based Network-oblivious - ICDM 2013
Influence-based Network-oblivious - ICDM 2013Nicola Barbieri
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsStavros Kontopoulos
 
Mining data streams
Mining data streamsMining data streams
Mining data streamsAkash Gupta
 
Simplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionSimplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionAustin Benson
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraJason Riedy
 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Olivier Jeunen
 

Ähnlich wie STRIPStream Learning of Influence Probabilities (20)

Slides barcelona risk data
Slides barcelona risk dataSlides barcelona risk data
Slides barcelona risk data
 
The Other HPC: High Productivity Computing
The Other HPC: High Productivity ComputingThe Other HPC: High Productivity Computing
The Other HPC: High Productivity Computing
 
Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network	Maximizing the Diversity of Exposure in a Social Network
Maximizing the Diversity of Exposure in a Social Network
 
Lecture_2_Stats.pdf
Lecture_2_Stats.pdfLecture_2_Stats.pdf
Lecture_2_Stats.pdf
 
Introduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theoryIntroduction to algorithmic aspect of auction theory
Introduction to algorithmic aspect of auction theory
 
Artificial Agents Without Ontological Access to Reality
Artificial Agents Without Ontological Access to RealityArtificial Agents Without Ontological Access to Reality
Artificial Agents Without Ontological Access to Reality
 
Dynamic Data Center concept
Dynamic Data Center concept  Dynamic Data Center concept
Dynamic Data Center concept
 
Information theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloudInformation theoritic analysis of entity dynamics on the linked open data cloud
Information theoritic analysis of entity dynamics on the linked open data cloud
 
Mining Data Streams
Mining Data StreamsMining Data Streams
Mining Data Streams
 
Artificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep LearningArtificial Intelligence, Machine Learning and Deep Learning
Artificial Intelligence, Machine Learning and Deep Learning
 
A Novel Target Marketing Approach based on Influence Maximization
A Novel Target Marketing Approach based on Influence MaximizationA Novel Target Marketing Approach based on Influence Maximization
A Novel Target Marketing Approach based on Influence Maximization
 
HOP-Rec_RecSys18
HOP-Rec_RecSys18HOP-Rec_RecSys18
HOP-Rec_RecSys18
 
Influence-based Network-oblivious - ICDM 2013
Influence-based Network-oblivious - ICDM 2013Influence-based Network-oblivious - ICDM 2013
Influence-based Network-oblivious - ICDM 2013
 
Online machine learning in Streaming Applications
Online machine learning in Streaming ApplicationsOnline machine learning in Streaming Applications
Online machine learning in Streaming Applications
 
Mining data streams
Mining data streamsMining data streams
Mining data streams
 
Simplicial closure and higher-order link prediction
Simplicial closure and higher-order link predictionSimplicial closure and higher-order link prediction
Simplicial closure and higher-order link prediction
 
Graph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear AlgebraGraph Analysis Beyond Linear Algebra
Graph Analysis Beyond Linear Algebra
 
Big Data Challenges and Solutions
Big Data Challenges and SolutionsBig Data Challenges and Solutions
Big Data Challenges and Solutions
 
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
Efficient Similarity Computation for Collaborative Filtering in Dynamic Envir...
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 

Mehr von Albert Bifet

Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkAlbert Bifet
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big DataAlbert Bifet
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labelsAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real TimeAlbert Bifet
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsAlbert Bifet
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online AnalysisAlbert Bifet
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streamsAlbert Bifet
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Albert Bifet
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAlbert Bifet
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAlbert Bifet
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesAlbert Bifet
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsAlbert Bifet
 

Mehr von Albert Bifet (13)

Apache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache FlinkApache Samoa: Mining Big Data Streams with Apache Flink
Apache Samoa: Mining Big Data Streams with Apache Flink
 
Introduction to Big Data
Introduction to Big DataIntroduction to Big Data
Introduction to Big Data
 
Multi-label Classification with Meta-labels
Multi-label Classification with Meta-labelsMulti-label Classification with Meta-labels
Multi-label Classification with Meta-labels
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
Mining Big Data in Real Time
Mining Big Data in Real TimeMining Big Data in Real Time
Mining Big Data in Real Time
 
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and SolutionsPAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
PAKDD 2011 TUTORIAL Handling Concept Drift: Importance, Challenges and Solutions
 
MOA : Massive Online Analysis
MOA : Massive Online AnalysisMOA : Massive Online Analysis
MOA : Massive Online Analysis
 
New ensemble methods for evolving data streams
New ensemble methods for evolving data streamsNew ensemble methods for evolving data streams
New ensemble methods for evolving data streams
 
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.
 
Adaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data StreamsAdaptive XML Tree Mining on Evolving Data Streams
Adaptive XML Tree Mining on Evolving Data Streams
 
Adaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent PatternsAdaptive Learning and Mining for Data Streams and Frequent Patterns
Adaptive Learning and Mining for Data Streams and Frequent Patterns
 
Mining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed TreesMining Implications from Lattices of Closed Trees
Mining Implications from Lattices of Closed Trees
 
Kalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data StreamsKalman Filters and Adaptive Windows for Learning in Data Streams
Kalman Filters and Adaptive Windows for Learning in Data Streams
 

Kürzlich hochgeladen

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 

Kürzlich hochgeladen (20)

Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 

STRIPStream Learning of Influence Probabilities

  • 1. STRIPStream Learning of Influence Probabilities Konstantin Kutzkov – IT University of Copenhagen, Denmark Albert Bifet, Francesco Bonchi – Yahoo! Labs, Barcelona Aristides Gionis – Aalto University and HIIT, Finland 1 KDD 2013
  • 2. Problem: Learning Influence Probabilities • Learning influence strength along links in social networks • Real-time interactions – Small amount of time and memory – Only one pass The general Propagation log Social graph Learnprobabilities
  • 3. Data Stream Learning • Sequence is potentially infinite • High amount of data, high speed of arrival • Change over time • Process elements from a data stream in only one pass • Approximation algorithms – Small error rate with high probability
  • 4. Social graph and log of propagations We have 2 pieces of input data: (1) social graph and (2) a log of past propagations u12 u45 u45 follows u12 – arc u12→ u45 I liked this movie I read this article great movie 09:30 09:00 Action Node Time a u12 1 a u45 2 a u32 3 a u76 8 b u32 1 b u45 3 b u98 7
  • 5. Learning Influence Probabilities • STRIP – Streaming methods • approximate solutions – Superlinear space – Linear space – Sublinear space • Landmark and sliding window Independent Casca Every arc (u,v) has associated the prob Time proceeds in di At time t, nodes that became active at neighbors, and succeed 15 .1 .1 .1 .1 .1 .2 .2 .2 .2 .3 .3 .3 .3 .4 .4 .4 .4 .4 .1 .1 b a c f e d g h i
  • 6. The problem • Given a social network with nodes being the users and directed edges the social relations among them. • For two users u and v connected by an edge (u, v), let pu,v ϵ [0,1] be the influence of u on v. • In the Independent Cascade Model the influence is the probability that an action propagates from user u to user v. • The problem of finding the influential users has been widely studied in viral marketing. (Domingos and Richardson KDD’01, Kempe et al. KDD’03) 6
  • 7. But how do we know the influence probabilities? • It is usually assumed the influence probabilities are known in advance. • However, in real-life (social) networks the relations and influence probabilities are not known in advance. • The problem of learning the probabilities was first considered in Goyal et al. WSDM’10. • In particular the following definition was shown to give a good prediction of propagation: Let Au2v(t) be the number of actions that propagated from user u to user v within time t and Au|v – the total number of actions performed by either u or v. Then pu,v = Au2v(t)/Au|v 7
  • 8. First solution. Store the whole social graph in memory. • Count exactly for each user u how many actions propagate to its neighbor v within time t, i.e., compute exactly Au2v(t). (Means we have to store all actions performed at most t time units ago.) • Estimate the total number of actions performed by either u or v, i.e., estimate Au|v. (Set size estimation in a streaming setting, many efficient algorithms.) 9
  • 9. 10 Experiments with Twitter dataset. “+” Very good approximation, storing the network in RAM does not dominate the space usage (real graphs are sparse), for small time threshold t good space usage. “-” Slow processing time per action (one has to check all neighbors), bad space usage for large t. Experiments with Twitter Dataset
  • 10. Second solution. (Not storing the whole social graph in main memory.) Lower bound: 11 • Intuitively, u and v can perform just one action and have influence probability 1. • We thus need to somehow store information for each user. • Formally, using a reduction from the Set Disjointness problem we show the following result: Let the number of users be n. Any streaming algorithm which distinguishes with probability at least 2/3 between the cases (i) there exists an edge with influence probability 1/2 (ii) all edges have influence probability at most 1/3, needs Ω(n) bits.
  • 11. Second solution. (Not storing the whole social graph in main memory.) • The definition pu,v = Au2v(t)/Au|v is almost identical to Jaccard coefficient for estimating the similarity between sets A and B. • Jaccard(A, B) = |A∩B|/|A U B| • Min-wise independent hashing (Broder et al., 2000) has become the de facto standard technique for estimating the Jaccard similarity in data streams. 12
  • 12. Min-wise independent hashing. Example. Similarity in market baskets among supermarket customers. 13 Alice Bob
  • 13. 14 Min-wise independent hashing. Example. Let h: Items  [0,1] be a random hash function. Find the 4 items bought by either Alice or Bob with the smallest hash values: i) find the 4 smallest hash values for Alice, and the 4 for Bob. ii) Aggregate the results. iii) Estimate the similarity.
  • 14. 15 Min-wise independent hashing. Example. Let h: Items  [0,1] be a random hash function. Find the 4 items bought by either Alice or Bob with the smallest hash values: i) find the 4 smallest hash values for Alice, and the 4 for Bob. ii) Aggregate the results. iii) Estimate the similarity. Alice Bob h( ) = 0.06 h( ) = 0.09 h( ) = 0.1 h( ) = 0.12 h( ) = 0.03 h( ) = 0.07 h( ) = 0.09 h( ) = 0.096
  • 15. Min-wise independent hashing. Example. 16 Let h: Items  [0,1] be a random hash function. Find the 4 items bought by either Alice or Bob with the smallest hash values: i) find the 4 smallest hash values for Alice, and the 4 for Bob. ii) Aggregate the results. iii) Estimate the similarity. Alice or Bob: h( ) = 0.03 h( ) = 0.07 h( ) = 0.09h( ) = 0.06
  • 16. Min-wise independent hashing. Example. 17 Let h: Items  [0,1] be a random hash function. Find the 4 items bought by either Alice or Bob with the smallest hash values: i) find the 4 smallest hash values for Alice, and the 4 for Bob. ii) Aggregate the results. iii) Estimate the similarity. Among the 4 items with the smallest hash values, only is bought by Alice and Bob, thus the estimated similarity is 1/4.
  • 17. Second solution. Min-wise independent hashing. • Store the k actions with the smallest hash values for each user. • After processing the stream, for an edge (u, v) we count the number of actions among the k actions with the smallest hash values that propagated from u to v within time t. 18 For all edges (u, v) with constant influence probability pu,v, we can obtain a constant factor approximation with probability 2/3 using O(n log n) bits and constant processing time per action.
  • 18. 19 Experiments with Twitter dataset. “+” Good space usage, fast processing time, reasonably good approximation. “-” One achieves really good space savings for large streams (average number of actions per user is a few thousands). Experiments with Twitter Dataset
  • 19. Third solution. (Estimate influence only for active users.) • Often, the most interesting users are the most active users. • By sampling only actions a with h(a)<α, for some α = α(k) ϵ [0,1], we can collect the min-wise samples for the k most active users. • Thus, we can estimate influence probabilities pu,v for users u and v who are among the k most active users. • Space efficient for small k and highly skewed user activity. 20
  • 20. Extending the algorithms to sliding windows. • Influence probabilities may change over time. • Streaming Sliding Window: Exponential histograms technique for bit-counting in data streams (Datar et al. SIAMCOMP, 2002) • Essentially, we obtain the same results, at the price of multiplicative polylog-factors in the time and space complexity. 21
  • 21. Conclusions and future directions. • The first algorithms for learning influence probabilities in data streams. • We performed only a high-level evaluation of our algorithms, confirming the theoretical findings. • We plan to extend the algorithms to handle adaptive size windows such that the user does not have to decide in advance the optimal window size. • The min-wise independent hashing approach can be distributed such that each processor processes only a subset of the users. 22
  • 22. STRIP: Stream Learning of Influence Probabilities Thank you! 23