7. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
Wednesday, 16 May 12
8. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
Wednesday, 16 May 12
9. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
• spambot/fraud detection: identify people gaming the system
Wednesday, 16 May 12
10. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
• spambot/fraud detection: identify people gaming the system
• sentiment classification: happy/sad/neutral
Wednesday, 16 May 12
11. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
• spambot/fraud detection: identify people gaming the system
• sentiment classification: happy/sad/neutral
• The really exciting stuff
Wednesday, 16 May 12
12. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
• spambot/fraud detection: identify people gaming the system
• sentiment classification: happy/sad/neutral
• The really exciting stuff
• inferring networks of influence - more about this later
Wednesday, 16 May 12
13. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
• spambot/fraud detection: identify people gaming the system
• sentiment classification: happy/sad/neutral
• The really exciting stuff
• inferring networks of influence - more about this later
• visualise different aspects of influence, in an engaging way
Wednesday, 16 May 12
14. Machine Learning @ PeerIndex
• The usual stuff
• topic modelling/classification of tweets/statuses/URLs
• identity resolution across twitter, facebook, linkedIn
• spambot/fraud detection: identify people gaming the system
• sentiment classification: happy/sad/neutral
• The really exciting stuff
• inferring networks of influence - more about this later
• visualise different aspects of influence, in an engaging way
• influence maximisation - submodular optimisation
Wednesday, 16 May 12
20. Heurisric approaches to estimate pi,j
• purely based on local network structure
1
pi,j
din (j)
Wednesday, 16 May 12
21. Heurisric approaches to estimate pi,j
• purely based on local network structure
1
pi,j
din (j)
• trivalency “model” my personal favourite :)
pi,j {0.1, 0.01, 0.01} randomly
Wednesday, 16 May 12
22. Heurisric approaches to estimate pi,j
• purely based on local network structure
1
pi,j
din (j)
• trivalency “model” my personal favourite :)
pi,j {0.1, 0.01, 0.01} randomly
• data-driven heuristics
number of items shared by j after i shared it
pi,j
number of items shared by i
Wednesday, 16 May 12
23. Heurisric approaches to estimate pi,j
• purely based on local network structure
1
pi,j
din (j)
• trivalency “model” my personal favourite :)
pi,j {0.1, 0.01, 0.01} randomly
• data-driven heuristics
number of items shared by j after i shared it
pi,j
number of items shared by i
How do you solve this with machine learning?
Wednesday, 16 May 12
28. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
Wednesday, 16 May 12
29. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
Wednesday, 16 May 12
30. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
p0,u1
Wednesday, 16 May 12
31. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
p0,u1(1 (1 p0,u2 ) (1 pu1 ,u2 ))
Wednesday, 16 May 12
32. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
p0,u1(1 (1 p0,u2 ) (1 pu1 ,u2 ))· · ·
Wednesday, 16 May 12
33. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
0 1
n
Y i 1
Y
= @1 (1 puj ,ui )A
i=1 j=1
Wednesday, 16 May 12
34. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
0 1
n
Y i 1
Y
= @1 (1 puj ,ui )A
i=1 j=1
for users that are not in cascade
Wednesday, 16 May 12
35. The likelihood
P( D | )
http://www.pcworld.com/article/239719
1079306 2011-08-25T00:03:06+01:00
4549198 2011-08-25T04:32:25+01:00
2662975 2011-08-25T00:35:11+01:00
2333224 2011-08-25T01:43:18+01:00
3141371 2011-08-25T01:52:06+01:00
3482720 2011-08-25T07:18:24+01:00
1403682 2011-08-25T03:52:58+01:00
4679657 2011-08-25T01:07:48+01:00
32460 2011-08-25T01:11:43+01:00
pi,j
what’s the probability of the cascade u1 , u2 , u3 , . . . , un
for subsequent users in cascade
0 1
n
Y i 1
Y
= @1 (1 puj ,ui )A
i=1 j=1
for users that are not in cascade
Y Y
(1 pu,v )
u2{u1 ...un } v2users
/
Wednesday, 16 May 12
37. Maximum likelihood at scale
• data too sparse to learn one parameter per edge
Wednesday, 16 May 12
38. Maximum likelihood at scale
• data too sparse to learn one parameter per edge
• large scale gradient-based optimisation is costly
Wednesday, 16 May 12
39. Maximum likelihood at scale
• data too sparse to learn one parameter per edge
• large scale gradient-based optimisation is costly
• Solution: combine ensemble of heuristics with ML
Wednesday, 16 May 12
40. Maximum likelihood at scale
• data too sparse to learn one parameter per edge
• large scale gradient-based optimisation is costly
• Solution: combine ensemble of heuristics with ML
• use heuristics to compute probabilities at scale
Wednesday, 16 May 12
41. Maximum likelihood at scale
• data too sparse to learn one parameter per edge
• large scale gradient-based optimisation is costly
• Solution: combine ensemble of heuristics with ML
• use heuristics to compute probabilities at scale
• use ML to tune parameters on small-scale data
Wednesday, 16 May 12
43. Influence maximisation
• Select a set of users to maximise outreach
Wednesday, 16 May 12
44. Influence maximisation
• Select a set of users to maximise outreach
• Influence of people combines non-linearly
Wednesday, 16 May 12
45. Influence maximisation
• Select a set of users to maximise outreach
• Influence of people combines non-linearly
• In many models it combines sub-modularly
Wednesday, 16 May 12
46. Influence maximisation
• Select a set of users to maximise outreach
• Influence of people combines non-linearly
• In many models it combines sub-modularly
A ✓ B =) f (A [ {x}) f (A) f (B [ {x}) f (B)
Wednesday, 16 May 12
47. Influence maximisation
• Select a set of users to maximise outreach
• Influence of people combines non-linearly
• In many models it combines sub-modularly
A ✓ B =) f (A [ {x}) f (A) f (B [ {x}) f (B)
• these functions are fun to optimise
Wednesday, 16 May 12
48. Influence maximisation
• Select a set of users to maximise outreach
• Influence of people combines non-linearly
• In many models it combines sub-modularly
A ✓ B =) f (A [ {x}) f (A) f (B [ {x}) f (B)
• these functions are fun to optimise
• pops up many times in machine learning
Wednesday, 16 May 12
50. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
Wednesday, 16 May 12
51. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
• lots of ‘standard’ machine learning tasks
Wednesday, 16 May 12
52. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
• lots of ‘standard’ machine learning tasks
• some uniquely exciting problems
Wednesday, 16 May 12
53. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
• lots of ‘standard’ machine learning tasks
• some uniquely exciting problems
• inferring propagation probabilities
Wednesday, 16 May 12
54. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
• lots of ‘standard’ machine learning tasks
• some uniquely exciting problems
• inferring propagation probabilities
• compute expected number of users one reaches out to
Wednesday, 16 May 12
55. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
• lots of ‘standard’ machine learning tasks
• some uniquely exciting problems
• inferring propagation probabilities
• compute expected number of users one reaches out to
• putting all aspects together into a single number, and visualise
Wednesday, 16 May 12
56. Wrap up
• two lines of ‘data’ products: PeerIndex, PeerPerks
• lots of ‘standard’ machine learning tasks
• some uniquely exciting problems
• inferring propagation probabilities
• compute expected number of users one reaches out to
• putting all aspects together into a single number, and visualise
• influence maximisation
Wednesday, 16 May 12
57. Thanks
We’re hiring ML scientists, interns and engineers...
@fhuszar
fh@peerindex.com
Wednesday, 16 May 12