The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
PhD Thesis presentation
1. DETECTION OF DISHONEST
BEHAVIORS IN ON-LINE
NETWORKS USING GRAPH-BASED
RANKING TECHNIQUES
Francisco Javier Ortega Rodríguez
Supervised by
Prof. Dr. José Antonio Troyano Jiménez
3. Motivation
3
WWW: Web Search
A new business model
Advertisements on the web pages
More web traffic More visits to (or views of) the ads
Search Engine Optimization (SEO) is born
White Hat SEO
Black Hat SEO Web Spam!
4. Motivation
4
Social Networks
Reputation of users similar to relevance of web
pages
Higher reputation can imply some benefits
Malicious users manipulate the TRS’s
On-line marketplaces: money
Social news sites: slant the contents of the web site
Simply for “trolling” (for pleasure)
6. Motivation
6
Hypothesis
The detection of dishonest behaviors in on-
line networks can be carried out with graph-
based techniques, flexible enough to include
in their schemes specific information (in the
form of features of the elements in a graph)
about the network to be processed and the
concrete task to be solved.
8. Web Spam Detection
8
Web spam mechanisms try to increase the
web traffic to specific web sites
Reach the top positions of a web search
engine
Relatedness: similarity to the user query
Changing the content of the web page
Visibility: relevance in the collection
Getting a high number of references
12. Web Spam Detection
12
Relevant web spam detection methods:
Link-based approaches
PageRank-based
Adaptations:
Truncated PageRank [Castillo et al. 2007]
TrustRank [Gyongy et al. 2004]
13. Web Spam Detection
13
Relevant web spam detection methods:
Link-based approaches
Pros:
Tackle the link-based spam methods
The ranking can be directly used as the result of an user
query
Cons:
Do not take into account the content of the web pages
Need human intervention in some specific parts
14. Web Spam Detection
14
Relevant web spam detection methods:
Content-based approaches
Database
WP’s Size Compressibilit Avg. word lenght ..
y .
1 … … … …
2 … … … …
… … … … …
Classifie
r
Spam Not Spam
15. Web Spam Detection
15
Relevant web spam detection methods:
Content-based approaches
Pros:
Deal with content-based spam methods
Binary classification methods
Cons:
Very slow in comparison to the link-based methods
Based on user-specified features
Do not take into account the topology of the web graph
17. Web Spam Detection
17
Relevant web spam detection methods:
Hybrid approaches
Pros:
Combine the pros of link and content-based methods.
Really effective in the classification of web pages
Cons:
Need user-specified features for both the content and the
link-based heuristics.
Opportunity:
Do not take advantage of the global topology of the web
graph
19. PolaritySpam
19
Intuition
Include content-based knowledge in a link-based
system.
Content Propagatio Ranking
Evaluation n algorithm
Databas
e
Selection of
sources
20. PolaritySpam
20
Content Evaluation
Content
Evaluation
Databas
e
21. PolaritySpam
21
Content Evaluation
Acquire useful knowledge from the textual content
Content-based heuristics
Adequate for spam detection
Easy to compute
Highest discriminative ability
A-priori spam likelihood of a web page
22. PolaritySpam
22
Content Evaluation
Small set of heuristics [Ntoulas et al., 2006]
Compressibility
Average length of words
A high value of the metrics implies an a-priori
high spam likelihood of a web page
23. PolaritySpam
23
Selection of Sources
Databas
e
Selection of
sources
24. PolaritySpam
24
Selection of Sources
Automatically
pick a set of a-priori spam and not-
spam web pages, Sources- and
Sources+, respectively
Take into account the content-basedmi1 , mi 2 ,..., mij }
M i { heuristics
Given a web page wpi with metrics:
26. PolaritySpam
26
Propagation algorithm
Propagatio Ranking
n algorithm
27. PolaritySpam
27
Propagation algorithm:
PageRank-based algorithm
Idea: propagate a-priori information from a
specific set of web pages, Sources
A-priori scores for the Sources
ei ¹ 0Ûwpi Î Sources
28. PolaritySpam
28
Propagation algorithm:
Two scores for each web page, vi:
ei+ ¹ 0 Û wpi Î Sources+ Set of a-priori non-spam web
pages
ei- ¹ 0 Û wpi Î Sources- Set of a-priori spam web pages
31. PolaritySpam
31
Evaluation:
Dataset
WEBSPAM-UK 2006 (Università degli Studi di Milano)
Metrics:
98 million pages
11,400 hosts manually labeled
7,423 hosts are labeled as spam
About 10 million web pages are labeled as spam
Processed with Terrier IR Platform
http://terrier.org
32. PolaritySpam
32
Evaluation:
Baseline: TrustRank [Gyongy et al., 2004]
Link-based web spam detection method
Personalized PageRank equation
Propagation from a set of hand-picked web pages
34. PolaritySpam
34
Evaluation:
Evaluation methods: PR-Buckets
Evaluation metric: number of spam web pages in each
bucket
…
Bucket 1 Bucket 2 Bucket
N
35. PolaritySpam
35
Evaluation:
Evaluation methods: PR-Buckets
Evaluation metric: number of spam web pages in each
bucket
…
Bucket 1 Bucket 2 Bucket
N
36. PolaritySpam
36
Evaluation:
Normalized Discounted Cumulative Gain (nDCG):
Global metric: measures the demotion of spam web
pages
Sum the “relevance” scores of not-spam web pages
37. PolaritySpam
37
Evaluation:
Normalized Discounted Cumulative Gain (nDCG):
Global metric: measures the demotion of spam web
pages
Sum the “relevance” scores of not-spam web pages
38. PolaritySpam
38
Evaluation:
Normalized Discounted Cumulative Gain (nDCG):
Global metric: measures the demotion of spam web
pages
Sum the “relevance” scores of not-spam web pages
43. Trust & Reputation in Social
Networks
43
Trust and reputation are key concepts in
social networks
Similar to the relevance of web pages in the
WWW
Reputation: assessment of the
trustworthiness of a user in a social
network, according to his behavior and the
opinions of the other users.
44. Trust & Reputation in Social
Networks
44
Example: On-line marketplaces
Trustworthiness as determining as the price
Higher reputation implies more sales
Positive and negative opinions
45. Trust & Reputation in Social
Networks
45
Main goal: gain high reputation
Obtain positive feedbacks from the customers
Sell
some bargains
Special offers
Give negative opinions for sellers that can be
competitors.
Obtain false positive opinions from other accounts
(not necessarily other users).
Dishonest behaviors!
47. Trust & Reputation in Social
Networks
47
TRS’s in real world
Moderators
Special set of users with specific responsibilities
Example: Slashdot.org
A hierarchy of moderators
A special user, No_More_Trolls, maintains a list of known
trolls
Drawbacks:
Scalability
Subjectivity
48. Trust & Reputation in Social
Networks
48
TRS’s in real world
Unsupervised TRS’s
Users rate the contents of the system (and also other
users)
Scalability problem: rely on the users
Subjectivity problem: decentralized
Examples: Digg.com, eBay.com
Drawbacks:
Unsupervised!
49. Trust & Reputation in Social
Networks
49
Transivity of Trust and Distrust [Guha et al.,
2004]
Multiplicative distrust
The enemy of my enemy is my friend
Additive distrust
Don’t trust someone not trusted by someone you don’t trust
Neutral distrust
Don’t take into account your enemies’ opinions
50. Trust & Reputation in Social
Networks
50
Threats of TRS’s
Orchestrated attacks
Camouflage behind good behavior
Malicious Spies
Camouflage behind judgments
51. Trust & Reputation in Social
Networks
51
Threats of TRS’s
Orchestrated attacks: obtaining positive opinions
from other accounts (not necessarily other users)
6
1 7
2
0 3
9
5 8
4
52. Trust & Reputation in Social
Networks
52
Threats of TRS’s
Camouflage behind good behavior: feigning good
behavior in order to obtain positive feedback from
others.
6
1 7
2
0 3
9
5 8
4
53. Trust & Reputation in Social
Networks
53
Threats of TRS’s
Malicious spies: using an “honest” account to
provide positive opinions to malicious users.
6
1 7
2
0 3
9
8
4 5
54. Trust & Reputation in Social
Networks
54
Threats of TRS’s
Camouflage behind judgments: giving negative
feedback to users who can be competitors.
6
1 7
2
0 3
9
5 8
4
56. PolarityTrust
56
Intuition
Compute a ranking of the users in a social
network according to their trustworthiness
Take into account both positive and negative
feedback
Graph-based ranking algorithm to obtain two
scores ifor each node:
PT (v )
PT (vi ) : positive reputation of user i
: negative reputation of user i
57. PolarityTrust
57
Intuition
Propagation algorithm for the opinions of the users
Given a set of trustworthy users
Their PT+ and PT- scores are propagated to their
neighbors
… and so on
6
1 7
2
0 3
9
4 5 8
58. PolarityTrust
58
Algorithm
Propagation schema of the opinions of the users
Different behavior depending on the type of relation
between the users: positive or negative
PT ⁺ (b) ↑ ⁻
PT (e) ↑
b e
a d
⁺
PT (a)
c
⁻
PT (d)
f
⁻
PT (c) ↑ ⁺
PT (f) ↑
59. PolarityTrust
59
Algorithm
The scores of the nodes influence the scores of
their neighbors
PT (vi )
60. PolarityTrust
60
Algorithm
The scores of the nodes influence the scores of
their neighbors
PT (vi ) (1 d )ei d
Set of
sources
61. PolarityTrust
61
Algorithm
The scores of the nodes influence the scores of
their neighbors
pij
PT (vi ) (1 d )ei d PT (v j )
j In ( i ) | p jk |
k Out ( v j )
Direct relation with the PT+ of positively voting
users
62. PolarityTrust
62
Algorithm
The scores of the nodes influence the scores of
their neighbors
pij pij
PT (vi ) (1 d )ei d PT (v j ) PT (v j )
j In ( i ) | p jk | j In ( i ) | p jk |
k Out ( v j ) k Out ( v j )
Inverse relation with the PT- of negatively voting
users
63. PolarityTrust
63
Algorithm
The scores of the nodes influence the scores of
their neighbors
pij pij
PT (vi ) (1 d )ei d PT (v j ) PT (v j )
j In ( i ) | p jk | j In ( i ) | p jk |
k Out ( v j ) k Out ( v j )
pij pij
PT (vi ) (1 d )ei d PT (v j ) PT (v j )
j In ( i ) | p jk | j In ( i ) | p jk |
k Out ( v j ) k Out ( v j )
64. PolarityTrust
64
Non-Negative Propagation
Problems caused by negative opinions from
malicious users
Solution:
dynamically avoid the propagation of
these opinions from malicious users
PR⁻(b) ↑
b
a
⁻
PR (a)
c
⁺
PR (c) ↑
65. PolarityTrust
65
Action-Reaction Propagation
Problems caused by dishonest voting attacks
Positive votes to malicious users
Orchestrated attacks, malicious spies…
Negative votes to good users
Camouflage behind bad judgments
React against bad actions: dishonest voting
Penalize users who performs these actions
Proportional to the trustworthiness of the nodes been
affected
66. PolarityTrust
66
Action-Reaction Propagation
Computation:
Relation between the number of dishonest votes and
the total number of votes
Applied after each iteration of the ranking algorithm
b
a
d
c
69. PolarityTrust
69
Evaluation
Datasets
Barabasi & Albert
Preferential attachment property
Randomly generated attacks
Metrics of the dataset
104nodes per graph
103 malicious users
100 malicious spies
70. PolarityTrust
70
Evaluation
Datasets
Slashdot Zoo
Graph of users in Slashdot.org
Friend and Foe relationships
Gold Standard = list of Foes of the special user
No_More_Trolls
Metrics of the dataset
71,500 users in total
24% negative edges
96 known trolls
Source set: CmdrTaco and his friends 6 users in total
71. PolarityTrust
71
Evaluation
Baselines
EigenTrust [Kamvar et al. 2003]
It does not take into account negative opinions
Fans Minus Freaks
(Number of friends – Number of foes)
Signed Spectral Ranking [Kunegis et al. 2009]
Negative Ranking [Kunegis et al. 2009]
75. PolarityTrust
75
Evaluation
Include a set of sources of distrust
In Slashdot Zoo Dataset:
Sources of trust: CmdrTaco and friends
Sources of distrust: 5 random foes of No_More_Trolls
Many possible methods to choose the sources of
distrust
78. Conclusions
78
Final Remarks
Development of two systems for the detection of
dishonest behaviors in on-line networks
Web Spam Detection: PolaritySpam
Trust and Reputation: PolarityTrust
Propagation of some a-priori information
Web Spam: Textual content of the web pages
Trust and Reputation: Trust and distrust sources sets
79. Conclusions
79
Final Remarks
Web Spam Detection
Unlike
existent approaches, include content-based
knowledge into a link-based technique
Unsupervised methods for the selection of sources
Propagate information of the sources through the
network
Two simple metrics improve state-of-the-art methods
80. Conclusions
80
Final Remarks
Trust and Reputation in social networks
Negative links improve the discriminative ability of
TRS’s
Propagationestrategies to deal with different attacks
against a TRS
Non-Negative propagation
Action-Reaction propagation
Interrelated scores modeling the transitivity of trust
and distrust
Flexible to be adapted to different situations and
81. Conclusions
81
Future Work
PolaritySpam
Applicability of more content-based metrics
Aditional methods for the selection of sources
Propagation ability of each source
Infer negative relations between web pages
According to their textual content
Apply similar propagation schemas as in PolarityTrust
82. Conclusions
82
Future Work
PolarityTrust
Study other possible attacks
Playbook sequences (omniscience of the attackers)
Analyze the casuistry of the different social networks
Selection of sources of trust and distrust
Link-based methods
Study other contexts with positive and negative
relations:
Trending topics
Authorities in the blogosphere
83. Conclusions
83
Future Work
Both techniques
Study of the parallelization of both algorithms
Many works on the parallelization of PageRank
Saving time and memory
Detection of Spam on the social networks
Spam messages and spam user accounts
Recommender Systems
NLP and Opinion Mining techniques in a link-based
system
Use the positive and negative information
84. Curriculum Vitae
84
Academic and Research milestones
2006: Degree on Computer Science
2006: Funded Student in the Itálica research
group
2008: Master of Advances Studies:
“STR: A graph-based tagger generator”
2010: Research stay at the University of Glasgow
IR Group (Dr. Iadh Ounis and Dr. Craig Macdonald)
85. Curriculum Vitae
85
26 contributions to conferences and journals
5 JCR
10 International Conferences
2 CORE B
4 CORE C
4 ISI Proceedings
3 Lecture Notes in Computer Sciences
3 CiteSeer Venue Impact Ratings
Proyectos de investigación
86. Curriculum Vitae
86
Contributions related to the thesis
PolarityRank
National Conf.
TexRank for
International PolarityTrus
Tagging
Conf. t
JCR
System
Combinatio STR
n Methods PolaritySpa
m
Web Spam
Detection
Improving a
Tagger Generator
in IE
87. Curriculum Vitae
87
Contributions related to the thesis
National Conf.
TexRank for
International
Tagging
Conf.
JCR
System
Combinatio
n Methods
Improving a
Tagger Generator
in IE
TextRank como motor de aprendizaje en tareas de etiquetado, SEPLN
2006
Bootstrapping Applied to a Corpus Generation Task, EUROCAST 2007
Improving the Performance of a Tagger Generator in an Information Extraction
Application, Journal of Universal Computer Science (2007)
88. Curriculum Vitae
88
Contributions related to the thesis
National Conf.
International
Conf.
JCR
STR
STR: A Graph-based Tagging Technique, International Journal on
Artificial Intelligence Tools (2011)
89. Curriculum Vitae
89
Contributions related to the thesis
PolarityRank
National Conf.
International
Conf.
JCR
Web Spam
Detection
A Knowledge-Rich Approach to Featured-based Opinion Extraction from
Product Reviews, SMUC 2010 (CIKM 2010)
Combining Textual Content and Hyperlinks in Web Spam Detection, NLDB
2010
90. Curriculum Vitae
90
Contributions related to the thesis
National Conf.
International PolarityTrus
Conf. t
JCR
PolaritySpa
m
PolarityTrust: Measuring Trust and Reputation in Social Networks, ITA 2011
PolaritySpam: Propagating Content-based Information Through a Web Graph to
Detect Web Spam, International Journal of Innovative Computing, Information and
Control (2012)
91. DETECTION OF DISHONEST
BEHAVIORS IN ON-LINE
NETWORKS USING GRAPH-BASED
RANKING TECHNIQUES
Francisco Javier Ortega Rodríguez
Supervised by
Prof. Dr. José Antonio Troyano Jiménez