PhD Thesis presentation

DETECTION OF DISHONEST
BEHAVIORS IN ON-LINE
NETWORKS USING GRAPH-BASED
RANKING TECHNIQUES
Francisco Javier Ortega Rodríguez

Supervised by
Prof. Dr. José Antonio Troyano Jiménez

Motivation
3

 WWW: Web Search
A new business model
 Advertisements on the web pages
 More web traffic  More visits to (or views of) the ads

 Search Engine Optimization (SEO) is born

 White Hat SEO

 Black Hat SEO Web Spam!

Motivation
4

 Social Networks
 Reputation of users similar to relevance of web
pages

 Higher reputation can imply some benefits

 Malicious users manipulate the TRS’s
 On-line marketplaces: money
 Social news sites: slant the contents of the web site
 Simply for “trolling” (for pleasure)

Motivation
6

 Hypothesis

The detection of dishonest behaviors in on-
line networks can be carried out with graph-
based techniques, flexible enough to include
in their schemes specific information (in the
form of features of the elements in a graph)
about the network to be processed and the
concrete task to be solved.

Web Spam Detection
8

 Web spam mechanisms try to increase the
web traffic to specific web sites

 Reach the top positions of a web search
engine
 Relatedness: similarity to the user query
 Changing the content of the web page

 Visibility: relevance in the collection
 Getting a high number of references

Web Spam Detection
9

 Content-based methods: self promotion
 HiddenHTML code
 Keyword stuffing

Web Spam Detection
10

 Link-based methods: mutual promotion
 Link-farms

 PR-sculpting

Web Spam Detection
12

 Relevant web spam detection methods:
 Link-based approaches
 PageRank-based

 Adaptations:
 Truncated PageRank [Castillo et al. 2007]
 TrustRank [Gyongy et al. 2004]

Web Spam Detection
13

 Link-based approaches
 Pros:
 Tackle the link-based spam methods
 The ranking can be directly used as the result of an user
query

 Cons:
 Do not take into account the content of the web pages
 Need human intervention in some specific parts

Web Spam Detection
14

 Content-based approaches

Database
WP’s Size Compressibilit Avg. word lenght ..
y .
1 … … … …
2 … … … …
… … … … …

Classifie
r

Spam Not Spam

Web Spam Detection
15

 Content-based approaches
 Pros:
 Deal with content-based spam methods
 Binary classification methods

 Cons:
 Very slow in comparison to the link-based methods
 Based on user-specified features
 Do not take into account the topology of the web graph

Web Spam Detection
16

 Hybrid approaches

Database WP’s Size % Compressibilit Out-links Avg. word ..
In- y / In-links lenght .
links
1 … … … … … …
2 … … … … … …
… … … … … … …

Link-based
metrics

Web Spam Detection
17

 Hybrid approaches
 Pros:
 Combine the pros of link and content-based methods.
 Really effective in the classification of web pages

 Cons:
 Need user-specified features for both the content and the
link-based heuristics.

 Opportunity:
 Do not take advantage of the global topology of the web
graph

PolaritySpam
19

 Intuition
 Include content-based knowledge in a link-based
system.
Content Propagatio Ranking
Evaluation n algorithm

Databas
e

Selection of
sources

PolaritySpam
20

 Content Evaluation

Content
Evaluation

Databas
e

PolaritySpam
21

 Acquire useful knowledge from the textual content

 Content-based heuristics
 Adequate for spam detection
 Easy to compute
 Highest discriminative ability

 A-priori spam likelihood of a web page

PolaritySpam
22

 Small set of heuristics [Ntoulas et al., 2006]
 Compressibility

 Average length of words

A high value of the metrics implies an a-priori
high spam likelihood of a web page

PolaritySpam
23

 Selection of Sources

Databas
e

Selection of
sources

PolaritySpam
24

 Automatically
pick a set of a-priori spam and not-
spam web pages, Sources- and
Sources+, respectively

 Take into account the content-basedmi1 , mi 2 ,..., mij }
M i { heuristics
 Given a web page wpi with metrics:

PolaritySpam
25

 Most Spamy/Not-Spamy sources (S-NS)
Sources

 Content-based S-NS (CS-NS)

Sources

 Content-based Graph Sources (C-GS)

PolaritySpam
26

 Propagation algorithm

Propagatio Ranking
n algorithm

PolaritySpam
27

 Propagation algorithm:
 PageRank-based algorithm

 Idea: propagate a-priori information from a
specific set of web pages, Sources

 A-priori scores for the Sources

ei ¹ 0Ûwpi Î Sources

PolaritySpam
28

 Propagation algorithm:
 Two scores for each web page, vi:

ei+ ¹ 0 Û wpi Î Sources+ Set of a-priori non-spam web
pages
ei- ¹ 0 Û wpi Î Sources- Set of a-priori spam web pages

PolaritySpam
29

Propagatio Ranking
n algorithm

PolaritySpam
30

 Evaluation:
 Dataset

 Baseline

 Evaluation Methods

 Results

PolaritySpam
31

 Evaluation:
 Dataset
 WEBSPAM-UK 2006 (Università degli Studi di Milano)

 Metrics:
 98 million pages
 11,400 hosts manually labeled
 7,423 hosts are labeled as spam

 About 10 million web pages are labeled as spam

 Processed with Terrier IR Platform
 http://terrier.org

PolaritySpam
32

 Evaluation:
 Baseline: TrustRank [Gyongy et al., 2004]

 Link-based web spam detection method

 Personalized PageRank equation

 Propagation from a set of hand-picked web pages

PolaritySpam
33

 Evaluation:
 Evaluation methods: PR-Buckets

…

Bucket 1 Bucket 2 Bucket
N

PolaritySpam
34

 Evaluation:
 Evaluation metric: number of spam web pages in each
bucket

…

N

PolaritySpam
35

 Evaluation:
 Evaluation metric: number of spam web pages in each
bucket

…

N

PolaritySpam
36

 Evaluation:
 Normalized Discounted Cumulative Gain (nDCG):
 Global metric: measures the demotion of spam web
pages

 Sum the “relevance” scores of not-spam web pages

PolaritySpam
37

 Evaluation:
pages


PolaritySpam
38

 Evaluation:
pages


PolaritySpam
39

 Evaluation:
 PR-Buckets evaluation
1000
Number of Spam Web Pages

100

10

1
1 2 3 4 5 6 7 8 9 10

Buckets

TrustRank S-NS CS-NS C-GS

PolaritySpam
40

 Evaluation:
 nDCG evaluation

nDCG
TrustRan 0.7381
k
S-NS 0.4230
CS-NS 0.8621
C-GS 0.8648

PolaritySpam
41

 Evaluation:
 Content-based heuristics
1000
Number of Spam Web Pages

100

10

1
1 2 3 4 5 6 7 8 9 10

Buckets

AverageLength Compressibility AllMetrics TrustRank PolaritySpam

Trust & Reputation in Social
Networks
43

 Trust and reputation are key concepts in
social networks
 Similar to the relevance of web pages in the
WWW

 Reputation: assessment of the
trustworthiness of a user in a social
network, according to his behavior and the
opinions of the other users.

Networks
44

 Example: On-line marketplaces

 Trustworthiness as determining as the price
 Higher reputation implies more sales

 Positive and negative opinions

Networks
45

 Main goal: gain high reputation
 Obtain positive feedbacks from the customers
 Sell
some bargains
 Special offers

 Give negative opinions for sellers that can be
competitors.
 Obtain false positive opinions from other accounts

(not necessarily other users).

Dishonest behaviors!

Networks
47

 TRS’s in real world
 Moderators
 Special set of users with specific responsibilities

 Example: Slashdot.org
 A hierarchy of moderators
 A special user, No_More_Trolls, maintains a list of known
trolls

 Drawbacks:
 Scalability
 Subjectivity

Networks
48

 TRS’s in real world
 Unsupervised TRS’s
 Users rate the contents of the system (and also other
users)
 Scalability problem: rely on the users
 Subjectivity problem: decentralized

 Examples: Digg.com, eBay.com

 Drawbacks:
 Unsupervised!

Networks
49

 Transivity of Trust and Distrust [Guha et al.,
2004]
 Multiplicative distrust
 The enemy of my enemy is my friend

 Additive distrust
 Don’t trust someone not trusted by someone you don’t trust

 Neutral distrust
 Don’t take into account your enemies’ opinions

Networks
50

 Threats of TRS’s
Orchestrated attacks

Camouflage behind good behavior

Malicious Spies

Camouflage behind judgments

Networks
51

 Orchestrated attacks: obtaining positive opinions
from other accounts (not necessarily other users)

6
1 7
2

0 3
9

5 8
4

Networks
52

 Camouflage behind good behavior: feigning good
behavior in order to obtain positive feedback from
others.

6
1 7
2

0 3
9

5 8
4

Networks
53

 Malicious spies: using an “honest” account to
provide positive opinions to malicious users.

6
1 7
2

0 3
9

8
4 5

Networks
54

 Camouflage behind judgments: giving negative
feedback to users who can be competitors.

6
1 7
2

0 3
9

5 8
4

PolarityTrust
56

 Intuition
 Compute a ranking of the users in a social
network according to their trustworthiness

 Take into account both positive and negative
feedback

 Graph-based ranking algorithm to obtain two
scores ifor each node:
PT (v )
 PT (vi ) : positive reputation of user i
 : negative reputation of user i

PolarityTrust
57

 Intuition
 Propagation algorithm for the opinions of the users
 Given a set of trustworthy users
 Their PT+ and PT- scores are propagated to their
neighbors
 … and so on

6
1 7
2

0 3
9
4 5 8

PolarityTrust
58

 Algorithm
 Propagation schema of the opinions of the users
 Different behavior depending on the type of relation
between the users: positive or negative

PT ⁺ (b) ↑ ⁻
PT (e) ↑
b e

a d
⁺
PT (a)
c
⁻
PT (d)
f
⁻
PT (c) ↑ ⁺
PT (f) ↑

PolarityTrust
59

 Algorithm
 The scores of the nodes influence the scores of
their neighbors

PT (vi )

PolarityTrust
60

 Algorithm
their neighbors

PT (vi ) (1 d )ei d

Set of
sources

PolarityTrust
61

 Algorithm
their neighbors

pij
PT (vi ) (1 d )ei d PT (v j )
j In ( i ) | p jk |
k Out ( v j )

Direct relation with the PT+ of positively voting
users

PolarityTrust
62

 Algorithm
their neighbors

pij pij
PT (vi ) (1 d )ei d PT (v j ) PT (v j )
j In ( i ) | p jk | j In ( i ) | p jk |
k Out ( v j ) k Out ( v j )

Inverse relation with the PT- of negatively voting
users

PolarityTrust
63

 Algorithm
their neighbors

pij pij
j In ( i ) | p jk | j In ( i ) | p jk |

pij pij
j In ( i ) | p jk | j In ( i ) | p jk |

PolarityTrust
64

 Non-Negative Propagation
 Problems caused by negative opinions from
malicious users

 Solution:
dynamically avoid the propagation of
these opinions from malicious users
PR⁻(b) ↑
b

a
⁻
PR (a)
c
⁺
PR (c) ↑

PolarityTrust
65

 Action-Reaction Propagation
 Problems caused by dishonest voting attacks
 Positive votes to malicious users
 Orchestrated attacks, malicious spies…
 Negative votes to good users
 Camouflage behind bad judgments

 React against bad actions: dishonest voting
 Penalize users who performs these actions
 Proportional to the trustworthiness of the nodes been
affected

PolarityTrust
66

 Action-Reaction Propagation
 Computation:
 Relation between the number of dishonest votes and
the total number of votes

 Applied after each iteration of the ranking algorithm

b
a

d

c

PolarityTrust
67

 Complete Formulation

PolarityTrust
68

 Evaluation
 Datasets

 Baselines

 Results

PolarityTrust
69

 Evaluation
 Datasets
 Barabasi & Albert
 Preferential attachment property

 Randomly generated attacks

 Metrics of the dataset
 104nodes per graph
 103 malicious users
 100 malicious spies

PolarityTrust
70

 Evaluation
 Datasets
 Slashdot Zoo
 Graph of users in Slashdot.org
 Friend and Foe relationships
 Gold Standard = list of Foes of the special user
No_More_Trolls

 Metrics of the dataset
 71,500 users in total
 24% negative edges
 96 known trolls
 Source set: CmdrTaco and his friends  6 users in total

PolarityTrust
71

 Evaluation
 Baselines
 EigenTrust [Kamvar et al. 2003]
 It does not take into account negative opinions

 Fans Minus Freaks
 (Number of friends – Number of foes)

 Signed Spectral Ranking [Kunegis et al. 2009]

 Negative Ranking [Kunegis et al. 2009]

PolarityTrust
72

 Evaluation
 Results: Randomly generated datasets
 nDCG

Threats ET FmF SR NR PTNN PTAR PT
A 0.833 0.843 0.599 0.749 0.876 0.906 0.987
AB 0.833 0.844 0.811 0.920 0.876 0.906 0.987
ABC 0.842 0.719 0.816 0.920 0.877 0.903 0.984
ABCD 0.823 0.723 0.818 0.937 0.879 0.903 0.984
ABCDE 0.753 0.777 0.877 0.933 0.966 0.862 0.982

A: No estrategies
B: Orchestrated attack ET: EigenTrust
PTNN: Non-Negative Propagation
C: Camouflage behind good FmF: Fans Minus
PTAR: Action-Reaction
behaviors Freaks
Propagation
D: Malicious spies SR: Spectral Ranking
PT: PolarityTrust
E: Camouflage behind judgments NR: Negative Ranking

PolarityTrust
73

 Evaluation
 Results: Slashdot Zoo dataset

ET FmF SR NR PTNN PTAR PT
nDCG 0.31 0.460 0.479 0.477 0.593 0.570 0.588
0

ET: EigenTrust
FmF: Fans Minus
Freaks
Propagation
SR: Spectral Ranking
PT: PolarityTrust
NR: Negative Ranking

PolarityTrust
74

 Evaluation
 Results: Trolling Slashdot
 nDCG

Threats ET FmF SR NR PTNN PTAR PT
A 0.310 0.460 0.479 0.477 0.593 0.570 0.588
AB 0.308 0.460 0.478 0.477 0.593 0.570 0.588
ABC 0.311 0.460 0.474 0.484 0.593 0.570 0.588
ABCD 0.370 0.476 0.501 0.501 0.580 0.570 0.586
ABCDE 0.370 0.475 0.501 0.496 0.580 0.574 0.588

A: No estrategies
B: Orchestrated attack ET: EigenTrust
C: Camouflage behind good FmF: Fans Minus
behaviors Freaks
Propagation
D: Malicious spies SR: Spectral Ranking
PT: PolarityTrust
E: Camouflage behind judgments NR: Negative Ranking

PolarityTrust
75

 Evaluation
 Include a set of sources of distrust

 In Slashdot Zoo Dataset:
 Sources of trust: CmdrTaco and friends

 Sources of distrust: 5 random foes of No_More_Trolls

 Many possible methods to choose the sources of
distrust

PolarityTrust
76

 Evaluation
 Results: Sources os trust and distrust
 nDCG

Sources of Trust Sources of Trust &
Distrust
Threats PTNN PTAR PT PTNN PTAR PT
A 0.593 0.570 0.588 0.846 0.790 0.846
AB 0.593 0.570 0.588 0. 846 0.790 0.846
ABC 0.593 0.570 0.588 0.846 0.790 0.846
ABCD 0.580 0.570 0.586 0.775 0.739 0.782
ABCDE 0.580 0.574 0.588 0.774 0.741 0.781
A: No estrategies PTNN: Non-Negative Propagation
B: Orchestrated attack D: Malicious spies PTAR: Action-Reaction
C: Camouflage behind good E: Camouflage behind Propagation
behaviors judgments PT: PolarityTrust

Conclusions
78

 Final Remarks

 Development of two systems for the detection of
dishonest behaviors in on-line networks
 Web Spam Detection: PolaritySpam
 Trust and Reputation: PolarityTrust

 Propagation of some a-priori information
 Web Spam: Textual content of the web pages
 Trust and Reputation: Trust and distrust sources sets

Conclusions
79

 Final Remarks

 Web Spam Detection
 Unlike
existent approaches, include content-based
knowledge into a link-based technique

 Unsupervised methods for the selection of sources

 Propagate information of the sources through the
network

 Two simple metrics improve state-of-the-art methods

Conclusions
80

 Final Remarks
 Trust and Reputation in social networks
 Negative links improve the discriminative ability of
TRS’s
 Propagationestrategies to deal with different attacks
against a TRS
 Non-Negative propagation
 Action-Reaction propagation

 Interrelated scores modeling the transitivity of trust
and distrust
 Flexible to be adapted to different situations and

Conclusions
81

 Future Work

 PolaritySpam
 Applicability of more content-based metrics

 Aditional methods for the selection of sources
 Propagation ability of each source

 Infer negative relations between web pages
 According to their textual content
 Apply similar propagation schemas as in PolarityTrust

Conclusions
82

 Future Work
 PolarityTrust
 Study other possible attacks
 Playbook sequences (omniscience of the attackers)
 Analyze the casuistry of the different social networks

 Selection of sources of trust and distrust
 Link-based methods

 Study other contexts with positive and negative
relations:
 Trending topics
 Authorities in the blogosphere

Conclusions
83

 Future Work
 Both techniques
 Study of the parallelization of both algorithms
 Many works on the parallelization of PageRank
 Saving time and memory

 Detection of Spam on the social networks
 Spam messages and spam user accounts

 Recommender Systems
 NLP and Opinion Mining techniques in a link-based
system
 Use the positive and negative information

Curriculum Vitae
84

 Academic and Research milestones
 2006: Degree on Computer Science

 2006: Funded Student in the Itálica research
group

 2008: Master of Advances Studies:
 “STR: A graph-based tagger generator”

 2010: Research stay at the University of Glasgow
 IR Group (Dr. Iadh Ounis and Dr. Craig Macdonald)

Curriculum Vitae
85

 26 contributions to conferences and journals
5 JCR
 10 International Conferences

 2 CORE B

 4 CORE C

 4 ISI Proceedings

 3 Lecture Notes in Computer Sciences

 3 CiteSeer Venue Impact Ratings

 Proyectos de investigación

Curriculum Vitae
86

 Contributions related to the thesis
PolarityRank
National Conf.
TexRank for
International PolarityTrus
Tagging
Conf. t
JCR
System
Combinatio STR
n Methods PolaritySpa
m
Web Spam
Detection
Improving a
Tagger Generator
in IE

Curriculum Vitae
87

National Conf.
TexRank for
International
Tagging
Conf.
JCR
System
Combinatio
n Methods

Improving a
Tagger Generator
in IE
TextRank como motor de aprendizaje en tareas de etiquetado, SEPLN
2006
Bootstrapping Applied to a Corpus Generation Task, EUROCAST 2007
Improving the Performance of a Tagger Generator in an Information Extraction
Application, Journal of Universal Computer Science (2007)

Curriculum Vitae
88

National Conf.
International
Conf.
JCR

STR

STR: A Graph-based Tagging Technique, International Journal on
Artificial Intelligence Tools (2011)

Curriculum Vitae
89

PolarityRank
National Conf.
International
Conf.
JCR

Web Spam
Detection

A Knowledge-Rich Approach to Featured-based Opinion Extraction from
Product Reviews, SMUC 2010 (CIKM 2010)

Combining Textual Content and Hyperlinks in Web Spam Detection, NLDB
2010

Curriculum Vitae
90

National Conf.
International PolarityTrus
Conf. t
JCR

PolaritySpa
m

PolarityTrust: Measuring Trust and Reputation in Social Networks, ITA 2011

PolaritySpam: Propagating Content-based Information Through a Web Graph to
Detect Web Spam, International Journal of Innovative Computing, Information and
Control (2012)

PhD Thesis presentation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Andere mochten auch

Andere mochten auch (19)

Ähnlich wie PhD Thesis presentation

Ähnlich wie PhD Thesis presentation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

PhD Thesis presentation