2. Personal information
2007 – 2011: B.S. NTU CSIE
• Prof. Juin-Ming Chen (Math): Lattice Reduction
• Prof. Der-Tsai. Lee (CSIE), Chen-Mou Cheng (EE):
Secure Index
2011 – 2012: Military Police second lieutenant
2013 – 2019: Ph.D. study, UC Davis
• Prof. S. Felix Wu
• Dissertation: Attackers’ Intention and Influence
Analysis in Social Media
2019 ~ Cloud Innovation School @ 東海大學
10/18/2021 2
3. Social Media
Exerting significant impact on mass communication
10/18/2021 3
Traditional Media Social Media
Datasize Less More
User Type Reader Editor/Reporter
Time-based Delayed Real time
11. Suitable Targets Problem
Any post thread p in social media
platform, predict whether p
contains at least one malicious
comment via a classifier – c
{target,nontarget}
10/18/2021 11
13. Definition
Time Series (TS)
• TScreated(post): the time an original article is posted
• TSj: a time period j following the time of the original
• TSfinal: the end of our observation
Accumulated Number of participants (AccNcomment)
• The number of post comments between TSi and TS(i-1)
Discussion Atmosphere Vector (DAV)
10/18/2021 13
14. Example
TScreated(Climate) = 2014-12-19 03:06:42
Suppose j = 5, final = 120
DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st
# of comments 03:11:42 ~ 03:16:42 2nd
…
# of comments 05:01:42 ~ 05:06:42] 24th
10/18/2021 14
16. Feature Engineering
# of comments, # of likes, # of shares
Spanning time (Last comment time – first comment time)
Temporal Feature with Delta Time window, with a final
observation time
Context-free, don’t need to address Natural Language
Processing
10/18/2021
16
Time Elapsed
1st
Comments 1st Likes 1st Shares
25. Conclusion
Predict Suitable Targets successfully with temporal
features
• Attackers: Follow or not?
• Defenders: Deploy resource
Temporal Analysis with different variables
• Influence Ratio, increase or decrease for next time
window?
• 24 hours pattern, link online and offline behavior
10/18/2021 25
26. Outline
Suitable Target, Lifecycle Analysis
Multiple Accounts Detection
Geolocation Identification
Personal words
10/18/2021 26
27. Semi-Supervised Learning on Graphs
Motivation of detecting multiple accounts on FB
Crawler
1
Crawler
2
Crawler
3
FaceBook
API
When Call FaceBook
API:
API will give each
crawler a different
scope ID. Thus it leads
to same user with
different scope ID in
the dataset.
31. Multiple Accounts Detection using
Semi-Supervised Learning on Graphs
When crawling data from FB using multiple crawlers, it will give you a scope ID instead of
giving you primary ID for each crawler.
For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID,
they are 1815396745342476, 1815402648675219 , 1815411572007660,
1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 ,
1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers.
As the result, in our dataset we know their users names are all mohamed aimane, but
there are a lot of ID with the same user name.
Problem : Given 2 scope ID with the same user name. Are they the same user(same
primary ID) or not?
Motivation of detecting multiple accounts on FB
33. Main Algorithms
Unsupervised learning using Katz Similarity
Pxy(i) = (x,x1,x2,….y), length I
u1, u2 are similar if their activity paths are similar
Katz similarity can be computed by:
Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2
to
ensure convergence, and I is the identity matrix.
35. Katz matrix is
1 0.9
0.9 1
0.2 0.3
0.5 0.5
0.2 0.6
0.3 0.5
1 0.8
0.8 1
The threshold we use is 0.8
Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th
node are belongs to the same user, others are not.
Example of Algorithm 1
37. Classical ML Tasks in Networks
• Node Classification
• Predict type of a node
• Link Prediction
• Predict friends
• Community Detection
• Network Similarity
• Similar with two networks
40. Node2vec(3/4)
• Two Parameters:
• Return parameter p:
• Return back to the previous node
• In-out parameter q:
• Moving outwards (DFS) vs. inwards (BFS)
• The ratio of BFS vs.DFS
• Biased 2nd-order random walks explore network neighborhoods.
Parameters
41. Node2vec(4/4)
• Simulate r random walks of length l starting from each node u
• Optimize the node2vec objective using Stochastic Gradient Descent
42. Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4)
We sample some ground truth that : node 1 and node 2 are belongs to the same node,
ect.
L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) …..
X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) ….
Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node
are belong to the same user, and the 3rd and 4th node are belongs to the same user,
others are not.
Example of Algorithm 3
44. Experiments and Evaluation
Comparison among the Three Methods
Two simple datasets : dataset 1: 188 nodes and 262 activities (links);
dataset 2: 4188 accounts and 6715 activities(links).
45. Outline
Suitable Target, Lifecycle Analysis
Multiple Account Detection
Geolocation Identification
Personal words
10/18/2021 45
46. Page Information and Page-like Graph
10/18/2021
Sport Illustrated
Golden State
Warriors
Oakland Museum
Giving Tuesday
like
like
like
Field Example
Page ID 47657117525
Name Golden State Warriors
Category Sports Team
Country United States
Fan Count 11,019,236
Description The Official Facebook page
of
the Golden State Warriors
46
47. 10/18/2021
• Facebook public
pages are public
profiles used by
local businesses,
companies,
organizations or
public figures
Likes
Promoting other pages to
community participants
47
48. Data Collection
Facebook Graph API version 2.8 used to collect our
data [1]
• 38,831,367 pages (for this work)
• 2,430,873 US
• 12,685,090 other countries
• 23,715,404 unknown
[1] https://developers.facebook.com/docs/graph-api/reference/page
10/18/2021 48
49. Majority Vote Algorithm
10/18/2021
• location designated as state
information in this scenario
• The location labeling is determined by
the most votes
• Overall accuracy is only 59.4%
• This algorithm works well in page nationality
prediction task, with 90.25% accuracy
49
50. Baseline Algorithm
Utilizes locality of states to find pages
belonging to their corresponding states
• Pick out anchored pages, with local property, as
multiple seeds to start BFS from
Target classifier: 51 classes
• 50 classes of US states and a class of ”others (OT)”
State Distance Vector (SDV)
10/18/2021 50
52. Anchor Page Selection (1/2)
10/18/2021
Effectiveness of BFS-based algorithms
• It depends on anchored page selection
Anchored pages have to be local such that SDV can provide authentic
tendency of a page’s locality
Suitable examples (focusing on local communities)
• state universities, government, park or police organizations
Ill-suited examples (popular and thus having global impact)
• NBA, MLB, or NFL sports teams
52
53. • We adopt all subsidiary
pages
of ”OnlyInYourState.com” as
a set of anchored pages
• It has a distinct page for each state
• Each subsidiary page mostly
connects local communities
Anchor Page Selection (2/2)
Page Name Page ID
Only In Alabama 783744898386760
Only In Alaska 686107314826906
Only In Southern California
184034905285700
6
Only In Northern California 856450181102963
Idaho Only 435099846671531
Only In New York 386608421546055
Only In Virginia
156051573754049
2
Only In West Virginia
150970950928653
2
Only In Wisconsin
139029706462742
0
Only In Wyoming
172417436447638
1
10/18/2021 53
55. Advanced Algorithm
Baseline algorithm’s drawback
• A local page can have a few connections with those pages far beyond
• This kind of connection noise would highly reduce prediction accuracy
State Neighborhood Probability (SNP)
Both SDV and SNP are taken as feature vectors for ML models
• Utilize locality and neighborhood context for better identification
10/18/2021 55
56. Dataset
California accounts for 20% of all US pages, and half of all
pages (49.49%) are located in top 5 states
• California, New York, Florida, Illinois, and Texas
10/18/2021 56
Top-down, Authoritative, vs. distributed, skim
SFW – “Editor/Reporter” and “reader”
Sometimes it’s hard to evaluate “spamming”
New
SFW – Likefarm? Is that ContentFarm?
Every principle has its mind, reason, everything has its causality
SFW – we need to have a better organized presentation for problems.
SFW – the defenders concern might be different – we need to consider the risk factor
Shelf Life, skim messages, can “catch” ones eyes only , enlarge the influence
https://www.facebook.com/barackobama/posts/10151673679836749
https://www.facebook.com/cnn/posts/313652498762911
SFW – ask the audience “which post has higher prob to be attacked”?
SFW – watch out for the transition into this slide.
SFW – do you want to provide one example for all or most of the slides?
SFW – I feel that you should give an example to explain.
SFW – Definition**s**
SFW – how to interpret 10 minutes? (what is the total time and attack time)?
Naïve Bayne: DAV not independent with each other
Adaboost: Not good for outlier, number of estimators = 50 and learning rate = 1.
Decision Tree: Good for social networks data
we set minimum samples split = 2 and minimum samples leaf = 1, as with depth, nodes are expanded until all leaves are pure.
1. IR is learnable?
2. No difference between Light and Critical malicious URLs since their performance are quite similar
3. Increase recall result is high
SFW – explain “Exact time after last attack”
Why do you choose similarity
Fast
Read the silde
Our first thought is majority vote algorithm
where IHOP(Page,Si) denotes hop distance between page and seed Si, using inward edges as connection for BFS;
OHOP(Page, Si) denotes hop distance between page and seed Si, using outward edges as connection for BFS.
In particular, since California is much larger than other states in perspectives of population and economy,
“OnlyInYourState.com” splits California into Northern and Southern regions, as shown in Table.
Therefore, both ”Only In Northern California” and ”Only In Southern California” are used as anchored pages to calculate IHOP (P age, Si) and OHOP (P age, Si),
in addition to the other forty nine an- chored pages. Hence Nanchored pages is set as 51.
Furthermore, since ”Only In Idaho” had been registered, OnlyInYourState.com named its Idaho counterpart as ”Idaho Only” instead.
In general, more anchored pages involved would enlarge the BFS coverage of pages.
This probability is not high; however, the baseline BFS-based ML algorithm only cares about the hop distances to the anchored pages.
where INP(Page,Ri) denotes inward neighborhood location probability between this page and the adjacent pages belonging to the region Ri;
where IE(Page,Ri) is the number of inward edges between this page and the adjacent pages belonging to the region Ri;
We took the pages with declared location information of country and city as ground truth data.
Few pages are excluded because their city names exist in multiple states, which can result in ambiguous city-to-state mapping.
There are 29,849 cities in total in the US.
The training set utilized 80% of data while test set employed the rest.
Since number of classes is rather large, Random Forest classifier is preferably adopted, instead of Gradient Boosting classifier [23].
The default parameter sets were applied when using the implementations available in the scikit-learn package [54].
As shown in Table 4.2, the precision, recall, f1 score of the Random Forest classifier are at least 20% better than the counterparts of the Naive Bayes classifier and the Adaboost classifier.
Thus in the following, we only present results done with the Random Forest classifier.
baseline BFS-based ML algorithm with the Random Forest classifier achieved 69% accuracy, which is 10% better than accuracy of the majority vote algorithm.
With addition of SNP, advanced BFS-based ML algorithm accomplished 89% prediction accuracy, which is a 20% improvement over baseline.
SFW – what have been done? Whether you can justify some of your work is fundamental and not just incremental and applied?
SFW – balance between contributions to CS versus Social Science