NDU Present

Security Challenges in
Online Social Media
Chun-Ming Tim Lai
賴俊鳴助理教授

Personal information
2007 – 2011: B.S. NTU CSIE
• Prof. Juin-Ming Chen (Math): Lattice Reduction
• Prof. Der-Tsai. Lee (CSIE), Chen-Mou Cheng (EE):
Secure Index
2011 – 2012: Military Police second lieutenant
2013 – 2019: Ph.D. study, UC Davis
• Prof. S. Felix Wu
• Dissertation: Attackers’ Intention and Influence
Analysis in Social Media
2019 ~ Cloud Innovation School @ 東海大學
10/18/2021 2

Social Media
Exerting significant impact on mass communication
10/18/2021 3
Traditional Media Social Media
Datasize Less More
User Type Reader Editor/Reporter
Time-based Delayed Real time

Reaction
 回覆的即時性
 是否切中要點，立案追蹤
 文章的生命週期
 平均1.5小時，影響人的生活
10/18/2021 5

Security Threat
Severe Threat
• Phishing
• Malware, drive-by-download
Medium to light Threat
• Advertisement
• Spamming (Fund-raising, porn, canned messages, etc.)
New type Threat
• Rumors, Media manipulation, sign up, vote stuffing, etc.
• Fake News
• Crowdturfing = CrowdSourcing + Astroturfing
10/18/2021 6

Outline
 Suitable Target, Lifecycle Analysis
 Multiple Accounts Detection
 Geolocation Identification
 Personal words
10/18/2021 7

10/18/2021 8
Facebook.com/63811549237/posts/10153038271604238
2014, 12-19, 03:06 am GMT
Social Media— Climate Change

10/18/2021 10
Total: 609 comments

Suitable Targets Problem
Any post thread p in social media
platform, predict whether p
contains at least one malicious
comment via a classifier – c
{target,nontarget}
10/18/2021 11

Key idea: Life Cycle of Posts
10/18/2021 12
10 hrs

Definition
 Time Series (TS)
• TScreated(post): the time an original article is posted
• TSj: a time period j following the time of the original
• TSfinal: the end of our observation
 Accumulated Number of participants (AccNcomment)
• The number of post comments between TSi and TS(i-1)
 Discussion Atmosphere Vector (DAV)
10/18/2021 13

Example
TScreated(Climate) = 2014-12-19 03:06:42
Suppose j = 5, final = 120
DAV(Climate) = [# of comments 03:06:42 ~ 03:11:42 1st
# of comments 03:11:42 ~ 03:16:42 2nd
…
# of comments 05:01:42 ~ 05:06:42] 24th
10/18/2021 14

Dataset
2011~2014 Ten Main Media pages on
Facebook
Totally 42,703,463
10/18/2021 15

Feature Engineering
 # of comments, # of likes, # of shares
 Spanning time (Last comment time – first comment time)
 Temporal Feature with Delta Time window, with a final
observation time
 Context-free, don’t need to address Natural Language
Processing
10/18/2021
16
Time Elapsed
1st
Comments 1st Likes 1st Shares

Results
10/18/2021 17
Near Real Time

Discussion: Do you understand Facebook enough?
10/18/2021 18
• Attackers’ preference
• Selected by Facebook
• Audience reaction
• Bandwagon Effect
• Rich get Richer
• Human loves biased and
debating ones

Life Cycle and Influence Ratio
10/18/2021 19
CNN 2012 all post threads
>70%
mURL

DAV Predict IR (1/2)
10/18/2021 20

DAV Predict IR (2/2)
10/18/2021 21

Accounts Activity within a week around election date
10/18/2021 22
Active = Count(Activities) within 1 week >= threshold

10/18/2021 23
Clinton
1st week
Clinton
2nd
week

10/18/2021 24
Trump
2nd
week
Trump
1st
week
All accounts:
Periodic
Attacker accounts:
Random

Conclusion
Predict Suitable Targets successfully with temporal
features
• Attackers: Follow or not?
• Defenders: Deploy resource
Temporal Analysis with different variables
• Influence Ratio, increase or decrease for next time
window?
• 24 hours pattern, link online and offline behavior
10/18/2021 25

Outline
 Multiple Accounts Detection
 Personal words
10/18/2021 26

Semi-Supervised Learning on Graphs
Motivation of detecting multiple accounts on FB
Crawler
1
Crawler
2
Crawler
3
FaceBook
API
When Call FaceBook
API:
API will give each
crawler a different
scope ID. Thus it leads
to same user with
different scope ID in
the dataset.

100003468896671 高婷婷
https://www.facebook.com/mayuko.sakamoto.503
100004123536871 賴婷婷
https://www.facebook.com/profile.php?id=100004123536871
100003251795795 陳婷婷 https://www.facebook.com/rika.etoh
100000681128139 高婷婷 https://www.facebook.com/vincenzo.muscari.5
100002630019886 陳婷婷 https://www.facebook.com/sven.erkens.98
813243492 高婷婷 https://www.facebook.com/profile.php?id=813243492
Ting-Ting’s Family

Facebook 允許朋友數
100003468896671 高婷婷 45xx
100004123536871 賴婷婷 45xx
100003251795795 陳婷婷 4xxx
5000

Multiple Accounts Detection using
Semi-Supervised Learning on Graphs
When crawling data from FB using multiple crawlers, it will give you a scope ID instead of
giving you primary ID for each crawler.
For example, a user’s primary ID is mohamed.aimane.98. He has multiple scope ID,
they are 1815396745342476, 1815402648675219 , 1815411572007660,
1815468805335270 , 1815515615330589 ,1815482155333935 , 1815488781999939 ,
1816157185266432. It implies mohamed’s data is crawled by 8 different crawlers.
As the result, in our dataset we know their users names are all mohamed aimane, but
there are a lot of ID with the same user name.
Problem : Given 2 scope ID with the same user name. Are they the same user(same
primary ID) or not?
Motivation of detecting multiple accounts on FB

Graph Construction
U: {Users}, V:{Pages}, edge:{u,v} : u had an activity on page v
Activities

Main Algorithms
Unsupervised learning using Katz Similarity
Pxy(i) = (x,x1,x2,….y), length I
u1, u2 are similar if their activity paths are similar
Katz similarity can be computed by:
Where M is the adjacency matrix of graph G. 𝛽 is a scalar smaller than 1/ 𝑀 2
to
ensure convergence, and I is the identity matrix.

Main Algorithms
Unsupervised learning using Katz Similarity

Katz matrix is
1 0.9
0.9 1
0.2 0.3
0.5 0.5
0.2 0.6
0.3 0.5
1 0.8
0.8 1
The threshold we use is 0.8
Then the 1st node and the 2nd node are belong to the same user, and the 3rd and 4th
node are belongs to the same user, others are not.
Example of Algorithm 1

Main Algorithms
Semi-Supervised Method using Graph Embedding

Classical ML Tasks in Networks
• Node Classification
• Predict type of a node
• Link Prediction
• Predict friends
• Community Detection
• Network Similarity
• Similar with two networks

Node2vec(1/4)
Many Possible ways:
• PageRank score, Degree, centrality, # of edges…etc.
Features

Node2vec(2/4)
Mixture of BFS and DFS
BFS --- LocalView (u and S1)
DFS --- GlobalView (u and S6)

Node2vec(3/4)
• Two Parameters:
• Return parameter p:
• Return back to the previous node
• In-out parameter q:
• Moving outwards (DFS) vs. inwards (BFS)
• The ratio of BFS vs.DFS
• Biased 2nd-order random walks explore network neighborhoods.
Parameters

Node2vec(4/4)
• Simulate r random walks of length l starting from each node u
• Optimize the node2vec objective using Stochastic Gradient Descent

Embedding for node 1 : (0.1, 0.3, 0.2, 0.4), Embedding for node 2 : (0.2, 0.3, 0.2, 0.4)
We sample some ground truth that : node 1 and node 2 are belongs to the same node,
ect.
L looks like :((1,2), 1) ((1,3), 0 ), ((2,3), 0) ((2,4), - 1) ((3,4), -1) …..
X is from embedding : for example, ((1,2), (0.1, 0, 0, 0 )) ….
Then feed X and L into label spreading model, we will get, the 1st node and the 2nd node
are belong to the same user, and the 3rd and 4th node are belongs to the same user,
others are not.
Example of Algorithm 3

Main Algorithms
Different measurement of Embedding Vectors

Experiments and Evaluation
Comparison among the Three Methods
Two simple datasets : dataset 1: 188 nodes and 262 activities (links);
dataset 2: 4188 accounts and 6715 activities(links).

Outline
 Multiple Account Detection
 Personal words
10/18/2021 45

Page Information and Page-like Graph
10/18/2021
Sport Illustrated
Golden State
Warriors
Oakland Museum
Giving Tuesday
like
like
like
Field Example
Page ID 47657117525
Name Golden State Warriors
Category Sports Team
Country United States
Fan Count 11,019,236
Description The Official Facebook page
of
the Golden State Warriors
46

10/18/2021
• Facebook public
pages are public
profiles used by
local businesses,
companies,
organizations or
public figures
Likes
Promoting other pages to
community participants
47

Data Collection
Facebook Graph API version 2.8 used to collect our
data [1]
• 38,831,367 pages (for this work)
• 2,430,873 US
• 12,685,090 other countries
• 23,715,404 unknown
 [1] https://developers.facebook.com/docs/graph-api/reference/page
10/18/2021 48

Majority Vote Algorithm
10/18/2021
• location designated as state
information in this scenario
• The location labeling is determined by
the most votes
• Overall accuracy is only 59.4%
• This algorithm works well in page nationality
prediction task, with 90.25% accuracy
49

Baseline Algorithm
Utilizes locality of states to find pages
belonging to their corresponding states
• Pick out anchored pages, with local property, as
multiple seeds to start BFS from
Target classifier: 51 classes
• 50 classes of US states and a class of ”others (OT)”
State Distance Vector (SDV)
10/18/2021 50

Alabama Arkansas Arizona Wyoming
……
P IHOP(P, S_Arizona) == 4
OHOP(P, S_Arizona) == 3
31M+ nodes, 600M+ edges
10/18/2021
Alaska
51

Anchor Page Selection (1/2)
10/18/2021
Effectiveness of BFS-based algorithms
• It depends on anchored page selection
Anchored pages have to be local such that SDV can provide authentic
tendency of a page’s locality
Suitable examples (focusing on local communities)
• state universities, government, park or police organizations
Ill-suited examples (popular and thus having global impact)
• NBA, MLB, or NFL sports teams
52

• We adopt all subsidiary
pages
of ”OnlyInYourState.com” as
a set of anchored pages
• It has a distinct page for each state
• Each subsidiary page mostly
connects local communities
Anchor Page Selection (2/2)
Page Name Page ID
Only In Alabama 783744898386760
Only In Alaska 686107314826906
Only In Southern California
184034905285700
6
Only In Northern California 856450181102963
Idaho Only 435099846671531
Only In New York 386608421546055
Only In Virginia
156051573754049
2
Only In West Virginia
150970950928653
2
Only In Wisconsin
139029706462742
0
Only In Wyoming
172417436447638
1
10/18/2021 53

51 Anchors
Arizona
Northern
California
10/18/2021 54

Advanced Algorithm
Baseline algorithm’s drawback
• A local page can have a few connections with those pages far beyond
• This kind of connection noise would highly reduce prediction accuracy
State Neighborhood Probability (SNP)
Both SDV and SNP are taken as feature vectors for ML models
• Utilize locality and neighborhood context for better identification
10/18/2021 55

Dataset
California accounts for 20% of all US pages, and half of all
pages (49.49%) are located in top 5 states
• California, New York, Florida, Illinois, and Texas
10/18/2021 56

Accuracy Summary
Classifier Precision Recall F1 score
Naive Bayes (Baseline BFS) 0.44 0.27 0.26
Adaboost (Baseline BFS) 0.46 0.40 0.37
Random Forest (Baseline BFS) 0.69 0.69 0.68
Random Forest (Advanced BFS) 0.89 0.88 0.88
10/18/2021 57

Outline
 Multiple Account Detection
 Personal words
10/18/2021 58

Future Trends -- IT
10/18/2021 59

Thank you!
Q & A
10/18/2021 60
Thank you!
Q & A

NDU Present

Recommended

Recommended

More Related Content

Similar to NDU Present

Similar to NDU Present (20)

More from Tunghai University

More from Tunghai University (6)

Recently uploaded

Recently uploaded (20)

NDU Present

Editor's Notes