ICDM 2017 tutorial misinformation

Arizona State University Mining Misinformation in Social Media, November 21, 2017 1
Mining Misinformation in Social Media:
Understanding Its Rampant Spread, Harm, and Intervention
Liang Wu1, Giovanni Luca Ciampaglia2, Huan Liu1
1Arizona State University
2Indiana University Bloomington

Tutorial Web Page
• All materials and resources are available
online:
http://bit.ly/ICDMTutorial

Introduction

Definition of Misinformation
• False and inaccurate information that is spontaneously
spread.
• Misinformation can be…
– Disinformation
– Rumors
– Urban legend
– Spam
– Troll
– Fake news
– …
https://www.kdnuggets.com/2016/08/misinformation-key-terms-explained.html

Ecosystem of Misinformation
Misinformation
Fake
News Rumors
Spams
Click-
baits
User
User
User User
• Spreaders
– Fabrication
• Misinformation
– Fake news
– Rumor
– …
• Influenced Users
– Echo chamber
– Filter bubble
Motivation
Spammer Fraudster …

Misinformation Ramification
Top issues highlighted for 2014
• 1. Rising societal tensions in the Middle East
and North Africa
• 10. The rapid spread of misinformation online
• 2. Widening income disparities
• …
• 3. Persistent structural unemployment
4.07
4.02
3.97
3.35
• Top 10 global risks – World Economic Forum

Word of The Year
• Macquarie Dictionary Word of the Year 2016
• Oxford Dictionaries Word of the Year 2016

Social Media
• Social media has changed the way of
exchanging and obtaining information.
• 500 million tweets are posted per day
– An effective channel for information dissemination
RenRenTwitter & Facebook

Social Media: A Channel for Misinformation
• False and inaccurate information is pervasive.
• Misinformation can be devastating.
– Cause undesirable consequences
– Wreak havoc
User
User
User User
Echo Chamber: Misinformation
can be reinforced
Filter Bubble: Misinformation
can be targeted

Two Examples
• PizzaGate – Fake News has Real Consequences
– What made Edgar Maddison Welch “raid” a “pedo
ring” on 12/1/2016?
– All started with a post on Facebook, spread to
Twitter and then went viral with platforms like
Breitbart and Info-Wars
• Anti-Vaccine Movement on Social Media: A
case of echo chambers and filter bubbles
– Peer-to-peer connection
– Groups
– Facebook feeds

PizzaGate
https://www.nytimes.com/interactive/2016/12/10/business/media/pizzagate.html
•WikiLeaks began releasing emails of Podesta.
•2
•Social media users on Reddit searched the releases for evidence of wrongdoing.
•3
•Discussions were found that include the word pizza, including dinner plans.
•4
•A participant connected the phrase “cheese pizza” to pedophiles (“c.p.” -->child pornography).
•5
•Following the use of “pizza,” theorists focused on the Washington pizza restaurant Comet Ping Pong.
•6
•The theory started snowballing, taking on the meme #PizzaGate. Fake news articles emerged.
•7
•The false stories swept up neighboring businesses and bands that had played at Comet. Theories about kill rooms,
underground tunnels, satanism and even cannibalism emerged.
•8
•Edgar M. Welch, a 28-year-old from North Carolina, fired the rifle inside the pizzeria, and surrendered after finding
no evidence to support claims of child slaves.
•9
•The shooting did not put the theory to rest. Purveyors of the theory and fake news pointed to the mainstream media
as conspirators of a coverup to protect what they said was a crime ring.
Oct-Nov 2016
Nov 3rd, 2016
Nov 23rd, 2016
Dec 4th, 2016
2016-2017

Challenges in Dealing with Misinformation
• Large-scale
– Misinformation can be rampant
• Dynamic
– It can happen fast
• Deceiving
– Hard to verify
• Homophily
– Consistent with one’s beliefs

Overview of Today’s Tutorial
• Introduction
• Misinformation Detection
• Misinformation in Social Media
• Misinformation Spreader Detection
• Resources
40 minutes
20 minutes
40 minutes
10 minutes
10 minutes

Misinformation Detection

Misinformation in Social Media: An Example

Misinformation Spreader
Content of Misinformation
• Text
• Hashtag
• URL
• Emoticon
• Image
• Video (GIF)
Context of Misinformation
• Date, Time
• Location
Propagation of Misinformation
• Retweet
• Reply
• Like

Overview of Misinformation Detection
Misinformation
Detection
Content
Context
Propagation
Early
Detection
Individual Message
or
Message Cluster
+
Supervised: Classification
or
Unsupervised: Anomaly
Anomalous Time of Bursts
Lack of Data
Lack of Labels
Who
When
How
[1] Qazvinian et al. "Rumor has it: Identifying misinformation in microblogs." EMNLP 2011.
[2] Castillo et al. "Predicting information credibility in time-sensitive social media." Internet Research 23.5 (2013).
[3] Zubiaga et al. "Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media."
[4] Wu et al. "Information Credibility Evaluation on Social Media." AAAI. 2016.
[5] Wang et al. "Detecting rumor patterns in streaming social media." IEEE BigData, 2015.
[6] Kwon et al. "Modeling Bursty Temporal Pattern of Rumors." ICWSM. 2014.
[7] Wu et al. “Characterizing Social Media Messages by How They Propagate.“ WSDM 2018.
[8] Ma et al. "Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning." ACL 2017.
[9] Sampson et al. "Leveraging the Implicit Structure within Social Media for Emergent Rumor Detection.“ CIKM 2016.
[10] Wu et al. "Gleaning Wisdom from the Past: Early Detection of Emerging Rumors in Social Media.“ SDM 2017.
[1, 2, 3, 4]
[5, 6]
[7, 8]
[7]
[8]
[9]
[10]

Feature Engineering on Content
Text Feature Example
Length of post #words, #characters
Punctuation marks Question mark ? Exclamation!
Emojis/Emoticons Angry face ;-L
Sentiment Sentiment/swear/curse words
Pronoun (1st, 2nd, 3rd) I, me, myself, my, mine
URL, PageRank of domain
Mention (@)
Hashtag (#)

Misinformation Detection: Text Matching
• Text Matching
– Exact matching
– Relevance
• TF-IDF
• BM25
– Semantic
• Word2Vec
• Doc2Vec
• Drawbacks
– Low Recall
Starbird, Kate, et al. "Rumors, false flags, and digital vigilantes: Misinformation on twitter after the 2013 boston marathon bombing."
iConference 2014 Proceedings (2014).
Jin, Zhiwei, et al. "Detection and Analysis of 2016 US Presidential Election Related Rumors on Twitter." International Conference on Social
Computing, Behavioral-Cultural Modeling and Prediction and Behavior Representation in Modeling and Simulation. Springer, Cham, 2017.
Fake News 1
Fake News 2
…
Exact Duplication
Similar Words
Similar
Representation
Different
Representation
Relevance

Misinformation Detection: Supervised Learning
• Message-based
– A vector represents a tweet
• Message cluster-based
– A vector represents a cluster of tweets
• Methods
– Random Forest
– SVM
– Naïve Bayes
– Decision Tree
– Maximum Entropy
– Logistic Regression
Individual
Posts
Clusters of
Posts
Picking Data
Picking A Method
Radom
Forest
SVM …

Visual Content-based Detection
• Diversity of Images
Jin, Zhiwei, et al. "Novel visual and statistical image features for microblogs news verification."
IEEE Transactions on Multimedia 19.3 (2017): 598-608.
Texas Pizza Hut workers paddle through flood
waters to deliver free pizzas by kayak
There are sharks swimming in the streets of
Houston during Hurricane Harvey

References
• Starbird, Kate, et al. "Rumors, false flags, and digital vigilantes: Misinformation on
twitter after the 2013 boston marathon bombing." iConference 2014 Proceedings
(2014).
• Jin, Zhiwei, et al. "Detection and Analysis of 2016 US Presidential Election Related
Rumors on Twitter." International Conference on Social Computing, Behavioral-
Cultural Modeling and Prediction and Behavior Representation in Modeling and
Simulation. Springer, Cham, 2017.
• Gupta, Aditi, and Ponnurangam Kumaraguru. "Credibility ranking of tweets during
high impact events." Proceedings of the 1st workshop on privacy and security in
online social media. ACM, 2012.
• Yu, Suisheng, Mingcai Li, and Fengming Liu. "Rumor Identification with Maximum
Entropy in MicroNet.“
• Yang, Fan, et al. "Automatic detection of rumor on sina weibo." Proceedings of the
ACM SIGKDD Workshop on Mining Data Semantics. ACM, 2012.
• Zhang, Qiao, et al. "Automatic detection of rumor on social network." Natural
Language Processing and Chinese Computing. Springer, Cham, 2015. 113-122.
• Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. "Information credibility on
twitter." Proceedings of the 20th international conference on World wide web.
ACM, 2011.
• Castillo, Carlos, Marcelo Mendoza, and Barbara Poblete. "Predicting information
credibility in time-sensitive social media." Internet Research 23.5 (2013): 560-588.
• Qazvinian, Vahed, et al. "Rumor has it: Identifying misinformation in microblogs."
Proceedings of the Conference on Empirical Methods in Natural Language
Processing. Association for Computational Linguistics, 2011.
• Wu, Shu, et al. "Information Credibility Evaluation on Social Media." AAAI. 2016.
Text
Matching
Message-
based
Cluster-
based

Modeling Message Sequence
• The chronological order of messages is ignored
• Messages are generated as a
temporal sequence
– Modeling posts as documents
ignores the structural information

Modeling Post Sequence: Message-based
• Message-based
– Conditional Random Fields (CRF)
Zubiaga, Arkaitz, Maria Liakata, and Rob Procter. "Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media."
Linear
Chain
CRF

Modeling Post Sequence: Cluster-based
• Message cluster-based
– Recurrent Neural Networks
Ma et al. "Detecting Rumors from Microblogs with Recurrent Neural Networks." IJCAI. 2016.
Classifier
Recurrent
Neural
Network
Classifier
Layer

Personalized Misinformation Detection (PCA)
• Detecting anomalous content of a user with PCA
• Main assumption
– Misinformation likely to be eccentric to normal
content of a user
• Detecting misinformation as content outliers
– Tweet-based modeling
– Measure distance between a new message with
historical data
Zhang, Yan, et al. "A distance-based outlier detection method for rumor detection
exploiting user behaviorial differences." Data and Software Engineering (ICoDSE), 2016
International Conference on. IEEE, 2016.
New
Post
New
Post
Historical Posts
Measure
Distance

Personalized Misinformation Detection (Autoencoder)
• Detecting anomalous content of a user with Autoencoder
• Multi-layer Autoencoder
– Train an autoencoder with historical data
– To test a message:
• Feed it to the autoencoder
• Obtain the reconstructed data
• Calculate distance between the original and the
reconstruction
Zhang, Yan, et al. "Detecting rumors on Online Social Networks using multi-layer autoencoder."
Technology & Engineering Management Conference (TEMSCON), 2017 IEEE. IEEE, 2017.

Detecting Misinformation with Context
• Date, Time
• Location

Peak Time of Misinformation
Misinformation on Twitter
Kwon et al. "Modeling Bursty Temporal Pattern of Rumors." ICWSM. 2014.Friggeri et al. "Rumor Cascades." ICWSM. 2014.
Misinformation on Facebook
• Rebirth of misinformation
– Misinformation has multiple peaks over time
– True information has only one

Detecting Misinformation with Propagation
• Retweet
• Reply
• Like

• Misinformation is spread by similar users
– Bot army
– Echo chamber of misinformation
• Intuition: misinformation can be distinguished
by who spreads it, and how it is spread
• Challenges
– Users may change accounts (bot army)
– Data sparsity

• User Embedding:
• Message Classification
Liang Wu, Huan Liu. “Characterizing Social Media Messages by How They Propagate." WSDM 2018.
B
C
A
D
Users Embed User Representations
Posts
Networks
Community
B
C
A
D
Propagation Pathways Sequence Modeling
A
B
C
D
A B D
Classifier

Key Issue for Misinformation Detection
• April 2013, AP tweeted, Two Explosions in the
White House and Barack Obama is injured
– Truth: Hackers hacked the account
– However, it tipped stock market by $136 billion in
2 minutes

Early Detection of Misinformation: Challenges
• Challenges of early detection
–Message cluster based methods
• Lack of data
–Supervised learning methods
• Lack of labels

Early Detection Challenge I: Lack of Data
• Lack of data
• Early stage: few posts sparsely scattered
• Most methods prove effective in a later stage
Early Stage Later Stage

Early Detection: Lack of Data
• Linking scattered messages
– Clustering messages
– Merge individual messages
• Hashtag linkage
• Web Linkage
Sampson et al. "Leveraging the implicit structure within social media for emergent rumor detection." CIKM 2016.
Hashtag
Web Link

Early Detection Challenge II: Lack of Labels
• Lack of labels
– Traditional text categories
• Articles within the same category share similar
vocabulary and writing styles
• E.g., sports news are similar to each other
– Misinformation is heterogeneous
• Two rumors are unlikely to be similar to each other
Rumor about
Presidential Election
Rumor about
Ferguson Protest

Early Detection (II): Lack of Labels
• Utilize user responses from prior
misinformation
– Clustering misinformation with similar responses
– Selecting effective features shared by a cluster
Post #1: “Can't fix stupid but it can be blocked”
Post #2: “So, when did bearing false witness become a
Christian value?”
Post #3: “Christians Must Support Trump or Face Death
Camps. Does he still claim to be a Christian?”
Post #1: “i've just seen the sign on fb. you can't fix stupid”
Post #2: “THIS IS PURE INSANITY. HOW ABOUT THIS
STATEMENT”
Post #3: “No Mother Should Have To Fear For Her Son's Life
Every Time He Robs A Store”
Wu et al. "Gleaning Wisdom from the Past: Early Detection of Emerging Rumors in Social Media.“ SDM 2017

Early Detection Results: Lack of Data
• Effectiveness of linkage
Classification without linkage
Classification with hashtag linkage
Classification with web linkage

Early Detection Results: Lack of Labels
• Effectiveness of linkage
Effectiveness of different methods over time Results at an early stage (2 hours)

Overview of Misinformation Detection
Misinformation
Detection
Content
Context
Propagation
Early
Detection
Individual Message
or
Message Cluster
+
Supervised: Classification
or
Unsupervised: Anomaly
Anomalous Time of Bursts
Lack of Data
Lack of Label
Who
When
How
[1] Qazvinian et al. "Rumor has it: Identifying misinformation in microblogs." EMNLP 2011.
[2] Castillo et al. "Predicting information credibility in time-sensitive social media." Internet Research 23.5 (2013).
[3] Zubiaga et al. "Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media."
[4] Wu et al. "Information Credibility Evaluation on Social Media." AAAI. 2016.
[5] Wang et al. "Detecting rumor patterns in streaming social media." IEEE BigData, 2015.
[6] Kwon et al. "Modeling Bursty Temporal Pattern of Rumors." ICWSM. 2014.
[7] Wu et al. “Characterizing Social Media Messages by How They Propagate.“ WSDM 2018.
[8] Ma et al. "Detect Rumors in Microblog Posts Using Propagation Structure via Kernel Learning." ACL 2017.
[9] Sampson et al. "Leveraging the Implicit Structure within Social Media for Emergent Rumor Detection.“ CIKM 2016.
[10] Wu et al. "Gleaning Wisdom from the Past: Early Detection of Emerging Rumors in Social Media.“ SDM 2017.
[1, 2, 3, 4]
[5, 6]
[7, 8]
[7]
[8]
[9]
[10]

Spread of Misinformation

Mining
Misinformation
in Social Media Giovanni Luca Ciampaglia
glciampaglia.com
ICDM 2017, New Orleans, Nov 21, 2017

➢ What is Misinformation and Why it Spreads on Social Media
➢ Modeling the Spread of Misinformation
➢ Open Questions
○ What techniques are used to boost misinformation?
Introduction

➢ What is Misinformation and Why it Spreads
➢ Open Questions
Introduction

Pheme
❖ Wartime studies, types of rumors (e.g., pipe dreams)
[Knapp 1944, Allport & Pottsman, 1947]
❖ “Demand” for Improvised News
[Shibutani, 1968]
❖ Two-step information diffusion
[Katz & Latzarsfeld, 1955]
❖ Reputation exchange
[Rosnow & Fine, 1976]
❖ Collective Sensemaking, Watercooler effect
[Bordia & DiFonzo 2004]
Swift is her walk, more swift her winged haste:
A monstrous phantom, horrible and vast.
As many plumes as raise her lofty flight,
So many piercing eyes inlarge her sight;
Millions of opening mouths to Fame belong,
And ev'ry mouth is furnish'd with a tongue,
And round with list'ning ears the flying plague is hung.
Aeneid, Book IV
Publij Virgilij maronis opera cum quinque
vulgatis commentariis Seruii Mauri honorati
gram[m]atici: Aelii Donati: Christofori Landini:
Antonii Mancinelli & Domicii Calderini,
expolitissimisque figuris atque imaginibus
nuper per Sebastianum Brant superadditis,
exactissimeque revisis atque elimatis,
Straßburg: Grieninger 1502.

Source: “Fake News. It’s Complicated”. First Draft News medium.com/1st-draft

hoaxy.iuni.iu.edu
Query: “three million votes illegal aliens”

Echo chambers
What is the role of online
social networks and social
media in fostering echo
chambers, filter bubbles,
segregation, polarization?
Adamic & Glance (2005),
[Blogs]
Conover et al. (2011),
[Twitter]

Recap: What is
misinformation
and why it
spreads
❖ Misinformation has always existed
❖ Social media disseminate
(mis)information very quickly
❖ Echo chambers insulate people from
fact-checking and verifications

➢ What is Misinformation and How it Spreads
➢ Open Questions
Introduction

❖ Compartmental models (SI, SIR, SIS, etc.)
[Kermack and McKendrick, 1927]
❖ Rumor spreading models (DK, MT)
[Daley and Kendall 1964, Maki 1973]
❖ Independent Cascades Model
[Kempe et al., 2005]
❖ Threshold Model, Complex Contagion
[Granovetter 1979, Centola 2010]
Models of
Information
Diffusion
Pi
(m) ∝ Pi
(m) ∝ f ( i )
f monotonically increasing
Probability of adopting a
“meme” at the i-th exposure

Simple vs
Complex
Contagion
Complex contagion:
strong concentration of
communication inside
communities
Simple contagion:
weak concentration
❖ Most memes spread like complex contagion
❖ Viral memes spread across communities
more like diseases (simple contagion)
Weng et al. (2014) [Twitter]

Weng et al. (2014), Nature Sci. Rep.

Role of the
social network
and limited
attention
❖ Spread among agents with limited attention
on social network is sufficient to explain
virality patterns
❖ Not necessary to invoke more complicated
explanations based on intrinsic meme quality
Weng et al. (2014), Nature Sci. Rep.

Can the best
ideas win?
α{
P(m) ∝ (m)
f fitness function

Discriminative Power
(Kendall’s Tau)
When do
the best
ideas win?
High Quality Low Quality

Recap: Models
of the Spread of
Misinformation
❖ Simple vs complex contagion
❖ More realistic features
➢ Agents have limited attention
➢ Social network structure
➢ Competition between different memes
❖ Tradeoff between information quality and
diversity

Bots are
super
spreaders
Shao et al. 2017 (CoRR)

Bots are
strategic

Bots are
effective

Conclusions
❖ What is misinformation and why it spreads
➢ Online it spreads through a mix social, cognitive, and algorithmic biases.
❖ Models the spread of misinformation
➢ Social network structure, limited attention, and information overload
make us vulnerable to misinformation.
❖ Open Questions:
➢ Bots are strategic superspreaders
➢ They are effective at spreading misinformation.
❖ Tools to detect manipulation of public opinions may be
first steps toward a trustworthy Web.

Thanks!
cnets.indiana.edu
iuni.iu.iedu
Marcella
Tambuscio

WMF Research Showcase, August 17, 2016 Giovanni Luca Ciampaglia gciampag@indiana.edu

Recap: Open
Questions
❖ Social Bots Amplify Misinformation
➢ Through social reinforcement
➢ Early amplification
➢ Target humans, possibly “influentials”

afterbefore
(days)
Demand and
Supply of
Information
2012 London Olympics [Wikipedia]
Ciampaglia et al., Sci. Rep. (2015)
London
England
Usain
Bolt
Olympics
Medal
2012
London
Olympics

Supply of and Demand for
Information
❖ Production of information is associated to shifts in collective attention
❖ Evidence that attention precedes production
❖ Higher demand → higher price → more production
Ciampaglia et al. Scientific Reports 2015
source: Wikipedia

Predicting
Virality
Structural Trapping Social Reinforcement Homophily
M1: random sampling model
M2: random cascading model (structural trapping)
M3: social reinforcement model (structural trapping + social reinforcement)
M4: homophily model (structural trapping + homophily)
SIMPLE Contagion
COMPLEX Contagion
Weng et al. (2014),
[Twitter]

Virality and
Competition
for Attention
User Popularity
# followers
[Yahoo! Meme]
Hashtag Popularity
# daily retweets
[Twitter]
2B Views 55M Followers

Low Quality
Information
just as Likely
to Go Viral
Source: Emergent.info
[FB shares]

Misinformation Spreader Detection

Content of Misinformation
• Text
• Hashtag
• URL
• Emoticon
• Image
• Video (GIF)
• Date, Time
• Location
• Retweet
• Reply
• Like

Detecting Misinformation by Its Spreaders
• A large portion of OSN
accounts are likely to be
fake
– Facebook: 67.65 million –
137.76 million
– Twitter: 48 million

A Misinformation Spreader
• Misinformation spreaders: users that deliberately
spread misinformation to mislead others
A phishing link
to Twivvter.com

Types of Misinformation Spreaders
• Spammers
• Fraudsters
• Trolls
• Crowdturfers
• …
Misinformation
Fake
News Rumor
Spam
Spammer Fraudster
Clickbait
…

Features for Capturing a Spreader
• What can be used to detect a spammer?
– Profile
– Posts (Text)
– Friends (Network)
Profile Features
Post Features
Network Features

• Extracting features from a profile
– #followers, #followees
• E.g., small #followers  suspicious account
– Biography, registration time, screen name, etc.
Feature Engineering: Profile
Profile Features

• Extracting text features from user posts
– Text: BoW, TF-IDF, etc.
Feature Engineering: Text
…
A textual feature vector:

• Extracting network features
– Network: Adjacency matrix, number of follower,
follower/followee ratio, centrality
Feature Engineering: Network
Adjacency matrix:

Overview: Misinformation Spreader Detection
[1] Jindal, Nitin, and Bing Liu. "Review spam detection." Proceedings of the 16th international
conference on World Wide Web. ACM, 2007.
[2] Hu, X., Tang, J., Zhang, Y. and Liu, H., 2013, August. Social Spammer Detection in
Microblogging. In IJCAI.
[3] Song, Yuqi, et al. "PUD: Social Spammer Detection Based on PU Learning.“ International
Conference on Neural Information Processing. Springer, Cham, 2017.
[4] Wu L, Hu X, Morstatter F, Liu H. Adaptive Spammer Detection with Sparse Group Modeling.
ICWSM 2017 (pp. 319-326).
[5] Wu, Liang, et al. "Detecting Camouflaged Content Polluters." ICWSM. 2017.
[6] Hooi, Bryan, et al. "Fraudar: Bounding graph fraud in the face of camouflage." Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
ACM, 2016.
Content
Network
Text Mining [1]
Content
Network
Camouflage
Content
Text + Graph Mining [2, 3]
Data
Methods Instance (Post/User)
Selection [4, 5, 6]

Supervised Learning: Content + Network
…
Adjacency matrix:
Profile Features
Post Features
Network Features
• Features for supervised learning
– Text features
– Network features

Traditional Approach: Content Modeling
…
Positive and negative
accounts can be
distinguished with text
: Coefficients to be estimated
• Supervised learning with text features

Traditional Approach: Network Modeling
…
Adjacency matrix:
• Supervised learning with network features
Friends are likely to
have the same label
: Coefficients to be estimated

Emerging Challenge: Camouflage
• Content Camouflage
– Copy content from legitimate users
– Exploit compromised account
• Network Camouflage
– Link farming with other spreaders, bots
– Link farming with normal users

Challenge (I): Camouflage
• In order to avoid being detected:
– manipulate the text feature vector
• posting content similar to regular users’
– manipulate the adjacency matrix
• harvesting links with other users
accounts can be
have the same label

Challenge (II): Network Camouflage
• Heuristic-based methods
• #Followers
• Follower/followee ratio
• Anomaly detection

Challenge III: Limited Label Information
• Label a malicious account (positive)
– Suspended accounts
– Honeypot
• A set of bots created to lure other bots in the wild.
• Any user that follow them is a bot.
• The assumption is that normal users can easily
recognize them.
• Lack of labeled camouflage
Honeypots

Camouflage
• Prior assumptions:
– All suspended accounts are misinformation spreaders
– All posts of a spreader are malicious
• Selecting a subset of users for training
• Selecting a subset of posts for training
accounts can be
have the same label

Selecting Users for Training
Select a subset of
users for training
Evaluate with a
validation set
Update the
training set
• How to select the optimal set for training?
Wu et al. Adaptive Spammer Detection with Sparse Group Modeling. ICWSM 2017

Relaxation I: Group Structure
• Assumption: malicious accounts
cannot join a legitimate community
– organize users in groups
– users in the same group should be
similarly weighted
𝑮 𝟏
𝟎
𝑮 𝟏
𝟏
𝑮 𝟑
𝟏
{1, 2, 3, 4, 5, 6, 7}
{1, 2, 3, 4} {6, 7}{5} 𝑮 𝟐
𝟏
𝟓 6 7𝑮 𝟏
𝟐
{1, 2}
1 2
𝑮 𝟐
𝟐
{3, 4}
3 4

Relaxation II: Weighted Training
𝐰,𝐜
𝐦𝐢𝐧 ෍
𝒊=𝟏
𝑵
ci 𝐱𝐢 𝐰 − 𝐲𝒊
𝟐
+ λ1||w||2
2
𝐒𝐮𝐛𝐣𝐞𝐜𝐭 𝐭𝐨 ෍
𝒊
𝒄𝒊 = 𝑲
0 < 𝒄𝒊< 1
+ λ2σi=0
d
σj=1
ni ||𝐜Gj
i||2
𝑔𝑟𝑜𝑢𝑝 𝐿𝑎𝑠𝑠𝑜𝑎𝑣𝑜𝑖𝑑 𝑜𝑣𝑒𝑟𝑓𝑖𝑡𝑡𝑖𝑛𝑔
𝑮 𝟏
𝟎
𝑮 𝟏
𝟏
𝑮 𝟑
𝟏
{1, 2, 3, 4, 5, 6, 7}
{1, 2, 3, 4} {5} 𝑮 𝟐
𝟏
𝟓 6 7𝑮 𝟏
𝟐
{1, 2}
1 2
𝑮 𝟐
𝟐
{3, 4}
3 4
L1 norm on the
inter-group level
L2 norm on the
intra-group level
d: depth of hierarchy of Louvain method
ni: number of groups on layer i
𝐜Gj
i: nodes of group j on layer i

Optimization
𝐰,𝐜
𝐦𝐢𝐧 ෍
𝒊=𝟏
𝒎
ci 𝐱𝐢 𝐰 − 𝐲𝐢 𝟐
+ λ1||w||2
2
𝒊
𝒄𝒊 = 𝟏
ci: weight of i
𝐱i: an attribute vector of i
𝐰: coefficients of linear regression
𝐲𝒊: Label of instance i
||𝐰||2
2: avoiding overfitting
+ λ2σi=0
d σj=1
ni ||𝐜Gj
i||2
N: number of instances
𝐜Gj
𝐺𝑟𝑜𝑢𝑝 𝐿𝑎𝑠𝑠𝑜
𝐰
𝐦𝐢𝐧 ෍
𝒊=𝟏
𝒎
+ λ1||w||2
2
Optimize w

Optimization
𝐰,𝐜
𝐦𝐢𝐧 ෍
𝒊=𝟏
𝒎
+ λ1||w||2
2
𝒊
𝒄𝒊 = 𝟏
ci: weight of i
𝐱i: an attribute vector of i
𝐰: coefficients of linear regression
𝐲𝒊: Label of instance i
||𝐰||2
2: avoiding overfitting
+ λ2σi=0
d σj=1
ni ||𝐜Gj
i||2
N: number of instances
𝐜Gj
𝐺𝑟𝑜𝑢𝑝 𝐿𝑎𝑠𝑠𝑜
𝐰,𝐜
𝐦𝐢𝐧 ෍
𝒊=𝟏
𝒎
ci 𝑡𝒊 + λ2σi=0
d σj=1
ni ||𝐜Gj
i||2
Optimize c

Experimental Results
1 Hu et al., “Social spammer detection in microblogging.”, IJCAI’13
2 Ye et al., “Discovering opinion spammer groups by network footprints.”, ECML-PKDD’15
Approaches Precision Recall F-Score
SSDM1 92.15% 92.00% 92.07%
NFS2 88.16% 65.67% 75.27%
SGASD 93.75% 96.92% 95.31%
• Collecting data with honeypots
– http://infolab.tamu.edu/data/
Tweets Users ReTweets Links Spammers
4,453,380 38,400 223,115 8,739,105 19,200

Content Camouflage
• Basic assumption for traditional methods:
– All content of a misinformation spreader is
malicious
• Content camouflage: posts of a misinformation
spreader may be legitimate
– Copy content from legitimate users
– Exploit compromised accounts

Content Camouflage: An Example
A normal post
A normal post

Challenge: Lack of Labeled Data
• Labels of camouflage are costly to collect

Learning to Identify Camouflage
• Assumption: posts of misinformation
spreaders are mixed with normal and
malicious.
• Introduce a weight for each post label.
• Select posts that distinguish between
misinformation spreaders and normal users.

Learning to Identify Camouflage: Formulation
Wu et al. Detecting Camouflaged Content Polluters. ICWSM 2017

Experimental Results
• Findings:
– Sophisticated misinformation spreaders first
disguise, and then do harm.
• Results:
Wu et al. Detecting Camouflaged Content Polluters. ICWSM 2017

Misinformation Spreader Detection
[1] Jindal, Nitin, and Bing Liu. "Review spam detection." Proceedings of the 16th international
conference on World Wide Web. ACM, 2007.
[2] Hu, X., Tang, J., Zhang, Y. and Liu, H., 2013, August. Social Spammer Detection in
Microblogging. In IJCAI.
[3] Song, Yuqi, et al. "PUD: Social Spammer Detection Based on PU Learning.“ International
Conference on Neural Information Processing. Springer, Cham, 2017.
[4] Wu L, Hu X, Morstatter F, Liu H. Adaptive Spammer Detection with Sparse Group Modeling.
ICWSM 2017 (pp. 319-326).
[5] Wu, Liang, et al. "Detecting Camouflaged Content Polluters." ICWSM. 2017.
[6] Hooi, Bryan, et al. "Fraudar: Bounding graph fraud in the face of camouflage." Proceedings
of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
ACM, 2016.
Content
Network
Text Mining [1]
Content
Network
Camouflage
Content
Text + Graph Mining [2, 3]
Data
Methods Instance (Post/User)
Selection [4, 5, 6]

• Large-scale
• Dynamic
• Deceiving
– Hard to verify
• Homophily

Codes, Platforms and Datasets

Platforms
• TweetTracker: Detecting Topic-centric Bots
• Hoaxy: Tracking Online Misinformation
• Botometer: Detecting Bots on Twitter

Fact-Checking Websites
• Online Fact-checking websites
– PolitiFact: http://www.politifact.com/
– Truthy: http://truthy.indiana.edu/
– Snopes: http://www.snopes.com/
– TruthOrFiction: https://www.truthorfiction.com/
– Weibo Rumor: http://service.account.weibo.com/
• Volunteering committee

Code and Data Repositories
• Honeypot: http://bit.ly/ASUHoneypot
• Identification: https://veri.ly/
• Diffusion:
– Python Networkx: https://networkx.github.io/
– Stanford SNAP: http://snap.stanford.edu/
• Datasets
– http://socialcomputing.asu.edu/pages/datasets
– http://bit.ly/asonam-bot-data
– https://github.com/jsampso/AMNDBots
– http://carl.cs.indiana.edu/data/#fact-checking
– http://snap.stanford.edu/data/index.html

Book Chapters
• “Mining Misinformation in Social Media’’, Chapter 5
in Big Data in Complex and Social Networks
• http://bit.ly/2AYr5KM
• “Detecting Crowdturfing in Social Media’’,
Encyclopedia of Social Network Analysis and Mining
• http://bit.ly/2hE6LXE

Twitter Data Analytics
• Common tasks in mining Twitter Data.
– Free Download with Code & Data
– Collection
– Analysis
– Visualization
tweettracker.fulton.asu.edu/tda/

Social Media Mining
• Social Media Mining:
An Introduction – a textbook
• A comprehensive coverage of
social media mining techniques
– Free Download
– Network Measures and Analysis
– Influence and Diffusion
– Community Detection
– Classification and Clustering
– Behavior Analytics
http://dmml.asu.edu/smm/

• Large-scale
• Dynamic
• Deceiving
– Hard to verify
• Homophily

Q&A
• Liang Wu, Giovanni Luca Ciampaglia, Huan Liu
• {wuliang, huanliu}@asu.edu
• All materials and resources are available
online:
• Giovanni Luca Ciampaglia
• gciampag@indiana.edu
http://bit.ly/ICDMTutorial

Acknowledgements
• DMML @ ASU
• NaN @ IUB
• MINERVA initiative through the ONR N000141310835 on
Multi-Source Assessment of State Stability

ICDM 2017 tutorial misinformation

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie ICDM 2017 tutorial misinformation

Ähnlich wie ICDM 2017 tutorial misinformation (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

ICDM 2017 tutorial misinformation