On the Origins of Memes by Means of Fringe Web Communities - Invited talk ta Sigmetrics
1. On the Origins of Memes by Means of Fringe
Web Communities
Savvas Zannettou, Tristan Caulfield, Jeremy Blackburn, Emiliano De Cristofaro,
Michael Sirivianos, Gianluca Stringhini, Guillermo Suarez-Tangil
5. IDRAMA Lab Overview
• Team of international researchers and academics
• Various backgrounds ranging from Computer Security, Cryptography to Social
Network Analysis and Physics
• Geographically distributed
7. IDRAMA Approach
• Large-scale data-driven approach
• Analyzing billions of posts over several years
• Cross-Platform
• Twitter, Reddit, 4chan, Gab, etc.
• Quantitative Analysis
8. Online platforms do not exist in a vacuum
Looking at a single platform at a time is not enough to
capture online dynamics
We lack tools to effectively trace how
information spreads across different platforms
8
17. Memes in politics
Memes have become a popular,
and seemingly effective, method to
transmit ideology.
Memes have been weaponized
18. But what do
we really
know about
memes?
• How can we track meme propagation
across Web communities?
• Can we characterize Web
communities through their memes?
• Can we measure the influence of Web
communities with respect to memes
they share?
19. Memes processing pipeline
3.Clustering
1.pH ashExtraction
2.pH ash-basedPairw ise
DistanceCalculation
pHashes of some or all Web
communities' images
Clusters of images
5.ClusterAnnotation
Pairwise Comparisons
of pHashes
annotated
images
6.Associationof
Images toClusters
Annotated
Clusters
pHashes of
annotated images
pHashes
(all Web Communities)
7.Analysis and
Influ
e
nceEs timation
Occurrences of Memes in
all Web Communities
4.Screenshot
Classifie
r
annotated
images
pHashes of non-screenshot
annotated images
Know Your
Meme
Generic
Annotation
Sites
Meme Annotation Sites
Generic
Web
Communities
4chan Twitter Reddit Gab
Web Communities posting Memes
images
20. Let’s see our data sources…
3.Clustering
1.pH ashExtraction
2.pH ash-basedPairw ise
DistanceCalculation
pHashes of some or all Web
communities' images
Clusters of images
5.ClusterAnnotation
Pairwise Comparisons
of pHashes
annotated
images
6.Associationof
Images toClusters
Annotated
Clusters
pHashes of
annotated images
pHashes
(all Web Communities)
7.Analysis and
Influ
e
nceEs timation
Occurrences of Memes in
all Web Communities
4.Screenshot
Classifie
r
annotated
images
pHashes of non-screenshot
annotated images
Know Your
Meme
Generic
Annotation
Sites
Meme Annotation Sites
Generic
Web
Communities
4chan Twitter Reddit Gab
Web Communities posting Memes
images
21. Know Your Meme (KYM)
• Crowdsourced encyclopedia for Memes
• Provides useful metadata
• E.g., origin, descriptive tags, description, examples, image galleries
• Built custom crawler
• Obtained data for 15K KYM entries
• Download every image per entry (706K)
22. Datasets
# of posts 1.4B 1.0B 48M 12M 15K
# of posts with
images
242M 62M 13M 955K 15K
# of Images 114M 40M 4M 235K 706K
23. Perceptual hashing extraction
3.Clustering
1.pH ashExtraction
2.pH ash-basedPairw ise
DistanceCalculation
pHashes of some or all Web
communities' images
Clusters of images
5.ClusterAnnotation
Pairwise Comparisons
of pHashes
annotated
images
6.Associationof
Images toClusters
Annotated
Clusters
pHashes of
annotated images
pHashes
(all Web Communities)
7.Analysis and
Influ
e
nceEs timation
Occurrences of Memes in
all Web Communities
4.Screenshot
Classifie
r
annotated
images
pHashes of non-screenshot
annotated images
Know Your
Meme
Generic
Annotation
Sites
Meme Annotation Sites
Generic
Web
Communities
4chan Twitter Reddit Gab
Web Communities posting Memes
images
24. Perceptual hashing (pHash)
• Generates a hash for each image
• Visually similar images have minor differences in
their hashes
• Reduces dimensionality of the images
• Run the pHash algorithm for
• All images from KYM (706K)
• All images from Twitter, Reddit, /pol/, and Gab
(159.5M)
25. Creating clusters of images/memes
3.Clustering
1.pH ashExtraction
2.pH ash-basedPairw ise
DistanceCalculation
pHashes of some or all Web
communities' images
Clusters of images
5.ClusterAnnotation
Pairwise Comparisons
of pHashes
annotated
images
6.Associationof
Images toClusters
Annotated
Clusters
pHashes of
annotated images
pHashes
(all Web Communities)
7.Analysis and
Influ
e
nceEs timation
Occurrences of Memes in
all Web Communities
4.Screenshot
Classifie
r
annotated
images
pHashes of non-screenshot
annotated images
Know Your
Meme
Generic
Annotation
Sites
Meme Annotation Sites
Generic
Web
Communities
4chan Twitter Reddit Gab
Web Communities posting Memes
images
26. Pairwise comparisons and clustering
• Calculated all pairwise comparisons between
all pHashes from /pol/, The_Donald, and Gab
• Used TensorFlow and GPUs to speed-up the
process
• Hamming distance
• Performed clustering using:
• DBSCAN algorithm
28. Annotating clusters
3.Clustering
1.pH ashExtraction
2.pH ash-basedPairw ise
DistanceCalculation
pHashes of some or all Web
communities' images
Clusters of images
5.ClusterAnnotation
Pairwise Comparisons
of pHashes
annotated
images
6.Associationof
Images toClusters
Annotated
Clusters
pHashes of
annotated images
pHashes
(all Web Communities)
7.Analysis and
Influ
e
nceEs timation
Occurrences of Memes in
all Web Communities
4.Screenshot
Classifie
r
annotated
images
pHashes of non-screenshot
annotated images
Know Your
Meme
Generic
Annotation
Sites
Meme Annotation Sites
Generic
Web
Communities
4chan Twitter Reddit Gab
Web Communities posting Memes
images
29. Annotating clusters
• Calculated medoid of each cluster
• “Representative” image in cluster
• Compared all medoids with all KYM images
• We have a hit if the Hamming distance is <= pre-
defined threshold
• Assign the representative label according to:
• Number of hits
• Average distance between all hits
• Performed small-scale evaluation of
annotations
30. Finding all memes and analyzing final dataset
3.Clustering
1.pH ashExtraction
2.pH ash-basedPairw ise
DistanceCalculation
pHashes of some or all Web
communities' images
Clusters of images
5.ClusterAnnotation
Pairwise Comparisons
of pHashes
annotated
images
6.Associationof
Images toClusters
Annotated
Clusters
pHashes of
annotated images
pHashes
(all Web Communities)
7.Analysis and
Influ
e
nceEs timation
Occurrences of Memes in
all Web Communities
4.Screenshot
Classifie
r
annotated
images
pHashes of non-screenshot
annotated images
Know Your
Meme
Generic
Annotation
Sites
Meme Annotation Sites
Generic
Web
Communities
4chan Twitter Reddit Gab
Web Communities posting Memes
images
32. Studying specific groups of memes
• Focus on racist and political memes
• Use KYM tags to find relevant memes
• “politics,” “2016 us presidential election,” “trump,” and
“clinton” tags
• “racism,” “racist,” or “antisemitism” tags
• Obtain 117 racist memes and 556 political memes
from KYM dataset
33. How are memes shared over time?
Political Memes Racist Memes
34. How are memes shared over time?
Political Memes Racist Memes
2nd US
presidential
debate
35. How are memes shared over time?
Political Memes Racist Memes
2016 US
elections
2nd US
presidential
debate
36. How memes are shared over time?
Political Memes Racist Memes
2016 US
elections
Gab activity
increase
2017
2nd US
presidential
debate
37. How are memes shared over time?
Political Memes Racist Memes
2016 US
elections
Gab activity
increase
2017
/pol/
constant
share
2nd US
presidential
debate
38. How are memes shared over time?
Political Memes Racist Memes
2016 US
elections
Gab activity
increase
2017
/pol/
constant
share
Gab activity
increase in
2017
2nd US
presidential
debate
39. How to quantify the influence?
• Hawkes processes
• Assume K processes
• Each with a rate of events (i.e., posting of a meme),
called the background rate
• An event can cause impulse responses in other
processes
• Increases the rates of other processes for a period of
time
• Enables us to assess root cause of events
48. For our
purposes…
• Hawkes model with 5 processes
• One for each platform/community (/pol/,
The_Donald, Reddit, Twitter, Gab)
• Distinct model for each cluster; fit each
model with Gibbs sampling
• Calculate the influence and efficiency of each
community
50. Communities’ efficiency (racist memes)
If we look at the
influence normalized
to the number of
memes posted, the
The_Donald is most
efficient in terms of
disseminating memes
51. Summary
• Proposed meme processing pipeline
• Code and datasets available on Github
(https://github.com/memespaper/memes_pipeline)
• Important differences between the memes posted on
Web communities
• Quantified influence among Web communities
52. Now For Some
“Fun”
• As researchers, our goal is to share
what we learn
• We write papers; sometimes
people even read them!
• Unfortunately, this can attract some
unwanted attention…
53.
54. Nature did a really nice interview with Gianluca about our work.
We were pretty excited!
“Wow! We get to share our work with a general audience!!!”
55.
56.
57.
58. From what we could
determine, this image was
produced by Daily Stormer
users…
A literal, non-satirical neo-
Nazi community
62. Remarks
• These type of problems are not easy for researchers
• It’s disturbing content; stressful and emotionally draining
• We are putting ourselves at personal risk of attack
• As researchers studying these fringe Web communities we should be
prepared for such kind of personal attacks
• We should regularly check with colleagues/students working on these
communities to ensure that they do not sink into the cesspool