Social media have transformed the Web into an interactive sharing platform where users upload data and media, comment on, and share this content within their social circles. The large-scale availability of user-generated content in social media platforms has opened up new possibilities for studying and understanding real-world phenomena, trends and events. The objective of this talk is to provide an overview of social media mining, which offers a unique opportunity to to discover, collect, and extract relevant information in order to provide useful insights. It will include key challenges and issues, such as fighting misinformation, data collection, analysis and visualization components, applications, results and demos from multiple areas ranging from news to environmental and security ones.
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Social media mining for sensing and responding to real-world trends and events
1. Social media mining for sensing and responding to
real-world trends and events
Dr. Yiannis Kompatsiaris, ikom@iti.gr
Multimedia, Knowledge and Social Media Analytics Lab, Head
CERTH-ITI
CLEF 2020
Thessaloniki, Greece
September 2020
3. CLEF 2020 Social Media Mining3
Pope Francis
Pope Benedict
2007: iPhone release
2008: Android release
2010: iPad release
http://petapixel.com/2013/03/14/a-starry-sea-of-cameras-at-the-unveiling-of-pope-francis/
4. CLEF 2020 Social Media Mining
Hillary Clinton's Epic Group Selfie
5. CLEF 2020 Social Media Mining
User
Profile
Tags
Social Media aspects
9. CLEF 2020 Social Media Mining9
Social Networks as Real-Life Sensors
• Social Networks is a data source with an
extremely dynamic nature that reflects
events and the evolution of community
focus (user’s interests)
• Huge smartphones and mobile devices
penetration provides real-time and
location-based user feedback
• Transform individually rare but
collectively frequent media to meaningful
topics, events, points of interest, emotional
states and social connections
• Present in an efficient way for a variety of
applications (news, security (cyber and
physical), marketing, science, health)
10. CLEF 2020 Social Media Mining10
Real-life Social Networks
• Social networks have emergent
properties. Emergent properties
are new attributes of a whole
that arise from the interaction
and interconnection of the parts
• Emotions, Health, Sexual
relationships depend on our
connections (e.g. number of
them) and on our position -
structure in the social graph
• Central – Hub
• Outlier
• Transitivity (connections between
friends)
12. CLEF 2020 Social Media Mining
Example – twitter and earthquakes
12
13. CLEF 2020 Social Media Mining13
API Wrapper
Website Wrapper
Scheduler
CRAWLING
Visual Indexing
Near-duplicates
Text Indexing
INDEXING
Media Fetcher
SNA
Sentiment - Influence
Trends - Topics
MINING
Model Building
Concepts
Relevance
Diversity
Popularity
RANKING
Veracity
Crawling Specs
Sources
Interaction
Responsiveness
Aggregation
VISUALIZATION
Aesthetics
Conceptual Architecture
ANALYSIS
PRESENTATION
14. CLEF 2020 Social Media Mining14
Challenges – Content (Indexing - Mining)
•Multi-modality: e.g. image + tags, video, audio
•Rich social context: spatio-temporal, social connections,
relations and social graph
•Specific messages: short, conversations, errors, no context
•Inconsistent quality: noise, spam, fake, propaganda
•Huge volume: Massively produced and disseminated
•Multi-source: may be generated by different applications and
user communities
•Dynamic: Fast updates, real-time
15. CLEF 2020 Social Media Mining
Policy – Licensing – Legal challenges
• Fragmented access to data
– Separate wrappers/APIs for each source (Twitter, Facebook, etc.)
– Different data collection/crawling policies
• Limitations imposed by API providers (“Walled Gardens”)
• Full access to data impossible or extremely expensive (e.g. see data
licensing plans for GNIP and DataSift)
• Non-transparent data access practices (e.g. access is provided to an
organization/person if they have a contact in Twitter)
• Constant change of model and ToS of social APIs
– No backwards compatibility, additional development costs
• Ephemeral nature of content
• Social search results often lead to removed content inconsistent
and unreliable referencing
• User Privacy & Purpose of use
• Fuzzy regulatory framework regarding mining user-contributed data
15
17. CLEF 2020 Social Media Mining
The Rise of Fake News
17
https://trends.google.com/trends/explore?date=all&geo=US&q=fake%20news
US Elections 2016
Volume for query “fake news” over time: A key milestone has
been the US Elections in 2016, which marked the beginning of
large-scale coordinated disinformation campaigns.
18. CLEF 2020 Social Media Mining
Key Concepts
• Fake news: popular term to refer to the phenomenon of
disinformation, but currently avoided from Academics and the
EC due to the fact that it is often misused by Trump and the
alt-right
• Disinformation: general term that typically refers to
intentional (and often coordinated) efforts to spread
misleading information to the public
• Misinformation: refers to misleading content and information
but not necessarily intentional
• Propaganda: refers to coordinated campaigns aiming to
spread a particular ideology or belief
• Manipulated content: Also known as tampered or doctored.
Refers to multimedia content that has been digitally altered
typically for malicious purposes.
18
19. CLEF 2020 Social Media Mining
The Diffusion of Fake News
Example cascade
Vosoughi, S., Roy, D., & Aral, S. (2018). The spread of true and false news
online. Science, 359(6380), 1146-1151.
Number of cascades
Topic frequency
Misleading posts tend to
spread faster and wider
compared to accurate ones.
20. CLEF 2020 Social Media Mining
The Famous Shark
https://www.snopes.com/photos/animals/puertorico.asp
2005
23. CLEF 2020 Social Media Mining
A bit of “historical” background
2011-2014 2013-2016 2016-2018
Trend detection
Social media search Quality & veracity of
social media
Media forensics
Social media video
verification
Reverse video search
2018-2021
Deepfake detection
Deep learning-assisted
forensics and analysis
23
EU funded projects
24. CLEF 2020 Social Media Mining
Overview of Media Verification Resources
Tools/Approaches
• Social media verification
– Tweet Credibility Classification
– Context Analysis and Aggregation
• Multimedia forensics
– Image Verification Assistant
– Video forensics
• Reverse-image and video search
Datasets
• Tweet verification Corpus
• Fake Video Corpus
• FIVR-200K
24
25. CLEF 2020 Social Media Mining
Tweet Credibility Classification - Features
Credibility cues (aka features)
26. CLEF 2020 Social Media Mining
Tweet Credibility Classification - Model
27. CLEF 2020 Social Media Mining
Tweet Credibility Classification - Evaluation
92.5% accuracy in identifying misleading posts
88-98% accuracy depending on language
(major languages tested: en, fr, es, nl)
New features and agreement-based retraining led to
significant improvements! One of the top performing
methods in the MediaEval VMU 2015 & 2016 tasks!
Boididou, C., Papadopoulos, S., Zampoglou, M., Apostolidis, L., Papadopoulou, O.,
& Kompatsiaris, Y. (2018). Detection and visualization of misleading content on
Twitter. International Journal of Multimedia Information Retrieval, 7(1), 71-86.
28. CLEF 2020 Social Media Mining
Image Verification Assistant – Intro (1/2)
copy-move splicing
in-painting retouching
Types of Multimedia Manipulation
29. CLEF 2020 Social Media Mining
Image Verification Assistant – Intro (2/2)
Lens
Optical
filter
CFA pattern
Real-world
scene
R
G
G
B
Imaging
sensor
(e.g. CCD)
CFA
interpolat.
In-camera
SW
processing
In-camera
JPEG
compress.
DIGITAL CAMERA
Digital image
Out of camera SW
processing
Piva, A. (2013). An overview on image forensics. ISRN Signal Processing, 2013.
Image Capturing & Tampering Process
30. CLEF 2020 Social Media Mining
Image Verification Assistant – Forensics
Assume that when a “foreign” object is inserted into an
image, some traces of it will be possible to detect.
• Noise-based methods try to locate areas where the
noise patterns are different compared to the rest.
• JPEG compression analysis methods try to locate
areas where some JPEG-specific property is different,
e.g. 8x8 grid, DCT quantization, etc.
• Machine learning-based methods try to locate areas
that look like areas of tampered images that were
used to “train” them.
MeVer – Media Verification (mever.iti.gr) 30
31. CLEF 2020 Social Media Mining
Image Verification Assistant – Forensics
Zampoglou, M., Papadopoulos, S., & Kompatsiaris, Y. (2015). Detecting image splicing in the wild (web).
In International Conference on Multimedia & Expo Workshops (ICMEW), 2015 (pp. 1-6). IEEE
The Challenge of Image
Forensics on the (Wild) Web!
32. CLEF 2020 Social Media Mining
Image Verification Assistant - UI
http://reveal-mklab.iti.gr/
Zampoglou, M., Papadopoulos, S., Kompatsiaris, Y., Bouwmeester, R., & Spangenberg, J. (2016, April). Web
and Social Media Image Forensics for News Professionals. In SMN@ ICWSM.
33. CLEF 2020 Social Media Mining
Image Verification Assistant - Comparison
33MeVer – Media Verification (mever.iti.gr)
FotoForensics1 Forensically2 Ghiro3 Ours
ELA X X X X
Ghost X
DW Noise X
Median Noise X X
Block Artifact X
Double Quantization X
Deep Learning-based X
Copy-move X* X
Thumbnail X X
Metadata X X X X
Geotagging X X X X
Reverse search X
*Forensically implements a very simple block-matching algorithm with low robustness
1 http://fotoforensics.com
2 http://29a.ch/photo-forensics/
3 http://www.imageforensic.org/
34. CLEF 2020 Social Media Mining
The Fake Video Corpus - Overview
• 200 fake and 188 real newsworthy videos
• 2206 fake and 1209 real near-duplicates
• 388 cascades of near-duplicate videos
https://mklab.iti.gr/results/fake-video-corpus/
35. CLEF 2020 Social Media Mining
The Fake Video Corpus - Analysis
• Fake videos keep
reappearing years
later
• Real videos tend
to be reproduced
mostly during the
first month
Papadopoulou, O., Zampoglou, M., Papadopoulos, S., & Kompatsiaris, I. (2019). A corpus
of debunked and verified user-generated videos. Online information review.
36. CLEF 2020 Social Media Mining
The Rise of Deepfakes
• Synthetic media become increasingly realistic mainly
using Generative Adversarial Networks
• We seem to get into an arms race on disinformation!
• Novel solutions beyond supervised learning models
will be needed!
38. CLEF 2020 Social Media Mining
Approach
• Thousands of tweets are generated during a crisis
event in a specific location
38
256% rise in Italian tweets about floods
on Thu, 01 November 2018 16:12 in Veneto
40. CLEF 2020 Social Media Mining
Problem – Challenges – Existing Limitations
• Civil protection agencies and local authorities require
timely access to citizen observations during a crisis event
to estimate the
– Location of a crisis event (e.g. floods, fires, etc.)
– Relevance of each tweet
– Concepts of the image (e.g. people in danger)
• Challenges and existing limitations include:
– Management of large streams of data for event
detection
– Disambiguation from multimodal content (text/image)
– Limited location information (only as mention in text)
40
41. CLEF 2020 Social Media Mining
Social Media Data Mining
• Focusing on Twitter posts, collected with Twitter Streaming API
https://developer.twitter.com/en/docs/tweets/filter-realtime/overview
• Various analysis techniques to obtain further knowledge on the tweets
• The complete flow:
new
tweet
Search terms:
• Keywords
• Accounts
• Bounding
Boxes
Keys & Tokens
Twitter
Streaming API
Client
receives
tweets
Fake tweets
detection
Text
classification
Image
classification
Get tweet in
JSON format &
find matching
use case
Nudity
detection
Tweets
localisation
Concept
extraction
tweet
has
image
Inputs:
42. CLEF 2020 Social Media Mining
Datasets
• Benchmark datasets (e.g. MediaEval tasks)
• Collected datasets about crisis events
42
10 m. about
fires in Spain 75 k.
about
floods in
Italy
74 k.
about
heatwave
in Greece
42 k.
about
snow in
Finland
43. CLEF 2020 Social Media Mining
• Results of the NER task for English
Dataset (CoNLL2003) Precision Recall F1-
score
Our system (ELMo
embeddings)
91.63 93.01 92.32
Best-scoring
CoNLL2003 system:
Florian et al., 2003
88.99 88.54 88.76
Baevski, A. et al. 2019 (not
reported)
(not
reporte
d)
93.5
• Localisation steps after Named
Entity Recognition (NER) has been
performed on available tweets
Dataset (EVALITA2009) Precisio
n
Recall F1-
score
Our system (GloVe
embeddings)
75.49 75.60 75.37
Best-scoring shared task
system:
FBK_ZanoliPianta
84.07 80.02 82.00
Nguyen and Moschitti,
2012
85.99 82.73 84.33
• Results of the NER task for Italian
Estimation of the location mentioned in a tweet
44. CLEF 2020 Social Media Mining
Concept Detection in Social Media Images
• Extracts high-level concepts from visual low-level information
• Fine-tune pre-trained 22-layer GoogleNet DCNN network to recognize the 345
TRECVID INS concepts and thresholding to keep concepts with higher probability
• Concept examples: animal, boat_ship, clouds, waterscape_waterfront
45. CLEF 2020 Social Media Mining
CERTH-ITI participation in MediaEval 2018
First in the social media image classification (Average F1-score)
https://www.youtube.com/watch?v=yq1nIPc6dWw&list=PLOPR
p1vNOG9ahE5viJmF6Gx8XDk8hG9MP&index=2&t=0s
46. CLEF 2020 Social Media Mining
Demo
• Social media dashboard in EOPEN project:
– https://eopen.spaceapplications.com/dashboard/
– Dashboards Social Media
46
48. CLEF 2020 Social Media Mining
Multiple identities detection in social media:
sockpuppets, doppelgängers, and more
• Users often hold several accounts in their effort to multiply the
spread of their thoughts, ideas, and viewpoints
• Illegal & abusive activities: creation of multiple accounts to bypass
the combating measures enforced by social media platforms
48
Figure: Kumar et al. “An Army of Me: Sockpuppets in Online Discussion Communities” WWW 2017
User Identity Linkage
Detect accounts likely to belong to
the same natural person
(“linked accounts”)
49. CLEF 2020 Social Media Mining
Approach
Feature extraction
• Profile (P)
• Activity (A)
• Linguistic (L)
• Network (N)
Data
Collection
Linked
accounts
detection
User Modeling
• Individual
representation
• Joint
representation
Classification
• Probabilistic
• Tree-based
• Ensemble
• Neural networks
51. CLEF 2020 Social Media Mining
• 𝑢𝑖: 𝑉𝑆 𝑢 𝑖
= < 𝑓𝑆 𝑖1
, 𝑓𝑆 𝑖2
, … , 𝑓𝑆 𝑖 𝑗
, … , 𝑓𝑆 𝑖 𝑛
>,
Feature sets: S = {P, A, L, N}
User Modeling: Individual representation
𝑗𝑡ℎ feature of
category S for 𝑢𝑖
Total number of features
for category S
Example: 𝑉𝑁 𝑣 𝑖
= < 𝑎𝑢𝑡ℎ𝑜𝑟𝑖𝑡𝑦𝑖, ℎ𝑢𝑏𝑖, … , 𝑃𝑎𝑔𝑒𝑅𝑎𝑛𝑘𝑖 >
Feature extraction
• Profile (P)
• Activity (A)
• Linguistic (L)
• Network (N)
Data
Collection
Linked
accounts
detection
User Modeling
• Individual
representation
• Joint
representation
Classification
• Probabilistic
• Tree-based
• Ensemble
• Neural networks
52. CLEF 2020 Social Media Mining
User Modeling: Joint representation
1. abs: absolute difference of feature vectors of 𝑢𝑖, 𝑢𝑗
2. sim: similarity of the per-category feature vector (Cosine similarity,
Euclidean distance, Manhattan distance)
3. Similarity of the content posted by users 𝑢𝑖, 𝑢𝑗
• edits: edit distance - Levenshtein distance
• sem: semantic similarity - vector space model approach (word
embeddings)
Feature extraction
• Profile (P)
• Activity (A)
• Linguistic (L)
• Network (N)
Data
Collection
Linked
accounts
detection
User Modeling
• Individual
representation
• Joint
representation
Classification
• Probabilistic
• Tree-based
• Ensemble
• Neural networks
53. CLEF 2020 Social Media Mining
Classification
• Probabilistic: Naïve Bayes, BayesNet
• Tree-based: J48, LADTree, LMT
• Ensemble: Random Forest (RF), AdaBoost and voting
ensembles
• Deep Neural Network
• Recurrent Neural Network (RNN)
• Combined Network: Text classification network +
Metadata network
Feature extraction
• Profile (P)
• Activity (A)
• Linguistic (L)
• Network (N)
Data
Collection
Linked
accounts
detection
User Modeling
• Individual
representation
• Joint
representation
Classification
• Probabilistic
• Tree-based
• Ensemble
• Neural networks
54. CLEF 2020 Social Media Mining
Comparison to other approaches
[1] Fredrik Johansson, Lisa Kaati, and Amendra Shrestha (2013) Detecting multiple aliases in social media. In Proceedings of the 2013 IEEE/ACM international conference
on advances in social networks analysis and mining.
[2] Thamar Solorio, Ragib Hasan, and Mainul Mizan. 2013. A case study of sockpuppet detection in wikipedia. In Proceedings of the Workshop on Language Analysis in
Social Media. ACL.
[3] Michail Tsikerdekis and Sherali Zeadally. (2014) Multiple account identity deception detection in social media using nonverbal behavior. IEEE Transactions on
Information Forensics and Security 9, 8 (2014).
[4] Fredrik Johansson, Lisa Kaati, and Amendra Shrestha. (2015) Timeprints for identifying social media users with multiple aliases. Security Informatics 4, 1 (2015)
[5] Srijan Kumar, Justin Cheng, Jure Leskovec, and VS Subrahmanian (2017) An army of me: Sockpuppets in online discussion communities. In Proceedings of the 26th
International Conference on World Wide Web.
[6] Jan Pennekamp, Martin Henze, Oliver Hohlfeld, and Andriy Panchenko. (2019) Hi Doppelgänger : Towards Detecting Manipulation in News Comments. In Companion
Proceedings of The 2019 World Wide Web Conference.
[7] Despoina Chatzakou, Juan Soler-Company, Theodora Tsikrika, Leo Wanner, Stefanos Vrochidis, Ioannis Kompatsiaris, (2020) User Identity Linkage in Social Media
Using Linguistic and Social Interaction Features”. In Proceedings of the 2020 ACM on Web Science Conference
Features Classifier
Activity Linguistic Network Traditional
ML
NN
Character Word Sentence Dictionary Syntactic Distribution Segmentation Connection
Johansson et al. [1] X X X X
Solorio et al. [2] X X X X X
Tsikerdekis et al. [3] X X
Johansson et al. [4] X X X X X
Kumar et al. [5] X X X X X X X X X
Pennekamp et al. [6] X X X X X
Ours [7] X X X X X X X X X X X
55. CLEF 2020 Social Media Mining
Datasets and Ground Truth
Manual creation of the ground truth due to the absence of ground truth that
indicates which user accounts belong to the same person
• Split each account 𝑢𝑖 (its posts) in two distinct accounts: 𝑢𝑖𝑎 and 𝑢𝑖𝑏
• linked accounts: (𝑢𝑖𝑎, 𝑢𝑖𝑏)
• non-linked accounts: (𝑢𝑖𝑎, 𝑢𝑗𝑏), where 𝑖 ≠ 𝑗
• 10% of linked and 90% non-linked accounts
Abusive Dataset
• June to August 2016
• Relevant to Gamergate
controversy
• Abusive-related English
hashtags
• 650K tweets and 312K users
Terrorism Dataset
• February 2017 to June 2018
• Relevant to Jihadist terrorism
• Terrorism-related Arabic
keywords
• 65K tweets and 35K users
57. CLEF 2020 Social Media Mining
Experimental Phases
• Phase 1: 10% linked & 90% non-linked
• #𝑙𝑖𝑛𝑘𝑒𝑑 𝑎𝑐𝑜𝑢𝑛𝑡𝑠: 200
• #non-linked accounts: 1,800
• Phase 2: Varying number of linked accounts
• # linked accounts: 200 to 500 with step 100
• # non-linked accounts: 1,800
• Phase 3: Varying number of non-linked accounts
• # linked accounts: 200
• # non-linked accounts: 1,800 to 39,800 with step 1,800
58. CLEF 2020 Social Media Mining
Results: Abusive dataset (Phase 1)
Features
• Traditional classifiers: Network features perform better (combined or not with the edit & sim
features)
• Neural Network: Linguistic features result to a better performance (combined or not with the edit &
sim features)
Classifiers
• Random Forest achieves the best performance (AUC: 99.50%)
59. CLEF 2020 Social Media Mining
Results: Abusive dataset (Phases 2 & 3)
Varied linked accounts
Varied non-linked accounts
• From 200 to 300: slight increase (precision, recall, accuracy)
• Stable performance: 99% AUC
• Even with the highest number of non-linked user accounts,
AUC remains at around 87.30%
• Increase of precision & recall when more data are available
• At ~24k non-linked accounts, precision & recall converge
Results obtained by using Random Forest as classifier
60. CLEF 2020 Social Media Mining
Results: Terrorism dataset (Phase 1)
Features
• J48: Network features perform better compared to the Activity and Linguistic
• Random Forest, BayesNet, Neural Network: Linguistic features result to a better performance compared to the
Activity and Network
• In most cases all feature categories (using the abs) combined with similarity feature vectors result to the best
performance
Classifiers
• Random Forest achieves the best performance (AUC: 99.50%)
61. CLEF 2020 Social Media Mining
Results: Terrorism dataset (Phases 2 & 3)
• Higher number of linked user accounts
=> higher precision, recall & accuracy
• Stable performance: 99% AUC
• AUC fluctuates from 94% to 99.50%
• Precision & recall fluctuate from 97.1% to 99%
• Stable model even with a quite unbalanced dataset
Varied linked accounts
Varied non-linked accounts
Results obtained by using Random Forest as classifier
62. CLEF 2020 Social Media Mining
Conclusions
• Social media data useful in many applications
– From confirming existing and known correlations to prediction and decision-
making
• Many challenges exist
– Data availability and representativeness (of society, real-event)
– Coverage, robustness and reproducibility
– Authenticity (threat to democratic society)
– Real-time and scalable approaches
– Fusion of various modalities (Content, social, temporal, location)
• Required contribution from various disciplines
– Content Analytics
– Machine Learning
– Network Analysis
– Psychology – Social Sciences (patterns of presentation, sharing)
– Visualization
• Currently mostly an auxiliary means for real-events assessment and decision-
making, which can generate additional insights
63
63. CLEF 2020 Social Media Mining
With Contributions from
• Dr. Symeon Papadopoulos
– Social network analysis, social media content mining and multimedia indexing and
retrieval
– http://mklab.iti.gr/people/papadop
– Twitter: @sympap
• Dr. Ilias Gialampoukidis
– Social media mining and classification, topic detection, community and key-player
identification, multimodal fusion and multimedia retrieval
– http://www.researchgate.net/profile/Ilias_Gialampoukidis
• Dr. Theodora Tsikrika
– Web and social media search and mining, multimedia indexing and retrieval, AI-
based multimodal analytics, evaluation
– https://www.iti.gr/iti/people/Theodora_Tsikrika.html
• Dr. Stefanos Vrochidis
– Multimodal data fusion, web and social media mining, multimedia analysis and
retrieval, multimodal analytics
– https://sites.google.com/site/stevrochidis/
64
64. CLEF 2020 Social Media Mining
Support
Tools and services for Social
Media verification from a
journalistic and enterprise
perspective.
65
Video verification platform
including video forensics, reverse
video search and context
analysis and aggregation.
Social media verification
platform including deepfake
detection and a database of
known fakes.
InterCONnected NEXt-
Generation Immersive IoT
Platform of Crime and
Terrorism DetectiON,
PredictiON, InvestigatiON, and
PreventiON Services
EU funded projects
opEn interOperable
Platform for unified access
and analysis of Earth
observatioN data
Enhancing decision
support and
management services
in extreme weather
climate events
65. Thank you for your attention!
ikom@iti.gr
http://mklab.iti.gr
The main point here is that the original tweet which was misleading was retweeted much more than the tweet that made the correction.
https://twitter.com/Thomas_Binder/status/984934979451879424
https://twitter.com/Thomas_Binder/status/985665154695262211
https://www.dailysabah.com/syrian-crisis/2018/04/18/cardiologist-apologizes-after-falsely-accusing-white-helmets-of-staging-syria-chemical-attack
Of the many tools that our team develops, we will briefly focus on a model for tweet credibility classification and a tool for image verification based on image forensics. Also, the Fake Video Corpus will be presented.
- Automatic resaving and exif removal
Features
Tampering localization heat maps
Six state-of-the-art algorithms and one newly proposed (CAGI)
Zoom-in and overlay of heat map over image
Auxiliary features
Metadata: full listing, GPS geolocation, Exif thumbnail extraction
Reverse image search: auto-generation of link to perform search on Google Images
Quantitative
Six reference datasets (images + binary masks of tampering = “ground truth”)
Measures capturing the matching between ground truth mask and algorithm output
Comparison of 14 algorithms, “best” six plus a newly proposed one ended up in the tool
Qualitative
Informal feedback received by end users
Usability and quality of results
In the case where location entities are recognised, the bounding box of each location is retrieved via the OpenStreetMap API. In case no location entities are recognised, the organisation entities are considered. Finally, an analysis of the bounding boxes returned follows. Specifically, in case of one entity, a single bounding box is returned. However, in case of multiple entities, the bounding boxes are compared with each other in order to exclude bigger areas when a smaller - more precise one is also available and all the remaining are returned as output.
- English language results are already satisfactory for the purposes of PUC 1. Although our scores are lower than the current state-of-the-art (Baevski, A. et al. 2019), they are not far off, while the model still outperforms the baseline method (Florian et al., 2003).
- The Italian dataset results are for the purposes of PUC 1. The model is still being worked on and fine-tuning is under process. As can be seen the model’s experimental character is apparent since it is outperformed by the baseline method. Accuracy is expected to increase considerably when we manually enhance the dataset with our own annotations, update the annotation format and decide on final parameters.
- To the best of our knowledge there is only one freely available Finnish dataset which was extracted from the archives of Digitoday, an online technology news source. It consists of 953 annotated with six named entity classes (organisation, location, person, product (PRO), event (EVENT), and date (DATE)). The dataset in its current state is too small to be used with a DNN, so no remarkable results are expected until the issue of data size is addressed. On that account, we are working towards enhancing it, by adding more sentences of our own manual annotation efforts.
Figure: Nodes represent users and edges connect users that reply to each other. Sockpuppets (red nodes) tend to interact with other sockpuppets, and are more central in the network than ordinary users (blue nodes) – where a sockpuppet can be defined as a user account that is controlled by an individual (or puppetmaster) who controls at least one other user account.
It should be noted that not all people who have multiple online identities engage in malicious activities; however, our focus is on those who do. Of course, the proposed techniques for detecting multiple identities are applicable irrespective of the context (i.e., whether the motivation of creating multiple identifies is for malicious purposes or not).
How to represent each individual users: feature vector for each of the feature sets
How to represent pairs of users: We do that since the goal is to classify whether a pair of users are likely to belong to the same natural person.