1. Learning to detect Misleading
Content on Twitter
Christina Boididou, Symeon Papadopoulos,
Lazaros Apostolidis, Yiannis Kompatsiaris
Information Technologies Institute, CERTH, Thessaloniki, Greece
ACM International Conference on Multimedia Retrieval
June 6-9, Bucharest, Romania
3. REAL OR FAKE: THE VERIFICATION PROBLEM
Captured in Dublin’s Olympia Theatre
Mislabeled on social media as showing
the crowd at the Bataclan theatre just
before gunmen began firing.
4. TYPES OF FAKE
Reposting of real
fake is any post (tweet) that shares multimedia content that does not faithfully represent the event that it refers to
6. FEATURE EXTRACTION
Features related to tweets
Features related to users
• Link-based num of
num of slang
9. VERIFICATION CORPUS
Set of tweets T collected with a set of keywords K
Tweets contain multimedia content (Image or Video)
Reputable online resources which debunk
Publicly available corpus here:
193real Images & Videos
220fake Images & Videos
10. EXPERIMENTAL STUDY
Evaluate the fake detection accuracy on samples from new events
Accuracy: 𝑎 =
Kind of event-based cross-validation
For each event Ei -> training: 16 remaining events, testing: Ei
Additional split proposed on MediaEval task 
Random Forest of 100 trees
 Christina Boididou, Katerina Andreadou, Symeon Papadopoulos, Duc-Tien Dang-Nguyen, Giulia Boato, Michael Riegler, and Yiannis
Kompatsiaris. 2015. Verifying Multimedia Use at MediaEval 2015. In MediaEval 2015 Workshop, Sept. 14-15, 2015, Wurzen, Germany.
15. COMPARISON WITH OTHER METHODS
MCG-ICT (2015) method:
• Approach tailored to the given MediaEval dataset
• Preprocessing step that first groups tweets by their multimedia content
• Difficult to apply in realistic setting
16. TWEET VERIFICATION ASSISTANT
Visualize the verification result
Present list of extracted features
and their values
Compare values in comparison to
the ones from the verification
HOW TO USE
Provide URL or tweet ID
Inspect the features and the
verification result (fake/real)
Find the Tweet Verification Assistant here: http://reveal-mklab.iti.gr/reveal/fake/
18. CHALLENGES AND FUTURE WORK
Making the tool usable and easy to understand by non-computer scientists
• Interpretation of Machine Learning outputs is challenging
• Difficult to create an application that journalists could rely on and trust
Test the Verification Assistant usefulness when used by journalists/news editors
Extend the framework to other social media
Leverage method output for other verification problems 
 Olga Papadopoulou, Markos Zampoglou, Symeon Papadopoulos and Yiannis Kompatsiaris. Web Video
Verification using Contextual Cues
19. Thank you!
Get in touch:
• Christina Boididou: email@example.com / @CMpoi
• Symeon Papadopoulos: firstname.lastname@example.org / @sympap
• Lazaros Apostolidis: email@example.com
• Verification Corpus: https://github.com/MKLab-ITI/image-verification-corpus
• Tweet Verification Assistant: http://reveal-mklab.iti.gr/reveal/fake/
With the support of:
Hinweis der Redaktion
Recent years, we have seen a tremendous increase in the use of social media platforms as means of sharing content. The simplicity of sharing has led to large volumes of news content reaching huge numbers of readers in short time. Especially multimedia content can easily become viral as easily consumed and carrying entertainment value.
Given the speed of the news and the competition of journalists to publish first, the verification of the content is neglected or carried out in superficial manner. This leads to the online appearance of misleading multimedia content, or for the sake of brevity fake content. For example, let’s look at this picture: Can you make a guess? Is it real or fake? Even though Sharuman could well attend this meeting, this image was ultimately found to be photoshopped.
Now, let’s have a look at this image. What is your guess now?
Here we deal with an other type of fake photos. It is a real photo but was mislabeled on social media as showing the crowd at the Bataclan theatre just before gunmen started firing.
So, as misleading or fake we consider any twitter post that shares multimedia content that does not faithfully represent the event that it refers to.
This could include Reposting of real multimedia content, Reposting of synthetic/artworks, Digital tampering/photoshop or Speculations.
In order to deal with the verification problem, we present a robust approach for detecting in real time whether a tweet that shares a multimedia item is fake or real.
The proposed framework relies on two independent classification models built on the training data (verification corpus) using different sets of features, tweet-based and user based features. A bagging technique is used when building the models. We use n subsets of tweets including equal number of samples for each class leading to the creation of n classifiers. The final prediction is the majority vote among the n predictions.
At prediction time, an agreement based retraining technique is employed which combines the outputs of the two models. The outome is then visualized to the users, using information of the labelled verification corpus.
The selection of our features was carried out following a thorough study of the way journalists verify content on the web.
We have defined two sets of features, the tweet-based extracted from the tweet itself.
Assess the trust of the website
A key novelty in our approach is the ABR technique (fusion block). We combine the outputs as follows: for each sample, we compare the predictions and depending on the agreement we divide the test set in agreed and disagreed samples. The agreed samples are assigned the agreed label (fake or real) assuming that it is correct with high likelihood and they consistute the predictions for the agreed samples. Then, we use a retraining technique. First we select the most effective of the independent classifiers based on their performance on the training set with cross validation. Then we use the agreed samples together with the initial training samples of the VC to predict labels for the disagreed samples. The goal is to adapt the initial model to the characteristics of the new unseen event.
Our VC is a publicly available dataset with fake and real tweets. It consists to tweets related to 17 events compromising in total 220 cases blah blah. The tweets were collected using a set of keywords and they were debunked using reputable online resources. Only tweets with a multimedia item of these ones were included in the dataset and several manual steps were necessary to come up with those.
The aim of the conducted experiments was to evaluate the fake detection accuracy on samples from new events. We consider this very important aspect of a verification framework as the nature of fake tweets may vary across different events.
The employed scheme can be thought as an event-based cross-validation
We first assess the contribution of the features on the method’s accuracy. We compare the performance using the baseline and the full set of features. The baseline features are just a subset of the features that we used on our previous work. Then, we assess the bagging we applied in our method. We can see that the full set of features and the bagging in both the tweet and user based features model led to considerably improved accuracy.
In this graph, we present the agreement level and the accuracy of the classifiers on the agreed set. We note that the higher the agreed level the higher the achieved accuracy. The last column is the average percentage of the classifiers across the different trials.
This bar chart shows the agreed accuracy, the disagreed accuracy and finally the overall across the trials. On the right chart, we can see the average accuracy levels of them with green orange and grey respectively. The last columns, with the blue color, are the performance of each of the models when tested individually on the test set. One can see a clear improvement (about 5%) compared to the overall accuracy.
We also assessed the model on tweets written in different languages.
Five most used languages in the corpus. No lang -> not detected or not much text
Accuracy is stable independent of the language
We also compare our model with methods sybmitted to Mediaeval 2015 verification task against their best run.
Our proposed method achieves the second best performance reaching almost equals to the best run.
One of the biggest challenges we are facing is making the tool usable and easy to understand by non computer scientists.
Our experience with media experts from Deutsche Welle & AFP (Agence France Presse) shows that the …