SlideShare ist ein Scribd-Unternehmen logo
1 von 80
Downloaden Sie, um offline zu lesen
Tunisian Republic
Ministry of Higher Education
and Scientific Research
University of Tunis El Manar
Higher Institute of Computer Science
Master’s Thesis
Presented in order to obtain the
Master’s Degree in Information and Technology
Mention: Information and Technology
Specialty : Software Engineering (GL)
By:
Wajdi KHATTEL
Proposal of a Terrorist Detection Model in
Social Networks
Presented on 07.12.2019
In front of jury composed of:
President:
Evaluator:
Academic supervisor:
Laboratory supervisor:
Najet AROUS
Olfa EL MOURALI
Ramzi GUETARI
Nour El Houda BEN CHAABENE
Realized within
Academic year : 2018-2019
Laboratory Supervisor
Academic Supervisor
I authorize the student to submit his internship report for a defense
Signature
I authorize the student to submit his internship report for a defense
Signature
Le 22/11/2019
Ramzi Guetari
Le 22/11/2019
Nour El Houda Ben Chaabene
Dedications
I want to dedicate this humble work to:
My parents Abderraouf and Sonia for all the pain they have been through and all the
sacrifices they made in order for me to reach this level and for me to be what I am today.
To my sister Yosra and her husband Jamel for their patience, continuous support and
care.
To all the members of my family and my dearest friends for the best times and laughs
we had and sticking by my side the time I needed.
For all those I love and all those who love me. To all who helped that I forgot to mention.
With Love,
Wajdi Khattel.
iii
Acknowledgements
I would like first to thank and express my very profound gratitude to my academic advisor,
Mrs. Nour EL Houda BEN CHAABENE for the huge effort and sacrifice she gave the entire
time and also for believing in our capacities and her patience, motivation, and immense
knowledge. Her guidance helped us in all the time of research and writing of this thesis.
My academic Professor, Mr. Ramzi GUETARI, for his big support and generosity and his
continuous welcome in his office that was always open whenever I ran into a trouble spot or
had a question about our research, and steering us in the right direction whenever I needed it.
Also anyone who contributed to this work for the support, even spiritually especially the last
couple of weeks.
With Gratitude
Wajdi Khattel.
iv
Table of Contents
General Introduction 1
I State of the art 3
1 Anomaly Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 4
1.1 Activity-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.2 Graph-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2 Terrorist Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 Existing Content-based Models . . . . . . . . . . . . . . . . . . . . . 11
2.2 Existing Graph-input Analysis . . . . . . . . . . . . . . . . . . . . . 13
II Existing Techniques 16
1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.1 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1.2 Data Representation . . . . . . . . . . . . . . . . . . . . . . 20
1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
1.2.1 CNN: Convolutional Layer . . . . . . . . . . . . . . . . . . 23
1.2.2 CNN: Pooling Layer . . . . . . . . . . . . . . . . . . . . . . 25
1.2.3 CNN: Fully-Connected Layer . . . . . . . . . . . . . . . . . 26
1.3 Numerical-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . 26
2 Data Classification in Machine Learning . . . . . . . . . . . . . . . . . . . . 26
2.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
v
Table of Contents
III Proposed Model 29
1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.1 Offline Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
1.2 Online Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2 Proposed Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.1 Model Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.2 Content-Based Classification . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.1 Text Classification Model . . . . . . . . . . . . . . . . . . . 34
2.2.2 Image Classification Model . . . . . . . . . . . . . . . . . . 36
2.2.3 General Information Classification Model . . . . . . . . . . 37
2.3 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
IV Implementation and Results 43
1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1 Offline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
1.1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . 44
1.1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . 46
1.1.3 General Information Data . . . . . . . . . . . . . . . . . . . 48
1.2 Online Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.1 Facebook Data . . . . . . . . . . . . . . . . . . . . . . . . . 49
1.2.2 Instagram Data . . . . . . . . . . . . . . . . . . . . . . . . . 51
1.2.3 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.1 Text Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.1 NLP Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.2 Data Vectorization . . . . . . . . . . . . . . . . . . . . . . . 54
2.1.3 Data Classification . . . . . . . . . . . . . . . . . . . . . . . 55
2.2 Image Classification Model . . . . . . . . . . . . . . . . . . . . . . . 56
2.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . 56
2.2.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . 57
2.3 General Information Classification Model . . . . . . . . . . . . . . . 58
vi
Table of Contents
2.4 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
V Conclusions and Perspectives 63
Bibliography 65
vii
List of Figures
I.1 Unified User Profiling (UUP) system with cyber security perspective . . . . 6
I.2 User Profiling Method in Authorization Logs . . . . . . . . . . . . . . . . . 7
I.3 Context-aware graph-based approach framework . . . . . . . . . . . . . . . 8
I.4 Forum user profiling approach framework . . . . . . . . . . . . . . . . . . . 9
I.5 Transfer-Learning CNN Framework . . . . . . . . . . . . . . . . . . . . . . . 12
I.6 Multidimensional Key Actor Detection Framework . . . . . . . . . . . . . . 14
II.1 An example of morphemes extraction . . . . . . . . . . . . . . . . . . . . . . 18
II.2 An example of syntax analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19
II.3 An example of semantic network . . . . . . . . . . . . . . . . . . . . . . . . 20
II.4 Curved Edge Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
III.1 Multi-dimensional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
III.2 Text Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
III.3 Image Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
III.4 General Information Classification Model . . . . . . . . . . . . . . . . . . . 38
III.5 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
III.6 Model Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
IV.1 Twitter Searching Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
IV.2 Sample of news headlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
IV.3 Word Cloud of our Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.4 Sample of Terrorists images . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.5 Sample of Military/News images . . . . . . . . . . . . . . . . . . . . . . . . 48
IV.6 Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
IV.7 Relationship Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IV.8 Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
IV.9 Facebook Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
viii
List of Figures
IV.10An example of data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 57
IV.11Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
ix
List of Tables
I.1 Anomaly detection existing works comparison . . . . . . . . . . . . . . . . 10
I.2 Activity-based techniques comparison . . . . . . . . . . . . . . . . . . . . . 13
II.1 Comparison of word embedding methods . . . . . . . . . . . . . . . . . . . 22
IV.1 Textual-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
IV.2 Image-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
IV.3 Text Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
IV.4 Image Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
IV.5 General Information Models Metric Scores . . . . . . . . . . . . . . . . . . . 59
IV.6 Model Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
x
Acronyms
UUP Unified User Profiling
CERT Computer Emergency Response Team
NATOPS Naval Air Training and Operating Procedures Standardization
SVM Support Vector Machine
SMP Social Media Processing
M-SN Multiple Social Networks
M-IT Multiple Input Types
T-UBC Time-based User Behavior Changes
T-FBC Time-based Future Behavior’s Changes
ISIS Islamic State of Iraq and Syria
URL Uniform Resource Locator
LSTM Long Short-Term Memory
CNN Convolutional Neural Network
API Application Program Interface
GTD Global Terrorism Database
GDELT Global Data on Events Location and Tone
SNA Social Network Analysis
NLP Natural Language Processing
FOPL First Order Predicate Logic
xi
TF-IDF Term Frequency-Inverse Document Frequency
CBOW Continuous Bag Of Words
RGB Red, Green, Blue
START Study of Terrorism And Responses to Terrorism
PIRUS Profiles of Individual Radicalization In the United States
HTTP HyperText Transfer Protocol
RDF Resource Description Framework
REST REpresentational State Transfer
JSON JavaScript Object Notation
SDK Software Development Kit
RAM Random Access Memory
CPU Central Processing Unit
GPU Graphics Processing Unit
NLTK Natural Language ToolKit
DA Data Augmentation
TL Transfer Learning
VGG Visual Geometry Group
FB Facebook
IG Instagram
T Twitter
xii
General Introduction
The appearing of social networks has created an ease of communication and
idea sharing. Several of them has now become one of the most popular
sources of information, namely: Facebook, Twitter, LinkedIn, etc.
Within the last decade, the number of the people across the word that uses those
websites kept increasing to overcome billion of active users per day [1]. Most of these
users are there to interact with their friends, family and meet new people that shares
their interests. Other users such as business owners are there to communicate with their
target audience for promoting their brand or receiving feedback from customers.
Although this facilitated way of communication could be used in a friendly way,
there are other users that take that advantage in a harmful way such as bullies, spammers
and hackers. One of the most dangerous categories is terrorist groups. They are one of
the most profitable users of this advantage. The ability for them to incite other people,
promote their groups and planning attacks has become very simple.
Detecting these groups in an accurate and a fast way has become one of the most
important tasks for social network owners. Several approaches and methods has been
proposed to that end such as manual monitoring and firewalls. But, as the number of
those individuals kept increasing, an accurate and fully automated approaches must be
used. Fortunately, the evolution of new technologies, especially the appearing of machine
learning, has made that task easier.
In this thesis, we propose a model that learns the characteristics that describes a
terrorist individual. Additionally, the model learns by itself the new characteristics that
define terrorism behaviors, since the abnormal behaviors of our social-culture changes
over-time.
The first chapter presents some existing works that deal with the issue of anomaly
detection in general and terrorism detection in particular, to give the reader a general
idea of the research carried out in this domain.
1
The second chapter presents some existing techniques in the machine learning do-
main in order for us to implement our proposed model.
The third chapter introduces the basis of our proposed model from a theoretical per-
spective so that we can implement the model’s design.
The fourth chapter presents the practical part of our work where we go through the
pipeline of our model’s implementation and discuss the results.
Finally, we end with a general conclusion and perspectives.
2
IState of the art
This chapter presents an overview of some existing works that deal with the issue of
anomaly detection in general and terrorism detection in particular. We begin this chapter
by defining the concept of anomaly and point up the importance of its application in the
social media area. We present, by next, an overview of some applied anomaly detection
and terrorist detection works categorized based on their input format. The purpose of this
chapter is to give the reader a general idea of the research carried out in the detection of
anomalies and terrorism.
Introduction
Social media’s main objective is to provide a platform for people to communicate
together and share their thoughts. Although, most of the users use it in a friendly way,
many others can benefit from this ease of communication to plan attacks or incite the
others to adopt extremism behaviors. Therefore, it is extremely important that we can
detect these users in an accurate and a fast way. Those users are often referred to as an
anomaly due to their abnormal behaviors.
3
Chapter I. State of the art
Abnormal behaviors are behaviors that differ or follow an unusual pattern com-
pared to what is defined as normal sociocultural behavior.
Our main objective behind this research is to study the characteristics that describe
an anomalous individual. Although, in social media, an anomalous user certainly will
hide his anomalousness, therefore, time is important since we will be looking for peaks
and deviations from his/her usual behavior pattern. Nevertheless, what is considered
abnormal in today’s sociocultural could become normal after a period of time, thus, we
should take into consideration the behavior’s evolution when defining the abnormal be-
havior.
Different models and approaches has been proposed toward solving this problem. Based
on their input format, we can categorize them into activity-based detection, where the in-
put data is the user’s activity, and graph-based detection, where the input data is a graph
of multiple users.
However, the anomaly itself is way too abstract as a term, this motivated us to initiate an
attempt to work only on one concrete type of anomaly which is the terrorism. To consider
an individual as a terrorist, we have first to define what is a terrorist, since there are no
universal agreement on the definition of a terrorist [2]. Facebook, in their definition of
dangerous individuals and organizations, attempted to define terrorism as following:
Terrorism: Any nongovernmental organization that engages in premeditated acts of vio-
lence against persons or property to intimidate a civilian population, government or interna-
tional organization in order to achieve a political, religious or ideological aim. [3]
Since we are working with social medias, we decided to consider that definition.
In the following sections, we start by presenting the existing activity-based and graph-
based anomaly detection proposals, then we put our focus on the terrorist detection
works.
1 Anomaly Detection in Social Media
This section presents the existing models and approaches for anomaly detection in
social media categorized by their input format. We look for whether the latest propos-
als can identify the future anomaly-behavior’s changes, the user’s behavior over-time
4
Chapter I. State of the art
changes and the usage of multiple social networks.
1.1 Activity-based Detection
Activity-based detection approaches consider that users are kind of independent
from each other. An individual is defined by his/her own activities and that would deter-
mine if whether his/her behavior are abnormal.
In [4], the authors presented a survey of the available user profiling methods for
anomaly detection, then they proposed their own anomaly detection model. They showed
the advantages and disadvantages of each model from a cybersecurity perspective, some
models were using operating system log and web browser history as data source while
others were more focused on social networks such as twitter and Facebook. Their anal-
yses revealed that the models based on history and logs were more limited and not con-
sistent from the perspective of not really knowing whether the same user is the only one
using that operating system or the web browser, while the social network-related models,
were more consistent because it is a private account based approaches that also includes
users interactivity with each other which leads to better results. Based on other method’s
data sources, they defined a user profile representation with a vector of 7 main feature
categories :
• Users interests features
• Knowledge and skills features
• Demographic information features
• Intention features
• Online and offline behaviour features
• Social media activity features
• Network traffic features
Each features category contains some features and sub-grouped features inside, which
led finally to having more than 270 features that are mostly security-related. Their pro-
posed model called "Unified User Profiling" (Fig.I.1), will mainly collect the data from
5
Chapter I. State of the art
the different sources, then clean it and parse it in order to have structured data which
finally leads to having a user profile vector that the administrator is able to monitor in
different categories and detect anomalies based on the user activity.
While their model is mostly complete in term of features and they considered different
social networks, it is still limited to not automatically detect anomalies.
Figure I.1: Unified User Profiling (UUP) system with cyber security perspective
In [5], the writers proposed a pattern recognition method, that given a vector of a user
profile, it will take the user’s daily activity and create a time-series pattern for that user
on each activity he does (Fig.I.2), then each time the user is involved in an activity, the
new behaviour is compared to his/her behavioral pattern of that activity. If a deviation
from the normal behavior happened, it is flagged suspicious, but since a minor deviation
doesn’t always mean a suspicion, there is a behavioral model of all system users that
the activity will also be compared to so that the false alarms are kept at the minimum.
Their model is a random forest trained on the CERT dataset along with a private dataset
acquired from NextLabs which achieved over 97% accuracy.
This method showed great results in term of insider threat detection which is considered
as a single social network, so it is still limited by not supporting multiple social networks
and it cannot learn future abnormal behaviors automatically over time.
6
Chapter I. State of the art
Figure I.2: User Profiling Method in Authorization Logs
1.2 Graph-based Detection
Graph-based detection approaches consider user interactivity by analyzing a snap-
shot of a network. Each user can have a relation with other users such as mentions, shares
and likes.
There are two approaches for that, statically or dynamically. In the static graph-based
detection approaches, the analysis is done on a single snapshot of the network. While for
the dynamic graph-based detection approaches, the analysis is done in a time-based way
by analyzing a series of snapshots.
In [6], the writers proposed an anomaly detection framework that, at each timestamp
t, each user within a network have an activity score and a mutual score with other users.
The scores are based on the user’s activities and the interactivity with other users on these
activities. A Mutual agreement matrix is then produced to represent those scores where
the user’s activities score in the matrix diagonal. Using an anomaly scoring function that
they proposed, the user’s scores are passed into it and thresholded to define whether the
user is anomalous or not (Fig.I.3). As a data source, they used the "CMU-CERT Insider
Threat Dataset" and the "NATOPS Gesture Dataset", then they compared the results of
their framework to other known models. Their model by far was the best, they reached
around 0.95 of area-under-curve score while the other models such as SVM and clustering
7
Chapter I. State of the art
were around 0.89.
Despite the fact that the framework overcame by far the expecting results for detecting
insider threats and its ability to support overtime behavior changes, it is still limited by
not considering different input data types such as images and texts and not analyzing
multiple networks simultaneously.
Figure I.3: Context-aware graph-based approach framework
In [7], the authors proposed a user profiling approach based on user behavior features
and social network connection features (Fig.I.4). The first set of features (user behavior
features) is the foundation of user representation which are composed of posts contents
statistics, posts content semantics and user behavior statistics. The social network con-
nection features are basically a set of features that leads to the construction of a network
of similar users that have similar network representation. The experiment results showed
that by using the network connections the model overall score improved. Their approach
reached the second place among around 900 participants in the SMP 2017 User Profiling
Competition.
This work showed that the use of graphs and the consideration of user interactivity is an
improvement toward grouping individuals thus, detecting anomalous communities. The
limitation of this work is that it cannot detect the category’s behavior future changes.
8
Chapter I. State of the art
Figure I.4: Forum user profiling approach framework
1.3 Summary
Within the scope of our research in the anomaly detection in social media, we studied
different papers. Table I.2 presents the advantages and limitations of those papers in term
of their support of multiple social networks (M-SN), support of multiple input data types
such as text and images (M-IT), support of over-time user behavior changes (T-UBC) and
their ability to learn future new abnormal behavior’s changes (T-FBC).
9
Chapter I. State of the art
Paper Description Input Format M-SN M-IT T-UBC T-FBC
Lashakry et al.,
2019 [4]
Proposed model for
user profile creation
to monitor users
User’s Activity    
Zamanian et al.,
2019 [5]
Proposed model for
user activity pattern
recognition with ran-
dom forest
User’s Activity    
Bhattacharjee et
al., 2017 [6]
Proposed a prob-
abilistic anomaly
classifier model
Graph of users    
Chen et al., 2018
[7]
Proposed a user pro-
filing framework that
can be used to detect
anomalous users
Graph of users    
Table I.1: Anomaly detection existing works comparison
None of the mentioned works has considered all the mentioned functionalities to-
gether. Therefore, we decided to work on a model that supports those features. To fa-
cilitate that, we considered a hybrid architecture where the input format is graph-based
to include user interactivity and the ease of detecting communities but also focus on the
user’s activity to solve our main problem of identifying the characteristics that describes
an anomalous individual.
2 Terrorist Detection in Social Media
As we decided to have a hybrid architecture with both graph-input and activity-
based detection, we identified the existing terrorist detection works that focus on user’s
social medias content and other works that treats a graph as an input. In this section, we
present those papers to get more overview about how to solve our problem.
10
Chapter I. State of the art
2.1 Existing Content-based Models
In this section, we focus on models that treats the content of the activities that an
individual can get involved in on social medias. Those are served as a proof-of-concept
for our implementation of them.
In [8], the writers implemented a model that detects extremists in social media based
on some information related to usernames, profile, and textual content. They built their
dataset from Twitter by looking for hashtags related to extremism which results into
having around 1.5M tweets, then they extracted 150 ISIS-related accounts that posted
those tweets and were reported to the Twitter Safety account (@TwitterSafety) by normal
users and 150 normal users to have a balanced dataset all along with 3k of unlabeled
data.
Afterwards, they categorized the features into 3 major groups:
• Twitter handle’s (username) related features: length, number of unique characters
and Kolmogorov complexity of the username.
• Profile related features: this group contains 7 features related to the profile of the
user such as the profile’s description, the number of followers and the location.
• Content related features: the number of URLs, the number of hashtags and the
sentiment of the content.
Based on this dataset, they tried to answer two research questions:
• Are extremists on Twitter inclined to adopt similar handles?
• Can we infer the labels (extremist vs. non-extremist) of unseen handles based on
their proximity to the labeled instances?
After their experiment with different supervised and semi-supervised approaches, both
question had a positive answer and SVM had the best precision score with 0.96 which
shows the significance of the proposed feature set, but char-LSTM had the best precision-
recall score with 0.76 that minimize the number of false negatives.
This work presented different ways of collecting the necessary data in an extremist detec-
tion work. They also showed that the use of different input data types from social media
11
Chapter I. State of the art
can help detecting extremists. The limitation of this model is that it does not support
over-time user’s behavior change and it cannot learn future extremist behaviors.
In [9], the authors presented a convolutional neural network (CNN) in order to de-
tect suspicious e-crimes and terrorist involvement by classifying social media image con-
tents. They used three different kinds of datasets in which we are only interested in the
terrorism images dataset. Based on the transfer learning technique, they took the CNN
architecture of the imagenet model [10] and they reduced its network size by lowering
the kernel size of each layer to come up with their new smaller network (Fig I.5). In the
results, their architecture outperformed the default imagenet by around 1% of mean av-
erage precision score and took half imagenet’s execution time.
This paper showed that the concept of detecting terrorists based on their social media im-
age contents is possible along with the advantage of using transfer learning rather than
building a CNN from scratch. But their model supports only one type of data which is
images.
Figure I.5: Transfer-Learning CNN Framework
In Table I.2 we present the content-based models that we analyzed with their advan-
tages and limitations.
12
Chapter I. State of the art
Paper Description Advantages Limits
Alvari et al.,
2019 [8]
(semi)-supervised
model of extremist
detection based
on user’s general
information and
textual-content
data
- Proof-of-concept
of detection based
on textual-content
and general infor-
mation
- Support multiple
input data types
- Cannot support
multiple social
networks
- Cannot detect if
user is adopting
new behaviors
over-time
- Cannot learn
future behavior’s
change
Chitrakar et al.,
2016 [9]
Image classification
model using CNN
and Transfer learn-
ing
- Proof-of-concept
of image content
based detection
- Highlighted a
model improve-
ment technique:
Transfer Learning
- Cannot support
multiple input data
types
- Cannot learn
future behavior’s
change
Table I.2: Activity-based techniques comparison
2.2 Existing Graph-input Analysis
In this section, we study the existing works that works with graph as an input for the
terrorist detection in social media problem.
In [11], the authors proposed a framework that treats multidimensional network as
an input for the identification of terrorist network key-actors. The dimensions represent
the types of relationships or interactions in a social media. The workflow of their frame-
work starts by building a multidimensional network through a keyword-based search on
a social media platform, then that network is mapped to a single layer network by using
certain mapping functions. To detect the key actors, they use several centrality measures
13
Chapter I. State of the art
such as Degree of Centrality and Betweenness Centrality. The output of the frame-
work is a ranked list of the key actors within the network. The framework’s effectiveness
was evaluated with a ground truth dataset of a 16-month period Twitter data. Fig. I.6
presents the workflow of this framework.
This work presented the usage of multidimensional networks and how we can analyse it
to detect terrorist-network’s key actors. Their usage of the multiple dimensions could be
more efficient if they considered multiple social medias instead of multiple relationship
and interaction types.
Figure I.6: Multidimensional Key Actor Detection Framework
In [12], the writers created a survey on social network analysis for counter-terrorism
where they provided the data collection methods and the different types of analysis.
The two sources of data are: online social networks and offline social networks. The on-
line social networks are the social media websites which allow users to interact with other
users through sending messages, posting information, these are websites like Facebook,
Twitter and YouTube in which we collect the data using their APIs. In the other hand, of-
fline social networks are the real life social networks based on the relations like financial
transactions, locations, events etc, and these are the public databases such as Global Ter-
rorism Database (GTD) [13] and Global Data on Events Location and Tone (GDELT) [14].
14
Chapter I. State of the art
Furthermore, they analyzed the different centrality measures that provides the impor-
tance and position of a node in a network such as:
• Degree Centrality: A node with higher degree value is often considered as an active
actor in a network. The degree value is the number of connections linked to a node.
[15]
• Closeness Centrality: A node with higher closeness value can quickly access other
nodes in a network. The closeness value is a measure for how fast a node can reach
other nodes. [15]
• Betweenness Centrality: A node with higher betweenness value is often considered
as an influencer in a network. The betweenness value is the number of shortest
paths between any pair that pass through a node. We can see this as which node
acts as a bridge to make communities in a network. [15]
Finally, they stated some SNA tools comparison based on the functionality, platform,
license type and file-formats. As conclusion, they winded up with the idea of when doing
social network analysis, the main challenge is the data itself, since the privacy of users is
a very sensitive issue and also most of the times data tends to be incomplete with lot of
missing and fake nodes and relations, which often leads to incorrect analysis results. This
survey provided us the different data collection methods as well as the graph analysis
methodologies.
Conclusion
In this chapter, we presented some existing works that have dealt with anomaly de-
tection in general and terrorist detection in particular in different approaches. To the
best of our analysis, the existing methods did not deal with terrorism in multidimen-
sional graphs with combining different types of classifications in a time-based way. This
motivated us to provide a model of terrorism detection in multidimensional graphs that
supports different types of input data that can also detect over-time behavior’s change.
In the next chapter, we initiate a research on the existing techniques needed to im-
plement our proposed model.
15
IIExisting Techniques
This chapter presents the necessary techniques to implement our proposed model.
We begin by presenting the different input data types that we are considering and the
techniques used for the analysis of each type. Then, we present the classification models
to use and how they works.
Introduction
Each social network has ample input data that could be shared on it, identifying
these data types and choosing which ones we will be working with an important task
toward achieving our goal. In our previous analysis of the different existing proposals,
the authors of [4] identified nearly 270 of anomaly detection security-related features,
some of which were social media activity features, We analyzed those features and based
on [8, 9], we grouped them into three data types categories namely: textual-content data,
image-content data, and numerical-content data. To classify an individual based on those
content data, different classification models exists.
In the next sections, we begin by giving an overview about the identified input data
types and their analysis approaches, then we present the different classification models.
16
Chapter II. Existing Techniques
1 Data Types
In this section, we briefly introduces each type of data along with the chosen ap-
proach toward their analysis and classification.
1.1 Textual-Content Data
Textual-content data is mainly characters that are part of a certain language and
could be read by a human being. We begin by presenting the chosen text analysis ap-
proach, then we decide on a data representation techniques to transform the text to nu-
merical input.
1.1.1 Text Analysis
In text analysis, the most common used technique is Text Mining.
Text Mining is the process of extracting high quality information from textual data,
where the information could be patterns or matching structures in text without the con-
sideration of the semantics of the it. The outcome of it are mostly statistical information
such as frequency and correlation of words. [16]
In terrorism detection domain, we are interested in knowing what the user is trying to
incite with the post and whether it is serious, sarcasm or reporting a news. To differentiate
that, we need to go through the semantic analysis and not working with words as objects.
One of the most important text-mining’s processing methodologies, that also consider the
semantics of words, is the Natural Language Processing.
Natural Language Processing is the process of making the computer understand
the language spoken by humans along with the semantics and sentiments conveyed from
it by doing some analysis such as morphological, syntactical and semantic analysis [16].
The first step in NLP is the morphology processing which involved analyzing the
structure of words studying their construction from primitive meaningful units called
17
Chapter II. Existing Techniques
morphemes. This will help us divide the different words/phrases of a document into
tokens that will be used on later analysis.
Morphemes are the smallest units with a meaning in a word. There are two types
of morphemes namely Stems and Affixes where the stems are the base or root of a
word and affixes could be a prefix, an infix or a suffix. Affixes that never appear isolated,
but are combined with a stem.
Taking the example of Fig. II.1, we can see how we split a word into a stem which carries
the main meaning of the word and some affixes.
Figure II.1: An example of morphemes extraction
Tokens are words, keywords, phrases or symbols that have a useful semantic unit
for processing. We refer to its extraction process as Tokenization. It is mainly composed
of a lemma + part of speech tag + grammatical features. Example:
• plays → play (lemma) + Noun (part of speech tag) + plural (grammatical feature)
• plays → play (lemma) + Verb (part of speech tag) + Singular (grammatical feature)
After finishing studying the structure of the words, we have to examine their ar-
rangement and combination in a sentence, using syntax analysis.
In a sentence, words arrangement follow precise rules of the language’s grammar. Taking
an example of the sentence Three people were killed in an incident today and following
the English grammar parser, we end up with the example of Fig. II.2 where we have some
grammatical groups such as S for sentence, NP for noun phrase, VP for verb phrase,
NN for singular nouns and NNS for plural nouns.
18
Chapter II. Existing Techniques
Figure II.2: An example of syntax analysis
This analysis will make the machine able to understand the relationship between the
words and the different references.
After structuring the words and studying their relationship, it is time for the ma-
chine to understand the meaning of the words and phrases along with the context of the
document. Focusing on the relationship between the words and elements such as syn-
onyms, antonyms and hyponyms (hierarchical order of meaning), the semantic system is
able to build blocks composed of:
• Entities: Individuals or instances.
• Concepts: Category of individuals or classes.
• Relations: Relationship between entities and concepts.
• Predicates: Verb structures or semantic roles.
These can be represented through methods such as first order predicate logic (FOPL),
semantic networks and conceptual dependency.
Fig. II.3 illustrates an example of semantic networks using our last example of the
sentence Three people were killed in an incident today.
19
Chapter II. Existing Techniques
Figure II.3: An example of semantic network
Based on these semantics, the machine can now learn the meaning of the words and
the text, thus, from this part it is possible to lean the meaning of the user’s textual data.
1.1.2 Data Representation
After going through the text analysis, our machine can now understand the meaning
of the textual content data. But in order to build a classifier that will automatically cat-
egorize the current and future data, our data must be numerical to apply mathematical
rules while also preserving its semantics.
Word embedding is one of the most popular representation of textual data, where it trans-
forms a word in a document into a vector of numerical features where mostly close vectors
means that these words share the same meaning or are in the same context therefore the
data will not loose it semantics.
While doing our research, the most used word embedding techniques are Word2Vec
and Term Frequency-Inverse Document Frequency (TF-IDF).
Word2Vec uses two different approaches, namely: Continuous Bag Of Words (CBOW)
and Skip Gram, both are based on neural networks that takes a context as an input and
use back-propagation to learn [17]. The mathematical background work of Word2Vec,
tries to maximize the probability of the next word wt given the previous word h. Thus,
20
Chapter II. Existing Techniques
the probability P (wt|h) in Equation II.1, where score(wt,h) computes the compatibility
between wt with the context h and sof tmax is the known mathematical softmax function.
P (wt|h) = sof tmax(score(wt,h)) (II.1)
CBOW learns the embedding of a word by predicting it based on the surrounding
words that are considered as the context here.
Skip-Gram learns the embedding of a word by considering the current word as the
context and predicting the surrounding words.
According to [17], Skip-Gram is able to function with less data and represents rarer words
more, while CBOW is faster and represents frequent words clearer.
TF-IDF represents words with weights. These weights are based on the product of the
term frequency times the inverse document frequency In simpler terms, words that occur
frequently throughout the document should be given very little weighting or significance.
For example, in English, simpler terms include: the, or, and and. They don’t provide
a large amount of value. However, if a word appears very little or appears frequently, but
only in one or two places, then these are identified as more important words and should
be weighted as such [18].
Term-Frequency (TF) is the percentage of occurrence of a term t in a document
d. As illustrated in Equation II.2, we calculate term-frequency by taking the number of
times a term t is appearing in a document d by the total number of words in the document
d.
tft,d =
nt,d
term nterm,d
(II.2)
nt,d: The number of occurrences of term t in the document d.
term nterm,d: The sum of occurrences of all the terms that appear in the document d
which is the total number of words in the document d.
21
Chapter II. Existing Techniques
Method Advantages Disadvantages
Word2Vec - Optimized memory usage
- Fast execution time
- Contains a lot of noisy data
- Does not work well with ambigu-
ity
TF-IDF - The vocabulary is built with
words that identify the category
- Extract relevant information
- High memory usage
- The closest words are not similar
in meaning but in the category of
the document’s context
Table II.1: Comparison of word embedding methods
Inverse-Document-Frequency (IDF) is the rank of a term t for its relevance within
a document d. Equation II.3 show the mathematical formula to calculate inverse docu-
ment frequency. This is done by taking the total number of documents N and dividing
that by dft the number of documents that contains the term t.
idf (t) = loge(
N
dft
) (II.3)
Finally, if we are trying to get the weight wt,d of the word t in a document d using TF-IDF,
we get that as shown in Equation II.4 by multiplying the tft,d by the idf (t).
wt,d = tft,d ∗ idf (t) (II.4)
As found in review over existing research, such as in [18], it appears that Word2Vec
performs better in term of memory, execution time and embedding quality for words sim-
ilar in context and meaning, while TF-IDF performs better in identifying the words that
determine the document’s category. In other words, it detects the keywords that iden-
tify a category of documents. Table II.1 summarizes the advantages and disadvantages of
each method.
22
Chapter II. Existing Techniques
1.2 Image-Content Data
This type of data is anything that is a visual representation of something. Different
approaches are also available for image processing, but as determined in [10], convolu-
tional neural network is by far the most performing method to utilize in image classifica-
tion in term of precision and execution time.
Convolutional Neural Network is a deep learning algorithm and an extension of
neural network that is distinguished from other methods by its ability to consider spa-
cial structure and translation invariance. This means that regardless of where an object
is located in an image, it is still considered as the same object [19]. The advantage of
having a multidimensional input, unlike regular neural networks that use a vector as an
input, makes it performs better with image data since the images usually has three color
channels (RGB) which makes it a three dimensions matrix. Taking an example of an a
32×32 image with 3 color channels, we would have 32×32×3 = 3072 weights for a regu-
lar neural network, if we go for a 512×512 image, we would have 512×512×3 = 786432
weights. This will results in huge calculations as well as an over-fitting for having too
much information and details. [20]
A simple CNN is a sequence of layers: convolutional layer, pooling layer and fully-
connected layer. In a typical CNN, there are several rounds of convolution/pooling until
we proceed to the fully-connected layer.
1.2.1 CNN: Convolutional Layer
Each convolutional layer of the network has a set of feature maps that can recognize
increasingly complex patterns/shapes in a hierarchical manner. Instead of regular matrix
multiplications, convolutional layer uses convolution calculations. To do that, convolu-
tional layer needs to construct the filters and apply calculations on it while doing some
optimization techniques such as Striding and Padding.
Filters are used to detect patterns in an image, they also offer weight sharing. For
example a filter which detects curved edge (Fig.II.4), matches the left corner of an image
but may also match the right bottom corner of the image if both corners has a curved
23
Chapter II. Existing Techniques
edges.
Figure II.4: Curved Edge Filter
Calculation are matrix multiplications that are used to apply a filter on an input
image we.
Let us consider:
0 0 1 1 0
1 1 3 1 2
1 0 1 4 2
0 2 2 1 0
3 4 1 0 0




*
1 1 0
0 0 1
1 0 0




=
? ? ?
? ? ?
? ? ?




In order to get the value of the first ’?’ we need to use the filter on the first 3x3 matrix of
pixels : ? = (0 ∗ 1) + (0 ∗ 1) + (1 ∗ 0) + (1 ∗ 0) + (1 ∗ 0) + (3 ∗ 1) + (1 ∗ 1) + (0 ∗ 0) + (1 ∗ 0) = 4. Then
we continue, the value next to ’?’ is the value of the second 3x3 matrix of pixels in which
’3’ is the center. This means we moved by 1 pixel to the right.
24
Chapter II. Existing Techniques
0 0 1 1 0
1 1 3 1 2
1 0 1 4 2
0 2 2 1 0
3 4 1 0 0




*
1 1 0
0 0 1
1 0 0




=
4 ? ?
? ? ?
? ? ?




? = (0∗1)+(1∗1)+(1∗0)+(1∗0)+(3∗0)+(1∗1)+(0∗1)+(1∗0)+(4∗0) = 2. And so on.
Striding is a parameter of how many pixels we are going to move to calculate the
next value. It is mainly used to reduce the calculation as values next to each other are
more likely to be similar. In our last example the striding was 1, that means we only
moved the red box by 1 pixel to get the next value. Usually, we use a value of 2 or 3 since
in most of the cases a 2-3 pixels apart would make a variation or a change of a pattern.
Padding is used to prevent information loss. In our example when applied the fil-
ter, we didn’t consider having the values of the first/last rows and the first/last columns
as center for the 3x3 matrix. To fix that we add zero padding which will add new rows/-
columns filled with 0.
0 0 1 1 0
1 1 3 1 2
1 0 1 4 2
0 2 2 1 0
3 4 1 0 0




⇒
0 0 0 0 0 0 0
0 0 0 1 1 0 0
0 1 1 3 1 2 0
0 1 0 1 4 2 0
0 0 2 2 1 0 0
0 3 4 1 0 0 0
0 0 0 0 0 0 0




1.2.2 CNN: Pooling Layer
Pooling layer is used to determine what information is critical and what constitutes
irrelevant details. There are many types of pooling layers such as: max pooling layer and
average pooling layer. With max pooling, we look at a neighborhood of pixels and only
keeps the maximum value.
Considering a 2x2 max pooling with a stride of 2:
25
Chapter II. Existing Techniques
1 0 0 1
3 2 0 2
0 0 4 2
4 1 0 1




⇒
3 2
4 4




For each 2x2 matrix we took the maximum value and each time we move by two pixels
(stride) to get the next 2x2 matrix.
1.2.3 CNN: Fully-Connected Layer
A fully-connected layer is a layer on which all the inputs are connected to all the
outputs. In a CNN it is used to finally determine the class that will be assigned to our
main input. Before proceeding to the fully-connected layer, we have to use a technique
called flattening in order to generate a vector which is needed for this layer.
Flattening:
• Each 2D matrix of pixels is turned into 1 column of pixels.
• Each one of our 2D matrices is placed on top of another.
1.3 Numerical-Content Data
Numerical-content data is the data that are based on numbers that could be statis-
tically interpreted. This type of data does not require a pre-processing thus, it can be
directly fitted into a model. The models for this type of data are mostly the general sta-
tistical machine learning models that we will be presenting later.
2 Data Classification in Machine Learning
Machine learning is a subset of the artificial intelligence domain, that makes the
machine able to automatically gain knowledge from experience without being explicitly
programmed. By following some statistics and mathematical concepts, it looks for pat-
terns in the data we provide, learn them and make better decisions in the future. [21]
Several learning methods exists in Machine Learning:
26
Chapter II. Existing Techniques
• Supervised Learning: Given a sample of data and the desired output, the machine
should learn a function that maps the inputs to the outputs.
• Unsupervised Learning: Given a sample of data without the output, the machine
should learn a function that categorize these samples based on learned patterns.
• Semi-Supervised Learning: Given a small number of data with the desired output
(labeled data), and other data without output (unlabeled data), the machine should
learn a function that can label the unlabeled data using the knowledge learned from
the labeled data.
• Reinforcement Learning: Given a sample of data, a certain actions and rewards
related to the actions, the machine should learn a function that finds the optimal
actions toward achieving maximum rewards.
Classification is part of supervised learning in which the machine is going to catego-
rize a new observed data based on the learned patterns of each category from the training
data. In the following sections, we present the most common classification algorithms.
2.1 Support Vector Machines
A Support vector machine model is a representation of the data in a space. Examples
of a same category are close to each other. The group of examples in a category are
separated by a clear gap as wide and as spaced as possible from the examples of another
category. New observed examples are then predicted to be part of a category based on the
side of the gap in which they fall. [22]
2.2 Logistic Regression
Logistic regression is a statistical model that analyses a data in which there is at least
one feature that could determine the outcome. By using a logistic function, it tries to
model a binary output that is measured with a dichotomous variable. Since the output
is binary, it can only be used for binary classification problems. To use it for multi-class
problem, N logistic regression models should be trained, where N is the number of classes
you have, each model is trained on a certain class with one-vs-all approach. [23]
27
Chapter II. Existing Techniques
2.3 Neural Networks
A neural network is a network in which we have multiple layers of perceptrons. A
perceptron is the elementary unit in an artificial neural network which was introduced
as a model of biological neurons in 1959 [24]. The output of each perceptron in a layer is
connected to each perceptron of another layer as an input which makes it known as fully
connected layer. A neural network must have an input layer, an output layer and in be-
tween a hidden layer. Any neural network with more than one hidden layer is considered
as a deep neural network. [20]
Conclusion
In this chapter, we studied the existing techniques needed to perform a classification
on textual-content data, image-content data and numerical-content data. In the next
chapter, we detail the basis of our proposed model.
28
IIIProposed Model
This chapter introduces a novel time-based terrorism detection model that works
with multidimensional networks and different types of input data. The output results
of our model are the nodes that belong to terrorist regions in a graph across the dimen-
sions of the multidimensional network. To identify this type of nodes, we first have to
determine what are the terrorist regions and how to create them. Then, we examine the
network to estimate a terrorism score for each node in a dynamic way in order to detect
over time behavior changes.
First, we introduce the purpose of the model along with the proposed research questions
then we present the sources of data. After that, we present in detail the theoretical ap-
proach toward constructing our model and we finish by a conclusion.
Introduction
Nowadays, social networks provides many types of data that could be used such as
images, texts and videos, but most of the existing models work on specific type of data
on a specific social network.
Our proposed model will try to cover this limitation by supporting a multidimensional
network as an input in order to have the ability to use multiple social network data at the
29
Chapter III. Proposed Model
same time along with the supporting of different input data types. In addition to that,
the model will also consider the evolution of individual’s behavior over time to detect
deviation from the usual behavior pattern. Furthermore, the model will adapt itself to
the behavior’s evolution to be kept updated with new abnormal behaviors.
Before describing the basis of the model construction, it is first necessary to present the
research questions that will be used as a metric to track the accuracy of our proposed
model for solving the main research problem of the thesis which is the study of the char-
acteristics that describes a terrorist in different social media platforms.
The research questions being posed are as follows:
Q1: Can we identify the behavior of a terrorist based on his/her social media content ?
Q2: Can machine learning help automatically detect if a user is adopting a terrorism be-
havior over time ?
Q3: Do terrorists adopt the same behavior on different social networks ?
In order to answer these research questions, we have to pass through some phases:
• Phase 1: Identifying the available data sources
• Phase 2: Determining the convenient classification approach
• Phase 3: Estimating the terrorism score calculation
First, we start by collecting the necessary data of each user. Then, we create a multidi-
mensional network where each dimension represents a social network. Once the network
is ready, it is then used as an input to our model where each feature from each social
network will be mapped to its respective sub-model. Finally, a decision score will be cal-
culated.
If the node is detected as a terrorist, the model will be re-trained with those new inputs
to be kept up-to-date with newest (unseen) terrorist behaviors, in case the model losses
its accuracy once we updated it, it will be reverted to the last version. Additionally, each
node will be passed to the model each time it was involved in a new activity, that way,
the node could also be considered as a terrorist once the user adopt terrorism behavior
over-time.
30
Chapter III. Proposed Model
1 Data Collection
As part of phase 1, the data sources of the different data types should be identified.
As we shared in the last chapter, there exists three types of data:
• Textual-Content Data: These include posts, comments, image captions, text in an
image, etc.
• Image-Content Data: These are posted photos, profile picture, etc.
• Numerical-Content Data: These are age, number of friends, average posts per day,
etc.
Several other information exists in social media such as username, gender and rela-
tionship. Therefore, instead of having the numerical-content data category, we opted for
utilizing another category named general information data, where we have the existing
numerical-content data in addition to the user’s information data. We present by next the
data sources of the different data contents that we have. As mentioned by [12], we can
categorize the data sources into two categories namely: offline data sources and online
data sources.
In this section, we provide the sources of both offline and online data that are used in or-
der to retrieve our target data types for model training and later prediction. As a strategy
for training the model and precisely distinguishing terrorism from other similar data, we
decided to consider terrorist contents as positive labels against military and news con-
tents as negative labels, as these types of contents are related, training them against each
other will make the model more precise.
1.1 Offline Data Sources
Offline data is the data used for the model training which was gathered from public
terrorism datasets. For each input type, we used a different dataset. All of them defines
the terrorism from the American point of view.
For the textual-content data, we inspired from [8], to use twitter API to gather tweets
that consist of terrorism-related hashtags and tweets from terrorist accounts that were re-
ported to twitter’s safety account (@twittersafety) ensuring that they are not anti-terrorist
31
Chapter III. Proposed Model
accounts with that we will be creating our offline textual-content dataset where we con-
sider those tweets as positive labels against terrorism news tweets and news headlines
gathered from other public datasets such as Global Terrorism Database (GTD) [13] as
negative labels. We will also be using google translate API since some accounts may pub-
lish tweets in different languages.
For the image-content data, we did not find a public terrorism-related images dataset
within the scope of our research. We decided to use a manual web scraping method with
Google Image as our data source. We will be manually gathering terrorist individuals
images and incitement of terrorism images, which are our positive labels, and contrasting
them against military and terrorism news images, which are our our negative labels.
For general information data, Study of Terrorism And Responses to Terrorism (START)
published a database called Profiles of Individual Radicalization In the United States
(PIRUS) [25] which contains approximately 145 features about many radical profiles in
the united states from which we will be extracting our project’s relevant features that are
age, gender, relationship, etc.
1.2 Online Data Sources
Online data is the social network data that is part of the prediction and future model
re-training. The sources for that are the public APIs provided by the social networks.
For social media, we decided to study three popular websites that have similar data con-
tents and that could also be linked together: Facebook, Instagram and Twitter.
Facebook provides Graph API which is a HTTP-based API service to access the Face-
book social graph objects [26]. With the right permissions, Graph API allows you to query
public data as well as creating contents [27]. The data is rich with semantics since Graph
API utilizes RDF format as a return type. [28]
Instagram as part of Facebook, also provides Graph API for business accounts [29].
For normal user accounts it gives REST API that returns JSON object for querying public
data . [30]
Twitter hands over a REST API with JSON return format that provides several public
data queries as well as private data with the right permissions. [31, 32]
32
Chapter III. Proposed Model
2 Proposed Model Design
With the data preparation phases ready, we can now start determining the classi-
fication approach that we will be using along with the terrorism score formula, thus,
finishing the phase 2 and phase 3.
In this section, we explain the theoretical side of the necessary steps toward constructing
our proposed model. As previously mentioned, the model consider a graph as an input.
Then, an individual’s content-based classification with a decision making component to
calculate the final score of the node and utilizing a threshold to determine whether the
user is a terrorist.
2.1 Model Input
Inspired from [11] the best way to represent our input data is a multidimensional
network. However, unlike their proposal, the dimensions in our work will represent each
social network used.
Let G = (V ,E,D) denote an undirected unweighted multidimensional graph, in which
V is a set of nodes representing each user, D reflects the dimensions which are the social
networks and E = {(u,v,d);u,v ∈ V ,d ∈ D} represents a set of edges that are the connection
between the users that represents things such as: relationship, shared comments or post
sharing. Fig. III.1 illustrates an example of how this network look like.
At each timestamp, the user will have his/her data inserted into our model to have
his/her score. The timestamp here would be each time the user is involved in a new
activity, which is the method used by [5].
33
Chapter III. Proposed Model
Figure III.1: Multi-dimensional Network
2.2 Content-Based Classification
The model itself will contain three different sub-models, one for each content type
we have.
2.2.1 Text Classification Model
As mentioned in the previous chapter, before applying machine learning classifica-
tion models on a textual content, we have to do text analysis and transform it to numerical
input that a model can understand.
As illustrated in Fig. III.2, the first step when the textual data is received, it has to
34
Chapter III. Proposed Model
pass through the NLP process. Once that is done, it has to be represented in a numeri-
cal way. In the last chapter, we presented a comparison between two word embedding
techniques namely: Word2Vec and TF-IDF. We chose TF-IDF because we are doing a clas-
sification problem, we are more interested in differentiating the categories rather than
representing the similarity of the words meaning. Now that our machine can understand
our textual data and the data itself can be represented in a numerical way. We can start
passing that to any machine learning model. As a strategy, we decided that in the im-
plementation phase we would try different models that we mentioned in the last chapter,
such as Support Vector Machines, Logistic Regression and Neural Networks then com-
pare their results to assess which one performs better.
Figure III.2: Text Classification Model
35
Chapter III. Proposed Model
2.2.2 Image Classification Model
In the previous chapter, we presented convolutional neural network as a model to use
in image classification. But designing a CNN will require ample parameters tuning and
adding/removing of convolution blocks to find the best architecture while re-training
your model each time. This task is a huge time consuming job. To overcome this, there
is a technique called Transfer Learning that could help getting better results in a faster
way.
Transfer Learning is a technique that makes a model benefit from knowledge gained
during solving another similar problem. For example, a model that learned to recognize
cars could use its knowledge to recognize trucks [33]. This is done by taking a pre-trained
model, changing few layers, usually the last ones, and re-training only those layers.
It is proven in [34], that transfer learning could have a huge improvement for accu-
racy, execution time and memory usage.
Another known limitation that we usually encounter in image classification is not having
diverse enough data or enough samples. A solution to that is the Data Augmentation
technique.
Data Augmentation is a technique for generating more data because having little
data and not enough variation, leads to a bottleneck in Neural Network models that usu-
ally requires thousands of training samples with diverse variation to be able to generalize
the learning. This is done by using some techniques such as:
• Flipping: Flip the image horizontally and vertically.
• Rotating: Rotate the image with some degrees.
• Scaling: Re-scale an image by making it larger or smaller.
• Cropping: Crop a part of an image.
• Translating: Move the image in some direction.
• Adding Gaussian Noise: Add noisy points to the image.
36
Chapter III. Proposed Model
Applying data augmentation can help in improving the model score as discussed
in [35].
Therefore, as illustrated in Fig. III.3, once we have image data, it passes through our
trained CNN model resulting in an image-content score.
Figure III.3: Image Classification Model
2.2.3 General Information Classification Model
For the general information model, the features do not require pre-processing for
the machine to understand it. We have to follow some encoding techniques for the non-
numerical data, then fit that to a supervised machine learning classification model.
For non-numerical features such as gender and relationship, we have to encode these
into numerical values. As these are binary, we can use 0 and 1. For non-binary values,
37
Chapter III. Proposed Model
we have to use techniques such as one-hot encoding or a Sparse Categorical Cross En-
tropy encoding.
As for the username, we can apply some feature engineering to create relevant features
from it such as the length, number of unique characters and other important information
as discussed in [8].
Other numerical features such as the age, the number of friends and the number of fol-
lowers can be passed them directly to the model.
In the implementation phase we try different classification models where we compare
their results to select which one performs better.
Figure III.4: General Information Classification Model
38
Chapter III. Proposed Model
2.3 Decision Making
Now that we have a model for each data type, we can go to phase 3 where we propose
a calculation formula to provide a score for each user.
While doing our work and based on the available features, we noticed that the textual
content and the image content has more impact on the user behavior than the general
information which could be mis-leading. Therefore, as a compromise we decided to give
a weight to each input data relative to its impact on determining the anomaly of the user.
Taking 3 scores one for each sub-model {s1,s2,s3} and 3 weights {α1,α2,α3}, each node
u ∈ V on each dimension d ∈ D should have the terrorism score of that dimension S(u)d
as in (III.1).
S(u)d =
3
i=1
(αi × s(u)i) (III.1)
Now each user has a score for each dimension based on the sub-models score of each
dimension, but as an output, we want a single score. For that, given 3 dimensions, each
user must have a terrorism score ST (u) as in (III.2).
ST (u) =
3
d=1 S(u)d
3
(III.2)
Now that each user u ∈ V has a terrorism score ST (u), we have to decide whether
that user is a terrorist or not, this is done by defining a certain threshold γ where:
ST (u) = γ ⇒ T errorist
ST (u)  γ ⇒ NotT errorist
(III.3)
The values of the weights αi and the threshold γ are determined in the implementa-
tion phase.
2.4 Global Model
After defining the different components of our model, let us present its design along
with the workflow of how to use it. Fig. III.5 shows how our model look like using an
example of a single user with three dimensions that are the Facebook, Twitter and Insta-
gram data.
39
Chapter III. Proposed Model
Figure III.5: Proposed Model
Fig. III.6 illustrates the workflow of our model. Each time a user is involved in an ac-
tivity, the user’s data will pass through our model. In the case in which the user behavior
is detected as terrorist, we re-train the model with this new data to keep it updated with
new unseen behaviors. If the model loses accuracy after re-training, we revert to the last
existing model.
40
Chapter III. Proposed Model
Figure III.6: Model Workflow
Conclusion
In this chapter, we presented our proposed approach, starting from the research
questions that we are looking to solve. Then, we showed the different phases to follow
in order to answer those questions. Finally, we explored the steps to follow toward the
41
Chapter III. Proposed Model
construction of our model.
The next chapter will detail the achievements and the different results.
42
IVImplementation and Results
This chapter presents the practical part of our work. We will go through the pipeline
of our implementation starting with data gathering, then the model creation and we fin-
ish with the interpretation of the results and a response to the research questions.
1 Data Collection
In this section, we will explain how to gather the data that we identified in the last
chapter. As we discussed, there are two types of data, the offline and the online data. In
the next sections, we will implement the data gathering solution to each of them.
1.1 Offline Data
To train the models, we used a strategy of using an offline dataset which is the public
datasets related to our problem.
In the last chapter, we decided a data sources for each input type, we will implement
their gathering scripts in the next sections.
43
Chapter IV. Implementation and Results
1.1.1 Textual-Content Data
For the textual data, we have two sources of data:
• Positive labels: Tweets of banned tweeter accounts.
• Negative labels: News headlines of the GTD.
Our positive labels are the data that contains terrorist textual content. Our strategy
was to gather tweets of the banned users that were reported to @twittersafety account
and that also contains terrorism-related hashtags when they were reported, this could
be done through the Twitter API or the Twitter searching tool. Fig. IV.1 illustrates an
example of our searches looking for tweets that were reported to or mentioned the
twittersafety account containing the hashtags #ISIS, #terrorist, #Daech, #IslamicState.
Figure IV.1: Twitter Searching Tool
While doing our research, we found out that some organization already did this pro-
cess and extracted over 17k of clean terrorist data of ISIS users, and published that into a
44
Chapter IV. Implementation and Results
Kaggle dataset called How ISIS Uses Twitter [36].
For our negative labels, we need content related to terrorism in an opposite way, such
as news reporting on terrorism. For that, we will be using the news headlines from the
Global Terrorism Database (GTD) [13]. Fig. IV.2 presents a sample of 4 rows from the
GTD news headlines.
Figure IV.2: Sample of news headlines
Our final dataset contains the merge of the tweets labeled as terrorist, and the GTD
data labeled as news. Fig IV.3 shows the word cloud of the most appearing keywords
from our dataset, that includes both positive and negative labels.
45
Chapter IV. Implementation and Results
Figure IV.3: Word Cloud of our Textual Data
The number of samples we have total to approximately 300k of data, where about
122K are terrorist data and around 181K are news headlines, Table IV.1 presents the real
numbers in our dataset.
Label Number of samples
Positive labels 122619
Negative labels 181691
Total Data 304310
Table IV.1: Textual-Content Dataset
1.1.2 Image-Content Data
As we discussed in our research, the source of the image data is Google-Image and
we will be manually gathering images from it. Lucky for us, a python package called
google_images_download [37] exists, which allow us to automate this task by choosing
the keywords that we are looking for and the number of images needed.
46
Chapter IV. Implementation and Results
We started a script that downloaded around five hundred of terrorist persons and in-
citement acts in addition to another five hundred images of military and terrorism news.
Unfortunately, the images were not 100% related to what we are looking for, therefore,
we had to manually verify the gathered images and remove the non-related images.
After cleaning the data and keeping only related images, we had around 200 of ter-
rorist images and 300 of military and news images. Table IV.2 illustrates the real numbers
of images in our dataset. Fig. IV.4 and Fig. IV.5 shows random three images of each cate-
gory.
Label Number of samples
Positive labels 219
Negative labels 314
Total Data 533
Table IV.2: Image-Content Dataset
Figure IV.4: Sample of Terrorists images
47
Chapter IV. Implementation and Results
Figure IV.5: Sample of Military/News images
1.1.3 General Information Data
For general information data, we used the Profiles of Individual Radicalization In
the United States (PIRUS) [25] public dataset from which we extracted the ages, genders
and relationships status of 135 extremist person that are our positive labels. As for the
negative labels, we will be using the online data to build our dataset.
Fig.IV.6, Fig.IV.7 and Fig.IV.8 shows the distribution of each feature within our pos-
itive labels data.
Figure IV.6: Age Distribution
48
Chapter IV. Implementation and Results
Figure IV.7: Relationship Distribution
Figure IV.8: Gender Distribution
1.2 Online Data
In this section, we will implement the necessary scripts that will gather the online
data from our selected social media platforms: Facebook, Instagram and Twitter.
1.2.1 Facebook Data
Facebook provides the HTTP-based API called Graph API. A public SDK called facebook-
sdk will help us write automated Facebook data gathering script using Python.
To use Facebook Graph API, it is necessary to pass an access token that has the rele-
vant permissions to access the social graph objects that you are querying. In the Facebook
social graph objects, each object has some fields related to the object type, for example,
49
Chapter IV. Implementation and Results
the object User will contain information around the user profile such as the age, rela-
tionship and gender. The main existing objects that we are interested in are the User, the
Post, and the Comment. To access each graph object, you pass an id of an object of that
type,. Therefore, for posts and comments we cannot access them directly since the posts
ids are contained in the list of posts of the user object, and the same for comments as they
are part of the posts. Fig.IV.9 shows a representation of the Facebook Graph API.
Figure IV.9: Facebook Graph API
Our script starts by obtaining the information of the user along with the list of posts
ids. Then, it accesses all the posts by looping through the list of posts ids from the posts
field in the User object and retrieve the necessary information from it. After that, it
extracts the comments by looping through the list of comments ids from the comments
field in each Post object. Finally it will parse the textual and image data from those
posts and comments.
The following code is an example of how to get the user information along with the posts
data.
graph = facebook.GraphAPI(access_token=access_token , version=3.1)
50
Chapter IV. Implementation and Results
user_information = graph.get_object(
id=’me’, fields=’id,name,age_range ,gender,relationship_status’)
posts_ids = []
posts_object = graph.get_object(id=’me’, fields=’posts’)
posts_ids.extend(posts_object[’posts’][’data’])
while next_page is not None:
response = requests.get(next_page)
new_data = json.loads(response.content)
posts_ids.extend(new_data[’data’])
try:
next_page = new_data[’paging’][’next’]
except:
next_page = None
for post in posts_ids:
post_data = graph.get_object(
id=post.id,
fields=’created_time ,full_picture ,message ,shares ,
likes.summary(1)’)
post_data[’likes’] = post_data[’likes’][’summary’][’total_count’]
try:
post_data[’shares’] = post_data[’shares’][’count’]
except:
post_data[’shares’] = 0
1.2.2 Instagram Data
For Instagram, the task is easier as it provides a normal REST API with JSON output
where the access to each endpoint is direct through any HTTP request module. In Python,
we use the module requests with the Instagram endpoint: https://api.instagram.com/v1/
where we can access the information of the user through /users/self/?access_token={}
or the posts through /users/self/media/recent/?access_token={}.
The following code shows how our script will gather information from Instagram.
51
Chapter IV. Implementation and Results
# User data
response = requests.get(
’https://api.instagram.com/v1/users/self/?access_
token={}’.format(access_token))
user = json.loads(response.content)[’data’]
# Posts data
response = requests.get(
’https://api.instagram.com/v1/users/self/media/recent/?access_
token={}’.format(access_token))
data = json.loads(response.content)
for post in data[’data’]:
_id = post[’id’]
creation_timestamp = post[’created_time’]
created_time = datetime.fromtimestamp(
int(creation_timestamp)).strftime(’%Y−%m−%d %H:%M:%S’)
message = post[’caption’][’text’] if post[’caption’]
is not None else ’’
img_url = post[’images’][’standard_resolution’][’url’]
post_data = dict(created_time=created_time , id=_id,
message=message , img_url=img_url)
1.2.3 Twitter Data
Similarly to Instagram, Twitter also provides a REST API, however, it also hands over
a Python SDK making the API usage easier. In order to use it, we have to pass 4 access
keys: consumer key, consumer secret, access token key and access token secret. Each
key has relevant permissions that allow access to either user’s private data or the public
Twitter data.
The following code is an example of how we loaded the tweets using the Twitter Python
SDK.
52
Chapter IV. Implementation and Results
api = twitter.Api(consumer_key=consumer_key ,
consumer_secret=consumer_secret ,
access_token_key=access_token_key ,
access_token_secret=access_token_secret)
user_id = api.VerifyCredentials().AsDict()[’id’]
tweets = api.GetUserTimeline(user_id=user_id)
for tweet in tweets:
tweet = tweet.AsDict()
_id = tweet[’id’]
created_time = tweet[’created_at’]
message = tweet[’text’] if tweet[’text’] is not None else ’’
tweet_data = dict(created_time=created_time ,
id=_id, message=message)
2 Model Implementation
In the next sections, we will be implementing the different components that will lead
toward constructing our proposed model.
For each sub-model, we will be splitting the dataset of that content type into 80%
training data, and 20% testing data. All the models will be implemented on the same
machine provided by Kaggle, a data science platform, with the following hardware:
• RAM: 16 GB
• CPU count: 2
• GPU: Tesla K80
• Disk: 5 GB
53
Chapter IV. Implementation and Results
2.1 Text Classification Model
The steps to construct our text classification model were first to have the NLP pipeline
ready for data pre-processing, then vectorize it with TF-IDF and pass it to a classification
model.
2.1.1 NLP Process
When the NLP enters the practical phase, the process becomes tokenization, removal
of stop words and lemmatization. In the following code we will be using the Natural
Language Toolkit (NLTK) python package to do these steps. We start with regular ex-
pressions that will remove unnecessary texts that disrupts the process such as links and
dates. Then, we split the text into tokens, removing the stopwords (common useless
words like ’a’, ’the’, ’that’, ’on’) and lemmatizing the words, by determining the root word
based on its part-of-speech tag (adjective, verb, noun).
def process_text(text):
nltk_processed_data = []
text = re.sub(r’https?://. [rn] ’, ’’, text,
flags=re.MULTILINE)
text = re.sub(’(?:[0−9] [:/−]){2}[0−9]{2,4}’, ’’, text,
flags=re.MULTILINE)
for w in tokenizer.tokenize(text) :
word = w.lower()
if not is_stopword(word=word):
processed_text = wordnet_lemmatizer.lemmatize(
word, get_wordnet_pos(word))
nltk_processed_data.append(processed_text)
return nltk_processed_data
2.1.2 Data Vectorization
To use our data for classification models, we have to vectorize it into semantic nu-
merical data. In the last chapter, we defined TF-IDF as our vectorizer model. Sckit-Learn
54
Chapter IV. Implementation and Results
offers ’TfidfVectorizer’ module in its package with an easy usage of two lines. We defined
the object parameters as follows:
• max_df: Max document frequency for a word to be considered in the grammar ⇒
0.95 (word must maximum appears in 95% of the documents)
• min_df: Min document frequency for a word to be considered in the grammar ⇒
0.1 (word must minimum appears in 10% of the documents)
• ngram_range: Number of words to consider as a single token in the grammar ⇒
(1,3) (From 1 word to 3 words)
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.1, ngram_range=(1,3))
X = vectorizer.fit_transform(train_data)
The data used to train the ’TfidfVectorizer’ is around 205K of samples.
After vectorizing the data, TF-IDF has identified 330 feature vectors, which makes our
data shape: (n_training_samples, n_dimensions) ⇒ (243448, 330)
We utilize the trained vectorizer to transform the testing data, as follows:
transformed_data = vectorizer.transform(test_data).toarray()
2.1.3 Data Classification
As mentioned in the last chapter, we will try three classification models namely Lo-
gistic Regression, Support Vector Machine and Neural Network. The best performing
model will be used later for our global model.
To implement the Logistic Regression and the Support Vector Machine models,
we used Scikit-Learn, a python machine learning library that offers many known models.
We trained these two models with their default suggested parameter’s values.
For the Neural Network, we used Keras as a framework that works on top of TensorFlow.
The architecture of our model is composed of three layers, with 16 neurons, 8 neurons
and 1 neuron respectively. The first two layers has a ’relu’ activation as it is proven for its
performance, and the last layer has a ’sigmoid’ activation as it is our output layer and we
55
Chapter IV. Implementation and Results
have a binary classification problem. The model is compiled with ’binary_crossentropy’
as a loss function and ’adam’ as an optimizer. For the training parameters, we used 20
epochs with 128 batch size and 20% of validation data extracted from the training data.
Table IV.5 show the different metric scores for each model along with the training
execution time. These models are trained and tested with the same data and on the same
machine. The model that we will be using in our global model is the Neural Network as
it has the best F1-score with a good average of training time.
Model Name Accuracy F1-Score Training time
Logistic Regression 0.9726 0.9674 39.9 secs
SVM 0.9626 0.9548 6h 48min 33s
Neural Network 0.9774 0.9719 1min 11s
Table IV.3: Text Models Metric Scores
2.2 Image Classification Model
In last chapter, we defined convolutional neural network as our image classification
model along with optimization techniques namely Transfer Learning and Data Augmen-
tation. Therefore, as a first step, we have to implement our data augmentation functions,
then define which base model’s learnt knowledge will be used in our model.
2.2.1 Data Augmentation
To use data augmentation, a python package called ’imgaug’ exists that provides all
the different data augmentation techniques. In the following code, we show an example
on how to use the augmenters of the ’imgaug’ library where we will be applying a random
augmentation technique for the image used.
from imgaug import augmenters
img_augmentor = augmenters.Sequential([
# S e l e c t one of the augmentation techniques randomly
augmenters.OneOf([
iaa.Affine(rotate=0),
iaa.Affine(rotate=90),
56
Chapter IV. Implementation and Results
iaa.Affine(rotate=180),
iaa.Affine(rotate=270),
iaa.Fliplr(0.5),
iaa.Flipud(0.5),
])], random_order=True)
# Apply the augmentation technique on the image
image_aug = img_augmentor.augment_image(image)
Fig.IV.10 shows an example of two images generated through the data augmentation
code above.
Figure IV.10: An example of data augmentation
After applying the data augmentation on the training data, we generated an addi-
tional 30% of data resulting in a total of approximately around 550 images.
2.2.2 Transfer Learning
Many pre-trained models exists nowadays, but they are each focused on a specific
problem. In our case, we work more with faces and objects like guns, so the pre-trained
model VGG16 [38] is more suitable to our problem.
To adapt the VGG16 to our problem, we remove its fully-connected layers, freeze the
training on the the remaining layers and add two new layers. The first will have 16 neu-
rons and ’relu’ activation. The second, our output layer, will have 1 neuron and ’sigmoid’
57
Chapter IV. Implementation and Results
activation. The loss function will be ’binary_crossentropy’ with ’adam’ as an optimizer.
Since the image classification could be a complex task and we have little amount of data,
we will train the model with 5000 epochs with 32 batch size while having an early stop-
ping strategy of 250 rounds.
In Table IV.4, we present the different scores of combination using our two CNN
layers, with and without the pre-trained model and with and without the generated data
from the data augmentation. While the scores were measured by the same testing data,
the training data differs when using the data augmentation. The usage of both DA and
TL together has resulted in better scores and not so long training time, therefore, we will
be using that in our global model.
Model Accuracy F1-Score Training time
CNN 0.7631 0.7219 3min 50secs
CNN + DA 0.7781 0.7463 4min 12secs
CNN + TL 0.8291 0.8103 8min 48secs
CNN + DA + TL 0.8571 0.8454 9min 23secs
Table IV.4: Image Models Metric Scores
2.3 General Information Classification Model
For the general information, we will follow the same strategy used in the text clas-
sification where we will be working with three classification models namely Logistic Re-
gression, Support Vector Machine and Neural Network and the best performing model
will be later used for our global model.
For the Logistic Regression and the Support Vector Machine we used the default
Scikit-Learn parameter’s values.
However, for the Neural Network, we used an architecture of four layers with 16 neu-
rons, 8 neurons, 4 neurons and 1 neuron respectively. A ’relu’ activation is used for the
first three layers and a ’sigmoid’ activation for the last layer. The model is compiled with
’binary_crossentropy’ as a loss function and ’adam’ as an optimizer. For the training pa-
rameters, we will use 200 epochs with 32 batch size and 20% of validation data extracted
from the training data.
58
Chapter IV. Implementation and Results
Table IV.5 illustrated the metric scores of the trained models with the same data on
the same machine. For the global model, we will be using SVM as it exceeds by far the
performance of the other models.
Model Name Accuracy F1-Score Training time
Logistic Regression 0.7650 0.7873 5 secs
SVM 0.8300 0.8495 7 secs
Neural Network 0.8173 0.8325 48.6 secs
Table IV.5: General Information Models Metric Scores
2.4 Proposed Model
In this part, we will be going through our proposed model’s workflow to put things
together and implement the missing components.
Our model’s input is a multidimensional network, therefore, we have to implement
a parser that will map the data into the correspondent sub-model.
This could be solved through creating objects where we can store the data in a convenient
way then pass to the sub-models. Fig.IV.11 illustrates our class diagram where we store
each user’s data. The general user information data are in the User object, while the
Post object, which could also be a Comment, has both image and textual data.
59
Chapter IV. Implementation and Results
Figure IV.11: Class Diagram
The second component of our model is the sub-models that will be receiving the
input data. For that, we will use the pre-trained chosen models of each input type and
output a score per each model.
The next component is the decision making where we have to interpret the output
score of the sub-models and calculate the terrorism score and decide on the user’s ex-
tremeness. The calculation formula for that was already defined in the last chapter, but
the values of the threshold γ and the models factors α are not yet decided. For the factors,
we decided that since we have more features on the image and textual content than the
general information, we will have the factors as follow:
• Text-Model factor: 0.4 (40%)
• Image-Model factor: 0.4 (40%)
• Information-Model factor: 0.2 (20%)
As for the threshold, we do not have many real online data to decide on this in a
scientific way, we agreed to keep it in a neutral way with value of 0.5 (50%).
The model itself is adapted to an over-time change, thus, a component that re-train
and revert a model must be implemented as well. For that, we will have a database where
we store the last model’s score and a python function that checks if the score improved
after re-training the model on the new terrorist-user’s data.
60
Chapter IV. Implementation and Results
With having those components ready, our model’s implementation is finished and
the model is ready to be tested.
3 Results Interpretation
In this section, we will start testing our model with a network to see if we can answer
our research questions that were posed in the beginning of our proposal.
The network passed to the model is composed of two real users (U1  U2) that are
non-terrorist and one generated terrorist user (U3) as we cannot find an available terrorist
users. The input was tested only a single timestamp t, due to lack of historical data.
As we can see in Table IV.6, which presents the scores predicted for those users for each
sub-model on each social network (Facebook: FB, Instagram: IG, Twitter: T), the model
has performed good by predicting correctly the anomalousness of the users. Based on
these results, we can see that a terrorist could be detected according to his/her social
media content, thus, our answer for Q1 is positive. We can also notice that the scores on
the same data type from different social networks are mostly similar, except for the text
content on Instagram as it is only image captions, which means that our answer to Q3 is
positive.
User Text-Model Score Image-Model Score Information-Model Score Final Score
FB IG T FB IG T FB IG T
U1 0.084 0.084 0.079 0.031 0.068 0.063 0.265 0.318 0.345 0.116
U2 0.059 0.054 0.078 0.013 0.054 0.115 0.530 0.445 0.276 0.133
U3 0.859 0.298 0.854 0.658 0.877 0.816 0.530 0.637 0.690 0.705
Table IV.6: Model Testing Results
After detecting the user U3 as a terrorist, the sub-models were re-trained again with
appending the new data extracted from U3 to the old data. The new score of each sub-
model were increased by an average of 0.01. Although this increase could be considered
negligible, but over time, it will help our model being up-to-date with the new terrorism
contents, thus, if a user is starting to adopt the new terrorism behaviors that the model
61
Chapter IV. Implementation and Results
was not trained on in the first place, the user will still be detected as a terrorist, therefore
our answer to Q2 is positive.
Conclusion
During this chapter, we presented the implementation of our solution starting from
the data gathering, then the sub-models training and our proposed model construction,
and we finished by testing our model and answering our research questions.
62
VConclusions and Perspectives
In this thesis, we proposed a terrorist detection model that works with multidimen-
sional networks as an input format and that can also support different input data types
such as texts and images. Our model can also detect if the user is adopting a new behavior
over-time, and the model itself can automatically learn new terrorism behaviors.
We started by presenting the existing works carried in the anomaly and terrorism
detection domains. Then, we discussed the existing techniques for data processing and
data classification in an automated way. After that, we presented the model’s design and
the theoretical perspective of the workflow. Finally, we started implementing the model
and discussed the results.
The model itself showed good results on two real users and one generated user by
predicting their anomalousness correctly. Despite the fact that the number of the online
data used for testing is too little, this is still considered as a proof-of-concept that our
proposed model can be implemented and put in a production environment.
Although we tried to cover the limitation of other existing models, our proposed
model is still limited by not supporting some functionalities such as:
• Graph analysis: We can use graph analysis methodologies to detect communities
since our input data is a network.
63
Chapter V. Conclusions and Perspectives
• Support of videos: We can add another sub-model that works with video classifica-
tion, since videos are one of the most important contents in social medias.
The model’s accuracy can also be improved by using larger datasets, thus, we also
solve the calculation of the threshold and the sub-models factors.
64
Bibliography
[1] Shannon Greenwood, Andrew Perrin, and Maeve Duggan. Social media update
2016. Pew Research Center, 11(2), 2016.
[2] Alex P Schmid. The definition of terrorism. In The Routledge handbook of terrorism
research, pages 57–116. Routledge, 2011.
[3] Facebook community standards. URL https://www.facebook.com/
communitystandards/dangerous_individuals_organizations.
[4] Arash Habibi Lashkari, Min Chen, and Ali A Ghorbani. A survey on user profiling
model for anomaly detection in cyberspace. Journal of Cyber Security and Mobility, 8
(1):75–112, 2019.
[5] Zahedeh Zamanian, Ali Feizollah, Nor Badrul Anuar, Laiha Binti Mat Kiah,
Karanam Srikanth, and Sudhindra Kumar. User profiling in anomaly detection of
authorization logs. In Computational Science and Technology, pages 59–65. Springer,
2019.
[6] Sreyasee Das Bhattacharjee, Junsong Yuan, Zhang Jiaqi, and Yap-Peng Tan. Context-
aware graph-based analysis for detecting anomalous activities. In 2017 IEEE Inter-
national Conference on Multimedia and Expo (ICME), pages 1021–1026. IEEE, 2017.
[7] Di Chen, Qinglin Zhang, Gangbao Chen, Chuang Fan, and Qinghong Gao. Forum
user profiling by incorporating user behavior and social network connections. In
International Conference on Cognitive Computing, pages 30–42. Springer, 2018.
[8] Hamidreza Alvari, Soumajyoti Sarkar, and Paulo Shakarian. Detection of violent
extremists in social media. arXiv preprint arXiv:1902.01577, 2019.
[9] Pradip Chitrakar, Chengcui Zhang, Gary Warner, and Xinpeng Liao. Social media
image retrieval using distilled convolutional neural network for suspicious e-crime
65
Bibliography
and terrorist account detection. In 2016 IEEE International Symposium on Multimedia
(ISM), pages 493–498. IEEE, 2016.
[10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with
deep convolutional neural networks. In Advances in neural information processing
systems, pages 1097–1105, 2012.
[11] George Kalpakis, Theodora Tsikrika, Stefanos Vrochidis, and Ioannis Kompatsiaris.
Identifying terrorism-related key actors in multidimensional social networks. In
International Conference on Multimedia Modeling, pages 93–105. Springer, 2019.
[12] Pankaj Choudhary and Upasna Singh. A survey on social network analysis for
counter-terrorism. International Journal of Computer Applications, 112(9):24–29,
2015.
[13] Gary LaFree and Laura Dugan. Introducing the global terrorism database. Terrorism
and Political Violence, 19(2):181–204, 2007.
[14] Kalev Leetaru and Philip A Schrodt. Gdelt: Global data on events, location, and
tone, 1979–2012. In ISA annual convention, volume 2, pages 1–49. Citeseer, 2013.
[15] Linton C Freeman. Centrality in social networks conceptual clarification. Social
networks, 1(3):215–239, 1978.
[16] EDUCBA contributors. Text mining vs natural language process-
ing - top 5 comparisons, Aug 2019. URL https://www.educba.com/
important-text-mining-vs-natural-language-processing/.
[17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of
word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
[18] Shivangi Singhal. Data representation in nlp, Jul 2019. URL https://medium.com/
@shiivangii/data-representation-in-nlp-7bb6a771599a.
[19] Eric Kauderer-Abrams. Quantifying translation-invariance in convolutional neural
networks. arXiv preprint arXiv:1801.01450, 2017.
66
Master's Thesis
Master's Thesis

Weitere ähnliche Inhalte

Was ist angesagt?

Lecture notes on hybrid systems
Lecture notes on hybrid systemsLecture notes on hybrid systems
Lecture notes on hybrid systems
AOERA
 
ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)
Ian Dewancker
 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THE
butest
 
aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11
Aniket Pingley
 

Was ist angesagt? (18)

thesis
thesisthesis
thesis
 
Anarchi report
Anarchi reportAnarchi report
Anarchi report
 
Lecture notes on hybrid systems
Lecture notes on hybrid systemsLecture notes on hybrid systems
Lecture notes on hybrid systems
 
PhD-2013-Arnaud
PhD-2013-ArnaudPhD-2013-Arnaud
PhD-2013-Arnaud
 
how to design classes
how to design classeshow to design classes
how to design classes
 
ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)ubc_2014_spring_dewancker_ian (9)
ubc_2014_spring_dewancker_ian (9)
 
MACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THEMACHINE LEARNING METHODS FOR THE
MACHINE LEARNING METHODS FOR THE
 
Capturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender SystemsCapturing Knowledge Of User Preferences With Recommender Systems
Capturing Knowledge Of User Preferences With Recommender Systems
 
btpreport
btpreportbtpreport
btpreport
 
Uml (grasp)
Uml (grasp)Uml (grasp)
Uml (grasp)
 
document
documentdocument
document
 
SCE-0188
SCE-0188SCE-0188
SCE-0188
 
Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...Design and implementation of a Virtual Reality application for Computational ...
Design and implementation of a Virtual Reality application for Computational ...
 
Cognos v10.1
Cognos v10.1Cognos v10.1
Cognos v10.1
 
aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11aniketpingley_dissertation_aug11
aniketpingley_dissertation_aug11
 
Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...Im-ception - An exploration into facial PAD through the use of fine tuning de...
Im-ception - An exploration into facial PAD through the use of fine tuning de...
 
General physics
General physicsGeneral physics
General physics
 
Mining of massive datasets
Mining of massive datasetsMining of massive datasets
Mining of massive datasets
 

Ähnlich wie Master's Thesis

Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
Léo Vetter
 
Computer Security: A Machine Learning Approach
Computer Security: A Machine Learning ApproachComputer Security: A Machine Learning Approach
Computer Security: A Machine Learning Approach
butest
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
AimonJamali
 
A proposed taxonomy of software weapons
A proposed taxonomy of software weaponsA proposed taxonomy of software weapons
A proposed taxonomy of software weapons
UltraUploader
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
pfleidi
 
Designing Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDesigning Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : Documentation
Darwish Ahmad
 

Ähnlich wie Master's Thesis (20)

Investigation in deep web
Investigation in deep webInvestigation in deep web
Investigation in deep web
 
Upstill_thesis_2000
Upstill_thesis_2000Upstill_thesis_2000
Upstill_thesis_2000
 
Thesis
ThesisThesis
Thesis
 
Content Based Image Retrieval
Content Based Image RetrievalContent Based Image Retrieval
Content Based Image Retrieval
 
Ibm watson analytics
Ibm watson analyticsIbm watson analytics
Ibm watson analytics
 
IBM Watson Content Analytics Redbook
IBM Watson Content Analytics RedbookIBM Watson Content Analytics Redbook
IBM Watson Content Analytics Redbook
 
main
mainmain
main
 
Master_Thesis
Master_ThesisMaster_Thesis
Master_Thesis
 
Computer Security: A Machine Learning Approach
Computer Security: A Machine Learning ApproachComputer Security: A Machine Learning Approach
Computer Security: A Machine Learning Approach
 
Nweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italyNweke digital-forensics-masters-thesis-sapienza-university-italy
Nweke digital-forensics-masters-thesis-sapienza-university-italy
 
Software guide 3.20.0
Software guide 3.20.0Software guide 3.20.0
Software guide 3.20.0
 
Machine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_reportMachine_translation_for_low_resource_Indian_Languages_thesis_report
Machine_translation_for_low_resource_Indian_Languages_thesis_report
 
A proposed taxonomy of software weapons
A proposed taxonomy of software weaponsA proposed taxonomy of software weapons
A proposed taxonomy of software weapons
 
Scale The Realtime Web
Scale The Realtime WebScale The Realtime Web
Scale The Realtime Web
 
An Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing UnitsAn Optical Character Recognition Engine For Graphical Processing Units
An Optical Character Recognition Engine For Graphical Processing Units
 
Master's Thesis
Master's ThesisMaster's Thesis
Master's Thesis
 
E.M._Poot
E.M._PootE.M._Poot
E.M._Poot
 
Distributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data StreamsDistributed Decision Tree Learning for Mining Big Data Streams
Distributed Decision Tree Learning for Mining Big Data Streams
 
Designing Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : DocumentationDesigning Countermeasures For Tomorrows Threats : Documentation
Designing Countermeasures For Tomorrows Threats : Documentation
 
Technical report
Technical reportTechnical report
Technical report
 

Kürzlich hochgeladen

Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Dr.Costas Sachpazis
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
Tonystark477637
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Christo Ananth
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 

Kürzlich hochgeladen (20)

VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...Booking open Available Pune Call Girls Koregaon Park  6297143586 Call Hot Ind...
Booking open Available Pune Call Girls Koregaon Park 6297143586 Call Hot Ind...
 
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICSUNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
UNIT-IFLUID PROPERTIES & FLOW CHARACTERISTICS
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Thermal Engineering Unit - I & II . ppt
Thermal Engineering  Unit - I & II . pptThermal Engineering  Unit - I & II . ppt
Thermal Engineering Unit - I & II . ppt
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance BookingCall Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
Call Girls Walvekar Nagar Call Me 7737669865 Budget Friendly No Advance Booking
 
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
Structural Analysis and Design of Foundations: A Comprehensive Handbook for S...
 
result management system report for college project
result management system report for college projectresult management system report for college project
result management system report for college project
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
(INDIRA) Call Girl Meerut Call Now 8617697112 Meerut Escorts 24x7
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
Call for Papers - Educational Administration: Theory and Practice, E-ISSN: 21...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdfONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
ONLINE FOOD ORDER SYSTEM PROJECT REPORT.pdf
 
AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 

Master's Thesis

  • 1. Tunisian Republic Ministry of Higher Education and Scientific Research University of Tunis El Manar Higher Institute of Computer Science Master’s Thesis Presented in order to obtain the Master’s Degree in Information and Technology Mention: Information and Technology Specialty : Software Engineering (GL) By: Wajdi KHATTEL Proposal of a Terrorist Detection Model in Social Networks Presented on 07.12.2019 In front of jury composed of: President: Evaluator: Academic supervisor: Laboratory supervisor: Najet AROUS Olfa EL MOURALI Ramzi GUETARI Nour El Houda BEN CHAABENE Realized within Academic year : 2018-2019
  • 2. Laboratory Supervisor Academic Supervisor I authorize the student to submit his internship report for a defense Signature I authorize the student to submit his internship report for a defense Signature Le 22/11/2019 Ramzi Guetari Le 22/11/2019 Nour El Houda Ben Chaabene
  • 3. Dedications I want to dedicate this humble work to: My parents Abderraouf and Sonia for all the pain they have been through and all the sacrifices they made in order for me to reach this level and for me to be what I am today. To my sister Yosra and her husband Jamel for their patience, continuous support and care. To all the members of my family and my dearest friends for the best times and laughs we had and sticking by my side the time I needed. For all those I love and all those who love me. To all who helped that I forgot to mention. With Love, Wajdi Khattel. iii
  • 4. Acknowledgements I would like first to thank and express my very profound gratitude to my academic advisor, Mrs. Nour EL Houda BEN CHAABENE for the huge effort and sacrifice she gave the entire time and also for believing in our capacities and her patience, motivation, and immense knowledge. Her guidance helped us in all the time of research and writing of this thesis. My academic Professor, Mr. Ramzi GUETARI, for his big support and generosity and his continuous welcome in his office that was always open whenever I ran into a trouble spot or had a question about our research, and steering us in the right direction whenever I needed it. Also anyone who contributed to this work for the support, even spiritually especially the last couple of weeks. With Gratitude Wajdi Khattel. iv
  • 5. Table of Contents General Introduction 1 I State of the art 3 1 Anomaly Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 4 1.1 Activity-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Graph-based Detection . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2 Terrorist Detection in Social Media . . . . . . . . . . . . . . . . . . . . . . . 10 2.1 Existing Content-based Models . . . . . . . . . . . . . . . . . . . . . 11 2.2 Existing Graph-input Analysis . . . . . . . . . . . . . . . . . . . . . 13 II Existing Techniques 16 1 Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1.1 Text Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 17 1.1.2 Data Representation . . . . . . . . . . . . . . . . . . . . . . 20 1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 1.2.1 CNN: Convolutional Layer . . . . . . . . . . . . . . . . . . 23 1.2.2 CNN: Pooling Layer . . . . . . . . . . . . . . . . . . . . . . 25 1.2.3 CNN: Fully-Connected Layer . . . . . . . . . . . . . . . . . 26 1.3 Numerical-Content Data . . . . . . . . . . . . . . . . . . . . . . . . . 26 2 Data Classification in Machine Learning . . . . . . . . . . . . . . . . . . . . 26 2.1 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.2 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 2.3 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 v
  • 6. Table of Contents III Proposed Model 29 1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.1 Offline Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 1.2 Online Data Sources . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 2 Proposed Model Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.1 Model Input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 2.2 Content-Based Classification . . . . . . . . . . . . . . . . . . . . . . . 34 2.2.1 Text Classification Model . . . . . . . . . . . . . . . . . . . 34 2.2.2 Image Classification Model . . . . . . . . . . . . . . . . . . 36 2.2.3 General Information Classification Model . . . . . . . . . . 37 2.3 Decision Making . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 2.4 Global Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 IV Implementation and Results 43 1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1.1 Offline Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 1.1.1 Textual-Content Data . . . . . . . . . . . . . . . . . . . . . 44 1.1.2 Image-Content Data . . . . . . . . . . . . . . . . . . . . . . 46 1.1.3 General Information Data . . . . . . . . . . . . . . . . . . . 48 1.2 Online Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 1.2.1 Facebook Data . . . . . . . . . . . . . . . . . . . . . . . . . 49 1.2.2 Instagram Data . . . . . . . . . . . . . . . . . . . . . . . . . 51 1.2.3 Twitter Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 2 Model Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 2.1 Text Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.1 NLP Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.2 Data Vectorization . . . . . . . . . . . . . . . . . . . . . . . 54 2.1.3 Data Classification . . . . . . . . . . . . . . . . . . . . . . . 55 2.2 Image Classification Model . . . . . . . . . . . . . . . . . . . . . . . 56 2.2.1 Data Augmentation . . . . . . . . . . . . . . . . . . . . . . 56 2.2.2 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . 57 2.3 General Information Classification Model . . . . . . . . . . . . . . . 58 vi
  • 7. Table of Contents 2.4 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 3 Results Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 V Conclusions and Perspectives 63 Bibliography 65 vii
  • 8. List of Figures I.1 Unified User Profiling (UUP) system with cyber security perspective . . . . 6 I.2 User Profiling Method in Authorization Logs . . . . . . . . . . . . . . . . . 7 I.3 Context-aware graph-based approach framework . . . . . . . . . . . . . . . 8 I.4 Forum user profiling approach framework . . . . . . . . . . . . . . . . . . . 9 I.5 Transfer-Learning CNN Framework . . . . . . . . . . . . . . . . . . . . . . . 12 I.6 Multidimensional Key Actor Detection Framework . . . . . . . . . . . . . . 14 II.1 An example of morphemes extraction . . . . . . . . . . . . . . . . . . . . . . 18 II.2 An example of syntax analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 19 II.3 An example of semantic network . . . . . . . . . . . . . . . . . . . . . . . . 20 II.4 Curved Edge Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 III.1 Multi-dimensional Network . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 III.2 Text Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 III.3 Image Classification Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 III.4 General Information Classification Model . . . . . . . . . . . . . . . . . . . 38 III.5 Proposed Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 III.6 Model Workflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 IV.1 Twitter Searching Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 IV.2 Sample of news headlines . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 IV.3 Word Cloud of our Textual Data . . . . . . . . . . . . . . . . . . . . . . . . . 46 IV.4 Sample of Terrorists images . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 IV.5 Sample of Military/News images . . . . . . . . . . . . . . . . . . . . . . . . 48 IV.6 Age Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 IV.7 Relationship Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 IV.8 Gender Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 IV.9 Facebook Graph API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 viii
  • 9. List of Figures IV.10An example of data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 57 IV.11Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 ix
  • 10. List of Tables I.1 Anomaly detection existing works comparison . . . . . . . . . . . . . . . . 10 I.2 Activity-based techniques comparison . . . . . . . . . . . . . . . . . . . . . 13 II.1 Comparison of word embedding methods . . . . . . . . . . . . . . . . . . . 22 IV.1 Textual-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 IV.2 Image-Content Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 IV.3 Text Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 IV.4 Image Models Metric Scores . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 IV.5 General Information Models Metric Scores . . . . . . . . . . . . . . . . . . . 59 IV.6 Model Testing Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 x
  • 11. Acronyms UUP Unified User Profiling CERT Computer Emergency Response Team NATOPS Naval Air Training and Operating Procedures Standardization SVM Support Vector Machine SMP Social Media Processing M-SN Multiple Social Networks M-IT Multiple Input Types T-UBC Time-based User Behavior Changes T-FBC Time-based Future Behavior’s Changes ISIS Islamic State of Iraq and Syria URL Uniform Resource Locator LSTM Long Short-Term Memory CNN Convolutional Neural Network API Application Program Interface GTD Global Terrorism Database GDELT Global Data on Events Location and Tone SNA Social Network Analysis NLP Natural Language Processing FOPL First Order Predicate Logic xi
  • 12. TF-IDF Term Frequency-Inverse Document Frequency CBOW Continuous Bag Of Words RGB Red, Green, Blue START Study of Terrorism And Responses to Terrorism PIRUS Profiles of Individual Radicalization In the United States HTTP HyperText Transfer Protocol RDF Resource Description Framework REST REpresentational State Transfer JSON JavaScript Object Notation SDK Software Development Kit RAM Random Access Memory CPU Central Processing Unit GPU Graphics Processing Unit NLTK Natural Language ToolKit DA Data Augmentation TL Transfer Learning VGG Visual Geometry Group FB Facebook IG Instagram T Twitter xii
  • 13. General Introduction The appearing of social networks has created an ease of communication and idea sharing. Several of them has now become one of the most popular sources of information, namely: Facebook, Twitter, LinkedIn, etc. Within the last decade, the number of the people across the word that uses those websites kept increasing to overcome billion of active users per day [1]. Most of these users are there to interact with their friends, family and meet new people that shares their interests. Other users such as business owners are there to communicate with their target audience for promoting their brand or receiving feedback from customers. Although this facilitated way of communication could be used in a friendly way, there are other users that take that advantage in a harmful way such as bullies, spammers and hackers. One of the most dangerous categories is terrorist groups. They are one of the most profitable users of this advantage. The ability for them to incite other people, promote their groups and planning attacks has become very simple. Detecting these groups in an accurate and a fast way has become one of the most important tasks for social network owners. Several approaches and methods has been proposed to that end such as manual monitoring and firewalls. But, as the number of those individuals kept increasing, an accurate and fully automated approaches must be used. Fortunately, the evolution of new technologies, especially the appearing of machine learning, has made that task easier. In this thesis, we propose a model that learns the characteristics that describes a terrorist individual. Additionally, the model learns by itself the new characteristics that define terrorism behaviors, since the abnormal behaviors of our social-culture changes over-time. The first chapter presents some existing works that deal with the issue of anomaly detection in general and terrorism detection in particular, to give the reader a general idea of the research carried out in this domain. 1
  • 14. The second chapter presents some existing techniques in the machine learning do- main in order for us to implement our proposed model. The third chapter introduces the basis of our proposed model from a theoretical per- spective so that we can implement the model’s design. The fourth chapter presents the practical part of our work where we go through the pipeline of our model’s implementation and discuss the results. Finally, we end with a general conclusion and perspectives. 2
  • 15. IState of the art This chapter presents an overview of some existing works that deal with the issue of anomaly detection in general and terrorism detection in particular. We begin this chapter by defining the concept of anomaly and point up the importance of its application in the social media area. We present, by next, an overview of some applied anomaly detection and terrorist detection works categorized based on their input format. The purpose of this chapter is to give the reader a general idea of the research carried out in the detection of anomalies and terrorism. Introduction Social media’s main objective is to provide a platform for people to communicate together and share their thoughts. Although, most of the users use it in a friendly way, many others can benefit from this ease of communication to plan attacks or incite the others to adopt extremism behaviors. Therefore, it is extremely important that we can detect these users in an accurate and a fast way. Those users are often referred to as an anomaly due to their abnormal behaviors. 3
  • 16. Chapter I. State of the art Abnormal behaviors are behaviors that differ or follow an unusual pattern com- pared to what is defined as normal sociocultural behavior. Our main objective behind this research is to study the characteristics that describe an anomalous individual. Although, in social media, an anomalous user certainly will hide his anomalousness, therefore, time is important since we will be looking for peaks and deviations from his/her usual behavior pattern. Nevertheless, what is considered abnormal in today’s sociocultural could become normal after a period of time, thus, we should take into consideration the behavior’s evolution when defining the abnormal be- havior. Different models and approaches has been proposed toward solving this problem. Based on their input format, we can categorize them into activity-based detection, where the in- put data is the user’s activity, and graph-based detection, where the input data is a graph of multiple users. However, the anomaly itself is way too abstract as a term, this motivated us to initiate an attempt to work only on one concrete type of anomaly which is the terrorism. To consider an individual as a terrorist, we have first to define what is a terrorist, since there are no universal agreement on the definition of a terrorist [2]. Facebook, in their definition of dangerous individuals and organizations, attempted to define terrorism as following: Terrorism: Any nongovernmental organization that engages in premeditated acts of vio- lence against persons or property to intimidate a civilian population, government or interna- tional organization in order to achieve a political, religious or ideological aim. [3] Since we are working with social medias, we decided to consider that definition. In the following sections, we start by presenting the existing activity-based and graph- based anomaly detection proposals, then we put our focus on the terrorist detection works. 1 Anomaly Detection in Social Media This section presents the existing models and approaches for anomaly detection in social media categorized by their input format. We look for whether the latest propos- als can identify the future anomaly-behavior’s changes, the user’s behavior over-time 4
  • 17. Chapter I. State of the art changes and the usage of multiple social networks. 1.1 Activity-based Detection Activity-based detection approaches consider that users are kind of independent from each other. An individual is defined by his/her own activities and that would deter- mine if whether his/her behavior are abnormal. In [4], the authors presented a survey of the available user profiling methods for anomaly detection, then they proposed their own anomaly detection model. They showed the advantages and disadvantages of each model from a cybersecurity perspective, some models were using operating system log and web browser history as data source while others were more focused on social networks such as twitter and Facebook. Their anal- yses revealed that the models based on history and logs were more limited and not con- sistent from the perspective of not really knowing whether the same user is the only one using that operating system or the web browser, while the social network-related models, were more consistent because it is a private account based approaches that also includes users interactivity with each other which leads to better results. Based on other method’s data sources, they defined a user profile representation with a vector of 7 main feature categories : • Users interests features • Knowledge and skills features • Demographic information features • Intention features • Online and offline behaviour features • Social media activity features • Network traffic features Each features category contains some features and sub-grouped features inside, which led finally to having more than 270 features that are mostly security-related. Their pro- posed model called "Unified User Profiling" (Fig.I.1), will mainly collect the data from 5
  • 18. Chapter I. State of the art the different sources, then clean it and parse it in order to have structured data which finally leads to having a user profile vector that the administrator is able to monitor in different categories and detect anomalies based on the user activity. While their model is mostly complete in term of features and they considered different social networks, it is still limited to not automatically detect anomalies. Figure I.1: Unified User Profiling (UUP) system with cyber security perspective In [5], the writers proposed a pattern recognition method, that given a vector of a user profile, it will take the user’s daily activity and create a time-series pattern for that user on each activity he does (Fig.I.2), then each time the user is involved in an activity, the new behaviour is compared to his/her behavioral pattern of that activity. If a deviation from the normal behavior happened, it is flagged suspicious, but since a minor deviation doesn’t always mean a suspicion, there is a behavioral model of all system users that the activity will also be compared to so that the false alarms are kept at the minimum. Their model is a random forest trained on the CERT dataset along with a private dataset acquired from NextLabs which achieved over 97% accuracy. This method showed great results in term of insider threat detection which is considered as a single social network, so it is still limited by not supporting multiple social networks and it cannot learn future abnormal behaviors automatically over time. 6
  • 19. Chapter I. State of the art Figure I.2: User Profiling Method in Authorization Logs 1.2 Graph-based Detection Graph-based detection approaches consider user interactivity by analyzing a snap- shot of a network. Each user can have a relation with other users such as mentions, shares and likes. There are two approaches for that, statically or dynamically. In the static graph-based detection approaches, the analysis is done on a single snapshot of the network. While for the dynamic graph-based detection approaches, the analysis is done in a time-based way by analyzing a series of snapshots. In [6], the writers proposed an anomaly detection framework that, at each timestamp t, each user within a network have an activity score and a mutual score with other users. The scores are based on the user’s activities and the interactivity with other users on these activities. A Mutual agreement matrix is then produced to represent those scores where the user’s activities score in the matrix diagonal. Using an anomaly scoring function that they proposed, the user’s scores are passed into it and thresholded to define whether the user is anomalous or not (Fig.I.3). As a data source, they used the "CMU-CERT Insider Threat Dataset" and the "NATOPS Gesture Dataset", then they compared the results of their framework to other known models. Their model by far was the best, they reached around 0.95 of area-under-curve score while the other models such as SVM and clustering 7
  • 20. Chapter I. State of the art were around 0.89. Despite the fact that the framework overcame by far the expecting results for detecting insider threats and its ability to support overtime behavior changes, it is still limited by not considering different input data types such as images and texts and not analyzing multiple networks simultaneously. Figure I.3: Context-aware graph-based approach framework In [7], the authors proposed a user profiling approach based on user behavior features and social network connection features (Fig.I.4). The first set of features (user behavior features) is the foundation of user representation which are composed of posts contents statistics, posts content semantics and user behavior statistics. The social network con- nection features are basically a set of features that leads to the construction of a network of similar users that have similar network representation. The experiment results showed that by using the network connections the model overall score improved. Their approach reached the second place among around 900 participants in the SMP 2017 User Profiling Competition. This work showed that the use of graphs and the consideration of user interactivity is an improvement toward grouping individuals thus, detecting anomalous communities. The limitation of this work is that it cannot detect the category’s behavior future changes. 8
  • 21. Chapter I. State of the art Figure I.4: Forum user profiling approach framework 1.3 Summary Within the scope of our research in the anomaly detection in social media, we studied different papers. Table I.2 presents the advantages and limitations of those papers in term of their support of multiple social networks (M-SN), support of multiple input data types such as text and images (M-IT), support of over-time user behavior changes (T-UBC) and their ability to learn future new abnormal behavior’s changes (T-FBC). 9
  • 22. Chapter I. State of the art Paper Description Input Format M-SN M-IT T-UBC T-FBC Lashakry et al., 2019 [4] Proposed model for user profile creation to monitor users User’s Activity Zamanian et al., 2019 [5] Proposed model for user activity pattern recognition with ran- dom forest User’s Activity Bhattacharjee et al., 2017 [6] Proposed a prob- abilistic anomaly classifier model Graph of users Chen et al., 2018 [7] Proposed a user pro- filing framework that can be used to detect anomalous users Graph of users Table I.1: Anomaly detection existing works comparison None of the mentioned works has considered all the mentioned functionalities to- gether. Therefore, we decided to work on a model that supports those features. To fa- cilitate that, we considered a hybrid architecture where the input format is graph-based to include user interactivity and the ease of detecting communities but also focus on the user’s activity to solve our main problem of identifying the characteristics that describes an anomalous individual. 2 Terrorist Detection in Social Media As we decided to have a hybrid architecture with both graph-input and activity- based detection, we identified the existing terrorist detection works that focus on user’s social medias content and other works that treats a graph as an input. In this section, we present those papers to get more overview about how to solve our problem. 10
  • 23. Chapter I. State of the art 2.1 Existing Content-based Models In this section, we focus on models that treats the content of the activities that an individual can get involved in on social medias. Those are served as a proof-of-concept for our implementation of them. In [8], the writers implemented a model that detects extremists in social media based on some information related to usernames, profile, and textual content. They built their dataset from Twitter by looking for hashtags related to extremism which results into having around 1.5M tweets, then they extracted 150 ISIS-related accounts that posted those tweets and were reported to the Twitter Safety account (@TwitterSafety) by normal users and 150 normal users to have a balanced dataset all along with 3k of unlabeled data. Afterwards, they categorized the features into 3 major groups: • Twitter handle’s (username) related features: length, number of unique characters and Kolmogorov complexity of the username. • Profile related features: this group contains 7 features related to the profile of the user such as the profile’s description, the number of followers and the location. • Content related features: the number of URLs, the number of hashtags and the sentiment of the content. Based on this dataset, they tried to answer two research questions: • Are extremists on Twitter inclined to adopt similar handles? • Can we infer the labels (extremist vs. non-extremist) of unseen handles based on their proximity to the labeled instances? After their experiment with different supervised and semi-supervised approaches, both question had a positive answer and SVM had the best precision score with 0.96 which shows the significance of the proposed feature set, but char-LSTM had the best precision- recall score with 0.76 that minimize the number of false negatives. This work presented different ways of collecting the necessary data in an extremist detec- tion work. They also showed that the use of different input data types from social media 11
  • 24. Chapter I. State of the art can help detecting extremists. The limitation of this model is that it does not support over-time user’s behavior change and it cannot learn future extremist behaviors. In [9], the authors presented a convolutional neural network (CNN) in order to de- tect suspicious e-crimes and terrorist involvement by classifying social media image con- tents. They used three different kinds of datasets in which we are only interested in the terrorism images dataset. Based on the transfer learning technique, they took the CNN architecture of the imagenet model [10] and they reduced its network size by lowering the kernel size of each layer to come up with their new smaller network (Fig I.5). In the results, their architecture outperformed the default imagenet by around 1% of mean av- erage precision score and took half imagenet’s execution time. This paper showed that the concept of detecting terrorists based on their social media im- age contents is possible along with the advantage of using transfer learning rather than building a CNN from scratch. But their model supports only one type of data which is images. Figure I.5: Transfer-Learning CNN Framework In Table I.2 we present the content-based models that we analyzed with their advan- tages and limitations. 12
  • 25. Chapter I. State of the art Paper Description Advantages Limits Alvari et al., 2019 [8] (semi)-supervised model of extremist detection based on user’s general information and textual-content data - Proof-of-concept of detection based on textual-content and general infor- mation - Support multiple input data types - Cannot support multiple social networks - Cannot detect if user is adopting new behaviors over-time - Cannot learn future behavior’s change Chitrakar et al., 2016 [9] Image classification model using CNN and Transfer learn- ing - Proof-of-concept of image content based detection - Highlighted a model improve- ment technique: Transfer Learning - Cannot support multiple input data types - Cannot learn future behavior’s change Table I.2: Activity-based techniques comparison 2.2 Existing Graph-input Analysis In this section, we study the existing works that works with graph as an input for the terrorist detection in social media problem. In [11], the authors proposed a framework that treats multidimensional network as an input for the identification of terrorist network key-actors. The dimensions represent the types of relationships or interactions in a social media. The workflow of their frame- work starts by building a multidimensional network through a keyword-based search on a social media platform, then that network is mapped to a single layer network by using certain mapping functions. To detect the key actors, they use several centrality measures 13
  • 26. Chapter I. State of the art such as Degree of Centrality and Betweenness Centrality. The output of the frame- work is a ranked list of the key actors within the network. The framework’s effectiveness was evaluated with a ground truth dataset of a 16-month period Twitter data. Fig. I.6 presents the workflow of this framework. This work presented the usage of multidimensional networks and how we can analyse it to detect terrorist-network’s key actors. Their usage of the multiple dimensions could be more efficient if they considered multiple social medias instead of multiple relationship and interaction types. Figure I.6: Multidimensional Key Actor Detection Framework In [12], the writers created a survey on social network analysis for counter-terrorism where they provided the data collection methods and the different types of analysis. The two sources of data are: online social networks and offline social networks. The on- line social networks are the social media websites which allow users to interact with other users through sending messages, posting information, these are websites like Facebook, Twitter and YouTube in which we collect the data using their APIs. In the other hand, of- fline social networks are the real life social networks based on the relations like financial transactions, locations, events etc, and these are the public databases such as Global Ter- rorism Database (GTD) [13] and Global Data on Events Location and Tone (GDELT) [14]. 14
  • 27. Chapter I. State of the art Furthermore, they analyzed the different centrality measures that provides the impor- tance and position of a node in a network such as: • Degree Centrality: A node with higher degree value is often considered as an active actor in a network. The degree value is the number of connections linked to a node. [15] • Closeness Centrality: A node with higher closeness value can quickly access other nodes in a network. The closeness value is a measure for how fast a node can reach other nodes. [15] • Betweenness Centrality: A node with higher betweenness value is often considered as an influencer in a network. The betweenness value is the number of shortest paths between any pair that pass through a node. We can see this as which node acts as a bridge to make communities in a network. [15] Finally, they stated some SNA tools comparison based on the functionality, platform, license type and file-formats. As conclusion, they winded up with the idea of when doing social network analysis, the main challenge is the data itself, since the privacy of users is a very sensitive issue and also most of the times data tends to be incomplete with lot of missing and fake nodes and relations, which often leads to incorrect analysis results. This survey provided us the different data collection methods as well as the graph analysis methodologies. Conclusion In this chapter, we presented some existing works that have dealt with anomaly de- tection in general and terrorist detection in particular in different approaches. To the best of our analysis, the existing methods did not deal with terrorism in multidimen- sional graphs with combining different types of classifications in a time-based way. This motivated us to provide a model of terrorism detection in multidimensional graphs that supports different types of input data that can also detect over-time behavior’s change. In the next chapter, we initiate a research on the existing techniques needed to im- plement our proposed model. 15
  • 28. IIExisting Techniques This chapter presents the necessary techniques to implement our proposed model. We begin by presenting the different input data types that we are considering and the techniques used for the analysis of each type. Then, we present the classification models to use and how they works. Introduction Each social network has ample input data that could be shared on it, identifying these data types and choosing which ones we will be working with an important task toward achieving our goal. In our previous analysis of the different existing proposals, the authors of [4] identified nearly 270 of anomaly detection security-related features, some of which were social media activity features, We analyzed those features and based on [8, 9], we grouped them into three data types categories namely: textual-content data, image-content data, and numerical-content data. To classify an individual based on those content data, different classification models exists. In the next sections, we begin by giving an overview about the identified input data types and their analysis approaches, then we present the different classification models. 16
  • 29. Chapter II. Existing Techniques 1 Data Types In this section, we briefly introduces each type of data along with the chosen ap- proach toward their analysis and classification. 1.1 Textual-Content Data Textual-content data is mainly characters that are part of a certain language and could be read by a human being. We begin by presenting the chosen text analysis ap- proach, then we decide on a data representation techniques to transform the text to nu- merical input. 1.1.1 Text Analysis In text analysis, the most common used technique is Text Mining. Text Mining is the process of extracting high quality information from textual data, where the information could be patterns or matching structures in text without the con- sideration of the semantics of the it. The outcome of it are mostly statistical information such as frequency and correlation of words. [16] In terrorism detection domain, we are interested in knowing what the user is trying to incite with the post and whether it is serious, sarcasm or reporting a news. To differentiate that, we need to go through the semantic analysis and not working with words as objects. One of the most important text-mining’s processing methodologies, that also consider the semantics of words, is the Natural Language Processing. Natural Language Processing is the process of making the computer understand the language spoken by humans along with the semantics and sentiments conveyed from it by doing some analysis such as morphological, syntactical and semantic analysis [16]. The first step in NLP is the morphology processing which involved analyzing the structure of words studying their construction from primitive meaningful units called 17
  • 30. Chapter II. Existing Techniques morphemes. This will help us divide the different words/phrases of a document into tokens that will be used on later analysis. Morphemes are the smallest units with a meaning in a word. There are two types of morphemes namely Stems and Affixes where the stems are the base or root of a word and affixes could be a prefix, an infix or a suffix. Affixes that never appear isolated, but are combined with a stem. Taking the example of Fig. II.1, we can see how we split a word into a stem which carries the main meaning of the word and some affixes. Figure II.1: An example of morphemes extraction Tokens are words, keywords, phrases or symbols that have a useful semantic unit for processing. We refer to its extraction process as Tokenization. It is mainly composed of a lemma + part of speech tag + grammatical features. Example: • plays → play (lemma) + Noun (part of speech tag) + plural (grammatical feature) • plays → play (lemma) + Verb (part of speech tag) + Singular (grammatical feature) After finishing studying the structure of the words, we have to examine their ar- rangement and combination in a sentence, using syntax analysis. In a sentence, words arrangement follow precise rules of the language’s grammar. Taking an example of the sentence Three people were killed in an incident today and following the English grammar parser, we end up with the example of Fig. II.2 where we have some grammatical groups such as S for sentence, NP for noun phrase, VP for verb phrase, NN for singular nouns and NNS for plural nouns. 18
  • 31. Chapter II. Existing Techniques Figure II.2: An example of syntax analysis This analysis will make the machine able to understand the relationship between the words and the different references. After structuring the words and studying their relationship, it is time for the ma- chine to understand the meaning of the words and phrases along with the context of the document. Focusing on the relationship between the words and elements such as syn- onyms, antonyms and hyponyms (hierarchical order of meaning), the semantic system is able to build blocks composed of: • Entities: Individuals or instances. • Concepts: Category of individuals or classes. • Relations: Relationship between entities and concepts. • Predicates: Verb structures or semantic roles. These can be represented through methods such as first order predicate logic (FOPL), semantic networks and conceptual dependency. Fig. II.3 illustrates an example of semantic networks using our last example of the sentence Three people were killed in an incident today. 19
  • 32. Chapter II. Existing Techniques Figure II.3: An example of semantic network Based on these semantics, the machine can now learn the meaning of the words and the text, thus, from this part it is possible to lean the meaning of the user’s textual data. 1.1.2 Data Representation After going through the text analysis, our machine can now understand the meaning of the textual content data. But in order to build a classifier that will automatically cat- egorize the current and future data, our data must be numerical to apply mathematical rules while also preserving its semantics. Word embedding is one of the most popular representation of textual data, where it trans- forms a word in a document into a vector of numerical features where mostly close vectors means that these words share the same meaning or are in the same context therefore the data will not loose it semantics. While doing our research, the most used word embedding techniques are Word2Vec and Term Frequency-Inverse Document Frequency (TF-IDF). Word2Vec uses two different approaches, namely: Continuous Bag Of Words (CBOW) and Skip Gram, both are based on neural networks that takes a context as an input and use back-propagation to learn [17]. The mathematical background work of Word2Vec, tries to maximize the probability of the next word wt given the previous word h. Thus, 20
  • 33. Chapter II. Existing Techniques the probability P (wt|h) in Equation II.1, where score(wt,h) computes the compatibility between wt with the context h and sof tmax is the known mathematical softmax function. P (wt|h) = sof tmax(score(wt,h)) (II.1) CBOW learns the embedding of a word by predicting it based on the surrounding words that are considered as the context here. Skip-Gram learns the embedding of a word by considering the current word as the context and predicting the surrounding words. According to [17], Skip-Gram is able to function with less data and represents rarer words more, while CBOW is faster and represents frequent words clearer. TF-IDF represents words with weights. These weights are based on the product of the term frequency times the inverse document frequency In simpler terms, words that occur frequently throughout the document should be given very little weighting or significance. For example, in English, simpler terms include: the, or, and and. They don’t provide a large amount of value. However, if a word appears very little or appears frequently, but only in one or two places, then these are identified as more important words and should be weighted as such [18]. Term-Frequency (TF) is the percentage of occurrence of a term t in a document d. As illustrated in Equation II.2, we calculate term-frequency by taking the number of times a term t is appearing in a document d by the total number of words in the document d. tft,d = nt,d term nterm,d (II.2) nt,d: The number of occurrences of term t in the document d. term nterm,d: The sum of occurrences of all the terms that appear in the document d which is the total number of words in the document d. 21
  • 34. Chapter II. Existing Techniques Method Advantages Disadvantages Word2Vec - Optimized memory usage - Fast execution time - Contains a lot of noisy data - Does not work well with ambigu- ity TF-IDF - The vocabulary is built with words that identify the category - Extract relevant information - High memory usage - The closest words are not similar in meaning but in the category of the document’s context Table II.1: Comparison of word embedding methods Inverse-Document-Frequency (IDF) is the rank of a term t for its relevance within a document d. Equation II.3 show the mathematical formula to calculate inverse docu- ment frequency. This is done by taking the total number of documents N and dividing that by dft the number of documents that contains the term t. idf (t) = loge( N dft ) (II.3) Finally, if we are trying to get the weight wt,d of the word t in a document d using TF-IDF, we get that as shown in Equation II.4 by multiplying the tft,d by the idf (t). wt,d = tft,d ∗ idf (t) (II.4) As found in review over existing research, such as in [18], it appears that Word2Vec performs better in term of memory, execution time and embedding quality for words sim- ilar in context and meaning, while TF-IDF performs better in identifying the words that determine the document’s category. In other words, it detects the keywords that iden- tify a category of documents. Table II.1 summarizes the advantages and disadvantages of each method. 22
  • 35. Chapter II. Existing Techniques 1.2 Image-Content Data This type of data is anything that is a visual representation of something. Different approaches are also available for image processing, but as determined in [10], convolu- tional neural network is by far the most performing method to utilize in image classifica- tion in term of precision and execution time. Convolutional Neural Network is a deep learning algorithm and an extension of neural network that is distinguished from other methods by its ability to consider spa- cial structure and translation invariance. This means that regardless of where an object is located in an image, it is still considered as the same object [19]. The advantage of having a multidimensional input, unlike regular neural networks that use a vector as an input, makes it performs better with image data since the images usually has three color channels (RGB) which makes it a three dimensions matrix. Taking an example of an a 32×32 image with 3 color channels, we would have 32×32×3 = 3072 weights for a regu- lar neural network, if we go for a 512×512 image, we would have 512×512×3 = 786432 weights. This will results in huge calculations as well as an over-fitting for having too much information and details. [20] A simple CNN is a sequence of layers: convolutional layer, pooling layer and fully- connected layer. In a typical CNN, there are several rounds of convolution/pooling until we proceed to the fully-connected layer. 1.2.1 CNN: Convolutional Layer Each convolutional layer of the network has a set of feature maps that can recognize increasingly complex patterns/shapes in a hierarchical manner. Instead of regular matrix multiplications, convolutional layer uses convolution calculations. To do that, convolu- tional layer needs to construct the filters and apply calculations on it while doing some optimization techniques such as Striding and Padding. Filters are used to detect patterns in an image, they also offer weight sharing. For example a filter which detects curved edge (Fig.II.4), matches the left corner of an image but may also match the right bottom corner of the image if both corners has a curved 23
  • 36. Chapter II. Existing Techniques edges. Figure II.4: Curved Edge Filter Calculation are matrix multiplications that are used to apply a filter on an input image we. Let us consider: 0 0 1 1 0 1 1 3 1 2 1 0 1 4 2 0 2 2 1 0 3 4 1 0 0     * 1 1 0 0 0 1 1 0 0     = ? ? ? ? ? ? ? ? ?     In order to get the value of the first ’?’ we need to use the filter on the first 3x3 matrix of pixels : ? = (0 ∗ 1) + (0 ∗ 1) + (1 ∗ 0) + (1 ∗ 0) + (1 ∗ 0) + (3 ∗ 1) + (1 ∗ 1) + (0 ∗ 0) + (1 ∗ 0) = 4. Then we continue, the value next to ’?’ is the value of the second 3x3 matrix of pixels in which ’3’ is the center. This means we moved by 1 pixel to the right. 24
  • 37. Chapter II. Existing Techniques 0 0 1 1 0 1 1 3 1 2 1 0 1 4 2 0 2 2 1 0 3 4 1 0 0     * 1 1 0 0 0 1 1 0 0     = 4 ? ? ? ? ? ? ? ?     ? = (0∗1)+(1∗1)+(1∗0)+(1∗0)+(3∗0)+(1∗1)+(0∗1)+(1∗0)+(4∗0) = 2. And so on. Striding is a parameter of how many pixels we are going to move to calculate the next value. It is mainly used to reduce the calculation as values next to each other are more likely to be similar. In our last example the striding was 1, that means we only moved the red box by 1 pixel to get the next value. Usually, we use a value of 2 or 3 since in most of the cases a 2-3 pixels apart would make a variation or a change of a pattern. Padding is used to prevent information loss. In our example when applied the fil- ter, we didn’t consider having the values of the first/last rows and the first/last columns as center for the 3x3 matrix. To fix that we add zero padding which will add new rows/- columns filled with 0. 0 0 1 1 0 1 1 3 1 2 1 0 1 4 2 0 2 2 1 0 3 4 1 0 0     ⇒ 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 1 3 1 2 0 0 1 0 1 4 2 0 0 0 2 2 1 0 0 0 3 4 1 0 0 0 0 0 0 0 0 0 0     1.2.2 CNN: Pooling Layer Pooling layer is used to determine what information is critical and what constitutes irrelevant details. There are many types of pooling layers such as: max pooling layer and average pooling layer. With max pooling, we look at a neighborhood of pixels and only keeps the maximum value. Considering a 2x2 max pooling with a stride of 2: 25
  • 38. Chapter II. Existing Techniques 1 0 0 1 3 2 0 2 0 0 4 2 4 1 0 1     ⇒ 3 2 4 4     For each 2x2 matrix we took the maximum value and each time we move by two pixels (stride) to get the next 2x2 matrix. 1.2.3 CNN: Fully-Connected Layer A fully-connected layer is a layer on which all the inputs are connected to all the outputs. In a CNN it is used to finally determine the class that will be assigned to our main input. Before proceeding to the fully-connected layer, we have to use a technique called flattening in order to generate a vector which is needed for this layer. Flattening: • Each 2D matrix of pixels is turned into 1 column of pixels. • Each one of our 2D matrices is placed on top of another. 1.3 Numerical-Content Data Numerical-content data is the data that are based on numbers that could be statis- tically interpreted. This type of data does not require a pre-processing thus, it can be directly fitted into a model. The models for this type of data are mostly the general sta- tistical machine learning models that we will be presenting later. 2 Data Classification in Machine Learning Machine learning is a subset of the artificial intelligence domain, that makes the machine able to automatically gain knowledge from experience without being explicitly programmed. By following some statistics and mathematical concepts, it looks for pat- terns in the data we provide, learn them and make better decisions in the future. [21] Several learning methods exists in Machine Learning: 26
  • 39. Chapter II. Existing Techniques • Supervised Learning: Given a sample of data and the desired output, the machine should learn a function that maps the inputs to the outputs. • Unsupervised Learning: Given a sample of data without the output, the machine should learn a function that categorize these samples based on learned patterns. • Semi-Supervised Learning: Given a small number of data with the desired output (labeled data), and other data without output (unlabeled data), the machine should learn a function that can label the unlabeled data using the knowledge learned from the labeled data. • Reinforcement Learning: Given a sample of data, a certain actions and rewards related to the actions, the machine should learn a function that finds the optimal actions toward achieving maximum rewards. Classification is part of supervised learning in which the machine is going to catego- rize a new observed data based on the learned patterns of each category from the training data. In the following sections, we present the most common classification algorithms. 2.1 Support Vector Machines A Support vector machine model is a representation of the data in a space. Examples of a same category are close to each other. The group of examples in a category are separated by a clear gap as wide and as spaced as possible from the examples of another category. New observed examples are then predicted to be part of a category based on the side of the gap in which they fall. [22] 2.2 Logistic Regression Logistic regression is a statistical model that analyses a data in which there is at least one feature that could determine the outcome. By using a logistic function, it tries to model a binary output that is measured with a dichotomous variable. Since the output is binary, it can only be used for binary classification problems. To use it for multi-class problem, N logistic regression models should be trained, where N is the number of classes you have, each model is trained on a certain class with one-vs-all approach. [23] 27
  • 40. Chapter II. Existing Techniques 2.3 Neural Networks A neural network is a network in which we have multiple layers of perceptrons. A perceptron is the elementary unit in an artificial neural network which was introduced as a model of biological neurons in 1959 [24]. The output of each perceptron in a layer is connected to each perceptron of another layer as an input which makes it known as fully connected layer. A neural network must have an input layer, an output layer and in be- tween a hidden layer. Any neural network with more than one hidden layer is considered as a deep neural network. [20] Conclusion In this chapter, we studied the existing techniques needed to perform a classification on textual-content data, image-content data and numerical-content data. In the next chapter, we detail the basis of our proposed model. 28
  • 41. IIIProposed Model This chapter introduces a novel time-based terrorism detection model that works with multidimensional networks and different types of input data. The output results of our model are the nodes that belong to terrorist regions in a graph across the dimen- sions of the multidimensional network. To identify this type of nodes, we first have to determine what are the terrorist regions and how to create them. Then, we examine the network to estimate a terrorism score for each node in a dynamic way in order to detect over time behavior changes. First, we introduce the purpose of the model along with the proposed research questions then we present the sources of data. After that, we present in detail the theoretical ap- proach toward constructing our model and we finish by a conclusion. Introduction Nowadays, social networks provides many types of data that could be used such as images, texts and videos, but most of the existing models work on specific type of data on a specific social network. Our proposed model will try to cover this limitation by supporting a multidimensional network as an input in order to have the ability to use multiple social network data at the 29
  • 42. Chapter III. Proposed Model same time along with the supporting of different input data types. In addition to that, the model will also consider the evolution of individual’s behavior over time to detect deviation from the usual behavior pattern. Furthermore, the model will adapt itself to the behavior’s evolution to be kept updated with new abnormal behaviors. Before describing the basis of the model construction, it is first necessary to present the research questions that will be used as a metric to track the accuracy of our proposed model for solving the main research problem of the thesis which is the study of the char- acteristics that describes a terrorist in different social media platforms. The research questions being posed are as follows: Q1: Can we identify the behavior of a terrorist based on his/her social media content ? Q2: Can machine learning help automatically detect if a user is adopting a terrorism be- havior over time ? Q3: Do terrorists adopt the same behavior on different social networks ? In order to answer these research questions, we have to pass through some phases: • Phase 1: Identifying the available data sources • Phase 2: Determining the convenient classification approach • Phase 3: Estimating the terrorism score calculation First, we start by collecting the necessary data of each user. Then, we create a multidi- mensional network where each dimension represents a social network. Once the network is ready, it is then used as an input to our model where each feature from each social network will be mapped to its respective sub-model. Finally, a decision score will be cal- culated. If the node is detected as a terrorist, the model will be re-trained with those new inputs to be kept up-to-date with newest (unseen) terrorist behaviors, in case the model losses its accuracy once we updated it, it will be reverted to the last version. Additionally, each node will be passed to the model each time it was involved in a new activity, that way, the node could also be considered as a terrorist once the user adopt terrorism behavior over-time. 30
  • 43. Chapter III. Proposed Model 1 Data Collection As part of phase 1, the data sources of the different data types should be identified. As we shared in the last chapter, there exists three types of data: • Textual-Content Data: These include posts, comments, image captions, text in an image, etc. • Image-Content Data: These are posted photos, profile picture, etc. • Numerical-Content Data: These are age, number of friends, average posts per day, etc. Several other information exists in social media such as username, gender and rela- tionship. Therefore, instead of having the numerical-content data category, we opted for utilizing another category named general information data, where we have the existing numerical-content data in addition to the user’s information data. We present by next the data sources of the different data contents that we have. As mentioned by [12], we can categorize the data sources into two categories namely: offline data sources and online data sources. In this section, we provide the sources of both offline and online data that are used in or- der to retrieve our target data types for model training and later prediction. As a strategy for training the model and precisely distinguishing terrorism from other similar data, we decided to consider terrorist contents as positive labels against military and news con- tents as negative labels, as these types of contents are related, training them against each other will make the model more precise. 1.1 Offline Data Sources Offline data is the data used for the model training which was gathered from public terrorism datasets. For each input type, we used a different dataset. All of them defines the terrorism from the American point of view. For the textual-content data, we inspired from [8], to use twitter API to gather tweets that consist of terrorism-related hashtags and tweets from terrorist accounts that were re- ported to twitter’s safety account (@twittersafety) ensuring that they are not anti-terrorist 31
  • 44. Chapter III. Proposed Model accounts with that we will be creating our offline textual-content dataset where we con- sider those tweets as positive labels against terrorism news tweets and news headlines gathered from other public datasets such as Global Terrorism Database (GTD) [13] as negative labels. We will also be using google translate API since some accounts may pub- lish tweets in different languages. For the image-content data, we did not find a public terrorism-related images dataset within the scope of our research. We decided to use a manual web scraping method with Google Image as our data source. We will be manually gathering terrorist individuals images and incitement of terrorism images, which are our positive labels, and contrasting them against military and terrorism news images, which are our our negative labels. For general information data, Study of Terrorism And Responses to Terrorism (START) published a database called Profiles of Individual Radicalization In the United States (PIRUS) [25] which contains approximately 145 features about many radical profiles in the united states from which we will be extracting our project’s relevant features that are age, gender, relationship, etc. 1.2 Online Data Sources Online data is the social network data that is part of the prediction and future model re-training. The sources for that are the public APIs provided by the social networks. For social media, we decided to study three popular websites that have similar data con- tents and that could also be linked together: Facebook, Instagram and Twitter. Facebook provides Graph API which is a HTTP-based API service to access the Face- book social graph objects [26]. With the right permissions, Graph API allows you to query public data as well as creating contents [27]. The data is rich with semantics since Graph API utilizes RDF format as a return type. [28] Instagram as part of Facebook, also provides Graph API for business accounts [29]. For normal user accounts it gives REST API that returns JSON object for querying public data . [30] Twitter hands over a REST API with JSON return format that provides several public data queries as well as private data with the right permissions. [31, 32] 32
  • 45. Chapter III. Proposed Model 2 Proposed Model Design With the data preparation phases ready, we can now start determining the classi- fication approach that we will be using along with the terrorism score formula, thus, finishing the phase 2 and phase 3. In this section, we explain the theoretical side of the necessary steps toward constructing our proposed model. As previously mentioned, the model consider a graph as an input. Then, an individual’s content-based classification with a decision making component to calculate the final score of the node and utilizing a threshold to determine whether the user is a terrorist. 2.1 Model Input Inspired from [11] the best way to represent our input data is a multidimensional network. However, unlike their proposal, the dimensions in our work will represent each social network used. Let G = (V ,E,D) denote an undirected unweighted multidimensional graph, in which V is a set of nodes representing each user, D reflects the dimensions which are the social networks and E = {(u,v,d);u,v ∈ V ,d ∈ D} represents a set of edges that are the connection between the users that represents things such as: relationship, shared comments or post sharing. Fig. III.1 illustrates an example of how this network look like. At each timestamp, the user will have his/her data inserted into our model to have his/her score. The timestamp here would be each time the user is involved in a new activity, which is the method used by [5]. 33
  • 46. Chapter III. Proposed Model Figure III.1: Multi-dimensional Network 2.2 Content-Based Classification The model itself will contain three different sub-models, one for each content type we have. 2.2.1 Text Classification Model As mentioned in the previous chapter, before applying machine learning classifica- tion models on a textual content, we have to do text analysis and transform it to numerical input that a model can understand. As illustrated in Fig. III.2, the first step when the textual data is received, it has to 34
  • 47. Chapter III. Proposed Model pass through the NLP process. Once that is done, it has to be represented in a numeri- cal way. In the last chapter, we presented a comparison between two word embedding techniques namely: Word2Vec and TF-IDF. We chose TF-IDF because we are doing a clas- sification problem, we are more interested in differentiating the categories rather than representing the similarity of the words meaning. Now that our machine can understand our textual data and the data itself can be represented in a numerical way. We can start passing that to any machine learning model. As a strategy, we decided that in the im- plementation phase we would try different models that we mentioned in the last chapter, such as Support Vector Machines, Logistic Regression and Neural Networks then com- pare their results to assess which one performs better. Figure III.2: Text Classification Model 35
  • 48. Chapter III. Proposed Model 2.2.2 Image Classification Model In the previous chapter, we presented convolutional neural network as a model to use in image classification. But designing a CNN will require ample parameters tuning and adding/removing of convolution blocks to find the best architecture while re-training your model each time. This task is a huge time consuming job. To overcome this, there is a technique called Transfer Learning that could help getting better results in a faster way. Transfer Learning is a technique that makes a model benefit from knowledge gained during solving another similar problem. For example, a model that learned to recognize cars could use its knowledge to recognize trucks [33]. This is done by taking a pre-trained model, changing few layers, usually the last ones, and re-training only those layers. It is proven in [34], that transfer learning could have a huge improvement for accu- racy, execution time and memory usage. Another known limitation that we usually encounter in image classification is not having diverse enough data or enough samples. A solution to that is the Data Augmentation technique. Data Augmentation is a technique for generating more data because having little data and not enough variation, leads to a bottleneck in Neural Network models that usu- ally requires thousands of training samples with diverse variation to be able to generalize the learning. This is done by using some techniques such as: • Flipping: Flip the image horizontally and vertically. • Rotating: Rotate the image with some degrees. • Scaling: Re-scale an image by making it larger or smaller. • Cropping: Crop a part of an image. • Translating: Move the image in some direction. • Adding Gaussian Noise: Add noisy points to the image. 36
  • 49. Chapter III. Proposed Model Applying data augmentation can help in improving the model score as discussed in [35]. Therefore, as illustrated in Fig. III.3, once we have image data, it passes through our trained CNN model resulting in an image-content score. Figure III.3: Image Classification Model 2.2.3 General Information Classification Model For the general information model, the features do not require pre-processing for the machine to understand it. We have to follow some encoding techniques for the non- numerical data, then fit that to a supervised machine learning classification model. For non-numerical features such as gender and relationship, we have to encode these into numerical values. As these are binary, we can use 0 and 1. For non-binary values, 37
  • 50. Chapter III. Proposed Model we have to use techniques such as one-hot encoding or a Sparse Categorical Cross En- tropy encoding. As for the username, we can apply some feature engineering to create relevant features from it such as the length, number of unique characters and other important information as discussed in [8]. Other numerical features such as the age, the number of friends and the number of fol- lowers can be passed them directly to the model. In the implementation phase we try different classification models where we compare their results to select which one performs better. Figure III.4: General Information Classification Model 38
  • 51. Chapter III. Proposed Model 2.3 Decision Making Now that we have a model for each data type, we can go to phase 3 where we propose a calculation formula to provide a score for each user. While doing our work and based on the available features, we noticed that the textual content and the image content has more impact on the user behavior than the general information which could be mis-leading. Therefore, as a compromise we decided to give a weight to each input data relative to its impact on determining the anomaly of the user. Taking 3 scores one for each sub-model {s1,s2,s3} and 3 weights {α1,α2,α3}, each node u ∈ V on each dimension d ∈ D should have the terrorism score of that dimension S(u)d as in (III.1). S(u)d = 3 i=1 (αi × s(u)i) (III.1) Now each user has a score for each dimension based on the sub-models score of each dimension, but as an output, we want a single score. For that, given 3 dimensions, each user must have a terrorism score ST (u) as in (III.2). ST (u) = 3 d=1 S(u)d 3 (III.2) Now that each user u ∈ V has a terrorism score ST (u), we have to decide whether that user is a terrorist or not, this is done by defining a certain threshold γ where: ST (u) = γ ⇒ T errorist ST (u) γ ⇒ NotT errorist (III.3) The values of the weights αi and the threshold γ are determined in the implementa- tion phase. 2.4 Global Model After defining the different components of our model, let us present its design along with the workflow of how to use it. Fig. III.5 shows how our model look like using an example of a single user with three dimensions that are the Facebook, Twitter and Insta- gram data. 39
  • 52. Chapter III. Proposed Model Figure III.5: Proposed Model Fig. III.6 illustrates the workflow of our model. Each time a user is involved in an ac- tivity, the user’s data will pass through our model. In the case in which the user behavior is detected as terrorist, we re-train the model with this new data to keep it updated with new unseen behaviors. If the model loses accuracy after re-training, we revert to the last existing model. 40
  • 53. Chapter III. Proposed Model Figure III.6: Model Workflow Conclusion In this chapter, we presented our proposed approach, starting from the research questions that we are looking to solve. Then, we showed the different phases to follow in order to answer those questions. Finally, we explored the steps to follow toward the 41
  • 54. Chapter III. Proposed Model construction of our model. The next chapter will detail the achievements and the different results. 42
  • 55. IVImplementation and Results This chapter presents the practical part of our work. We will go through the pipeline of our implementation starting with data gathering, then the model creation and we fin- ish with the interpretation of the results and a response to the research questions. 1 Data Collection In this section, we will explain how to gather the data that we identified in the last chapter. As we discussed, there are two types of data, the offline and the online data. In the next sections, we will implement the data gathering solution to each of them. 1.1 Offline Data To train the models, we used a strategy of using an offline dataset which is the public datasets related to our problem. In the last chapter, we decided a data sources for each input type, we will implement their gathering scripts in the next sections. 43
  • 56. Chapter IV. Implementation and Results 1.1.1 Textual-Content Data For the textual data, we have two sources of data: • Positive labels: Tweets of banned tweeter accounts. • Negative labels: News headlines of the GTD. Our positive labels are the data that contains terrorist textual content. Our strategy was to gather tweets of the banned users that were reported to @twittersafety account and that also contains terrorism-related hashtags when they were reported, this could be done through the Twitter API or the Twitter searching tool. Fig. IV.1 illustrates an example of our searches looking for tweets that were reported to or mentioned the twittersafety account containing the hashtags #ISIS, #terrorist, #Daech, #IslamicState. Figure IV.1: Twitter Searching Tool While doing our research, we found out that some organization already did this pro- cess and extracted over 17k of clean terrorist data of ISIS users, and published that into a 44
  • 57. Chapter IV. Implementation and Results Kaggle dataset called How ISIS Uses Twitter [36]. For our negative labels, we need content related to terrorism in an opposite way, such as news reporting on terrorism. For that, we will be using the news headlines from the Global Terrorism Database (GTD) [13]. Fig. IV.2 presents a sample of 4 rows from the GTD news headlines. Figure IV.2: Sample of news headlines Our final dataset contains the merge of the tweets labeled as terrorist, and the GTD data labeled as news. Fig IV.3 shows the word cloud of the most appearing keywords from our dataset, that includes both positive and negative labels. 45
  • 58. Chapter IV. Implementation and Results Figure IV.3: Word Cloud of our Textual Data The number of samples we have total to approximately 300k of data, where about 122K are terrorist data and around 181K are news headlines, Table IV.1 presents the real numbers in our dataset. Label Number of samples Positive labels 122619 Negative labels 181691 Total Data 304310 Table IV.1: Textual-Content Dataset 1.1.2 Image-Content Data As we discussed in our research, the source of the image data is Google-Image and we will be manually gathering images from it. Lucky for us, a python package called google_images_download [37] exists, which allow us to automate this task by choosing the keywords that we are looking for and the number of images needed. 46
  • 59. Chapter IV. Implementation and Results We started a script that downloaded around five hundred of terrorist persons and in- citement acts in addition to another five hundred images of military and terrorism news. Unfortunately, the images were not 100% related to what we are looking for, therefore, we had to manually verify the gathered images and remove the non-related images. After cleaning the data and keeping only related images, we had around 200 of ter- rorist images and 300 of military and news images. Table IV.2 illustrates the real numbers of images in our dataset. Fig. IV.4 and Fig. IV.5 shows random three images of each cate- gory. Label Number of samples Positive labels 219 Negative labels 314 Total Data 533 Table IV.2: Image-Content Dataset Figure IV.4: Sample of Terrorists images 47
  • 60. Chapter IV. Implementation and Results Figure IV.5: Sample of Military/News images 1.1.3 General Information Data For general information data, we used the Profiles of Individual Radicalization In the United States (PIRUS) [25] public dataset from which we extracted the ages, genders and relationships status of 135 extremist person that are our positive labels. As for the negative labels, we will be using the online data to build our dataset. Fig.IV.6, Fig.IV.7 and Fig.IV.8 shows the distribution of each feature within our pos- itive labels data. Figure IV.6: Age Distribution 48
  • 61. Chapter IV. Implementation and Results Figure IV.7: Relationship Distribution Figure IV.8: Gender Distribution 1.2 Online Data In this section, we will implement the necessary scripts that will gather the online data from our selected social media platforms: Facebook, Instagram and Twitter. 1.2.1 Facebook Data Facebook provides the HTTP-based API called Graph API. A public SDK called facebook- sdk will help us write automated Facebook data gathering script using Python. To use Facebook Graph API, it is necessary to pass an access token that has the rele- vant permissions to access the social graph objects that you are querying. In the Facebook social graph objects, each object has some fields related to the object type, for example, 49
  • 62. Chapter IV. Implementation and Results the object User will contain information around the user profile such as the age, rela- tionship and gender. The main existing objects that we are interested in are the User, the Post, and the Comment. To access each graph object, you pass an id of an object of that type,. Therefore, for posts and comments we cannot access them directly since the posts ids are contained in the list of posts of the user object, and the same for comments as they are part of the posts. Fig.IV.9 shows a representation of the Facebook Graph API. Figure IV.9: Facebook Graph API Our script starts by obtaining the information of the user along with the list of posts ids. Then, it accesses all the posts by looping through the list of posts ids from the posts field in the User object and retrieve the necessary information from it. After that, it extracts the comments by looping through the list of comments ids from the comments field in each Post object. Finally it will parse the textual and image data from those posts and comments. The following code is an example of how to get the user information along with the posts data. graph = facebook.GraphAPI(access_token=access_token , version=3.1) 50
  • 63. Chapter IV. Implementation and Results user_information = graph.get_object( id=’me’, fields=’id,name,age_range ,gender,relationship_status’) posts_ids = [] posts_object = graph.get_object(id=’me’, fields=’posts’) posts_ids.extend(posts_object[’posts’][’data’]) while next_page is not None: response = requests.get(next_page) new_data = json.loads(response.content) posts_ids.extend(new_data[’data’]) try: next_page = new_data[’paging’][’next’] except: next_page = None for post in posts_ids: post_data = graph.get_object( id=post.id, fields=’created_time ,full_picture ,message ,shares , likes.summary(1)’) post_data[’likes’] = post_data[’likes’][’summary’][’total_count’] try: post_data[’shares’] = post_data[’shares’][’count’] except: post_data[’shares’] = 0 1.2.2 Instagram Data For Instagram, the task is easier as it provides a normal REST API with JSON output where the access to each endpoint is direct through any HTTP request module. In Python, we use the module requests with the Instagram endpoint: https://api.instagram.com/v1/ where we can access the information of the user through /users/self/?access_token={} or the posts through /users/self/media/recent/?access_token={}. The following code shows how our script will gather information from Instagram. 51
  • 64. Chapter IV. Implementation and Results # User data response = requests.get( ’https://api.instagram.com/v1/users/self/?access_ token={}’.format(access_token)) user = json.loads(response.content)[’data’] # Posts data response = requests.get( ’https://api.instagram.com/v1/users/self/media/recent/?access_ token={}’.format(access_token)) data = json.loads(response.content) for post in data[’data’]: _id = post[’id’] creation_timestamp = post[’created_time’] created_time = datetime.fromtimestamp( int(creation_timestamp)).strftime(’%Y−%m−%d %H:%M:%S’) message = post[’caption’][’text’] if post[’caption’] is not None else ’’ img_url = post[’images’][’standard_resolution’][’url’] post_data = dict(created_time=created_time , id=_id, message=message , img_url=img_url) 1.2.3 Twitter Data Similarly to Instagram, Twitter also provides a REST API, however, it also hands over a Python SDK making the API usage easier. In order to use it, we have to pass 4 access keys: consumer key, consumer secret, access token key and access token secret. Each key has relevant permissions that allow access to either user’s private data or the public Twitter data. The following code is an example of how we loaded the tweets using the Twitter Python SDK. 52
  • 65. Chapter IV. Implementation and Results api = twitter.Api(consumer_key=consumer_key , consumer_secret=consumer_secret , access_token_key=access_token_key , access_token_secret=access_token_secret) user_id = api.VerifyCredentials().AsDict()[’id’] tweets = api.GetUserTimeline(user_id=user_id) for tweet in tweets: tweet = tweet.AsDict() _id = tweet[’id’] created_time = tweet[’created_at’] message = tweet[’text’] if tweet[’text’] is not None else ’’ tweet_data = dict(created_time=created_time , id=_id, message=message) 2 Model Implementation In the next sections, we will be implementing the different components that will lead toward constructing our proposed model. For each sub-model, we will be splitting the dataset of that content type into 80% training data, and 20% testing data. All the models will be implemented on the same machine provided by Kaggle, a data science platform, with the following hardware: • RAM: 16 GB • CPU count: 2 • GPU: Tesla K80 • Disk: 5 GB 53
  • 66. Chapter IV. Implementation and Results 2.1 Text Classification Model The steps to construct our text classification model were first to have the NLP pipeline ready for data pre-processing, then vectorize it with TF-IDF and pass it to a classification model. 2.1.1 NLP Process When the NLP enters the practical phase, the process becomes tokenization, removal of stop words and lemmatization. In the following code we will be using the Natural Language Toolkit (NLTK) python package to do these steps. We start with regular ex- pressions that will remove unnecessary texts that disrupts the process such as links and dates. Then, we split the text into tokens, removing the stopwords (common useless words like ’a’, ’the’, ’that’, ’on’) and lemmatizing the words, by determining the root word based on its part-of-speech tag (adjective, verb, noun). def process_text(text): nltk_processed_data = [] text = re.sub(r’https?://. [rn] ’, ’’, text, flags=re.MULTILINE) text = re.sub(’(?:[0−9] [:/−]){2}[0−9]{2,4}’, ’’, text, flags=re.MULTILINE) for w in tokenizer.tokenize(text) : word = w.lower() if not is_stopword(word=word): processed_text = wordnet_lemmatizer.lemmatize( word, get_wordnet_pos(word)) nltk_processed_data.append(processed_text) return nltk_processed_data 2.1.2 Data Vectorization To use our data for classification models, we have to vectorize it into semantic nu- merical data. In the last chapter, we defined TF-IDF as our vectorizer model. Sckit-Learn 54
  • 67. Chapter IV. Implementation and Results offers ’TfidfVectorizer’ module in its package with an easy usage of two lines. We defined the object parameters as follows: • max_df: Max document frequency for a word to be considered in the grammar ⇒ 0.95 (word must maximum appears in 95% of the documents) • min_df: Min document frequency for a word to be considered in the grammar ⇒ 0.1 (word must minimum appears in 10% of the documents) • ngram_range: Number of words to consider as a single token in the grammar ⇒ (1,3) (From 1 word to 3 words) from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=0.95, min_df=0.1, ngram_range=(1,3)) X = vectorizer.fit_transform(train_data) The data used to train the ’TfidfVectorizer’ is around 205K of samples. After vectorizing the data, TF-IDF has identified 330 feature vectors, which makes our data shape: (n_training_samples, n_dimensions) ⇒ (243448, 330) We utilize the trained vectorizer to transform the testing data, as follows: transformed_data = vectorizer.transform(test_data).toarray() 2.1.3 Data Classification As mentioned in the last chapter, we will try three classification models namely Lo- gistic Regression, Support Vector Machine and Neural Network. The best performing model will be used later for our global model. To implement the Logistic Regression and the Support Vector Machine models, we used Scikit-Learn, a python machine learning library that offers many known models. We trained these two models with their default suggested parameter’s values. For the Neural Network, we used Keras as a framework that works on top of TensorFlow. The architecture of our model is composed of three layers, with 16 neurons, 8 neurons and 1 neuron respectively. The first two layers has a ’relu’ activation as it is proven for its performance, and the last layer has a ’sigmoid’ activation as it is our output layer and we 55
  • 68. Chapter IV. Implementation and Results have a binary classification problem. The model is compiled with ’binary_crossentropy’ as a loss function and ’adam’ as an optimizer. For the training parameters, we used 20 epochs with 128 batch size and 20% of validation data extracted from the training data. Table IV.5 show the different metric scores for each model along with the training execution time. These models are trained and tested with the same data and on the same machine. The model that we will be using in our global model is the Neural Network as it has the best F1-score with a good average of training time. Model Name Accuracy F1-Score Training time Logistic Regression 0.9726 0.9674 39.9 secs SVM 0.9626 0.9548 6h 48min 33s Neural Network 0.9774 0.9719 1min 11s Table IV.3: Text Models Metric Scores 2.2 Image Classification Model In last chapter, we defined convolutional neural network as our image classification model along with optimization techniques namely Transfer Learning and Data Augmen- tation. Therefore, as a first step, we have to implement our data augmentation functions, then define which base model’s learnt knowledge will be used in our model. 2.2.1 Data Augmentation To use data augmentation, a python package called ’imgaug’ exists that provides all the different data augmentation techniques. In the following code, we show an example on how to use the augmenters of the ’imgaug’ library where we will be applying a random augmentation technique for the image used. from imgaug import augmenters img_augmentor = augmenters.Sequential([ # S e l e c t one of the augmentation techniques randomly augmenters.OneOf([ iaa.Affine(rotate=0), iaa.Affine(rotate=90), 56
  • 69. Chapter IV. Implementation and Results iaa.Affine(rotate=180), iaa.Affine(rotate=270), iaa.Fliplr(0.5), iaa.Flipud(0.5), ])], random_order=True) # Apply the augmentation technique on the image image_aug = img_augmentor.augment_image(image) Fig.IV.10 shows an example of two images generated through the data augmentation code above. Figure IV.10: An example of data augmentation After applying the data augmentation on the training data, we generated an addi- tional 30% of data resulting in a total of approximately around 550 images. 2.2.2 Transfer Learning Many pre-trained models exists nowadays, but they are each focused on a specific problem. In our case, we work more with faces and objects like guns, so the pre-trained model VGG16 [38] is more suitable to our problem. To adapt the VGG16 to our problem, we remove its fully-connected layers, freeze the training on the the remaining layers and add two new layers. The first will have 16 neu- rons and ’relu’ activation. The second, our output layer, will have 1 neuron and ’sigmoid’ 57
  • 70. Chapter IV. Implementation and Results activation. The loss function will be ’binary_crossentropy’ with ’adam’ as an optimizer. Since the image classification could be a complex task and we have little amount of data, we will train the model with 5000 epochs with 32 batch size while having an early stop- ping strategy of 250 rounds. In Table IV.4, we present the different scores of combination using our two CNN layers, with and without the pre-trained model and with and without the generated data from the data augmentation. While the scores were measured by the same testing data, the training data differs when using the data augmentation. The usage of both DA and TL together has resulted in better scores and not so long training time, therefore, we will be using that in our global model. Model Accuracy F1-Score Training time CNN 0.7631 0.7219 3min 50secs CNN + DA 0.7781 0.7463 4min 12secs CNN + TL 0.8291 0.8103 8min 48secs CNN + DA + TL 0.8571 0.8454 9min 23secs Table IV.4: Image Models Metric Scores 2.3 General Information Classification Model For the general information, we will follow the same strategy used in the text clas- sification where we will be working with three classification models namely Logistic Re- gression, Support Vector Machine and Neural Network and the best performing model will be later used for our global model. For the Logistic Regression and the Support Vector Machine we used the default Scikit-Learn parameter’s values. However, for the Neural Network, we used an architecture of four layers with 16 neu- rons, 8 neurons, 4 neurons and 1 neuron respectively. A ’relu’ activation is used for the first three layers and a ’sigmoid’ activation for the last layer. The model is compiled with ’binary_crossentropy’ as a loss function and ’adam’ as an optimizer. For the training pa- rameters, we will use 200 epochs with 32 batch size and 20% of validation data extracted from the training data. 58
  • 71. Chapter IV. Implementation and Results Table IV.5 illustrated the metric scores of the trained models with the same data on the same machine. For the global model, we will be using SVM as it exceeds by far the performance of the other models. Model Name Accuracy F1-Score Training time Logistic Regression 0.7650 0.7873 5 secs SVM 0.8300 0.8495 7 secs Neural Network 0.8173 0.8325 48.6 secs Table IV.5: General Information Models Metric Scores 2.4 Proposed Model In this part, we will be going through our proposed model’s workflow to put things together and implement the missing components. Our model’s input is a multidimensional network, therefore, we have to implement a parser that will map the data into the correspondent sub-model. This could be solved through creating objects where we can store the data in a convenient way then pass to the sub-models. Fig.IV.11 illustrates our class diagram where we store each user’s data. The general user information data are in the User object, while the Post object, which could also be a Comment, has both image and textual data. 59
  • 72. Chapter IV. Implementation and Results Figure IV.11: Class Diagram The second component of our model is the sub-models that will be receiving the input data. For that, we will use the pre-trained chosen models of each input type and output a score per each model. The next component is the decision making where we have to interpret the output score of the sub-models and calculate the terrorism score and decide on the user’s ex- tremeness. The calculation formula for that was already defined in the last chapter, but the values of the threshold γ and the models factors α are not yet decided. For the factors, we decided that since we have more features on the image and textual content than the general information, we will have the factors as follow: • Text-Model factor: 0.4 (40%) • Image-Model factor: 0.4 (40%) • Information-Model factor: 0.2 (20%) As for the threshold, we do not have many real online data to decide on this in a scientific way, we agreed to keep it in a neutral way with value of 0.5 (50%). The model itself is adapted to an over-time change, thus, a component that re-train and revert a model must be implemented as well. For that, we will have a database where we store the last model’s score and a python function that checks if the score improved after re-training the model on the new terrorist-user’s data. 60
  • 73. Chapter IV. Implementation and Results With having those components ready, our model’s implementation is finished and the model is ready to be tested. 3 Results Interpretation In this section, we will start testing our model with a network to see if we can answer our research questions that were posed in the beginning of our proposal. The network passed to the model is composed of two real users (U1 U2) that are non-terrorist and one generated terrorist user (U3) as we cannot find an available terrorist users. The input was tested only a single timestamp t, due to lack of historical data. As we can see in Table IV.6, which presents the scores predicted for those users for each sub-model on each social network (Facebook: FB, Instagram: IG, Twitter: T), the model has performed good by predicting correctly the anomalousness of the users. Based on these results, we can see that a terrorist could be detected according to his/her social media content, thus, our answer for Q1 is positive. We can also notice that the scores on the same data type from different social networks are mostly similar, except for the text content on Instagram as it is only image captions, which means that our answer to Q3 is positive. User Text-Model Score Image-Model Score Information-Model Score Final Score FB IG T FB IG T FB IG T U1 0.084 0.084 0.079 0.031 0.068 0.063 0.265 0.318 0.345 0.116 U2 0.059 0.054 0.078 0.013 0.054 0.115 0.530 0.445 0.276 0.133 U3 0.859 0.298 0.854 0.658 0.877 0.816 0.530 0.637 0.690 0.705 Table IV.6: Model Testing Results After detecting the user U3 as a terrorist, the sub-models were re-trained again with appending the new data extracted from U3 to the old data. The new score of each sub- model were increased by an average of 0.01. Although this increase could be considered negligible, but over time, it will help our model being up-to-date with the new terrorism contents, thus, if a user is starting to adopt the new terrorism behaviors that the model 61
  • 74. Chapter IV. Implementation and Results was not trained on in the first place, the user will still be detected as a terrorist, therefore our answer to Q2 is positive. Conclusion During this chapter, we presented the implementation of our solution starting from the data gathering, then the sub-models training and our proposed model construction, and we finished by testing our model and answering our research questions. 62
  • 75. VConclusions and Perspectives In this thesis, we proposed a terrorist detection model that works with multidimen- sional networks as an input format and that can also support different input data types such as texts and images. Our model can also detect if the user is adopting a new behavior over-time, and the model itself can automatically learn new terrorism behaviors. We started by presenting the existing works carried in the anomaly and terrorism detection domains. Then, we discussed the existing techniques for data processing and data classification in an automated way. After that, we presented the model’s design and the theoretical perspective of the workflow. Finally, we started implementing the model and discussed the results. The model itself showed good results on two real users and one generated user by predicting their anomalousness correctly. Despite the fact that the number of the online data used for testing is too little, this is still considered as a proof-of-concept that our proposed model can be implemented and put in a production environment. Although we tried to cover the limitation of other existing models, our proposed model is still limited by not supporting some functionalities such as: • Graph analysis: We can use graph analysis methodologies to detect communities since our input data is a network. 63
  • 76. Chapter V. Conclusions and Perspectives • Support of videos: We can add another sub-model that works with video classifica- tion, since videos are one of the most important contents in social medias. The model’s accuracy can also be improved by using larger datasets, thus, we also solve the calculation of the threshold and the sub-models factors. 64
  • 77. Bibliography [1] Shannon Greenwood, Andrew Perrin, and Maeve Duggan. Social media update 2016. Pew Research Center, 11(2), 2016. [2] Alex P Schmid. The definition of terrorism. In The Routledge handbook of terrorism research, pages 57–116. Routledge, 2011. [3] Facebook community standards. URL https://www.facebook.com/ communitystandards/dangerous_individuals_organizations. [4] Arash Habibi Lashkari, Min Chen, and Ali A Ghorbani. A survey on user profiling model for anomaly detection in cyberspace. Journal of Cyber Security and Mobility, 8 (1):75–112, 2019. [5] Zahedeh Zamanian, Ali Feizollah, Nor Badrul Anuar, Laiha Binti Mat Kiah, Karanam Srikanth, and Sudhindra Kumar. User profiling in anomaly detection of authorization logs. In Computational Science and Technology, pages 59–65. Springer, 2019. [6] Sreyasee Das Bhattacharjee, Junsong Yuan, Zhang Jiaqi, and Yap-Peng Tan. Context- aware graph-based analysis for detecting anomalous activities. In 2017 IEEE Inter- national Conference on Multimedia and Expo (ICME), pages 1021–1026. IEEE, 2017. [7] Di Chen, Qinglin Zhang, Gangbao Chen, Chuang Fan, and Qinghong Gao. Forum user profiling by incorporating user behavior and social network connections. In International Conference on Cognitive Computing, pages 30–42. Springer, 2018. [8] Hamidreza Alvari, Soumajyoti Sarkar, and Paulo Shakarian. Detection of violent extremists in social media. arXiv preprint arXiv:1902.01577, 2019. [9] Pradip Chitrakar, Chengcui Zhang, Gary Warner, and Xinpeng Liao. Social media image retrieval using distilled convolutional neural network for suspicious e-crime 65
  • 78. Bibliography and terrorist account detection. In 2016 IEEE International Symposium on Multimedia (ISM), pages 493–498. IEEE, 2016. [10] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [11] George Kalpakis, Theodora Tsikrika, Stefanos Vrochidis, and Ioannis Kompatsiaris. Identifying terrorism-related key actors in multidimensional social networks. In International Conference on Multimedia Modeling, pages 93–105. Springer, 2019. [12] Pankaj Choudhary and Upasna Singh. A survey on social network analysis for counter-terrorism. International Journal of Computer Applications, 112(9):24–29, 2015. [13] Gary LaFree and Laura Dugan. Introducing the global terrorism database. Terrorism and Political Violence, 19(2):181–204, 2007. [14] Kalev Leetaru and Philip A Schrodt. Gdelt: Global data on events, location, and tone, 1979–2012. In ISA annual convention, volume 2, pages 1–49. Citeseer, 2013. [15] Linton C Freeman. Centrality in social networks conceptual clarification. Social networks, 1(3):215–239, 1978. [16] EDUCBA contributors. Text mining vs natural language process- ing - top 5 comparisons, Aug 2019. URL https://www.educba.com/ important-text-mining-vs-natural-language-processing/. [17] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013. [18] Shivangi Singhal. Data representation in nlp, Jul 2019. URL https://medium.com/ @shiivangii/data-representation-in-nlp-7bb6a771599a. [19] Eric Kauderer-Abrams. Quantifying translation-invariance in convolutional neural networks. arXiv preprint arXiv:1801.01450, 2017. 66