Unsupervised machine learning techniques such as clustering are widely gaining use with the recent increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety, Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly find key points in the data before going into further classification.
Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches.
Political prediction analysis using text mining and deep learningVishwambhar Deshpande
We have proposed a system to determine current sentiment on twitter using Twit-
ter API for open access which includes opinions from dierent content structures like
latest news, audits, articles and social media posts. and Deep Learning method to
study Historic Data for predicting future results. we utilized Naive Bayes and dictio-
nary based algorithms to predict the sentiment on Live Twitter Data.
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
An evolutionary approach to comparative analysis of detecting Bangla abusive ...journalBEEI
The use of Bangla abusive texts has been accelerated with the progressive use of social media. Through this platform, one can spread the hatred or negativity in a viral form. Plenty of research has been done on detecting abusive text in the English language. Bangla abusive text detection has not been done to a great extent. In this experimental study, we have applied three distinct approaches to a comprehensive dataset to obtain a better outcome. In the first study, a large dataset collected from Facebook and YouTube has been utilized to detect abusive texts. After extensive pre-processing and feature extraction, a set of consciously selected supervised machine learning classifiers i.e. multinomial Naïve Bayes (MNB), multi layer perceptron (MLP), support vector machine (SVM), decision tree, random forrest, stochastic gradient descent (SGD), ridge, perceptron and k-nearest neighbors (k-NN) has been applied to determine the best result. The second experiment is conducted by constructing a balanced dataset by random under sampling the majority class and finally, a Bengali stemmer is employed on the dataset and then the final experiment is conducted. In all three experiments, SVM with the full dataset obtained the highest accuracy of 88%.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET Journal
This document summarizes a research paper that aims to effectively counter communal hatred during disaster events on social media. It uses machine learning techniques to analyze tweets and classify them based on parameters like offensive, hatred, or neither. Tweets are collected using Twitter's API and preprocessed. A supervised machine learning algorithm (Support Vector Machine) is trained on manually labeled tweet data to classify new tweets. The results are visualized in a pie chart graph displaying the percentage of tweets containing offensive, hatred, or neutral words. The goal is to reduce the spread of communal hate speech on social media during disasters.
This document summarizes a research paper on analyzing sentiments from Twitter data using data mining techniques. The paper presents an approach for analyzing user sentiments using data mining classifiers and compares the performance of single classifiers versus an ensemble of classifiers for sentiment analysis. Experimental results show that the k-nearest neighbor classifier achieved very high predictive accuracy, and single classifiers outperformed the ensemble approach.
Hate speech has been an ongoing problem on the Internet for many years. Besides, social media, especially Facebook, and Twitter have given it a global stage where those hate speeches can spread far more rapidly. Every social media platform needs to implement an effective hate speech detection system to remove offensive content in real-time. There are various approaches to identify hate speech, such as Rule-Based, Machine Learning based, deep learning based and Hybrid approach. Since this is a review paper, we explained the valuable works of various authors who have invested their valuable time in studying to identifying hate speech using various approaches.
Political prediction analysis using text mining and deep learningVishwambhar Deshpande
We have proposed a system to determine current sentiment on twitter using Twit-
ter API for open access which includes opinions from dierent content structures like
latest news, audits, articles and social media posts. and Deep Learning method to
study Historic Data for predicting future results. we utilized Naive Bayes and dictio-
nary based algorithms to predict the sentiment on Live Twitter Data.
SOCIAL NETWORK HATE SPEECH DETECTION FOR AMHARIC LANGUAGEcscpconf
The anonymity of social networks makes it attractive for hate speech to mask their criminal
activities online posing a challenge to the world and in particular Ethiopia. With this everincreasing
volume of social media data, hate speech identification becomes a challenge in
aggravating conflict between citizens of nations. The high rate of production, has become
difficult to collect, store and analyze such big data using traditional detection methods. This
paper proposed the application of apache spark in hate speech detection to reduce the
challenges. Authors developed an apache spark based model to classify Amharic Facebook
posts and comments into hate and not hate. Authors employed Random forest and Naïve Bayes
for learning and Word2Vec and TF-IDF for feature selection. Tested by 10-fold crossvalidation,
the model based on word2vec embedding performed best with 79.83%accuracy. The
proposed method achieve a promising result with unique feature of spark for big data.
An evolutionary approach to comparative analysis of detecting Bangla abusive ...journalBEEI
The use of Bangla abusive texts has been accelerated with the progressive use of social media. Through this platform, one can spread the hatred or negativity in a viral form. Plenty of research has been done on detecting abusive text in the English language. Bangla abusive text detection has not been done to a great extent. In this experimental study, we have applied three distinct approaches to a comprehensive dataset to obtain a better outcome. In the first study, a large dataset collected from Facebook and YouTube has been utilized to detect abusive texts. After extensive pre-processing and feature extraction, a set of consciously selected supervised machine learning classifiers i.e. multinomial Naïve Bayes (MNB), multi layer perceptron (MLP), support vector machine (SVM), decision tree, random forrest, stochastic gradient descent (SGD), ridge, perceptron and k-nearest neighbors (k-NN) has been applied to determine the best result. The second experiment is conducted by constructing a balanced dataset by random under sampling the majority class and finally, a Bengali stemmer is employed on the dataset and then the final experiment is conducted. In all three experiments, SVM with the full dataset obtained the highest accuracy of 88%.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular social media platform. Social media analytics helps make informed decisions based on people's needs and opinions. This information, when properly perceived provides valuable insights into different domains, such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet) algorithms. The experiments use different data processing steps including trigrams, without trigrams, hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags helps improve the topic inference results with a better coherence score.
FAKE NEWS DETECTION WITH SEMANTIC FEATURES AND TEXT MININGijnlc
Nearly 70% of people are concerned about the propagation of fake news. This paper aims to detect fake news in online articles through the use of semantic features and various machine learning techniques. In this research, we investigated recurrent neural networks vs. the naive bayes classifier and random forest classifiers using five groups of linguistic features. Evaluated with real or fake dataset from kaggle.com, the best performing model achieved an accuracy of 95.66% using bigram features with the random forest classifier. The fact that bigrams outperform unigrams, trigrams, and quadgrams show that word pairs as opposed to single words or phrases best indicate the authenticity of news.
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET Journal
This document summarizes a research paper that aims to effectively counter communal hatred during disaster events on social media. It uses machine learning techniques to analyze tweets and classify them based on parameters like offensive, hatred, or neither. Tweets are collected using Twitter's API and preprocessed. A supervised machine learning algorithm (Support Vector Machine) is trained on manually labeled tweet data to classify new tweets. The results are visualized in a pie chart graph displaying the percentage of tweets containing offensive, hatred, or neutral words. The goal is to reduce the spread of communal hate speech on social media during disasters.
This document summarizes a research paper on analyzing sentiments from Twitter data using data mining techniques. The paper presents an approach for analyzing user sentiments using data mining classifiers and compares the performance of single classifiers versus an ensemble of classifiers for sentiment analysis. Experimental results show that the k-nearest neighbor classifier achieved very high predictive accuracy, and single classifiers outperformed the ensemble approach.
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
This document discusses methods for detecting fake news and identifying the source of rumors on social media. It proposes using Bayesian classification to classify information into real or fake categories based on the outputs. If the combined outputs from the classes do not match, then the information is considered fake. It also discusses using a reverse dissemination strategy to identify a group of suspects for the original rumor source, rather than examining each individual. This addresses issues with identifying sources. The method aims to identify the source node based on which nodes have accepted the rumor. Machine learning and natural language processing techniques are used to detect fake news from article content.
This document proposes using a convolutional neural network (CNN) to detect and classify fake news. It first discusses the implications of fake news spreading on social media and the need for automated identification. It then explores existing fake news datasets and data preprocessing techniques. Deep learning approaches like word embeddings and CNNs are presented as promising techniques to capture semantics in text for classification. The document outlines a CNN architecture with word embedding, convolutional, max pooling and fully connected layers to output probabilities for fake/real classification. It reports the CNN approach achieved 99.8% accuracy on a 2.5GB dataset, significantly outperforming baseline models like SVM and naive bayes. Finally, contact information is provided for questions.
IRJET- Fake News Detection using Logistic RegressionIRJET Journal
1) The document discusses a study that uses logistic regression to classify news articles as real or fake. It outlines the methodology which includes data preprocessing, feature extraction using bag-of-words and TF-IDF, and using a logistic regression classifier to predict fake news.
2) The model achieved an accuracy of approximately 72% at classifying news as real or fake when using TF-IDF features and logistic regression.
3) The study aims to address the growing issue of fake news proliferation online by developing a computational method for identifying unreliable news sources.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
Political astroturfing and organised trolling are online malicious behaviours with significant real-world effects. Common approaches examining these phenomena focus on broad campaigns rather than the small groups responsible. To reveal networks of cooperating accounts, we propose a novel temporal window approach that relies on account interactions and metadata alone. It detects groups of accounts engaging in behaviours that, in concert, execute different goal-based strategies, which we describe. Our approach is validated against two relevant datasets with ground truth data. See https://github.com/weberdc/find_hccs for code and data.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining).
Co-authored with Frank Neumann (University of Adelaide)
IRJET - Fake News Detection using Machine LearningIRJET Journal
This document presents a machine learning approach for detecting fake news. It discusses existing fake news detection methods and their limitations. The proposed system uses natural language processing and machine learning techniques like TF-IDF vectorization, naive Bayes classification and XGBoost to build a model that classifies news articles as real or fake. It extracts linguistic features from news content and social context to train models that can identify fake news with greater accuracy than existing approaches. The system is intended to help reduce the spread of misinformation on social media platforms.
This document presents research on analyzing sentiment and affect in dark web forums related to radical groups. It aims to determine how effective automated methods are at measuring opinion polarity and emotion intensity in these forums. The researchers collected data from two forums - a more radical Al-Firdaws forum and a more moderate Montada forum. They used machine learning techniques like SVR ensembles and feature selection to analyze 500 sentences from each forum for sentiment and intensities of emotions like violence and hate. The results found the Al-Firdaws forum expressed more negative sentiment and intense negative emotions, confirming domain expert assessments. A time series analysis also examined how forum affects changed over time.
This document summarizes research on detecting fake news using text analysis techniques. It discusses how social media consumption of news has increased and the challenges of identifying trustworthy sources. Various types of fake news are described based on visual/text content or the targeted audience. Methods for detection include clustering similar news reports and using predictive models to analyze linguistic features like punctuation, semantic levels, and readability. The proposed approach uses text summarization, web crawling to find related articles, latent semantic analysis to compare articles, and fuzzy logic to determine the authenticity score of a target news article. The goal is to develop a system to help users identify fake news on social media platforms.
POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK dannyijwest
The 21st century has been characterized by an increased attention to social networks. Nowadays, going 24
hours without getting in touch with them in some way has become difficult. Facebook and Twitter, these
social platforms are now part of everyday life. Thus, these social networks have become important sources
to be aware of frequently discussed topics or public opinions on a current issue. A lot of people write
messages about current events, give their opinion on any topic and discuss social issues more and more.
DYNAMIC LARGE SCALE DATA ON TWITTER USING SENTIMENT ANALYSIS AND TOPIC MODELINGAndry Alamsyah
1. The document presents a case study analyzing tweets about Uber using sentiment analysis and topic modeling to understand public opinion from large-scale social media data.
2. Sentiment analysis classified tweets as positive, negative, or neutral, while topic modeling identified dominant topics of discussion, like promotions or driver complaints.
3. The analyses found that positive tweets often discussed promotions while negative tweets addressed issues like sexual harassment allegations or unsatisfactory drivers.
This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of our knowledge. Memes are pixel-based multimedia documents that contain photos or illustrations together with phrases which, when combined, usually adopt a funny meaning. However, hate memes are also used to spread hate through social networks, so their automatic detection would help reduce their harmful societal impact. Our results indicate that the model can learn to detect some of the memes, but that the task is far from being solved with this simple architecture. While previous work focuses on linguistic hate speech, our experiments indicate how the visual modality can be much more informative for hate speech detection than the linguistic one in memes. In our experiments, we built a dataset of 5,020 memes to train and evaluate a multi-layer perceptron over the visual and language representations, whether independently or fused.
https://github.com/imatge-upc/hate-speech-detection
Hybrid sentiment and network analysis of social opinion polarization icoictAndry Alamsyah
The rapid growth of social media and user generated contents (UGC) has provided a rich source of potentially relevant data. The problems arise on how to summarize those data to understand and transforming it into information. Twitter as one of the most popular social networking and micro-blogging service can be analyzed in terms of content produced with sentiment analysis. On the other hand, some types of networks can also be constructed to analyze the social network structure and network properties. This research intended to combine those content and structural approaches into hybrid approach for identifies social opinion polarization, this is in the form of conversation network. Sentiment analysis used to determine public sentiment, and social network analysis used to analyze the structure of the network, detecting communities and influential actors in the network. Using this hybrid approach, we have comprehensive understanding about social opinion polarization. As case study, we present real social opinion polarization about reclamation issue in Indonesia.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
Finding Pattern in Dynamic Network AnalysisAndry Alamsyah
1) The document analyzes social network properties like nodes, edges, average degree, diameter and average path length for different companies on Twitter over time.
2) It finds that network properties generally indicate more user interactions and information sharing on weekdays compared to weekends.
3) However, the diameter and average path length are often lowest on weekends, suggesting information spreads more quickly at those times due to the network structure.
A data mining tool for the detection of suicide in social networksYassine Bensaoucha
This document describes a dissertation that developed a program to detect suicidal tendencies in users on Twitter through data mining and text classification techniques. The program first collects and preprocesses tweets, then classifies them using naive Bayes classifiers into three categories: positive, negative, and suicidal. It analyzes the results to determine if a given user has suicidal tendencies based on the percentage of tweets classified in each category. While initial results were promising, future work could compare this approach to other classifiers and potentially combine it with decision tree classification.
Twitter Based Election Prediction and AnalysisIRJET Journal
This document discusses using Twitter data to predict election outcomes through sentiment analysis. It begins with an introduction to election prediction methods and why social media data is being explored as an alternative. The paper then reviews related work on using features like user profiles, linguistic content, and sentiment analysis of tweets mentioning candidates. It describes the methodology used, including data collection from Twitter's API, preprocessing tweets, and performing sentiment analysis using both machine learning and lexicon-based approaches. The results section shows the sentiment analysis identified more positive tweets for Clinton and more negative tweets for Trump, suggesting Clinton would win. Emotion analysis found more tweets expressing sadness for Clinton and joy for Trump.
A RELIABLE ARTIFICIAL INTELLIGENCE MODEL FOR FALSE NEWS DETECTION MADE BY PUB...caijjournal
The quick access to information on social media networks as well as its exponential rise also made it
difficult to distinguish among fake information or real information. The fast dissemination by way of
sharing has enhanced its falsification exponentially. It is also important for the credibility of social media
networks to avoid the spread of fake information. So it is emerging research challenge to automatically
check for misstatement of information through its source, content, or publisher and prevent the
unauthenticated sources from spreading rumours. This paper demonstrates an artificial intelligence based
approach for the identification of the false statements made by social network entities. Two variants of
Deep neural networks are being applied to evalues datasets and analyse for fake news presence. The
implementation setup produced maximum extent 99% classification accuracy, when dataset is tested for
binary (true or false) labeling with multiple epochs.
Groundhog day: near duplicate detection on twitterDan Nguyen
This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
International Journal of Computer Science, Engineering and Applications (IJCSEA) is an open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer science, Engineering and Applications. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of computer science, Engineering and Applications.
IRJET- Fake News Detection and Rumour Source IdentificationIRJET Journal
This document discusses methods for detecting fake news and identifying the source of rumors on social media. It proposes using Bayesian classification to classify information into real or fake categories based on the outputs. If the combined outputs from the classes do not match, then the information is considered fake. It also discusses using a reverse dissemination strategy to identify a group of suspects for the original rumor source, rather than examining each individual. This addresses issues with identifying sources. The method aims to identify the source node based on which nodes have accepted the rumor. Machine learning and natural language processing techniques are used to detect fake news from article content.
This document proposes using a convolutional neural network (CNN) to detect and classify fake news. It first discusses the implications of fake news spreading on social media and the need for automated identification. It then explores existing fake news datasets and data preprocessing techniques. Deep learning approaches like word embeddings and CNNs are presented as promising techniques to capture semantics in text for classification. The document outlines a CNN architecture with word embedding, convolutional, max pooling and fully connected layers to output probabilities for fake/real classification. It reports the CNN approach achieved 99.8% accuracy on a 2.5GB dataset, significantly outperforming baseline models like SVM and naive bayes. Finally, contact information is provided for questions.
IRJET- Fake News Detection using Logistic RegressionIRJET Journal
1) The document discusses a study that uses logistic regression to classify news articles as real or fake. It outlines the methodology which includes data preprocessing, feature extraction using bag-of-words and TF-IDF, and using a logistic regression classifier to predict fake news.
2) The model achieved an accuracy of approximately 72% at classifying news as real or fake when using TF-IDF features and logistic regression.
3) The study aims to address the growing issue of fake news proliferation online by developing a computational method for identifying unreliable news sources.
Recently, fake news has been incurring many problems to our society. As a result, many researchers have been working on identifying fake news. Most of the fake news detection systems utilize the linguistic feature of the news. However, they have difficulty in sensing highly ambiguous fake news which can be detected only after identifying meaning and latest related information. In this paper, to resolve this problem, we shall present a new Korean fake news detection system using fact DB which is built and updated by human's direct judgement after collecting obvious facts. Our system receives a proposition, and search the semantically related articles from Fact DB in order to verify whether the given proposition is true or not by comparing the proposition with the related articles in fact DB. To achieve this, we utilize a deep learning model, Bidirectional Multi Perspective Matching for Natural Language Sentence BiMPM , which has demonstrated a good performance for the sentence matching task. However, BiMPM has some limitations in that the longer the length of the input sentence is, the lower its performance is, and it has difficulty in making an accurate judgement when an unlearned word or relation between words appear. In order to overcome the limitations, we shall propose a new matching technique which exploits article abstraction as well as entity matching set in addition to BiMPM. In our experiment, we shall show that our system improves the whole performance for fake news detection. Prasanth. K | Praveen. N | Vijay. S | Auxilia Osvin Nancy. V ""Fake News Detection using Machine Learning"" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-4 | Issue-2 , February 2020,
URL: https://www.ijtsrd.com/papers/ijtsrd30014.pdf
Paper Url : https://www.ijtsrd.com/engineering/information-technology/30014/fake-news-detection-using-machine-learning/prasanth-k
In the age of social media communication, it is easy to
modulate the minds of users and also instigate violent
actions being taken by them in some cases. There is a need
to have a system that can analyze the threat level of tweets
from influential users and rank their Twitter handles so
that dangerous tweets can be avoided going public on
Twitter before fact-checking which can hurt the sentiments
of people and can take the shape of violence. The study
aims to analyse and rank twitter users according to their
influential power and extremism of their tweets to help
prevent major protests and violent events. We scraped top
trending topics and fetched tweets using those hashtags.
We propose a custom ranking algorithm which considers
source based and content based features along with a
knowledge graph which generates the score and rank the
twitter users according to the scores. Our aim with this
study is to identify and rank extremist twitter users with
regards to their impact and influence. We use a technique
that takes into consideration both source based and
content-based features of tweets to generate the ranking of
the extremist twitter users having a high impact factor
Who’s in the Gang? Revealing Coordinating Communities in Social MediaDerek Weber
Political astroturfing and organised trolling are online malicious behaviours with significant real-world effects. Common approaches examining these phenomena focus on broad campaigns rather than the small groups responsible. To reveal networks of cooperating accounts, we propose a novel temporal window approach that relies on account interactions and metadata alone. It detects groups of accounts engaging in behaviours that, in concert, execute different goal-based strategies, which we describe. Our approach is validated against two relevant datasets with ground truth data. See https://github.com/weberdc/find_hccs for code and data.
Presented at ASONAM'20 (2020 IEEE/ACM International Conference on Advances in Social Network Analysis and Mining).
Co-authored with Frank Neumann (University of Adelaide)
IRJET - Fake News Detection using Machine LearningIRJET Journal
This document presents a machine learning approach for detecting fake news. It discusses existing fake news detection methods and their limitations. The proposed system uses natural language processing and machine learning techniques like TF-IDF vectorization, naive Bayes classification and XGBoost to build a model that classifies news articles as real or fake. It extracts linguistic features from news content and social context to train models that can identify fake news with greater accuracy than existing approaches. The system is intended to help reduce the spread of misinformation on social media platforms.
This document presents research on analyzing sentiment and affect in dark web forums related to radical groups. It aims to determine how effective automated methods are at measuring opinion polarity and emotion intensity in these forums. The researchers collected data from two forums - a more radical Al-Firdaws forum and a more moderate Montada forum. They used machine learning techniques like SVR ensembles and feature selection to analyze 500 sentences from each forum for sentiment and intensities of emotions like violence and hate. The results found the Al-Firdaws forum expressed more negative sentiment and intense negative emotions, confirming domain expert assessments. A time series analysis also examined how forum affects changed over time.
This document summarizes research on detecting fake news using text analysis techniques. It discusses how social media consumption of news has increased and the challenges of identifying trustworthy sources. Various types of fake news are described based on visual/text content or the targeted audience. Methods for detection include clustering similar news reports and using predictive models to analyze linguistic features like punctuation, semantic levels, and readability. The proposed approach uses text summarization, web crawling to find related articles, latent semantic analysis to compare articles, and fuzzy logic to determine the authenticity score of a target news article. The goal is to develop a system to help users identify fake news on social media platforms.
POLITICAL OPINION ANALYSIS IN SOCIAL NETWORKS: CASE OF TWITTER AND FACEBOOK dannyijwest
The 21st century has been characterized by an increased attention to social networks. Nowadays, going 24
hours without getting in touch with them in some way has become difficult. Facebook and Twitter, these
social platforms are now part of everyday life. Thus, these social networks have become important sources
to be aware of frequently discussed topics or public opinions on a current issue. A lot of people write
messages about current events, give their opinion on any topic and discuss social issues more and more.
DYNAMIC LARGE SCALE DATA ON TWITTER USING SENTIMENT ANALYSIS AND TOPIC MODELINGAndry Alamsyah
1. The document presents a case study analyzing tweets about Uber using sentiment analysis and topic modeling to understand public opinion from large-scale social media data.
2. Sentiment analysis classified tweets as positive, negative, or neutral, while topic modeling identified dominant topics of discussion, like promotions or driver complaints.
3. The analyses found that positive tweets often discussed promotions while negative tweets addressed issues like sexual harassment allegations or unsatisfactory drivers.
This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of our knowledge. Memes are pixel-based multimedia documents that contain photos or illustrations together with phrases which, when combined, usually adopt a funny meaning. However, hate memes are also used to spread hate through social networks, so their automatic detection would help reduce their harmful societal impact. Our results indicate that the model can learn to detect some of the memes, but that the task is far from being solved with this simple architecture. While previous work focuses on linguistic hate speech, our experiments indicate how the visual modality can be much more informative for hate speech detection than the linguistic one in memes. In our experiments, we built a dataset of 5,020 memes to train and evaluate a multi-layer perceptron over the visual and language representations, whether independently or fused.
https://github.com/imatge-upc/hate-speech-detection
Hybrid sentiment and network analysis of social opinion polarization icoictAndry Alamsyah
The rapid growth of social media and user generated contents (UGC) has provided a rich source of potentially relevant data. The problems arise on how to summarize those data to understand and transforming it into information. Twitter as one of the most popular social networking and micro-blogging service can be analyzed in terms of content produced with sentiment analysis. On the other hand, some types of networks can also be constructed to analyze the social network structure and network properties. This research intended to combine those content and structural approaches into hybrid approach for identifies social opinion polarization, this is in the form of conversation network. Sentiment analysis used to determine public sentiment, and social network analysis used to analyze the structure of the network, detecting communities and influential actors in the network. Using this hybrid approach, we have comprehensive understanding about social opinion polarization. As case study, we present real social opinion polarization about reclamation issue in Indonesia.
INTELLIGENT SOCIAL NETWORKS MODEL BASED ON SEMANTIC TAG RANKINGdannyijwest
Social Networks has become one of the most popular platforms to allow users to communicate, and share their interests without being at the same geographical location. With the great and rapid growth of Social Media sites such as Facebook, LinkedIn, Twitter…etc. causes huge amount of user-generated content. Thus, the improvement in the information quality and integrity becomes a great challenge to all social media sites, which allows users to get the desired content or be linked to the best link relation using improved search / link technique. So introducing semantics to social networks will widen up the representation of the social networks. In this paper, a new model of social networks based on semantic tag ranking is introduced. This model is based on the concept of multi-agent systems. In this proposed model the representation of social links will be extended by the semantic relationships found in the vocabularies which are known as (tags) in most of social networks.The proposed model for the social media engine is based on enhanced Latent Dirichlet Allocation(E-LDA) as a semantic indexing algorithm, combined with Tag Rank as social network ranking algorithm. The improvements on (E-LDA) phase is done by optimizing (LDA) algorithm using the optimal parameters. Then a filter is introduced to enhance the final indexing output. In ranking phase, using Tag Rank based on the indexing phase has improved the output of the ranking. Simulation results of the proposed model have shown improvements in indexing and ranking output.
Finding Pattern in Dynamic Network AnalysisAndry Alamsyah
1) The document analyzes social network properties like nodes, edges, average degree, diameter and average path length for different companies on Twitter over time.
2) It finds that network properties generally indicate more user interactions and information sharing on weekdays compared to weekends.
3) However, the diameter and average path length are often lowest on weekends, suggesting information spreads more quickly at those times due to the network structure.
A data mining tool for the detection of suicide in social networksYassine Bensaoucha
This document describes a dissertation that developed a program to detect suicidal tendencies in users on Twitter through data mining and text classification techniques. The program first collects and preprocesses tweets, then classifies them using naive Bayes classifiers into three categories: positive, negative, and suicidal. It analyzes the results to determine if a given user has suicidal tendencies based on the percentage of tweets classified in each category. While initial results were promising, future work could compare this approach to other classifiers and potentially combine it with decision tree classification.
Twitter Based Election Prediction and AnalysisIRJET Journal
This document discusses using Twitter data to predict election outcomes through sentiment analysis. It begins with an introduction to election prediction methods and why social media data is being explored as an alternative. The paper then reviews related work on using features like user profiles, linguistic content, and sentiment analysis of tweets mentioning candidates. It describes the methodology used, including data collection from Twitter's API, preprocessing tweets, and performing sentiment analysis using both machine learning and lexicon-based approaches. The results section shows the sentiment analysis identified more positive tweets for Clinton and more negative tweets for Trump, suggesting Clinton would win. Emotion analysis found more tweets expressing sadness for Clinton and joy for Trump.
A RELIABLE ARTIFICIAL INTELLIGENCE MODEL FOR FALSE NEWS DETECTION MADE BY PUB...caijjournal
The quick access to information on social media networks as well as its exponential rise also made it
difficult to distinguish among fake information or real information. The fast dissemination by way of
sharing has enhanced its falsification exponentially. It is also important for the credibility of social media
networks to avoid the spread of fake information. So it is emerging research challenge to automatically
check for misstatement of information through its source, content, or publisher and prevent the
unauthenticated sources from spreading rumours. This paper demonstrates an artificial intelligence based
approach for the identification of the false statements made by social network entities. Two variants of
Deep neural networks are being applied to evalues datasets and analyse for fake news presence. The
implementation setup produced maximum extent 99% classification accuracy, when dataset is tested for
binary (true or false) labeling with multiple epochs.
Groundhog day: near duplicate detection on twitterDan Nguyen
This document presents a framework for detecting near-duplicate tweets on Twitter. The framework analyzes tweet pairs using three approaches: (1) comparing syntactic characteristics like word overlap, (2) measuring semantic similarity, and (3) analyzing contextual information. Machine learning is used to learn patterns that help identify duplicate tweets. The framework is integrated into a Twitter search engine called Twinder to diversify search results and improve search quality. Extensive experiments evaluate strategies for detecting duplicate tweets and analyzing features that impact detection. The results show semantic features can boost duplicate detection performance.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
International Journal of Computer Science, Engineering and Applications (IJCSEA)IJCSEA Journal
International Journal of Computer Science, Engineering and Applications (IJCSEA) is an open access peer-reviewed journal that publishes articles which contribute new results in all areas of the computer science, Engineering and Applications. The journal is devoted to the publication of high quality papers on theoretical and practical aspects of computer science, Engineering and Applications.
ANALYSIS OF TOPIC MODELING WITH UNPOOLED AND POOLED TWEETS AND EXPLORATION OF...IJCSEA Journal
In this digital era, social media is an important tool for information dissemination. Twitter is a popular
social media platform. Social media analytics helps make informed decisions based on people's needs and
opinions. This information, when properly perceived provides valuable insights into different domains,
such as public policymaking, marketing, sales, and healthcare. Topic modeling is an unsupervised
algorithm to discover a hidden pattern in text documents. In this study, we explore the Latent Dirichlet
Allocation (LDA) topic model algorithm. We collected tweets with hashtags related to corona virus related
discussions. This study compares regular LDA and LDA based on collapsed Gibbs sampling (LDAMallet)
algorithms. The experiments use different data processing steps including trigrams, without trigrams,
hashtags, and without hashtags. This study provides a comprehensive analysis of LDA for short text
messages using un-pooled and pooled tweets. The results suggest that a pooling scheme using hashtags
helps improve the topic inference results with a better coherence score.
1. The document presents a study on automated detection of fake news from social media.
2. It proposes using account features like number of hashtags, URLs, mentions, followers etc. and a random forest classifier to detect fake news with 99.8% precision.
3. The random forest approach is compared to other classifiers like decision trees, Naive Bayes and neural networks, which achieve lower precision of 98.4%, 92.6% and 62.7% respectively.
How does fakenews spread understanding pathways of disinformation spread thro...Araz Taeihagh
What are the pathways for spreading disinformation on social media platforms? This article addresses this question by collecting, categorising, and situating an extensive body of research on how application programming interfaces (APIs) provided by social media platforms facilitate the spread of disinformation. We first examine the landscape of official social media APIs, then perform quantitative research on the open-source code repositories GitHub and GitLab to understand the usage patterns of these APIs. By inspecting the code repositories, we classify developers' usage of the APIs as official and unofficial, and further develop a four-stage framework characterising pathways for spreading disinformation on social media platforms. We further highlight how the stages in the framework were activated during the 2016 US Presidential Elections, before providing policy recommendations for issues relating to access to APIs, algorithmic content, advertisements, and suggest rapid response to coordinate campaigns, development of collaborative, and participatory approaches as well as government stewardship in the regulation of social media platforms.
Fake News Detection Using Machine LearningIRJET Journal
This document proposes a machine learning approach for detecting fake news using support vector machines. It discusses preprocessing news data using techniques like TF-IDF, extracting features related to text, date, source and author, and training a support vector machine classifier on the preprocessed data. The proposed system architecture involves preprocessing, training a model on the training data, validating it on test data, adjusting parameters to improve accuracy, and then using the model to classify new unlabeled news. Prior research that used techniques like n-gram analysis, naive Bayes classifiers and linear support vector machines for fake news detection are also reviewed. The conclusion is that the proposed approach using support vector machines can help identify fake news effectively.
Analyzing sentiment dynamics from sparse text coronavirus disease-19 vaccina...IJECEIAES
Social media platforms enable people exchange their thoughts, reactions, emotions regarding all aspects of their lives. Therefore, sentiment analysis using textual data is widely practiced field. Due to large textual content available on social media, sentiment analysis is usually considered a text classification task. The high feature dimension is an important issue that needs to be resolved by examining text meaningfully. The proposed study considers a case study of coronavirus (COVID) vaccination to conclude public opinions about prospects for vaccination. Text corpus of tweets is collected, published between December 12, 2020, and July 13, 2021 is considered. The proposed model is developed considering phase-by-phase data analysis process, followed by an assessment of important information about the collected tweets on coronavirus disease (COVID-19) vaccine using two sentiment analyzer methods and probabilistic models for validation and knowledge analysis. The result indicated that public sentiment is more positive than negative. The study also presented statistics of trends in vaccination progress in the top countries from early 2021 to July 2021. The scope of study is enormous regarding sentiment analysis based on keyword and document modeling. The proposed work offers an effective mechanism for a decision-making system to understand public opinion and accordingly assists policymakers in health measures and vaccination campaigns.
IRJET- Identification of Prevalent News from Twitter and Traditional Media us...IRJET Journal
This document describes a study that uses community detection models to identify prevalent news topics discussed on both Twitter and traditional media like BBC. It collects tweets and news articles about sports over a one-month period. Keywords are extracted from the data and a graph is constructed to represent relationships between words. Three community detection models - Girvan-Newman clustering, CLIQUE, and Louvain - are used to cluster similar content and detect communities of keywords representing news topics. The number of unique Twitter users engaged with each topic is also calculated to rank topics by user attention. The goal is to analyze how information is distributed between social and traditional media and identify emerging topics with low coverage in traditional sources.
This document summarizes several research papers that used social network analysis on Twitter data related to COVID-19. The papers analyzed hashtags, retweets, mentions and conversations to understand public debates and information spread about topics like conspiracy theories, medical news, and public responses in different countries. The studies identified influential users, common discussions, and how social media could provide insights into managing pandemic situations.
Detection of Fake News Using Machine LearningIRJET Journal
This document summarizes a literature review on detecting fake news using machine learning. It discusses how machine learning classifiers can be trained to automatically identify fake news. Specifically, it addresses three research questions: 1) Why machine learning is needed to detect fake news, 2) Which machine learning classifiers can be used, and 3) How the classifiers are trained. It finds that common classifiers like support vector machines, naive Bayes, logistic regression, random forests, and recurrent neural networks have been effectively used to detect fake news by analyzing content. Accuracy of the classifiers depends on how well they are trained on labeled datasets.
PANDEMIC INFORMATION DISSEMINATION WEB APPLICATION: A MANUAL DESIGN FOR EVERYONEijcsitcejournal
The aim of this research is to generate a web application from an inedited methodology with a series of
instructions indicating the coding in a flow diagram. The primary purpose of this methodology is to aid
non-profits in disseminating information regarding the COVID-19 pandemic, so that users can share vital
and up-to-date information. This is a functional design, and a series of screenshots demonstrating its
behaviour is presented below. This unique design arose from the necessity to create a web application for
an information dissemination platform; it also addresses an audience that does not have programming
knowledge. This document uses the scientific method in its writing. The authors understand that there is a
similar design in the bibliography; therefore, the differences between the designs are described herein; it
is very important to point out that this proposal can be taken as an alternative to the design of any web
application.
Social networking sites are a significant source of information to know the behavior of users and to know
what is occupying society of all ages and accordingly helpful information can be provided to specialists
and decision-makers. According to official sources, 98.43% of Saudi youth use social networking sites. The
study and analysis of social media data are done to provide the necessary information to increase
investment opportunities within the Kingdom of Saudi Arabia, by studying and analyzing what people
occupy on the communication sites through their tweets about the labor market and investment. Given the
huge volume of data and also its randomness, a survey of the data will be done and collected from through
keywords, the priority of arranging the data, and recording it as (positive - negative - mixed). The study
analysis and conclusion will be based on data-mining and its techniques of analysis and deduction
.
INCREASING THE INVESTMENT’S OPPORTUNITIES IN KINGDOM OF SAUDI ARABIA BY STUDY...ijcsit
Social networking sites are a significant source of information to know the behavior of users and to know
what is occupying society of all ages and accordingly helpful information can be provided to specialists
and decision-makers. According to official sources, 98.43% of Saudi youth use social networking sites. The
study and analysis of social media data are done to provide the necessary information to increase
investment opportunities within the Kingdom of Saudi Arabia, by studying and analyzing what people
occupy on the communication sites through their tweets about the labor market and investment. Given the
huge volume of data and also its randomness, a survey of the data will be done and collected from through
keywords, the priority of arranging the data, and recording it as (positive - negative - mixed). The study
analysis and conclusion will be based on data-mining and its techniques of analysis and deduction.
An Assessment of Sentiment Analysis of Covid 19 Tweetsijtsrd
Various rumors and assumptions have circulated about the COVID 19 immunization, making it a heated subject of discussion in India. This prompted a reaction from the countrys populace, who During the course of favorable, negative, and neutral evaluations, tweets and retweets on twitter. The number of these tweets are a jumble of unstructured data. The goal of this study is to have the statistics justify feeling implied by it. The purpose of this study is to take advantage of twitters massive data pool and extract insights that have the implications that can be drawn from it. Comprehensive research on the peoples feelings may help us arrive at a fair familiarity with the population at larges point of view toward preventing disease by vaccination. Dataset taken into consideration for vaccination related tweets are collected for study. From 2020 to 2021, including a data mining of 16,05,152 tweets related to vaccination. Ms. Tanzeela Qureshi | Dr. Mohit Singh Tomar | Dr. Ritu Shrivastava "An Assessment of Sentiment Analysis of Covid-19 Tweets" Published in International Journal of Trend in Scientific Research and Development (ijtsrd), ISSN: 2456-6470, Volume-7 | Issue-5 , October 2023, URL: https://www.ijtsrd.com/papers/ijtsrd59976.pdf Paper Url: https://www.ijtsrd.com/other-scientific-research-area/other/59976/an-assessment-of-sentiment-analysis-of-covid19-tweets/ms-tanzeela-qureshi
COVID Sentiment Analysis of Social Media Data Using Enhanced Stacked EnsembleIRJET Journal
The document presents a study that analyzes sentiment in tweets related to COVID-19 using an enhanced stacked ensemble model. It first reviews previous research on sentiment analysis of social media data during the pandemic. It then describes collecting a dataset of over 40,000 COVID-19 tweets from March-April 2020. Various classification algorithms are tested to predict sentiment, including naive bayes, random forest, XGBoost, and an enhanced stacked ensemble. The enhanced stacked ensemble achieved the highest test accuracy of 86% according to the experimental results presented.
With the spread of social media platforms and the proliferation of misleading news, misinformation
detection within microblogging platforms has become a real challenge. During the Covid-19 pandemic,
many fake news and rumors were broadcasted and shared daily on social media. In order to filter out these
fake news, many works have been done on misinformation detection using machine learning and sentiment
analysis in the English language. However, misinformation detection research in the Arabic language on
social media is limited. This paper introduces a misinformation verification system for Arabic COVID-19
related news using an Arabic rumors dataset on Twitter. We explored the dataset and prepared it using
multiple phases of preprocessing techniques before applying different machine learning classification
algorithms combined with a semantic analysis method. The model was applied on 3.6k annotated tweets
achieving 93% best overall accuracy of the model in detecting misinformation. We further build another
dataset of Covid-19 related claims in Arabic to examine how our model performs with this new set of
claims. Results show that the combination of machine learning techniques and linguistic analysis achieves
the best scores reaching 92% best accuracy in detecting the veracity of sentences of the new dataset.
COMBINING MACHINE LEARNING AND SEMANTIC ANALYSIS FOR EFFICIENT MISINFORMATION...ijcsit
With the spread of social media platforms and the proliferation of misleading news, misinformation
detection within microblogging platforms has become a real challenge. During the Covid-19 pandemic,
many fake news and rumors were broadcasted and shared daily on social media. In order to filter out these
fake news, many works have been done on misinformation detection using machine learning and sentiment
analysis in the English language. However, misinformation detection research in the Arabic language on
social media is limited. This paper introduces a misinformation verification system for Arabic COVID-19
related news using an Arabic rumors dataset on Twitter. We explored the dataset and prepared it using
multiple phases of preprocessing techniques before applying different machine learning classification
algorithms combined with a semantic analysis method. The model was applied on 3.6k annotated tweets
achieving 93% best overall accuracy of the model in detecting misinformation. We further build another
dataset of Covid-19 related claims in Arabic to examine how our model performs with this new set of
claims. Results show that the combination of machine learning techniques and linguistic analysis achieves
the best scores reaching 92% best accuracy in detecting the veracity of sentences of the new dataset.
IRJET- Sentiment Analysis using Machine LearningIRJET Journal
This document discusses sentiment analysis of social media posts using machine learning. The authors aim to classify social media posts as having either a political or non-political sentiment. A dictionary of keywords and their sentiments is created to analyze posts. Users can make posts that are then classified, and an admin can hide or delete posts containing harmful keywords. Graphs are generated to analyze the classification of posts as political versus non-political and by sentiment. The accuracy of classification depends on the training data and dictionary.
Event detection in twitter using text and image fusioncsandit
In this paper, we describe an accurate and effective event detection method to detect events from
Twitter stream. It detects events using visual information as well as textual information to improve
the performance of the mining. It monitors Twitter stream to pick up tweets having texts and photos
and stores them into database. Then it applies mining algorithm to detect the event. Firstly, it detects
event based on text only by using the feature of the bag-of-words which is calculated using the term
frequency-inverse document frequency (TF-IDF) method. Secondly, it detects the event based on
image only by using visual features including histogram of oriented gradients (HOG) descriptors,
grey-level co-occurrence matrix (GLCM), and color histogram. K nearest neighbours (Knn)
classification is used in the detection. Finally, the final decision of the event detection is made based
on the reliabilities of text only detection and image only detection. The experiment result showed that
the proposed method achieved high accuracy of 0.93, comparing with 0.89 with texts only, and 0.86
with images only.
Ähnlich wie CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING (20)
International Conference on Information Technology Convergence Services & AI ...gerogepatton
International Conference on Information Technology Convergence Services & AI (ITCAI 2024) will provide an excellent international forum for sharing knowledge and new research results in all areas of Information Technology Convergence Services & AI. The Conference focuses on all technical and practical aspects of ITC & AI. The goal of this conference is to bring together researchers and practitioners from academia and industry to focus on understanding Information Technology Convergence and services, AI, and establishing new collaborations in these areas.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
PhotoQR: A Novel ID Card with an Encoded Viewgerogepatton
There is an increasing interest in developing techniques to identify and assess data to allow an easy and
continuous access to resources, services or places that require thorough ID control. Usually, in order to
give access to these resources, different kinds of documents are mandatory. In order to avoid forgeries
without the need of extra credentials, a new system –named photoQR, is here proposed. This system is
based on a ID card having two objects: one person’s picture (pre-processed via blur and/or swirl
techniques) and one QR code containing embedded data related to the picture. The idea is that the picture
and the QR code can assess each other by a proper hash value in the QR. The QR without the picture
cannot be assessed and vice versa. An open source prototype of the photoQR system has been implemented
in Python and can be used both in offline and real-time environments, which effectively combines security
concepts and image processing algorithms to obtain data assessment.
12th International Conference of Artificial Intelligence and Fuzzy Logic (AI ...gerogepatton
12th International Conference of Artificial Intelligence and Fuzzy Logic (AI & FL 2024) provides a
forum for researchers who address this issue and to present their work in a peer-reviewed forum. Authors
are solicited to contribute to the conference by submitting articles that illustrate research results, projects,
surveying works and industrial experiences that describe significant advances in the following areas, but
are not limited to these topics only.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
International Conference on NLP, Artificial Intelligence, Machine Learning an...gerogepatton
International Conference on NLP, Artificial Intelligence, Machine Learning and Applications (NLAIM 2024) offers a premier global platform for exchanging insights and findings in the theory, methodology, and applications of NLP, Artificial Intelligence, Machine Learning, and their applications. The conference seeks substantial contributions across all key domains of NLP, Artificial Intelligence, Machine Learning, and their practical applications, aiming to foster both theoretical advancements and real-world implementations. With a focus on facilitating collaboration between researchers and practitioners from academia and industry, the conference serves as a nexus for sharing the latest developments in the field.
DEEP LEARNING FOR SMART GRID INTRUSION DETECTION: A HYBRID CNN-LSTM-BASED MODELgerogepatton
As digital technology becomes more deeply embedded in power systems, protecting the communication
networks of Smart Grids (SG) has emerged as a critical concern. Distributed Network Protocol 3 (DNP3)
represents a multi-tiered application layer protocol extensively utilized in Supervisory Control and Data
Acquisition (SCADA)-based smart grids to facilitate real-time data gathering and control functionalities.
Robust Intrusion Detection Systems (IDS) are necessary for early threat detection and mitigation because
of the interconnection of these networks, which makes them vulnerable to a variety of cyberattacks. To
solve this issue, this paper develops a hybrid Deep Learning (DL) model specifically designed for intrusion
detection in smart grids. The proposed approach is a combination of the Convolutional Neural Network
(CNN) and the Long-Short-Term Memory algorithms (LSTM). We employed a recent intrusion detection
dataset (DNP3), which focuses on unauthorized commands and Denial of Service (DoS) cyberattacks, to
train and test our model. The results of our experiments show that our CNN-LSTM method is much better
at finding smart grid intrusions than other deep learning algorithms used for classification. In addition,
our proposed approach improves accuracy, precision, recall, and F1 score, achieving a high detection
accuracy rate of 99.50%.
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
10th International Conference on Artificial Intelligence and Applications (AI 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Artificial Intelligence and its applications. The Conference looks for significant contributions to all major fields of the Artificial Intelligence, Soft Computing in theoretical and practical aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Immunizing Image Classifiers Against Localized Adversary Attacksgerogepatton
This paper addresses the vulnerability of deep learning models, particularly convolutional neural networks
(CNN)s, to adversarial attacks and presents a proactive training technique designed to counter them. We
introduce a novel volumization algorithm, which transforms 2D images into 3D volumetric representations.
When combined with 3D convolution and deep curriculum learning optimization (CLO), itsignificantly improves
the immunity of models against localized universal attacks by up to 40%. We evaluate our proposed approach
using contemporary CNN architectures and the modified Canadian Institute for Advanced Research (CIFAR-10
and CIFAR-100) and ImageNet Large Scale Visual Recognition Challenge (ILSVRC12) datasets, showcasing
accuracy improvements over previous techniques. The results indicate that the combination of the volumetric
input and curriculum learning holds significant promise for mitigating adversarial attacks without necessitating
adversary training.
May 2024 - Top 10 Read Articles in Artificial Intelligence and Applications (...gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
3rd International Conference on Artificial Intelligence Advances (AIAD 2024)gerogepatton
3rd International Conference on Artificial Intelligence Advances (AIAD 2024) will act as a major forum for the presentation of innovative ideas, approaches, developments, and research projects in the area advanced Artificial Intelligence. It will also serve to facilitate the exchange of information between researchers and industry professionals to discuss the latest issues and advancement in the research area. Core areas of AI and advanced multi-disciplinary and its applications will be covered during the conferences.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Information Extraction from Product Labels: A Machine Vision Approachgerogepatton
This research tackles the challenge of manual data extraction from product labels by employing a blend of
computer vision and Natural Language Processing (NLP). We introduce an enhanced model that combines
Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) in a Convolutional
Recurrent Neural Network (CRNN) for reliable text recognition. Our model is further refined by
incorporating the Tesseract OCR engine, enhancing its applicability in Optical Character Recognition
(OCR) tasks. The methodology is augmented by NLP techniques and extended through the Open Food
Facts API (Application Programming Interface) for database population and text-only label prediction.
The CRNN model is trained on encoded labels and evaluated for accuracy on a dedicated test set.
Importantly, our approach enables visually impaired individuals to access essential information on
product labels, such as directions and ingredients. Overall, the study highlights the efficacy of deep
learning and OCR in automating label extraction and recognition.
10th International Conference on Artificial Intelligence and Applications (AI...gerogepatton
10th International Conference on Artificial Intelligence and Applications (AI 2024) will provide an excellent international forum for sharing knowledge and results in theory, methodology and applications of Artificial Intelligence and its applications. The Conference looks for significant contributions to all major fields of the Artificial Intelligence, Soft Computing in theoretical and practical aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from both academia as well as industry to meet and share cutting-edge development in the field.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
Research on Fuzzy C- Clustering Recursive Genetic Algorithm based on Cloud Co...gerogepatton
Aiming at the problems of poor local search ability and precocious convergence of fuzzy C-cluster
recursive genetic algorithm (FOLD++), a new fuzzy C-cluster recursive genetic algorithm based on
Bayesian function adaptation search (TS) was proposed by incorporating the idea of Bayesian function
adaptation search into fuzzy C-cluster recursive genetic algorithm. The new algorithm combines the
advantages of FOLD++ and TS. In the early stage of optimization, fuzzy C-cluster recursive genetic
algorithm is used to get a good initial value, and the individual extreme value pbest is put into Bayesian
function adaptation table. In the late stage of optimization, when the searching ability of fuzzy C-cluster
recursive genetic is weakened, the short term memory function of Bayesian function adaptation table in
Bayesian function adaptation search algorithm is utilized. Make it jump out of the local optimal solution,
and allow bad solutions to be accepted during the search. The improved algorithm is applied to function
optimization, and the simulation results show that the calculation accuracy and stability of the algorithm
are improved, and the effectiveness of the improved algorithm is verified
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
The International Journal of Artificial Intelligence & Applications (IJAIA) is a bi monthly open access peer-reviewed journal that publishes articles which contribute new results in all areas of the Artificial Intelligence & Applications (IJAIA). It is an international journal intended for professionals and researchers in all fields of AI for researchers, programmers, and software and hardware manufacturers. The journal also aims to publish new attempts in the form of special issues on emerging areas in Artificial Intelligence and applications.
10th International Conference on Artificial Intelligence and Soft Computing (...gerogepatton
10th International Conference on Artificial Intelligence and Soft Computing (AIS 2024) will
provide an excellent international forum for sharing knowledge and results in theory, methodology, and
applications of Artificial Intelligence, Soft Computing. The Conference looks for significant
contributions to all major fields of the Artificial Intelligence, Soft Computing in theoretical and practical
aspects. The aim of the Conference is to provide a platform to the researchers and practitioners from
both academia as well as industry to meet and share cutting-edge development in the field.
International Journal of Artificial Intelligence & Applications (IJAIA)gerogepatton
Employee attrition refers to the decrease in staff numbers within an organization due to various reasons.
As it has a negative impact on long-term growth objectives and workplace productivity, firms have
recognized it as a significant concern. To address this issue, organizations are increasingly turning to
machine-learning approaches to forecast employee attrition rates. This topic has gained significant
attention from researchers, especially in recent times. Several studies have applied various machinelearning methods to predict employee attrition, producing different resultsdepending on the employed
methods, factors, and datasets. However, there has been no comprehensive comparative review of multiple
studies applying machine-learning models to predict employee attrition to date. Therefore, this study aims
to fill this gap by providing an overview of research conducted on applying machine learning to predict
employee attrition from 2019 to February 2024. A literature review of relevant studies was conducted,
summarized, and classified. Most studies agree on conducting comparative experiments with multiple
predictive models to determine the most effective one.From this literature survey, the RF algorithm and
XGB ensemble method are repeatedly the best-performing, outperforming many other algorithms.
Additionally, the application of deep learning to employee attrition prediction issues also shows promise.
While there are discrepancies in the datasets used in previous studies, it is notable that the dataset
provided by IBM is the most widely utilized. This study serves as a concise review for new researchers,
facilitating their understanding of the primary techniques employed in predicting employee attrition and
highlighting recent research trends in this field. Furthermore, it provides organizations with insight into
the prominent factors affecting employee attrition, as identified by studies, enabling them to implement
solutions aimed at reducing attrition rates.
Software Engineering and Project Management - Introduction, Modeling Concepts...Prakhyath Rai
Introduction, Modeling Concepts and Class Modeling: What is Object orientation? What is OO development? OO Themes; Evidence for usefulness of OO development; OO modeling history. Modeling
as Design technique: Modeling, abstraction, The Three models. Class Modeling: Object and Class Concept, Link and associations concepts, Generalization and Inheritance, A sample class model, Navigation of class models, and UML diagrams
Building the Analysis Models: Requirement Analysis, Analysis Model Approaches, Data modeling Concepts, Object Oriented Analysis, Scenario-Based Modeling, Flow-Oriented Modeling, class Based Modeling, Creating a Behavioral Model.
Open Channel Flow: fluid flow with a free surfaceIndrajeet sahu
Open Channel Flow: This topic focuses on fluid flow with a free surface, such as in rivers, canals, and drainage ditches. Key concepts include the classification of flow types (steady vs. unsteady, uniform vs. non-uniform), hydraulic radius, flow resistance, Manning's equation, critical flow conditions, and energy and momentum principles. It also covers flow measurement techniques, gradually varied flow analysis, and the design of open channels. Understanding these principles is vital for effective water resource management and engineering applications.
Null Bangalore | Pentesters Approach to AWS IAMDivyanshu
#Abstract:
- Learn more about the real-world methods for auditing AWS IAM (Identity and Access Management) as a pentester. So let us proceed with a brief discussion of IAM as well as some typical misconfigurations and their potential exploits in order to reinforce the understanding of IAM security best practices.
- Gain actionable insights into AWS IAM policies and roles, using hands on approach.
#Prerequisites:
- Basic understanding of AWS services and architecture
- Familiarity with cloud security concepts
- Experience using the AWS Management Console or AWS CLI.
- For hands on lab create account on [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
# Scenario Covered:
- Basics of IAM in AWS
- Implementing IAM Policies with Least Privilege to Manage S3 Bucket
- Objective: Create an S3 bucket with least privilege IAM policy and validate access.
- Steps:
- Create S3 bucket.
- Attach least privilege policy to IAM user.
- Validate access.
- Exploiting IAM PassRole Misconfiguration
-Allows a user to pass a specific IAM role to an AWS service (ec2), typically used for service access delegation. Then exploit PassRole Misconfiguration granting unauthorized access to sensitive resources.
- Objective: Demonstrate how a PassRole misconfiguration can grant unauthorized access.
- Steps:
- Allow user to pass IAM role to EC2.
- Exploit misconfiguration for unauthorized access.
- Access sensitive resources.
- Exploiting IAM AssumeRole Misconfiguration with Overly Permissive Role
- An overly permissive IAM role configuration can lead to privilege escalation by creating a role with administrative privileges and allow a user to assume this role.
- Objective: Show how overly permissive IAM roles can lead to privilege escalation.
- Steps:
- Create role with administrative privileges.
- Allow user to assume the role.
- Perform administrative actions.
- Differentiation between PassRole vs AssumeRole
Try at [killercoda.com](https://killercoda.com/cloudsecurity-scenario/)
Accident detection system project report.pdfKamal Acharya
The Rapid growth of technology and infrastructure has made our lives easier. The
advent of technology has also increased the traffic hazards and the road accidents take place
frequently which causes huge loss of life and property because of the poor emergency facilities.
Many lives could have been saved if emergency service could get accident information and
reach in time. Our project will provide an optimum solution to this draw back. A piezo electric
sensor can be used as a crash or rollover detector of the vehicle during and after a crash. With
signals from a piezo electric sensor, a severe accident can be recognized. According to this
project when a vehicle meets with an accident immediately piezo electric sensor will detect the
signal or if a car rolls over. Then with the help of GSM module and GPS module, the location
will be sent to the emergency contact. Then after conforming the location necessary action will
be taken. If the person meets with a small accident or if there is no serious threat to anyone’s
life, then the alert message can be terminated by the driver by a switch provided in order to
avoid wasting the valuable time of the medical rescue team.
Mechatronics is a multidisciplinary field that refers to the skill sets needed in the contemporary, advanced automated manufacturing industry. At the intersection of mechanics, electronics, and computing, mechatronics specialists create simpler, smarter systems. Mechatronics is an essential foundation for the expected growth in automation and manufacturing.
Mechatronics deals with robotics, control systems, and electro-mechanical systems.
Build the Next Generation of Apps with the Einstein 1 Platform.
Rejoignez Philippe Ozil pour une session de workshops qui vous guidera à travers les détails de la plateforme Einstein 1, l'importance des données pour la création d'applications d'intelligence artificielle et les différents outils et technologies que Salesforce propose pour vous apporter tous les bénéfices de l'IA.
Supermarket Management System Project Report.pdfKamal Acharya
Supermarket management is a stand-alone J2EE using Eclipse Juno program.
This project contains all the necessary required information about maintaining
the supermarket billing system.
The core idea of this project to minimize the paper work and centralize the
data. Here all the communication is taken in secure manner. That is, in this
application the information will be stored in client itself. For further security the
data base is stored in the back-end oracle and so no intruders can access it.
CATEGORIZING 2019-N-COV TWITTER HASHTAG DATA BY CLUSTERING
1. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
DOI: 10.5121/ijaia.2020.11404 41
CATEGORIZING 2019-N-COV TWITTER
HASHTAG DATA BY CLUSTERING
Koffka Khan1
and Emilie Ramsahai2
1
Department of Computing and Information Technology, The University of the West
Indies, St. Augustine Campus, Trinidad
2
UWI School of Business & Applied Studies Ltd (UWI-ROYTEC), 136-138 Henry
Street, 24105 Port of Spain, Trinidad and Tobago
ABSTRACT
Unsupervised machine learning techniques such as clustering are widely gaining use with the recent
increase in social communication platforms like Twitter and Facebook. Clustering enables the finding of
patterns in these unstructured datasets. We collected tweets matching hashtags linked to COVID-19 from a
Kaggle dataset. We compared the performance of nine clustering algorithms using this dataset. We
evaluated the generalizability of these algorithms using a supervised learning model. Finally, using a
selected unsupervised learning algorithm we categorized the clusters. The top five categories are Safety,
Crime, Products, Countries and Health. This can prove helpful for bodies using large amount of Twitter
data needing to quickly find key points in the data before going into further classification.
KEYWORDS
Unsupervised machine learning, clustering, Twitter, 2019-nCoV, hashtags, Kaggle, supervised,
classification
1. INTRODUCTION
Twitter [43] is a micro-blogging service which has millions of users from around the world. It
enables users to post and exchange 140-character-long messages, also called tweets. Using a
wide array of Web-based services, tweets can be published by sending e-mails, SMS text
messages and directly from smartphones. Twitter thus enables the dissemination of information
to a wide number of people in real time. This makes it an ideal environment for disseminating
breaking news directly from the source of news and/or event location.
Before a related keyword or expression in their message, people use the hashtag [26] symbol (#)
to categorize Tweets and make them display them more quickly in Twitter search. Clicking or
tapping on a hashtagged word will show you other Tweets that have the hashtag. Hashtags can be
included in a Tweet at any place. Thus, a hashtag is used on Twitter to index keywords or topics.
Hashtagged words, which become very popular, are often trending topics.
Unsupervised learning algorithms [7] [20] has seen a recent spurt in usage with increasing
advances in computing technology. It is a sub-field of machine learning. The machine simply
receives inputs in unsupervised learning but does not obtain supervised target outputs or rewards
from its environment. It is possible to establish a formal structure for unsupervised learning based
on the notion that the purpose of the machine is to construct representations of the inputs that can
be used for decision making, to anticipate potential inputs and to communicate inputs effectively
to another system. In a way, unsupervised learning can be seen as identifying correlations over
2. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
42
and above what should be called mere unstructured noise in the results. Clustering [22] and
dimensionality reduction are two very basic textbook examples of unsupervised learning.
Twitter has been used to monitor patterns and distribute health knowledge over the course of
virus epidemics. The recent 2019-nCoV [41] is no exception. Researchers in [17] used Twitter
and web news mining to predict COVID-19 outbreak. To quantify and understand early changes
in Twitter activity, content, and sentiment [28], [33], [36] about the COVID-19 epidemic used a
large volume of Twitter data [32]. To understand Twitter users' discussions and reactions about
the COVID-19, researchers in [46] used various machine learning techniques. [1] aimed to
identify the main topics posted by Twitter users related to the COVID-19 pandemic. Research
offer insights based on theory to help explain and predict these behaviors and associated
outcomes in order to inform future research and marketing practice using social media data
including Twitter [21]. Researcher [16] call for collaboration amidst the growing mountain of
daily data across PubMed, Twitter, Google Scholar and the World Health Organization.
We collected tweets matching hashtags linked to COVID-19 from a Kaggle dataset [42]. We
compared the performance of nine clustering algorithms using this dataset. We evaluated the
generalizability of these algorithms using a supervised learning model. Finally, using a selected
unsupervised learning algorithm we categorized the clusters. This can prove helpful for bodies
using large amount of Twitter data needing to quickly find key points in the data before going
into further classification [3].
In section two we investigate previous work related to our study. Then we present a brief
introduction to the nine unsupervised machine learning clustering models used to categorize
2019-nCoV hashtags in section three. In section four the method used to categorize 2019-nCoV
hashtags is given. We show the results in section five. As part of our results we compare the
performance of these methods as we believe it would better guide further research work in
developing clustering techniques to combat future pandemics. In addition, our results can aid
disaster relief bodies to quickly sift through huge amounts Twitter data to accurately capture
meaningful categories ‘on-the-fly.’ Finally give our conclusion in section six.
2. RELATED WORK
Twitter data has been used extensively since the start of the 2019-nCoV pandemic, for example
in predicting the onset [18] and tracking social media discourse [9]. Another area of research
impacted by the use of Twitter data is sentiment analysis [6]. Research using sentiment analysis
on tweets indicated that while majority of the people throughout the world are taking a positive
and hopeful approach, there are instances of fear, sadness and disgust exhibited worldwide [11].
Also, in [31] 126,049 tweets from 53,196 unique users were evaluated. Researchers note that the
hourly number of COVID-19-related tweets starkly increased from January 21, 2020 onward and
that nearly half (49.5%) of all tweets expressed fear and nearly 30% expressed surprise.
Researchers [45] showed that trust for the authorities remained a prevalent emotion using Twitter
data analysed consisting of about 1.8 million Tweets messages related to coronavirus collected
from January 20th to March 7th, 2020. A total of salient 11 topics are identified and then
categorized into 10 themes, such as "cases outside China (worldwide)," "COVID-19 outbreak in
South Korea," "early signs of the outbreak in New York," "Diamond Princess cruise," "economic
impact," "Preventive/Protective measures," "authorities," and "supply chain" using unsupervised
machine learning techniques. Researchers [35] collected tweets from the users who shared their
location as ‘Nepal’ between 21st May 2020 and 31st May 2020. They observed that while
majority of the people of Nepal are taking a positive and hopeful approach, there are instances of
fear, sadness and disgust exhibited too. Researchers [24] performed a Twitter search using 14
separate common hashtags and keywords associated with the COVID-19 outbreak. Then, in
3. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
43
comparison to checked and peer-reviewed tools they analyzed and tested individual tweets for
misinformation. Finally, they utilized descriptive statistics to compare words and hashtags, and
describe the characteristics of individual tweets and account. They found that the keyword
“COVID-19” had the lowest rate of misinformation and unverifiable information, while the
keywords “#2019_ncov” and “Corona” were associated with the highest amount of
misinformation and unverifiable content respectively. Researchers found that two-thirds (66.1%)
of the Instagram users use "COVID-19", and "coronavirus" hashtags to disperse the information
related to COVID-19 [38]. Other work in [12] investigated the temporal tweeting dynamic and
the Twitter users involved in the online discussions around COVID-19-related research. They
observed that throughout the course of time, a shift in the direction of the Twitter discussions can
be noted, from initial exposure to virological and scientific science to more realistic issues such
as new therapies, policy countermeasures, welfare measures, and more recent effects on the
economy and culture. Though many researchers attempt to categorize the actual twitter data, we
did not find any work on finding patterns in the hashtags to elicit an initial overview of the
existing tweets which can be then classified further if desired. Our approach does this in an
attempt to overcome the huge sizes of these social media datasets.
3. CLUSTERING MODELS
Clustering generally uses iterative techniques to group cases into clusters in a dataset which
contain similar characteristics. These groupings are useful to explore data, identify anomalies in
the data, and ultimately make predictions. Models of clustering can also help you identify
relationships in a dataset that you may not logically derive from browsing or simply observing.
For these reasons, clustering is often used to explore the data in the early stages of machine
learning tasks, and to discover unexpected correlation.
3.1. K-means
The K-means algorithm [25] starts with an initial set of randomly selected centroids, which serve
as starting points for each cluster, when processing the training data, and applies the Lloyd 's
algorithm to iteratively refine the centroid locations. Cluster assignment is effected by calculating
the distance between each cluster 's new case and the centroid. Every new case is allocated with
the nearest centroid to the cluster.
3.2. DBSCAN
Density-based spatial clustering of applications with noise (DBSCAN) [5] is a non-parametric
density-based clustering algorithm: given a set of points in some space, it aggregates points that
are closely packed together (points with many nearby neighbours), marking outliers that lie in
low-density regions (whose closest neighbors are too far away).
3.3. Spectral clustering
Spectral clustering (SC) models [44] use the data similarity matrix spectrum (eigenvalues) to
obtain a decrease in dimensionality before clustering in smaller dimensions. The similarity matrix
is provided as an input and consists of a quantitative evaluation of the relative similarity of each
pair of points within the dataset.
4. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
44
3.4. Agglomerative clustering
Agglomerative clustering (AC) [14] is a "bottom-up" approach: each observation starts in its own
cluster, merging pairs of clusters as one moves the hierarchy upwards. A measure of the
dissimilarity between sets of observations is necessary in order to determine which clusters
should be combined. As a function of the pairwise distances between observations, a linkage
criterion determines the distance between sets of observations.
3.5. Gaussian mixtures
Gaussian mixture (GM) models [29] are a probabilistic model within an overall population to
describe naturally distributed subpopulations. Two types of measurements, the weights of the
mixture component and the means and variances/covariances of the component are parameterized
by a Gaussian mixture model.
3.6. BIRCH
The Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH) [47] algorithm takes
a set of data points as its input, represented as real-valued vectors, and a necessary number of
clusters. Due to its ability to find a good clustering solution with only one data scan, BIRCH is
particularly suitable for very large data sets or for streaming data.
3.7. Mini Batch
The key principle of the Mini Batch K-means algorithm [34] is to use small random batches of
fixed size data, so that they can be stored in memory. Each iteration obtains and uses a new
random sample from the dataset to update the clusters, and this is repeated until convergence.
3.8. Mean-Shift
Mean-Shift [2] is a procedure to locate the maxima of a density function given the discrete data
sampled from that function. It functions by placing a kernel in the data set at each point. Mean
shift takes advantage of this idea by considering what the points would do if they all climbed up
hill to the closest kernel density estimate (KDE) surface top.
3.9. OPTICS
Ordering points to identify the clustering structure (OPTICS) [4] is an algorithm for the
identification of clusters based on density in spatial data. OPTICS therefore outputs the points in
a particular order, annotated at their smallest distance of reachability.
4. METHODS
We used the Python version 3.8 programming language to run experiments. Text data requires
special preparation before use for modeling can begin. To remove words, the text must be parsed,
called tokenization. Then the words must be encoded as integer or floating point values for use as
input to a machine learning algorithm, called extraction (or vectorization). We use the scikit-learn
library to perform both tokenization and hashtag data extraction functionality.
One difficulty with simple counts is that certain words like "the" will occur several times, and
their large counts in the encoded vectors will not be very important. An alternative is to calculate
5. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
45
the word frequencies, and TF-IDF (Term Frequency–Inverse Document) [15] is by far the most
popular method. The TF-IDF are the components of each word assigned to the resulting scores.
The term frequency describes the frequency at which a given word appears inside a text. Inverse
Document Frequency scales down words that occur a great deal across files. Thus TF-IDF are
word frequency scores that attempt to highlight more interesting words, e.g. frequent in a file but
not across files. The Tfidf Vectorizer will tokenize documents, learn the vocabulary and reverse
weighting of document frequencies, and allow you to encode new documents/files. A Tfidf
Transformer is used for measuring reverse text frequencies and starting text encoding. The scores
are normalized to values between 0 and 1 and, as with most machine learning algorithms, the
encoded document vectors can then be used directly.
We employ three metrics for evaluating the performance of clustering models: Silhouette [37],
Calinski-Harabasz [30] and Davies-Bouldin [30]. Silhouette refers to a method for interpreting
and validating consistency within data clusters. It is a function of how close an entity is to its own
(cohesion) cluster relative to other (separation) clusters. The score varies from −1 to +1, where a
high value means that the object is well aligned with its own cluster and poorly suited to
neighboring clusters. If most objects have a high value, then the setup for the clustering is
correct. However, if many points have a low or negative value then there may be too many or too
few clusters in the clustering configuration.
The Calinski-Harabasz score [30] is defined as the ratio between the dispersion within a cluster
and the dispersion between a cluster. This procedure ensures that the number of potential splits is
effectively reduced. The approach can be generalized to a dichotomous division but is well suited
to any number of clusters and for a global division. The Calinski-Harabasz Index should be
greatest at the optimal clustering size.
The Davies-Bouldin score [30] is an internal assessment scheme where the analysis of how good
the clustering was performed is achieved using the underlying quantities and characteristics of
the dataset. A lower value will mean that the clustering is better because of the way it is defined,
as a function of the ratio of the cluster scatter within, to the separation between clusters. It
happens to be the average similarity, averaged over all clusters, between each cluster and its most
similar one. This affirms the idea that no cluster has to be similar to another, and thus the best
clustering scheme minimizes the Davies–Bouldin index.
Gradient Descent is a common technique of optimization in machine learning and deep learning,
which can be used for most, if not all, learning algorithms. The slope of a function is a gradient.
It measures the degree of variability of a variable in response to other variable changes. Gradient
Descent is a convex function whose output is the part derivative of a set of its input parameters.
The steeper the slope the greater the gradient. Stochastic gradient descent (SGD) [8] is an
iterative method to optimize an objective function with proper smoothness (e.g. differentiable or
sub-differentiable) properties. It can be considered as a stochastic approximation of gradient
descent optimization, since it substitutes the true gradient (calculated from the entire data set)
with an average of it (calculated from a randomly chosen data subset). For each iteration, a few
samples are chosen randomly in Stochastic Gradient Descent, instead of the entire data set. The
sample is shuffled at random and selected to perform the iteration. The cost function gradient of a
single sample is calculated at each iteration. The Stochastic Gradient Descent (SGD) classifier
[19] essentially incorporates a simple SGD learning routine.
6. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
46
There are four Outcomes of Binary Classification [39]:
True Positives (TP) - These are the correctly predicted positive values which means that the
value of actual class is positive and the value of predicted class is positive.
True Negatives (TN) - These are the correctly predicted negative values which means that
the value of actual class is negative and value of predicted class is negative.
False Positives (FP) – When actual class is negative and predicted class is positive.
False Negatives (FN) – When actual class is positive but predicted class in negative.
Accuracy is simply a ratio of correctly predicted observation to the total observations (Accuracy
= TP+TN/TP+FP+FN+TN). The number of true positives divided by the number of true positives
plus the number of false positives is known as precision (Precision = TP/TP+FP). It measures the
ability of a classification model to identify only the relevant data points. Recall (Recall =
TP/TP+FN) is the ability of a classification model to identify all relevant instances. It is the ratio
of correctly predicted positive observations to the all observations in actual class. The F1 score is
a single metric that combines recall and precision using the harmonic mean (F1 Score =
2*(Recall * Precision) / (Recall + Precision)).
The procedure we carried out followed these steps:
1. Read hashtags
2. Convert to string, vectorize and transform hashtags
3. Select clustering model
4. Fit model to hashtag data
5. Predict and obtain labels from specified model
6. Evaluate model using Silhouette, Calinski-Harabasz and Davies-Bouldin metrics
7. Test and train model using a Stochastic Gradient Descent (SGD) Classifier
8. Print the accuracy, precision, recall and F1 score SGD Classification Report
9. Categorize the cluster centroids by obtaining top terms per cluster
The last and very important step in our method is to categorize the clusters by obtaining the top
terms per cluster. We used the TF-IDF vectorizer, thus "features" are the words in a given
hashtag document (and each document is its own vector). Thus, when the document vectors are
clustered, each "feature" of the centroid represents the relevance of that word to it. The "word"
(in vocabulary) is equal to the "feature" (in the vector space) which is equal to the "column" (in
the centroid of the matrix). We get the mapping of column index to the word it represents and
convert each centroid into a sorted (descending) list of the columns most "relevant" (highly
valued) in it, and hence the words most relevant since words are equal to columns. Thus, in
essence we are sorting each centroid in descending order of the features/words most valued in it,
then mapping those columns back to their original words.
5. RESULTS
We use a 2019-nCoV dataset from Kaggle [42] which was obtained from the Johns Hopkins
University [27]. The dataset was made available from January 22nd, 2020 but was accessed on
March 30th 2020. This dataset has regular level details from 2019-nCoV on the number of
infected cases, deaths and recovery. The dataset has 1086 cases with nineteen (19) features. Some
features were:
7. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
47
Observation Date - Date of the observation in MM/DD/YYYY
Province/State - Province or state of the observation (Could be empty when missing)
Country/Region - Country of observation
Last Update - Time in UTC at which the row is updated for the given province or
country. (Not standardised and so please clean before using it)
Confirmed - Cumulative number of confirmed cases till that date
Deaths - Cumulative number of deaths till that date
Recovered - Cumulative number of recovered cases till that date
The dataset also contained a hashtags.CSV file which contained 26,833,314 hashtags collected
over the period starting on March 13th 2020 and ending on March 28th
2020. There were two
fields status_id and the hashtag. An example of the hashtag was ‘SocialDistancing.’ We observed
that in a dataset of this size it would take huge amounts of computing power to gather the main
points in the Twitter dataset. Thus, we focused on the hashtags as a starting point which is still
considerably large but we experimented to explore the possibility of this ‘smaller’ sample of the
dataset (that is, not containing the actual tweets) giving insights to a good representation or
categorization of what the main elements of the tweets contained.
Figure 1. Elbow method using K-means.
For a range of values for k (say from 1-10) the elbow method runs k-means clustering on the
dataset and then for each value of k calculates an average score for all clusters. In order to
determine the optimum number of clusters, we must select the value of k at the "elbow" i.e. the
point after which the distortion/inertia starts to decrease linearly. We employ the elbow method
on the Kaggle dataset using k-means clustering, see Figure 1. We observe that the optimal
number of clusters is 5.
Next we run evaluate the performance of the nine clustering models with the following defined
parameters:
8. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
48
KMeans: n_clusters=5, max_iter=10000000, n_init=42
DBSCAN: eps=0.078
Spectral Clustering: n_clusters = 5
Agglomerative Clustering: n_clusters = 5
Gaussian Mixture: n_components=3, covariance_type='full'
BIRCH: n_clusters = 5
Mini Batch K-Means: n_clusters = 5
Mean Shift: quantile=0.2, bin_seeding=True
OPTICS: min_samples=5
The results are shown on Table 1. The best Silhouette score is obtained using the Mean-Shift
model (0.724). OPTICS (0.530) and DBSCAN (0.526) also had high Silhouette scores. BIRCH
gave the worst Silhouette score (0.108). Gaussian Mixture (738.158) produced the highest
Calinski-Harabasz Score. Agglomerative Clustering (581.066), k-means (561.455) and Mini-
Bach k-means (545.470) also did well considering their high Calinski-Harabasz Scores. BIRCH
(19.214) performed the worst. Mean-Shift (0.655) had the best Davies-Bouldin Score while
Gaussian Mixture (0.981), k-means (0.992) and Spectral Clustering (0.992) performed better than
the other models. BIRCH (1.840) performed the worst.
Table 1. Heading and text fonts.
Clustering
Algorithm
Silhouette
Score
Calinski-Harabasz
Score
Davies-Bouldin
Score
k-Means 0.397 561.455 0.992
DBSCAN 0.526 69.922 1.585
SC 0.404 580.535 0.992
AC 0.403 581.066 1.218
GM 0.320 738.158 0.981
BIRCH 0.108 19.214 1.840
Mini Batch 0.390 545.470 1.217
Mean-Shift 0.724 82.744 0.655
OPTICS 0.530 69.176 1.653
Though arbitrary, the data is now 'labelled' after running the clustering models. This means we
can use supervised learning now to see how good the clustering is generalizing. It is just one way
of testing the clustering. If the clustering model could find a meaningful split in the data it should
be possible to train a classifier to predict which cluster should belong to a given instance. All
models performed excellently having good classification metric values using the SGD classifier
except Mean-Shift and BIRCH, see Table 2.
Table 2. Heading and text fonts.
Clustering
Algorithm
Accuracy
Score
Precision Recall F1 score
k-Means 100.000 % 100.000 % 100.000 % 100.000 %
DBSCAN 99.412 % 97.503 % 97.455 % 97.448 %
SC 100.000 % 100.000 % 100.000 % 100.000 %
AC 100.000 % 100.000 % 100.000 % 100.000 %
GM 100.000 % 100.000 % 100.000 % 100.000 %
Birch 99.804 % 78.347 % 77.314 % 77.816 %
Mini Batch 100.000 % 100.000 % 100.000 % 100.000 %
Mean Shift 74.063 % 20.383 % 20.622 % 20.369 %
OPTICS 99.524 % 98.339 % 97.918 % 98.071 %
9. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
49
For the final part of our procedure we illustrate using k-means. In K-means, the centroid is the
mean of the documents in the cluster, and in TF-IDF all values are non-negative, so every word
in every document in the cluster will be represented in its centroid. Thus, the terms significant in
the centroid are those that are most significant across all the documents in that cluster. No word
gets left out, but a lot become insignificant. The highest TF-IDF values in a document vector are
those words most significant to that document; likewise, those highest valued words in the
centroid are those most significant to the cluster as a whole
.
Using k-means we are able to distinguish five top cluster categories: (1) Safety, (2) Crime, (3)
Products, (4) Countries and (5) Health, see Table 3. Note that in each category contain hashtags
corresponding to it. This becomes relevant in today’s world where there are millions of Twitter
tweets daily on 2019-nCoV. This makes it very processing intensive to go through all the tweets
to decipher meaning. Our procedure shows how governments and world bodies like the World
Health Organization can quickly elicit meaningful categories from Twitter hashtags. This will
give them a general overview of what the tweet data contains with the chance of placing more
elaborate methods to selected tweets of interest, if any were found from looking at the different
categories produced by the clustering model.
Table 3. Hashtag Categories using K-means.
Safety Crime Products Countries Health
confinement crimestatistics Dental products covid_19de deliverhappiness
lockdown domestic abuse Hand sanitizers covid_19india difficltybreathing
quarantine life staysafe Dettol covid_19italia digitalhealth
social distancing stayhomesavelives Soap covid_19sa diagnosing
curfew threat precautions Droplet covidespana deathcareindustry
5.1. Limitations
In general, the main limitations of our approach are:
1. A lack of a formal validation of the results using an independent Twitter dataset.
2. The study is presented just for one Twitter dataset, somehow limiting the potential
generality of the proposed approach.
3. The use of supervised learning as a method of verifying the classification requires experts
to validate the classifications made.
5.2. Improvements to Existing System
To improve our current system, we need to include in our testing additional 2019-nCoV datasets.
In addition to further validation for our system, this can also show that our system can generalize
to different Twitter datasets. These datasets can first include other 2019-nCoV data and then
other past disease pandemics such as H1N1 [40] and SARS [10]. We can also hire experts to
analyse and categorize Twitter data. This would give us a better indication on the exactness of
our chosen classification algorithm: Stochastic Gradient Descent (SGD) Classifier. Based on this
performance compared to the expert we can also implement other classifiers which may give
better results.
10. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
50
5.3. Future aid in 2019-nCoV
In the present fight against 2019-nCoV our system can be used by government and private
agencies to categorize Twitter data. This will give them the ability to quickly find tweets of
interest instead of employing machine learning algorithms on the actual tweets which can take
days to process. By using these categories government bodies can set up media briefs or meetings
more targeted to the individuals in a geographic region or country. This will give them a more
targeted approach in dealing with the virus. In addition, other machine learning algorithms can be
employed on the results by first selecting tweets represented by a certain category. This is done
by searching for one or more hashtags present within that category. This data reduction technique
will now result in a faster processing of the now smaller number of tweets. For instance, many
text-based machine learning algorithms can now be employed to find the sentiment of the
individuals on this particular category or topic of interest. As machine learning sentiment analysis
techniques improve emotions such as fear and panic can be discovered, and this can advise
certain medical institutions of attending to certain mental states and affects within the population
caused by 2019-nCoV first and even second infection spikes.
6. CONCLUSIONS
Unsupervised machine learning techniques such as clustering are widely gaining use with the
recent increase in social communication platforms like Twitter and Facebook. Clustering enables
the finding of patterns in these unstructured datasets. We collected tweets matching hashtags
linked to COVID-19 from a Kaggle dataset. We compared the performance of nine clustering
algorithms using this dataset. We evaluated the generalizability of these algorithms using a
supervised learning model. Finally, using a selected unsupervised learning algorithm we
categorized the clusters. The top five categories are Safety, Crime, Products, Countries and
Health. This can prove helpful for bodies using large amount of Twitter data needing to quickly
find key points in the data before going into further classification.
REFERENCES
[1] Abd-Alrazaq, Alaa, Dari Alhuwail, Mowafa Househ, Mounir Hamdi, and Zubair Shah. "Top
concerns of Tweeters during the COVID-19 pandemic: infoveillance study." Journal of medical
Internet research vol. 22, no. 4 (2020): e19016.
[2] Aiazzi, B., Alparone, L., Baronti, S., Garzelli, A. and Zoppetti, C., 2013. Nonparametric change
detection in multitemporal SAR images based on mean-shift clustering. IEEE transactions on
geoscience and remote sensing, 51(4), pp.2022-2031.
[3] Allan, K., 1977. Classifiers. Language, 53(2), pp.285-311.
[4] Ankerst, M., Breunig, M.M., Kriegel, H.P. and Sander, J., 1999. OPTICS: ordering points to identify
the clustering structure. ACM Sigmod record, 28(2), pp.49-60.
[5] Arlia, D. and Coppola, M., 2001, August. Experiments in parallel clustering with DBSCAN.
In European Conference on Parallel Processing (pp. 326-331). Springer, Berlin, Heidelberg.
[6] Bakshi, R.K., Kaur, N., Kaur, R. and Kaur, G., 2016, March. Opinion mining and sentiment analysis.
In 2016 3rd IEEE International Conference on Computing for Sustainable Global Development
(INDIACom), pp. 452-455.
[7] Bao, W., Lianju, N. and Yue, K., 2019. Integration of unsupervised and supervised machine learning
algorithms for credit risk assessment. Expert Systems with Applications, 128, pp.301-315.
[8] Bottou, L., 2012. Stochastic gradient descent tricks. In Neural networks: Tricks of the trade (pp. 421-
436). Springer, Berlin, Heidelberg.
[9] Chen, E., Lerman, K. and Ferrara, E., 2020. Tracking Social Media Discourse About the COVID-19
Pandemic: Development of a Public Coronavirus Twitter Data Set. JMIR Public Health and
Surveillance, 6(2), p.e19273.
11. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
51
[10] De Wit, E., Van Doremalen, N., Falzarano, D. and Munster, V.J., 2016. SARS and MERS: recent
insights into emerging coronaviruses. Nature Reviews Microbiology, 14(8), p.523.
[11] Dubey, A.D., 2020. Twitter Sentiment Analysis during COVID19 Outbreak. Available at SSRN
3572023.
[12] Fang, Z. and Costas, R., 2020. Tracking the Twitter attention around the research efforts on the
COVID-19 pandemic. arXiv preprint arXiv:2006.05783.
[13] Feldman, Maryann, and Pierre Desrochers. "Research universities and local economic development:
Lessons from the history of the Johns Hopkins University." Industry and Innovation vol. 10, no. 1
(2003): 5-24.
[14] Gowda, K.C. and Krishna, G., 1978. Agglomerative clustering using the concept of mutual nearest
neighbourhood. Pattern recognition, 10(2), pp.105-112.
[15] Havrlant, L. and Kreinovich, V., 2017. A simple probabilistic explanation of term frequency-inverse
document frequency (tf-idf) heuristic (and variations motivated by this explanation). International
Journal of General Systems, 46(1), pp.27-36.
[16] Hechenbleikner, Elizabeth M., Daniel V. Samarov, and Ed Lin. "Data explosion during COVID-19:
A call for collaboration with the tech industry & data scrutiny." EClinicalMedicine (2020).
[17] Jahanbin, K. and Rahmanian, V., 2020. Using Twitter and web news mining to predict COVID-19
outbreak. Asian Pacific Journal of Tropical Medicine, vol. 13.
[18] Jahanbin, K. and Rahmanian, V., 2020. Using Twitter and web news mining to predict COVID-19
outbreak. Asian Pacific Journal of Tropical Medicine, 13.
[19] Kabir, F., Siddique, S., Kotwal, M.R.A. and Huda, M.N., 2015, March. Bangla text document
categorization using stochastic gradient descent (sgd) classifier. In 2015 International Conference on
Cognitive Computing and Information Processing (CCIP) (pp. 1-4). IEEE.
[20] Khan, K., Nikov, A. and Sahai, A., 2011. A fuzzy bat clustering method for ergonomic screening of
office workplaces. In Third International Conference on Software, Services and Semantic
Technologies S3T 2011 (pp. 59-66). Springer, Berlin, Heidelberg.
[21] Kirk, Colleen P., and Laura S. Rifkin. "I'll Trade You Diamonds for Toilet Paper: Consumer
Reacting, Coping and Adapting Behaviors in the COVID-19 Pandemic." Journal of Business
Research (2020).
[22] Kiselev, V.Y., Andrews, T.S. and Hemberg, M., 2019. Challenges in unsupervised clustering of
single-cell RNA-seq data. Nature Reviews Genetics, 20(5), pp.273-282.
[23] Kouzy, R., Abi Jaoude, J., Kraitem, A., El Alam, M.B., Karam, B., Adib, E., Zarka, J., Traboulsi, C.,
Akl, E.W. and Baddour, K., 2020. Coronavirus goes viral: quantifying the COVID-19 misinformation
epidemic on Twitter. Cureus, 12(3).
[24] Kouzy, R., Abi Jaoude, J., Kraitem, A., El Alam, M.B., Karam, B., Adib, E., Zarka, J., Traboulsi, C.,
Akl, E.W. and Baddour, K., 2020. Coronavirus goes viral: quantifying the COVID-19 misinformation
epidemic on Twitter. Cureus, 12(3).
[25] Krishna, K. and Murty, M.N., 1999. Genetic K-means algorithm. IEEE Transactions on Systems,
Man, and Cybernetics, Part B (Cybernetics), 29(3), pp.433-439.
[26] Kywe, S.M., Hoang, T.A., Lim, E.P. and Zhu, F., 2012, December. On recommending hashtags in
twitter networks. In International conference on social informatics (pp. 337-350). Springer, Berlin,
Heidelberg.
[27] Lee, S.hyun. & Kim Mi Na, (2008) “This is my paper”, ABC Transactions on ECE, Vol. 10, No. 5,
pp120-122.
[28] Lima, M.L., Nascimento, T.P., Labidi, S., Timbó, N.S., Batista, M.V., Neto, G.N., Costa, E.A. and
Sousa, S.R., 2016. Using sentiment analysis for stock exchange prediction. International Journal of
Artificial Intelligence & Applications (IJAIA), 7(1), pp.59-67.
[29] Maugis, C., Celeux, G. and Martin‐Magniette, M.L., 2009. Variable selection for clustering with
Gaussian mixture models. Biometrics, 65(3), pp.701-709.
[30] Maulik, U. and Bandyopadhyay, S., 2002. Performance evaluation of some clustering algorithms and
validity indices. IEEE Transactions on pattern analysis and machine intelligence, 24(12), pp.1650-
1654.
[31] Medford, R.J., Saleh, S.N., Sumarsono, A., Perl, T.M. and Lehmann, C.U., 2020. An" Infodemic":
Leveraging High-Volume Twitter Data to Understand Public Sentiment for the COVID-19 Outbreak.
medRxiv.
12. International Journal of Artificial Intelligence and Applications (IJAIA), Vol.11, No.4, July 2020
52
[32] Medford, Richard J., Sameh N. Saleh, Andrew Sumarsono, Trish M. Perl, and Christoph U.
Lehmann. "An" Infodemic": Leveraging High-Volume Twitter Data to Understand Public Sentiment
for the COVID-19 Outbreak." medRxiv (2020).
[33] Mustafa, H.H., Mohamed, A. and Elzanfaly, D.S., 2017. An enhanced approach for arabic sentiment
analysis. International Journal of Artificial Intelligence and Applications (IJAIA), 8(5), pp.1-14.
[34] Peng, K., Leung, V.C. and Huang, Q., 2018. Clustering approach based on mini batch kmeans for
intrusion detection system over big data. IEEE Access, 6, pp.11897-11906.
[35] Pokharel, B.P., 2020. Twitter Sentiment Analysis During Covid-19 Outbreak in Nepal. Available at
SSRN 3624719.
[36] Romanyshyn, M., 2013. Rule-based sentiment analysis of ukrainian reviews. International Journal of
Artificial Intelligence & Applications, 4(4), p.103.
[37] Rousseeuw, P.J., 1987. Silhouettes: a graphical aid to the interpretation and validation of cluster
analysis. Journal of computational and applied mathematics, 20, pp.53-65.
[38] Rovetta, A. and Bhagavathula, A.S., 2020. Global Infodemiology of COVID-19: Focus on Google
web searches and Instagram hashtags. medRxiv.
[39] Saito, T. and Rehmsmeier, M., 2015. The precision-recall plot is more informative than the ROC plot
when evaluating binary classifiers on imbalanced datasets. PloS one, 10(3).
[40] Schulze, M., Nitsche, A., Schweiger, B. and Biere, B., 2010. Diagnostic approach for the
differentiation of the pandemic influenza A (H1N1) v virus from recent human influenza viruses by
real-time PCR. PloS one, 5(4), p.e9966.
[41] Song, F., Shi, N., Shan, F., Zhang, Z., Shen, J., Lu, H., Ling, Y., Jiang, Y. and Shi, Y., 2020.
Emerging 2019 novel coronavirus (2019-nCoV) pneumonia. Radiology, 295(1), pp.210-217.
[42] SudalaiRajkumar: Novel Corona Virus 2019 Dataset. data retrieved March 30, 2020 from Kaggle,
https://www.kaggle.com/sudalairajkumar/novel-corona-virus-2019-dataset (2020)
[43] Toshniwal, A., Taneja, S., Shukla, A., Ramasamy, K., Patel, J.M., Kulkarni, S., Jackson, J., Gade, K.,
Fu, M., Donham, J. and Bhagat, N., 2014, June. Storm@ twitter. In Proceedings of the 2014 ACM
SIGMOD international conference on Management of data (pp. 147-156).
[44] Von Luxburg, U., 2007. A tutorial on spectral clustering. Statistics and computing, 17(4), pp.395-
416.
[45] Xue, J., Chen, J., Chen, C., Zheng, C. and Zhu, T., 2020. Machine learning on Big Data from Twitter
to understand public reactions to COVID-19. arXiv preprint arXiv:2005.08817.
[46] Xue, Jia, Junxiang Chen, Chen Chen, ChengDa Zheng, and Tingshao Zhu. "Machine learning on Big
Data from Twitter to understand public reactions to COVID-19." arXiv preprint
arXiv:2005.08817 (2020).
[47] Zhang, T., Ramakrishnan, R. and Livny, M., 1996. BIRCH: an efficient data clustering method for
very large databases. ACM Sigmod Record, 25(2), pp.103-114.
AUTHORS
Koffka Khan received the M.Sc., M.Phil., DPhil. degrees from the University of the West
Indies. He is currently an Assistant Lecturer and has up-to-date, published numerous
papers in journals & proceedings of international repute. His research areas are
computational intelligence, routing protocols, wireless communications, information
security and adaptive streaming controllers.
Emilie Ramsahai is a consulting Data Scientist, with more than 20 years industry
experience. She is currently working with UWI-Roytec in programme development and
course writing. She completed her PhD in Statistics and a Masters in Computer Science,
both at the University of the West Indies, where she has also lectured the Big Data and
Visualisation course from the Masters in Data Science, offered by the Department of
Computing and Information Technology, St Augustine Campus. She also completed her
fellowship at the International Centre for Genetic Engineering and Biotechnology
(ICGEB) in New Delhi, India and continues to publish and collaborate with a number of researchers in this
area.