SlideShare ist ein Scribd-Unternehmen logo
1 von 9
Downloaden Sie, um offline zu lesen
How anonymous we are on Twitter?
Reihane Boghrati
boghrati@usc.edu
Vinit Parakh
vparakh@usc.edu
George Sam
gsam@usc.edu
Nada Aldarrab
naldarra@usc.edu
Abstract
Authorship recognition is one of the well-
studied areas in machine learning. How-
ever, there is less work done on author
identification of short texts, especially in
an environment like Twitter where text is
limited to 140 characters per tweet. In this
project we extracted features from around
3 millions tweets from 100 different users
and use them along with tf-idf vectors and
were able to get 67% accuracy on the test
dataset.
1 Introduction
Authorship attribution and Identification has been
an important topic of research since the 19th cen-
tury. Most notably in the field of statistical and
mathematical methods, one of the first few works
available have been that of Mosteller and Wal-
lace (1964) on the disputes surrounding author-
ship of some Federalist Papers. With advancement
in Machine Learning and Natural Language Pro-
cessing, this area of authorship identification has
become a more generic problem in the computer
science domain. With the text content and docu-
ments available online increasing tremendously in
the last decade, the importance of having substan-
tial research in this area has increased a lot. We
intend to study and apply some techniques in Nat-
ural Language Processing to answer some of the
most important questions in Author Identification.
Twitter is one of the most popular social networks
that experienced rapid growth. With 288 million
monthly active users and 500 million Tweets sent
per day, Twitter users might assume that they have
a certain level of anonymity. Our project investi-
gates whether the author of an anonymous tweet
can be identified using stylometry. We create a
system that will help identify the original author of
tweets and also identify patterns in style of writing
among authors, thus helping Twitters recommen-
dation system to suggest followers to each based
on similarity in writing styles.
During the early researches done in author at-
tribution, the works were concentrated in building
statistical models to identify writing patterns. Pat-
tern attribute like character count, word length and
word count were presented as measures to iden-
tify writing styles. However, most of the early
work was computer assisted and there wasnt an
automated system developed to identify such pat-
terns. In most of the cases, the testing ground
was literary works of unknown or disputed au-
thorship (e.g., the Federalist case), so the estima-
tion of attribution accuracy was not even possible.
The main methodological limitations of that pe-
riod concerning the evaluation procedure were the
following:
• The textual data were too long (usually in-
cluding entire books) and probably not stylis-
tically homogeneous.
• The number of candidate authors was too
small (usually 2 or 3).
• The evaluation corpora were not controlled
for topic.
• The evaluation of the proposed methods was
mainly intuitive (usually based on subjective
visual inspection of scatterplots).
• The comparison of different methods was dif-
ficult due to lack of suitable benchmark data.
Working with Twitter data has a lot of chal-
lenges of its own. Each author has a different pat-
tern of writing tweets. Users tend to write about
a variety of topics, describing different emotions,
and referring to different events/things. Captur-
ing this data is tremendously difficult because the
tweets do not always follow a strict pattern of
Grammar. Some tweets have multiple grammati-
cal mistakes. It is thus important to look at each in-
dividual tweet, and create a normalized feature ex-
traction mechanism, so that the users do not get bi-
ased. Also, study has proven that existing off-the-
shelf tools perform poorly on Twitter data. The
tools need to be modified to use them on Twitter
data.
2 Related Works
There has been work done in author identification,
but most of it is done on blogs or books, where the
amount of content is significant. Author identifi-
cation on twitter has been tried by few researchers
viz, Antonio Castro et al. They have extracted
user information from multiple third party sources
and not just by using twitter API. They have per-
formed the experiment on about 800 users with
1000 tweets per user. They have used a combi-
nation of bag of words (tf-idf) and some other fea-
tures which are extracted from the tweet. They
are using nearest neighbor and regularized least
score classification technique for classification. In
other literature that we have surveyed we have
found that current research has been broadly clas-
sified into 2 categories - NLP and ML. These are
mainly categorized as STATISTICAL UNIVARI-
ATE METHODS
• Naive Bayes classifier: In this Classifier
Learning and classification methods based on
probability theory they use Bayes theorem in
generating a classifier. This has been used as
a baseline model in most research.
• Cluster Analysis: Cluster analysis is an ex-
ploratory data analysis tool for solving clas-
sification problems. In this idea, the re-
searchers have sorted the documents accord-
ing to the topics and groups. They then com-
pute the relative associativity between each
group. Once a cumulative relation is built for
every cluster pair, they then classify the text
to the individual users.
MACHINE LEARNING techniques involve
use of Neural Networks and SVMs
• Feed-forward neural network : Most of the
prior work in Machine Learning has been
done using neural networks. The Feed for-
ward neural network seemed to be the most
widely used neural network in this case. In
this type of NN the data flow is unidirectional
and output of neurons cannot be sent back to
the neurons in the previous layers. A feed
forward neural network is an artificial neural
network where connections between the units
do not form a directed cycle.
• Support Vector Machines: In machine learn-
ing, the use of SVMs for classification is wide
spread. SVMs are efficient in also analyzing
data patterns and in regression and classifica-
tion. For the case of author identification the
use of SVM eventually makes it similar to a
non-probabilistic binary linear classifier.
3 Data Collection
We collected our data form Twitter using their
API. It includes two steps of user selection and
tweet collection as follows:
3.1 User Selection
We randomly selected 100 Twitter accounts.
These accounts had to have more than 3200 tweets
according to our design. We selected these ac-
counts using popular accounts as seeds, and fol-
lowing the followers network to get random users.
3.2 Collecting Tweets
We collected about 3200 tweets from each Twit-
ter account using Tweepy and the official Twit-
ter API. In collecting these tweets, we had to deal
with some issues which included the following:
1. The Twitter API does not retrieve old tweets.
2. The Twitter API does not give access to pri-
vate accounts.
3. The Twitter API does not filter tweets accord-
ing to language (we got many non-English
tweets).
4. There are some spam accounts that we got by
random selection.
5. There are many representations of emoticons
and many character encodings.
We needed to develop some scripts to filter ac-
counts and tweets to deal with these issues. In ad-
dition, the Twitter API has some rate limits. Since
our approach requires a large number of tweets,
we needed several days to be able to get the num-
ber of tweets that we wanted (after filtering).
3.3 Data sets
We divided the data into training, development,
and test sets using 80%, 10%, and 10% of the col-
lected tweets, respectively. Splitting tweets into
these categories was done randomly to minimize
the effect of the chronological order of the time-
line.
3.4 Tokenization
We implemented a 2-stage tokenizer. In the first
stage, we took care of common encoding and for-
matting issues. In the second stage, we tokenized
tweets using Carnegie Mellons Tweet NLP Tok-
enizer. Our data sets had about 5 million tokens in
total.
4 Feature Extraction
In order to get the most information from tweets,
several different features were extracted from
tweets. Each set of features captures a specific part
of users’ writing style. In this section we introduce
the extracted features from our dataset.
4.1 Basic Features
In our project we have tried to extract informative
features from tweets. Since the structure of
tweets is not uniform, we wanted to extract
information like word shapes, structure of words,
and length of words. These features fall under
the main feature category in our system. Some
of the main features are: number of words,
tweet length, number of different words,
number of uppercase words, num-
ber of lowercase words, num-
ber of titlecase words, num-
ber of othercase words, per-
cent of uppercase characters, per-
cent of lowercase characters, num-
ber of nonascii characters, number of smiley,
number of stopwords, number of slangwords
Many users like to write tweets in Upper Case
characters to higlight importance of emotions or
words, while most users also like to use only lower
case characters in their words. This feature is ex-
tracted in the ercent of uppercase characters, per-
cent of lowercase characters. This give us impor-
tant author specific information.
Many times user use up all the 140 charac-
ter limit in Twitter in their tweets. These users’
features are extracted using the number of words,
tweet length, number of different words features.
These features give us an idea of how long the user
tweets are. They also tell us if the user likes us-
ing a lot of unique words in his tweets or tends to
repeat certain words, like some verbs, nouns etc.
Authors who always tweet about the same entity
tend to repeat those words a lot in their tweets.
For example, a student at USC who loves tweet-
ing about USC will tend to use the words ’school’,
’USC’, ’university’ a lot in his tweets, the no. of
unique words in his tweets will automatically be
lesser than other types of users.
Twitter also has a set of jargons which are used
only pertaining to Twitter. Phrases like ’LMFAO’,
’YOLO’, ’DM’ are used only on social media, and
particularly on twitter. It is important to find out
how many times a user uses these words. We built
a dictionary of all such words, which we call slang
words. We have kept a count of these words in
every tweet and added it as a feature. Not all users
use these phrases in their tweets. So it helps in
identifying specific types of users. These words
have also helped in clustering the users according
to topics they talk about.
Other features that are important as main fea-
tures, include the count of specific characters.
Many users use a lot of ”!” in their tweets, while
many others use other special characters like ”@”
or ’&’. We capture these features by keeping a
count of such special characters in the features.
We also keep count of individual letters as features
in our feature vector. This is important in some
very specific cases of tweets and users. For exam-
ple, if a user likes writing words like ’happyyyyy’,
’louuuve’, ’knooooow’. Then they will tend to use
those words a lot in their tweets. This is efficiently
captured in the count features for the characters,
since the count for these characters will be excep-
tionally high. We capture these features as well.
It is also important to note that some user like to
Retweet or mention other users in their tweets. We
use these features for our feature vector.
Hashtags and mentions are two pieces of infor-
mation which say a lot about a tweet. Hashtags
usually mark the topic of the tweet, and mentions
identify the persons involved in the discussion. We
recognize the importance of this information and
add these features to our feature set.
4.2 Word clusters
We utilize the idea of word clusters to deal with
technically misspelled words such as gonna and
gna; so, sooo, and sooooooo; etc. We use clus-
ters produced by an unsupervised HMM: Percy
Liang’s Brown clustering implementation (as de-
scribed in the Tweet NLP project).
We use about 900 clusters to represent tweets.
We also add general clusters to handle common
categories of tokens such as URLs. Using word
clusters alone as a bag of clusters with a Naive
Bayes classifier resulted in an increase of 6% in
author prediction accuracy compared to
a bag of words representation of tweets.
4.3 Part of Speech Tagging
Having words alone as features does not always
highlight the authors writing patterns the best way.
We implemented a POS tagging system to ex-
tract vital features about the Part of Speech ele-
ments that the authors have used, to find out how
many verbs, nouns and adjectives are used in their
tweets. These features help in understanding the
style of tweets the author writes. Some authors
also tend to add a lot of emoticons in their tweets,
while others like to add punctuation marks (!, ? ,
...). These features need to be captured to add to
the authors style.
In our system we have used a combination of
Stanford NLP libraries and a perceptron based
POS tagger to get POS tags for tweets. The Stan-
ford NLP libraries provide an inbuilt POS tag-
ger function. It needs a model file to tag words
according to tokens. In our system, since we
are focusing on Twitter data, the major chunk of
tweets do not follow standard English Grammat-
ical rules. This causes problems while using tra-
ditional off-the-shelf POS tagging tools to tag the
words in the tweets. Words like LOL, Knoooow,
and URLs do not have a specific POS tag in the
feature space for these tools. While tagging us-
ing StanfordNLP library we have used a Model
file specifically designed for Twitter data. This
file is available on the Stanford CoreNLP website.
This model file allows us to use their POS tagger
to tag tweets. But in our case we could only tag
76% tweets correctly using this model file. We
generated a training file from the POS tagger. We
have also developed our own Perceptron POS tag-
ger. This tagger gives us an accuracy of 95%
on standard English POS tags. We used the out-
put from StanfordCoreNLPs tagger as the train-
ing file for our POS tagger. This worksin helping
us get tags for URLs, Emoticons, and Retweets.
We have focused our tagger to collect only some
of the important POS tags. The following tags
are the ones we want to use: ”USR”, ”PRP”,
”VBD”, ”CC”, ”IN”, ”JJS”, ”NN”, ”NNS”, ”DT”,
”VBP”, ”VB”, ”VBG”, ”JJ”, ”UH”, ”RB”, ”TO”,
”VBN”, ”PRP$”, ”NNP”, ”VBZ”, ”URL”, ”WP”,
”MD”, ”WRB”, ”RT”, ”SYM”, ”CD”, ”WDT”,
”RP”, ”EX”, ”JJR”, ”RBR”, ”HT”, ”RBS”, ”MB”,
”POS”, ”PDT”, ”WP$”, ”FW”, ”$”, ”NNPS”,
”ON”, ”R”, ”LV”, ”FM”, ”PR”, ”J”, ”HR”, ”AE”,
”F”, ”O”, ”N”, ”CCG”, ”EA”, ”MP”, ”P”, ”T”
We want to find out occurrences of User tags,
Re-tweets, Emoticons, Punctuation, Continua-
tions ... etc in from every tweet. The most help-
ful POS tag was HashTag (HT). With the help of
the POS tagger we are able to extract some vital
features of POS. With our system we then keep
a count of every feature in the POS tags. We
also keep count of occurrences of certain tags, like
punctuation, emoticons. These counts are added to
the feature vector automatically. The POS tagger
was integrated with the Feature Extraction Engine,
which uses the tool to create a feature vector CSV
file.
4.4 Paragraph Vector Representation
Conceptual representation of words and texts has
always been a challenge. In recent years more
focus have been on distributed representation of
words, where words are shown in form of n-
dimensional vector. Recently the focus is also
shifted over distributed representation of larger
texts. Specifically we used and approach called
Paragraph Vector which simultaneously learns
vector representations for words and context.
Using the Paragraph Vector, we trained the
model on the tweets dataset and a large blog cor-
pus to compute the vector representation for each
tweet. The vectors have 100 dimensions. These
100 dimensions capture semantic and syntactic
style of the tweets. Further on we used them as
features to the classifiers to improve the perfor-
mance.
5 Methods
In the following section different methods that are
used in our project are explained.
5.1 KMeans
In our system we also intend to perform clustering
of tweets to find out what are the most common
topics talked about by the individual authors.
We create a KMeans clustering system using
Scikit for Python. This system runs a clustering al-
gorithm on the tweets to find out most commonly
talked topics. These topics are extracted from the
bag of words of the tweets. The system gives
us a list of clusters with topics in each of these.
This cluster is created dynamically, and the kmean
value is 10, i.e. top 10 most talked about topics.
This clustering can help us identify what topics are
most talked about by the users.
The clustering system was tested on the training
data from the user tweets. We extracted top 3 clus-
ters for each user. Each cluster has 10 words which
the user talks about in the tweet. The clusters are
cluster0 cluster1, and cluster2. Cluster0 denotes
the cluster with the most talked about tweets for
the user.
The clustering is done using the sklearn module
in scikit. We use tf-idf for each tweet to find out
similarity. Then we use the KMeans function in
the sklearn module to do Kmeans clustering on the
training set, which is the vectorized format of all
the tweets for users. We then store the data from
all the clusters.
5.2 Naive Bayes
Naive Bayes classifier is a probabilistic classifier
which is based on Bayes theorem and has the as-
sumption of features being independent of each
other. It computes the probability of a given obser-
vation belonging to a class based on its features.
We used Naive Bayes with bag of words and
other extracted features on a set of 10 users and
compared its performance with Support Vector
Machine (SVM) which we have explained later.
The drawback of Naive Bayes was that it needs
all the data to be in memory to compute the proba-
bilities, so it would be time and resource consum-
ing in large scale usage.
5.3 Support Vector Machine
Support Vector Machines (SVM) is also another
method of supervise classification. It analyses data
and find patterns of features of trainset to further
on apply them to unseen data.
We used SVM with tf-idf vectors and extracted
features to conduct the author identification task.
The advantage of SVM over Naive Bayes is that
it’s faster for larger amount of data.
5.4 Architecture
Figure 1 show an overall picture of our system
components. More specifically the system has the
following major components:
Data Collector for Tweets: This component is
responsible for the tweet collection from Twit-
ter. We have used Twitter API to collect the
tweets. This system then performs aggregation on
the tweets.
Data Cleaner and Filter: The data collected
from Twitter has a lot of noise and garbage. Lot of
the tweets were in non English languages. These
tweets needed to be filtered out. So we manually
filtered these tweets out from the corpus. We also
have a preliminary system to filter tweets with cuss
words and abusive language. This is done in the
data filter section.
Feature Engine: Once the data was collected,
the next important step is to perform feature engi-
neering on the data. We have used multiple meth-
ods to extract features from our corpus. We have
first divided the corpus into 3 sections. Training
Set 80% Development Set 10% Testing and Val-
idation Set 10% The feature engine plays one of
the most crucial part in the system. In this phase
we have divided the features into 4 main types:
• Main Features: These features are extracted
from the raw text. Simply Bag of Words
is not sufficient to extract vital information
from the tweets. Many times users use
words like knooow grrreat, etc. The extend
some words to highlight the importance of
such words in their tweets. These features
need to be captured in the tweets. Also not
every author uses long words of length 5
or more. Many users also like to use ! and
... in their tweets, and many users re-tweet
other tweets and put URLs in their tweets.
We intended to capture these features.
We have the following features classi-
fied as main features: number of words,
tweet length, number of different words,
number of uppercase words, num-
ber of lowercase words, num-
ber of titlecase words, num-
ber of othercase words, per-
cent of uppercase characters, per-
cent of lowercase characters, num-
ber of nonascii characters, num-
ber of smiley, number of stopwords,
number of slangwords
Figure 1: System Architecture
These features highlight the way the tweet
has been written. Most of the features con-
sist of count of these factors in the tweets.
• POS Tagged Features: Often merely having
the structure of words and count of different
types of words in a sentence does not help in
extracting vital information about the struc-
ture of a sentence. In our system, we also
intend to find out the number of times au-
thors describe the subjects they talk about.
Things like the verbs used, adjectives used in
the tweets play a vital role in this case. The
study of different words and their grammat-
ical definition is called Part of Speech. We
have tried to perform annotation of words in
the tweets using parts of speech. However,
this is difficult since the structure of tweets
in twitter is very different from regular En-
glish Grammar. Also, the words and jargons
used in Twitter is very different. Using the
Stanford NLP libraries and our own POS tag-
ger we have intended to perform POS tag-
ging on the tweets and extract some informa-
tion about the tweets. We have extracted POS
tweets as features and put them in the feature
vector. This proves fruitful in gaining a good
accuracy on author identification.
• Stylometric Features: In our system we have
also intended to identify style of writing
through use of some Stylometric features.
We have used word clusters, by using dis-
tance of words from each other. Words like
knoooow and know and noe mean the same
thing. But a regular tokenizer, or feature en-
gine will not identify this similarity. We have
identified such similarities by creating word
clusters by computing the distance between
such words. These features are added to the
vector space.
• Sentiment Features: We also have built a sen-
timent analyser for our system that identifies
the sentiment of the author while the tweet
was being written. The sentiment analyser
has 5 classes of sentiments: Very Positive,
Positive, Neutral, Negative, Very Negative
These sentiments can be used as features for
the tweets as well. People tend to use Twitter
to express what they feel. And most of the
time, these emotions follow a sustained pat-
tern. In our course of feature extraction, we
realized that the users were consistent in the
sentiments with their tweets. We have used
these tags as a binary value in our feature set.
This gives a lot of information about the kind
of tweets the users write. We used Stanford
NLP tool for getting the sentiments of tweets.
However before giving the tweet to Stanford
NLP, we preprocessed it intelligently to re-
move stop words, replace the slang words
like lol with lots of laughter to make more
sense, removed punctuation marks and also
replaced emoticons with their actual expres-
sions like :-), :-( was replaced with happy and
sad respectively.
6 Results
In previous section we described two methods,
Naive Bayes and SVM. Both classifier were used
in order to classify and label tweets based on their
authors. We used the same features that were ex-
tracted and described in the fourth section to train
the classifier and later on used the trained model
to predict the author of unseen tweets.
Figure 2 shows the results of running Naive
Bayes and SVM classifier on a set of 10 users, with
using just bag of words or a combination of bag
of words and extracted features. Using the afore-
mentioned features, the accuracy for both SVM
and Naive Bayes improves; however, SVM out-
performs Naive Bayes in both experiments.
Based on the better performance of SVM (and
Naive Bayes being time-resource consuming on
large corpus) we continued our experiments us-
ing SVM classifier. Figure 3 shows the results of
running SVM on 10, 50, and 100 users. In each
case we did the experiment once by using tf-idf
and once by using tf-idf along with other described
features. As shown in all experiments using fea-
tures increased the accuracy.
7 Discussion and Future Work
We experimented using 2 major technique - Naive
Bayes and Support Vector Machine by using a va-
riety of feature combinations to identify the au-
thor of the tweet. Naive Bayes even though was
not scalable to 100 user with 3200 tweets per user,
it performed really well on 10 users and by using
bag of words as features. Bag of words using tf-idf
gave us our baseline score and then by adding vari-
ous features like main features, POS tags, doc2vec
and clustering extracted from the tweets improved
the accuracy. We were able to run Naive Bayes
on max 20 users. With naive bayes we get the
best result by using all the combination of features.
This is because with 140 word limit on twitter we
have tried and extracted as much information as
we could for a tweet, and more the information we
have, the better the classification results. We tried
SVM by using Stochastic Gradient Descent as our
next approach since it allowed us to scale to 100
users. Because the data is so huge, it is not possi-
ble to load the entire data into memory. So, batch
gradient descent and newton method were not an
option for us. Hence, we used Stochastic Gradient
Descent approach for updating the weights. The
learning rate for the Stochastic Gradient Descent
has also been experimented upon. We have tried
a few values for the learning rate and chose the
optimum value as 0.01. This gave us better perfor-
mance as compared to Naive Bayes. We observed
a very similar trend here as well. The best accu-
racy that we got was by using bag of words(tf-idf)
with a combination of all the other features that
we had. We tried using various options with SGD.
We tried initializing the weights to zero and using
different loss functions as well. We got the best
results when we used the hinge loss and by ini-
tializing the weights to some random values. The
reason why we tried to do this is, there might be
some values in the features from test data which
were never seen in the training data. Initializing all
weights to 0 means that we are giving a 0 weight
to these unseen instances. That might lead to clas-
sification error. But giving a small initial weight
to features did not seem to improve the accuracy
in anyway. So, we went ahead with initializing the
weights to 0.
What we have intended to do in this project is
build a system that will use text on social media to
analyze author writing patterns. It shows improve-
ments over previous work, by using a large corpus
of users, and we have focused on text from Twit-
ter. More work can be done to developing Boost-
ing algorithms with better performance. We can
associate confidence levels with individual clas-
sifiers and then aggregate the weighted output of
each classifier to generate a final answer for au-
thor based on the tweet.
More research can be done in extracting better
features from tweets, like vectorizing the parse-
tree to an n-dimensional vector. This will help
in getting a definite structure for each tweet in
the training. This system can also be integrated
with the existing twitter recommendation system,
to identify writing patterns amongst the users, and
suggest followers to each other based on topics
that they commonly write about. The classifica-
tion system can be improved by adding an inter-
mediate layer of classification on clusters.
Another area where more work is needed is
scalability. Our system currently works for 100
Figure 2: Naive Bayes Performance
Figure 3: Support Vector Machine Performance
users, but needs to be modified to work on a large
user corpus like Twitter.
Acknowledgment and Contribution
First of all we want to thank Professor Kenji and
Justin for the amazing class we had this semester,
there is no need to say how much we learned dur-
ing the course.
However it was a teamwork and we couldn’t do
it individually, but mainly tasks were done as fol-
lows:
George: He mostly worked extracting POS
tagging features (which is challenging in Twitter
world) and some of the basic features mentioned
earlier. Also he tried clustering the tweets in order
to do the author identification.
Nada: She worked on collecting tweets and af-
ter that tokenized them to make them ready to use
for next step. Also she worked on extracting fea-
tures (word clusters, hashtags and mentions)
Reihane: She first worked on getting random
users from Twitter who tweet in English and have
a reasonable number of tweets. Then she started
working on classification (Naive Bayes and SVM).
She also extracted vector representation of tweets
using Paragraph Vector method.
Vinit: He extracted main features (which de-
scribed in basic feature section) including senti-
ment analysis of tweets. He then started working
on SVM and SGD classifiers.
BitBucket: https://bitbucket.org/georgesam/csci544-
project

Weitere ähnliche Inhalte

Was ist angesagt?

SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...ijnlc
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationIRJET Journal
 
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...kevig
 
A survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textA survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textDanbi Cho
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET Journal
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterDan Nguyen
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionIJERA Editor
 
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSISTEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSISacijjournal
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine LearningIRJET Journal
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine LearningIRJET Journal
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-ServiceMarius Corici
 
Detecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer softwareDetecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer softwareAshish Arora
 
Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningVishwambhar Deshpande
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News DetectionIRJET Journal
 
Detection of cyber-bullying
Detection of cyber-bullying Detection of cyber-bullying
Detection of cyber-bullying Ziar Khan
 
Automatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewAutomatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewDr. Amarjeet Singh
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam FilteringIOSR Journals
 
Insights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter contentInsights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter contentStephen Dann
 
A Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live TweetsA Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live Tweetsijtsrd
 

Was ist angesagt? (20)

SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
 
Named Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet SegmentationNamed Entity Recognition using Tweet Segmentation
Named Entity Recognition using Tweet Segmentation
 
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
SARCASM AS A CONTRADICTION BETWEEN A TWEET AND ITS TEMPORAL FACTS: A PATTERNB...
 
A survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in textA survey on automatic detection of hate speech in text
A survey on automatic detection of hate speech in text
 
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
IRJET- A Survey on Trend Analysis on Twitter for Predicting Public Opinion on...
 
Groundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitterGroundhog day: near duplicate detection on twitter
Groundhog day: near duplicate detection on twitter
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSISTEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
TEXT CLASSIFICATION FOR AUTHORSHIP ATTRIBUTION ANALYSIS
 
IRJET - Fake News Detection using Machine Learning
IRJET -  	  Fake News Detection using Machine LearningIRJET -  	  Fake News Detection using Machine Learning
IRJET - Fake News Detection using Machine Learning
 
IRJET - Twitter Sentiment Analysis using Machine Learning
IRJET -  	  Twitter Sentiment Analysis using Machine LearningIRJET -  	  Twitter Sentiment Analysis using Machine Learning
IRJET - Twitter Sentiment Analysis using Machine Learning
 
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
58903240-SentiMatrix-Multilingual-Sentiment-Analysis-Service
 
Detecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer softwareDetecting the presence of cyberbullying using computer software
Detecting the presence of cyberbullying using computer software
 
Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learning
 
IRJET- Fake News Detection
IRJET- Fake News DetectionIRJET- Fake News Detection
IRJET- Fake News Detection
 
Detection of cyber-bullying
Detection of cyber-bullying Detection of cyber-bullying
Detection of cyber-bullying
 
Automatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature ReviewAutomatic Hate Speech Detection: A Literature Review
Automatic Hate Speech Detection: A Literature Review
 
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for  Spam FilteringA Model for Fuzzy Logic Based Machine Learning Approach for  Spam Filtering
A Model for Fuzzy Logic Based Machine Learning Approach for Spam Filtering
 
Insights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter contentInsights into the Twitterverse: Benchmarking and analysis twitter content
Insights into the Twitterverse: Benchmarking and analysis twitter content
 
A Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live TweetsA Baseline Based Deep Learning Approach of Live Tweets
A Baseline Based Deep Learning Approach of Live Tweets
 
Ijet journal
Ijet journalIjet journal
Ijet journal
 

Ähnlich wie How Anonymous Can Someone be on Twitter?

Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSumit Raj
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET Journal
 
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET Journal
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using Rsantoshi mangalgi
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx20211a05p7
 
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET Journal
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisEditor IJCATR
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionieeepondy
 
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2VecIRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2VecIRJET Journal
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platformFayan TAO
 
Categorize balanced dataset for troll detection
Categorize balanced dataset for troll detectionCategorize balanced dataset for troll detection
Categorize balanced dataset for troll detectionvivatechijri
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine LearningIRJET Journal
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysisijtsrd
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2The Night's Watch
 
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...IRJET Journal
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Serge Beckers
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET Journal
 

Ähnlich wie How Anonymous Can Someone be on Twitter? (20)

Sentiment Analysis of Twitter Data
Sentiment Analysis of Twitter DataSentiment Analysis of Twitter Data
Sentiment Analysis of Twitter Data
 
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
IRJET- An Experimental Evaluation of Mechanical Properties of Bamboo Fiber Re...
 
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity RecognitionIRJET- Tweet Segmentation and its Application to Named Entity Recognition
IRJET- Tweet Segmentation and its Application to Named Entity Recognition
 
Twitter data analysis using R
Twitter data analysis using RTwitter data analysis using R
Twitter data analysis using R
 
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptxSampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
SampleLiteratureReviewTemplate_IVBTechIISEM_MajorProject.pptx
 
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
IRJET- Effective Countering of Communal Hatred During Disaster Events in Soci...
 
Neural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment AnalysisNeural Network Based Context Sensitive Sentiment Analysis
Neural Network Based Context Sensitive Sentiment Analysis
 
Tweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognitionTweet segmentation and its application to named entity recognition
Tweet segmentation and its application to named entity recognition
 
[IJET V2I4P9] Authors: Praveen Jayasankar , Prashanth Jayaraman ,Rachel Hannah
[IJET V2I4P9] Authors: Praveen Jayasankar , Prashanth Jayaraman ,Rachel Hannah[IJET V2I4P9] Authors: Praveen Jayasankar , Prashanth Jayaraman ,Rachel Hannah
[IJET V2I4P9] Authors: Praveen Jayasankar , Prashanth Jayaraman ,Rachel Hannah
 
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2VecIRJET-  	  Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
IRJET- Improved Real-Time Twitter Sentiment Analysis using ML & Word2Vec
 
Text mining on Twitter information based on R platform
Text mining on Twitter information based on R platformText mining on Twitter information based on R platform
Text mining on Twitter information based on R platform
 
Categorize balanced dataset for troll detection
Categorize balanced dataset for troll detectionCategorize balanced dataset for troll detection
Categorize balanced dataset for troll detection
 
Twitter in Academic Conferences
Twitter in Academic ConferencesTwitter in Academic Conferences
Twitter in Academic Conferences
 
IRJET - Suicidal Text Detection using Machine Learning
IRJET -  	  Suicidal Text Detection using Machine LearningIRJET -  	  Suicidal Text Detection using Machine Learning
IRJET - Suicidal Text Detection using Machine Learning
 
Twitter Sentiment Analysis
Twitter Sentiment AnalysisTwitter Sentiment Analysis
Twitter Sentiment Analysis
 
Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2Detecting Trends Through Twitter Stream v2
Detecting Trends Through Twitter Stream v2
 
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
A Paper on Web Data Segmentation for Terrorism Detection using Named Entity R...
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?Twitter: Social Network Or News Medium?
Twitter: Social Network Or News Medium?
 
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...IRJET-  	  A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
IRJET- A Review on: Sentiment Polarity Analysis on Twitter Data from Diff...
 

Kürzlich hochgeladen

Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxupamatechverse
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduitsrknatarajan
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)Suman Mia
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordAsst.prof M.Gokilavani
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).pptssuser5c9d4b1
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSISrknatarajan
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )Tsuyoshi Horigome
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escortsranjana rawat
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlysanyuktamishra911
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxpurnimasatapathy1234
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)simmis5
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...Soham Mondal
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations120cr0395
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 

Kürzlich hochgeladen (20)

Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024Water Industry Process Automation & Control Monthly - April 2024
Water Industry Process Automation & Control Monthly - April 2024
 
Introduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptxIntroduction and different types of Ethernet.pptx
Introduction and different types of Ethernet.pptx
 
UNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular ConduitsUNIT-II FMM-Flow Through Circular Conduits
UNIT-II FMM-Flow Through Circular Conduits
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)Software Development Life Cycle By  Team Orange (Dept. of Pharmacy)
Software Development Life Cycle By Team Orange (Dept. of Pharmacy)
 
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete RecordCCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
CCS335 _ Neural Networks and Deep Learning Laboratory_Lab Complete Record
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
(RIA) Call Girls Bhosari ( 7001035870 ) HI-Fi Pune Escorts Service
 
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
247267395-1-Symmetric-and-distributed-shared-memory-architectures-ppt (1).ppt
 
UNIT-III FMM. DIMENSIONAL ANALYSIS
UNIT-III FMM.        DIMENSIONAL ANALYSISUNIT-III FMM.        DIMENSIONAL ANALYSIS
UNIT-III FMM. DIMENSIONAL ANALYSIS
 
SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )SPICE PARK APR2024 ( 6,793 SPICE Models )
SPICE PARK APR2024 ( 6,793 SPICE Models )
 
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur EscortsHigh Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
High Profile Call Girls Nagpur Isha Call 7001035870 Meet With Nagpur Escorts
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
Microscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptxMicroscopic Analysis of Ceramic Materials.pptx
Microscopic Analysis of Ceramic Materials.pptx
 
Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)Java Programming :Event Handling(Types of Events)
Java Programming :Event Handling(Types of Events)
 
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
OSVC_Meta-Data based Simulation Automation to overcome Verification Challenge...
 
Extrusion Processes and Their Limitations
Extrusion Processes and Their LimitationsExtrusion Processes and Their Limitations
Extrusion Processes and Their Limitations
 
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service NashikCall Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
Call Girls Service Nashik Vaishnavi 7001305949 Independent Escort Service Nashik
 

How Anonymous Can Someone be on Twitter?

  • 1. How anonymous we are on Twitter? Reihane Boghrati boghrati@usc.edu Vinit Parakh vparakh@usc.edu George Sam gsam@usc.edu Nada Aldarrab naldarra@usc.edu Abstract Authorship recognition is one of the well- studied areas in machine learning. How- ever, there is less work done on author identification of short texts, especially in an environment like Twitter where text is limited to 140 characters per tweet. In this project we extracted features from around 3 millions tweets from 100 different users and use them along with tf-idf vectors and were able to get 67% accuracy on the test dataset. 1 Introduction Authorship attribution and Identification has been an important topic of research since the 19th cen- tury. Most notably in the field of statistical and mathematical methods, one of the first few works available have been that of Mosteller and Wal- lace (1964) on the disputes surrounding author- ship of some Federalist Papers. With advancement in Machine Learning and Natural Language Pro- cessing, this area of authorship identification has become a more generic problem in the computer science domain. With the text content and docu- ments available online increasing tremendously in the last decade, the importance of having substan- tial research in this area has increased a lot. We intend to study and apply some techniques in Nat- ural Language Processing to answer some of the most important questions in Author Identification. Twitter is one of the most popular social networks that experienced rapid growth. With 288 million monthly active users and 500 million Tweets sent per day, Twitter users might assume that they have a certain level of anonymity. Our project investi- gates whether the author of an anonymous tweet can be identified using stylometry. We create a system that will help identify the original author of tweets and also identify patterns in style of writing among authors, thus helping Twitters recommen- dation system to suggest followers to each based on similarity in writing styles. During the early researches done in author at- tribution, the works were concentrated in building statistical models to identify writing patterns. Pat- tern attribute like character count, word length and word count were presented as measures to iden- tify writing styles. However, most of the early work was computer assisted and there wasnt an automated system developed to identify such pat- terns. In most of the cases, the testing ground was literary works of unknown or disputed au- thorship (e.g., the Federalist case), so the estima- tion of attribution accuracy was not even possible. The main methodological limitations of that pe- riod concerning the evaluation procedure were the following: • The textual data were too long (usually in- cluding entire books) and probably not stylis- tically homogeneous. • The number of candidate authors was too small (usually 2 or 3). • The evaluation corpora were not controlled for topic. • The evaluation of the proposed methods was mainly intuitive (usually based on subjective visual inspection of scatterplots). • The comparison of different methods was dif- ficult due to lack of suitable benchmark data. Working with Twitter data has a lot of chal- lenges of its own. Each author has a different pat- tern of writing tweets. Users tend to write about a variety of topics, describing different emotions, and referring to different events/things. Captur- ing this data is tremendously difficult because the tweets do not always follow a strict pattern of
  • 2. Grammar. Some tweets have multiple grammati- cal mistakes. It is thus important to look at each in- dividual tweet, and create a normalized feature ex- traction mechanism, so that the users do not get bi- ased. Also, study has proven that existing off-the- shelf tools perform poorly on Twitter data. The tools need to be modified to use them on Twitter data. 2 Related Works There has been work done in author identification, but most of it is done on blogs or books, where the amount of content is significant. Author identifi- cation on twitter has been tried by few researchers viz, Antonio Castro et al. They have extracted user information from multiple third party sources and not just by using twitter API. They have per- formed the experiment on about 800 users with 1000 tweets per user. They have used a combi- nation of bag of words (tf-idf) and some other fea- tures which are extracted from the tweet. They are using nearest neighbor and regularized least score classification technique for classification. In other literature that we have surveyed we have found that current research has been broadly clas- sified into 2 categories - NLP and ML. These are mainly categorized as STATISTICAL UNIVARI- ATE METHODS • Naive Bayes classifier: In this Classifier Learning and classification methods based on probability theory they use Bayes theorem in generating a classifier. This has been used as a baseline model in most research. • Cluster Analysis: Cluster analysis is an ex- ploratory data analysis tool for solving clas- sification problems. In this idea, the re- searchers have sorted the documents accord- ing to the topics and groups. They then com- pute the relative associativity between each group. Once a cumulative relation is built for every cluster pair, they then classify the text to the individual users. MACHINE LEARNING techniques involve use of Neural Networks and SVMs • Feed-forward neural network : Most of the prior work in Machine Learning has been done using neural networks. The Feed for- ward neural network seemed to be the most widely used neural network in this case. In this type of NN the data flow is unidirectional and output of neurons cannot be sent back to the neurons in the previous layers. A feed forward neural network is an artificial neural network where connections between the units do not form a directed cycle. • Support Vector Machines: In machine learn- ing, the use of SVMs for classification is wide spread. SVMs are efficient in also analyzing data patterns and in regression and classifica- tion. For the case of author identification the use of SVM eventually makes it similar to a non-probabilistic binary linear classifier. 3 Data Collection We collected our data form Twitter using their API. It includes two steps of user selection and tweet collection as follows: 3.1 User Selection We randomly selected 100 Twitter accounts. These accounts had to have more than 3200 tweets according to our design. We selected these ac- counts using popular accounts as seeds, and fol- lowing the followers network to get random users. 3.2 Collecting Tweets We collected about 3200 tweets from each Twit- ter account using Tweepy and the official Twit- ter API. In collecting these tweets, we had to deal with some issues which included the following: 1. The Twitter API does not retrieve old tweets. 2. The Twitter API does not give access to pri- vate accounts. 3. The Twitter API does not filter tweets accord- ing to language (we got many non-English tweets). 4. There are some spam accounts that we got by random selection. 5. There are many representations of emoticons and many character encodings. We needed to develop some scripts to filter ac- counts and tweets to deal with these issues. In ad- dition, the Twitter API has some rate limits. Since our approach requires a large number of tweets, we needed several days to be able to get the num- ber of tweets that we wanted (after filtering).
  • 3. 3.3 Data sets We divided the data into training, development, and test sets using 80%, 10%, and 10% of the col- lected tweets, respectively. Splitting tweets into these categories was done randomly to minimize the effect of the chronological order of the time- line. 3.4 Tokenization We implemented a 2-stage tokenizer. In the first stage, we took care of common encoding and for- matting issues. In the second stage, we tokenized tweets using Carnegie Mellons Tweet NLP Tok- enizer. Our data sets had about 5 million tokens in total. 4 Feature Extraction In order to get the most information from tweets, several different features were extracted from tweets. Each set of features captures a specific part of users’ writing style. In this section we introduce the extracted features from our dataset. 4.1 Basic Features In our project we have tried to extract informative features from tweets. Since the structure of tweets is not uniform, we wanted to extract information like word shapes, structure of words, and length of words. These features fall under the main feature category in our system. Some of the main features are: number of words, tweet length, number of different words, number of uppercase words, num- ber of lowercase words, num- ber of titlecase words, num- ber of othercase words, per- cent of uppercase characters, per- cent of lowercase characters, num- ber of nonascii characters, number of smiley, number of stopwords, number of slangwords Many users like to write tweets in Upper Case characters to higlight importance of emotions or words, while most users also like to use only lower case characters in their words. This feature is ex- tracted in the ercent of uppercase characters, per- cent of lowercase characters. This give us impor- tant author specific information. Many times user use up all the 140 charac- ter limit in Twitter in their tweets. These users’ features are extracted using the number of words, tweet length, number of different words features. These features give us an idea of how long the user tweets are. They also tell us if the user likes us- ing a lot of unique words in his tweets or tends to repeat certain words, like some verbs, nouns etc. Authors who always tweet about the same entity tend to repeat those words a lot in their tweets. For example, a student at USC who loves tweet- ing about USC will tend to use the words ’school’, ’USC’, ’university’ a lot in his tweets, the no. of unique words in his tweets will automatically be lesser than other types of users. Twitter also has a set of jargons which are used only pertaining to Twitter. Phrases like ’LMFAO’, ’YOLO’, ’DM’ are used only on social media, and particularly on twitter. It is important to find out how many times a user uses these words. We built a dictionary of all such words, which we call slang words. We have kept a count of these words in every tweet and added it as a feature. Not all users use these phrases in their tweets. So it helps in identifying specific types of users. These words have also helped in clustering the users according to topics they talk about. Other features that are important as main fea- tures, include the count of specific characters. Many users use a lot of ”!” in their tweets, while many others use other special characters like ”@” or ’&’. We capture these features by keeping a count of such special characters in the features. We also keep count of individual letters as features in our feature vector. This is important in some very specific cases of tweets and users. For exam- ple, if a user likes writing words like ’happyyyyy’, ’louuuve’, ’knooooow’. Then they will tend to use those words a lot in their tweets. This is efficiently captured in the count features for the characters, since the count for these characters will be excep- tionally high. We capture these features as well. It is also important to note that some user like to Retweet or mention other users in their tweets. We use these features for our feature vector. Hashtags and mentions are two pieces of infor- mation which say a lot about a tweet. Hashtags usually mark the topic of the tweet, and mentions identify the persons involved in the discussion. We recognize the importance of this information and add these features to our feature set. 4.2 Word clusters We utilize the idea of word clusters to deal with technically misspelled words such as gonna and
  • 4. gna; so, sooo, and sooooooo; etc. We use clus- ters produced by an unsupervised HMM: Percy Liang’s Brown clustering implementation (as de- scribed in the Tweet NLP project). We use about 900 clusters to represent tweets. We also add general clusters to handle common categories of tokens such as URLs. Using word clusters alone as a bag of clusters with a Naive Bayes classifier resulted in an increase of 6% in author prediction accuracy compared to a bag of words representation of tweets. 4.3 Part of Speech Tagging Having words alone as features does not always highlight the authors writing patterns the best way. We implemented a POS tagging system to ex- tract vital features about the Part of Speech ele- ments that the authors have used, to find out how many verbs, nouns and adjectives are used in their tweets. These features help in understanding the style of tweets the author writes. Some authors also tend to add a lot of emoticons in their tweets, while others like to add punctuation marks (!, ? , ...). These features need to be captured to add to the authors style. In our system we have used a combination of Stanford NLP libraries and a perceptron based POS tagger to get POS tags for tweets. The Stan- ford NLP libraries provide an inbuilt POS tag- ger function. It needs a model file to tag words according to tokens. In our system, since we are focusing on Twitter data, the major chunk of tweets do not follow standard English Grammat- ical rules. This causes problems while using tra- ditional off-the-shelf POS tagging tools to tag the words in the tweets. Words like LOL, Knoooow, and URLs do not have a specific POS tag in the feature space for these tools. While tagging us- ing StanfordNLP library we have used a Model file specifically designed for Twitter data. This file is available on the Stanford CoreNLP website. This model file allows us to use their POS tagger to tag tweets. But in our case we could only tag 76% tweets correctly using this model file. We generated a training file from the POS tagger. We have also developed our own Perceptron POS tag- ger. This tagger gives us an accuracy of 95% on standard English POS tags. We used the out- put from StanfordCoreNLPs tagger as the train- ing file for our POS tagger. This worksin helping us get tags for URLs, Emoticons, and Retweets. We have focused our tagger to collect only some of the important POS tags. The following tags are the ones we want to use: ”USR”, ”PRP”, ”VBD”, ”CC”, ”IN”, ”JJS”, ”NN”, ”NNS”, ”DT”, ”VBP”, ”VB”, ”VBG”, ”JJ”, ”UH”, ”RB”, ”TO”, ”VBN”, ”PRP$”, ”NNP”, ”VBZ”, ”URL”, ”WP”, ”MD”, ”WRB”, ”RT”, ”SYM”, ”CD”, ”WDT”, ”RP”, ”EX”, ”JJR”, ”RBR”, ”HT”, ”RBS”, ”MB”, ”POS”, ”PDT”, ”WP$”, ”FW”, ”$”, ”NNPS”, ”ON”, ”R”, ”LV”, ”FM”, ”PR”, ”J”, ”HR”, ”AE”, ”F”, ”O”, ”N”, ”CCG”, ”EA”, ”MP”, ”P”, ”T” We want to find out occurrences of User tags, Re-tweets, Emoticons, Punctuation, Continua- tions ... etc in from every tweet. The most help- ful POS tag was HashTag (HT). With the help of the POS tagger we are able to extract some vital features of POS. With our system we then keep a count of every feature in the POS tags. We also keep count of occurrences of certain tags, like punctuation, emoticons. These counts are added to the feature vector automatically. The POS tagger was integrated with the Feature Extraction Engine, which uses the tool to create a feature vector CSV file. 4.4 Paragraph Vector Representation Conceptual representation of words and texts has always been a challenge. In recent years more focus have been on distributed representation of words, where words are shown in form of n- dimensional vector. Recently the focus is also shifted over distributed representation of larger texts. Specifically we used and approach called Paragraph Vector which simultaneously learns vector representations for words and context. Using the Paragraph Vector, we trained the model on the tweets dataset and a large blog cor- pus to compute the vector representation for each tweet. The vectors have 100 dimensions. These 100 dimensions capture semantic and syntactic style of the tweets. Further on we used them as features to the classifiers to improve the perfor- mance. 5 Methods In the following section different methods that are used in our project are explained. 5.1 KMeans In our system we also intend to perform clustering of tweets to find out what are the most common
  • 5. topics talked about by the individual authors. We create a KMeans clustering system using Scikit for Python. This system runs a clustering al- gorithm on the tweets to find out most commonly talked topics. These topics are extracted from the bag of words of the tweets. The system gives us a list of clusters with topics in each of these. This cluster is created dynamically, and the kmean value is 10, i.e. top 10 most talked about topics. This clustering can help us identify what topics are most talked about by the users. The clustering system was tested on the training data from the user tweets. We extracted top 3 clus- ters for each user. Each cluster has 10 words which the user talks about in the tweet. The clusters are cluster0 cluster1, and cluster2. Cluster0 denotes the cluster with the most talked about tweets for the user. The clustering is done using the sklearn module in scikit. We use tf-idf for each tweet to find out similarity. Then we use the KMeans function in the sklearn module to do Kmeans clustering on the training set, which is the vectorized format of all the tweets for users. We then store the data from all the clusters. 5.2 Naive Bayes Naive Bayes classifier is a probabilistic classifier which is based on Bayes theorem and has the as- sumption of features being independent of each other. It computes the probability of a given obser- vation belonging to a class based on its features. We used Naive Bayes with bag of words and other extracted features on a set of 10 users and compared its performance with Support Vector Machine (SVM) which we have explained later. The drawback of Naive Bayes was that it needs all the data to be in memory to compute the proba- bilities, so it would be time and resource consum- ing in large scale usage. 5.3 Support Vector Machine Support Vector Machines (SVM) is also another method of supervise classification. It analyses data and find patterns of features of trainset to further on apply them to unseen data. We used SVM with tf-idf vectors and extracted features to conduct the author identification task. The advantage of SVM over Naive Bayes is that it’s faster for larger amount of data. 5.4 Architecture Figure 1 show an overall picture of our system components. More specifically the system has the following major components: Data Collector for Tweets: This component is responsible for the tweet collection from Twit- ter. We have used Twitter API to collect the tweets. This system then performs aggregation on the tweets. Data Cleaner and Filter: The data collected from Twitter has a lot of noise and garbage. Lot of the tweets were in non English languages. These tweets needed to be filtered out. So we manually filtered these tweets out from the corpus. We also have a preliminary system to filter tweets with cuss words and abusive language. This is done in the data filter section. Feature Engine: Once the data was collected, the next important step is to perform feature engi- neering on the data. We have used multiple meth- ods to extract features from our corpus. We have first divided the corpus into 3 sections. Training Set 80% Development Set 10% Testing and Val- idation Set 10% The feature engine plays one of the most crucial part in the system. In this phase we have divided the features into 4 main types: • Main Features: These features are extracted from the raw text. Simply Bag of Words is not sufficient to extract vital information from the tweets. Many times users use words like knooow grrreat, etc. The extend some words to highlight the importance of such words in their tweets. These features need to be captured in the tweets. Also not every author uses long words of length 5 or more. Many users also like to use ! and ... in their tweets, and many users re-tweet other tweets and put URLs in their tweets. We intended to capture these features. We have the following features classi- fied as main features: number of words, tweet length, number of different words, number of uppercase words, num- ber of lowercase words, num- ber of titlecase words, num- ber of othercase words, per- cent of uppercase characters, per- cent of lowercase characters, num- ber of nonascii characters, num- ber of smiley, number of stopwords, number of slangwords
  • 6. Figure 1: System Architecture These features highlight the way the tweet has been written. Most of the features con- sist of count of these factors in the tweets. • POS Tagged Features: Often merely having the structure of words and count of different types of words in a sentence does not help in extracting vital information about the struc- ture of a sentence. In our system, we also intend to find out the number of times au- thors describe the subjects they talk about. Things like the verbs used, adjectives used in the tweets play a vital role in this case. The study of different words and their grammat- ical definition is called Part of Speech. We have tried to perform annotation of words in the tweets using parts of speech. However, this is difficult since the structure of tweets in twitter is very different from regular En- glish Grammar. Also, the words and jargons used in Twitter is very different. Using the Stanford NLP libraries and our own POS tag- ger we have intended to perform POS tag- ging on the tweets and extract some informa- tion about the tweets. We have extracted POS tweets as features and put them in the feature vector. This proves fruitful in gaining a good accuracy on author identification. • Stylometric Features: In our system we have also intended to identify style of writing through use of some Stylometric features. We have used word clusters, by using dis- tance of words from each other. Words like knoooow and know and noe mean the same thing. But a regular tokenizer, or feature en- gine will not identify this similarity. We have identified such similarities by creating word clusters by computing the distance between such words. These features are added to the vector space. • Sentiment Features: We also have built a sen- timent analyser for our system that identifies the sentiment of the author while the tweet was being written. The sentiment analyser has 5 classes of sentiments: Very Positive, Positive, Neutral, Negative, Very Negative These sentiments can be used as features for the tweets as well. People tend to use Twitter to express what they feel. And most of the time, these emotions follow a sustained pat- tern. In our course of feature extraction, we realized that the users were consistent in the sentiments with their tweets. We have used these tags as a binary value in our feature set. This gives a lot of information about the kind of tweets the users write. We used Stanford NLP tool for getting the sentiments of tweets. However before giving the tweet to Stanford NLP, we preprocessed it intelligently to re- move stop words, replace the slang words
  • 7. like lol with lots of laughter to make more sense, removed punctuation marks and also replaced emoticons with their actual expres- sions like :-), :-( was replaced with happy and sad respectively. 6 Results In previous section we described two methods, Naive Bayes and SVM. Both classifier were used in order to classify and label tweets based on their authors. We used the same features that were ex- tracted and described in the fourth section to train the classifier and later on used the trained model to predict the author of unseen tweets. Figure 2 shows the results of running Naive Bayes and SVM classifier on a set of 10 users, with using just bag of words or a combination of bag of words and extracted features. Using the afore- mentioned features, the accuracy for both SVM and Naive Bayes improves; however, SVM out- performs Naive Bayes in both experiments. Based on the better performance of SVM (and Naive Bayes being time-resource consuming on large corpus) we continued our experiments us- ing SVM classifier. Figure 3 shows the results of running SVM on 10, 50, and 100 users. In each case we did the experiment once by using tf-idf and once by using tf-idf along with other described features. As shown in all experiments using fea- tures increased the accuracy. 7 Discussion and Future Work We experimented using 2 major technique - Naive Bayes and Support Vector Machine by using a va- riety of feature combinations to identify the au- thor of the tweet. Naive Bayes even though was not scalable to 100 user with 3200 tweets per user, it performed really well on 10 users and by using bag of words as features. Bag of words using tf-idf gave us our baseline score and then by adding vari- ous features like main features, POS tags, doc2vec and clustering extracted from the tweets improved the accuracy. We were able to run Naive Bayes on max 20 users. With naive bayes we get the best result by using all the combination of features. This is because with 140 word limit on twitter we have tried and extracted as much information as we could for a tweet, and more the information we have, the better the classification results. We tried SVM by using Stochastic Gradient Descent as our next approach since it allowed us to scale to 100 users. Because the data is so huge, it is not possi- ble to load the entire data into memory. So, batch gradient descent and newton method were not an option for us. Hence, we used Stochastic Gradient Descent approach for updating the weights. The learning rate for the Stochastic Gradient Descent has also been experimented upon. We have tried a few values for the learning rate and chose the optimum value as 0.01. This gave us better perfor- mance as compared to Naive Bayes. We observed a very similar trend here as well. The best accu- racy that we got was by using bag of words(tf-idf) with a combination of all the other features that we had. We tried using various options with SGD. We tried initializing the weights to zero and using different loss functions as well. We got the best results when we used the hinge loss and by ini- tializing the weights to some random values. The reason why we tried to do this is, there might be some values in the features from test data which were never seen in the training data. Initializing all weights to 0 means that we are giving a 0 weight to these unseen instances. That might lead to clas- sification error. But giving a small initial weight to features did not seem to improve the accuracy in anyway. So, we went ahead with initializing the weights to 0. What we have intended to do in this project is build a system that will use text on social media to analyze author writing patterns. It shows improve- ments over previous work, by using a large corpus of users, and we have focused on text from Twit- ter. More work can be done to developing Boost- ing algorithms with better performance. We can associate confidence levels with individual clas- sifiers and then aggregate the weighted output of each classifier to generate a final answer for au- thor based on the tweet. More research can be done in extracting better features from tweets, like vectorizing the parse- tree to an n-dimensional vector. This will help in getting a definite structure for each tweet in the training. This system can also be integrated with the existing twitter recommendation system, to identify writing patterns amongst the users, and suggest followers to each other based on topics that they commonly write about. The classifica- tion system can be improved by adding an inter- mediate layer of classification on clusters. Another area where more work is needed is scalability. Our system currently works for 100
  • 8. Figure 2: Naive Bayes Performance Figure 3: Support Vector Machine Performance
  • 9. users, but needs to be modified to work on a large user corpus like Twitter. Acknowledgment and Contribution First of all we want to thank Professor Kenji and Justin for the amazing class we had this semester, there is no need to say how much we learned dur- ing the course. However it was a teamwork and we couldn’t do it individually, but mainly tasks were done as fol- lows: George: He mostly worked extracting POS tagging features (which is challenging in Twitter world) and some of the basic features mentioned earlier. Also he tried clustering the tweets in order to do the author identification. Nada: She worked on collecting tweets and af- ter that tokenized them to make them ready to use for next step. Also she worked on extracting fea- tures (word clusters, hashtags and mentions) Reihane: She first worked on getting random users from Twitter who tweet in English and have a reasonable number of tweets. Then she started working on classification (Naive Bayes and SVM). She also extracted vector representation of tweets using Paragraph Vector method. Vinit: He extracted main features (which de- scribed in basic feature section) including senti- ment analysis of tweets. He then started working on SVM and SGD classifiers. BitBucket: https://bitbucket.org/georgesam/csci544- project