How Anonymous Can Someone be on Twitter?

How anonymous we are on Twitter?
Reihane Boghrati
boghrati@usc.edu
Vinit Parakh
vparakh@usc.edu
George Sam
gsam@usc.edu
Nada Aldarrab
naldarra@usc.edu
Abstract
Authorship recognition is one of the well-
studied areas in machine learning. How-
ever, there is less work done on author
identification of short texts, especially in
an environment like Twitter where text is
limited to 140 characters per tweet. In this
project we extracted features from around
3 millions tweets from 100 different users
and use them along with tf-idf vectors and
were able to get 67% accuracy on the test
dataset.
1 Introduction
Authorship attribution and Identification has been
an important topic of research since the 19th cen-
tury. Most notably in the field of statistical and
mathematical methods, one of the first few works
available have been that of Mosteller and Wal-
lace (1964) on the disputes surrounding author-
ship of some Federalist Papers. With advancement
in Machine Learning and Natural Language Pro-
cessing, this area of authorship identification has
become a more generic problem in the computer
science domain. With the text content and docu-
ments available online increasing tremendously in
the last decade, the importance of having substan-
tial research in this area has increased a lot. We
intend to study and apply some techniques in Nat-
ural Language Processing to answer some of the
most important questions in Author Identification.
Twitter is one of the most popular social networks
that experienced rapid growth. With 288 million
monthly active users and 500 million Tweets sent
per day, Twitter users might assume that they have
a certain level of anonymity. Our project investi-
gates whether the author of an anonymous tweet
can be identified using stylometry. We create a
system that will help identify the original author of
tweets and also identify patterns in style of writing
among authors, thus helping Twitters recommen-
dation system to suggest followers to each based
on similarity in writing styles.
During the early researches done in author at-
tribution, the works were concentrated in building
statistical models to identify writing patterns. Pat-
tern attribute like character count, word length and
word count were presented as measures to iden-
tify writing styles. However, most of the early
work was computer assisted and there wasnt an
automated system developed to identify such pat-
terns. In most of the cases, the testing ground
was literary works of unknown or disputed au-
thorship (e.g., the Federalist case), so the estima-
tion of attribution accuracy was not even possible.
The main methodological limitations of that pe-
riod concerning the evaluation procedure were the
following:
• The textual data were too long (usually in-
cluding entire books) and probably not stylis-
tically homogeneous.
• The number of candidate authors was too
small (usually 2 or 3).
• The evaluation corpora were not controlled
for topic.
• The evaluation of the proposed methods was
mainly intuitive (usually based on subjective
visual inspection of scatterplots).
• The comparison of different methods was dif-
ficult due to lack of suitable benchmark data.
Working with Twitter data has a lot of chal-
lenges of its own. Each author has a different pat-
tern of writing tweets. Users tend to write about
a variety of topics, describing different emotions,
and referring to different events/things. Captur-
ing this data is tremendously difficult because the
tweets do not always follow a strict pattern of

Grammar. Some tweets have multiple grammati-
cal mistakes. It is thus important to look at each in-
dividual tweet, and create a normalized feature ex-
traction mechanism, so that the users do not get bi-
ased. Also, study has proven that existing off-the-
shelf tools perform poorly on Twitter data. The
tools need to be modified to use them on Twitter
data.
2 Related Works
There has been work done in author identification,
but most of it is done on blogs or books, where the
amount of content is significant. Author identifi-
cation on twitter has been tried by few researchers
viz, Antonio Castro et al. They have extracted
user information from multiple third party sources
and not just by using twitter API. They have per-
formed the experiment on about 800 users with
1000 tweets per user. They have used a combi-
nation of bag of words (tf-idf) and some other fea-
tures which are extracted from the tweet. They
are using nearest neighbor and regularized least
score classification technique for classification. In
other literature that we have surveyed we have
found that current research has been broadly clas-
sified into 2 categories - NLP and ML. These are
mainly categorized as STATISTICAL UNIVARI-
ATE METHODS
• Naive Bayes classifier: In this Classifier
Learning and classification methods based on
probability theory they use Bayes theorem in
generating a classifier. This has been used as
a baseline model in most research.
• Cluster Analysis: Cluster analysis is an ex-
ploratory data analysis tool for solving clas-
sification problems. In this idea, the re-
searchers have sorted the documents accord-
ing to the topics and groups. They then com-
pute the relative associativity between each
group. Once a cumulative relation is built for
every cluster pair, they then classify the text
to the individual users.
MACHINE LEARNING techniques involve
use of Neural Networks and SVMs
• Feed-forward neural network : Most of the
prior work in Machine Learning has been
done using neural networks. The Feed for-
ward neural network seemed to be the most
widely used neural network in this case. In
this type of NN the data flow is unidirectional
and output of neurons cannot be sent back to
the neurons in the previous layers. A feed
forward neural network is an artificial neural
network where connections between the units
do not form a directed cycle.
• Support Vector Machines: In machine learn-
ing, the use of SVMs for classification is wide
spread. SVMs are efficient in also analyzing
data patterns and in regression and classifica-
tion. For the case of author identification the
use of SVM eventually makes it similar to a
non-probabilistic binary linear classifier.
3 Data Collection
We collected our data form Twitter using their
API. It includes two steps of user selection and
tweet collection as follows:
3.1 User Selection
We randomly selected 100 Twitter accounts.
These accounts had to have more than 3200 tweets
according to our design. We selected these ac-
counts using popular accounts as seeds, and fol-
lowing the followers network to get random users.
3.2 Collecting Tweets
We collected about 3200 tweets from each Twit-
ter account using Tweepy and the official Twit-
ter API. In collecting these tweets, we had to deal
with some issues which included the following:
1. The Twitter API does not retrieve old tweets.
2. The Twitter API does not give access to pri-
vate accounts.
3. The Twitter API does not filter tweets accord-
ing to language (we got many non-English
tweets).
4. There are some spam accounts that we got by
random selection.
5. There are many representations of emoticons
and many character encodings.
We needed to develop some scripts to filter ac-
counts and tweets to deal with these issues. In ad-
dition, the Twitter API has some rate limits. Since
our approach requires a large number of tweets,
we needed several days to be able to get the num-
ber of tweets that we wanted (after filtering).

3.3 Data sets
We divided the data into training, development,
and test sets using 80%, 10%, and 10% of the col-
lected tweets, respectively. Splitting tweets into
these categories was done randomly to minimize
the effect of the chronological order of the time-
line.
3.4 Tokenization
We implemented a 2-stage tokenizer. In the first
stage, we took care of common encoding and for-
matting issues. In the second stage, we tokenized
tweets using Carnegie Mellons Tweet NLP Tok-
enizer. Our data sets had about 5 million tokens in
total.
4 Feature Extraction
In order to get the most information from tweets,
several different features were extracted from
tweets. Each set of features captures a specific part
of users’ writing style. In this section we introduce
the extracted features from our dataset.
4.1 Basic Features
In our project we have tried to extract informative
features from tweets. Since the structure of
tweets is not uniform, we wanted to extract
information like word shapes, structure of words,
and length of words. These features fall under
the main feature category in our system. Some
of the main features are: number of words,
tweet length, number of different words,
number of uppercase words, num-
ber of lowercase words, num-
ber of titlecase words, num-
ber of othercase words, per-
cent of uppercase characters, per-
cent of lowercase characters, num-
ber of nonascii characters, number of smiley,
number of stopwords, number of slangwords
Many users like to write tweets in Upper Case
characters to higlight importance of emotions or
words, while most users also like to use only lower
case characters in their words. This feature is ex-
tracted in the ercent of uppercase characters, per-
cent of lowercase characters. This give us impor-
tant author specific information.
Many times user use up all the 140 charac-
ter limit in Twitter in their tweets. These users’
features are extracted using the number of words,
tweet length, number of different words features.
These features give us an idea of how long the user
tweets are. They also tell us if the user likes us-
ing a lot of unique words in his tweets or tends to
repeat certain words, like some verbs, nouns etc.
Authors who always tweet about the same entity
tend to repeat those words a lot in their tweets.
For example, a student at USC who loves tweet-
ing about USC will tend to use the words ’school’,
’USC’, ’university’ a lot in his tweets, the no. of
unique words in his tweets will automatically be
lesser than other types of users.
Twitter also has a set of jargons which are used
only pertaining to Twitter. Phrases like ’LMFAO’,
’YOLO’, ’DM’ are used only on social media, and
particularly on twitter. It is important to find out
how many times a user uses these words. We built
a dictionary of all such words, which we call slang
words. We have kept a count of these words in
every tweet and added it as a feature. Not all users
use these phrases in their tweets. So it helps in
identifying specific types of users. These words
have also helped in clustering the users according
to topics they talk about.
Other features that are important as main fea-
tures, include the count of specific characters.
Many users use a lot of ”!” in their tweets, while
many others use other special characters like ”@”
or ’&’. We capture these features by keeping a
count of such special characters in the features.
We also keep count of individual letters as features
in our feature vector. This is important in some
very specific cases of tweets and users. For exam-
ple, if a user likes writing words like ’happyyyyy’,
’louuuve’, ’knooooow’. Then they will tend to use
those words a lot in their tweets. This is efficiently
captured in the count features for the characters,
since the count for these characters will be excep-
tionally high. We capture these features as well.
It is also important to note that some user like to
Retweet or mention other users in their tweets. We
use these features for our feature vector.
Hashtags and mentions are two pieces of infor-
mation which say a lot about a tweet. Hashtags
usually mark the topic of the tweet, and mentions
identify the persons involved in the discussion. We
recognize the importance of this information and
add these features to our feature set.
4.2 Word clusters
We utilize the idea of word clusters to deal with
technically misspelled words such as gonna and

gna; so, sooo, and sooooooo; etc. We use clus-
ters produced by an unsupervised HMM: Percy
Liang’s Brown clustering implementation (as de-
scribed in the Tweet NLP project).
We use about 900 clusters to represent tweets.
We also add general clusters to handle common
categories of tokens such as URLs. Using word
clusters alone as a bag of clusters with a Naive
Bayes classifier resulted in an increase of 6% in
author prediction accuracy compared to
a bag of words representation of tweets.
4.3 Part of Speech Tagging
Having words alone as features does not always
highlight the authors writing patterns the best way.
We implemented a POS tagging system to ex-
tract vital features about the Part of Speech ele-
ments that the authors have used, to find out how
many verbs, nouns and adjectives are used in their
tweets. These features help in understanding the
style of tweets the author writes. Some authors
also tend to add a lot of emoticons in their tweets,
while others like to add punctuation marks (!, ? ,
...). These features need to be captured to add to
the authors style.
In our system we have used a combination of
Stanford NLP libraries and a perceptron based
POS tagger to get POS tags for tweets. The Stan-
ford NLP libraries provide an inbuilt POS tag-
ger function. It needs a model file to tag words
according to tokens. In our system, since we
are focusing on Twitter data, the major chunk of
tweets do not follow standard English Grammat-
ical rules. This causes problems while using tra-
ditional off-the-shelf POS tagging tools to tag the
words in the tweets. Words like LOL, Knoooow,
and URLs do not have a specific POS tag in the
feature space for these tools. While tagging us-
ing StanfordNLP library we have used a Model
file specifically designed for Twitter data. This
file is available on the Stanford CoreNLP website.
This model file allows us to use their POS tagger
to tag tweets. But in our case we could only tag
76% tweets correctly using this model file. We
generated a training file from the POS tagger. We
have also developed our own Perceptron POS tag-
ger. This tagger gives us an accuracy of 95%
on standard English POS tags. We used the out-
put from StanfordCoreNLPs tagger as the train-
ing file for our POS tagger. This worksin helping
us get tags for URLs, Emoticons, and Retweets.
We have focused our tagger to collect only some
of the important POS tags. The following tags
are the ones we want to use: ”USR”, ”PRP”,
”VBD”, ”CC”, ”IN”, ”JJS”, ”NN”, ”NNS”, ”DT”,
”VBP”, ”VB”, ”VBG”, ”JJ”, ”UH”, ”RB”, ”TO”,
”VBN”, ”PRP$”, ”NNP”, ”VBZ”, ”URL”, ”WP”,
”MD”, ”WRB”, ”RT”, ”SYM”, ”CD”, ”WDT”,
”RP”, ”EX”, ”JJR”, ”RBR”, ”HT”, ”RBS”, ”MB”,
”POS”, ”PDT”, ”WP$”, ”FW”, ”$”, ”NNPS”,
”ON”, ”R”, ”LV”, ”FM”, ”PR”, ”J”, ”HR”, ”AE”,
”F”, ”O”, ”N”, ”CCG”, ”EA”, ”MP”, ”P”, ”T”
We want to find out occurrences of User tags,
Re-tweets, Emoticons, Punctuation, Continua-
tions ... etc in from every tweet. The most help-
ful POS tag was HashTag (HT). With the help of
the POS tagger we are able to extract some vital
features of POS. With our system we then keep
a count of every feature in the POS tags. We
also keep count of occurrences of certain tags, like
punctuation, emoticons. These counts are added to
the feature vector automatically. The POS tagger
was integrated with the Feature Extraction Engine,
which uses the tool to create a feature vector CSV
file.
4.4 Paragraph Vector Representation
Conceptual representation of words and texts has
always been a challenge. In recent years more
focus have been on distributed representation of
words, where words are shown in form of n-
dimensional vector. Recently the focus is also
shifted over distributed representation of larger
texts. Specifically we used and approach called
Paragraph Vector which simultaneously learns
vector representations for words and context.
Using the Paragraph Vector, we trained the
model on the tweets dataset and a large blog cor-
pus to compute the vector representation for each
tweet. The vectors have 100 dimensions. These
100 dimensions capture semantic and syntactic
style of the tweets. Further on we used them as
features to the classifiers to improve the perfor-
mance.
5 Methods
In the following section different methods that are
used in our project are explained.
5.1 KMeans
In our system we also intend to perform clustering
of tweets to find out what are the most common

topics talked about by the individual authors.
We create a KMeans clustering system using
Scikit for Python. This system runs a clustering al-
gorithm on the tweets to find out most commonly
talked topics. These topics are extracted from the
bag of words of the tweets. The system gives
us a list of clusters with topics in each of these.
This cluster is created dynamically, and the kmean
value is 10, i.e. top 10 most talked about topics.
This clustering can help us identify what topics are
most talked about by the users.
The clustering system was tested on the training
data from the user tweets. We extracted top 3 clus-
ters for each user. Each cluster has 10 words which
the user talks about in the tweet. The clusters are
cluster0 cluster1, and cluster2. Cluster0 denotes
the cluster with the most talked about tweets for
the user.
The clustering is done using the sklearn module
in scikit. We use tf-idf for each tweet to find out
similarity. Then we use the KMeans function in
the sklearn module to do Kmeans clustering on the
training set, which is the vectorized format of all
the tweets for users. We then store the data from
all the clusters.
5.2 Naive Bayes
Naive Bayes classifier is a probabilistic classifier
which is based on Bayes theorem and has the as-
sumption of features being independent of each
other. It computes the probability of a given obser-
vation belonging to a class based on its features.
We used Naive Bayes with bag of words and
other extracted features on a set of 10 users and
compared its performance with Support Vector
Machine (SVM) which we have explained later.
The drawback of Naive Bayes was that it needs
all the data to be in memory to compute the proba-
bilities, so it would be time and resource consum-
ing in large scale usage.
5.3 Support Vector Machine
Support Vector Machines (SVM) is also another
method of supervise classification. It analyses data
and find patterns of features of trainset to further
on apply them to unseen data.
We used SVM with tf-idf vectors and extracted
features to conduct the author identification task.
The advantage of SVM over Naive Bayes is that
it’s faster for larger amount of data.
5.4 Architecture
Figure 1 show an overall picture of our system
components. More specifically the system has the
following major components:
Data Collector for Tweets: This component is
responsible for the tweet collection from Twit-
ter. We have used Twitter API to collect the
tweets. This system then performs aggregation on
the tweets.
Data Cleaner and Filter: The data collected
from Twitter has a lot of noise and garbage. Lot of
the tweets were in non English languages. These
tweets needed to be filtered out. So we manually
filtered these tweets out from the corpus. We also
have a preliminary system to filter tweets with cuss
words and abusive language. This is done in the
data filter section.
Feature Engine: Once the data was collected,
the next important step is to perform feature engi-
neering on the data. We have used multiple meth-
ods to extract features from our corpus. We have
first divided the corpus into 3 sections. Training
Set 80% Development Set 10% Testing and Val-
idation Set 10% The feature engine plays one of
the most crucial part in the system. In this phase
we have divided the features into 4 main types:
• Main Features: These features are extracted
from the raw text. Simply Bag of Words
is not sufficient to extract vital information
from the tweets. Many times users use
words like knooow grrreat, etc. The extend
some words to highlight the importance of
such words in their tweets. These features
need to be captured in the tweets. Also not
every author uses long words of length 5
or more. Many users also like to use ! and
... in their tweets, and many users re-tweet
other tweets and put URLs in their tweets.
We intended to capture these features.
We have the following features classi-
fied as main features: number of words,
tweet length, number of different words,
number of uppercase words, num-
ber of lowercase words, num-
ber of titlecase words, num-
ber of othercase words, per-
cent of uppercase characters, per-
cent of lowercase characters, num-
ber of nonascii characters, num-
ber of smiley, number of stopwords,
number of slangwords

Figure 1: System Architecture
These features highlight the way the tweet
has been written. Most of the features con-
sist of count of these factors in the tweets.
• POS Tagged Features: Often merely having
the structure of words and count of different
types of words in a sentence does not help in
extracting vital information about the struc-
ture of a sentence. In our system, we also
intend to find out the number of times au-
thors describe the subjects they talk about.
Things like the verbs used, adjectives used in
the tweets play a vital role in this case. The
study of different words and their grammat-
ical definition is called Part of Speech. We
have tried to perform annotation of words in
the tweets using parts of speech. However,
this is difficult since the structure of tweets
in twitter is very different from regular En-
glish Grammar. Also, the words and jargons
used in Twitter is very different. Using the
Stanford NLP libraries and our own POS tag-
ger we have intended to perform POS tag-
ging on the tweets and extract some informa-
tion about the tweets. We have extracted POS
tweets as features and put them in the feature
vector. This proves fruitful in gaining a good
accuracy on author identification.
• Stylometric Features: In our system we have
also intended to identify style of writing
through use of some Stylometric features.
We have used word clusters, by using dis-
tance of words from each other. Words like
knoooow and know and noe mean the same
thing. But a regular tokenizer, or feature en-
gine will not identify this similarity. We have
identified such similarities by creating word
clusters by computing the distance between
such words. These features are added to the
vector space.
• Sentiment Features: We also have built a sen-
timent analyser for our system that identifies
the sentiment of the author while the tweet
was being written. The sentiment analyser
has 5 classes of sentiments: Very Positive,
Positive, Neutral, Negative, Very Negative
These sentiments can be used as features for
the tweets as well. People tend to use Twitter
to express what they feel. And most of the
time, these emotions follow a sustained pat-
tern. In our course of feature extraction, we
realized that the users were consistent in the
sentiments with their tweets. We have used
these tags as a binary value in our feature set.
This gives a lot of information about the kind
of tweets the users write. We used Stanford
NLP tool for getting the sentiments of tweets.
However before giving the tweet to Stanford
NLP, we preprocessed it intelligently to re-
move stop words, replace the slang words

like lol with lots of laughter to make more
sense, removed punctuation marks and also
replaced emoticons with their actual expres-
sions like :-), :-( was replaced with happy and
sad respectively.
6 Results
In previous section we described two methods,
Naive Bayes and SVM. Both classifier were used
in order to classify and label tweets based on their
authors. We used the same features that were ex-
tracted and described in the fourth section to train
the classifier and later on used the trained model
to predict the author of unseen tweets.
Figure 2 shows the results of running Naive
Bayes and SVM classifier on a set of 10 users, with
using just bag of words or a combination of bag
of words and extracted features. Using the afore-
mentioned features, the accuracy for both SVM
and Naive Bayes improves; however, SVM out-
performs Naive Bayes in both experiments.
Based on the better performance of SVM (and
Naive Bayes being time-resource consuming on
large corpus) we continued our experiments us-
ing SVM classifier. Figure 3 shows the results of
running SVM on 10, 50, and 100 users. In each
case we did the experiment once by using tf-idf
and once by using tf-idf along with other described
features. As shown in all experiments using fea-
tures increased the accuracy.
7 Discussion and Future Work
We experimented using 2 major technique - Naive
Bayes and Support Vector Machine by using a va-
riety of feature combinations to identify the au-
thor of the tweet. Naive Bayes even though was
not scalable to 100 user with 3200 tweets per user,
it performed really well on 10 users and by using
bag of words as features. Bag of words using tf-idf
gave us our baseline score and then by adding vari-
ous features like main features, POS tags, doc2vec
and clustering extracted from the tweets improved
the accuracy. We were able to run Naive Bayes
on max 20 users. With naive bayes we get the
best result by using all the combination of features.
This is because with 140 word limit on twitter we
have tried and extracted as much information as
we could for a tweet, and more the information we
have, the better the classification results. We tried
SVM by using Stochastic Gradient Descent as our
next approach since it allowed us to scale to 100
users. Because the data is so huge, it is not possi-
ble to load the entire data into memory. So, batch
gradient descent and newton method were not an
option for us. Hence, we used Stochastic Gradient
Descent approach for updating the weights. The
learning rate for the Stochastic Gradient Descent
has also been experimented upon. We have tried
a few values for the learning rate and chose the
optimum value as 0.01. This gave us better perfor-
mance as compared to Naive Bayes. We observed
a very similar trend here as well. The best accu-
racy that we got was by using bag of words(tf-idf)
with a combination of all the other features that
we had. We tried using various options with SGD.
We tried initializing the weights to zero and using
different loss functions as well. We got the best
results when we used the hinge loss and by ini-
tializing the weights to some random values. The
reason why we tried to do this is, there might be
some values in the features from test data which
were never seen in the training data. Initializing all
weights to 0 means that we are giving a 0 weight
to these unseen instances. That might lead to clas-
sification error. But giving a small initial weight
to features did not seem to improve the accuracy
in anyway. So, we went ahead with initializing the
weights to 0.
What we have intended to do in this project is
build a system that will use text on social media to
analyze author writing patterns. It shows improve-
ments over previous work, by using a large corpus
of users, and we have focused on text from Twit-
ter. More work can be done to developing Boost-
ing algorithms with better performance. We can
associate confidence levels with individual clas-
sifiers and then aggregate the weighted output of
each classifier to generate a final answer for au-
thor based on the tweet.
More research can be done in extracting better
features from tweets, like vectorizing the parse-
tree to an n-dimensional vector. This will help
in getting a definite structure for each tweet in
the training. This system can also be integrated
with the existing twitter recommendation system,
to identify writing patterns amongst the users, and
suggest followers to each other based on topics
that they commonly write about. The classifica-
tion system can be improved by adding an inter-
mediate layer of classification on clusters.
Another area where more work is needed is
scalability. Our system currently works for 100

Figure 2: Naive Bayes Performance
Figure 3: Support Vector Machine Performance

users, but needs to be modified to work on a large
user corpus like Twitter.
Acknowledgment and Contribution
First of all we want to thank Professor Kenji and
Justin for the amazing class we had this semester,
there is no need to say how much we learned dur-
ing the course.
However it was a teamwork and we couldn’t do
it individually, but mainly tasks were done as fol-
lows:
George: He mostly worked extracting POS
tagging features (which is challenging in Twitter
world) and some of the basic features mentioned
earlier. Also he tried clustering the tweets in order
to do the author identification.
Nada: She worked on collecting tweets and af-
ter that tokenized them to make them ready to use
for next step. Also she worked on extracting fea-
tures (word clusters, hashtags and mentions)
Reihane: She first worked on getting random
users from Twitter who tweet in English and have
a reasonable number of tweets. Then she started
working on classification (Naive Bayes and SVM).
She also extracted vector representation of tweets
using Paragraph Vector method.
Vinit: He extracted main features (which de-
scribed in basic feature section) including senti-
ment analysis of tweets. He then started working
on SVM and SGD classifiers.
BitBucket: https://bitbucket.org/georgesam/csci544-
project

How Anonymous Can Someone be on Twitter?

Empfohlen

Empfohlen

Weitere ähnliche Inhalte

Was ist angesagt?

Was ist angesagt? (20)

Ähnlich wie How Anonymous Can Someone be on Twitter?

Ähnlich wie How Anonymous Can Someone be on Twitter? (20)

Kürzlich hochgeladen

Kürzlich hochgeladen (20)

How Anonymous Can Someone be on Twitter?