This is part of a presentation for QCon New York 2015. On April 23, 2013 the stock market experienced one of its biggest flash-crash drops of the year, with the Dow Jones industrial average falling 143 points (over 1%) in a matter of minutes. Unlike the 2012 stock market blip, this one wasn't caused by an individual trade, but rather by a single tweet from The Associated Press account on the social network, Twitter. The tweet, of course, wasn't written by AP, but rather by an impostor who had temporarily gained control of the account. Could a computer program have detected the tweet as hacked?
In this presentation, we'll discuss how machine learning was used to classify tweets as having been authored by The Associated Press or not. As a final test, the program was run on the hacked tweet and we'll reveal if it was able to successfully classify the tweet as being authentic or hacked.
Full article: http://www.primaryobjects.com/cms/article158
Nell’iperspazio con Rocket: il Framework Web di Rust!
Detecting a Hacked Tweet with Machine Learning (5 Minute Presentation)
1. DETECTING A HACKED
TWEET
with Machine Learning and Artificial Intelligence
Sponsored by
Kory Becker 2015
http://primaryobjects.com/cms/article158
http://linkedin.com/in/korybecker
http://twitter.com/primaryobjects
3. ALL YOUR DATA ARE BELONG TO US
Accord.NET SVM, Tried Gaussian (96%), then linear (97%) kernel
Extract Tweets with TweetSharp
Create Document Corpus (6,054 tweets)
Create Vocabulary (2,225 words)
Digitize Corpus
Porter-Stemmer (“talking” => “talk”, “explosion” => “explos”)
Term Frequency Inverse Document Frequency (TF*IDF)
Word Existence
Vector Size = Vocabulary Size | Matrix = double[6054][2225]
1. Introduction
My name is Kory Becker. I'm a Software Architect at The Associated Press. I develop web applications by day, and have a fascination with artificial intelligence. If you like, you can follow the (short) slides for this talk at slideshare.net/korybecker.
2. What?
On April 23, 2013 the stock market experienced one of its biggest flash-crash drops of the year, with the Dow Jones industrial average falling 143 points (over 1%) in a matter of minutes.
Unlike the 2012 stock market blip, this one wasn't caused by an individual trade, but rather by a single tweet from The Associated Press account on the social network, Twitter. The tweet, of course, wasn't written by AP, but rather by an impostor (claimed by the Syrian Electronic Army) who had temporarily gained control of the account. Could a computer program have detected the tweet as hacked?
The tweet was "Breaking: Two Explosions in the White House and Barack Obama is injured".
Now, there are a couple of specific characteristics about the text in question. The term "Breaking" has incorrect casing, coming from AP. It would usually be all capitals.
The combination of "White House" + "and" + "Barack Obama" is rare. Maybe a computer could pick up on this? So, what did we do?
3. How?
The idea was to write a program using artificial intelligence. Specifically, a machine learning algorithm with supervised learning. The computer would be given a list of tweets and be told whether a tweet is real or fake. It can then learn common terms in each category and (hopefully) figure out how to detect the hacked tweet.
Using the Accord.NET machine learning library, I started by implementing a support vector machine (SVM) with a gaussian kernel. SVMs work with different kernels, and gaussian allows fitting data points in a variety of non-linear shapes (round, curvy, etc).
I extracted tweets using the TweetSharp library.
I created a document corpus of about 6,000 tweets and a vocabulary of about 2,000 words.
The documents were digitized by tokenizing the tweets, running porter-stemmer to shorten words, and then creating a bag-of-words model.
Each tweet's unique terms were added to the vocabulary. Then, you loop through each tweet and check each word against the vocabulary. If the word exists, you mark a 1 in an array for that tweet. If it doesn't exist, you mark a 0. You end up with an array of 1's and 0's for each tweet. This is perfect for training a machine learning program.
To train and test the accuracy, the tweets were split into a training, cross validation, and test set. The computer uses the training set to learn which tweets it classifies right or wrong and fine-tune its model. It then runs against the cross validation set to see how it does on tweets that it hasn't trained on.
So, what were the results?
4. Result?
The gaussian kernel did pretty well. It scored 99.7% accuracy on the training set and 96% on the cross validation. The SVM was then switched to use a linear kernel. This bumped up the accuracy to 100% training and 97% cross validation.
Ok, but did it detect the hacked tweet?
The initial training set contained random tweets from AP and non-AP Twitter accounts. It correctly classified AP tweets, but failed on the particular hacked tweet.
I fed the training set additional tweets, such as "-from:AP obama" and "-from:AP breaking" so it had knowledge of the actual topic. And what do you know, it worked!
5. Conclusion
There are a lot more details in this project, including some cool learning curve charts and examples of tweets being classified. You can read my full article at http://www.primaryobjects.com/cms/article158 (the top link in the last slide). There are some code samples for setting up the SVM and you can even download the test set results.
If you're curious about artificial intelligence, I also have some other interesting articles, including Self-Programming Artificial Intelligence (the last link in the slide), where a computer program uses genetic algorithms to successfully write its own computer programs. Scary stuff!
In conclusion, my name is Kory Becker. Feel free to chat if you have any questions or connect online via @primaryobjects on Twitter or Kory Becker on LinkedIn. Thanks.