Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Mining tweets for security information (rev 2)
1. Mining Tweets for
Security Information
with “R”
Jeff Stanton, School of Information Studies
Syracuse University
2. @highfours: I just
watched a plane
crash into the
hudson rive in
manhattan
@ReallyVirtual:
Helicopter
hovering above
Abbotabad at 1AM
(is a rare event).
Twitter: Early Warning System?
4. 140 characters max
@petridishes – Screen Name
#blackberry – User-created hashtag
@crozzledhearts – “Retweeter” who sent this
tweet after receiving it from @petridishes
30 minutes ago via web – Each tweet
encoded with UTC timecode
No URLs here, but they are auto-shortened
Anatomy of a Tweet
6. A GNU open source project
An implementation of the “S” statistical
language developed at Bell labs
Largely an interpreted, command-line
interface with some GUI add-ons
More than 4300 add-on packages developed
by the user community
Full-featured data management and matrix
manipulation with performance comparable
to Octave and MATLAB
Extensive graphics for visualization
Starting in 2010, used by more data miners
(43%) than any other single tool
“R” Facts
7. Developed by Jeff
Gentry (Fidelity)
Five classes and 11
functions to:
◦ Authenticate to Twitter
with Oauth and check
current rate limit
◦ Manipulate, send, and
receive direct messages
◦ Update user status
◦ Search for tweets
containing particular
keywords or hashtags
◦ Examine topic trends
◦ Examine timelines
The “twitteR” package
8. Use the R “Packages” menu to install the
necessary packages:
bitops, RJSONIO, RCurl, and twitteR
Depending upon Mac/Win/Linux, you may need
to retrieve a zipped file of RCurl from:
◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri
b/2.14/
Then ready the packages for use in R with the
library() command:
> library(bitops)
> library(RCurl)
> library(RJSONIO)
> library(twitteR)
Getting Ready – Load Packages
9. > expTweets <- searchTwitter('#exploit', n=500)
> expDF <-
do.call("rbind", lapply(expTweets, as.data.fr
ame))
The second command above takes the raw tweet data in
expTweets – which starts as a list/collection of separate
data objects (frames) – and binds it into a single data
frame for ease of analysis
lapply() applies a command to each element of a list
as.data.frame is a type coercion
rbind is the function that joins separate objects to become
rows in a dataframe
do.call() repeats the rbind over all elements of the list
Search Twitter for “#exploit”
10. > head(expDF,1)
text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age
Verification Plugin http://t.co/O8wVjKca #Exploit
favorited: FALSE
replyToSN: NA
created: 2012-01-10 18:19:11
truncated: FALSE
replyToSID: <NA>
id: 156802281747124224
replyToUID: NA
statusSource: <a href="http://twitterfeed.com"
rel="nofollow">twitterfeed</a>
screenName: NotaThreat2u
A Preview of the Data
11. > head(expDF$created,1)
Histogram of expDF$created
[1] "2012-01-10 18:19:11
UTC“
20
The created variable is
conveniently coded as a
15
POSIX time variable
Frequency
calibrated to UTC
10
>
hist(expDF$created, breaks=15,
5
freq=TRUE)
0
Shows a frequency
histogram (with about 15 13:50 18:00 22:10 02:20 06:30 10:40
break points) expDF$created
Nice spike at 18:00 UTC
(about 1pm EST)
Visualizing the Data: When Tweeted?
11
12. # Total time between 1st and last tweet
elapsedTime = max(expDF$created) - min(expDF$created)
timeBin = floor(elapsedTime/11) # Make 11 bins
# Add a new variable with the bin designators
expDF$slice = floor((expDF$created -
min(expDF$created))/(as.integer(timeBin)*3600))
expSlices<-expDF[,c("screenName","slice")] # subset the data
expTable<-table(expSlices) # Count tweets in each slice
# Convert table data to matrix that heatmap() expects
expMatrix<-matrix(expTable,ncol=length(colnames(expTable)))
rownames(expMatrix)<-rownames(expTable)
colnames(expMatrix)<-paste('Slice',1:12)
heatmap(expMatrix,Rowv=NA,Colv=NA,
col=rainbow(max(expMatrix)+1,start=0.5,end=.7))
Prepare a Heatmap
14. library(stringr) # Provides easy string
functions
str_match(expDF$text, "^RT @") # Find RT @ at
beginning of each line
Regular expression matching any number of
alphanumeric characters or underscore:
[[:alnum:]_]*
str_match(expDF$text, "^RT @[[:alnum:]_]*") #
Matches the whole retweet screen name
expDF$rtSN = str_match(expDF$text, "^RT
@[[:alnum:]_]*") # Adds a new variable
Do Some Parsing with Regex
17. 0
5
10
15
20
25
30
#security
rt
alert
exploit
injection
sql
cross
scripting
site
#ccureit
new
remote
vulnerability
#cyber
#cyberwar
#hacker
buffer
cms
file
and
disclosure
of
1.4
execution
vulnerabilities
wordpress
/
[webapps]
analysis
multiple
overflow
1.3.3
Most common keywords
advanced
code
command
en
for
information
phpmydirectory
with
17
18. #security – Another good hashtag to search on
(SQL) Injection – Apparently one of the most common
attacks
cross (site) scripting – Another popular attack
#cyber #cyberwar #ccureit #hacker – More
hashtags?
remote vulnerability, buffer (overflow), cms,
wordpress, phpmydirectory
Each/any of these keywords could provide a basis for a
new tweet search term, or for keyword detection within a
set of tweets obtained from another search, or for an alert
dashboard with periodic updates
Common keywords to explore
18
19. @shitaesy Je me couche à 20h30 en ce
moment.. J'ai même lu ce soir :3 #exploit
Scanning across a sample of the
tweets, some are spam and should be
filtered out
Can we create a classifier that will get rid
of the non-exploit tweets?
Must Remove the Non-Tweets
20. • Attributes
Initial model developed Attribute can be
with training data
1 boolean or
numeric
• Most useful if
Attribute independent
2 of other
attributes
• The fewer
Model accuracy checked Attribute the
on training data, but later
3 attributes
cross-validated on new data
the better
Anatomy of a Classifier
21. write.table(expDF, sep=",",
file="exploitData.csv")
# I looked at the tweets and added
# training data, using my judgment to code
# the non-tweets
truExp = read.table("truExp.txt")
# Add to the existing data
expDFtrue = cbind(expDF, truExp)
# Note: new variable name defaults to “V1”
Create and Add Training Data
22. # Create true/false values for each row,
# based on whether the string exists
expDFtrue$hassec =
grepl("security",tolower(expDFtrue$text))
# Also count some punctuation to see if
# there are clues there
expDFtrue$numhash =
sapply(strsplit(as.character(expDFtrue$te
xt),"#"),length)-1
Easy Predictors with Grepl()
23. Coefficients (output from Logit analysis:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.946e+00 1.904e+00 1.547 0.12187
hassecTRUE 2.302e+00 1.186e+00 1.941 0.05222 .
hassqlTRUE 1.770e+01 4.076e+03 0.004 0.99653
hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 *
hasscrTRUE 1.832e+01 4.135e+03 0.004 0.99647
hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164
hascybTRUE 1.202e+00 1.958e+00 0.614 0.53937
numhash -2.046e+00 7.218e-01 -2.835 0.00458 **
numast -2.182e+01 5.554e+03 -0.004 0.99687
numdot 6.306e-01 4.167e-01 1.513 0.13017
twtlen 6.548e-03 2.329e-02 0.281 0.77854
# security, buffer keywords are promising, as well as the
# number of hash marks and the number of dots/periods
Choose Best Attributes
25. Conclusion 1: R is pretty handy for grabbing and manipulating
tweet data
Conclusion 2: Tweet data are messy and require a good deal of
clean-up, parsing, and filtering
Conclusion 3: As these two examples suggest, tweets can provide
breaking news about vulnerabilities and exploits
◦ WordPress Age Verification plugin versions 0.4 and below open redirect
vulnerability
Exploit availability tweeted at 12:19 PM
Blogged at SecurityBlog 10:24 PM
Added to SiloBreaker two days later
◦ Pragyan CMS v 3.0 Remote File Disclosure
Exploit availability tweeted at 11:07 AM
Appeared on PacketStorm next day
On RealHacker three days later
On WebCriminal.ru eight days later
Twitter: Early Warning System?
This graph shows the Michael Jackson Effect, with a strong uptick in the half hour following the announcement of his death on Hollywood celebrity site TMZ.com (Doctors at UCLA Hospital had announced the death 18 minutes before that). Twitter crashed temporarily under the load at 3:15 PM. Twitter has about 200 million users and is the 9th most popular site on the web. On a typical day Twitter handles 200 million tweets and 1.6 billion search queries.
Could also make a fancier Poststript plot with:post(fit, file = "tree.ps", title = "Classification Tree for Exploit Tweets")