SlideShare ist ein Scribd-Unternehmen logo
1 von 26
Mining Tweets for
Security Information
            with “R”
 Jeff Stanton, School of Information Studies
                         Syracuse University
@highfours: I just
                   watched a plane
                   crash into the
                   hudson rive in
                   manhattan

                  @ReallyVirtual:
                   Helicopter
                   hovering above
                   Abbotabad at 1AM
                   (is a rare event).

Twitter: Early Warning System?
2:26 pm    2:44 PM   3:15 PM
  UCLA      TMZ.com    Twitter




Twitter Facts
   140 characters max
            @petridishes – Screen Name
            #blackberry – User-created hashtag
            @crozzledhearts – “Retweeter” who sent this
             tweet after receiving it from @petridishes
            30 minutes ago via web – Each tweet
             encoded with UTC timecode
            No URLs here, but they are auto-shortened

Anatomy of a Tweet
“R” – Open Source Analytics
   A GNU open source project
   An implementation of the “S” statistical
    language developed at Bell labs
   Largely an interpreted, command-line
    interface with some GUI add-ons
   More than 4300 add-on packages developed
    by the user community
   Full-featured data management and matrix
    manipulation with performance comparable
    to Octave and MATLAB
   Extensive graphics for visualization
   Starting in 2010, used by more data miners
    (43%) than any other single tool


“R” Facts
   Developed by Jeff
    Gentry (Fidelity)
   Five classes and 11
    functions to:
    ◦ Authenticate to Twitter
      with Oauth and check
      current rate limit
    ◦ Manipulate, send, and
      receive direct messages
    ◦ Update user status
    ◦ Search for tweets
      containing particular
      keywords or hashtags
    ◦ Examine topic trends
    ◦ Examine timelines


The “twitteR” package
   Use the R “Packages” menu to install the
    necessary packages:
    bitops, RJSONIO, RCurl, and twitteR
   Depending upon Mac/Win/Linux, you may need
    to retrieve a zipped file of RCurl from:
    ◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri
      b/2.14/
   Then ready the packages for use in R with the
    library() command:
    >   library(bitops)
    >   library(RCurl)
    >   library(RJSONIO)
    >   library(twitteR)




Getting Ready – Load Packages
> expTweets <- searchTwitter('#exploit', n=500)
> expDF <-
  do.call("rbind", lapply(expTweets, as.data.fr
  ame))

   The second command above takes the raw tweet data in
    expTweets – which starts as a list/collection of separate
    data objects (frames) – and binds it into a single data
    frame for ease of analysis
   lapply() applies a command to each element of a list
   as.data.frame is a type coercion
   rbind is the function that joins separate objects to become
    rows in a dataframe
   do.call() repeats the rbind over all elements of the list




Search Twitter for “#exploit”
> head(expDF,1)
text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age
  Verification Plugin http://t.co/O8wVjKca #Exploit
favorited: FALSE
replyToSN: NA
created: 2012-01-10 18:19:11
truncated: FALSE
replyToSID: <NA>
id: 156802281747124224
replyToUID: NA
statusSource: &lt;a href=&quot;http://twitterfeed.com&quot;
  rel=&quot;nofollow&quot;&gt;twitterfeed&lt;/a&gt;
screenName: NotaThreat2u




A Preview of the Data
> head(expDF$created,1)
                                                               Histogram of expDF$created
[1] "2012-01-10 18:19:11
  UTC“




                                                     20
       The created variable is
        conveniently coded as a




                                                     15
        POSIX time variable




                                         Frequency
        calibrated to UTC




                                                     10
    >
        hist(expDF$created, breaks=15,




                                                     5
         freq=TRUE)




                                                     0
 Shows a frequency
  histogram (with about 15                                13:50 18:00 22:10 02:20 06:30 10:40
  break points)                                                        expDF$created
 Nice spike at 18:00 UTC
  (about 1pm EST)
Visualizing the Data: When Tweeted?
                                                                                                11
# Total time between 1st and last tweet
elapsedTime = max(expDF$created) - min(expDF$created)
timeBin = floor(elapsedTime/11) # Make 11 bins
# Add a new variable with the bin designators
expDF$slice = floor((expDF$created -
  min(expDF$created))/(as.integer(timeBin)*3600))

expSlices<-expDF[,c("screenName","slice")] # subset the data
expTable<-table(expSlices) # Count tweets in each slice

# Convert table data to matrix that heatmap() expects
expMatrix<-matrix(expTable,ncol=length(colnames(expTable)))
rownames(expMatrix)<-rownames(expTable)
colnames(expMatrix)<-paste('Slice',1:12)

heatmap(expMatrix,Rowv=NA,Colv=NA,
  col=rainbow(max(expMatrix)+1,start=0.5,end=.7))




Prepare a Heatmap
xoMC_DDL
                                                                                                               TheRomamane
                                                                                                               TheKingNappy
                                                                                                               lauura_5
                                                                                                               Sara_Katelyn
                                                                                                               kapitanluffy
                                                                                                               Zf1r3
                                                                                                               CyberCrimeNEWS
                                                                                                               CcureIT
                                                                                                               Brain_0verride
                                                                                                               drb0n3z
                                                                                                               sapo2025
                                                                                                               packet_storm
                                                                                                               cybfor
                                                                                                               csec
                                                                                                               Federico_II
                                                                                                               cloeliae
                                                                                                               manero94
                                                                                                               Hamoud_Oz
                                                                                                               belmontemartin
                                                                                                               pretorienx
                                                                                                               secwatched
                                                                                                               cedricpernet
                                                                                                               g4l4drim
                                                                                                               unixfreaxjp
                                                                                                               theBestRhiannon
                                                                                                               bortzmeyer
                                                                                                               macmark_de
                                                                                                               CyberDomain
                                                                                                               cinnamon_carter
                                                                                                               binushacker
                                                                                                               escan_sachin
                                                                                                               shadowy47
                                                                                                               iWorlds_it
                                                                                                               hacktalkblog
                                                                                                               NotaThreat2u
                                                                                 Slice10

                                                                                           Slice11

                                                                                                     Slice12
Slice1

         Slice2

                  Slice3

                           Slice4

                                    Slice5

                                             Slice6

                                                      Slice7

                                                               Slice8

                                                                        Slice9
library(stringr) # Provides easy string
  functions
str_match(expDF$text, "^RT @") # Find RT @ at
  beginning of each line

   Regular expression matching any number of
    alphanumeric characters or underscore:
    [[:alnum:]_]*

str_match(expDF$text, "^RT @[[:alnum:]_]*") #
  Matches the whole retweet screen name

expDF$rtSN = str_match(expDF$text, "^RT
  @[[:alnum:]_]*") # Adds a new variable



Do Some Parsing with Regex
0
                                               2
                                                   4
                                                       6
                                                           8
                                                               10
                                                                    12
                                                                         14




                   RT @_joviann_


       RT @cedricpernet


RT @CyberCrimeNEWS


       RT @hacktalkblog


      RT @packet_storm


            RT @unixfreaxjp
       plot(as.factor(expDF$rtSN),las=2)
 15
exploitWords = strsplit(levels(expDF$text)," ")
exploitWords = unlist(exploitWords)
exploitWords = str_replace_all(exploitWords, "^RT @[[:alnum:]_]*","")
exploitWords = str_replace_all(exploitWords, "@[[:alnum:]_]*","")
exploitWords = str_replace_all(exploitWords, "#Exploit","")
exploitWords = str_replace_all(exploitWords, "#exploit","")
exploitWords = str_replace_all(exploitWords, "^http.*","")
exploitWords = str_replace_all(exploitWords, ":","")
exploitWords = str_replace_all(exploitWords, "_","")
exploitWords = str_replace_all(exploitWords, "-","")
exploitWords = tolower(exploitWords)
exploitWords = sort(exploitWords)
wordCount = summary(as.factor(exploitWords))
wordCount = wordCount[wordCount<(max(wordCount)-1)]
wordCount = wordCount[wordCount>4]
barplot(wordCount,las=2)




Make a Keyword List
0
                                              5
                                                  10
                                                       15
                                                            20
                                                                 25
                                                                      30

                           #security
                                     rt
                                 alert
                               exploit
                            injection
                                   sql
                                cross
                            scripting
                                  site
                             #ccureit
                                 new
                              remote
                       vulnerability
                              #cyber
                         #cyberwar
                            #hacker
                               buffer
                                 cms
                                   file
                                  and
                          disclosure
                                    of
                                  1.4
                          execution
                      vulnerabilities
                         wordpress
                                      /
                         [webapps]
                            analysis
                             multiple
                            overflow
                                1.3.3
     Most common keywords

                          advanced
                                code
                          command
                                   en
                                   for
                        information
                    phpmydirectory
                                 with
17
    #security – Another good hashtag to search on
   (SQL) Injection – Apparently one of the most common
    attacks
   cross (site) scripting – Another popular attack
   #cyber #cyberwar #ccureit #hacker – More
    hashtags?
   remote vulnerability, buffer (overflow), cms,
    wordpress, phpmydirectory

   Each/any of these keywords could provide a basis for a
    new tweet search term, or for keyword detection within a
    set of tweets obtained from another search, or for an alert
    dashboard with periodic updates




Common keywords to explore
                                                                  18
@shitaesy Je me couche à 20h30 en ce
 moment.. J'ai même lu ce soir :3 #exploit

 Scanning across a sample of the
  tweets, some are spam and should be
  filtered out
 Can we create a classifier that will get rid
  of the non-exploit tweets?


Must Remove the Non-Tweets
• Attributes
Initial model developed       Attribute     can be
with training data
                                  1         boolean or
                                            numeric

                                              • Most useful if
                                  Attribute     independent
                                      2         of other
                                                attributes

                                          • The fewer
Model accuracy checked        Attribute     the
on training data, but later
                                  3         attributes
cross-validated on new data
                                            the better




Anatomy of a Classifier
write.table(expDF, sep=",",
 file="exploitData.csv")

# I looked at the tweets and added
# training data, using my judgment to code
# the non-tweets
truExp = read.table("truExp.txt")

# Add to the existing data
expDFtrue = cbind(expDF, truExp)
# Note: new variable name defaults to “V1”


Create and Add Training Data
# Create true/false values for each row,
# based on whether the string exists
expDFtrue$hassec =
 grepl("security",tolower(expDFtrue$text))

# Also count some punctuation to see if
# there are clues there
expDFtrue$numhash =
 sapply(strsplit(as.character(expDFtrue$te
 xt),"#"),length)-1



Easy Predictors with Grepl()
Coefficients (output from Logit analysis:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.946e+00 1.904e+00     1.547 0.12187
hassecTRUE   2.302e+00 1.186e+00    1.941 0.05222 .
hassqlTRUE   1.770e+01 4.076e+03    0.004 0.99653
hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 *
hasscrTRUE   1.832e+01 4.135e+03    0.004 0.99647
hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164
hascybTRUE   1.202e+00 1.958e+00    0.614 0.53937
numhash     -2.046e+00 7.218e-01 -2.835 0.00458 **
numast      -2.182e+01 5.554e+03 -0.004 0.99687
numdot       6.306e-01 4.167e-01    1.513 0.13017
twtlen       6.548e-03 2.329e-02    0.281 0.77854

# security, buffer keywords are promising, as well as the
# number of hash marks and the number of dots/periods




Choose Best Attributes
library(rpart)                              numhash>=2.5
                                                 |
                                                 1
fit <- rpart(V1 ~ hassec +                     26/71
  hasbuf + numhash + numdot,
  method="class",
  data=expDFtrue)
summary(fit)
plot(fit, uniform=TRUE,
  margin=0.1, branch=0.5,
  compress=TRUE)
text(fit, use.n=TRUE, all=TRUE,
  cex=.8)                          0                         1
                                  14/2                     12/69


      “numhash” only retained attribute, split at 2.5
      Overall 14 errors (83/97 = 85.5% correct)
      12 false positives (12/97 = 12.4% FP)
      2 false negatives (2/97 = 2.1% FN)


Classification Tree Works OK
                                                                   24
   Conclusion 1: R is pretty handy for grabbing and manipulating
    tweet data

   Conclusion 2: Tweet data are messy and require a good deal of
    clean-up, parsing, and filtering

   Conclusion 3: As these two examples suggest, tweets can provide
    breaking news about vulnerabilities and exploits
    ◦ WordPress Age Verification plugin versions 0.4 and below open redirect
      vulnerability
       Exploit availability tweeted at 12:19 PM
       Blogged at SecurityBlog 10:24 PM
       Added to SiloBreaker two days later

    ◦ Pragyan CMS v 3.0 Remote File Disclosure
       Exploit availability tweeted at 11:07 AM
       Appeared on PacketStorm next day
       On RealHacker three days later
       On WebCriminal.ru eight days later


Twitter: Early Warning System?
Image from: http://www.vincegolangco.com

Weitere ähnliche Inhalte

Mehr von Syracuse University

Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultySyracuse University
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformSyracuse University
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale developmentSyracuse University
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question proSyracuse University
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issuesSyracuse University
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics CourseSyracuse University
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collectionSyracuse University
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internetSyracuse University
 

Mehr von Syracuse University (20)

Discovery informaticsstanton
Discovery informaticsstantonDiscovery informaticsstanton
Discovery informaticsstanton
 
Basic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University FacultyBasic SEVIS Overview for U.S. University Faculty
Basic SEVIS Overview for U.S. University Faculty
 
Why R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics PlatformWhy R? A Brief Introduction to the Open Source Statistics Platform
Why R? A Brief Introduction to the Open Source Statistics Platform
 
Chapter9 r studio2
Chapter9 r studio2Chapter9 r studio2
Chapter9 r studio2
 
Basic Overview of Data Mining
Basic Overview of Data MiningBasic Overview of Data Mining
Basic Overview of Data Mining
 
Strategic planning
Strategic planningStrategic planning
Strategic planning
 
Carma internet research module scale development
Carma internet research module   scale developmentCarma internet research module   scale development
Carma internet research module scale development
 
Carma internet research module getting started with question pro
Carma internet research module   getting started with question proCarma internet research module   getting started with question pro
Carma internet research module getting started with question pro
 
Carma internet research module visual design issues
Carma internet research module   visual design issuesCarma internet research module   visual design issues
Carma internet research module visual design issues
 
Siop impact of social media
Siop impact of social mediaSiop impact of social media
Siop impact of social media
 
Basic Graphics with R
Basic Graphics with RBasic Graphics with R
Basic Graphics with R
 
R-Studio Vs. Rcmdr
R-Studio Vs. RcmdrR-Studio Vs. Rcmdr
R-Studio Vs. Rcmdr
 
Getting Started with R
Getting Started with RGetting Started with R
Getting Started with R
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Introduction to Advance Analytics Course
Introduction to Advance Analytics CourseIntroduction to Advance Analytics Course
Introduction to Advance Analytics Course
 
Installing R and R-Studio
Installing R and R-StudioInstalling R and R-Studio
Installing R and R-Studio
 
Reducing Response Burden
Reducing Response BurdenReducing Response Burden
Reducing Response Burden
 
PACIS Survey Workshop
PACIS Survey WorkshopPACIS Survey Workshop
PACIS Survey Workshop
 
Carma internet research module: Future data collection
Carma internet research module: Future data collectionCarma internet research module: Future data collection
Carma internet research module: Future data collection
 
Carma internet research module: Sampling for internet
Carma internet research module: Sampling for internetCarma internet research module: Sampling for internet
Carma internet research module: Sampling for internet
 

Kürzlich hochgeladen

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxVishalSingh1417
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docxPoojaSen20
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...KokoStevan
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAssociation for Project Management
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDThiyagu K
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17Celine George
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxnegromaestrong
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfChris Hunter
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docxPoojaSen20
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.MateoGardella
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxVishalSingh1417
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Shubhangi Sonawane
 

Kürzlich hochgeladen (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Unit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptxUnit-V; Pricing (Pharma Marketing Management).pptx
Unit-V; Pricing (Pharma Marketing Management).pptx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
SECOND SEMESTER TOPIC COVERAGE SY 2023-2024 Trends, Networks, and Critical Th...
 
APM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across SectorsAPM Welcome, APM North West Network Conference, Synergies Across Sectors
APM Welcome, APM North West Network Conference, Synergies Across Sectors
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17How to Give a Domain for a Field in Odoo 17
How to Give a Domain for a Field in Odoo 17
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Making and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdfMaking and Justifying Mathematical Decisions.pdf
Making and Justifying Mathematical Decisions.pdf
 
PROCESS RECORDING FORMAT.docx
PROCESS      RECORDING        FORMAT.docxPROCESS      RECORDING        FORMAT.docx
PROCESS RECORDING FORMAT.docx
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.Gardella_Mateo_IntellectualProperty.pdf.
Gardella_Mateo_IntellectualProperty.pdf.
 
Unit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptxUnit-IV; Professional Sales Representative (PSR).pptx
Unit-IV; Professional Sales Representative (PSR).pptx
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 

Mining tweets for security information (rev 2)

  • 1. Mining Tweets for Security Information with “R” Jeff Stanton, School of Information Studies Syracuse University
  • 2. @highfours: I just watched a plane crash into the hudson rive in manhattan @ReallyVirtual: Helicopter hovering above Abbotabad at 1AM (is a rare event). Twitter: Early Warning System?
  • 3. 2:26 pm 2:44 PM 3:15 PM UCLA TMZ.com Twitter Twitter Facts
  • 4. 140 characters max  @petridishes – Screen Name  #blackberry – User-created hashtag  @crozzledhearts – “Retweeter” who sent this tweet after receiving it from @petridishes  30 minutes ago via web – Each tweet encoded with UTC timecode  No URLs here, but they are auto-shortened Anatomy of a Tweet
  • 5. “R” – Open Source Analytics
  • 6. A GNU open source project  An implementation of the “S” statistical language developed at Bell labs  Largely an interpreted, command-line interface with some GUI add-ons  More than 4300 add-on packages developed by the user community  Full-featured data management and matrix manipulation with performance comparable to Octave and MATLAB  Extensive graphics for visualization  Starting in 2010, used by more data miners (43%) than any other single tool “R” Facts
  • 7. Developed by Jeff Gentry (Fidelity)  Five classes and 11 functions to: ◦ Authenticate to Twitter with Oauth and check current rate limit ◦ Manipulate, send, and receive direct messages ◦ Update user status ◦ Search for tweets containing particular keywords or hashtags ◦ Examine topic trends ◦ Examine timelines The “twitteR” package
  • 8. Use the R “Packages” menu to install the necessary packages: bitops, RJSONIO, RCurl, and twitteR  Depending upon Mac/Win/Linux, you may need to retrieve a zipped file of RCurl from: ◦ http://www.stats.ox.ac.uk/pub/RWin/bin/windows/contri b/2.14/  Then ready the packages for use in R with the library() command: > library(bitops) > library(RCurl) > library(RJSONIO) > library(twitteR) Getting Ready – Load Packages
  • 9. > expTweets <- searchTwitter('#exploit', n=500) > expDF <- do.call("rbind", lapply(expTweets, as.data.fr ame))  The second command above takes the raw tweet data in expTweets – which starts as a list/collection of separate data objects (frames) – and binds it into a single data frame for ease of analysis  lapply() applies a command to each element of a list  as.data.frame is a type coercion  rbind is the function that joins separate objects to become rows in a dataframe  do.call() repeats the rbind over all elements of the list Search Twitter for “#exploit”
  • 10. > head(expDF,1) text: RT @hacktalkblog: New Exploit [webapps] - Wordpress Age Verification Plugin http://t.co/O8wVjKca #Exploit favorited: FALSE replyToSN: NA created: 2012-01-10 18:19:11 truncated: FALSE replyToSID: <NA> id: 156802281747124224 replyToUID: NA statusSource: &lt;a href=&quot;http://twitterfeed.com&quot; rel=&quot;nofollow&quot;&gt;twitterfeed&lt;/a&gt; screenName: NotaThreat2u A Preview of the Data
  • 11. > head(expDF$created,1) Histogram of expDF$created [1] "2012-01-10 18:19:11 UTC“ 20  The created variable is conveniently coded as a 15 POSIX time variable Frequency calibrated to UTC 10 > hist(expDF$created, breaks=15, 5 freq=TRUE) 0  Shows a frequency histogram (with about 15 13:50 18:00 22:10 02:20 06:30 10:40 break points) expDF$created  Nice spike at 18:00 UTC (about 1pm EST) Visualizing the Data: When Tweeted? 11
  • 12. # Total time between 1st and last tweet elapsedTime = max(expDF$created) - min(expDF$created) timeBin = floor(elapsedTime/11) # Make 11 bins # Add a new variable with the bin designators expDF$slice = floor((expDF$created - min(expDF$created))/(as.integer(timeBin)*3600)) expSlices<-expDF[,c("screenName","slice")] # subset the data expTable<-table(expSlices) # Count tweets in each slice # Convert table data to matrix that heatmap() expects expMatrix<-matrix(expTable,ncol=length(colnames(expTable))) rownames(expMatrix)<-rownames(expTable) colnames(expMatrix)<-paste('Slice',1:12) heatmap(expMatrix,Rowv=NA,Colv=NA, col=rainbow(max(expMatrix)+1,start=0.5,end=.7)) Prepare a Heatmap
  • 13. xoMC_DDL TheRomamane TheKingNappy lauura_5 Sara_Katelyn kapitanluffy Zf1r3 CyberCrimeNEWS CcureIT Brain_0verride drb0n3z sapo2025 packet_storm cybfor csec Federico_II cloeliae manero94 Hamoud_Oz belmontemartin pretorienx secwatched cedricpernet g4l4drim unixfreaxjp theBestRhiannon bortzmeyer macmark_de CyberDomain cinnamon_carter binushacker escan_sachin shadowy47 iWorlds_it hacktalkblog NotaThreat2u Slice10 Slice11 Slice12 Slice1 Slice2 Slice3 Slice4 Slice5 Slice6 Slice7 Slice8 Slice9
  • 14. library(stringr) # Provides easy string functions str_match(expDF$text, "^RT @") # Find RT @ at beginning of each line  Regular expression matching any number of alphanumeric characters or underscore: [[:alnum:]_]* str_match(expDF$text, "^RT @[[:alnum:]_]*") # Matches the whole retweet screen name expDF$rtSN = str_match(expDF$text, "^RT @[[:alnum:]_]*") # Adds a new variable Do Some Parsing with Regex
  • 15. 0 2 4 6 8 10 12 14 RT @_joviann_ RT @cedricpernet RT @CyberCrimeNEWS RT @hacktalkblog RT @packet_storm RT @unixfreaxjp plot(as.factor(expDF$rtSN),las=2) 15
  • 16. exploitWords = strsplit(levels(expDF$text)," ") exploitWords = unlist(exploitWords) exploitWords = str_replace_all(exploitWords, "^RT @[[:alnum:]_]*","") exploitWords = str_replace_all(exploitWords, "@[[:alnum:]_]*","") exploitWords = str_replace_all(exploitWords, "#Exploit","") exploitWords = str_replace_all(exploitWords, "#exploit","") exploitWords = str_replace_all(exploitWords, "^http.*","") exploitWords = str_replace_all(exploitWords, ":","") exploitWords = str_replace_all(exploitWords, "_","") exploitWords = str_replace_all(exploitWords, "-","") exploitWords = tolower(exploitWords) exploitWords = sort(exploitWords) wordCount = summary(as.factor(exploitWords)) wordCount = wordCount[wordCount<(max(wordCount)-1)] wordCount = wordCount[wordCount>4] barplot(wordCount,las=2) Make a Keyword List
  • 17. 0 5 10 15 20 25 30 #security rt alert exploit injection sql cross scripting site #ccureit new remote vulnerability #cyber #cyberwar #hacker buffer cms file and disclosure of 1.4 execution vulnerabilities wordpress / [webapps] analysis multiple overflow 1.3.3 Most common keywords advanced code command en for information phpmydirectory with 17
  • 18. #security – Another good hashtag to search on  (SQL) Injection – Apparently one of the most common attacks  cross (site) scripting – Another popular attack  #cyber #cyberwar #ccureit #hacker – More hashtags?  remote vulnerability, buffer (overflow), cms, wordpress, phpmydirectory  Each/any of these keywords could provide a basis for a new tweet search term, or for keyword detection within a set of tweets obtained from another search, or for an alert dashboard with periodic updates Common keywords to explore 18
  • 19. @shitaesy Je me couche à 20h30 en ce moment.. J'ai même lu ce soir :3 #exploit  Scanning across a sample of the tweets, some are spam and should be filtered out  Can we create a classifier that will get rid of the non-exploit tweets? Must Remove the Non-Tweets
  • 20. • Attributes Initial model developed Attribute can be with training data 1 boolean or numeric • Most useful if Attribute independent 2 of other attributes • The fewer Model accuracy checked Attribute the on training data, but later 3 attributes cross-validated on new data the better Anatomy of a Classifier
  • 21. write.table(expDF, sep=",", file="exploitData.csv") # I looked at the tweets and added # training data, using my judgment to code # the non-tweets truExp = read.table("truExp.txt") # Add to the existing data expDFtrue = cbind(expDF, truExp) # Note: new variable name defaults to “V1” Create and Add Training Data
  • 22. # Create true/false values for each row, # based on whether the string exists expDFtrue$hassec = grepl("security",tolower(expDFtrue$text)) # Also count some punctuation to see if # there are clues there expDFtrue$numhash = sapply(strsplit(as.character(expDFtrue$te xt),"#"),length)-1 Easy Predictors with Grepl()
  • 23. Coefficients (output from Logit analysis: Estimate Std. Error z value Pr(>|z|) (Intercept) 2.946e+00 1.904e+00 1.547 0.12187 hassecTRUE 2.302e+00 1.186e+00 1.941 0.05222 . hassqlTRUE 1.770e+01 4.076e+03 0.004 0.99653 hasbufTRUE -2.476e+00 1.124e+00 -2.203 0.02757 * hasscrTRUE 1.832e+01 4.135e+03 0.004 0.99647 hasremTRUE -3.075e-01 1.109e+00 -0.277 0.78164 hascybTRUE 1.202e+00 1.958e+00 0.614 0.53937 numhash -2.046e+00 7.218e-01 -2.835 0.00458 ** numast -2.182e+01 5.554e+03 -0.004 0.99687 numdot 6.306e-01 4.167e-01 1.513 0.13017 twtlen 6.548e-03 2.329e-02 0.281 0.77854 # security, buffer keywords are promising, as well as the # number of hash marks and the number of dots/periods Choose Best Attributes
  • 24. library(rpart) numhash>=2.5 | 1 fit <- rpart(V1 ~ hassec + 26/71 hasbuf + numhash + numdot, method="class", data=expDFtrue) summary(fit) plot(fit, uniform=TRUE, margin=0.1, branch=0.5, compress=TRUE) text(fit, use.n=TRUE, all=TRUE, cex=.8) 0 1 14/2 12/69 “numhash” only retained attribute, split at 2.5 Overall 14 errors (83/97 = 85.5% correct) 12 false positives (12/97 = 12.4% FP) 2 false negatives (2/97 = 2.1% FN) Classification Tree Works OK 24
  • 25. Conclusion 1: R is pretty handy for grabbing and manipulating tweet data  Conclusion 2: Tweet data are messy and require a good deal of clean-up, parsing, and filtering  Conclusion 3: As these two examples suggest, tweets can provide breaking news about vulnerabilities and exploits ◦ WordPress Age Verification plugin versions 0.4 and below open redirect vulnerability  Exploit availability tweeted at 12:19 PM  Blogged at SecurityBlog 10:24 PM  Added to SiloBreaker two days later ◦ Pragyan CMS v 3.0 Remote File Disclosure  Exploit availability tweeted at 11:07 AM  Appeared on PacketStorm next day  On RealHacker three days later  On WebCriminal.ru eight days later Twitter: Early Warning System?

Hinweis der Redaktion

  1. This graph shows the Michael Jackson Effect, with a strong uptick in the half hour following the announcement of his death on Hollywood celebrity site TMZ.com (Doctors at UCLA Hospital had announced the death 18 minutes before that). Twitter crashed temporarily under the load at 3:15 PM. Twitter has about 200 million users and is the 9th most popular site on the web. On a typical day Twitter handles 200 million tweets and 1.6 billion search queries.
  2. Could also make a fancier Poststript plot with:post(fit, file = &quot;tree.ps&quot;, title = &quot;Classification Tree for Exploit Tweets&quot;)